Ask HN: What are some good resources for learning about low level disk/file IO?
I've been messing around with writing a toy database for fun/learning, and realised I've got a fairly big gap in my knowledge when it comes to dealing with performance and durability when dealing with file reads/writes.
Example of some questions I'd like to be able to answer or at least make reasonable decisions about (note: I don't actually want any answers to the above now, they're just examples of the sort of thing I'd like to read in depth about, and build up some background knowledge):
I recall reading some posts (related to Redis/SQLite/Postgres) related to this, which made me realise that it's a fairly complex topic, but not one I've found a good entry point for.
* how to ensure data's been safely written (e.g. when to flush, fsync, what guarantees that gives, using WAL) * blocks sizes to read/write for different purposes, tradeoffs, etc. * considerations for writing to different media/filesystems (e.g. disk, ssd, NFS) * when to rely on OS disk cache vs. using own cache * when to use/not use mmap * performance considerations (e.g. multiple small files vs. few larger ones, concurrent readers/writers, locking, etc.) * OS specific considerations
Any pointers to books, documentation, etc. on the above would be much appreciated.
There are some easy answers here:
* Bigger blocks = better performance. The bigger you can make it the faster you'll go. Your limiting factor is usually the desired resolution of the user (i.e. aggregation will inevitably result in under-utilized space).
* Disk, SSD and NFS don't all belong to the same category. Most modern products in storage are developed with the expectation that the media is SSD. Virtually nobody wants to enter the market of HDDs. The performance gap is just too big, and the existing products that still use HDDs rely on fast caching in something like flash memory anyways. NFS is a hopelessly backwards and outdated technology. It's the least common denominator, and that's why various storage products do support it, but if you want to go fast, forget about it. The tradeoff here usually is between writing your own client (usually, a kernel module) to do I/O efficiently, or spare users the need for installing a custom kernel module (often a security audit issue) and let them go slow...
* OS disk cache is somewhat of a misnomer. There are also two things that might get confused here. OS doesn't cache data written to disk -- the disk does. OS provides mechanism to talk to the disk and instruct it to flush the cache. There's also filesystem cache -- that's what OS does. It caches in the memory it manages the file contents of recently accessed files.
* I/O through mmap is a gimmick. Just one of the ways to abuse system API to do something it's not really intended to do. You can safely ignore it. If you are looking into making I/O more efficient, look into uring_io.
I've spent a lot of my career automating datacenter and HPC environments and I disagree with several of these points.
* Big distributed storage systems still use hdd's, usually within a tired system including ssds and nvme.
* A good nfs server implementation will beat the pants off all the cloud vendors. It's still highly relevant in physical datacenters.
* Mmap is used heavily in a ton of software for good reason. On top of that it's part of the POSIX API.
* While block size is one of those things where it usually doesn't matter until it does, just staying bigger blocks is faster is a bit misleading.
For the record, I work in HPC environment, but originally, my background is in storage.
> Big distributed storage systems still use hdd's
So what? Did you read what I wrote? I wrote about developing new, not supporting old...
> A good nfs server implementation will beat the pants off all the cloud vendors.
What are you even talking about? What cloud vendors have to do with this? Did you read what you replied to?
> Mmap is used heavily in a ton of software for good reason
So what? OP is asking in the context of writing a database / disk I/O. It's a wrong system API to do that. It's intended for applications to "easily" saving their in-memory data. If that's what it's used for, then it's fine. If it's used to implement a filesystem, then the filesystem authors don't understand what they are doing. Also, being part of POSIX or any other standard doesn't warrant a magical resilience to being a bad functionality... just look at the history of UNIX / Linux repeatedly failing to come up with an interface for asynchronous I/O, and sure enough, all these iterations made it into the standard.
> just staying bigger blocks is faster is a bit misleading.
It's not misleading. For the one paragraph answer, it's perfectly correct. And, no, block size is a very important aspect of any storage system, it's not something that may not matter.
As an aside: you sound pretentious, and try to pass as knowledgeable by saying things that have a drop of truth, but are mostly fancily dressed nonsense. Just stop. It's embarrassing.
NFS and Samba are right out IMO
The book Operating Systems, Three Easy Pieces  has a chapter on I/O and persistence in general.
I think, most of all, you should dig through LWN and the kernel documentation. Both are great learning resources. LWN has many introductory/educational articles about the very topics you care about. (Also, Jens Axboe wrote a long and great explanation why blk-mq was necessary, which you should seek out. He did the same for io_uring)
I highly recommend Gregg's Systems Performance (2nd edition came out in 2020). While the book is focused on performance rather than development, Gregg does a great job explaining a huge number of concepts without going too deep, specifically related to memory, fs, and block I/O.
Unfortunately, in terms of many of the things you care about, books tend to be outdated. Kerrisk's Linux Programming Interface is over 10 years old, and covers only ext2. Robert Love's great books on the kernel are hugely useful (though less intended for application developers) but also slightly outdated.
Thanks, that's a great idea. Last time I read kernel documentation was probably 15+ years ago, so I'm probably very out of date in some concepts I've internalised.
Did you see this discussion? https://news.ycombinator.com/item?id=32965075
As far as books are concerend: - Database Internals
- Designing Data Intensive Applications
- Disk-Based Algorithms for Big Data
- Database Systems by Ullman et al (http://infolab.stanford.edu/~ullman/pub/dscbtoc.txt) Part IV covers implementation details of a database system.
Thanks, I hadn't seen that discussion, looks like there might be some useful info there too.
I'm in the middle of reading Designing Data Intensive Applications, and Database Internals is next on my list, but hadn't come across the other two yet - have added to my reading list now.
I recently read this stack overflow post which was interesting:
If you use uring (such as via liburing) on Linux it forces you to split your IO in half: submit and then wait for callback but you can still do other things. You can submit multiple writes or reads in parallel and handle them when they're ready.
This white paper talks about writing to disk in S3 in a scheduled order to avoid corruption of concurrent requests in the event of a crash.
"Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3"
Apache Kafka supposedly doesn't need fsync due to the recovery protocol. So you might want to investigate why that is the case and whether or not you can create the same behaviour.
Read the Unix books, LWN, the kernel mailing list - It's definitely a good way to start out to get the fundamentals.
A good start on device drivers (old 2.6.10 kernel but still good): https://lwn.net/Kernel/LDD3/
Operating Systems: Three Easy Pieces - Files and directories: https://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf (Another great book from an OS perspective with some userspace interactions)
A thorough Linux internal engineering book is: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul... (The bibliography has tons of links on topics you might be interested in, Chapter 7 on locks is great)
I recommend implementing a basic key/value single table "database" in C/C++ and then add threading/multi-process interfaces so you can mentally figure out all the pros/cons. It's not technically "hard" and you'll learn a lot.
You might be interested in the Modern SSDs course from ETH Zürich https://youtube.com/playlist?list=PL5Q2soXY2Zi_8qOM5Icpp8hB2...
You might find these helpful:
- "Practical Filesystem Design": http://www.nobius.org/dbg/practical-file-system-design.pdf
- "Robert Love: Linux Kernel Development (chapters 13-14)" https://www.amazon.com/Linux-Kernel-Development-Robert-Love/...
- "The Linux Programming Interface (File I/O chapters)": https://www.amazon.com/Linux-Programming-Interface-System-Ha...
My advice is simple: write code. Implement toy systems that use these things, and begin exploring how they work and their inherent trade offs. Don't read about it, do it: practical learning is always 1000x as valuable.
Books can't get you very far. All they're really good for is informing that exploration.
can you give some examples of systems to try implementing?
The sibling comment is great. If you can't decide, pick whatever motivates you the most.
The key is to hold yourself accountable. It's easy to build sandcastles and think you're a brilliant architect if you don't try jumping on top of them. The iteration loop is what forces you to learn: pick something about your thing you can test objectively, and then make it better.
One of my favorite things to do is to implement something boring and standard, but with an unusual arbitrary design constraint that forces me to rethink the normal approach.
Literally write a toy database.
Redis from Scratch will walk you through the very basics.
Want to know how to build compilers? Crafting Interpreters (while very wordy) will walk you (painstakingly) through the very basics.
Want to build a basic server? Beej on Linux Networking.
Build a time-series database to handle back-testing for automated trading systems.
Slap on a bare-bones SQL interpreter onto it.
Now add networking so you can deploy it somewhere.
Now figure out what’s wrong with it (is the performance merely slow or are there serious pitfalls a la MongoDB?)
How do you handle multi user environments?
How you optimize for filesystem throughput while maintaining ACID? Are you like Mongo where you just queue everything and return an ACK — or do you only ACK back when you’ve successfully written to disk?
What’s your protocol for communicating with your DB?
How about sharding or distributed storage?
Hot/cold data swapping?
Execution engine or hand-crafted data retrieval semantics a la q (lang)?
How about remote direct memory access (RDMA) to get past the kernel? How about regular old kernel bypass?
How you handle a catastrophic failure where I physically pull the plug on your machine?
Are you using SIMD/vectorization?
There’s so much you could do. Pick whatever interests you the most.
I stumbled across this Carnegie Mellon University database course a year or two ago and this guy gets really deep into the technologies behind a database. Here's the list of YouTube videos of lectures that the professor has posted online:
Not specifically addressing your question, but when you get to the point of wanting to start doing some experiments you may find that 'fio'  is very handy.
Not only that. FIO is one of the most plain and easy to understand C code-bases I've ever seen. Studying its source code will help a lot in understanding of many aspects of Linux I/O subsystem, it's many drivers, the trade-offs between different approaches as well as give a glimpse into a bizarre and completely illogical naming conventions and traditions that will often trip the uninitiated.
FIO is also really useful in that its dizzying array of options will often prompt questions you didn't even think to ask about what you're trying to measure. Every time I am trying to design a storage benchmark for a certain purpose or trying to replicate a particular workload, I read the FIO documentation cover to cover to make sure I'm not forgetting any crucial details.
(Of course, FIO's options aren't quite exhaustive, but I can only think of two of three things where I've had to wrap or extend FIO to achieve what I needed.)
And if you want to graph the fio output data, 'fio-plot'  may help.
Nice, I saw this referenced in another post, but didn't know what it was, looks super useful.
I lead a project that included shipping a filesystem driver and a virtual disk on Windows.
What I did to learn the lower-level APIs, and perform initial testing on the driver, was write a "mirror" drive. The user-mode code pointed to a folder on disk, the driver made a virtual disk drive, and all reads and writes in the virtual disk drive went to the mirror folder. All of our (cough) unit tests for virtual drive handling used the mirror drive. ("cough" because the tests fit into that happy area that truly is a unit test but drives people nuts about splitting hairs about the semantics between unit and integration tests.)
On Windows, you can implement something like that using Dokany, Dokan, or Winfsp. On linux, there's the Fuse API. On Mac, there's MacFUSE.
Even if you don't do a "mirror" drive, understanding the callbacks that libraries like Dokany, Dokan, Winfsp, and Fuse do helps you understand how IO happens in the driver. Many IO methods provided in popular languages provide abstractions above what the OS does. (For example, the Windows kernel has no concept of the "Stream" that's in your C# program. The "Stream"'s Position property is purely a construct within the .Net framework.)
Another place to start is the OS's documentation itself. For example, you can start with Window's CreateFileA function. This typically is what gets called "under the hood" in most programming languages when you open or create a file: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/...
This is certainly something that needs to be addressed far more often.
Too many times have I seen some data scientist trying to parse and write 6 Petabytes of data with multiple cores, while the disk is thrashing about.
Spinning disks are still the backbone of most data science operations because they deal with >>4TB datasets, which can't be stored in SSD drives without breaking some serious bank.
So yes, understanding how to properly use producer/consumer multiprocessing queues correctly should be taught to everyone who does computing as the standard template.
Disk thrash is a threat.
This series has been really informative regarding the history of filesystems - when each breakthrough occurred and the historical tradeoffs that were considered at each point.
> Are You Sure You Want to Use MMAP in Your Database Management System?
That's exactly the sort of thing I was after, thanks. Trying to search for in depth info about mmap has generally given me docs on how to use it, not where/when it should/shouldn't be used.
This one is my favorite guide about SSD internals for programmers.
File Structures: An Object-Oriented Approach with C++ https://www.goodreads.com/book/show/614858.File_Structures
Rarely covered, but critical is knowledge of the seek timings on different media, and that same from different manufacturers, as well as the transition times for reads and writes, and finally many storage devices have additional information one can request, or it is just there if you know: the ability to have I/O buffer state transition callbacks, such as buffer filled, ring buffer rotated, last byte read and so on. I spent time working at disc manufacturers, where various dedicated software companies, mostly database developers and game devs, compile this information for their company's needs.
In the spirit of building complex systems from scratch, you’ll enjoy this repo https://github.com/codecrafters-io/build-your-own-x that contains several ideas for projects around building things from scratch (eg Build your own Git, Docker, etc)
The issue "SSDs vs spinning disks" is such a big one, I feel like you should understand the differences there before digging into anything else.
An SSD is just a completely different beast from a spinning disk. Spinning disks are much slower and really want to read things linearly.
Most of your other questions - block sizes, caching, mmapping, file size, concurrency - the answers will be completely different on SSD vs spinning disk.
I more or less agree. Having a decent understanding of the underlying hardware reality makes it much easier to understand the purpose of the software abstractions, and understanding how SSDs differ from hard drives makes it easier to see why software abstractions from the 1980s or earlier may now be ill-suited to today's problems.
I don't understand this comment. Spinning disks are still used very often in computing.
SSD/NVMe/etc drives are great for tiny amounts of storage, but if you need to process something substantial, like petabytes of data, you need to know how to write code to operate on spinning HDs.
Most people in ML or data science should default to writing for spinning disks, as they often have to deal with >>4TB of data.
You're imagining some specific focus on huge data that did not previously exist in the conversation, and does not actually seem relevant to today's real world. If you care about performance, you use SSDs, even if your dataset is large. Consumer 2TB SSDs have been under $100 for a while now. Enterprise SSDs can fit 1PB in a single 1U or 2U box. There are very few niches where hard drives are still relevant as something other than cold storage. Nothing I said implied that those niches don't exist, but my comment was mainly addressed at how understanding SSDs and their fundamental differences from hard drives is valuable knowledge for the more common use cases.
In particular, making good use of SSDs requires your application to be able to issue many simultaneous IO requests, which hard drives are relatively bad at handling and many IO APIs from before the SSD era make difficult or impossible.
Worth a watch: https://youtu.be/LMe7hf2G1po
this only covers SSDs but I remember this from a previous hackernews thread. It's a very short read.
IMO, reading books about operating systems internals / implementation, particularly the parts about disk management and file systems, block vs. character I/O, the buffer cache, fragmentation, etc., would help, too. Some are Unix-based terms, translate for other OSes.
Somewhat specific to Linux, but this one comes to mind:
Personally, I'd like to figure out how to use a SATA controller as a high speed serial port. Swapping pairs on a 6 Ghz cable will be dicey, especially compared to the old RS-232 cables and DB9/25 connectors, but it should be possible to have comms between PCs via spare SATA ports.
I think that plan would run into problems with too much functionality being hard-wired in SATA controllers. There are plenty of off-the-shelf solutions for PCI Express, though.
Related oldie and good starting point : Memory Systems: Cache, DRAM, Disk by Bruce Jacob et al.