A few factual inaccuracies in here that don't affect the general thrust. For example, the claim that S3 uses a 5:9 sharding scheme. In fact they use many different sharding schemes, and iirc 5:9 isn't one of them.
The main reason being that a ratio of 1.8 physical bytes to 1 logical byte is awful for HDD costs. You can get that down significantly, and you get wider parallelism and better availability guarantees to boot (consider: if a whole AZ goes down, how many shards can you lose before an object is unavailable for GET?).
Naively it seems difficult to decrease the ratio of 1.8x while simultaneously increasing availability. The less duplication, the greater risk of data loss if an AZ goes down? (I thought AWS promises you have a complete independent copy in all 3 AZs though?)
To me though the idea that to read like a single 16MB chunk you need to actually read like 4MB of data from 5 different hard drives and that this is faster is baffling.
Availability zones are not durability zones. S3 aims for objects to still be available with one AZ down, but not more than that. That does actually impose a constraint on the ratio relative to the number of AZs you shard across.
If we assume 3 AZs, then you lose 1/3 of shards when an AZ goes down. You could do at most 6:9, which is a 1.5 byte ratio. But that's unacceptable, because you know you will temporarily lose shards to HDD failure, and this scheme doesn't permit that in the AZ down scenario. So 1.5 is our limit.
To lower the ratio from 1.8, it's necessary to increase the denominator (the number of shards necessary to reconstruct the object). This is not possible while preserving availability guarantees with just 9 shards.
Note that Cloudflare's R2 makes no such guarantees, and so does achieve a more favorable cost with their erasure coding scheme.
Note also that if you increase the number of shards, it becomes possible to change the ratio without sacrificing availability. Example: if we have 18 shards, we can chose 11:18, which gives us 1.61 physical bytes per logical byte. And it still takes 1 AZ + 2 shards to make an object unavailable.
You can extrapolate from there to develop other sharding schemes that would improve the ratio and improve availability!
Another key hidden assumption is that you don't worry about correlated shard loss except in the AZ down case. HDDs fail, but these are independent events. So you can bound the probability of simultaneous shard loss using the mean time to failure and the mean time to repair that your repair system achieves.
I generally don't think about storage I/O speed at that scale (I mean really who does?). I once used a RAID0 to store data to HDDs faster, but that was a long time ago.
I would have naively guessed an interesting caching system, and to some degree tiers of storage for hot vs cold objects.
It was obvious after I read the article that parallelism was a great choice, but I definitely hadn't considered the detailed scheme of S3, or the error correction it used. Parallelism is the one word summary, but the details made the article worth reading. I bet minio also has a similar scaling story: parallelism.
RAID doesn’t exactly make writes faster, it can actually be slower. Depends on if you are using RAID for mirroring or sharding. When you mirror, writes are slower since you have to write to all disks.
AWS themselves have bragged that the biggest S3 buckets are striped across over 1 million hard drives. This doesn't mean they are using all of the space of all these drives, because one of the key concepts of S3 is to average IO of many customers over many drives.
Unless you have a large cluster with many tens of nodes/OSDs (and who does in a homelab?) then using Ceph is a bad idea (I've run large Ceph clusters at previous jobs).
Ceph's design is to avoid a single bottleneck or single point of failure, with many nodes that can all ingest data in parallel (high bandwidth across the whole cluster) and be redundant/fault tolerant in the face of disk/host/rack/power/room/site failures. In exchange it trades away some of: low latency, efficient disk space use, simple design, some kinds of flexibility. If you have a "small" use case then you will have a much easier life with a SAN or a bunch of disks in a Linux server with LVM, and probably get better performance.
How does it work with no single front end[1] and no centralised lookup table of data placement (because that could be a bottleneck)? All the storage nodes use the same deterministic algorithm for data placement known as CRUSH, guided by placement rules which the admin has written into the CRUSH map, things like:
- these storage servers are grouped together by some label (e.g. same rack, same power feed, same data centre, same site).
- I want N copies of data blocks, separated over different redundancy / failure boundaries like different racks or different sites.
There's a monitoring daemon which shares the CRUSH map out to each node. They get some data coming in over the network, work through the CRUSH algorithm, and then send the data internally to the target node. The algorithm is probabalistic and not perfectly balanced so some nodes end up with more data than others, and because there's no central data placement table this design is "as full as the fullest disk" - one full disk anywhere in the cluster will put the entire cluster into read-only mode until you fix it. Ceph doesn't easily run well with random cheap different size disks for that reason, the smallest disk or host will be a crunch point. It runs best with raw storage below 2/3rds full. It also doesn't have a single head which can have a fast RAM cache like a RAID controller can have.[2] Nothing about this is designed for the small business or home use case, it's all designed to spread out over a lot of nodes[3].
It’s got a design where the units of storage are OSDs (Object Storage Devices) which correspond roughly to disks/partitions/LVM volumes, each one has a daemon controlling it. Those are pulled together as RADOS (Reliable Autonomic Distributed Object Store) where Ceph internally keeps data, and on top of that the admin can layer user-visible storage such as the CephFS filesystem, Amazon S3 compatible object storage, or a layer that presents as a block device which can be formatted with XFS/etc.
It makes a distributed system that can ingest a lot of data in parallel streams using every node’s network bandwidth, but quite a lot of internal shuffling of data around between nodes and layers adding latency, and there are monitor daemons and management daemons overseeing the whole cluster to keep track of failed storage units and make the CRUSH map available to all nodes, and those ought to be duplicated and redundant as well. It's a bit of a "build it yourself storage cluster kit" which is pretty nicely designed and flexible but complex and layered and non-trivial.
There are some talks on YouTube by people who managed and upgraded it at CERN as targets of particle accelerators data which are quite interesting. I can only recommend searching for "Ceph at Cern" and there are many hours of talks, I can't remember which ones I've seen. Titles like: "Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service", "Ceph and the CERN HPC Infrastructure", "Ceph Days NYC 2023: Ceph at CERN: A Ten-Year Retrospective", "Ceph Operations at CERN: Where Do We Go From Here?".
[1] If you are not writing your own software that speaks to Ceph's internal object storage APIs, then you are fronting its with something like a Linux machine running an XFS filesystem or the S3-compatible gateway, and that machine becomes your single point of failure and bottleneck. Then you front one Ceph cluster with many separate Linux machines as targets, and have your users point their software to different front ends, and in that case why use Ceph at all? You may as well have had many Linux machines with their own separate internal storage and rsync, and no Ceph. Or two SANs with data replication between them. Do you need (or want) what Ceph does, specifically?
[2] I have only worked on HDD based clusters, with some SSDs for storing metadata to speed up performance. These clusters were not well specced and the metadata overflowed onto the HDDs which didn't help anything.
[3] There are ways to adjust the balance of data on each node to work with different size or nearly full disks, but if you get to read-only mode you end up waiting for it to internally rebalance while everything is down. This isn't so different to other storage like SANs, it's just that if you are going for Ceph you probably have a big cluster with a lot of things using it so a lot of things offline. You still have to consider running multiple Ceph clusters to limit blast radius of failures, if you are thinking "I don't want to bother with multiple storage targets I want one Ceph" you still need to plan that maybe you don't just want one Ceph.
While most of what you speak of re Ceph is correct, I want to strongly disagree with your view of not filling up Ceph above 66%. It really depends on implementation details. If you have 10 nodes, yeah then maybe that's a good rule of thumb. But if you're running 100 or 1000 nodes, there's no reason to waste so much raw capacity.
With upmap and balancer it is very easy to run a Ceph cluster where every single node/disk is within 1-1.5% of the average raw utilization of the cluster. Yes, you need room for failures, but on a large cluster it doesn't require much.
80% is definitely achievable, 85% should be as well on larger clusters.
Also re scale, depending on how small we're talking of course, but I'd rather have a small Ceph cluster with 5-10 tiny nodes than a single Linux server with LVM if I care about uptime. It makes scheduled maintenances much easier, also a disk failure on a regular server means RAID group (or ZFS/btrfs?) rebuild. With Ceph, even at fairly modest scale you can have very fast recovery times.
Source, I've been running production workloads on Ceph at fortune-50 companies for more than a decade, and yes I'm biased towards Ceph.
I defer to your experience and agree that it really depends on implementation details (and design). I've only worked on a couple of Ceph clusters built by someone else who left, around 1-2PB, 100-150 OSDs, <25 hosts, and not all the same disks in them. They started falling over because some OSDs filled up, and I had to quickly learn about upmap and rebalancing. I don't remember how full they were, but numbers around 75-85% were involved so I'm getting nervous around 75% from my experiences. We suddenly commit 20TB of backup data and that's a 2% swing. It was a regular pain in the neck, stress point, and creaking, amateurishly managed, under-invested Ceph cluster problems caused several outages and some data corruption. Just having some more free space slack in it would have spared us.[1]
That whole situation is probably easier the bigger the cluster gets; any system with three "units" that has to tolerate one failing can only have 66% usable. With a hundred "units" then 99% are usable. Too much free space is only wasting money, too full is a service down disaster, for that reason I would prefer to err towards the side of too much free rather than too little.
Other than Ceph I've only worked on systems where one disk failure needs one hotspare disk to rebuild, anything else is handled by a separate backup and DR plan. With Ceph, depending on the design it might need free space to handle a host or rack failure, and that's pretty new to me and also leads me to prefer more free space rather than less. With a hundred "units" of storage grouped into 5 failure domains then only 80% is usable, again probably better with scale and experienced design.
If I had 10,000 nodes I'd rather 10,100 nodes and better sleep than playing "how close to full can I get this thing" and constantly on edge waiting for a problem which takes down a 10,000 node cluster and all the things that needed such a big cluster. I'm probably taking some advice from Reddit threads talking about 3-node Ceph/Proxmox setups which say 66% and YouTube videos talking about Ceph at CERN - in those I think their use case is a bursty massive dump of particle accelerator data to ingest, followed by a quieter period of read-heavy analysis and reporting, so they need to keep enough free space for large swings. My company's use case was more backup data churn, lower peaks, less tidal, quite predictable, and we did run much fuller than 66%. We're now down below 50% used as we migrate away, and they're much more stable.
[1] it didn't help that we had nobody familiar with Ceph once the builder had left, and these had been running a long time and partially upgraded through different versions, and had one-of-everything; some S3 storage, some CephFS, some RBDs with XFS to use block cloning, some N+1 pools, some Erasure Coding pools, some physical hardware and some virtual machines, some Docker containerised services but not all, multiple frontends hooked together by password based SSH, and no management will to invest or pay for support/consultants, some parts running over IPv6 and some over IPv4, none with DNS names, some front-ends with redundant multiple back end links, others with only one. A well-designed, well-planned, management-supported cluster with skilled admins can likely run with finer tolerances.
I think the article's title question is a bit misleading because it focuses on peak throughput for S3 as a whole. The interesting question is "How can the throughput for a GET exceed the throughput of an HDD?"
If you just replicated, you could still get big throughput for S3 as a whole by doing many reads that target different HDDs. But you'd still be limited to max HDD throughput * number of GETs. S3 is not so limited, and that's interesting and non-obvious!
I still feel like you're underselling the article however.
Is obviously ultimately parallelism, but parallelism is hard at scale - because things often don't scale - and incorrect parallelism can even make things slower. And it's not always obvious why something gets slower by parallelism.
As a dumb example, if you have a fictional HDD with one disk and one head, you have two straightforward options to optimize performance:
Make sure only one file is read at the same time (otherwise the disk will keep seeking back and forth)
Make sure the file is persisted in a way that you're only accessing one sector, never entering the situation in which it would seek back and forth.
Ofc, that can be dumped down to "parallelism", because this is inherently a question about how to parallelize... But it's also ignoring that that's what is being elaborated on: ways s3 used to enable parallelism
You're right if you're only looking at peak sequential throughput. However, and this is the part that the author could have emphasized more, the impressive part is their strategy for dealing with disk access latency to improve random read throughput.
They shard the data as you might expect of a RAID, 5, 6, etc array and the distributed parity solves the problem of failure tolerance as you would expect and also improves bandwidth via parallelism as you describe.
The interesting part is their best strategy for sharding the data: plain-old-simple random. The decision of which disks and at which sectors to shard the data is done at random, and this creates the best change that at least one of the two copies of data can be accessed with much lower latency (~1ms instead of ~8ms).
The most crude, simple approach turns out to give them the best mileage. There's something vaguely poetic about it, an aesthetic beauty reminiscent of Euler's Identity or the solution to the Basel Problem; a very simple statement with powerful implications.
The fractional part isn't helping them serve data any faster. To the contrary, it actually reduces the speed from parallelism. E.g. a 5:9 scheme only achieves 1.8x throughput, whereas straight-up triple redundancy would achieve 3x.
It just saves AWS money is all, by achieving greater redundancy with less disk usage.
> full-platter seek time: ~8ms; half-platter seek time (avg): ~4ms
Average distance between two points (first is current location, second is target location) when both are uniformly distributed in [ 0 .. +1 ] interval is not 0.5, it’s 1/3. If the full platter seek time is 8ms, average seek time should be 2.666ms.
Full seek on a modern drive is a lot closer to 25ms than 8ms. It's pretty easy to test yourself if you have a hard drive in your machine and root access - fire up fio with --readonly and feed it a handmade trace that alternates reading blocks at the beginning and end of disk. (--readonly does a hard disable of any code that could write to the drive)
When I was reviewing it for publication I ran a couple of tests and found more like 18 on the devices I tested, but I’m sure there are some that do 15. 25 is probably on the slow end. (although I’ve never tested a HAMR drive - their head assemblies are probably heavier and more delicate)
Old SCSI 10K drives could hand a huge queue and reach 500 random read IOPS, sounding like a buzzsaw while they did it. Modern capacity drives treat their internals much more gently, and don’t get as much queuing gain. Note also that for larger objects the chunk size is probably 1+ rotations to amortize the seek overhead.
Oh, and by “full seek” I mean the time it takes to read the max block number (inner diameter) right after reading the min block number (OD) subtracting rotational delay.
You can do this test yourself with fio —readonly and root access to a hard drive block device, even if it’s mounted. (good luck reading any files while the test is running, but no damage done) Pick a variety of very high and low blocks, and the min delay will be when rotational delay is close to zero.
There's acceleration of the read head to move it between the tracks. So it may well be 4ms because shorter distances are penalized by a lower peak speed of the read head as well as constant factors (settling at the end of motion)
That’s a bogus number from an ancient slide deck for a class in 2001 or so, that’s misled generations of folks googling for the answer.
Note also that the outer tracks are longer (google ZCAV) and hold more data, so seeks across uniformly distributed block numbers do not generate uniformly distributed track numbers.
The platter is a circle so using the uniform distribution [0, 1] is incorrect, you should use the unit circular distribution of [0, 2pi] and also since the platter also spins in a single direction the distance is only computed going one way around (if target is right before current, it's one full spin).
But you can simplify this problem down and ask: with no loss of generality, if your starting point is always 0 degrees, how many degrees clockwise is a random point on average, if the target is uniformly distributed?
Since 0-180 has the same arc length as 180-360 then the average distance is 180 degrees. So average half-platter seek is half of the full-platter seek.
What you wrote only applies to rotational latency, not seek latency. The seek latency is the time it takes for the head to reach the target. Heads only rotate within the small range like [ 0 .. 25° ], they are designed for rapid movements in either direction.
Note that you can kind of infer that S3 is still using hard drives for their basic service by looking at pricing and calculating the IOPS rate that doubles the cost per GB per month.
S3 GET and PUT requests are sufficiently expensive that AWS can afford to let disk space sit idle to satisfy high-performance tenants, but not a lot more expensive than that.
S3’s KeyMap Index uses SSDs. I also wouldn’t be surprised if at this point SSDs are somewhere along the read path for caching hot objects or in the new one zone product.
The storage itself is probably (mostly) on HDDs, but I'd imagine metadata, indices, etc are stored on much faster flash storage. At least, that's the common advice for small-ish Ceph cluster MDS servers. Obviously S3 is a few orders of magnitude bigger than that...
Repeating a comment I made above - for standard tier, requests are expensive enough that it's cost-effective to let space on the disks go unused if someone wants an IOPS/TB ratio that's higher than what disk drives can provide. But not much more expensive than that.
The latest generation of drives store about 30TB - I don't know how much AWS pays for them, but a wild-ass guess would be $300-$500. That's a lot cheaper than 30TB of SSD.
Also important - you can put those disks in high-density systems (e.g. 100 drives in 4U) that only add maybe 25% to the total cost, at least if you're AWS, a bit more for the rest of us. The per-slot cost of boxes that hold lots of SSDs seems to be a lot higher.
I've always felt it's probably a wrapper around the Amazon EFS due to the similar pricing and that S3 One Zone has "Directory" buckets, a very file system-y idea.
Seems to indicate the storage underneath might be similar in cost and performance, and this might in fact really be similar. Not that the software on top is the same.
Yeah, I don't know about S3, but years back I talked a fair bit with someone that did storage stuff for HPC, and one thing he talked about is building huge JBOD arrays where only a handful of disks per rack would be spun up, basically pushing what could be done with scsi extenders or such. It wouldn't surprise me if they're doing something like that with batch scheduling the drive activations over a minutes to hours window.
I think that's close to the truth. IIRC it's something like a massive cluster of machines that are effectively powered off 99% of the time with a careful sharding scheme where they're turned on and off in batches over a long period of time for periodic backup or restore of blobs.
it's amazing that Glacier is such a huge system with so many people working on it and it's still a public mystery how it works. I've not seen a single confirmation of how it works..
Not even the higher tiers of Glacier were tape afaict (at least when it was first created), just the observation that hard drives are much bigger than you can reasonably access in useful time.
In the early days when there were articles speculating on what Glacier was backed by, it was actually on crusty old S3 gear (and at the very beginning, it was just on S3 itself as a wrapper and a hand wavy price discount, eating the costs to get people to buy in to the idea!). Later on (2018 or so) they began moving to a home grown tape-based solution (at least for some tiers).
I'm not aware of AWS ever confirming tape for glacier. My own speculation is they likely use hdd for glacier - especially so for the smaller regions - and eat the cost.
Someone recently came across some planning documents filed in London for a small "datacenter" which wasn't attached to their usual London compute DCs and built to house tape libraries (this was explicitly called out as there was concern about power - tape libraries don't use much). So I would be fairly confident they wait until the glacier volumes grow enough on hdd before building out tape infra.
Do you have any sources for that? I'm really curious about Glacier's infrastructure and AWS has been notoriously tight-lipped about it. I haven't found anything better than informed speculation.
My speculation: writes are to /dev/null, and the fact that reads are expensive and that you need to inventory your data before reading means Amazon is recreating your data from network transfer logs.
I'd be curious whether simulating a shitty restoration experience was part of the emulation when they first ran Glacier on plain S3 to test the market.
There might be surprisingly little value in going tape due to all the specialization required. As the other comment suggest, many of the lower tiers likely represent basically IO bandwidth classes. a 16 TB disk with 100 IOPs can only offer 1 IOP/s over 1.6 TB for 100 customers, or 0.1 IOP/s over 160 GB for 1000, etc. Just scale up that thinking to a building full of disks, it still applies
I realize you're making a general point about space/IO ratios and the below is orthogonal, no contradiction.
It's actually a lot less user-facing per disk IO capacity that you will be able to "sell" in a large distributed storage system. There's constant maintenance churn to keep data available:
- local hardware failure
- planned larger scale maintenance
- transient, unplanned larger scale failures
(etc)
In general, you can fall back to using reconstruction from the erasure codes for serving during degradation. But that's a) enormously expensive in IO and CPU and b) you carry higher availability and/or durability risk because you lost redundancy.
Additionally, it may make sense to rebalance where data lives for optimal read throughput (and other performance reasons).
So in practice, there's constant rebalancing going on in a sophisticated distributed storage system that takes a good chunk of your HDD IOPS.
This + garbage collection also makes tape really unattractive for all but very static archives.
See comments above about AWS per-request cost - if your customers want higher performance, they'll pay enough to let AWS waste some of that space and earn a profit on it.
Std has the same performance as every other storage class. There are 2 async classes which you can't read from without retrieving first, but that's not a 'performance' difference as such - GETs aren't slow, they fail.
It is interesting that even after falling prices of HDDs, S3 costs have remained the same for at least 8 years. There's just not enough competition to push them to reduce costs. But imagine money it brings in in AWS because of this.
Same with every other aspect of their offerings. Look at EC2 even with instances like m7a.medium, 1 vCPU (not core) and 4GB memory for ~$50 USD/month on demand or ~$35/month reserve 1 year. It isn't even close to be competitive outside other big cloud providers.
There is inflation, so it has effectively dropped in price. But your point is taken: inflation’s effect on prices is most assuredly slower than the progress of technology’s effect.
Reducing costs is the wrong incentive. If you look at a modern vendor such as Splunk or CrowdStrike, they have huge estates in AWS. There are huge swaths of repeating data, both within and across tenants. Rather than pointing this out, it is simpler and more effective to charge the customer for this data/usage, and use simple techniques so that it isn't duplicative. Reducing costs would only incentive and increase this asinine usage.
How much have hdd prices really fallen? AFAIK the incredible improvements in price per byte in HDD had slowed so much that they'll be eclipsed by SSDs in a few years.
Flash went from within 2x the price of DRAM in 2012 or so to maybe 40-50x cheaper today, driven somewhat by shrinking feature sizes, but mostly by the shift from SLC (1 bit/cell) to TLC (3 bits) and QLC (4 bits) and from planar to 300+ layer 3D flash.
Flash is near the end of the “S-curve” of those technologies being rolled out.
During that time HDD technology was pretty stagnant, with a mere 2x increase due to higher platter count with the use of helium.
New HDD technologies (HAMR) are just starting their rollout, promising major improvements in $/GB over the next few years as they roll out.
You can’t just look at a price curve on a graph and predict where it’s going to go. The actual technologies responsible for that curve matter.
> mostly by the shift from SLC (1 bit/cell) to TLC (3 bits) and QLC (4 bits) and from planar to 300+ layer 3D flash
That "and" is doing a lot of work.
In 2012 most flash was MLC.
In 2025 most flash is TLC.
> During that time HDD technology was pretty stagnant, with a mere 2x increase due to higher platter count with the use of helium.
They've advanced slower than SSDs but it wasn't that slow. Between 2012 and 2025, excluding HAMR, sizes have improved from 4TB to 24TB and prices at the low end have improved from $50/TB to $12/TB.
This is one of those times a downvote confuses me. I corrected some numbers. Was I accidentally rude? If I made a mistake on the numbers please give the right numbers.
If my first line was unclear: We might say the denser bits give us a 65% density improvement. And quick math shows that a 80-100x improvement is actually nine 65% improvements in a row. So the denser bits per cell aren't doing much, it's pretty much all process improvement.
3D flash is over 300 layers now. The size of a single 300-bit stack on the surface of the chip is bigger than an old planar cell, but that 300x does a lot more than make up for it.
3D NAND isn’t a “process improvement” - it’s a fundamental new architecture. It’s radically cheaper because it’s a set of really cheap steps to make all 300+ layers, not using any of the really expensive lithography systems in the fab, then a single (really complicated) set of steps to drill holes through the layers for the bit stacks and coat the insides of the holes.
Chip cost basically = the depreciation of the fab investment during the time a chip spends in the fab, so 3D NAND is a huge win. (just stacking layers by running the chip through the process N times wouldn’t save any money, and would probably just decrease yields)
A total guess - 2x more expensive for extra steps, bit stacks take 4x more area than planar cells, 300 layer would have 300/8 = 37.5x cheaper bits. (That 4x is pulling a lot of weight - for all I know it might be more like 8x, but the point stands)
Because they made something different with the same process, instead of making the same thing with a different process. Feature size didn’t get any smaller. (or, rather, you get the order of magnitude improvement without it, and those gains were vastly more than the feature size improvements over that time period)
Also because “process improvement” usually refers to things where you get incremental improvements basically for free as each new generation of fab rolls out. Unless you can invent a 4D flash, this is a single (huge) improvement that’s mostly played out.
Oh, and no one has a solution to make HDDs faster. If anything, they may have gotten slower as they get optimized for capacity instead of speed.
(Well, peak data transfer rate keeps going up as bits get packed tighter, but capacity goes up linearly with areal bit density, while the speed the bits go under the head goes up with the square root.)
(Well, sort of. For a while a lot of the progress came from making the bits skinnier but not much shorter, so transfer rates didn’t go up that much)
Magnetic hard drives are 100X cheaper per GB than when S3 launched, and are about 3X cheaper than in 2016 when the price last dropped. Magnetic prices have actually ticked up recently due to supply chain issues, but HAMR is expected to cause a significant drop (50-75%/GB) in magnetic storage prices as it rolls out in next few years. SSDs are ~$120/T and magnetic drives are ~$18/T. This hasn't changed much in the last 2 years.
Author of the 2minutestreaming blog here. Good point! I'll add this as a reference at the end. I loved that piece. My goal was to be more concise and focus on the HDD aspect
Please fix the seek time numbers - they’re wildly in accurate; full platter seek is more like 20-25ms. I tried chasing down where the 8ms number came from a while back, and I think it applies to old sub-100GB 10K RPM high-speed drives from 25 or so years ago, which were purposely low density so they could swing the head faster and less precisely.
Woah buddy, I worked with Andy for years and this is not my experience. Moving a large product like S3 around is really, really difficult, and I've always thought highly of Andy's ability to: (a) predict where he thought the product should go, (b) come up with novel ways of getting there, and (c) trimming down the product to get something in the hands of customers.
Also, did you create this account for the express purpose of bashing Andy? That's not cool.
If we assume enterprise HDDs in the double digit TB range then one can estimate that the total S3 storage volume of AWS is in the triple digit Exabyte range. That's propably the biggest storage system on planet earth.
This piece is interesting background, but worth noting that the actual numbers are highly speculative. The NSA has never disclosed hard data on capacity, and most of what's out there is inference from blueprints, water/power usage, or second-hand claims. No verifiable figures exist.
Is there an open source service designed with HDDs in mind that achieves similar performance? I know none of the big ones work that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they all suggest flash-only deployments.
Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).
I would say that Ceph+RadosGW works well with HDDs, as long as 1) you use SDDs for the index pool, and 2) you are realistic about the number of IOPs you can get out of your pool of HDDs.
And remember that there's a multiplication of iops for any individual client iop, whether you're using triplicate storare or erasure coding. S3 also has iop multiplication, which they solve with tons of HDDs.
For big object storage that's mostly streaming 4MB chunks, this is no big deal. If you have tons of small random reads and writes across many keys or a single big key, that's when you need to make sure your backing store can keep up.
Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.
Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.
So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.
Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.
It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.
It's a very beefy server with 4 NVMe and 20 HDD bays + a 60-drive external enclosure, 2 enterprise grade HBA cards set to multipath round-robin mode, even with 80 drives it's nowhere near the data path saturation point.
The link is a 10G 9K MTU connection, the server is only accessed via that local link.
Essentially, the drives being HDD are the only real bottleneck (besides the obvious single-node scenario).
At the moment, all writes are buffered into the NVMes via OpenCAS write-through cache, so the writes are very snappy and are pretty much ingested at the rate I can throw data at it. But the read/delete operations require at least a metadata read, and due to the very high number of small (most even empty) objects they take a lot more time than I would like.
I'm willing to sacrifice the write-through cache benefits (the write performance is actually an overkill for my use case), in order to make it a little more balanced for better List/Read/DeleteObject operations performance.
On paper, most "real" writes will be sequential data, so writing that directly to the HDDs should be fine, while metadata write operations will be handled exclusively by the flash storage, thus also taking care of the empty/small objects problem.
> Essentially, the drives being HDD are the only real bottleneck
? on the low end a single HD can deliver 100MB/s, 80 can deliver 8,000MB/s, a single nvme can do 700MB/s and you have 4, 2,800MB/s - a 10Gb link can only do 1000MB/s, so isn't your bottle neck Network and then probably CPU?
If your server is old, the RAID card's PCIe interface will be another bottleneck, alongside the latencies added if the card is not that powerful to begin with.
Same applies to your NVMe throughput since now you have the risk to congest the PCIe lanes if you're increasing line count with PCIe switches.
If there are gateway services or other software bound processes like zRAID, your processor will saturate way before your NIC, adding more jitter and inconsistency to your performance.
NIC is an independent republic on the motherboard. They can accelerate almost anything related to stack, esp. server grade cards. If you can pump the data to the NIC, you can be sure that it can be pushed at line speed.
However, running a NIC at line speed with data read from elsewhere on the system is not always that easy.
For sure, there is zero expectations for any kind of hardware downtime tolerance, it's a secondary backup storage cobbled together from leftovers over many years :)
For software, at least with MinIO it's possible to do rolling updates/restarts since the 5 instances in docker-compose are enough for proper write quorum even with any single instance down.
Getting Ceph erasure coding set up properly on a big hard disk pool is a pain - you can tell that EC was shoehorned into a system that was totally designed around triple replication.
Originally Ceph divided big objects into 4MB chunks, sending each chunk to an OSD server which replicated it to 2 more servers. 4MB was chosen because it was several drive rotations, so the seek+ rotational delay didn’t affect the throughput very much.
Now the first OSD splits it into k data chunks plus d parity chunks, so the disk write size isn’t 4MB, it’s 4MB/k, while the efficient write size has gone up 2x? 4x? since the original 4MB decision as drive transfer rates increase.
You can change this, but still the tuning is based on the size of the block to be coded, not the size of the chunks to be written to disk. (and you might have multiple pools with much different values of k)
I'm still not sure which exact Ceph concept you are referring to. Thre is the "minimum allocation size" [1], but that is currently 4 KB (not MB).
There is also striping [2], which is the equivalent of RAID-10 functionality to split a large file into independent segments that can be written in parallel.
Perhaps you are referring to RGW's default stripe size of 4 MB [3]?
If yes, I can understand your point about one 4 MB RADOS object being erasure-coded to e.g. 6 = 4+2 "parity chunks", making it < 1 MB writes that are not efficient on HDDs.
But would you not simply raise `rgw_obj_stripe_size` to address that, according to the k you choose? E.g. 24 MB? You mention it can be changed, but I don't understand the "but still the tuning is based on the size of the block to be coded" part, (why) is that a problem?
Also, how else would you do it when designing EC writes?
If you can afford it, mirroring in some form is going to give you way better read perf than RAIDz. Using zfs mirrors is probably easiest but least flexible, zfs copies=2 with all devices as top level vdevs in a single zpool is not very unsafe, and something custom would be a lot of work but could get safety and flexibility if done right.
You're basically seek limited, and a read on a mirror is one seek, whereas a read on a RAIDz is one seek per device in the stripe. (Although if most of your objects are under the chunk size, you end up with more of mirroring than striping)
Yeah unfortunately mirrors is no go due to efficiency requirements, but luckily read performance is not that important if I manage to completely offload FS/S3 metadata and small files to flash storage (separate zpool for Garage metadata, separate special VDEV for metadata/small files).
I think I'm going to go with 8x RAIDz2 VDEVs 10x HDDs each, so that the 20 drives in the internal drive enclosure could be 2 separate VDEVs and not mix with the 60 in the external enclosure.
It's great to see other people's working solutions, thanks. Can I ask if you have backup on something like this? In many systems it's possible to store some data on ingress or after processing, which serves as something that's rebuildable, even if it's not a true backup. I'm not familiar if your software layer has backup to off site as part of their system, for example, which would be a great feature.
ZFS/OpenZFS can do scrub and do block-level recovery. I'm not sure about Lustre, but since Petabyte sized storage is its natural habitat, there should be at least one way to handle that.
At a past job we had an object store that used SwiftStack. We just used SSDs for the metadata storage but all the objects were stored on regular HDDs. It worked well enough.
Apache Ozone has multiple 100+ petabyte clusters in production. The capacity is on HDDs and metadata is on SSDs. Updated docs (staging for new docs): https://kerneltime.github.io/ozone-site/
Doing some light googling aside from Ceph being listed, there's one called Gluster as well. Hypes itself as "using common off-the-shelf hardware you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks."
It's open source / free to boot. I have no direct experience with it myself however.
Gluster has been slowly declining for a while. It used to be sponsored by RedHat, but tha stopped a few years ago. Since then, development slowed significantly.
I used to keep a large cluster array with Gluster+ZFS (1.5PB), and I can’t say I was ever really that impressed with the performance. That said — I really didn’t have enough horizontal scaling to make it worthwhile from a performance aspect. For us, it was mainly used to make a union file system.
But, I can’t say I’d recommend it for anything new.
A decade ago where I worked we used gluster for ~200TB of HDD for a shared file system on a SLURM compute cluster, as a much better clustered version of NFS. And we used ceph for its S3 interface (RadowGW) for tens of petabytes of back storage after the high IO stages of compute were finished. The ceph was all HDD though later we added some SSDs for a caching pool.
For single client performance, ceph beat the performance I get from S3 today for large file copies. Gluster had difficult to characterize performance, but our setup with big fast RAID arrays seems to still outperform what I see of AWS's luster as a service today for our use case of long sequential reads and writes.
We would occasionally try cephFS, the POSIX shared network filesystem, but it couldn't match our gluster performance for our workload. But also, we built the ceph long term storage to maximize TB/$, so it was at a disadvantage compared to our gluster install. Still, I never heard of cephFS being used anywhere despite it being the original goal in the papers back at UCSC. Keep an eye on CERN for news about one of the bigger ceph installs with public info.
I love both of the systems, and see ceph used everywhere today, but am surprised and happy to see that gluster is still around.
I’ve used GlusterFS before because I was having tens of old PCs and it worked for me very well. It’s basically a PoC to see how it work than production though
In their design document at https://garagehq.deuxfleurs.fr/documentation/design/goals/ they state: "erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication"
Nice! Learned something new today. Seems like a way for error correction. One can store parts of data with some more metadata and if some parts of the data are lost, the original can be reconstructed via some use of computational power.
Seems like some kind of compression?
Is that how the error correction on DVD works? I
And is that how GridFS is can keep file store slow low compare to regular file system?
We've been running a production ceph cluster for 11 years now, with only one full scheduled downtime for a major upgrade in all those years, across three different hardware generations. I wouldn't call it easy, but I also wouldn't call it hard. I used to run it with SSDs for radosgw indexes as well as a fast pool for some VMs, and harddrives for bulk object storage. Since i was only running 5 nodes with 10 drives each, I was tired of occasional iop issues under heavy recovery so on the last upgrade I just migrated to 100% nvme drives. To mitigate the price I just bought used enterprise micron drives off ebay whenever I saw a good deal popup. Haven't had any performance issues since then no matter what we've tossed at it. I'd recommend it, though I don't have experience with the other options. On paper I think it's still the best option. Stay away from CephFS though, performance is truly atrocious and you'll footgun yourself for any use in production.
We're using CephFS for a couple years, with some PBs of data on it (HDDs).
What performance issues and footguns do you have in mind?
I also like that CephFS has a performance benefits that doesn't seem to exist anywhere else: Automatic transparent Linux buffer caching, so that writes are extremely fast and local until you fsync() or other clients want to read, and repeat-reads or read-after-write are served from local RAM.
I was an SDE on the S3 Index team 10 years ago, but I doubt much of the core stack has changed.
S3 is comprised primarily of layers of Java-based web services. The hot path (object get / put / list) are all served by synchronous API servers - no queues or workers. It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career.
For a get call, you first hit a fleet of front-end HTTP API servers behind a set of load balancers. Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently. Your request is then sent to the Indexing fleet to find the mapping of your key name to an internal storage id. This is returned to the front end layer, which then calls the storage layer with the id to get the actual bits. It is a very straightforward multi-layer distributed system design for serving synchronous API responses at massive scale.
The only novel bit is all the backend communication uses a home-grown stripped-down HTTP variant, called STUMPY if I recall. It was a dumb idea to not just use HTTP but the service is ancient and originally built back when principal engineers were allowed to YOLO their own frameworks and protocols so now they are stuck with it. They might have done the massive lift to replace STUMPY with HTTP since my time.
Rest assured STUMPY was replaced with another home grown protocol! Though I think a stream oriented protocol is a better match for large scale services like S3 storage than a synchronous protocol like HTTP.
Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently.
They may still use key names for partitioning. But they now randomly hash the user key name prefix on the back end to handle hotspots generated by similar keys.
> The hot path (... list) are all served by synchronous API servers
Wait; how does that work, when a user is PUTting tons of objects concurrently into a bucket, and then LISTing the bucket during that? If the PUTs are all hitting different indexing-cluster nodes, then...?
(Or do you mean that there are queues/workers, but only outside the hot path; with hot-path requests emitting events that then get chewed through async to do things like cross-shard bucket metadata replication?)
LIST is dog slow, and everyone expects it to be. (my research group did a prototype of an ultra-high-speed S3-compatible system, and it really helps not needing to list things quickly)
I worked on lifecycle ~5 years ago and just the Standard -> Glacier transition path involved no fewer than 7 microservices.
Just determining which of the 400 trillion keys are eligible for a lifecycle action (comparing each object's metadata against the lifecycle policy on the bucket) is a massive big data job.
Always was a fun oncall when some bucket added a lifecycle rule that queued 1PB+ of data for transition or deletion on the same day. At the time our queuing had become good enough to handle these queues gracefully but our alarming hadn't figured out how to differentiate between the backlog for a single customer with a huge job and the whole system failing to process quickly enough. IIRC this was being fixed as I left.
I work on tiny systems now, but something I miss from "big" deployments is how smooth all of the metrics were! Any bump was a signal that really meant something.
Amazon biases towards Systems Oriented Architecture approach that is in the middle ground between monolith and microservices.
Biasing away from lots of small services in favour of larger ones that handle more of the work so that as much as possible you avoid the costs and latency of preparing, transmitting, receiving and processing requests.
I know S3 has changed since I was there nearly a decade ago, so this is outdated. Off the top of my head it used to be about a dozen main services at that time. A request to put an object would only touch a couple of services en route to disk, and similar on retrieval. There were a few services that handled fixity and data durability operations, the software on the storage servers themselves, and then stuff that maintained the mapping between object and storage.
At this kind of scale, queues, caches and long running workers ought to be avoided at all costs due to their highly opaque nature which drastically increases the unpredictability in the system's behaviour whilst decreasing the reliability and observability.
A few factual inaccuracies in here that don't affect the general thrust. For example, the claim that S3 uses a 5:9 sharding scheme. In fact they use many different sharding schemes, and iirc 5:9 isn't one of them.
The main reason being that a ratio of 1.8 physical bytes to 1 logical byte is awful for HDD costs. You can get that down significantly, and you get wider parallelism and better availability guarantees to boot (consider: if a whole AZ goes down, how many shards can you lose before an object is unavailable for GET?).
Naively it seems difficult to decrease the ratio of 1.8x while simultaneously increasing availability. The less duplication, the greater risk of data loss if an AZ goes down? (I thought AWS promises you have a complete independent copy in all 3 AZs though?)
To me though the idea that to read like a single 16MB chunk you need to actually read like 4MB of data from 5 different hard drives and that this is faster is baffling.
Availability zones are not durability zones. S3 aims for objects to still be available with one AZ down, but not more than that. That does actually impose a constraint on the ratio relative to the number of AZs you shard across.
If we assume 3 AZs, then you lose 1/3 of shards when an AZ goes down. You could do at most 6:9, which is a 1.5 byte ratio. But that's unacceptable, because you know you will temporarily lose shards to HDD failure, and this scheme doesn't permit that in the AZ down scenario. So 1.5 is our limit.
To lower the ratio from 1.8, it's necessary to increase the denominator (the number of shards necessary to reconstruct the object). This is not possible while preserving availability guarantees with just 9 shards.
Note that Cloudflare's R2 makes no such guarantees, and so does achieve a more favorable cost with their erasure coding scheme.
Note also that if you increase the number of shards, it becomes possible to change the ratio without sacrificing availability. Example: if we have 18 shards, we can chose 11:18, which gives us 1.61 physical bytes per logical byte. And it still takes 1 AZ + 2 shards to make an object unavailable.
You can extrapolate from there to develop other sharding schemes that would improve the ratio and improve availability!
Another key hidden assumption is that you don't worry about correlated shard loss except in the AZ down case. HDDs fail, but these are independent events. So you can bound the probability of simultaneous shard loss using the mean time to failure and the mean time to repair that your repair system achieves.
See timestamp 42:20 at https://youtu.be/NXehLy7IiPM?si=QQEOMCt7kOBTMaGK
The way it’s worded makes me understand that’s what scheme they’re using. Curious to hear what you know
VAST data uses 146+4
https://www.vastdata.com/whitepaper/#similarity-reduction-in...
page loads then quicky move up some video loads, and content is gone
Web 3 in a nutshell
Wow that is very annoying. Here is a better page
https://www.vastdata.com/blog/introducing-rack-scale-resilie...
I enjoyed this article but I think the answer to the headline is obvious: parallelism
I generally don't think about storage I/O speed at that scale (I mean really who does?). I once used a RAID0 to store data to HDDs faster, but that was a long time ago.
I would have naively guessed an interesting caching system, and to some degree tiers of storage for hot vs cold objects.
It was obvious after I read the article that parallelism was a great choice, but I definitely hadn't considered the detailed scheme of S3, or the error correction it used. Parallelism is the one word summary, but the details made the article worth reading. I bet minio also has a similar scaling story: parallelism.
My homelab servers all have raidz out of 3 nvme drives for this reason: higher parallelism without loosing redundancy.
> I would have naively guessed an interesting caching system, and to some degree tiers of storage for hot vs cold objects.
Caching in this scenario usually done outside of S3 in something like Cloudfront
RAID doesn’t exactly make writes faster, it can actually be slower. Depends on if you are using RAID for mirroring or sharding. When you mirror, writes are slower since you have to write to all disks.
He explicitly mentioned RAID0 though :)
AWS themselves have bragged that the biggest S3 buckets are striped across over 1 million hard drives. This doesn't mean they are using all of the space of all these drives, because one of the key concepts of S3 is to average IO of many customers over many drives.
If you’re curious about this at home, try Ceph in Proxmox.
Unless you have a large cluster with many tens of nodes/OSDs (and who does in a homelab?) then using Ceph is a bad idea (I've run large Ceph clusters at previous jobs).
Why is it a bad idea?
Ceph's design is to avoid a single bottleneck or single point of failure, with many nodes that can all ingest data in parallel (high bandwidth across the whole cluster) and be redundant/fault tolerant in the face of disk/host/rack/power/room/site failures. In exchange it trades away some of: low latency, efficient disk space use, simple design, some kinds of flexibility. If you have a "small" use case then you will have a much easier life with a SAN or a bunch of disks in a Linux server with LVM, and probably get better performance.
How does it work with no single front end[1] and no centralised lookup table of data placement (because that could be a bottleneck)? All the storage nodes use the same deterministic algorithm for data placement known as CRUSH, guided by placement rules which the admin has written into the CRUSH map, things like:
- these storage servers are grouped together by some label (e.g. same rack, same power feed, same data centre, same site).
- I want N copies of data blocks, separated over different redundancy / failure boundaries like different racks or different sites.
There's a monitoring daemon which shares the CRUSH map out to each node. They get some data coming in over the network, work through the CRUSH algorithm, and then send the data internally to the target node. The algorithm is probabalistic and not perfectly balanced so some nodes end up with more data than others, and because there's no central data placement table this design is "as full as the fullest disk" - one full disk anywhere in the cluster will put the entire cluster into read-only mode until you fix it. Ceph doesn't easily run well with random cheap different size disks for that reason, the smallest disk or host will be a crunch point. It runs best with raw storage below 2/3rds full. It also doesn't have a single head which can have a fast RAM cache like a RAID controller can have.[2] Nothing about this is designed for the small business or home use case, it's all designed to spread out over a lot of nodes[3].
It’s got a design where the units of storage are OSDs (Object Storage Devices) which correspond roughly to disks/partitions/LVM volumes, each one has a daemon controlling it. Those are pulled together as RADOS (Reliable Autonomic Distributed Object Store) where Ceph internally keeps data, and on top of that the admin can layer user-visible storage such as the CephFS filesystem, Amazon S3 compatible object storage, or a layer that presents as a block device which can be formatted with XFS/etc.
It makes a distributed system that can ingest a lot of data in parallel streams using every node’s network bandwidth, but quite a lot of internal shuffling of data around between nodes and layers adding latency, and there are monitor daemons and management daemons overseeing the whole cluster to keep track of failed storage units and make the CRUSH map available to all nodes, and those ought to be duplicated and redundant as well. It's a bit of a "build it yourself storage cluster kit" which is pretty nicely designed and flexible but complex and layered and non-trivial.
There are some talks on YouTube by people who managed and upgraded it at CERN as targets of particle accelerators data which are quite interesting. I can only recommend searching for "Ceph at Cern" and there are many hours of talks, I can't remember which ones I've seen. Titles like: "Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service", "Ceph and the CERN HPC Infrastructure", "Ceph Days NYC 2023: Ceph at CERN: A Ten-Year Retrospective", "Ceph Operations at CERN: Where Do We Go From Here?".
[1] If you are not writing your own software that speaks to Ceph's internal object storage APIs, then you are fronting its with something like a Linux machine running an XFS filesystem or the S3-compatible gateway, and that machine becomes your single point of failure and bottleneck. Then you front one Ceph cluster with many separate Linux machines as targets, and have your users point their software to different front ends, and in that case why use Ceph at all? You may as well have had many Linux machines with their own separate internal storage and rsync, and no Ceph. Or two SANs with data replication between them. Do you need (or want) what Ceph does, specifically?
[2] I have only worked on HDD based clusters, with some SSDs for storing metadata to speed up performance. These clusters were not well specced and the metadata overflowed onto the HDDs which didn't help anything.
[3] There are ways to adjust the balance of data on each node to work with different size or nearly full disks, but if you get to read-only mode you end up waiting for it to internally rebalance while everything is down. This isn't so different to other storage like SANs, it's just that if you are going for Ceph you probably have a big cluster with a lot of things using it so a lot of things offline. You still have to consider running multiple Ceph clusters to limit blast radius of failures, if you are thinking "I don't want to bother with multiple storage targets I want one Ceph" you still need to plan that maybe you don't just want one Ceph.
While most of what you speak of re Ceph is correct, I want to strongly disagree with your view of not filling up Ceph above 66%. It really depends on implementation details. If you have 10 nodes, yeah then maybe that's a good rule of thumb. But if you're running 100 or 1000 nodes, there's no reason to waste so much raw capacity.
With upmap and balancer it is very easy to run a Ceph cluster where every single node/disk is within 1-1.5% of the average raw utilization of the cluster. Yes, you need room for failures, but on a large cluster it doesn't require much.
80% is definitely achievable, 85% should be as well on larger clusters.
Also re scale, depending on how small we're talking of course, but I'd rather have a small Ceph cluster with 5-10 tiny nodes than a single Linux server with LVM if I care about uptime. It makes scheduled maintenances much easier, also a disk failure on a regular server means RAID group (or ZFS/btrfs?) rebuild. With Ceph, even at fairly modest scale you can have very fast recovery times.
Source, I've been running production workloads on Ceph at fortune-50 companies for more than a decade, and yes I'm biased towards Ceph.
I defer to your experience and agree that it really depends on implementation details (and design). I've only worked on a couple of Ceph clusters built by someone else who left, around 1-2PB, 100-150 OSDs, <25 hosts, and not all the same disks in them. They started falling over because some OSDs filled up, and I had to quickly learn about upmap and rebalancing. I don't remember how full they were, but numbers around 75-85% were involved so I'm getting nervous around 75% from my experiences. We suddenly commit 20TB of backup data and that's a 2% swing. It was a regular pain in the neck, stress point, and creaking, amateurishly managed, under-invested Ceph cluster problems caused several outages and some data corruption. Just having some more free space slack in it would have spared us.[1]
That whole situation is probably easier the bigger the cluster gets; any system with three "units" that has to tolerate one failing can only have 66% usable. With a hundred "units" then 99% are usable. Too much free space is only wasting money, too full is a service down disaster, for that reason I would prefer to err towards the side of too much free rather than too little.
Other than Ceph I've only worked on systems where one disk failure needs one hotspare disk to rebuild, anything else is handled by a separate backup and DR plan. With Ceph, depending on the design it might need free space to handle a host or rack failure, and that's pretty new to me and also leads me to prefer more free space rather than less. With a hundred "units" of storage grouped into 5 failure domains then only 80% is usable, again probably better with scale and experienced design.
If I had 10,000 nodes I'd rather 10,100 nodes and better sleep than playing "how close to full can I get this thing" and constantly on edge waiting for a problem which takes down a 10,000 node cluster and all the things that needed such a big cluster. I'm probably taking some advice from Reddit threads talking about 3-node Ceph/Proxmox setups which say 66% and YouTube videos talking about Ceph at CERN - in those I think their use case is a bursty massive dump of particle accelerator data to ingest, followed by a quieter period of read-heavy analysis and reporting, so they need to keep enough free space for large swings. My company's use case was more backup data churn, lower peaks, less tidal, quite predictable, and we did run much fuller than 66%. We're now down below 50% used as we migrate away, and they're much more stable.
[1] it didn't help that we had nobody familiar with Ceph once the builder had left, and these had been running a long time and partially upgraded through different versions, and had one-of-everything; some S3 storage, some CephFS, some RBDs with XFS to use block cloning, some N+1 pools, some Erasure Coding pools, some physical hardware and some virtual machines, some Docker containerised services but not all, multiple frontends hooked together by password based SSH, and no management will to invest or pay for support/consultants, some parts running over IPv6 and some over IPv4, none with DNS names, some front-ends with redundant multiple back end links, others with only one. A well-designed, well-planned, management-supported cluster with skilled admins can likely run with finer tolerances.
I think the article's title question is a bit misleading because it focuses on peak throughput for S3 as a whole. The interesting question is "How can the throughput for a GET exceed the throughput of an HDD?"
If you just replicated, you could still get big throughput for S3 as a whole by doing many reads that target different HDDs. But you'd still be limited to max HDD throughput * number of GETs. S3 is not so limited, and that's interesting and non-obvious!
Millions of hard drives cumulatively has enormous IO bandwidth.
That's like saying "how to get to the moon is obvious: traveling"
Thank you for setting me up for this...
It's not exactly rocket science.
Haha, good one!
I still feel like you're underselling the article however.
Is obviously ultimately parallelism, but parallelism is hard at scale - because things often don't scale - and incorrect parallelism can even make things slower. And it's not always obvious why something gets slower by parallelism.
As a dumb example, if you have a fictional HDD with one disk and one head, you have two straightforward options to optimize performance:
Make sure only one file is read at the same time (otherwise the disk will keep seeking back and forth)
Make sure the file is persisted in a way that you're only accessing one sector, never entering the situation in which it would seek back and forth.
Ofc, that can be dumped down to "parallelism", because this is inherently a question about how to parallelize... But it's also ignoring that that's what is being elaborated on: ways s3 used to enable parallelism
I dunno, the article's tl;dr is just parallelism.
Data gets split into redundant copies, and is rebalanced in response to hot spots.
Everything in this article is the obvious answer you'd expect.
You're right if you're only looking at peak sequential throughput. However, and this is the part that the author could have emphasized more, the impressive part is their strategy for dealing with disk access latency to improve random read throughput.
They shard the data as you might expect of a RAID, 5, 6, etc array and the distributed parity solves the problem of failure tolerance as you would expect and also improves bandwidth via parallelism as you describe.
The interesting part is their best strategy for sharding the data: plain-old-simple random. The decision of which disks and at which sectors to shard the data is done at random, and this creates the best change that at least one of the two copies of data can be accessed with much lower latency (~1ms instead of ~8ms).
The most crude, simple approach turns out to give them the best mileage. There's something vaguely poetic about it, an aesthetic beauty reminiscent of Euler's Identity or the solution to the Basel Problem; a very simple statement with powerful implications.
It's not really "redundant copies". It's erasure coding (ie, your data is the solution of an overdetermined system of equations).
That’s just fractional redundant copies.
And "fractional redundant copies" is way less obvious.
The fractional part isn't helping them serve data any faster. To the contrary, it actually reduces the speed from parallelism. E.g. a 5:9 scheme only achieves 1.8x throughput, whereas straight-up triple redundancy would achieve 3x.
It just saves AWS money is all, by achieving greater redundancy with less disk usage.
> full-platter seek time: ~8ms; half-platter seek time (avg): ~4ms
Average distance between two points (first is current location, second is target location) when both are uniformly distributed in [ 0 .. +1 ] interval is not 0.5, it’s 1/3. If the full platter seek time is 8ms, average seek time should be 2.666ms.
Full seek on a modern drive is a lot closer to 25ms than 8ms. It's pretty easy to test yourself if you have a hard drive in your machine and root access - fire up fio with --readonly and feed it a handmade trace that alternates reading blocks at the beginning and end of disk. (--readonly does a hard disable of any code that could write to the drive)
Here's a good paper that explains why the 1/3 number isn't quite right on any drives manufactured in the last quarter century or so - https://www.msstconference.org/MSST-history/2024/Papers/msst...
I'd be happy to answer any other questions about disk drive mechanics and performance.
> Full seek on a modern drive is a lot closer to 25ms than 8ms
Is "full seek" a synonymous for worst case time to reach a position occurring less than 1% of working time?
From the article: max seek time is 15.2 ms + additionally 0 to 8.3 ms of rotational latency.
Reordering of sector accesses by NCQ should reduce worst case scenario occurrences.
I didn’t check the article.
When I was reviewing it for publication I ran a couple of tests and found more like 18 on the devices I tested, but I’m sure there are some that do 15. 25 is probably on the slow end. (although I’ve never tested a HAMR drive - their head assemblies are probably heavier and more delicate)
Old SCSI 10K drives could hand a huge queue and reach 500 random read IOPS, sounding like a buzzsaw while they did it. Modern capacity drives treat their internals much more gently, and don’t get as much queuing gain. Note also that for larger objects the chunk size is probably 1+ rotations to amortize the seek overhead.
Oh, and by “full seek” I mean the time it takes to read the max block number (inner diameter) right after reading the min block number (OD) subtracting rotational delay.
You can do this test yourself with fio —readonly and root access to a hard drive block device, even if it’s mounted. (good luck reading any files while the test is running, but no damage done) Pick a variety of very high and low blocks, and the min delay will be when rotational delay is close to zero.
There's acceleration of the read head to move it between the tracks. So it may well be 4ms because shorter distances are penalized by a lower peak speed of the read head as well as constant factors (settling at the end of motion)
IT’S NOT 4 MS.
That’s a bogus number from an ancient slide deck for a class in 2001 or so, that’s misled generations of folks googling for the answer.
Note also that the outer tracks are longer (google ZCAV) and hold more data, so seeks across uniformly distributed block numbers do not generate uniformly distributed track numbers.
The platter is a circle so using the uniform distribution [0, 1] is incorrect, you should use the unit circular distribution of [0, 2pi] and also since the platter also spins in a single direction the distance is only computed going one way around (if target is right before current, it's one full spin).
But you can simplify this problem down and ask: with no loss of generality, if your starting point is always 0 degrees, how many degrees clockwise is a random point on average, if the target is uniformly distributed?
Since 0-180 has the same arc length as 180-360 then the average distance is 180 degrees. So average half-platter seek is half of the full-platter seek.
What you wrote only applies to rotational latency, not seek latency. The seek latency is the time it takes for the head to reach the target. Heads only rotate within the small range like [ 0 .. 25° ], they are designed for rapid movements in either direction.
Note that you can kind of infer that S3 is still using hard drives for their basic service by looking at pricing and calculating the IOPS rate that doubles the cost per GB per month.
S3 GET and PUT requests are sufficiently expensive that AWS can afford to let disk space sit idle to satisfy high-performance tenants, but not a lot more expensive than that.
So is any of S3 powered by SSD's?
I honestly figured that it must be powered by SSD for the standard tier and the slower tiers were the ones using HDD or slower systems.
> So is any of S3 powered by SSD's?
S3’s KeyMap Index uses SSDs. I also wouldn’t be surprised if at this point SSDs are somewhere along the read path for caching hot objects or in the new one zone product.
The storage itself is probably (mostly) on HDDs, but I'd imagine metadata, indices, etc are stored on much faster flash storage. At least, that's the common advice for small-ish Ceph cluster MDS servers. Obviously S3 is a few orders of magnitude bigger than that...
Repeating a comment I made above - for standard tier, requests are expensive enough that it's cost-effective to let space on the disks go unused if someone wants an IOPS/TB ratio that's higher than what disk drives can provide. But not much more expensive than that.
The latest generation of drives store about 30TB - I don't know how much AWS pays for them, but a wild-ass guess would be $300-$500. That's a lot cheaper than 30TB of SSD.
Also important - you can put those disks in high-density systems (e.g. 100 drives in 4U) that only add maybe 25% to the total cost, at least if you're AWS, a bit more for the rest of us. The per-slot cost of boxes that hold lots of SSDs seems to be a lot higher.
It's assumed that the new S3 Express One Zone is backed by SSDs but I believe Amazon doesn't say so explicitly.
I've always felt it's probably a wrapper around the Amazon EFS due to the similar pricing and that S3 One Zone has "Directory" buckets, a very file system-y idea.
Seems to indicate the storage underneath might be similar in cost and performance, and this might in fact really be similar. Not that the software on top is the same.
nope
I always assumed the really slow tiers were tape.
My own assumption was always that the cold tiers are managed by a tape robot, but managing offlined HDDs rather than actual tapes.
Yeah, I don't know about S3, but years back I talked a fair bit with someone that did storage stuff for HPC, and one thing he talked about is building huge JBOD arrays where only a handful of disks per rack would be spun up, basically pushing what could be done with scsi extenders or such. It wouldn't surprise me if they're doing something like that with batch scheduling the drive activations over a minutes to hours window.
There was an article or interview with one of the lead AWS engineers, and he said they use CDs or DVDs for cold glacier.
I think that's close to the truth. IIRC it's something like a massive cluster of machines that are effectively powered off 99% of the time with a careful sharding scheme where they're turned on and off in batches over a long period of time for periodic backup or restore of blobs.
it's amazing that Glacier is such a huge system with so many people working on it and it's still a public mystery how it works. I've not seen a single confirmation of how it works..
Glacier could be doing similar to what Azure does: https://www.microsoft.com/en-us/research/project/project-sil...
Also see this thread: https://news.ycombinator.com/item?id=13011396
I doubt it’s using WORM drives.
Not even the higher tiers of Glacier were tape afaict (at least when it was first created), just the observation that hard drives are much bigger than you can reasonably access in useful time.
In the early days when there were articles speculating on what Glacier was backed by, it was actually on crusty old S3 gear (and at the very beginning, it was just on S3 itself as a wrapper and a hand wavy price discount, eating the costs to get people to buy in to the idea!). Later on (2018 or so) they began moving to a home grown tape-based solution (at least for some tiers).
I'm not aware of AWS ever confirming tape for glacier. My own speculation is they likely use hdd for glacier - especially so for the smaller regions - and eat the cost.
Someone recently came across some planning documents filed in London for a small "datacenter" which wasn't attached to their usual London compute DCs and built to house tape libraries (this was explicitly called out as there was concern about power - tape libraries don't use much). So I would be fairly confident they wait until the glacier volumes grow enough on hdd before building out tape infra.
Do you have any sources for that? I'm really curious about Glacier's infrastructure and AWS has been notoriously tight-lipped about it. I haven't found anything better than informed speculation.
My speculation: writes are to /dev/null, and the fact that reads are expensive and that you need to inventory your data before reading means Amazon is recreating your data from network transfer logs.
Maybe they ask the NSA for a copy.
Source is SWIM who worked there (doubt any of that stuff has been published)
That's surprising given how badly restoration worked (much more like tape than drives).
I'd be curious whether simulating a shitty restoration experience was part of the emulation when they first ran Glacier on plain S3 to test the market.
The “drain time” for a 30TB drive is probably between 36 and 48 hours. I don’t have one in my lab to test, or the patience to do so if I did.
Yep, did 20TB drives in my Unraid box and took about 2 days and some change to setup a clean sync between em all :)
There might be surprisingly little value in going tape due to all the specialization required. As the other comment suggest, many of the lower tiers likely represent basically IO bandwidth classes. a 16 TB disk with 100 IOPs can only offer 1 IOP/s over 1.6 TB for 100 customers, or 0.1 IOP/s over 160 GB for 1000, etc. Just scale up that thinking to a building full of disks, it still applies
I realize you're making a general point about space/IO ratios and the below is orthogonal, no contradiction.
It's actually a lot less user-facing per disk IO capacity that you will be able to "sell" in a large distributed storage system. There's constant maintenance churn to keep data available: - local hardware failure - planned larger scale maintenance - transient, unplanned larger scale failures (etc)
In general, you can fall back to using reconstruction from the erasure codes for serving during degradation. But that's a) enormously expensive in IO and CPU and b) you carry higher availability and/or durability risk because you lost redundancy.
Additionally, it may make sense to rebalance where data lives for optimal read throughput (and other performance reasons).
So in practice, there's constant rebalancing going on in a sophisticated distributed storage system that takes a good chunk of your HDD IOPS.
This + garbage collection also makes tape really unattractive for all but very static archives.
See comments above about AWS per-request cost - if your customers want higher performance, they'll pay enough to let AWS waste some of that space and earn a profit on it.
Std has the same performance as every other storage class. There are 2 async classes which you can't read from without retrieving first, but that's not a 'performance' difference as such - GETs aren't slow, they fail.
I expect they are storing metadata on SSDs. They might have SSD caches for really hot objects that get read a lot.
It is interesting that even after falling prices of HDDs, S3 costs have remained the same for at least 8 years. There's just not enough competition to push them to reduce costs. But imagine money it brings in in AWS because of this.
Same with every other aspect of their offerings. Look at EC2 even with instances like m7a.medium, 1 vCPU (not core) and 4GB memory for ~$50 USD/month on demand or ~$35/month reserve 1 year. It isn't even close to be competitive outside other big cloud providers.
EDIT: clarity on monthly pricing.
There is inflation, so it has effectively dropped in price. But your point is taken: inflation’s effect on prices is most assuredly slower than the progress of technology’s effect.
Reducing costs is the wrong incentive. If you look at a modern vendor such as Splunk or CrowdStrike, they have huge estates in AWS. There are huge swaths of repeating data, both within and across tenants. Rather than pointing this out, it is simpler and more effective to charge the customer for this data/usage, and use simple techniques so that it isn't duplicative. Reducing costs would only incentive and increase this asinine usage.
How much have hdd prices really fallen? AFAIK the incredible improvements in price per byte in HDD had slowed so much that they'll be eclipsed by SSDs in a few years.
Flash went from within 2x the price of DRAM in 2012 or so to maybe 40-50x cheaper today, driven somewhat by shrinking feature sizes, but mostly by the shift from SLC (1 bit/cell) to TLC (3 bits) and QLC (4 bits) and from planar to 300+ layer 3D flash.
Flash is near the end of the “S-curve” of those technologies being rolled out.
During that time HDD technology was pretty stagnant, with a mere 2x increase due to higher platter count with the use of helium.
New HDD technologies (HAMR) are just starting their rollout, promising major improvements in $/GB over the next few years as they roll out.
You can’t just look at a price curve on a graph and predict where it’s going to go. The actual technologies responsible for that curve matter.
> mostly by the shift from SLC (1 bit/cell) to TLC (3 bits) and QLC (4 bits) and from planar to 300+ layer 3D flash
That "and" is doing a lot of work.
In 2012 most flash was MLC.
In 2025 most flash is TLC.
> During that time HDD technology was pretty stagnant, with a mere 2x increase due to higher platter count with the use of helium.
They've advanced slower than SSDs but it wasn't that slow. Between 2012 and 2025, excluding HAMR, sizes have improved from 4TB to 24TB and prices at the low end have improved from $50/TB to $12/TB.
This is one of those times a downvote confuses me. I corrected some numbers. Was I accidentally rude? If I made a mistake on the numbers please give the right numbers.
If my first line was unclear: We might say the denser bits give us a 65% density improvement. And quick math shows that a 80-100x improvement is actually nine 65% improvements in a row. So the denser bits per cell aren't doing much, it's pretty much all process improvement.
It’s mostly 3D, not process.
3D flash is over 300 layers now. The size of a single 300-bit stack on the surface of the chip is bigger than an old planar cell, but that 300x does a lot more than make up for it.
3D NAND isn’t a “process improvement” - it’s a fundamental new architecture. It’s radically cheaper because it’s a set of really cheap steps to make all 300+ layers, not using any of the really expensive lithography systems in the fab, then a single (really complicated) set of steps to drill holes through the layers for the bit stacks and coat the insides of the holes. Chip cost basically = the depreciation of the fab investment during the time a chip spends in the fab, so 3D NAND is a huge win. (just stacking layers by running the chip through the process N times wouldn’t save any money, and would probably just decrease yields)
A total guess - 2x more expensive for extra steps, bit stacks take 4x more area than planar cells, 300 layer would have 300/8 = 37.5x cheaper bits. (That 4x is pulling a lot of weight - for all I know it might be more like 8x, but the point stands)
I was counting all the 3D manufacturing innovations as "process improvement". I'm not sure why you don't.
Anyway the point stands that bits per cell is barely doing anything compared to making the cells cheaper.
Because they made something different with the same process, instead of making the same thing with a different process. Feature size didn’t get any smaller. (or, rather, you get the order of magnitude improvement without it, and those gains were vastly more than the feature size improvements over that time period)
Also because “process improvement” usually refers to things where you get incremental improvements basically for free as each new generation of fab rolls out. Unless you can invent a 4D flash, this is a single (huge) improvement that’s mostly played out.
> with the same process
Same process node.
Node is part of process, but all the layering and etching techniques they figured out to make 3D cells are also process. At least that's how I see it.
Oh well, I don't want to argue definitions, I just want to clarify what I meant.
Oh, and no one has a solution to make HDDs faster. If anything, they may have gotten slower as they get optimized for capacity instead of speed.
(Well, peak data transfer rate keeps going up as bits get packed tighter, but capacity goes up linearly with areal bit density, while the speed the bits go under the head goes up with the square root.)
(Well, sort of. For a while a lot of the progress came from making the bits skinnier but not much shorter, so transfer rates didn’t go up that much)
Magnetic hard drives are 100X cheaper per GB than when S3 launched, and are about 3X cheaper than in 2016 when the price last dropped. Magnetic prices have actually ticked up recently due to supply chain issues, but HAMR is expected to cause a significant drop (50-75%/GB) in magnetic storage prices as it rolls out in next few years. SSDs are ~$120/T and magnetic drives are ~$18/T. This hasn't changed much in the last 2 years.
I think a more interesting article on S3 is "Building and operating a pretty big storage system called S3"
https://www.allthingsdistributed.com/2023/07/building-and-op...
Discussed at the time:
Building and operating a pretty big storage system called S3 - https://news.ycombinator.com/item?id=36894932 - July 2023 (160 comments)
Really nice read, thank you for that.
Author of the 2minutestreaming blog here. Good point! I'll add this as a reference at the end. I loved that piece. My goal was to be more concise and focus on the HDD aspect
Please fix the seek time numbers - they’re wildly in accurate; full platter seek is more like 20-25ms. I tried chasing down where the 8ms number came from a while back, and I think it applies to old sub-100GB 10K RPM high-speed drives from 25 or so years ago, which were purposely low density so they could swing the head faster and less precisely.
Check out the Olmez et al paper from MSST 2024 - I linked it above, but here it is again: https://www.msstconference.org/MSST-history/2024/Papers/msst...
[flagged]
Woah buddy, I worked with Andy for years and this is not my experience. Moving a large product like S3 around is really, really difficult, and I've always thought highly of Andy's ability to: (a) predict where he thought the product should go, (b) come up with novel ways of getting there, and (c) trimming down the product to get something in the hands of customers.
Also, did you create this account for the express purpose of bashing Andy? That's not cool.
[flagged]
Well, the claim in question is not exactly a rebuttal of the original commenter's point despite the negative tone so I'd cut him some slack
Can you share some anecdotes?
https://github.com/rustfs/rustfs performs?
> tens of millions of disks
If we assume enterprise HDDs in the double digit TB range then one can estimate that the total S3 storage volume of AWS is in the triple digit Exabyte range. That's propably the biggest storage system on planet earth.
A certain data center in Utah might top that, assuming that they have upgraded their hardware since 2013.
https://www.forbes.com/sites/kashmirhill/2013/07/24/blueprin...
This piece is interesting background, but worth noting that the actual numbers are highly speculative. The NSA has never disclosed hard data on capacity, and most of what's out there is inference from blueprints, water/power usage, or second-hand claims. No verifiable figures exist.
Production scale enterprise HDDs are in the 30TB range, 50TB on the horizon...
Yes, WD 28-30TB range; we have a sku with 1000+ drives per rack and it weighs more than a ton.
Google for Seagate Mozaic
Is there an open source service designed with HDDs in mind that achieves similar performance? I know none of the big ones work that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they all suggest flash-only deployments.
Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).
I would say that Ceph+RadosGW works well with HDDs, as long as 1) you use SDDs for the index pool, and 2) you are realistic about the number of IOPs you can get out of your pool of HDDs.
And remember that there's a multiplication of iops for any individual client iop, whether you're using triplicate storare or erasure coding. S3 also has iop multiplication, which they solve with tons of HDDs.
For big object storage that's mostly streaming 4MB chunks, this is no big deal. If you have tons of small random reads and writes across many keys or a single big key, that's when you need to make sure your backing store can keep up.
Lustre and ZFS can do similar speeds.
However, if you need high IOPS, you need flash on MDS for Lustre and some Log SSDs (esp. dedicated write and read ones) for ZFS.
Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.
Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.
So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.
Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.
It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.
It's a very beefy server with 4 NVMe and 20 HDD bays + a 60-drive external enclosure, 2 enterprise grade HBA cards set to multipath round-robin mode, even with 80 drives it's nowhere near the data path saturation point.
The link is a 10G 9K MTU connection, the server is only accessed via that local link.
Essentially, the drives being HDD are the only real bottleneck (besides the obvious single-node scenario).
At the moment, all writes are buffered into the NVMes via OpenCAS write-through cache, so the writes are very snappy and are pretty much ingested at the rate I can throw data at it. But the read/delete operations require at least a metadata read, and due to the very high number of small (most even empty) objects they take a lot more time than I would like.
I'm willing to sacrifice the write-through cache benefits (the write performance is actually an overkill for my use case), in order to make it a little more balanced for better List/Read/DeleteObject operations performance.
On paper, most "real" writes will be sequential data, so writing that directly to the HDDs should be fine, while metadata write operations will be handled exclusively by the flash storage, thus also taking care of the empty/small objects problem.
> Essentially, the drives being HDD are the only real bottleneck
? on the low end a single HD can deliver 100MB/s, 80 can deliver 8,000MB/s, a single nvme can do 700MB/s and you have 4, 2,800MB/s - a 10Gb link can only do 1000MB/s, so isn't your bottle neck Network and then probably CPU?
If your server is old, the RAID card's PCIe interface will be another bottleneck, alongside the latencies added if the card is not that powerful to begin with.
Same applies to your NVMe throughput since now you have the risk to congest the PCIe lanes if you're increasing line count with PCIe switches.
If there are gateway services or other software bound processes like zRAID, your processor will saturate way before your NIC, adding more jitter and inconsistency to your performance.
NIC is an independent republic on the motherboard. They can accelerate almost anything related to stack, esp. server grade cards. If you can pump the data to the NIC, you can be sure that it can be pushed at line speed.
However, running a NIC at line speed with data read from elsewhere on the system is not always that easy.
Hope you don't have expectations (over the long run) for high availability. At some point that server will come down (planned or unplanned).
For sure, there is zero expectations for any kind of hardware downtime tolerance, it's a secondary backup storage cobbled together from leftovers over many years :)
For software, at least with MinIO it's possible to do rolling updates/restarts since the 5 instances in docker-compose are enough for proper write quorum even with any single instance down.
I'm working on something that might be suited for this use-case at https://github.com/uroni/hs5 (not ready for production yet).
It would still need a resilience/cache layer like ZFS, though.
Ceph's S3 protocol implementation is really good.
Getting Ceph erasure coding set up properly on a big hard disk pool is a pain - you can tell that EC was shoehorned into a system that was totally designed around triple replication.
Coudl you eleborate what you mean by the last sentence?
Originally Ceph divided big objects into 4MB chunks, sending each chunk to an OSD server which replicated it to 2 more servers. 4MB was chosen because it was several drive rotations, so the seek+ rotational delay didn’t affect the throughput very much.
Now the first OSD splits it into k data chunks plus d parity chunks, so the disk write size isn’t 4MB, it’s 4MB/k, while the efficient write size has gone up 2x? 4x? since the original 4MB decision as drive transfer rates increase.
You can change this, but still the tuning is based on the size of the block to be coded, not the size of the chunks to be written to disk. (and you might have multiple pools with much different values of k)
I'm still not sure which exact Ceph concept you are referring to. Thre is the "minimum allocation size" [1], but that is currently 4 KB (not MB).
There is also striping [2], which is the equivalent of RAID-10 functionality to split a large file into independent segments that can be written in parallel. Perhaps you are referring to RGW's default stripe size of 4 MB [3]?
If yes, I can understand your point about one 4 MB RADOS object being erasure-coded to e.g. 6 = 4+2 "parity chunks", making it < 1 MB writes that are not efficient on HDDs.
But would you not simply raise `rgw_obj_stripe_size` to address that, according to the k you choose? E.g. 24 MB? You mention it can be changed, but I don't understand the "but still the tuning is based on the size of the block to be coded" part, (why) is that a problem?
Also, how else would you do it when designing EC writes?
Thanks!
[1]: https://docs.ceph.com/en/squid/rados/configuration/bluestore...
[2]: https://docs.ceph.com/en/squid/architecture/#data-striping
[3]: https://docs.ceph.com/en/squid/radosgw/config-ref/#confval-r...
If you can afford it, mirroring in some form is going to give you way better read perf than RAIDz. Using zfs mirrors is probably easiest but least flexible, zfs copies=2 with all devices as top level vdevs in a single zpool is not very unsafe, and something custom would be a lot of work but could get safety and flexibility if done right.
You're basically seek limited, and a read on a mirror is one seek, whereas a read on a RAIDz is one seek per device in the stripe. (Although if most of your objects are under the chunk size, you end up with more of mirroring than striping)
You lose on capacity though.
Yeah unfortunately mirrors is no go due to efficiency requirements, but luckily read performance is not that important if I manage to completely offload FS/S3 metadata and small files to flash storage (separate zpool for Garage metadata, separate special VDEV for metadata/small files).
I think I'm going to go with 8x RAIDz2 VDEVs 10x HDDs each, so that the 20 drives in the internal drive enclosure could be 2 separate VDEVs and not mix with the 60 in the external enclosure.
It's great to see other people's working solutions, thanks. Can I ask if you have backup on something like this? In many systems it's possible to store some data on ingress or after processing, which serves as something that's rebuildable, even if it's not a true backup. I'm not familiar if your software layer has backup to off site as part of their system, for example, which would be a great feature.
It might not be the most ideal solution, but did you consider installing TrueNAS on that thing?
TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.
It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.
Do you know if some of these systems have components to periodically checksum the data at rest?
ZFS/OpenZFS can do scrub and do block-level recovery. I'm not sure about Lustre, but since Petabyte sized storage is its natural habitat, there should be at least one way to handle that.
Any of them will work just as well, but only with many datacenters worth of drives, which very few deployments can target.
It's the classic horizontal/vertical scaling trade off, that's why flash tends to be more space/cost efficient for speedy access.
SeaweedFS has evolved a lot the last few years, with RDMA support and EC.
At a past job we had an object store that used SwiftStack. We just used SSDs for the metadata storage but all the objects were stored on regular HDDs. It worked well enough.
Apache Ozone has multiple 100+ petabyte clusters in production. The capacity is on HDDs and metadata is on SSDs. Updated docs (staging for new docs): https://kerneltime.github.io/ozone-site/
Doing some light googling aside from Ceph being listed, there's one called Gluster as well. Hypes itself as "using common off-the-shelf hardware you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks."
It's open source / free to boot. I have no direct experience with it myself however.
https://www.gluster.org/
Gluster has been slowly declining for a while. It used to be sponsored by RedHat, but tha stopped a few years ago. Since then, development slowed significantly.
I used to keep a large cluster array with Gluster+ZFS (1.5PB), and I can’t say I was ever really that impressed with the performance. That said — I really didn’t have enough horizontal scaling to make it worthwhile from a performance aspect. For us, it was mainly used to make a union file system.
But, I can’t say I’d recommend it for anything new.
A decade ago where I worked we used gluster for ~200TB of HDD for a shared file system on a SLURM compute cluster, as a much better clustered version of NFS. And we used ceph for its S3 interface (RadowGW) for tens of petabytes of back storage after the high IO stages of compute were finished. The ceph was all HDD though later we added some SSDs for a caching pool.
For single client performance, ceph beat the performance I get from S3 today for large file copies. Gluster had difficult to characterize performance, but our setup with big fast RAID arrays seems to still outperform what I see of AWS's luster as a service today for our use case of long sequential reads and writes.
We would occasionally try cephFS, the POSIX shared network filesystem, but it couldn't match our gluster performance for our workload. But also, we built the ceph long term storage to maximize TB/$, so it was at a disadvantage compared to our gluster install. Still, I never heard of cephFS being used anywhere despite it being the original goal in the papers back at UCSC. Keep an eye on CERN for news about one of the bigger ceph installs with public info.
I love both of the systems, and see ceph used everywhere today, but am surprised and happy to see that gluster is still around.
I’ve used GlusterFS before because I was having tens of old PCs and it worked for me very well. It’s basically a PoC to see how it work than production though
>Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).
What you mean by no EC?
In their design document at https://garagehq.deuxfleurs.fr/documentation/design/goals/ they state: "erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication"
Nice! Learned something new today. Seems like a way for error correction. One can store parts of data with some more metadata and if some parts of the data are lost, the original can be reconstructed via some use of computational power.
Seems like some kind of compression?
Is that how the error correction on DVD works? I
And is that how GridFS is can keep file store slow low compare to regular file system?
We've been running a production ceph cluster for 11 years now, with only one full scheduled downtime for a major upgrade in all those years, across three different hardware generations. I wouldn't call it easy, but I also wouldn't call it hard. I used to run it with SSDs for radosgw indexes as well as a fast pool for some VMs, and harddrives for bulk object storage. Since i was only running 5 nodes with 10 drives each, I was tired of occasional iop issues under heavy recovery so on the last upgrade I just migrated to 100% nvme drives. To mitigate the price I just bought used enterprise micron drives off ebay whenever I saw a good deal popup. Haven't had any performance issues since then no matter what we've tossed at it. I'd recommend it, though I don't have experience with the other options. On paper I think it's still the best option. Stay away from CephFS though, performance is truly atrocious and you'll footgun yourself for any use in production.
We're using CephFS for a couple years, with some PBs of data on it (HDDs).
What performance issues and footguns do you have in mind?
I also like that CephFS has a performance benefits that doesn't seem to exist anywhere else: Automatic transparent Linux buffer caching, so that writes are extremely fast and local until you fsync() or other clients want to read, and repeat-reads or read-after-write are served from local RAM.
Does anyone know what is the technology stack of S3? Monolith or multiple services?
I assume would have lots of queues, caches and long running workers.
I was an SDE on the S3 Index team 10 years ago, but I doubt much of the core stack has changed.
S3 is comprised primarily of layers of Java-based web services. The hot path (object get / put / list) are all served by synchronous API servers - no queues or workers. It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career.
For a get call, you first hit a fleet of front-end HTTP API servers behind a set of load balancers. Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently. Your request is then sent to the Indexing fleet to find the mapping of your key name to an internal storage id. This is returned to the front end layer, which then calls the storage layer with the id to get the actual bits. It is a very straightforward multi-layer distributed system design for serving synchronous API responses at massive scale.
The only novel bit is all the backend communication uses a home-grown stripped-down HTTP variant, called STUMPY if I recall. It was a dumb idea to not just use HTTP but the service is ancient and originally built back when principal engineers were allowed to YOLO their own frameworks and protocols so now they are stuck with it. They might have done the massive lift to replace STUMPY with HTTP since my time.
Rest assured STUMPY was replaced with another home grown protocol! Though I think a stream oriented protocol is a better match for large scale services like S3 storage than a synchronous protocol like HTTP.
Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently.
They may still use key names for partitioning. But they now randomly hash the user key name prefix on the back end to handle hotspots generated by similar keys.
> The hot path (... list) are all served by synchronous API servers
Wait; how does that work, when a user is PUTting tons of objects concurrently into a bucket, and then LISTing the bucket during that? If the PUTs are all hitting different indexing-cluster nodes, then...?
(Or do you mean that there are queues/workers, but only outside the hot path; with hot-path requests emitting events that then get chewed through async to do things like cross-shard bucket metadata replication?)
LIST is dog slow, and everyone expects it to be. (my research group did a prototype of an ultra-high-speed S3-compatible system, and it really helps not needing to list things quickly)
It's not all java anymore. There's some rust now, too. ShardStore, at least (which the article mentions).
"It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career."
can you give some numbers? or at least ballpark?
Tens of thousands of TPS per node.
Microservices for days.
I worked on lifecycle ~5 years ago and just the Standard -> Glacier transition path involved no fewer than 7 microservices.
Just determining which of the 400 trillion keys are eligible for a lifecycle action (comparing each object's metadata against the lifecycle policy on the bucket) is a massive big data job.
Always was a fun oncall when some bucket added a lifecycle rule that queued 1PB+ of data for transition or deletion on the same day. At the time our queuing had become good enough to handle these queues gracefully but our alarming hadn't figured out how to differentiate between the backlog for a single customer with a huge job and the whole system failing to process quickly enough. IIRC this was being fixed as I left.
I used to work on the backing service for S3's Index and the daily humps in our graphs from lifecycle running were immense!
I work on tiny systems now, but something I miss from "big" deployments is how smooth all of the metrics were! Any bump was a signal that really meant something.
Amazon biases towards Systems Oriented Architecture approach that is in the middle ground between monolith and microservices.
Biasing away from lots of small services in favour of larger ones that handle more of the work so that as much as possible you avoid the costs and latency of preparing, transmitting, receiving and processing requests.
I know S3 has changed since I was there nearly a decade ago, so this is outdated. Off the top of my head it used to be about a dozen main services at that time. A request to put an object would only touch a couple of services en route to disk, and similar on retrieval. There were a few services that handled fixity and data durability operations, the software on the storage servers themselves, and then stuff that maintained the mapping between object and storage.
Amusingly, I suspect that the "dozen main services" is still quite a few more than most smaller companies would consider on their stacks.
Probably. Conway's law comes into effect, naturally.
There's a pretty good talk on S3 under the hood from last year's re:Invent: https://www.youtube.com/watch?v=NXehLy7IiPM
"Pretty good" is hugely underselling this!
I was just looking for this video so I can send it to my coworkers as one of the best introductory videos into the basics of cloud computing concepts.
The only scholarly paper they've written about it is this one: https://www.amazon.science/publications/using-lightweight-fo...
(well, I think they may have submitted one or two others, but this is the only one that got published)
> conway’s law and how it shapes S3’s architecture (consisting of 300+ microservices)
At this kind of scale, queues, caches and long running workers ought to be avoided at all costs due to their highly opaque nature which drastically increases the unpredictability in the system's behaviour whilst decreasing the reliability and observability.
Dose aws s3 pack small files into a large file? It seems hard to do it with erasue code and shardstore.
consider the read/write ratio. most of time s3 service are serving memory buffer. and memory is quite a lot faster than disk.
Here is another very interesting post on the subject from 2023:
https://www.allthingsdistributed.com/2023/07/building-and-op...
can we replicate something similar to this for homelab ?
garage
Add ZeroFS on top and get very low latency for frequently used data while bulk storage is remote S3.
[dead]
And still serving/protecting those 'internet scanners' and hacking script kiddies?
Because as a shelf-hosted, I had to block a ton of AWS ipv4s because of those.