S3 scales to petabytes a second on top of slow HDDs

EwanToo

I think a more interesting article on S3 is "Building and operating a pretty big storage system called S3"

https://www.allthingsdistributed.com/2023/07/building-and-op...

littlesnugblood

Andy Warfield is a narcissistic asshole. I speak from experience.

giancarlostoro

Really nice read, thank you for that.

enether

Author of the 2minutestreaming blog here. Good point! I'll add this as a reference at the end. I loved that piece. My goal was to be more concise and focus on the HDD aspect

dgllghr

I enjoyed this article but I think the answer to the headline is obvious: parallelism

crabique

Is there an open source service designed with HDDs in mind that achieves similar performance? I know none of the big ones work that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they all suggest flash-only deployments.

Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).

olavgg

SeaweedFS has evolved a lot the last few years, with RDMA support and EC.

bayindirh

Lustre and ZFS can do similar speeds.

However, if you need high IOPS, you need flash on MDS for Lustre and some Log SSDs (esp. dedicated write and read ones) for ZFS.

crabique

Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.

Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.

So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.

Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.

creiht

It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.

bayindirh

It might not be the most ideal solution, but did you consider installing TrueNAS on that thing?

TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.

It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.

foobarian

Do you know if some of these systems have components to periodically checksum the data at rest?

elitepleb

Any of them will work just as well, but only with many datacenters worth of drives, which very few deployments can target.

It's the classic horizontal/vertical scaling trade off, that's why flash tends to be more space/cost efficient for speedy access.

giancarlostoro

Doing some light googling aside from Ceph being listed, there's one called Gluster as well. Hypes itself as "using common off-the-shelf hardware you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks."

It's open source / free to boot. I have no direct experience with it myself however.

https://www.gluster.org/

a012

I’ve used GlusterFS before because I was having tens of old PCs and it worked for me very well. It’s basically a PoC to see how it work than production though

nerdjon

So is any of S3 powered by SSD's?

I honestly figured that it must be powered by SSD for the standard tier and the slower tiers were the ones using HDD or slower systems.

MDGeist

I always assumed the really slow tiers were tape.

wg0

Does anyone know what is the technology stack of S3? Monolith or multiple services?

I assume would have lots of queues, caches and long running workers.

Twirrim

Amazon biases towards Systems Oriented Architecture approach that is in the middle ground between monolith and microservices.

Biasing away from lots of small services in favour of larger ones that handle more of the work so that as much as possible you avoid the costs and latency of preparing, transmitting, receiving and processing requests.

I know S3 has changed since I was there nearly a decade ago, so this is outdated. Off the top of my head it used to be about a dozen main services at that time. A request to put an object would only touch a couple of services en route to disk, and similar on retrieval. There were a few services that handled fixity and data durability operations, the software on the storage servers themselves, and then stuff that maintained the mapping between object and storage.

hnexamazon

I was an SDE on the S3 Index team 10 years ago, but I doubt much of the core stack has changed.

S3 is comprised primarily of layers of Java-based web services. The hot path (object get / put / list) are all served by synchronous API servers - no queues or workers. It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career.

For a get call, you first hit a fleet of front-end HTTP API servers behind a set of load balancers. Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently. Your request is then sent to the Indexing fleet to find the mapping of your key name to an internal storage id. This is returned to the front end layer, which then calls the storage layer with the id to get the actual bits. It is a very straightforward multi-layer distributed system design for serving synchronous API responses at massive scale.

The only novel bit is all the backend communication uses a home-grown stripped-down HTTP variant, called STUMPY if I recall. It was a dumb idea to not just use HTTP but the service is ancient and originally built back when principal engineers were allowed to YOLO their own frameworks and protocols so now they are stuck with it. They might have done the massive lift to replace STUMPY with HTTP since my time.

js4ever

"It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career."

can you give some numbers? or at least ballpark?

jyscao

> conway’s law and how it shapes S3’s architecture (consisting of 300+ microservices)

cramcgrab

[dead]

HN

S3 scales to petabytes a second on top of slow HDDs

S3 scales to petabytes a second on top of slow HDDs