The real failure rate of EBS

33 comments

·March 18, 2025

mstaoru

"What makes PlanetScale Metal performance so much better? With your storage and compute on the same server, you avoid the I/O network hops that traditional cloud databases require [...] Every PlanetScale database requires at least 2 replicas in addition to the primary. Semi-synchronous replication is always enabled. This ensures every write has reached stable storage in two availability zones before it’s acknowledged to the client."

Isn't there a contradiction between these two statements?

My personal experience with EBS analogs in China (Aliyun, Tencent, Huawei clouds) is that every disk will experience a fatal failure or a disconnection at least once a month, at any provisioned IOPS. I don't know what makes them so bad, but I gave up running any kinds of DB workloads on them, using node local storage instead. If there are durability constrains, I would spin up Longhorn or Rook on top of local storage. I can see replicas degrade from time to time, but overall systems work (nothing too large, maybe ~50K QPS).

samlambert

it's not a contradiction but there is nuance. local disks mean we can do a significant amount of the operations involved in a write locally without every block going over the network. It's true that a replica has to acknowledge it received the write but that's a single operation vs hundreds over a network.

rfoo

> that every disk will experience a fatal failure or a disconnection at least once a month

When? I vaguely remember that it used to be like that, but I haven't seen nearly as many failures on Aliyun for the last few years.

mstaoru

Admittedly, I moved everything off around 2020. We do have a few smaller "on-prem" style installs for a few customers, with much less traffic than our main installation. One of the Aliyun installs does experience block device issues from time to time, but since I effectively RAID them, it goes unnoticed, though it doubles the price. But the install is small (<1TB data) so it's not a problem.

samlambert

we have a lot more content like this on the way. if anyone has feedback or questions let us know.

swyx

LOVE this stuff sam, its highly educational but also establishes a ton of trust in PS. please keep it up!

JoshTriplett

How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?

If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.

Thanks for the analysis and article!

miller_joe

Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.

A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...

JoshTriplett

I'm already at the ~5s mark, booting a brand new instance, almost all of which is AWS time before my instance gets control; once the kernel takes over the remaining boot time is milliseconds. (I plan to go the "pool of instances" route in the future to eliminate the AWS time I have no control over.)

But ever so often, I observe instances taking several more seconds of that uncontrollable AWS time, and I wondered what statistics you might have on that.

Possibly relatedly, do you ever observe EBS being degraded at initial boot?

bigfatfrock

Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.

This kind of miffs also:

> AWS doesn’t describe how failure is distributed for gp3 volumes

I wonder why? Because it affects their number of 9s? Rep?

samlambert

it's hard to know for sure. it might be that or it might just present a number that is confusing to most.

ta988

Thanks! This is extremely useful and I'll be waiting for the next ones.

flaminHotSpeedo

Do you listen for volume degradation EventBridge notifications? I'm curious if or how often AWS flags these failed volumes for you

nickvanw

Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.

kingnaldo

Love how educational it is. I'd love even more if formulas were included for the statistics calculations.

reedf1

If you can detect EBS failure better than Amazon - I'd be selling this to them tomorrow.

tpetry

They probably detect this. Thats why the problem is solved after one to ten minutes according to the article. There's probably nothing they can do which wouldn't stress the disks more.

diggan

Probably sometimes, at least if we trust the article:

> In our experience, the documentation is accurate: sometimes volumes pass in and out of their provisioned performance in small time windows:

What AWS consider "small degradation" is sometimes "100% down" for their users though, look at any previous "AWS is down/having problems" HN comment threads and you'll see there tends to be a huge mismatch between what AWS considers "not working" and what users of AWS considers "not working".

Doesn't surprise me people want better tooling than what AWS themselves offer.

nickvanw

Author here - it's not that we're detecting failure better than they are (though certainly, we might be able to do it as fast as anyone else) - it's what you do afterwards that matters.

Being able to fail over to another database instance backed by a different volume in a different zone allows for a minimization of impact. This is well inline with AWS best practices, it's just arduous to do quickly and at-scale.

sougou

It's not just failure detection. A write to EBS is at least two additonal network hops. The first one is to get to the machine for the initial write, and the second is for that write to be propagated to another machine for durability. Multiply this by the number of IOPS required to complete a database transaction.

dijit

Why? They wouldn't buy it.

No offence to anyone who has drank the kool-aid with AWS, but honestly they're making a product *not* foundational infrastructure.

This might feel like a jarring point.

When you think of foundational infrastructure in the real world you think bridges and plumbing and the costs of building such things; which is stupidly high.

Yet when those things get grossly privatised they end up like Lagos, Nigeria[0].

Because there is a difference between delivering something that works most of the time, and something that works all of the time -- Major point being: one of them is obscenely profitable, and the other one might not even break even, which is why governments usually take on the cost of foundational infrastructure: They never expect to even break-even.

[0]: https://ourworld.unu.edu/en/water-privatisation-a-worldwide-...

flaminHotSpeedo

I think the more interesting part here (besides the fact that AWS SLA's sneakily screw you over and make it hard to guarantee static stability) is the remediation aspect.

This is a consistent letdown across most AWS products; they build the undifferentiated 90% of a thing, but some PM refuses to admit their product isn't complete, so instead of having optional features flags or cdk samples or something to help with that last 10%, they bury it deep in the docs and try not to draw attention to it. Then when you open a support case they tell you to pound sand, or maybe suggest rearchitecting to avoid their foot-gun they didn't tell you about.

bddicken

Or in this case, to spend far more $$ on io2.

QuinnyPig

I'm a sucker for deep dive cloud nerd content like this.

c4wrd

> When attached to an EBS–optimized instance, General Purpose SSD (gp2 and gp3) volumes are designed to deliver at least 90 percent of their provisioned IOPS performance 99 percent of the time in a given year. This means a volume is expected to experience under 90% of its provisioned performance 1% of the time. That’s 14 minutes of every day or 86 hours out of the year of potential impact. This rate of degradation far exceeds that of a single disk drive or SSD. > This is not a secret, it's from the documentation. AWS doesn’t describe how failure is distributed for gp3 volumes, but in our experience it tends to last 1-10 minutes at a time. This is likely the time needed for a failover in a network or compute component. Let's assume the following: Each degradation event is random, meaning the level of reduced performance is somewhere between 1% and 89% of provisioned, and your application is designed to withstand losing 50% of its expected throughput before erroring. If each individual failure event lasts 10 minutes, every volume would experience about 43 events per month, with at least 21 of them causing downtime!

These are some seriously heavy-handed assumptions being made, completely disregarding the data they collect. First, the author assumes that these failure events are distributed randomly and expected to happen on a daily basis, ignoring Amazon's failure rate statement throughout a year ("99% of the time annually"). Second, they argue that in practice, they see failures lasting between 1 and 10 minutes. However, they assert that we should assume each failure will last 10 minutes, completely ignoring the severity range they introduced.

Imagine your favorite pizza company claiming to deliver on time "99% of the time throughout a year." The author's logic is like saying, "The delivery driver knocks precisely 14 minutes late every day -- and each delay is 10 minutes exactly, no exceptions!". It completely ignores reality: sometimes your pizza is delivered a minute late, sometimes 10 minutes late, sometimes exactly on time for four months.

As a company with useful real-world data, I expect them not to make arguments based on exaggerations but rather show cold, hard data to back up their claims. For transparency, my organization has seen 51 degraded EBS volume events in the past 3 years across ~10,000 EBS volumes. Of those events, 41 had a duration of less than one minute, nine had a duration of two minutes, and one had a duration of three minutes.

remram

They are expanding on what the guarantee from AWS means, their statement is correct. They did not say the pizza place does this, they said the pizza place's guarantee allows for this. I don't see a problem.

jewel

I wonder if you could work around this problem by having two EBS volumes on each host, and write to them both. You'd have the OS report the write was successful as soon as either drive reported success. With reads you could alternate between drives for double the read performance during happy times, but quickly detect when one drive is slow and reroute those reads to the other drive.

We could call this RAID -1.

You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.

Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.

maherbeg

The blog post mentioned correlated failures in an availability zone. You likely could reduce this a bit, but still run into this frequently enough

samlambert

it's a lot of complexity and cost for a service that is already replicating 3 ways. 6x replication for a single node's disks seems excessive.

semi-extrinsic

Funny to see the plots with "No unit" on the y-axis label and then the actual units in parentheses in the title.

null

[deleted]

null

[deleted]

waynesonfire

[flagged]