Skip to content(if available)orjump to list(if available)

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

JCM9

US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.

Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”

bravetraveler

Call me crazy, because it is, but perhaps it's their "Room 641a". The purpose of a system is what it does, etc.

They've been in the business of charging people for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any zone failing.

voxadam

> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A

xbar

This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.

Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?

firesteelrain

It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic

gchamonlive

Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.

kokanee

I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.

sharpy

I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...

helsinkiandrew

>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions

I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.

I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.

You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late

sgarland

It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.

bpicolo

Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them

cmiles8

Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.

api

My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.

"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"

The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.

The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.

raw_anon_1111

You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.

Yeul

Internet was supposed to be a communication network if the East Coast was nuked.

What it turned into was Daedalus from Deus Ex lol.

t_sawyer

Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.

ZeroCool2u

They can't even bother to enable billing services in GovCloud regions.

falcor84

That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.

ajsnigrutin

Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.

ajkjk

They absolutely do do it themselves..

t_sawyer

There's multiple single points of failure for their entire cloud in us-east-1.

I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.

falcor84

What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.

null

[deleted]

rose-knuckle17

aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.

Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).

Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.

martypitt

A bit meta - but, what is faun.dev? I visited their site - it looks like a very very slow (possibly because of the current outage?), ad-funded Reddit / HN clone?

But, in it's sidebar of "Trending technologies", it lists "Ansible" and "Jenkins" .. which while are both great, I doubt are trending currently.

Curious what this is?

sofixa

> "Ansible" and "Jenkins" .. which while are both great

I would strongly argue that there is nothing great about Jenkins. It's an unholy mess of mouldy spaghetti that can sometimes be used to do achieve a goal, but is generally terrible at everything. Shit to use, shit to maintain, shit to secure. It was the best solution because of a lack of competition 20 years ago, but hasn't been relevant or anywhere near the top 50 since any competition appeared.

The fact that to this very day, nearing the end of 2025, they still don't support JWT identities for runs is embarrassing. Same goes for VMware vSphere.

Maxion

OP is the creator of faun.dev. Seems to just be yet another tech news site.

pjmlp

It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.

mrbungie

> It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality,

Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.

spyspy

Eh, the "best practices" that would've prevented this aren't trivial to implement and are definitely far beyond what most engineering teams are capable of, in my experience. It depends on your risk profile. When we had cloud outages at the freemium game company I worked at, we just shrugged and waited for the systems to come back online - nobody dying because they couldn't play a word puzzle. But I've also had management come down and ask what it would take to prevent issues like that from happening again, and then pretend they never asked once it was clear how much engineering effort it would take. I've yet to meet a product manager that would shred their entire roadmap for 6-18 months just to get at an extra 9 of reliability, but I also don't work in industries where that's super important.

esafak

Does AWS follow its own Well-Architected Framework!?

bilekas

These things happen when profits are the measure everything. Change your provider, but if their number doesn't go up, they wont be reliable.

So your complaints matter nothing because "number go up".

I remember the good old days of everyone starting a hosting company. We never should have left.

helsinkiandrew

> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.

Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup

mrbluecoat

> due to an "operational issue" related to DNS

Always DNS..

aeon_ai

It's not DNS

There's no way it's DNS

It was DNS

randomtoast

Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].

I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].

[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...

[2]: https://xkcd.com/2347/

mcphage

It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.

Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

2d8a875f-39a2-4

> we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive.

You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.

foinker

No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.

Trasmatta

The irony is that true resilience is very complex, and complexity can be a major source of outages in and of itself

agos

when did we have resilience?

BoredPositron

Cold War was pretty good in terms of resilience.

mschuster91

> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.

And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.

TheNewsIsHere

The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.

mschuster91

IAM, hands down, is one of the most amazing pieces of technology there is.

The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.

And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.

To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.

fsto

Ironically, the HTTP request to this article timed out twice before a successful response.

jamesbelchamber

This website just seems to be an auto-generated list of "things" with a catchy title:

> 5000 Reddit users reported a certain number of problems shortly after a specific time.

> 400000 A certain number of reports were made in the UK alone in two hours.