AWS Multiple Services Down in us-east-1

120 comments

·October 20, 2025

jacquesm

Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

padjo

Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

mlrtime

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.

energy123

It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.

sreekanth850

Depends on how serious you are with SLA's.

lucideer

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

skywhopper

That is the vast majority of customers on AWS.

davedx

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

padjo

It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.

afro88

> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure

antihero

My website running on an old laptop in my cupboard is doing just fine.

whatevaa

When your laptop dies it's gonna be a pretty long outage too.

api

I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

jacquesm

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

chanux

I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.

Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.

shawabawa3

> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt

mlrtime

We all read it.. AWS not coming back up is your point on nat having a backup plan?

You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.

I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).

pmontra

In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop or service. Who knows?

For small and medium sized companies it's not easy to perform an accurate due diligency.

freetanga

Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesm

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

hvb2

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

lentil_soup

> Decentralized in terms of many companies making up the internet

Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately

hvb2

No we've not lost that at all. Nobody prevents you from doing that.

We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.

jacquesm

I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

hvb2

Absolutely, but the cost of perfection (100% uptime in this case) is infinite.

As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?

raincole

Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.

Frieren

> Most companies just aren't important enough to worry about "AWS never come back up."

But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

raincole

Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.

anal_reactor

First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.

Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesm

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

csomar

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

chibea

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

glemmaPaul

LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?

DrScientist

According to their status page the fault was in DNS lookup of the Dynamo services.

Everything depends on DNS....

mlrtime

Dynamo had a outage last year if I recall correctly.

KettleLaugh

We maybe distributed, but we die united...

hangsi

Divided we stand,

United we fall.

glemmaPaul

AWS Communist Cloud

Hamuko

I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.

yellow_lead

But it seems like only us-east-1 is down today, is that right?

dikei

Some global services have control plane located only in `us-east-1`, without which they become read-only at best, or even fail outright.

https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

philipallstar

Just don't buy it if you don't want it. No one is forced to buy this stuff.

benterix

> No one is forced to buy this stuff.

Actually, many companies are de facto forced to do that, for various reasons.

philipallstar

How so?

tonypapousek

Looks like they’re nearly done fixing it.

> Oct 20 3:35 AM PDT

> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.

chibea

It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...

rswail

In that region, other regions are able to launch EC2s and ECS/EKS without a problem.

weberer

Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.

kalleboo

It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.

It's still missing the one that earned me a phone call from a client.

zenexer

It's seemingly everything. SES was the first one that I noticed, but from what I can tell, all services are impacted.

null

[deleted]

mlrtime

When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.

The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.

hvb2

In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.

If in your own datacenter your storage service goes down, how much remains running

Aldipower

My minor 2000 users web app hosted on Hetzner works fyi. :-P

aembleton

Right up until the DNS fails

Aldipower

I am using ClouDNS. That is an AnycastDNS provider. My hopes are that they are more reliable. But yeah, it is still DNS and it will fail. ;-)

mlrtime

But how are you going to web scale it!? /s

Aldipower

Web scale? It is an _web_ app, so it is already web scaled, hehe.

Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.

There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.

nextaccountic

Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)

krowek

Shameless from them to make it look like it's a user problem. It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)

etothet

Never ascribe to malice that which is adequately explained by incompetence.

It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.

anal_reactor

I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.

kaptainscarlet

I got a rate limit error which didn't make sense since it was my first time opening reddit in hours.

igleria

funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.

dijit

AWS has made the internet into a single-point-of failure.

What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?

voidUpdate

To be fair, there is another point of failure, Cloudflare. It seems like half the internet goes down when Cloudflare has one of their moments

polaris64

It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189

miyuru

I wonder if the new endpoint was affected as well.

dynamodb.us-east-1.api.aws

testemailfordg2

Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.

gbalduzzi

Twilio is down worldwide: https://status.twilio.com/

HN

AWS Multiple Services Down in us-east-1

AWS Multiple Services Down in us-east-1