AWS Multiple Services Down in us-east-1
120 comments
·October 20, 2025jacquesm
padjo
Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
mlrtime
>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.
It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.
Not very many people realize that there are some services that still run only in us-east-1.
energy123
It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.
sreekanth850
Depends on how serious you are with SLA's.
lucideer
> to the tune of a few hours every 5-10 years
I presume this means you must not be working for a company running anything at scale on AWS.
skywhopper
That is the vast majority of customers on AWS.
davedx
> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
Absurd claim.
Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
padjo
It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.
Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.
afro88
> If your company is in anything finance-adjacent or critical infrastructure
GP said:
> most companies
Most companies aren't finance-adjacent or critical infrastructure
antihero
My website running on an old laptop in my cupboard is doing just fine.
whatevaa
When your laptop dies it's gonna be a pretty long outage too.
api
I have this theory of something I call “importance radiation.”
An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
jacquesm
Thank you for illustrating my point. You didn't even bother to read the second paragraph.
chanux
I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?
I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!
I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.
Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.
shawabawa3
> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt
mlrtime
We all read it.. AWS not coming back up is your point on nat having a backup plan?
You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.
I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).
pmontra
In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop or service. Who knows?
For small and medium sized companies it's not easy to perform an accurate due diligency.
freetanga
Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.
Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…
jacquesm
Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.
hvb2
> The internet got its main strengths from the fact that it was completely decentralized.
Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.
The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.
lentil_soup
> Decentralized in terms of many companies making up the internet
Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately
hvb2
No we've not lost that at all. Nobody prevents you from doing that.
We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.
jacquesm
I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.
But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.
hvb2
Absolutely, but the cost of perfection (100% uptime in this case) is infinite.
As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?
raincole
Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.
Frieren
> Most companies just aren't important enough to worry about "AWS never come back up."
But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.
We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.
raincole
Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.
anal_reactor
First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.
Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.
Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?
jacquesm
> Let me ask you: how do you prepare your website for the complete collapse of western society?
How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?
> Second, preparing for the disappearance of AWS is even more silly.
What's silly is not thinking ahead.
csomar
> Now imagine for a bit that it will never come back up.
Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.
chibea
One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
glemmaPaul
LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?
DrScientist
According to their status page the fault was in DNS lookup of the Dynamo services.
Everything depends on DNS....
mlrtime
Dynamo had a outage last year if I recall correctly.
KettleLaugh
We maybe distributed, but we die united...
hangsi
Divided we stand,
United we fall.
glemmaPaul
AWS Communist Cloud
Hamuko
I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.
yellow_lead
But it seems like only us-east-1 is down today, is that right?
dikei
Some global services have control plane located only in `us-east-1`, without which they become read-only at best, or even fail outright.
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
philipallstar
Just don't buy it if you don't want it. No one is forced to buy this stuff.
benterix
> No one is forced to buy this stuff.
Actually, many companies are de facto forced to do that, for various reasons.
philipallstar
How so?
tonypapousek
Looks like they’re nearly done fixing it.
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
chibea
It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...
rswail
In that region, other regions are able to launch EC2s and ECS/EKS without a problem.
weberer
Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.
kalleboo
It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.
It's still missing the one that earned me a phone call from a client.
zenexer
It's seemingly everything. SES was the first one that I noticed, but from what I can tell, all services are impacted.
null
mlrtime
When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.
The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.
hvb2
In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.
If in your own datacenter your storage service goes down, how much remains running
Aldipower
My minor 2000 users web app hosted on Hetzner works fyi. :-P
aembleton
Right up until the DNS fails
Aldipower
I am using ClouDNS. That is an AnycastDNS provider. My hopes are that they are more reliable. But yeah, it is still DNS and it will fail. ;-)
mlrtime
But how are you going to web scale it!? /s
Aldipower
Web scale? It is an _web_ app, so it is already web scaled, hehe.
Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.
There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.
nextaccountic
Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)
krowek
Shameless from them to make it look like it's a user problem. It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)
etothet
Never ascribe to malice that which is adequately explained by incompetence.
It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.
anal_reactor
I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.
kaptainscarlet
I got a rate limit error which didn't make sense since it was my first time opening reddit in hours.
igleria
funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.
dijit
AWS has made the internet into a single-point-of failure.
What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?
voidUpdate
To be fair, there is another point of failure, Cloudflare. It seems like half the internet goes down when Cloudflare has one of their moments
polaris64
It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189
miyuru
I wonder if the new endpoint was affected as well.
dynamodb.us-east-1.api.aws
testemailfordg2
Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.
gbalduzzi
Twilio is down worldwide: https://status.twilio.com/
Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.