Major AWS Outage Happening

415 comments

·October 20, 2025

littlecranky67

Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.

[0]: https://news.ycombinator.com/item?id=45614922

jwr

As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.

It's just a single data point, but for me that's a pretty good record.

It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.

jamesbelchamber

> physical servers are way simpler than the extremely complex software and networking systems that AWS provides.

Or, rather, it's your fault when the complex software and networking systems you deployed on top of those physical servers go wrong (:

jwr

Yes. Which is why I try to keep my software from being overly complex, for example by not succumbing to the Kubernetes craze.

sapiogram

Do you monitor your product closely enough to know that there weren't other brief outages? E.g. something on the scale of unscheduled server restarts, and minute-long network outages?

supriyo-biswas

I personally do through status monitors at larger cloud providers at 30 sec resolutions, never noticed a downtime. They will sometimes drop ICMP though, even though the host is alive and kicking.

jwr

Of course. It's a production SaaS, after all. But I don't monitor with sub-minute resolution.

lossolo

I do. Routers, switches, and power redundancy are solved problems in datacenter hardware. Network outages rarely occur because of these systems, and if any component goes down, there's usually an automatic failover. The only thing you might notice is TCP connections resetting and reconnecting, which typically lasts just a few seconds.

jpalomaki

When AWS is down, everybody knows it. People don’t really question your hosting choice. It’s the IBM of cloud era.

JCM9

Yes, but those days are numbered. For many years AWS was in a league of its own. Now they’ve fallen badly behind in a growing number of areas and are struggling to catch up.

There’s a ton of momentum associated with the prior dominance, but between the big misses on AI, a general slow pace of innovation on core services, and a steady stream of top leadership and engineers moving elsewhere they’re looking quite vulnerable.

zjaffee

I couldn't agree more, there was clearly a big shift when Jassy became CEO of amazon as a whole and Charlie Bell left (which is also interesting because it's not like azure is magically better now).

The improvements to core services at AWS hasn't really happened at the same pace post covid as it did prior, but that could also have something to do with overall maturity of the ecosystem.

Although it's also largely the case that other cloud providers have also realized that it's hard for them to compete against the core competency of other companies, whereas they'd still be selling the infrastructure the above services are run on.

cmiles8

Looks like you’re being down voted for saying the quiet bit out loud. You’re not wrong though.

sisve

That is 100% true. You cant be fired for picking AWS... But I doubt its the best choice for most people. Sad but true

dijit

Schrodingers user;

Simultaneously too confused to be able to make their own UX choices, but smart enough to understand the backend of your infrastructure enough to know why it doesn't work and excuses you for it.

zejn

You can't be fired, but you burn through your runway quicker. No matter which option you choose, there is some exothermic oxidative process involved.

ExoticPearTree

Every one of the big hyperscalers has a big outage from time to time.

Unless you lose a significant amount of money per minute of downtime, there is no incentive to go multicloud.

And multicloud has its own issues.

In the end, you live with the fact that your service might be down a day or two per year.

brazukadev

Usually, 2 founders creating a startup can't fire each other anyway so a bad decision can still be very bad for lots of people in this forum

patshead

On the other side of that coin, I am excited to be up and running while everyone else is down!

citrin_ru

On one hand it's allows to shift the blame but on other hand is shows a disadvantage of hyper centralization - if AWS is down too many important services are down at the same time which makes it worse. E. g. when AWS is down it's important to have communication/monitoring services UP so engineers can discuss / co-ordinate workarounds and have good visibility but Atlassian was (is) significantly degraded today too.

jamesbelchamber

To back up this point, currently BBC News have it as their most significant story, with "live" reporting: https://www.bbc.co.uk/news/live/c5y8k7k6v1rt

This is alongside "live" reporting on the Israel/Gaza conflict as well as news about Epstein and the Louvre heist.

This is mainstream news.

addandsubtract

I like how their headline starts with Snapchat and Roblox being affected.

abujazar

That depends on the service. Far from everyone is on their PC or smartphone all day, and even fewer care about these kinds of news.

antihero

Amazon is up, what are they doing?

petesergeant

100%. When AWS was down, we'd say "AWS is down!", and our customers would get it. Saying "Hetzner is down!" raises all sorts of questions your customers aren't interested in.

sph

I've ran a production application off Hetzner for a client for almost a decade and I don't think I have had to tell them "Hetzner is down", ever, apart from planned maintenance windows.

neverminder

You can argue about Hetzner's uptime, but you can 't argue about Hetzner's pricing which is hands down the best there is. I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.

Lio

For the price of AWS you could run Hetzner, a second provider for resiliancy and still make a large saving.

Your margin is my opportunity indeed.

k4rli

I switched to netcup for even cheaper private vps for personal noncritical hosting. I'd heard of netcup being less reliable but so far 4 months+ uptime and no problems. Europe region.

Hetzner has the better web interface and supposedly better uptime, but I've had no problems with either. Web interface not necessary at all either when using only ssh and paying directly.

benterix

Exactly. Hetzner is the equivalent of the original Raspberry Pi. It might not have all fancy features but it delivers and for the price that essentially unblocks you and allows you to do things you wouldn't be able to do otherwise.

motorest

> I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.

Comments like this are so exaggerated that they risk moving the goodwill needle back to where it was before. Hetzner offers no service that is similar to DynamoDB, IAM or Lambda. If you are going to praise Hetzner as a valid alternative during a DynamoDB outage caused by DNS configuration, you would need to a) argue that Hetzner is a better option regarding DNS outages, b) Hetzner is a preferable option for those who use serverless offers.

I say this as a long-time Hetzner user. Herzner is indeed cheaper, but don't pretend that Herzner let's you click your way into a highly-availale nosql data store. You need non-trivial levels of you're ow work to develop, deploy, and maintain such a service.

1dom

> but don't pretend that Herzner let's you click your way into a highly-availale nosql data store.

The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.

Of course nobody else offers AWS products, but people use AWS for their solutions to compute problems and it can be easy to forget virtually all other providers offer solutions to all the same problems.

esskay

Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.

Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.

mschuster91

> Hetzner offers no service that is similar to DynamoDB, IAM or Lambda.

The key thing you should ask yourself: do you need DynamoDB or Lambda? Like "need need" or "my resume needs Lambda".

ViewTrick1002

If you need the absolutely stupid scale DynamoDB enables what is the difference compared to running for example FoundationDb on your own using Hetzner?

You will in both cases need specialized people.

sreekanth850

TBH, in my last 3 years with Hetzner, i never saw a downtime to my servers other than myself doing some routin maitenance for os updates. Location Falkenstein.

ratg13

And I have seen them delete my entire environment including my backups due to them not following their own procedures.

Sure, if you configure offsite backups you can guard against this stuff, but with anything in life, you get what you pay for.

whizzter

You really need your backup procedures and failover procedures though, a friend bought a used server and the disk died fairly quickly leaving him sour.

bert-ye

I work at a small / medium company with about ~20 dedicated servers and ~30 cloud servers at Hetzner. Outages have happened, but we were lucky that the few times it did happen, it was never a problem / actual downtime.

One thing to note is that there are some scheduled maintenances were we needed to react.

ffsm8

Haha, yeah that's a nugget

> 99.99% uptime infra significantly cheaper than the cloud.

I guess that's another person that has never actually worked in the domain (SRE/admin) but still wants to talk with confidence on the topic.

Why do I say that? Because 99.99% is frickin easy

That's almost one full hour of complete downtime per year.

It only gets hard in the 99.9999+ range ... And you rarely meet that range with cloud providers either as requests still fail for some reason, like random 503 when a container is decommissioned or similar

krsdcbl

We've been running our services on Hetzner for 10 years, never experienced any significant outages.

That might be datacenter dependant of course, since our root servers and cloud services are all hosted in Europe, but I really never understood why Hetzner is said to be less reliable

me551ah

I’m more curious to understand how we ended up creating a single point of failure across the whole internet.

kbrkbr

It still can be true that the uptime is better, or am I overlooking something?

aiiizzz

Nah you're definitely correct.

0x002A

As Amazon moves from day-1 company as it claimed once, to be the sales company like Oracle focusing on raking money, expect more outages to come, and longer to be resolved.

Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.

Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.

hopelite

That is why technical leaders’ role wouldn’t demand they not only gather data, but also report things like accurate operational, alternative, and scenario cost analysis; financial risks; vendor lock-in; etc.

However, as may be apparent just from that small set, it is not exactly something technical people often feel comfortable with doing. It is why at least in some organizations you get the friction of a business type interfacing with technical people in varying ways, but also not really getting along because they don’t understand each other and often there are barriers of openness.

zht

Do you have data suggesting AWS outages are more frequent and/or take longer to resolve?

0x002A

This is a prediction, not a historical pattern to be observed now. Only future data can verify if this prediction was correct or not.

JCM9

Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.

The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

cmiles8

This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.

judahmeek

behind on innovation how exactly?

JCM9

I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.

GoblinSlayer

But then you will be affected by outages of every dependency you use.

wrasee

Please tell me there was a mixup and for some reason they didn’t show up.

stepri

“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”

It’s always DNS.

Nextgrid

I wonder how much of this is "DNS resolution" vs "underlying config/datastore of the DNS server is broken". I'd expect the latter.

babarjaana

Dumb question but what's the difference between the two? If the underlying config is broken then DNS resolution would fail, and that's basically the only way resolution fails, no?

wdfx

... wonders if the dns config store is in fact dynamodb ...

huflungdung

I don’t think it is DNS. The DNS A records were 2h before they announced it was DNS but _after_ reporting it was a DNS issue.

koliber

It's always US-EAST-1 :)

oneeyedpigeon

Downtime Never Stops!

commandersaki

Someone probably failed to lint the zone file.

DrewADesign

DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.

movpasd

Seems like an example of "worse is better". The worse solution has better survival characteristics (on account of getting actually made).

cindyllm

[dead]

huflungdung

[dead]

bayindirh

Even when it's not DNS, it's DNS.

us0r

Or expired domains which I suppose is related?

shamil0xff

Might just be BGP dressed as DNS

amadeoeoeo

Oh no... may be LaLiga found out pirates hosting on AWS?

agos

this is how I discover that is not just Serie A doing this shenanigans. I'm not really surprised

sofixa

All the big leagues take "piracy" very seriously and constantly try to clamp down on it.

TV rights is one of their main revenue sources, and it's expected to always go up, so they see "piracy" as a fundamental threat. IMO, it's a fundamental misunderstanding on their side, because people "pirating" usually don't have a choice - either there is no option for them to pay for the content (e.g. UK's 3pm blackout), or it's too expensive and/or spread out. People in the UK have to pay 3-4 different subscriptions to access all local games.

The best solution, by far, is what France's Ligue 1 just did (out of necessity though, nobody was paying them what they wanted for the rights after the previous debacles). Ligue 1+ streaming service, owned and operated by them which you can get access through a variety of different ways (regular old TV paid channel, on Amazon Prime, on DAZN, via Bein Sport), whichever suits you the best. Same acceptable price for all games.

fairity

As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?

greybeard69

To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.

jewba

500 billions events. Always blows my mind how many people use aws

Implicated

I know nothing. But I'd imagine the number of 'events' generated during this period of downtime will eclipse that number every minute.

RamtinJ95

[dead]

froobius

Yes, with no prior knowledge the mathematically correct estimate is:

time left = time so far

But as you note prior knowledge will enable a better guess.

matsemann

Yeah, the Copernican Principle.

> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.

> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.

https://www.newyorker.com/magazine/1999/07/12/how-to-predict...

tsimionescu

This thought process suggests something very wrong. The guess "it will last again as long as it has lasted so far" doesn't give any real insight. The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.

What this "time-wise Copernican principle" gives you is a guarantee that, if you apply this logic every time you have no other knowledge and have to guess, you will get the least mean error over all of your guesses. For some events, you'll guess that they'll end in 5 minutes, and they actually end 50 years later. For others, you'll guess they'll take another 50 years and they actually end 5 minutes later. Add these two up, and overall you get 0 - you won't have either a bias to overestimating, nor to underestimating.

But this doesn't actually give you any insight into how long the event will actually last. For a single event, with no other knowledge, the probability that it will after 1 minute is equal to the probability that it will end after the same duration that it lasted so far, and it is equal to the probability that it will end after a billion years. There is nothing at all that you can say about the probability of an event ending from pure mathematics like this - you need event-specific knowledge to draw any conclusions.

So while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation.

hshdhdhehd

Is this a weird Monty hall thing where the person next to you didnt visit the wall randomly (maybe they decided to visit on some anniversary of the wall) so for them the expected lifetime of the wall is different?

tsimionescu

Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.

Edit: I should add that, more specifically, this is a property of the uniform distribution, it applies to any event for which EndsAfter(t) is uniformly distributed over all t > 0.

froobius

I'm not sure about that. Is it not sometimes useful for decision making, when you don't have any insight as to how long a thing will be? It's better than just saying "I don't know".

rwky

Generally expect issues for the rest of the day, AWS will recover slowly, then anyone that relies on AWS will recovery slowly. All the background jobs which are stuck will need processing.

seydor

1440 min

emrodre

Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.

jamesbelchamber

It's not surprising that it's impacting other services in the region because DynamoDB is one of those things that lots of other services build on top of. It is a little bit surprising that the blast radius seems to extend beyond us-east-1, mind.

In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.

I'm looking forward to the RCA!

XorNot

I'm real curious how much of AWS GovCloud has continued through this actually. But even if it's fine, from a strategic perspective how much damage did we just discover you could do with a targeted disruption at the right time?

thmpp

AWS engineers are trained to use their internal services for each new system. They seem to like using DynamoDB. Dependencies like this should be made transparent.

Nextgrid

Not sure why this is downvoted - this is absolutely correct.

A lot of AWS services under the hood depend on others, and especially us-east-1 is often used for things that require strong consistency like AWS console logins/etc (where you absolutely don't want a changed password or revoked session to remain valid in other regions because of eventual consistency).

bsjaux628

Not "like using", they are mandated from the top to use DynamoDB for any storage. At my org in the retail page, you needed director approval if you wanted to use a relational DB for a production service.

nevada_scout

It's now listing 58 impacted services, so the blast radius is growing it seems

littlecranky67

The same page now says 58 services - just 23 minutes after your post. Seems this is becoming a larger issue.

kalleboo

When I first visited the page it said like 23 services, now it says 65

oneeyedpigeon

74 now. This is an extreme way of finding out just how many AWS services there really are!

SeanAnderson

Looks like it affected Vercel, too. https://www.vercel-status.com/

My website is down :(

(EDIT: website is back up, hooray)

l5870uoo9y

Static content resolves correctly but data fetching is still not functional.

maximefourny

Have you done anything for it to be back up? Looks like mines are still down.

LostMyLogin

Looks as if they are rerouting to a different region.

hugh-avherald

mines are generally down

TiredOfLife

Service that runs on aws is down when aws is down. Who knew.

rirze

We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight

Hilift

Even with redundancy, the response time between NYC and Amazon East in Ashburn is something like 10 ms. The impedance mismatch and dropped packets and increased latency would doom most organizations craplications.

OliverGuy

Their latest update on the status page says it's a Dynamodb DNS issue

shawabawa3

but the cause of that could be anything, including some kind of config getting wiped due to a temporary power outage

mittermayr

Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.

olex

They've acknowledged an issue now on the status page. For me at least, it's completely down, package installation straight up doesn't work. Thankfully current work project uses a pull-through mirror that allows us to continue working.

tonyhart7

"Thankfully current work project uses a pull-through mirror that allows us to continue working."

so there is no free coffee time???? lmao

gjvr

Yep. It's the auditing part that is broken. As a (dangerous) workaround use --no-audit

drinchev

Also npm audit times out.

me551ah

We created a single point of failure on the Internet, so that companies could avoid single points of failure in their data centers.

frays

Robinhood's completely down. Even their main website: https://robinhood.com/

mittermayr

Amazing, I wonder what their interview process is like, probably whiteboarding a next-gen LLM in WASM, meanwhile, their entire website goes down with us-east-1... I mean.

null

[deleted]