Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

comrade1234

I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.

eska

Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.

VBprogrammer

I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.

If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.

BrentOzar

> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.

Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."

tgv

The Register calls it Microsoft 364, 363, ...

kingstnap

Reported uptimes are little more than fabricated bullshit.

They measure uptime using averages of "if any part of a chain is even marginally working".

People experience downtime however as "if any part of a chain is degraded".

ivad

Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....

hmry

WiFi login portal (Icomera) on the train I'm on doesn't work either.

ta988

Happened to lots of commercial routers too (free wifi with sign-in pages in stores for example) and that's way outside us-east-1

kristopherleads

Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.

shakesbeard

Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.

CTDOCodebases

I'm getting rate limit issues on Reddit so it could be related.

hobo_mark

When did Snapchat move out of GCP?

freeqaz

Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.

Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.

Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.

Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)

So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)

Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P

dijit

They might have an implicit dependency on AWS, even if they're not primarily hosted there.

binsquare

The internal disruption reviews are going to be fun :)

Msurrow

The fun is really gonna start if the root cause of this somehow implicates an AI as a primary cause.

karel-3d

I haven't seen the "90% of our code is AI" nonsense from Amazon.

portaouflop

It’s gonna be DNS

WelcomeShorty

Your remark made me laugh, but..:

"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."

https://health.aws.amazon.com/health/status

aurareturn

It's never an AI's fault since it's up to a human to implement the AI and put in a process that prevents this stuff from happening.

So blame humans even if an AI wrote some bad code.

Msurrow

I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.

moribvndvs

So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.

lbreakjai

Well at least you don't have to figure out how to test your setup locally.

codebolt

Atlassian cloud is also having issues. Closing in on the 3 hour mark.

ryanmcdonough

Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?

spicybright

It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff

nicce

As far as I know, region selection is about regulation and privacy and guarantees on that.

speedgoose

The region labels found within the metadata are very very powerful.

They make lawyers happy and they stop intelligence services to access the associated resources.

For example, no one would even consider accessing data from a European region without the right paperwork.

sofixa

> another data centre

Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.

seeg

quay.io was down: https://status.redhat.com

gramakri2

quay.io is down

null

[deleted]

pantulis

Now I know why the documents I was sending to my Kindle didn't go through.

HN

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more