AWS outage shows internet users 'at mercy' of too few providers, experts say
156 comments
·October 20, 2025sunrunner
The 'experts' also made similar criticisms with the Fastly outage in 2021 and did anything obvious change as a result? In a week's time no national newspapers will be talking about this.
Meanwhile, everyone that spends actual time in these areas:
- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.
- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases
- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens
- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news
- Is just curious as to just what exactly happened from a technical perspective
This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.
imgabe
The "experts" in this case are
> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19
Dr. Cath-Speth has a PhD in cultural anthropology
> Cori Crider, the executive director of the Future of Technology Institute
A lawyer
> Madeline Carr, professor of global politics and cybersecurity at University College London
A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations
So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.
kopecs
Do you not think it a bit too hyperbolic to throw scare quotes around experts and imply the only people who can have opinions on systemic risk are software engineers? I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.
sunrunner
> I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.
Yeah, that's completely fair. My angle was more that firstly this doesn't come across as an opinion that needs the expert in question, and secondly this is yet another case of 'Talk is cheap, show me the code', particularly when quotes in the article include "We urgently need diversification in cloud computing."
I feel like the 'We' is doing an awful lot of heavy lifting and there's no mention of the costs of taking on such a task.
Additionally, and awkwardly, it's possible to be both a monopoly in the space but also technically a more stable solution, making the cost for competitors or people willing to use competitors doubly high.
Edit: Realised afer the fact I'm GP to your post, assumed it was mine, keeping the words anyway.
imgabe
Anyone can have an opinion, I never said or implied otherwise. Having an opinion does not make one an expert, hence the scare quotes.
The headline is misleading because when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts. Instead the "expert" is the director of the "Big Tech is Bad Institute" who says that "Big Tech is Bad". And their qualification of being an expert is solely that they are director of the "Big Tech is Bad Institute".
jimbokun
I don’t think it’s useful at all.
What are they going to say that’s useful for making concrete technical decisions?
They can advise on how to write contracts for dealing with these situations after the fact, I suppose.
mhb
Experts said that cloth masks would protect you from a deadly virus.
Waterluvian
Right?! Same with seatbelts. I don’t wear mine because there’s obviously still automobile deaths. Experts said seatbelts would protect us from deadly accidents. What else are they wrong about?!
wagwang
Opinions are valid but also worthless. Just give me a funny tweet to digest the situation.
zenoprax
I think your third point is what I've had to attune to when criticizing cloud dependence. I think if your entire source of revenue is dependent on AWS then you should be prepared for 16+ hours of downtime per year. Individuals notice it more when something is down for hours but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year. Bursts of downtime per day can still be attributed to the device, wifi, ISP, or some other intermediary's DNS/BGP.
If your margins are so tight that 16 hours of downtime will bankrupt you then I think either: a) I have no idea how to run a business; or b) you have no idea how to run a business. I'm also biased because I love highly fault-tolerant, geo-redundant, durable systems much more than "good enough for this KPI".
sunrunner
> but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year
This is really good point that aligns with my experience. Today's event was LOUD and (compared to other incidents) long, but perhaps not really that long compared to the situation you describe that for most businesses is going to be more pernicious.
Business intelligence and analytics-type folks at $DAYJOB are _very_ watchful for the year-on-year deviations and even periods where the prediction lines didn't match up for even just a few hours.
BrenBarn
I think all of that is mostly irrelevant. You don't need to pay a huge cost to avoid the small benefit, you don't need every service to be resilient to this, or any of that. You just need multiple different providers so that not everyone gets screwed at once.
sunrunner
But that would require companies to actually spend time and money testing and working with either a cross-provider multi-master-type system (with all the associated consistency headaches) or regularly test a functioning disaster-recovery/fallback system.
The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.
For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).
bamboozled
Yes but we live in a highly anti-competitive monopolized world now. With more to come under the new admin.
jdminhbg
It’s hard to think of anything less monopolized than cloud hosting. There are hundreds of providers.
A4ET8a8uTh0_v2
Um.. you don't need to be an expert in security, comp.science or economics to know that putting all eggs in one basket may not be a great idea as introduces one giant systemic target. If anything, regular people here are uniquely qualified to say something along the lines of:
Oi, this is ridiculous. Maybe more things should be ran locally..
FWIW, it was instructive to me as to which companies were not able to function today.
hippo77
These are Guardian 'experts' so can be safely ignored.
alecco
> - Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.
NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.
inopinatus
Even wearing my ex-AWS hat and understanding to some degree the internal complexity of these services, I too am boggled that foundational stuff is still out of Virginia and not a separately operated global region for the subset of control-plane dependencies that can’t be refactored into tolerating eventual consistency (such as parts of IAM).
We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.
Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.
sunrunner
I don't deny that an incident of this scope should prompt a serious technical and process review (and as you describe it, it sounds like this is long overdue), however how often does this kind of thing not affect 2000+ services? Companies should be tracking the time they don't have issues as much as the time they do in order to actually understand if they'd be better off elsewhere.
And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.
sysguest
> - Knows that genuinely 'critical' services (i.e. health) should be designed to account for this
yeah but aws advertises as "trust me bro I won't go down for 99.99999%"
I've seen a lot of gov proposals using aws to 'get away with downtime management'
null
labrador
Kieran Healy @kjhealy@mastodon.social
Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.
SpicyLemonZest
Sounds like a pretty good shed! Like a lot of pithy commentary on the cloud, this ignores the fact the practical alternative to a shed in Virginia for most businesses is a shelf in the supply closet. "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.
gspencley
> "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.
You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.
First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.
You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.
There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.
All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.
maccard
We run a subset of our CI workload on on-prem workstations because the cost/performance ratio of consumer hardware is so much higher than servers. 1TB NVMe drive, with a 7950x/i9, 64GB RAM and gigabit networking is < $1000. It actually completes our CI job faster than AWS restarts a gpu instance.
100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.
SpicyLemonZest
I've never managed IT professionally myself (pre-cloud or otherwise), so a lot of my information comes from family members who do, but my impression is that bare metal rental and colo centers weren't realistic options for any but the most technically sophisticated organizations. I know schools, stores, even research centers who went straight from on-prem to managed cloud with no real consideration for anything in between.
Spivak
But is the distinction meaningful? The alternative to a shed in Virginia is a different shed in Montana? I mean sure there are a lot of different sheds out there but they're all still sheds. They're all shared responsibility models where the line is drawn in different areas, some outages will be because of your fuckup, some will be theirs.
Not saying as an industry we shouldn't diversify a little but it doesn't fundamentally change the relationship each company has to their hosting provider.
noir_lord
Once had a site wide outage (biggish manufacturing company) of the internet and backup servers because one of the women wanted to plug her hair straighteners in for the xmas party.
In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.
Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.
darkwater
With a gazillion of shelves, closets, Jims and cables. So if Fortnite's Jim trips on a wire, Canva's Jim is quitely sipping coffee at his desk.
jacobsenscott
Largely mitigated by twist lock sockets.
franz_vlkshp
on the other hand, that's a small price to pay to having total control and physical access to your own infrastructure. if the sysadmin did his job properly, an incident like that shouldn't require anything else but to plug the server back in and hit the power switch. but then if he did his job properly, no one but IT should be tripping on power cables to begin with.
impure
We already have diversification. You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS. What we have here is a lock-in and marketing problem.
jasode
>You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS.
Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.
For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.
SoftTalker
> It's going to be a lot more involved
Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.
candiddevmike
Same thing applies to AWS...
TZubiri
Amazon offers VPS as well, EC2 instances, were those affected? I think they weren't.
swiftcoder
Our actual running instances were pretty much fine throughout, as was the RDS cluster, but we had no way to launch new instances (or auto-scale), and no way to invoke any of the other AWS services (IAM, SQS, Lambda, etc). Also no cloud watch logs/metrics for the duration, so limited visibility.
Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.
TYPE_FASTER
> While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates.
> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.
So, kinda? Some global services depend on us-east-1...
> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.
Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.
morshu9001
The expert opinions are more about geopolitics, like maybe don't have all your country's systems realtime depend on a foreign company.
If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.
kristianc
For the kind of person being quoted, the stock in trade is not actually doing anything to fix it, it's in being the person quoted when something goes wrong.
xp84
In 2011 there was some kind of big outage at some major AWS US-east pop. I started a job at a company (very boring B2C startup) which had taken the lesson from that, that "cloud anything is dangerous."
They went and bought a bunch of literal servers and installed them in a datacenter, 90 miles away from our offices, and this is where all our applications ran for the remainder of that company's existence (about 6 more years). For the whole time I was at that company, we had somewhat more, and usually more lengthy, outages than the average startup. The only difference is that when some piece of networking gear took a crap, or a disk failed, or whatever, our guys had to diagnose and resolve it (Their karma, I guess, since this was their idea).
Anyway, I do think it would be good if at least so-calld 'tech companies' had a little less obsession to outsource everything -- even easy things -- to AWS, GCP, and Azure. I feel that way mainly for cost reasons as many of these services are wildly overpriced. But also we shouldn't kid ourselves by ignoring the advantages of operating at the scale those guys do. They can afford to have multiple absolute wizards available around the clock who make sure that when a problem happens, it's not the kind of "S-show" we had at my old company where we're all on a slack room or zoom or whatever and just guessing at to try for half an hour before we can figure out what the actual issue is.
robomc
This. And when a service goes down it's a lot easier to explain to your client/boss that "half the internet is down" than "our boutique solution is broken so it's just us actually".
999900000999
I largely agree with you. When AWS goes down, for most situations I can just go outside and smoke a cigarette and not worry about it.
It's someone else's problem.
KronisLV
So, how many people will actually switch their setups to multi-cloud as a consequence of this? How many will move over to self-hosting? Or will they just do a post-incident report, wave hands around and do nothing?
Because I think it's very much the same way as it is with Cloudflare - while the large vendors aren't always openly hostile, we can just smile and hope that they don't get too keen on reminding us that they're holding us hostage.
I don't see that changing anytime soon. I've personally also used Hetzner, Contabo, Scaleway, Vultr, DigitalOcean, Time4VPS and some other platforms, but when people couple their setups to CF/AWS/GCP/Azure, typically that coupling is hard to get rid of and doing so is hard to justify.
SkyPuncher
For most companies, I suspect this will actually re-affirm _not_ switching to multi-cloud.
Lots of businesses who will be completely forgotten as having an outage today because all of their customers were dealing with their own outages and outages in dozens of other providers.
Obviously, that doesn't fly for everyone.
1970-01-01
GCP and Azure should be running a 10% sale/discount (Coupon code: RAINYDAY) for new accounts during the week of an AWS outage. The bean counters would take note.
jimbokun
Nobody ever got fired for buying IBM…
…no, Microsoft…
…no, AWS.
dynamite-ready
The whole industry walked straight into the cloud service lock-in trap. How would we begin to wind back? I also think Docker is as much to blame as the bigger cloud vendors.
spjt
I don't think it wants to. Ask any on-call engineer or support tech how they felt when, after having their phone blow up at 1am because everything is falling apart, they found out that this was an AWS-wide outage.
Jcowell
Why is docker to blame?
dynamite-ready
It's subjective I guess, but I feel as though containerisation has greatly supported the large Cloud vendor's desire to subvert the more common model of computing... Like, before, your server was a computer, much like your desktop machine, and you programmed it much like your desktop machine.
But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.
And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.
What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.
Felt like this all happened both at the speed of light, and in slow motion, at the same time.
godelski
Containers let me essentially build those machines but at the actual requirements I need for a particular system. So instead of 10 machines I can build 1. I then don't need to upgrade that machine if my service changes.
Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.
This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.
But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.
I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.
pythonaut_16
I don't see how Docker makes that worse.
Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.
ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?
Like I truly don't understand your argument here.
throwaway894345
Containers have nothing to do with storage. They are completely orthogonal to storage (you can use Dynamo or RedShift from EC2), and many people run Docker directly on VMs. Plenty of us still spend lots of time thinking about storage and state even with containers.
Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.
ryandvm
Man, I did not have "AWS us-east-1 will only have TWO 9s this year" on my bingo card.
aurumque
For those of us who have been using AWS for almost 20 years now, I can't imagine why anyone would willingly choose us-east-1 for anything. It is the oldest, highest traffic, most critical path region and is subject to turbulence.
tlogan
I think it is a little complicated. For example, your service might be using full failover but you use API from other service which are down.
Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...
dingnuts
ha! I saw another comment on here talking about how ec2 doesn't need to be held to the same standard as the power company because it's not as important as real infrastructure.
wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.
captainkrtek
“Global” and “edge” services such as IAM, Route53, CloudFront and so on have dependencies on us-east-1, so even if you don’t think you do, you probably do.
interroboink
By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice. I.e. reasons in favor.
Not that I disagree with you, but maybe not for the reasons you say (:
swiftcoder
> By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice
As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).
Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.
morshu9001
It can make sense to depend on the thing that will attract massive worldwide attention if/when it goes down. Or, more likely, it's just a default people don't change.
bongodongobob
Well, we didn't, but some of our third party softwares did. Hard to avoid.
TZubiri
Wait, was the whole region affected? Like even if you had an EC2 instance?
mads_quist
No, we run on US East 1 but only EC2. Everything was running smoothly!
mads_quist
Our strategy has always been to use as little higher abstractions from cloud providers as possible. Glad we went this way, saved us quite a bunch of SLA breaches today! I am confident to say that it's "best of both worlds". We get great availability zone redundancy by AWS without having to rely on and pay for all those PaaS stuff the cloud giants offer. Also, we can "fairly easy" migrate to any other cloud provider because we only need Debian instances running.
bigstrat2003
Yes, it was. We have EC2 instances that we turn on as-needed, and at times were unable to start said instances.
neom
Been a while since I worked in cloud but at least when I got out of it, the primitives where all shoring up to be generally very similar.
Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?
The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...
What happened?
LaurensBER
The (cognitive) overhead of managing and deploying to multiple clouds usually isn't worth it for most teams. Hiring experts and maintaining knowledge about the ins and outs of two (or more) clouds is less feasible for small, fast moving teams.
Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.
For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.
Besides that no-one ever got fired for picking AWS ;)
tadfisher
Not a justifiable expense when no one else is resilient against their AWS region going down either. Also cross-cloud orchestration is quite dead because every provider is still 100% proprietary bullshit and the control plane is... kubernetes. We settled for kubernetes.
morshu9001
Also if you can't even do cross region, cross cloud won't happen
dylan604
Cross region isn't simple when you have terabytes of storage in buckets in a region. Building services in other regions without that data doesn't really do any good. Maintaining instances in various regions is easy, but it's that data that complicates everything. If you need to use the instances in a different region because your main region is down, you still can't do anything because those cross region instances can't access the necessary data.
toast0
It seems that clouds balance their budget on egress charges... which leads to cross cloud communication being too expensive to setup multi cloud redundancy. Cross region redundancy is often too expensive too. Even cross availability zones is too expensive for some clouds and applications. (Cross region redundancy in a single cloud doesn't always work out, if the cloud has an outage on a global subsystem, or the broken subsystem gets pushed to multiple regions before exhibiting symptoms)
Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.
justapassenger
There’s a huge difference between “similar” and “works and is ROI positive for my business across the whole lifecycle”.
Multi cloud redundancy is like Java being a solution to platform independency.
dylan604
If you're a company providing services to people that already have data stored in VendorA's cloud, being on a different cloud would be expensive and prevent you from winning much work. If it turns out that VendorA happens to be the vendor for your clients, you build your services to run on VendorA's cloud too.
This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.
conductr
> are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors
Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.
Analemma_
All the cloud providers have cheap compute but ludicrously expensive network egress. Trying to multicloud will stick you with a massive traffic bill, which is probably not a coincidence.
jamesblonde
It's a market regulation failure. Which results in a failed market, with the cloud infra provider also providing data services. 20 years ago, there were 20+ widely used operational databases. Now, it's like DynamoDB with like half the market.
conductr
How should this have played out in a regulated market? DynamoDB gets released, then what? Has limits on the market share it's allowed to steal?
Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?
jimbokun
What would these regulations say, exactly?
sumtechguy
Many companies idea of a disaster plan is to make it after the disaster.
You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?
The unfortunate reality is this planning happens many times too late.
TZubiri
It feels like a hat on a hat, cloud systems are already designed for redundancy, adding a redundant layer on top of that is like a double condom, or invesisting in multiple investment funds.
Aeolun
It’s only a single region. If anything it shows how many people just double down on the default without any redundancy.
stronglikedan
> It’s only a single region
Which was effectively the only region
dijit
And we lean into it by saying "Well, if everyone else is down, I get a free pass".
(which, is not true in reality if you have ordinary customers).
Related ongoing thread:
AWS Multiple Services Down in us-east-1 - https://news.ycombinator.com/item?id=45640838 -(1650 comments so far)