AWS to bare metal two years later: Answering your questions about leaving AWS

261 comments

·October 29, 2025

bilekas

I'm so surprised there is so much pushback against this.. AWS is extremely expensive. The use cases for setting up your system or service entirely in AWS are more rare than people seem to realise. Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?

> We have 730+ days with 99.993% measured availability and we also escaped AWS region wide downtime that happened a week ago.

This is a very nice brag. Given they are using their ddos protection ingress via CloudFlare there is that dependancy, but in that case I can 100% agree than DNS and ingress can absolutely be a full time job. Running some microservices and a database absolutely is not. If your teams are constantly monitoring and adjusting them such as scaling, then the problem is the design. Not the hosting.

Unless you're a small company serving up billions of heavy requests an hour, I would put money on the bet AWS is overcharging you.

fulafel

The direct cost is the easy part. The more insidious part is that you're now cultivating a growing staff of technologists whose careers depend on doing things the AWS way, getting AWS certified to ensure they build your systems the AWS Well Architected Way instead of thinking themselves, and can upsell you on AWS lock-in solutions using AWS provided soundbites and sales arguments.

("Shall we make the app very resilient to failure? Yes running on multiple regions makes the AWS bill bigger but you'll get much fewer outages, look at all this technobabble that proves it")

And of course AWS lock-in services are priced to look cheaper compared to their overpricing of standard stuff[1] - if you just spend the engineering effort and IaC coding effort to move onto them, this "savings" can be put to more AWS cloud engineering effort which again makes your cloud eng org bigger and more important.

[1] (For example implementing your app off containers to Lambda, or the db off PostgreSQL to DynamoDB etc)

Hilift

> The direct cost is the easy part

I don't think it is easy. I see most organizations struggle with the fact that everything is throttled in the cloud. CPU, storage, network. Tenants often discover large amounts of activity they were previously unaware of, that contributes to the usage and cost. And there may be individuals or teams creating new usages that are grossly impacting their allocation. Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.

Then you can start adding in the Cloud value, such as incomprehensible networking diagrams that are probably non-compliant in some way (guess which ones!), and security? What is it?

m-gasser

> Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.

Sounds interesting, which setting is that?

hinkley

My last team decided to hand manage a Memcached cluster because it cost half as much as an unmanaged service versus AWS’s alternative. Don’t know how much we really saved versus opportunity cost on dev time though. But it’s close to negative.

torginus

Unfortunately it's not, and it gets more difficult the more cloud-y your app gets.

You can pay for EC2+EBS+network costs, or you can have a fancy cloud native solution where you pay for Lambda, ALBs, CloudWatch, Metrics, Secret Manager, (things you assume they would just give you, like if you eat at a restaurant, you probably won't expect to pay for the parking, toilet, or paying rent for the table and seats).

So cloud billing is its own science and art - and in most orgs devs don't even know how much the stuff they're building costs, until finance people start complaining about the monthly bills.

vidarh

I was about to rage at you over the first sentence, because this is so often how people start trying to argue bare metal setups are expensive. But after reading the rest: 100% this. I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.

hibikir

Well, they aren't wrong about the bare metal either: Every organization ends up tied to their staff, and said staff was hired to work on the stack you are using. People end up in quite the fights because their supposed experts are more fond of uniformity and learning nothing new.

Many a company was stuck with a datacenter unit that was unresponsive to the company's needs, and people migrated to AWS to avoid dealing with them. This straight out happened in front of my eyes multiple times. At the same time, you also end up in AWS, or even within AWS, using tools that are extremely expensive, because the cost-benefit analysis for the individuals making the decision, who often don't know very much other than what they use right now, are just wrong for the company. The executive on top is often either not much of a technologist or 20 years out of date, so they have no way to discern the quality of their staff. Technical disagreements? They might only know who they like to hang out with, but that's where it ends.

So for path dependent reasons, companies end up making a lot of decisions that in retrospect seem very poor. In startups if often just kills the company. Just don't assume the error is always in one direction.

torginus

The weird thing is I'm old enough to have grown up in the pre-cloud world, and most of the stuff, like file servers, proxies, dbs, etc. isn't any more difficult to set up than AWS stuff, it's just that the skills are different

Also there's a mindset difference - if I gave you a server with 32 cores you wouldn't design a microservice system on it, would you? After all there's nowhere to scale to.

But with AWS, you're sold the story of infinite compute you can just expect to be there, but you'll quickly find out just how stingy they can get with giving you more hardware automatically to scale to.

I don't dislike AWS, but I feel this promise of false abundance has driven the growth in complexity and resource use of the backend.

Reality tends to be you hit a bottleneck you have a hard time optimizing away - the more complex your architecture, the harder it is, then you can stew.

anal_reactor

My manager wants me to make this silly AWS certification.

Let me go on a tangent about trains. In Spain before you board a high-speed train you need to go though full security check, like on an airport. In all other EU countries you just show up and board, but in Spain there's the security check. The problem is that even though the security check is an expensive, inefficient theatre, just in case something does blow up, nobody wants to be the politician that removed the security check. There will be no reward for a politician that makes life marginally easier for lots of people, but there will be severe punishment for a politician that is involved in a potential terrorist attack, even if the chance of that happening is ridiculously small.

This is exactly why so many companies love to be balls deep into AWS ecosystem, even if it's expensive.

freetanga

https://en.wikipedia.org/wiki/2015_Thalys_train_attack

rsav

Nobody gets fired for buying IB^H^H AWS

embedding-shape

> In all other EU countries you just show up and board, but in Spain there's the security check

Just for curiosity's sake, did any other EU countries have any recent terrorist attacks involving bombs on trains in the capital, or is Spain so far alone with this experience?

kleiba

How does Spain deal with trains that come in from a neighboring country?

mrits

AWS doesn’t have to be expensive.

steelegbr

AWS may be overcharging but it's a balancing act. Going on-prem (well, shared DC) will be cheaper but comes with requirements for either jack of all trades sysadmins or a bunch of specialists. It can work well if your product is simple and scalable. A lot of places quietly achieve this.

That said, I've seen real world scenarios where complexity is up the wazoo and an opex cost focus means you're hiring under skilled staff to manage offerings built on components with low sticker prices. Throw in a bit of the old NIH mindset (DIY all the things!) and it's large blast radii with expensive service credits being dished out to customers regularly. On a human factors front your team will be seeing countless middle of the night conference calls.

While I'm not 100% happy with the AWS/Azure/GCP world, the reality is that on-prem skillsets are becoming rarer and more specialist. Hiring good people can be either really expensive or a bit of a unicorn hunt.

mhitza

It's a chicken and egg problem. If the cloud didn't become such a proeminent thing, the last decade and a half would have seen the rise of much better tools to manage on-premise servers (= requiring less in-depth sysadmin expertise). I think we're starting to see such tools appear in the last few years after enough people got burned by cloud bills and lockin.

PenguinCoder

I'm proudly 100% on prem Linux sys admin. There are not openings for my skills and they do not pay as well as whatever cloud hotness is "needed".

marcosdumay

Nobody is hiring generalists nowadays.

At the same time, the incredible complexity of the software infrastructure is making specialists more and more useless. To the point that almost every successful specialist out there is just some disguised generalist that decided to focus their presentation in a single area.

whstl

That's the crazy thing.

Most AWS-only Ops engineers I know are making bank and in high demand, and Ops teams are always HUGE in terms of headcount outside of startups.

The "AWS is cheaper" thing is the biggest grift in our industry.

hibikir

And don't forget the real crux of the problem: Do I even know whether a specialist is good or not? Hiring experts is really difficult if you don't have the skill in the topic, and if you do, you either not need an expert, or you will be biased towards those that agree with you.

It's not even limited to sysadmins, or in tech. How do you know whether a mechanic is very good, or iffy? Is a financial advisor giving you good advice, or basically robbing you? It's not as if many companies are going to hire 4 business units worth of on prem admins, and then decide which one does better after running for 3 years, or something empirical like that. You might be the poor sob that hires the very expensive, yet incompetent and out of date specialist, whose only remaining good skill is selling confidence to employers.

dns_snek

> Do I even know whether a specialist is good or not?

Of course but unless I misunderstood what you meant to say, you don't escape that by buying from AWS. It's just that instead of "sysadmin specialists" you need "AWS specialists".

If you want to outsource the job then you need to go up at least 1 more layer of abstraction (and likely an order of magnitude in price) and buy fully managed services.

everfrustrated

This only gets worse as you go higher in management. How does a technical founder know what good sales or marketing looks like? They are often swayed by people who can talk a good talk and deliver nothing.

canucktrash669

Managed servers reduce the on-prem skillset requirement and can also deliver a lot of value.

The most frustrating part of hyperscalers is that it's so easy to make mistakes. Active tracking of you bill is a must, but the data is 24-48h late in some cases. So a single engineer can cause 5-figure regrettable spend very quickly.

Aurornis

> I'm so surprised there is so much pushback against this.. AWS is extremely expensive.

I see more comments in favor than pushing back.

The problem I have with these stories is the confirmation bias that comes with them. Going self-hosted or on-premises does make sense in some carefully selected use cases, but I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.

The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely. Everyone goes through a honeymoon phase where the servers arrive and your software is up and running and you’re busy patting yourselves on the back about how you’re saving money. The real test comes 12 months later when the person who last set up the servers has left for a new job and the team is trying to do forensics to understand why the documentation they wrote doesn’t actually match what’s happening on the servers, or your project managers look back at the sprints and realize that the average time spent on self-hosting related tasks and ideas has added up to a lot more than anyone would have guessed.

Those stories aren’t shared as often. When they are, they’re not upvoted. A lot of people in my local startup scene have sheepish stories about how they finally threw in the towel on self-hosting and went to AWS and got back to focusing on their core product. Few people are writing blog posts about that because it’s not a story people want to hear. We like the heroic stories where someone sets up some servers and everything just works perfectly and there are no downsides.

You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.

mjr00

> I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.

Funnily enough, the article even affirms this, though most people seemed to have skimmed over it (or not read it at all).

> Cloud-first was the right call for our first five years. Bare metal became the right call once our compute footprint, data gravity, and independence requirements stabilised.

Unless you've got uncommon data egress requirements, if you're worried about optimizing cloud spend instead of growing your business in the first 5 years you're almost certainly focusing on the wrong problem.

> You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.

This too. Most of the massive AWS savings articles in the past few days have been from companies that do a massive amount of data egress i.e. video transfer, or in this case log data. If your product is sending out multiple terabytes of data monthly, hosting everything on AWS is certainly not the right choice. If your product is a typical n-tier webapp with database, web servers, load balancer, and some static assets, you're going to be wasting tons of time reinventing the wheel when you can spin up everything with redundancy & backups on AWS (or GCP, or Azure) in 30 minutes.

DrewADesign

> The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely.

What the modern software business seems to have lost is the understanding that ops and dev are two different universes. DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator. Having someone that helps derive the requirements for your infrastructure, then designs it, builds it , backs it up, maintains it, troubleshoots it, monitors performance, determines appropriate redundancy, etc. etc. etc. and then tells the developers how to work with it is the missing link. Hit-by-a-bus documentation, support and update procedures, security incident response… these are all problems we solved a long time ago, but sort of forgot about moving everything to cloud architecture.

mjr00

> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator.

This is revisionist history. DevOps was a reaction to the fact that many/most software development organizations had a clear separation between "developers" and "sysadmins". Developers' responsibility ended when they compiled an EXE/JAR file/whatever, then they tossed it over the fence to the sysadmins who were responsible for running it. DevOps was the realization that, huh, software works between when the people responsible for building the software ("Dev") are also the same people responsible for keeping it running ("Ops").

hndc

> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems

DevOps, conceptually, goes back to the 90s. I was using the term in 2001. If memory serves, AWS didn't really start to take off until the mid/late aughts, or at least not until they launched S3.

DevOps was a reaction to the software lifecycle problem and didn't have anything to do with AWS. If anything it's the other way around: AWS and cloud hosting gained popularity in part due to DevOps culture.

wredcoll

> What the modern software business seems to have lost is the understanding that ops and dev are two different universes.

This is a fascinating take, if you ask me, treating them as separate is the whole problem!

The point of being an engineer is to solve real world problems, not to live inside your own little specialist world.

Obviously there's a lot to be said for being really good at a specialized set of skills, but thats only relevant to the part where you're actually solving problems.

esskay

> I'm so surprised there is so much pushback against this

I'm not. It seems to be happening a lot. Any time a topic about not using AWS comes up here, or on Reddit there a sudden surge of people appearing out of nowhere shouting down anyone who suggests other options. It's honestly starting to feel like paid shilling.

Spooky23

I don’t think it’s paid shilling, it’s dogma that reflects where people are working here. The individual engineers are hammers and AWS is the nail.

AWS/Azure/GCP is great, but like any tool or platform you need to do some financial/process engineering to make an optimal choice. For small companies, time to market is often key, hence AWS.

Once you’re a little bigger, you may develop frameworks to operate efficiently. I have apps that I run in a data center because they’d cot 10-20x at a cloud provider. Conversely, I have apps that get more favorable licensing terms in AWS that I run there, even though the compute is slower and less efficient.

You also have people who treat AWS with the old “nobody gets fired for buying IBM” mentality.

dangus

I think a lot of engineers who remember the bare metal days have legitimate qualms about going back to the way that world used to work especially before containerization/Kubernetes.

I imagine a lot of people who use Linux/AWS now started out with bare metal Microsoft/VMWare/Oracle type of environments where AWS services seemed like a massive breath of fresh air.

TheCondor

It’s the current version of CCIE or some of the other certs. People pay money to learn how to operate AWS, other thing erode the value of their investment.

BirAdam

I'm not either. I used to do fully managed hosting solutions at a datacenter. I had to do everything from hardware through debugging customer applications. Now, people pay me to do the same but on cloud platforms and the occasional on-prem stuff. In general, the younger people I've come across have no idea how to set anything up. They've always just used awscli, the AWS Console, or terraform. I've even been ridiculed for suggesting people not use AWS. Thing is, public cloud really killed my passion for the industry in general.

Beyond public cloud being bad for the planet, I also hate that it drains companies of money, centralizes everyone's risk, and helps to entrench Amazon as yet another tech oligarchic fiefdom. For most people, these things just don't matter apparently.

palata

> Thing is, public cloud really killed my passion for the industry in general.

Similar here, I think. I got into Computer Science because I liked software... the way it was. Now I truly think that most software completely sucks.

The thing is that it has grown so much since then, that most developers come from a different angle.

ecshafer

I think in 5-10 years there is going to be very profitable consulting on setting up data center infrastructure, and de-clouding for companies.

7thaccount

I think some of that is a certain group of people will do anything to play with the new shiny stuff. In my org it's cloud and now GPU.

The cloud stuff is extremely expensive and doesn't work any better than our existing solutions. Like a commentator said below, it's insidious as your entire organization later becomes dependent on that. If you buy a cloud solution, you're also stuck with the vendor deciding to double the cost of the product once you're locked in.

The GPU stuff is annoying as all of our needs are fine with normal CPU workloads today. There are no performance issues, so again...what's the point? Well... somebody wants to play with GPUs I guess.

ghaff

Resume-driven development. It's probably pretty much always been a thing.

mrits

I think people that lived through the time where their severs are down because the admin forgot to turn them back on after he drove 50 miles back from the colo might not want to live through that again

fabian2k

A large part of the different views on this topic are due to the way people estimate the amount of saved effort and money because you're pushing some admin duties to the cloud provider instead of doing this yourself. And people come to vastly different conclusions on this aspect.

It's also that the requirements vary a lot, discussions here on HN often seem to assume that you need HA and lots of scaling options. That isn't universally true.

nicce

> A large part of the different views on this topic are due to the way people estimate the amount of saved effort and money because you're pushing some admin duties to the cloud provider instead of doing this yourself. And people come to vastly different conclusions on this aspect

This applies only if you had an extra customer that pays the difference. Basically argument only holds if you can’t take more customers because upkeeping the infrastructure takes too much time or you need to hire extra person which takes more money than AWS bill difference.

vb-8448

> Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?

It's a way to "commoditize" engineers. You can run on premise or mixed infra better and cheaper, but only if you know what you are doing. This requires experienced guys and doesn't work with new grad hired by big cons and sold ad "cloud experts".

calgoo

Also, when something breaks, you are responsible. If you put it in AWS like everyone else and it breaks, then its their problem not yours. We will still implement workarounds and fixes when it happens, but we are not responsible. Basic enterprise rules these days is to always pay someone else to be responsible.

vb-8448

Actually nothing new here, this was the same in the pre-cloud era where everyone in enterprises prefer big names(ibm, microsoft, oracle, ecc) to pass the responsibility to them in case of failures ... aka "nobody get fired because of buying IBM"

vidarh

Unless you put someone on retainer to be responsible, which you can do cheaper than to keep your AWS setup from breaking...

(I do that for people; my AWS using customers consistently end up needing more help)

chasd00

> then its their problem not yours

this is the main advantage of cloud, no one cares if the site/service/app is down as long as it's someone else's fault and responsibility.

yomismoaqui

> Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?

We should coin the term "Cloud Learned Helplessness"

izacus

A lot of people here have built their whole professional careers around knowing AWS and deploying to it.

Moving away is an existential issue for them - this is why there's such pushback. A huge % of new developer and devops generation doesn't know anything about deploying software on bare metal or even other clouds and they're terrified about being unemployed.

goalieca

meanwhile skills in operating systems, networking, and optimization are declining. Every system i've seen in the last 10 years or so has left huge cash on the table by not being aware of the basics.

cs702

In the early days of cloud service providers, they offered a handful of high-value services, all at great prices, making them cost-competitive with bare metal but much easier. That was then.

Things today are different. As cloud service providers have grown to become dominant, they now offer a vast, complicated tangle of services, microservices, control panels, etc., at prices that can spiral out of control if you are not constantly on top of them, making bare metal cheaper for many use cases.

JCM9

This. When AWS was 10 solid core services it made sense and was exciting. It’s now a bloated mess of 200+ services (many of which almost nobody uses) with all that complexity starting to create headaches and cracks.

AWS needs to stop trying to have a half-arsed solution to every possible use case and instead focus on doing a few basic things really well.

hinkley

I don’t think I’ve seen a menu as hilariously bad as the AWS dashboard menu. No popup menu should consume the entire screen edge to edge. Just a wall of cryptic service names with ambiguous icons.

genidoi

Imo the fact that an "AWS Certified Solutions Architect" is yet another AWS service/thing that is attainable, via an actual exam[0] for $300, is indicative of just how intentionally bloated the entire system has become.

[0] https://aws.amazon.com/certification/certified-solutions-arc...

rossdavidh

"Embrace, extend, extinguish". It was a Microsoft saying, but it explains Amazon's approach to Linux. Once your customers are skilled in how to do things on your platform, using your specialized products, they won't price-comparison (or compare in any other way) to competing options. Whether those countless other "half-arsed solutions" actually make money is beside the point; as long as the customer has baked at least one into their tech stack, they can't easily leave.

jrochkind1

(Real question, not meant to be sarcastic or challenging!) -- What are the challenges in trying to use just the ~10 core services you want/need and ignoring the others? What problems do the others you don't use cause with this use case?

whstl

The early services were mostly self-contained.

A lot of newer stuff that actually scales (so Lightsail doesn't count) is entangled with "security", "observability" and "network" services. So if you just want to run EC2 + RDS today, you also have to deal with VPC, Subnets, IAM, KMS, CloudWatch, CloudTrail, etc.

Since security and logs are not optional, you have very limited choice.

Having that many required additional services means lots of hidden charges, complexity and problems. And you need a team if you're not doing small-scale stuff.

aaronax

Costs have not dropped. Computing becomes cheaper over time, but AWS largely does not.

cmiles8

Word on the street is that Amazon leadership basically agrees with this and recognizes things have gotten off course. AWS is a small number of things that make money and then a whole bunch of slop and bloat.

AWS was mostly spared from yesterday’s big cuts but have been told to “watch this space” in the new year after re:Invent.

embedding-shape

> they offered a handful of high-value services, all at great prices, making them cost-competitive with bare metal but much easier

That was never the case for AWS, the point was never "We're cheap" but "We let you scale faster for a premium".

I first came across cloud services around 2010-2011 I think, when the company I worked at at the time started growing and we needed something better than shared hosting. AWS was brought up as a "fresh but expensive" alternative, and the CTO managed to convince the management that we needed AWS even if it was expensive, because it'll be a lot easier to tear up/down servers as we need it. Bandwidth costs I think was the most expensive part of the package, at least back then.

When I look at what performance per $ you get with AWS et al today, it looks the same, incredibly expensive for the performance you (don't) get. Better off with dedicated instances unless you team is lacking the basic skills of server management, or until the company really grown so it keeps being difficult dealing with the infrastructure, then hire a dedicated person and let them make the calls for what's next.

everfrustrated

I'd agree that AWS never sold on being cheaper, but there is one particular way AWS could be cheaper and that is their approach to billing-by-the-unit with no fixed costs or minimum charges.

Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.

If I wanted to store bytes in a DC it would cost $10k/mth by the time I was paying colo/ servers/ disks before I stored my first byte. Sure there wouldn't be any incremental costs for the second byte but thats a steep jump. S3 would have cost me $0.02. Being able to try technology and prove concepts at the product development stage is very powerful and why AWS became not just a vendor but a _technology partner_ for many companies.

embedding-shape

> Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.

Yes, no doubt about it. Initially AWS was mostly sold as "You never know when you might want to scale fast, imagine being featured in a newspaper and your servers can't handle the load, you need cloud for that!" to growing startups, and in that context it kind of makes sense, pay extra but at least be online.

But initially when you're small, or later when you're big and establish, other things make more sense. But yes, I agree that if you need to aggressively be able to scale up or down, cloud resources make sense to use for that, in addition to your base infrastructure.

torginus

But if AWS didn't have that anti-competitive data transfer fee that gets waived if your traffic goes to an internal server, why would you choose S3 vs a white-label storage vendor's similar offering?

rco8786

Anytime I have to go into the AWS control panel (which is often) I am immediately overwhelmed with a sense of dread. It's just the most bloated overcomplicated thing I could possibly imagine.

antonkochubey

You're lucky not to have dealt with Azure and GCP control panels, in that case :-)

null

[deleted]

rob74

...while on the other side, the "traditional" hosting/colocation providers feel the squeeze and have to offer more competitive prices to stay in business?

__alexs

AFAICT no AWS service has ever had a price increase. This is nonsense.

torginus

Considering you get exponentially more compute/hardware for the same money every 2 years or so, they haven't been getting that much cheaper.

raincole

Cloud has been generally getting cheaper if you take inflation into account. But hating AWS is the fad so...

viaoktavia

[flagged]

thelastgallon

These are the features that AWS provides

(1) Massive expansion of budget (100 - 1000x) to support empire building. Instead of one minimum-wage sysadmin with 2 high-availability, maxed-out servers for 20K - 40K (and 4-hour response time from Dell/HPE), you can have 100M multi-cloud Kubernetes + Lambda + a mix-and-match of various locked-in cloud services (DB, etc.). And you can have a large army of SRE/DevOps. You get power and influence as a VP of Cloud this and that and 300 - 1000 people reporting to you.

(2) OpEx instead of CapEx

(3) All leaders are completely clueless about hiring the right people in tech. They hire their incompetent buddies who hire their cronies. Data centers can run at scale with 5-10 good people. However, they hire 3000 horrible, incompetent, and toxic people, and they build lots of paperwork, bureaucracy, and approvals around it. Before AWS, it was VMware's internal cloud that ran most companies. Getting bare metal or a VM will take months to years, and many, many meetings and escalations. With AWS, here is my credit card, pls gimme 2 Vms is the biggest feature.

torginus

The problem with those 5 people, is you can't hire a 6th - your stack is custom and probably even if you find the guy, he'll need months of ramp-up.

In contrast, you could throw a stone into a bush and hit an AWS guy.

darkwater

The core of this success is this, IMO:

  > Our workload is 24/7 steady. We were already at >90% reservation coverage; there was no idle burst capacity to “right size” away. If we had the kind of bursty compute profile many commenters referenced, the choice would be different.

Which TBH applies to many, many places, even if they are not aware of it.

marcinzm

I'd say the core of their success is running everything in a single rack in a single datacenter at first (for months? a year?) and getting lucky. Life is simple when you don't need the costs and effort of reliability upfront.

darkwater

They mention having a second half-rack in a different DC.

In any case, not everyone need five nines, and usually it's just much easier to bring down a platform due to some bug in your own software rather that the core infrastructure going down at a rack level.

sceptic123

The point is valid, they mention adding that, so at one point they didn't have that. They're also only storing monitoring & observability data, that's never going to be mission critical for their customers.

It's probably the main reason why they were able to get away with this and why their application does not need scalability. I see they themselves are only offering two 9s of uptime.

rossdavidh

I had a problem figuring out why the place I was working wanted to move from in-house to AWS; their workload was easily handled by a few servers, they had no big bursts of traffic, and they didn't need any of the specialized features of AWS.

Eventually, I realized that it was because the devs wanted to put "AWS" on their resumes. I wondered how long it would take management to catch on that they were being used as a place to spruce up your resume before moving on to catch bigger fish.

But not long after, I realized that the management was doing the same thing. "Led a team migration to AWS" looked good on their resume, also, and they also intended to move on/up. Shortly after I left, the place got bought and the building it was in is empty now.

I wonder, now that Amazon is having layoffs and Big Tech generally is not as many people's target employer, will "migrated off of AWS to in-house servers" be what devs (and management) want on their resume?

whstl

Devs wanting to put AWS on their resume push for it, then the next wave you hire only knows AWS.

And then discussions on how to move forward are held between people that only know AWS and people who want to use other stuff, but only one side is transparent about it.

ksec

Many other points. When the Cloud Started, they offered great value in adjacent product and services. Scaling was painful, getting bare metal hardware have long lead time, provisioning takes time. DC was not of as high quality, Network wasn't as redundant. A lot of these today are much less of an issue.

In 2010 you could only get 64 Core Xeon CPU coming in 8 Sockets, or maximum or 8 Core per socket. And that is ignoring NUMA issues. Today you could get 256 Core per socket that is at least twice as fast per core. What used to be 64 Server could now be fitted into 1. And by 2030, it would be closer to 100 to 1 ratio. Not to mention Software on Server has gotten a lot faster compared to 2010. PHP, Python, Ruby, Java, ASP or even Perl. If we added up everything I wouldn't be surprised we are 200 or 300 to 1 ratio compared to 2010.

I am pretty sure there is some version of Oxide in the pipeline that will catch up to latest Zen CPU Core. If a server isn't enough, a few Oxide Rack should fit 99% of Internet companies usage.

debarshri

Recently i learned that orgs these days want to show software and infrastructure spend as capex as they can shown it as depreciating asset for tax purposes.

I understand that with AWS you cannot do that as it is often seem as opex.

I guess thats a good enough motivation to move out of AWS at scale.

yanslookup

FD: I work at Amazon, I also started my career in a time where I had to submit paper requests for servers that had turn around times measured in months.

I just don't see it. Given the nature of the services they offer it's just too risky not to use as much managed stuff with SLAs as possible. k8s alone is a very complicated control plane + a freaking database that is hard to keep happy if it's not completely static. In a prior life I went very deep on k8s, including self managing clusters and it's just too fragile, I literally had to contribute patches to etcd and I'm not a db engineer. I kept reading the post and seeing future failure point after future failure point.

The other aspect is there doesn't seem to be an honest assessment of the tradeoffs. It's all peaches and cream, no downsides, no tradeoffs, no risk assessment etc.

hedora

At another big-4 hyperscaler, we ended up with substantial downtime and a lossy migration because they didn’t know how to manage kubernetes.

Microk8s doesn’t use etcd (they have their own, simpler thing), which seems like a good tradeoff at single rack scale: https://benbrougher.tech/posts/microk8s-6-months-later/

The article’s deployment has a spare rack in a second DC and they do a monthly cutover to AWS in case the colo provider has a two site issue.

Spending time on that would make me sleep much better than hardening a deployment of etcd running inside a single point of failure.

What other problems do you see with the article? (Their monthly time estimates seem too low to me - they’re all 10x better than I’ve seen for well-run public cloud infrastructure that is comparable to their setup).

AndroTux

Managing a complex environment is hard, no matter whether that’s deployed on AWS or on prem. You always need skilled workers. On one platform you need k8s experts. On the other platform you need AWS experts. Let’s not pretend like AWS is a simple one-click fire and forget solution.

And let’s be very real here: if your cloud service goes down for a few hours because you screwed something up, or because AWS deployed some bad DNS rules again, the world moves on. At the end of the day, nobody gives a shit.

electroly

I put our company onto a hybrid AWS-colocation setup to attempt to get the best of both worlds. We have cheap fiddly/bursty things and expensive stable things and nothing in between. Obviously, put the fiddly/bursty things in AWS and put the stable things in colocation. Direct Connect keeps latency and egress costs down; we are 1 millisecond away from us-east-1 and for egress we pay $0.02/GB instead of the regular $0.09/GB and it doesn't need to go through a NAT Gateway. The database is on the colo side so database-to-AWS reads are all free AWS ingress instead of egress, and database-to-server traffic on the colo side doesn't transit to AWS at all. The savings on the HA pair of SQL Server instances is shocking and pays for the entire colo setup, and then some. I'm surprised hybrids are not more common. We are able to manage it with our existing (small) staff, and in absolute terms we don't spend much time on it--that was the point of putting the fiddly stuff in AWS.

The biggest downside I see? We had to sign a 3 year contract with the colocation facility up front, and any time we want to change something they want a new commitment. On AWS you don't commit to spending until after you've got it working, and even then it's your choice.

sondr3

> Cloud makes sense when elasticity matters; bare metal wins when baseload dominates.

This really is the crux of the matter in my opinion, at least for applications (databases and so on is in my opinion more nuanced). I've only worked at one place where using cloud functions made sense (keeping it somewhat vague here): data ingestion from stations that could be EXTREMELY bursty. Usually we got data from the stations at roughly midnight every day, nothing a regular server couldn't handle, but occasionally a station would come back online after weeks or new stations got connected etc which produced incredible load for a very short amount of time when we fetched, parsed and handled each packet. Instead of queuing things for ages we could instead just horizontally scale it out to handle the pressure.

TYPE_FASTER

> It depends on your workload.

Very much this.

Small team in a large company who has an enterprise agreement (discount) with a cloud provider? The cloud can be very empowering, in that teams who own their infra in the cloud can make changes that benefit the product in a fraction of the time it would take to work those changes through the org on prem. This depends on having a team that has enough of an understanding of database, network and systems administration to own their infrastructure. If you have more than one team like this, it also pays to have a central cloud enablement team who provides common config and controls to make sure teams have room to work without accidentally overrunning a budget or creating a potential security vulnerability.

Startup who wants to be able to scale? You can start in the cloud without tying yourself to the cloud or a provider if you are really careful. Or, at least design your system architecture in such a way that you can migrate in the future if/when it makes sense.

nik736

It's an interesting article, thanks for that.

What people forget about the OVH or Hetzner comparison is that for those entry servers they are known for, think the Advance line with OVH or AX with Hetzner. Those boxes come with some drawbacks.

The OVH Advance line for example comes without ECC memory, in a server, that might host databases. It's a disaster waiting to happen. There is no option to add ECC memory with the Advance line, so you have to use Scale or High Grade servers, which are far from "affordable".

Hetzner per default comes with a single PSU, a single uplink. Yes, if nothing happens this is probably fine, but if you need a reliable private network or 10G this will cost extra.

hedora

Their current advance offerings use AMD EPYC 4004 with on-die ECC. I can’t figure out if it’s “real” single correction double detection, or if the data lines between the processor and dimms are protected or not though.

torginus

I never understood the draw of 'server-grade hardware'. Consumer hardware fails rarely enough that you could 2x your infra and still be paying less.

lossolo

These concerns are exaggerated. I've been running on Hetzner, OVH and friends for 20 years. During that time I've had only two issues, one about 15 years ago when a PSU failed on one of the servers, and another a few years ago when an OVH data center caught fire and one of the servers went down. There have been no other hardware issues. YMMV.

hedora

They matter at scale, where 1% issues end up happening on a daily or weekly basis.

For a startup with one rack in each of two data centers, it’s probably fine. You’ll end up testing failover a bit more, but you’ll need that if you scale anyway.

If it’s for some back office thing that will never have any load, and must not permanently fail (eg payroll), maybe just slap it on an EC2 VM and enable off-site backup / ransomware protection.

vjerancrnjak

Is there software that works without ECC RAM ? I think most popular databases just assume memory never corrupts .

torginus

I'm pretty sure they keep internal internal checksums at various points to make sure the data on disk is intact - so does the filesystem, I think they can catch when memory corruption occurs, and can roll back to a consistent state (you still get some data loss).

But imo, systems like these (like the ones handling bank transaction), should have a degree of resiliency to this kind of failure, as any hw or sw problem can cause something similar.

jammo

Yes, but there are options for dedicated server providers who offer dual PSU and ECC ram etc. It's more expensive though for e.g a 24 Core Epyc with 384GB RAM dual 10G netowork is like $500/month (though there's smaller servers on serversearcher.com for other examples)