Geico repatriates work from the cloud, continues ambitious infra overhaul

62 comments

·October 25, 2024

geicosreyes

I've directly participated in this project and all I have to say is this: the same madness that created a super complex and unmanageable environment in the cloud is now in charge of creating a super easy and manageable environment on premises. The PoC had barely been approved and there was already legacy stuff in the new production environment.

Geico's IT will slow to a crawl in the next years due to the immense madness of supporting Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).

stackskipton

This article read more like advertisement for VP spearheading all of this.

null

[deleted]

lowbloodsugar

First you charge them to put a star on their belly, and then you can charge them to take the star off their belly!

mooreds

https://www.youtube.com/watch?v=hzMhmk2sWzU for anyone not in the know.

DougN7

Yes!!

sofixa

> Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).

OpenStack's services are running in Kube? And Kube itself is ran as an OpenStack thing? Why? Why not use the same tooling used to deploy that initial Kube to deploy as many as needed? Still a massive maintenance burden, but you don't need to add OpenStack into the mix.

mrweasel

Because you can't necessarily run everything in Kubernetes, or in the same cluster. OpenStack probably provides VMs, private networks and bunch of other stuff to run legacy systems, 3rd. party software, Windows application, tons of stuff that can't be containerized.

You can have a large Kubernetes cluster running OpenStack, because it's probably the easiest way to deploy and maintain OpenStack. You then build smaller, isolated Kubernetes clusters on top of OpenStack, using VMs.

It's not as crazy as it sounds, but it does feel a little unnecessarily complex.

hamandcheese

I get why you might want to use open stack.

And I get why you might want to use open stack on Kubernetes.

What I don't get is why you would want Kubernetes on open stack on kubernetes.

derefr

From what I've seen in other projects, I think that translates to:

1. we have a management k8s cluster where we deploy app blueprints

2. the app blueprints contain, among other things, specifications for VMs to allocate, which get allocated through an OpenStack CRD controller

3. and those VMs then get provisioned as k8s nodes, forming isolated k8s clusters (probably themselves exposed as resource manifests by the CRD controller on the management cluster);

4. where those k8s nodes can then have "namespaced" (in the Linux kernel namespaces sense) k8s resource manifests bound to them

5. which, through another CRD controller on the management cluster and a paired CRD agent controller on in the isolated cluster, causes equivalent regular resource manifests to be created in the isolated cluster

6. ...which can then do whatever arbitrary things k8s resource manifests can do. (After all, these manifests might even include deployments of arbitrary other CRD controllers, for other manifests to rely upon.)

All said, it's not actually that braindead of an architecture. You might better think of it as "k8s, with OpenStack serving as its 'Container Compute-Cluster Interface' driver for allocating new nodes/node pools for itself" (the same way that k8s has Container Storage Interface drivers.) Except that

1. there isn't a "Container Compute-Cluster Interface" spec like the CSI spec, so this needs to be done ad-hoc right now; and

2. k8s doesn't have a good multi-tenant security story — so rather than the k8s nodes created in these VMs being part of the cluster that spawned them, their resources isolated from the management-layer resources at a policy level, instead, the created nodes are formed into their own isolated clusters, with an isolated resource-set, and some kind of out-of-band resource replication and rewriting to allow for "passive" resources in the management cluster that control "active" resources in the sandboxed clusters.

RobRivera

All the whey down

Dios mio mayne

JohnMakin

Thank you for posting this - reading this set off a lot of alarm bells, and there's a loud, growing "on prem" marketing movement that is likely to trumpet this as the downfall of "cloud" that I wasn't particularly looking forward to arguing with.

0xbadcafebee

They had an expensive, fractured, hard to maintain on-prem layout. Then they moved to the cloud. And it turned out the cloud was expensive, fractured, and hard to maintain. So they're moving to on-prem.

Any bets on what's going to happen next?

mmcconnell1618

The comment about "running legacy applications in the cloud was not any cheaper" stood out to me. Just moving the same legacy design into the cloud is not the optimal way to gain cost and availability improvements.

If you have ever seen a data center from Azure, GCP or AWS, you will realize how difficult it will be for any company to compete in the long run. Those companies develop new generations of data center infrastructure with power efficiency improvements every single year. They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company. I'm skeptical that running your own data center will end up a cost saver in the long run.

kkielhofner

> They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company.

..and then mark it up. AWS overall has 38% operating margin[0]. Depending on your application this can hit you really hard (cloud egress bandwidth being an especially obscene offender).

> I'm skeptical that running your own data center will end up a cost saver in the long run.

It's not cloud -or- your own Azure-scale datacenter. There are any number of approaches in between including hybrid to offload stuff like CDN, storage, edge services, etc to cloud but the fact remains many companies can run the entire business from a few beefy machines in co-location facilities. Most companies, solutions, etc are not actually Google, Snapchat, Geico, etc scale and never will be.

Throw in some minor accounting tricks like leasing (with or without Section 179) and these kinds of "creative" approaches are often impossible to beat from a pricing/performance and even uptime standpoint. That's certainly been my experience.

[0] - https://www.theinformation.com/articles/why-aws-fat-margins-...

TheNewsIsHere

Speaking as someone who has seen (partially) behind the hyperscale curtain, I wholeheartedly agree.

Competing with an Azure, AWS, or GCP data center would absolutely be a _really_ expensive proposition, but it’s not something most Fortune 500s need (or want) to do. Hyperscaler data centers are intentionally designed to effectively be both available to (almost) every possible customer while also adhering to (almost) every GRC framework, redundancy metric, and security requirement that most of those potential customers may ask for.

If you’re running your own data center, you don’t need to worry about most of that. You only have to worry about your own needs or that of your customers.

The misconception that it’s either-or, or that the cloud is the prime solution for all use cases, is simply the result of really effective evangelism and marketing. That so many people working in software don’t have deep hardware expertise or are not familiar with data centers plays to that hand. Not a criticism, just an observation from my experiences.

Not that the cloud isn’t a very powerful option indeed.

null

[deleted]

HideousKojima

Colocation is always an option

wnevets

> Any bets on what's going to happen next?

Someone in the c-suite gets a massive bonus before moving to a new company.

miyuru

according to the blog they started the cloud migration in 2013, there have been lot of improvements/changes to on-prem since then.

whatever1

If you don't have strong seasonality or not expecting a significant ramp up of compute demand (true for startups) why bother with the cloud?

It is not more secure, I read every quarter about downtime events, and more importantly you have 0 control of your costs.

Your company is likely not Amazon, you will do fine if you have your on prem computers.

oneplane

It's not really about cloud vs. on-prem, it's the fact that people cut corners and lack knowledge on-prem, and don't have the budgets to do anything about it.

What you're referring to is mostly about elasticity, and it's true that if you don't need it, it doesn't make sense to pay for it.

But that doesn't mean that on-prem (which almost always turns into a virtual machine shitshow with crappy network design -- which will continue as long as nobody implements things like strong IAM and Security Groups in their on-prem setups) is 'the same' as cloud but just in a physical location you control.

The inverse is also true. If you just run some VMs 'in the cloud', you're doing it wrong. Playing datacenter is just as bad as not moving away from classic virtual machines, cloud or no cloud.

whatever1

So when they are setting up config files for the cloud they don't cut corners? It is insane amount of work to follow safe practices to configure your cloud.

I don't see that much difference compared to doing actual admin tasks.

oneplane

The entire underlying layer of possible misconfigurations is absent in the cloud. Yes, the services on top of that can still be misconfigured, but you don't get access to hosts, SANs, switches, firewalls, gateways, there isn't anything for you to mess up. The shared responsibility model allows you to also pick even more robust options.

But even if you were to stick to something simple, say, object storage. A bucket or blob store has no SAN config, no webserver config, no switches, no gateways, no raid controllers, no striping, mirroring, parity configuration, no firmware, no BIOS, no BMC, no OS. None of that. It's all eliminated. All that remains is the top layer where you configure your cost-to-resilience ratio and your access policy. And yes, you could cut corners, but those are orders of magnitude fewer corners you could be cutting than if you include all the stuff below it.

Add to that: almost all of it has good APIs that are well defined, well supported and have an ecosystem to go with it. Try finding anything like that for a crappy NetApp or EMC appliance you find in a datacenter. It either doesn't exist, or it's so bad you might as well run MinIO or a bloody NFS share (not actual object storage) yourself.

Being bad at cloud is definitely more expensive than being bad at on-prem, I'll give you that. But with cloud, at least you get a bill that you can use to show your peers and higher ups that being bad has a cost. Internal virtual/amortised dollars are much harder to allocate to incompetence. It's often completely ignored, and at best revisited at periodic capacity planning reviews with few to no consequences.

The only place on-prem has, is with locality requirements. That includes latency sensitive things where sub 1ms is a goal, and air gapped things. But even in the first case things like an AWS Outpost exist, and those are cheaper than doing it yourself (not much, but enough to save on the hardware and on 2 FTEs).

mrweasel

That's really what some/most companies want, a platform that can run cheap, fast and easy VMs, like on-prem, but without the hassle of having to deal with the hardware and physical network part, like in the cloud. Sadly that's not the choice being offered.

I don't know, I've seen the shittiest stuff built on-prem and in cloud, and I've seem completely amazing on-prem infrastructure and cloud stuff that could not possibly be built outside AWS.

bluGill

If your data center isn't large enough to need at least 5 people full time admins then you should just go cloud. With a part time person you will see downtime when a machine fails. With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.

Of course even in the cloud you still need to apply security patches to everything. However it still saves a lot of issues and thus money in all but the largest setups.

x0x0

> With a part time person you will see downtime when a machine fails.

Many data centers offer remote hands services. And I don't believe this is at all true.

I worked at a place that managed thousands of boxes in dozens of pops with 1.5 fulltime people. If you design it for this from the beginning, with cattle not pets and netboot everywhere, this is very doable. And a large cost savings vs cloud.

bluGill

The assertion was about bringing this onprem so you don't get that offer of remote hands service. A data center instead of onprem is a valid option and might be best - check the contract and services the provide for you carefully.

munk-a

Additionally, as someone who has been a part of the interview process for IT people, if you only have two people and you're not an expert yourself there's a non-neglible chance that neither of the two people you've got are particularly good at their job. I'd advise any company to just accept the premium cost of using cloud services rather than risk getting ransomewared or what-have-you and finding out nobody ever actually tested the backups.

The cost of getting things wrong with on-prem aren't high on the average - but they sure are spikey if you get unlucky.

kkielhofner

> With a part time person you will see downtime when a machine fails

If a hardware failure causes downtime you're doing it wrong. Additionally, big cloud scaring people from hardware with marketing and FUD has been very effective. Modern hardware is insanely reliable and performant - I don't think I've seen a datacenter/enterprise NVMe drive fail yet. It's not 2005 with spinning disks and power supplies blowing up left and right anymore.

> With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.

Hardware vendors (Dell, etc) have highly-discounted warranty services. In the event of a hardware failure you open a ticket and they dispatch someone directly to the facility (often within hours by SLA) and it gets handled.

Same thing for shipping HW directly to co-lo and they rack/cable/bootstrap for a nominal fee, remote hands for weird edge-cases, etc.

A lot of takes here and elsewhere seem to be either big-cloud or Meta-level datacenter. I have operated POPs in a dozen co-location ("datacenter") facilities (a cabinet or two each) no one on staff ever stepped foot in with hardware we owned (and/or financed) that no one ever saw or touched. We operated this with two people looking after it as part of their broader roles and responsibilities and frankly they didn't have much to do.

There is an entire industry that provides any number of highly flexible and cost-effective approaches for everything in between.

stackskipton

To me, the downside of on premise hardware isn't hardware swap out, it's just dealing with hardware in general. All hardware needs updates which is downtime for that hardware. Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected or just plain "Actually, THAT failure mode isn't redundant."

That can happen to Public Cloud as well but since they work with hardware at much much larger scale and most of time, build actual hardware software, they are much more aware of sharp edges.

Finally, with Broadcom acquisition, what virtualization software are using and is it really cheaper then the cloud?

milesward

Find me a list of customers on cloud who got hacked, vs folks on-prem. I've got 3k+ customers, I know which one I see 99.99% of the time...

whatever1

I guess you don't count misconfigurations. But deciding between the cloud vs local is a choice between config or admin.

0xbadcafebee

Our company is literally in the 3rd week of waiting for a colo to install some new RAM modules in a server. Before that we were waiting two weeks to get a new server ordered, delivered and racked. Before that we had to wait a week for them to tell us if there was available power and network ports for the new server.

That server is the main database. And yes, there is a backup server, but for reasons, the backup server isn't working as expected. So if that main server's RAM failed for good, there goes our product, for god knows how long, considering how long it's taken so far to get a second one set up.

You don't have to deal with any of that shit in the cloud. None. You just spin up a new server in 2 seconds. You don't deal with shitty hardware, or the differences between old and new hardware (besides cpu arch, and some special classes), or incompatibilities, or running out of space, or getting smart hands in your rack, or a million other things.

And that's just the hardware side. The software side of the cloud is the one million unique hosted services they offer that you can just start using immediately. No server set-up, no configuration management, it already has security baked in, it's already integrated with the other million services, etc. You just start using it, immediately, and it just works. It saves you time, complexity, maintenance, and it gives you reliability, compatibility, flexibility, and allows you to ship something earlier.

I have managed servers on-prem for years, for tiny startups and huge companies. Both two decades ago, and two years ago. Without a doubt, I would always suggest any kind of hosted, cloud-style vendor over on-prem. Only somebody needs to be on-prem, or they literally are a teenager with no money at all and all the time in the world to waste DIYing, then I would tell them to go on-prem.

alexjplant

Disclaimer: this is anecdotal so n=1. All opinions are my own. No value judgment one way or another is expressed or implied.

Professional developers these days are primarily concerned with 1) getting their service running 2) as quickly as possible 3) someplace where they have instant access and control of it. Clicking around a cloud console accomplishes all three of these and allows you to write "Delivered the ____ service in 3 months that generates $XX M/year" on a performance review in short order. Having to build, rack, and configure a physical server or deal with "IT" (which has somehow become something separate from software engineering) does not. Because the developers are the ones delivering value they get to decide how it's done. AWS gets it done. A server in a datacenter in Texas that requires an SSH keypair to reach doesn't.

Your average SDE L4 does know or care about init systems or SANs or colos or 802.1q or any of the myriad of things required to run on-prem infra. They write software. Software makes money and so the business makes money - wash, rinse, repeat. Why would you have people on the front lines of your revenue stream worrying about these things when you can have a hyperscaler with a control plane do it for a nominal fee?

whatever1

If the hyperscaler asks for 200% of my revenue then yes.

alexjplant

But they don't. They ask for a deterministic usage-based amount.

VirusNewbie

> expecting a significant ramp up of compute demand

Lots of data processing workloads don't need to be run constantly, but do need to be run in a shorter amount of time. Cloud is pretty good for that sort of thing.

weitendorf

Because you're not Amazon you also probably don't have tech as your core competency and don't have the budget to hire people skilled enough to operate an your on-prem setup as well as they operate their cloud.

Because you're not a startup you there is a very good chance that you have a very process-driven (cover your ass), slow-moving culture - this very often translates to an IT department where getting even basic things done (like reserving extra compute or changing a network setting or starting to use a third party software) takes months of waiting or pleading. Maybe you have never encountered this kind of pathological IT department, but they're very common, and it's a major reason executives bought into cloud to begin with. Of course, many companies like Geico seem to have merely replicated their IT pathologies in the cloud, but at least in the cloud you have fewer sources of problems in areas like physical space management, buying/integrating hardware to grow or change your footprint and dealing with all the SKUs and supply chain problems therein, or negotiating on-prem licences.

There are many more moving pieces when operating on-prem: more operations staff across more kinds of roles (yes, you still have eg devops people when using the cloud, but you don't need as many building operations staff (where managing a datacenter is its own speciality), people managing hardware/software vendors and related supply chain issues, people skilled in physical networking, people to plug things in/out and physically operate the machines), managing and acquiring the physical space where your on-prem setup is, buying/accounting for all the different kinds of hardware you need, licensing/using more software with more difficult integration to achieve equivalent functionality to eg EC2, licensing all your 3P software to run on-prem... even if nominally less expensive than the cloud in some cases, there are many more places where things can go wrong. That's not as easy to account for in a direct TCO comparison because it manifests as slowing things down - which does introduce very substantial costs - and distracting management away from other opportunities to grow revenue or improve costs.

Also, cloud downtime is really overstated as a problem in 2024. It makes the news because it has a high blast radius and involves high profile companies, not because it's more common than on-prem. With the exception of AWS us-east1 issues (which can break many AWS products at once across the world), most cloud reliability issues these days are isolated to only a few products and only a few regions. I think a lot of small on-prem companies don't realize that they are not actually more reliable, but just operate at a smaller scale where the probability of downtime causes "lucky streaks" to be more common (ie if you play roulette for three rounds, you're much more likely to have an abnormally high win rate than someone who plays it for three hundred rounds, even though you both have the same odds). Most companies don't have as mature security/risk operations as cloud providers and so face an existential risk/the possibility of huge (months) of downtime in the event of a fire/natural disaster at their dc, cryptolocker attack, janitor unplugging the server that says "do not unplug" - this isn't something people have to worry about with cloud providers to nearly the same extent.

beaviskhan

A company with the size and financial resources of Geico ought to be able to handle on-prem just fine. I am a huge public cloud fan, but it is definitely not a great (or even good) fit for everyone.

jnwatson

Cloud provides the CIO the same opportunities for advancement that COOs have had for years.

Staff costs too high? Outsource. Opex too high? Insource.

You can spend a career jumping among companies swinging the pendulum back and forth.

delusional

What a shame that the most interesting thing we can discuss about software now is where the computer its running on is located.

I must admit. The computer was never the part of software that interested me.

chronid

Even software (at least outside academia) eventually has to fight physics and the thing with most gravity of it all, money.

gtirloni

I'd gladly pay 2.5x more to not use OpenStack ever again.

stonethrowaway

> In an interview with The Stack she confirmed the shift, saying “we have a lot of data – and it turns out that storage in the cloud is one of the most expensive things you can do in the cloud, followed by AI in the cloud…”

This has been the story for 20 years now. Not even exaggerating. We all knew it was expensive from the get-go because we all did things on prem.

mullingitover

I feel like even in Geico's case, once they've paid salaries for everyone who's going to need to maintain this infra they're bringing in-house they're probably not saving that much. Then again, maybe they were already paying those salaries redundantly to all the services they were spending on, e.g. managed databases.

null

[deleted]

null

[deleted]

hnburnsy

Is building things cloud provider agnostic a thing? Is building things cloud or on prem agnostic a thing?

HN

Geico repatriates work from the cloud, continues ambitious infra overhaul

Geico repatriates work from the cloud, continues ambitious infra overhaul