Skip to content(if available)orjump to list(if available)

Zero-Downtime Kubernetes Deployments on AWS with EKS

paol

I'm not sure why they state "although the AWS Load Balancer Controller is a fantastic piece of software, it is surprisingly tricky to roll out releases without downtime."

The AWS Load Balancer Controller uses readiness gates by default, exactly as described in the article. Am I missing something?

Edit: Ah, it's not by default, it requires a label in the namespace. I'd forgotten about this. To be fair though, the AWS docs tell you to add this label.

pmig

Yes, that is what we thought as well, but it turns out that the there is still a delay between the load balancer controller registering a target as offline and the pod actually being already terminated. We did some benchmarks to highlight that gap.

paol

You mean the problem you describe in "Part 3" of the article?

Damn it, now you've made me paranoid. I'll have to check the ELB logs for 502 errors during our deployment windows.

pmig

Exactly! We initially received some sentry errors that triggered our curiosity.

Spivak

I think the "label based configuration" has got to be my least favorite thing about the k8s ecosystem. They're super magic, completely undiscoverable outside the documentation, not typed, not validated (for mutually exclusive options), and rely on introspecting the cluster and so aren't part of the k8s solver.

AWS uses them for all of their integrations and they're never not annoying.

_bare_metal

I run https://BareMetalSavings.com.

The amount of companies who use K8s when they have no business nor technological justification for it is staggering. It is the number one blocker in moving to bare metal/on prem when costs become too much.

Yes, on prem has its gotchas just like the EKS deployment described in the post, but everything is so much simpler and straightforward it's much easier to grasp the on prem side of things.

abtinf

Could you expand a bit on the point of K8S being a blocker to moving to on-prem?

Naively, I would think it be neutral, since I would assume that if a customer gets k8s running on-prem, then apps designed for running in k8s should have a straightforward migration path?

MPSimmons

I can expand a little bit, but based on your question, I suspect you may know everything I'm going to type.

In cloud environments, it's pretty common that your cloud provider has specific implementations of Kubernetes objects, either by creating custom resources that you can make use of, or just building opinionated default instances of things like storage classes, load balancers, etc.

It's pretty easy to not think about the implementation details of, say, an object-storage-backed PVC until you need to do it in a K8s instance that doesn't already have your desired storage class. Then you've got to figure out how to map your simple-but-custom $thing from provider-managed to platform-managed. If you're moving into Rancher, for instance, it's relatively batteries-included, but there are definitely considerations you need to make for things like how machines are built from disk storage perspective and where longhorn drives are mapped, for instance.

It's like that for a ton of stuff, and a whole lot of the Kubernetes/OutsideInfra interface is like that. Networking, storage, maybe even certificate management, those all need considerations if you're migrating from cloud to on-prem.

reillyse

Out of interest do you recommend any good places to host a machine in the US? A major part of why I like cloud is because it really simplifies the hardware maintenance.

glenjamin

The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.

We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.

I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.

dilyevsky

They are a few warts like this with core/apps controllers. Nothing unfixable within general k8s design imho but unfortunately most of the community have moved on to newer shinier things

evacchi

somewhat related https://architect.run/

> Seamless Migrations with Zero Downtime

(I don't work for them but they are friends ;))

pmig

[dead]