Skip to content(if available)orjump to list(if available)

K8s with 1M nodes

K8s with 1M nodes

22 comments

·October 16, 2025

jeffinhat

This is an awesome experiment and write up. I really appreciate the reproducibility.

I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.

I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.

I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!

rixed

I don't get the point of benchmarking k8s without the guarantees of etcd. At some point, you are just competing with clusterssh.

wppick

If you don't need the isolation of of k8s then don't forget about erlang, which is another option to scale up to 1 million functions. Obviously k8s containers (which are fundamentally just isolated processes) and erlang processes are not interchangeable things, but when thinking about needing in the order of millions of processes erlang is pretty good prior art

theptip

This is 1m nodes, you typically run tens or hundreds of pods per node, each with one or more containers. So more like 100m+ functions if I follow the Erlang analogy correctly?

ktpsns

Typical large scale high performance computing clusters are at a size of 10k nodes (for instance Jupiter and SuperMUC in Germany) [1]. These centers are quite remarkably big buildings. I wonder how much 1M node single k8s clusters there are in the world right now. Most likely at the hyperscalers.

[1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing. Then we talk about order of 100k cores to be scheduled.

osigurdson

>> [1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing

I'm sure they mean actual servers / not just cores. Even in traditional HPC it isn't abstracted to the level of individual cores usually since most HPC jobs care about memory bandwidth - even with Infiniband or other techniques throughput / latency is much worse than on a single machine. Of course, multiple machines are connected (usually using MPI / Infiniband) but important to try to minimize communication between nodes where possible.

For AI workloads, they are running GPUs - so 10K+ cores on a single device so even less likely to be talking about cores here.

stackskipton

I doubt any Hyperscalers are running 1M Node clusters either. They probably just have groups of clusters at each datacenter and some overall scheduler that determines which cluster is best suited for workload during deployment then connects to that cluster and schedules the workload.

null

[deleted]

up2isomorphism

“Perhaps my spiciest take from this entire project: most clusters don’t actually need the level of reliability and durability that etcd provides.”

This assumption is completely out of touch, and is especially funny when the goal is to build an extra large cluster.

itsnowandnever

etcd is also the entire point of k8s. that it's a single self-contained framework and doesn't require an external backer service. there is no kubernetes without etcd. much of the "secret sauce" of kubernetes is the "watch etcd" logic that "watches" desired state and does the cybernetic loop to bring the observed state adhere to the desired state.

trenchpilgrim

The API and controller loops are the point of k8s. etcd is an implementation detail and lots of clusters swap it out for something else like sqlite. I'm pretty sure that GCP and Azure are using Spanner or Cosmos instead of etcd for their managed offerings.

null

[deleted]

itsnowandnever

not exactly a fair assessment since neither of those were out and/or available to the kubernetes team at the time. sure, some things at many times from now into eternity may be or become better suited for the kubernetes data plane but at the time if etcd wasn't used there would be no kubernetes today

null

[deleted]

geoctl

Is it? I honestly kinda believe that etcd is probably the weakest point in vanilla k8s. It is simply unsuitable for heavy write environments and causes lots of consistency problems under heavy write loads, it's generally slow, it has value size constraints, it offers very primitive querying, etc... Why not replace etcd altogether with something like Postgres + Redis/NATS?

itsnowandnever

that touches on what I consider the dichotomy of k8s: it's a really scalable system that makes it easy to spin up a cluster locally on your laptop and interact with the full API locally just like in prod. so it's a super scalable system with a dense array of features. but paradoxically most shops won't need the vast majority of k8s features ever and by the time they scale to where they do need a ton of distributed init features they're extremely close to the point where they'd be better served by a bespoke system conceived from scratch in house that solves problems very specific to the business in question. if you have many thousands of k8s nodes, you're probably in the gray area of if using k8s is worth it because the loop of k8s will never be as fast as a centralized push control plane vs the k8s pull/watch control plane. and naturally at scale that problem will only compound

varispeed

> Why not replace etcd altogether with something like Postgres + Redis/NATS?

Holy Raft protocol is the blockchain of cloud.

jauntywundrkind

The API server is the thing. It so happens that the API server can mostly be a thin shell over etcd. But etcd itself while so common is not sacrosanct.

https://github.com/k3s-io/kine is a reasonably adequate substitute for etcd. sqlite, MySQL, PostgreSQL can also be substituted in. Etcd is from the ground up built to be more scale-out reliable, and that rocks to have baked in. But given how easy it is to substitute etcd out, I feel like we are at least a little off if we're trying to say "etcd is also the entire point of k8s" (the APIserver is)

itsnowandnever

that's fair but that 99% of all apiserver deployments in the world have the same standard boilerplate footprint is a large part of why it became so ubiquitous. that people running it locally don't have to make any decisions about how to deploy which database or why to use this one over that one... and that's also the same situation in production so people doing stuff in dev aren't punched in the face by an exponentially more complex system in production is huge.

kevin_nisbet

I'm with you, I think most people might think they don't need this reliability, until they do. I'm sure there is some subset of clusters where the claim is correct.

But from the article, turning off fsync and expecting to only lose a few ms of updates. I've tried to recover etcd on volumes that lied about fsync and experienced a power outage, and I don't think we managed to recover it. There might be more options now to recover and ignore corrupted WAL entries, but at that time it was very difficult and I think we ended up just reinstalling from scratch. For clusters where this doesn't matter or the SLOs for recovery account for this, I'm totally onboard, but only if you know what you're doing.

And similar the point from the article that "full control plane data loss isn’t catastrophic in some environments" is correct, in the sense of what the author means by some environments. Because I don't think it's limited to those that are management by gitops as suggested, but where there is enough resiliency and time to redeploy and do all the cleanup.

Anyways, like much advice on the internet, it's not good or bad, just highly situational, and some of the suggestions should only be applied if the implications are fully understood.