High Available Mosquitto MQTT on Kubernetes

31 comments

·May 14, 2025

zrail

To preface, I'm not a Kubernetes or Mosquitto expert by any means.

I'm confused about one point. A k8s Service sends traffic to pods matching the selector that are in "Ready" state, so wouldn't you accomplish HA without the pseudocontroller by just putting both pods in the Service? The Mosquitto bridge mechanism is bi-directional so you're already getting data re-sync no matter where a client writes.

edit: I'm also curious if you could use a headless service and use an init container on the secondary to set up the bridge to the primary by selecting the IP that isn't it's own.

jandeboevrie

> so wouldn't you accomplish HA without the pseudocontroller by just putting both pods in the Service?

I'm not sure how fast that would be, the extra controller container is needed for the almost instant failover.

Answering your second question, why not an init container in the secondary, because now we can scale that failover controller up over multiple nodes, if the node where the (fairly stateless) controller runs goes down, we'd still have to wait until k8s schedules another pod instead of almost instantly.

rad_gruchalski

> without the pseudocontroller

I am making an assumption. I assume that you mean the deployment. The deployment is responsible for individual pods. If a pod goes away, the deployment brings a new pod in. The deployment controls individual pods.

To answer your question: yes, you can simply create pods without the deployment. But then you are fully responsible for their lifecycle and failures. The deployment makes your life easier.

zrail

I was referring to the pod running the kubectl loop. As far as I can tell (I could be wrong! I haven't experimented yet) the script is relying on the primary Mosquitto pod's ready state, which is also what a Service relies on by default.

andrewfromx

when dealing with long lasting TCP connections, why add that extra layer of network complexity with k8s? I work for a big IoT company and we have 1.8M connections spread across 15 ec2 c8g.xlarge boxes. Not even using a NLB just round-robin DNS. Wrote our own broker with https://github.com/lesismal/nbio and use a packer .hcl file to make the AMI that each ec2 box boots. Using https://github.com/lesismal/llib/tree/master/std/crypto/tls to make nbio work with TLS.

stackskipton

Ops type here who deals with this around Kafka.

It comes down to how much you use Kubernetes. At my company, just about everything is in Kubernetes except for databases which are hosted by Azure. So having random VMs means we need to get Ansible, SSH Keys and SOC2 compliance annoyance. So the workload effort to get VMs running may be higher than Kubernetes even if you have to put in extra hacks.

NewJazz

You don't need ansible if it is all packed into the Ami.

stackskipton

Packer only works if you can replace machines on repeatable basis and data can be properly moved.

If not, you need Ansible to run apt update;apt upgrade -y periodically, make sure Security Software is installed and other maintenance tasks.

spotman

Having worked at multiple IoT companies with many millions of connections. This is the way.

People tend to overcomplicate things with K8S. I have never once seen a massively distributed IoT system run without a TON of headache and outages with k8s. Sure, it can be done, but it requires spending 4-8x the amount of of development time and has many more outages due to random things.

It's not just the network, its also the amount of config you have to do to get a deterministic system. For IoT, you dont need as much bursting (for most workloads). Its a bunch of devices that are connected 24/7 with fairly deterministic workloads, that are usually using some type of TCP connection that is not HTTP, and trying to shove it into an HTTP paradigm costs more money and more complexity and is not reliable.

avianlyric

K8s itself doesn’t introduce any real additional network complexity, at least not vanilla k8s.

At the end of the day, K8s only takes care of scheduling containers, and provides a super basic networking proxy layer for convenience. But there’s absolutely nothing in k8s that requires you use that proxy layer, or any other network overlay.

You can easily setup pods that directly expose their ports on the node they’re running on, and have k8s services just provide the IPs of nodes running associated pods as a list. Then rely on either on clients to handle multiple addresses themselves (by picking an address at random, and failing over to another random address if needed), configure k8s DNS to provide DNS round robin, or put an NLB or something in front of it all.

Everyone uses network overlays with k8s because it makes it easy for services in k8s to talk to other services in k8s. But there’s no requirement to force all your external inbound traffic through that layer. You can just use k8s to handle nodes, and collect needed meta-data for upstream clients to connect directly to services running on nodes with nothing but the container layer between the client and the running service.

andrewfromx

| Aspect | Direct EC2 (No K8s) | Kubernetes (K8s Pods) |

|-------------------------|-------------------------------------------------------|-------------------------------------------------------------------------------------|

| Networking Layers | Direct connection to EC2 instance (optional load balancer). | Service VIP → kube-proxy → CNI → pod (plus optional external load balancer). |

| Load Balancing | Optional, handled by ELB/ALB or application. | Built-in via kube-proxy (iptables/IPVS) and Service. |

| IP Addressing | Static or dynamic EC2 instance IP. | Pod IPs are dynamic, abstracted by Service VIP. |

| Connection Persistence | Depends on application and OS TCP stack. | Depends on session affinity, graceful termination, and application reconnection logic. |

| Overhead | Minimal (direct TCP). | Additional latency from kube-proxy, CNI, and load balancer. |

| Resilience | Connection drops if instance fails. | Connection may drop if pod is rescheduled, but Kubernetes can reroute to new pods. |

| Configuration Complexity| Simple (OS-level TCP tuning). | Complex (session affinity, PDBs, graceful termination, CNI tuning). |

avianlyric

If you read my reply again, you’ll notice that I explicitly highlight that K8s does not require the use of a CNI. There’s a reason CNIs are plugins, and not core parts of k8s.

How do you think external network traffic gets routed into a CNIs front proxy? It’s not via kube-proxy, kube-proxy isn’t designed for use in proper production systems, it’s only a stop gap to provide a functioning cluster to enable bootstrapping of a proper network management layer.

There is absolutely nothing preventing a network layer directly routing external traffic to pods, with the only translation being a basic iptable rule to enable routing of data sent to a nodes network interface with a pod IP to be accepted by the node and routed to the pod. Given it’s just basic Linux network interface bridging, happening entirely in the kernel with zero copies, the impact of this layer is practically zero.

Indeed the k8s services setup with external load balancers basically handle all of this setup for you.

There are plenty of reasons not to use k8s, but arguing that a k8s cluster must inherently introduce multiple additional network components and complexity is simply incorrect.

oulipo

Wouldn't more modern implementations like EMQx be better suited for HA ?

jpgvm

I built a high scale MQTT ingestion system by utilising the MQTT protocol handler for Apache Pulsar (https://github.com/streamnative/mop). I ran a forked version and contributed back some of non-proprietary bits.

A lot more work than Mosquitto but obviously HA/distributed and some tradeoffs w.r.t features. Worth it if you want to run Pulsar anyway for other reasons.

oulipo

I was going to go for Redpanda, what would be the pro/cons of Pulsar you think?

jpgvm

With Redpanda you would need to build something external. With Pulsar the protocol handlers run within the Pulsar proxy execution mode and all of your authn/authz can be done by Pulsar etc.

Redpanda might be more resource efficient however and less operational overhead than a Pulsar system.

Pulsar has some very distinct advantages over Redpanda when it comes to actually consuming messages though. Specifically it enables both queue-like and streaming consumption patterns (it is still a distributed log underneath but does selective acknowledgement at the subscription level).

jandeboevrie

Would they work as performant and use the same amount of (less, almost nothing) resources? I've ran mosquito clusters with tens of thousands of connected clients, thousands of messages per second, on 2 cores and 2GB of ram, while mostly idling. (Without retention, using clean sessions and only QoS 0)...

bo0tzz

EMQX just locked HA/clustering behind a paywall: https://www.emqx.com/en/blog/adopting-business-source-licens...

zrail

Sigh that's annoying.

Edit: it's not a paywall. It's the standard BSL with a 4 year Apache revert. I personally have zero issue with this.

casper14

Oh can you comment on what this means? I'm not too familiar with it. Thanks!

bo0tzz

It is a paywall, clustering won't work unless you have a license key.

seized

VerneMQ also has built in clustering and message replication which would make this easy.

oulipo

Have you tried both EMQx and VerneMQ and would you specifically recommend one over the other? I don't have experience with VerneMQ

HN

High Available Mosquitto MQTT on Kubernetes

High Available Mosquitto MQTT on Kubernetes