Graceful Shutdown in Go: Practical Patterns

60 comments

·May 4, 2025

zdc1

I've been bitten by the surprising amount of time it takes for Kubernetes to update loadbalancer target IPs in some configurations. For me, 90% of the graceful shutdown battle was just ensuring that traffic was actually being drained before pod termination.

Adding a global preStop hook with a 15 second sleep did wonders for our HTTP 503 rates. This creates time between when the loadbalancer deregistration gets kicked off, and when SIGTERM is actually passed to the application, which in turn simplifies a lot of the application-side handling.

rdsubhas

Yes. Prestop sleep is the magic SLO solution for high quality rolling deployments.

IMHO, there are two things that kubernetes could improve on:

1. Pods should be removed from Endoints _before_ initiating the shutdown sequence. Like the termination grace, there should be an option for termination delay. 2. PDB should allow an option for recreation _before_ eviction.

LazyMans

We just realized this was a problem too

evil-olive

another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.

it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.

tmpz22

Is it me or are observability stacks kind of ridiculous. Logs, metrics, and traces, each with their own databases, sidecars, visualization stacks. Language-specific integration libraries written by whoever felt like it. MASSIVE cloud bills.

Then after you go through all that effort most of the data is utterly ignored and rarely are the business insights much better then the trailer park version ssh'ing into a box and greping a log file to find the error output.

Like we put so much effort into this ecosystem but I don't think it has paid us back with any significant increase in uptime, performance, or ergonomics.

nkraft11

I can say that going from a place that had all of that observability tooling set up to one that was at the "ssh'ing into a box and greping a log" stage, you best believe I missed company A immensely. Even knowing which box to ssh into, which log file to grep, and which magic words to search far was nigh impossible if you weren't the dev that set up the machine and wrote the bug in the first place.

MortyWaves

I completely agree with you but I also think, like many aspects of "tech" certain segments of it have been monopolised and turned into profit generators for certain organisations. DevOps, Agile/Scrum, Observability, Kubernetes, are all examples of this.

This dilutes the good and helpful stuff with marketing bullshit.

Grafana seemingly inventing new time series databases and engines every few months is absolutely painful to try keep up to date with in order to make informed decisions.

So much so I've started using rrdtool/smokeping again.

evil-olive

if you're working on a system simple enough that "SSH to the box and grep the log file" works, then by all means have at it.

but many systems are more complicated than that. the observability ecosystem exists for a reason, there is a real problem that it's solving.

for example, your app might outgrow running on a single box. now you need to SSH into N different hosts and grep the log file from all of them. or you invent your own version of log-shipping with a shell script that does SCP in a loop.

going a step further, you might put those boxes into an auto-scaling group so that they would scale up and down automatically based on demand. now you really want some form of automatic log-shipping, or every time a host in the ASG gets terminated, you're throwing away the logs of whatever traffic it served during its lifetime.

or, maybe you notice a performance regression and narrow it down to one particular API endpoint being slow. often it's helpful to be able to graph the response duration of that endpoint over time. has it been slowing down gradually, or did the response time increase suddenly? if it was a sudden increase, what else happened around the same time? maybe a code deployment, maybe a database configuration change, etc.

perhaps the service you operate isn't standalone, but instead interacts with services written by other teams at your company. when something goes wrong with the system as a whole, how do you go about root-causing the problem? how do you trace the lifecycle of a request or operation through all those different systems?

when something goes wrong, you SSH to the box and look at the log file...but how do you know something went wrong to begin with? do you rely solely on user complaints hitting your support@ email? or do you have monitoring rules that will proactively notify you if a "huh, that should never happen" thing is happening?

HdS84

Overall, I think centralized logging and metrics are super valuable. But stacks are all missing the mark. For example, every damn log message has hundreds of fields,. Most of which never change. Why not push this information once, on service startup an not with every log message? OK, obviously the current system provides huge bills to the benefit of the company or's offering these services.

valyala

> For example, every damn log message has hundreds of fields,. Most of which never change. Why not push this information once, on service startup an not with every log message?

If the log field doesn't change with every log entry, then good databases for logs (such as VictoriaLogs) compress such a field by 1000x and more times, so its' storage space usage can be ignored, and it doesn't affect query performance in any way.

Storing many fields per every log entry simplifies further analysis of these logs, since you can get all the needed information from a single log entry instead of jumping over big number of interconnected logs. This also improves analysis of logs at scale by filtering and grouping the logs by any subset of numerous fields. Such logs with big number of fields are named "wide events". See the following excellent article about this type of logs - https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide... .

01HNNWZ0MV43FF

Programs are for people. That's why we got JSON, a bunch of debuggers, Python, and so on. Programming is only like 10 percent of programming

openWrangler

It's not just you - OSS toolstacks can be sprawling and involve long manual processes while costs from most enterprise vendors are too steep for fully mapped observability.

Coroot is an open source project I'm working with to try and to tackle this. eBPF automatically gathers your data into a centralized service map, and then the tool provides RCA insights (with things like mapped incident timeframes) to help implement fixes quicker and improve uptime.

GitHub here and we'd love any feedback if you think it can help: https://github.com/coroot/coroot

utrack

Jfyi, I'm doing exactly this (and more) in a platform library; it covers the issues I've encountered during the last 8+ years I've been working with Go highload apps. During this time developing/improving the platform and rolling was a hobby of mine in every company :)

It (will) cover the stuff like "sync the logs"/"wait for ingresses to catch up with the liveness handler"/etc.

https://github.com/utrack/caisson-go/blob/main/caiapp/caiapp...

https://github.com/utrack/caisson-go/tree/main/closer

The docs are sparse and some things aren't covered yet; however I'm planning to do the first release once I'm back from a holiday.

In the end, this will be a meta-platform (carefully crafted building blocks), and a reference platform library, covering a typical k8s/otel/grpc+http infrastructure.

peterldowns

I'll check this out, thanks for sharing. I think all of us golang infra/platform people probably have had to write our own similar libraries. Thanks for sharing yours!

RainyDayTmrw

I never understood why Prometheus and related use a "pull" model for data, when most things use a "push" model.

dilyevsky

That’s an artifact of the original google’s borgmon design. Fwiw, in a “v2” system at Google they tried switching to push-only and it went sideways so they settled on sort of hybrid pull-push streaming api

PrayagS

Is "v2" based on their paper around Monarch?

evil-olive

Prometheus doesn't necessarily lock you into the "pull" model, see [0].

however, there are some benefits to the pull model, which is why I think Prometheus does it by default.

with a push model, your service needs to spawn a background thread/goroutine/whatever that pushes metrics on a given interval.

if that background thread crashes or hangs, metrics from that service instance stop getting reported. how do you detect that, and fire an alert about it happening?

"cloud-native" gets thrown around as a buzzword, but this is an example where it's actually meaningful. Prometheus assumes that whatever service you're trying to monitor, you're probably already registering each instance in a service-discovery system of some kind, so that other things (such as a load-balancer) know where to find it.

you tell Prometheus how to query that service-discovery system (Kubernetes, for example [1]) and it will automatically discover all your service instances, and start scraping their /metrics endpoints.

this provides an elegant solution to the "how do you monitor a service that is up and running, except its metrics-reporting thread has crashed?" problem. if it's up and running, it should be registered for service-discovery, and Prometheus can trivially record (this is the `up` metric) if it discovers a service but it's not responding to /metrics requests.

and this greatly simplifies the client-side metrics implementation, because you don't need a separate metrics thread in your service. you don't need to ensure it runs forever and never hangs and always retries and all that. you just need to implement a single HTTP GET endpoint, and have it return text in a format simple enough that you can sprintf it yourself if you need to.

for a more theoretical understanding, you can also look at it in terms of the "supervision trees" popularized by Erlang. parents monitor their children, by pulling status from them. children are not responsible for pushing status reports to their parents (or siblings). with the push model, you have a supervision graph instead of a supervision tree, with all the added complexity that entails.

0: https://prometheus.io/docs/instrumenting/pushing/

1: https://prometheus.io/docs/prometheus/latest/configuration/c...

raffraffraff

Great answer. I managed metrics systems way back (cacti, nagios, graphite, kairosdb) and one thing that always sucked about push based metrics was coping with variable volume of data coming from an uncontrollable number of sources. Scaling was a massive headache. "Scraping" helps to solve this through splitting duty across a number of "scrapers" that autodiscover sources. And by placing limits on how much it will scrape from any given metrics source, you can effectively protect the system from overload. Obviously this comes at the expense of dropping metrics from noisy sources, but as the metrics owner I say "too bad, your fault, fix your metrics". Back in the old days you had to accept whatever came in through the fire hose.

sporkland

Having operated a large site with 1000's of services I've never had the metrics thread crash on a service. I've often seen the telemetry pipeline crash. If you've been writing the metrics to logs in a thread you at least have a chance to recover and backfill that information when you fix the pipeline.

bbkane

Thanks for writing this out; very insightful!

PrayagS

> another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.

Have you come across any convenient solution for this? If my scrape interval is 15 seconds, I don't exactly have 30 seconds to record two scrapes.

This behavior has sort of been the reason why our services still use statsd since the push-based model doesn't see this problem.

karel-3d

one tiny thing I see quite often: people think that if you do `log.Fatal`, it will still run things in `defer`. It won't!

    package main
    
    import (
     "fmt"
     "log"
    )
    
    func main() {
     defer fmt.Println("in defer")
    
     log.Fatal("fatal")
    }

this just runs "fatal"... because log.Fatal calls os.Exit, and that closes everything immediately.

    package main
    
    import (
     "fmt"
     "log"
    )
    
    func main() {
     defer fmt.Println("in defer")
    
     panic("fatal")
    }

This shows both `fatal` and `in defer`

wbl

If a distribute system relies on clients gracefully exiting to work the system will eventually break badly.

Rhapso

And i believe that so much that I don't even consider graceful shutdown in design. Components should be able to safely (and even frequently) hard-crash and so long as a critical percentage of the system is WAI then it shouldn't meaningfully impact the overall system.

The only way to make sure a system can handle components hard crashing, is if hard crashing is a normal thing that happens all the time.

All glory to the chaos monkey!

ikiris

There's a big gap between graceful shutdown to be nice to clients / workflows, and clients relying on it to work.

smcleod

Way back when, in physical land - I used STONITH for that! https://smcleod.net/2015/07/delayed-serial-stonith/

XorNot

There's valid reasons to want the typical exit not to look like a catastrophic one even if that's a recoverable situation.

That my application went down from sig int makes a big difference compared to kill.

Blue-Green migrations for example require a graceful exit behavior.

shoo

> Blue-Green migrations for example require a graceful exit behavior.

it may not always be necessary. e.g. if you are deploying a new version of a stateless backend service, and there is a load balancer forwarding traffic to current version and new version backends, the load balancer could be responsible for cutting over, allowing in flight requests to be processed by the current version backends while only forwarding new requests to the new backends. then the old backends could be ungracefully terminated once the LB says they are not processing any requests.

eknkc

Yeah. However, I do not need to pull the plug to shut things down even if the software was designed to tolerate that.

In a second thought though, maybe I do. That might be the only way to ensure the assumption is true. Like the Netflix's chaos monkey thing a couple years ago.

antonvs

> Like the Netflix's chaos monkey thing a couple years ago.

That was released 15 years ago.

eknkc

Thanks for reminding how old I am.

icedchai

Relying on graceful exit and supporting it are two different things. You want to support it so you can stop serving clients without giving them nasty 5xx errors.

Thaxll

No one said that.

fpoling

I was hoping the article describe how to perform the application restart without dropping a single incoming connections when a new service instance receives the listening socket from the old instance.

It is relatively straightforward to implement under systemd. And nginx has been supporting that for over 20 years. Sadly Kuberenets and Docker have no support for that assuming it is done in load balancer or the reverse proxy.

joaohaas

You're probably looking for Cloudflare's tableflip: https://github.com/cloudflare/tableflip

giancarlostoro

I had a coworker that would always say, if your program cannot cleanly handle ctrl c and a few other commands to close it, then its written poorly.

danhau

Your coworker is correct.

amelius

Ctrl-C is reserved for copy into the clipboard ... Stopping the program instead is highly counter-intuitive and will result in angry users.

moooo99

Have you really never cancelled a program in a terminal session?

tgv

I think it was a joke. The style, clearly, almost pedantically stating an annoyance as fact, does suggest that.

gchamonlive

This is one of the things I think Elixir is really smart in handling. I'm not very experienced in it, but it seems to me that having your processes designed around tiny VM processes that are meant to panic, quit and get respawned eliminates the need to have to intentionally create graceful shutdown routines, because this is already embedded in the application architecture.

cle

How does that eliminate the need for the graceful shutdown the author discusses?

fredrikholm

In the same way that GC eliminates the need for manual memory management.

Sometimes it's not enough and you have to 'do it by hand', but generally if you're working in a system that has GC, freeing memory is not something that you think of often.

The BEAM is designed for building distributed, fault tolerant systems in the sense that these type of concerns are first class objects, as compared to having them as external libraries (eg. Kafka) or completely outside of the system (eg. Kubernetes).

The three points the author lists in the beginning of the article are already built in and their behavior are described rather than implemented, which is what I think OP meant with not having to 'intentionally create graceful shutdown routines'.

joaohaas

I really don't see how what you are describing has anything to do with the graceful shutdown strategies/tips mentioned in the post.

- Some applications want to instantly terminate upon receiving kill sigs, others want to handle them, OP shows how to handle them

- In the case of HTTP servers, you want to stop listening for new requests, but finish handling current ones under a timer. TBF, OPs post actually handles that badly with a time.Sleep when there's a running connection, instead of using a sync.WaitGroup like most applications would want to do

- Regardless if the application is GCd or not, you probably want to still manually close connections, so you can handle any possible errors (a lot of connections stuff flushes data on close)

eberkund

I created a small library for handling graceful shutdowns in my projects: https://github.com/eberkund/graceful

I find that I typically have a few services that I need to start-up and sometimes they have different mechanisms for start-up and shutdown. Sometimes you need to instantiate an object first, sometimes you have a context you want to cancel, other times you have a "Stop" method to call.

I designed the library to help my consolidate this all in one place with a unified API.

mariusor

Haha, I had the exact same idea, though my API looks a bit less elegant. Maybe it's because it allows the caller to set-up multiple signals to handle and in which way to do it.

https://pkg.go.dev/git.sr.ht/~mariusor/wrapper#example-Regis...

pseidemann

I did something similar as well: https://github.com/pseidemann/finish

deathanatos

> After updating the readiness probe to indicate the pod is no longer ready, wait a few seconds to give the system time to stop sending new requests.

> The exact wait time depends on your readiness probe configuration

A terminating pod is not ready by definition. The service will also mark the endpoint as terminating (and as not ready). This occurs on the transition into Terminating; you don't have to fail a readiness check to cause it.

(I don't know about the ordering of the SIGTERM & the various updates to the objects such as Pod.status or the endpoint slice; there might be a small window after SIGTERM where you could still get a connection, but it isn't the large "until we fail a readiness check" TFA implies.)

(And as someone who manages clusters, honestly that infintesimal window probably doesn't matter. Just stop accepting new connections, gracefully close existing ones, and terminate reasonably fast. But I feel like half of the apps I work with fall into either a bucket of "handle SIGTERM & take forever to terminate" or "fail to handle SIGTERM (and take forever to terminate)".

cientifico

We've adopted Google Wire for some projects at JustWatch, and it's been a game changer. It's surprisingly under the radar, but it helped us eliminate messy shutdown logic in Kubernetes. Wire forces clean dependency injection, so now everything shuts down in order instead... well who knows :-D

https://go.dev/blog/wire https://github.com/google/wire

Savageman

I wish it would talk about liveness too, I've see several times apps that use the same endpoint for liveness/readiness but it feels wrong.

liampulles

I tend to use a waitgroup plus context pattern. Any internal service which needs to wind down for shutdown gets a context which it can listen to in a goroutine to start shutting down, and a waitgroup to indicate that it is finished shutting down.

Then the main app goroutine can close the context when it wants to shutdown, and block on the waitgroup until everything is closed.

mariusor

If you look at the article, it presents some additional niceties, like having middleware that is aware of the shutdown - though they didn't detail exactly how the WithCancellation() function is doing that.

So if you send a SIG-INT/-TERM signal to the server there's a delay to clean up resources, during which the new requests get served a response that doesn't try to access them and fail in unexpected ways, but a configurable "not in service" error.