Breaking Up with On-Call
73 comments
·March 16, 2025dakiol
ekimekim
When I'm in charge of an on-call rotation I always try to make it very clear that this is not the expectation.
In my preferred model of on-call, you have a primary, then after 5min an escalation to secondary, then after 5min an escalation to something drastic (sometimes "everyone", sometimes a manager).
The expectation is that most of the time you should be able to respond within 5 minutes, but if you can't then that's what the secondary role is for - to catch you. This means it's perfectly acceptable to go for a run, go to a movie, etc.
You relax the responsibility on the individual and let a sensible amount of redundancy solve the problem instead. Everyone is less stressed, and sure you get the occasional 5min delay in response but I'm willing to bet that the overall MTTR is lower since people are well rested and happier to be on call to begin with.
WhyIsItAlwaysHN
So it takes 10 min until you've gone to the drastic solution? With this time-frame it would be risky to go the bathroom, not go to a movie. Also even the backup sounds like a primary in this scenario.
rwmj
This is why people used to be paid time-and-a-half or even double-time for being on call. Ask your union to demand that.
zild3d
So for say a weekly on-call rotation you would be paid all 24 hours x 7 days at double rate? (- ~40)
Also most tech companies don't have unions...
chgs
Last time I was on call it was a one hour payment per day, a 4 hour payment for answering the phone.
That lasted about 3 months, just not acceptable.
liveoneggs
For nurses on-call is a tiny amount - $3/hr and then you get 1.5x if you actually get called in
danpalmer
As a current SRE, and having worked in a small startup, this doesn't echo my experience at all. What the author describes is possible what we would call "on duty" work, the grunt/maintenance work that comes with big software systems. It's not fun, and most companies/teams have friction getting this sort of work done. It's also however not how my SRE role is defined by any stretch. Our on-call work is much more about support during exceptional, somewhat rare circumstances.
strken
Yeah, I was similarly confused. My experience has been that on-call is a roster of who will keep their laptop with them on the weekend. Incidents involve little to no customer comms, because if anything more than a sentence once an hour is necessary then there's someone else on call to handle it.
Support work is a necessary nuisance, but it's also not what on-call is meant to be.
coolgoose
Sounds like normal support work not sure why that's affecting morale.
It's normal in any kind of system to also cover issues.
Yes, at some point you might want to have a customer support / customer success later to at least triage them, but that makes more sense as you get bigger not when you are small.
I actually like having discussions on support days with customer. Yes, sometimes they're more annoying but it's direct feedback of people trying to use your software.
cdnthrownawy39
My on call experience required that I had to be able to respond within 10 ten minutes of the call, with 24/7/365 coverage. But if I couldn't get the issue resolved remotely it meant that I'd have to be in the office lab to recreate and reproduce the problem. It effectively restricted my movements personal movements to stay within commute distance of my office, and that includes all my vacation time as well.
That was the better part of a year in my life, continuous. Of constantly considering that every decision, every meal, every movement, every action at all times, and weighing it against the risk of impacting my ability to respond to a customer call. Maintaining an extended period of alertness for a threat that very rarely materializes is frustrating in many ways that I'd like very much to forget.
I didn't burn out from it, but it was a major factor in my decision to resign from that company. Obviously people out there that can handle this lifestyle, but I couldn't. And frankly I'm quite content to never try again.
hbsbsbsndk
> Obviously people out there that can handle this lifestyle, but I couldn't
No I don't think anyone could. What you're describing is an insane policy of abusing employees. Most on-call is done on a rotation, not permanently.
All of my previous roles involved on-call and it was 1 week per quarter.
baq
You should be paid extra for all that time and indeed there are countries with legal frameworks which require an employer to pay the employee if their personal life outside of working hours is in any way constrained. If you are a contractor, always put limits for this kind of extra services.
whstl
Yeah, I worked in a couple companies that had the same requirements.
It also meant having to wake up on command in case the phone beeped, being unable to drink in my free time, being limited in which activities I could go to.
It was hell.
I know people are gonna hit back with "you're doing it wrong", but in this case it's the company doing it wrong, but nobody on HN will go there and tell them.
anal_reactor
I don't understand why on-call is normal. It's a huge mental burden on employees, which is especially an important issue in modern times of widespread mental issues, just so that some shitty mobile app can be available 24/7/365. If your business is important enough to have on-call, then you should have dedicated employees covering night shifts and nothing else, effectively limiting it to someone's office hours, effectively removing on-call. I think that there should be laws against on-call.
gpi
Agreed. Author's point of view seemed to have been influenced by their bad experience being part of the "on duty" work.
pokoleo
From my experience working on SaaS, and improving ops at large organizations, I've seen that "on-call culture" often exists inversely proportional to incentive alignment.
When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes. When incident response becomes an organizational checkbox divorced from financial outcomes and planning, you get perpetual firefighting.
The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.
Big companies aren't missing the resources to fix this; they just don't have the aligned incentive structures that make fixing it rational for individuals involved.
The most rational thing to do as an individual on a bad rotation: quit or transfer.
DanHulton
This assumes that the engineers in question get to choose how to allot their time, and are _allowed_ to spend time to add graceful failure modes. I cannot tell you how many stories I have heard of, and companies I have directly worked at, where this power is not granted to engineers, and they are instead directed to "stop working on technical debt, we'll make time to come back to that later". Of course, time is never found later, and the 3am pages continue because the people who DO choose how time is allocated are not the ones waking up at 3am to fix problems.
nijave
Definitely an issue but I think there's a little room for push back. Work done outside normal working hours is automatically the highest priority, by definition. It's helpful to remind people of that.
If it's important enough to deserve a page, it's top priority work. The reverse is also true (if a page isn't top priority, disable the paging alert and stick it on a dashboard or periodic checklist)
whstl
You're right, but it's still outrageous that engineers need to burn political capital in order to have proper sleep and avoid burnout.
tacticus
IMO it's when the incident response and readiness practice imposes a direct backpressure on feature delivery that you get the issues actually fixed and a resilient system.
if it's just the engineer while product and management see no real cost then people burn out and leave.
> The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.
100%
rednafi
> When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes.
Making engineers handle 3 AM issues caused by their code is one thing, but making them bear the financial consequences is another. That’s how you create a blame-game culture where everyone is afraid to deploy at the end of the day or touch anything they don’t fully understand.
cyberax
"Financial consequences" probably mean "the success of the startup, so your options won't be worth less than the toilet paper", rather than "you'll pay for the downtime out of your salary".
mrkeen
Engineers don't pick their work, management does.
A manager no longer needs to choose between system reliability and churning out new features with on-call:
The manager can get all the credit for pushing out new features during the day, and sleep well at night knowing that the engineers aren't.
procaryote
At a lot of companies engineers are involved in picking the work. It's silly to hire competent problem solvers and treat them as unskilled workers needing micro-management.
Besides, if you set the on-call system up so people get free time the following day to compensate for waking up at night, the manager can't pretend there's no cost.
Bad management will fail on both of these of course, but there's no saving that beyond finding a better company.
northern-lights
This assumes that the engineers who wrote the code that caused the 3 AM pages will still be around to suffer the consequences of the 3 AM pages. This is a lot of times, not true, especially in an environment which fostered moving around internally every now and then. Happens in at least one of the FAANGs.
adrianN
Minimizing 3am pages is good for engineers but it is not necessarily the best investment for the company. Beyond a certain scale it is probably not a good investment to try to get rid of all pages.
mook
By that point wouldn't it start to make sense to have people across time zones so that it will be working hours somewhere?
chilldsgn
> Filing for a new feature implementation would require a thorough documentation, rightfully so, followed by a political campaign to convince the political party of principal engineers and managers to accept the new feature. These stakeholders carry incentives and principals of their own - that do not necessarily always align with the true engineering spirit of solving the problem, nor with satisfying the customer.
> The same friction applied to fixing a bug or a flawed process. I would reproduce the bug: spin up the entire environment, the appropriate binary artifact and the reproduced state of application, create the test cases and pin-point the exact problem for the stakeholders as well as present the possible solutions. I would get sent to a dozen of meetings, bouncing my ideas back and forth until receiving the dire verdict - rejection to fix this bug altogether.
This has been my experience for the entire duration of my tenure at the current place I am employed at. I've stopped doing this, because my backlog is filled with "best effort" features and when I attempt to slot some of these into a quiet sprint, management says no.
Many features I request is not from assumption, but data I gathered with analytics for the customer-facing website. One example is to change a page layout and add better search functionality for this page and its data, because I noticed a +70% no-results rate in the search analytics. I suggested a change but marketing shot it down saying it's low priority for them, while they frequently say that the website can perform better at generating leads. I might be wrong, but to me it feels like utter short-sightedness and goes against the strategic goal of the company.
I'm just an engineer, what do I know?
caseyohara
Not to distract from the article, but I'm pretty sure that photo of the guard tower is from Manzanar, one of the concentration camps in California that interned Japanese Americans during WWII. Probably not in great taste to use that photo to represent the idea of "guard duty" in software.
strken
Apart from anything else, a fire tower is a better metaphor for on-call than a guard tower, too.
idiotsecant
Eh, intent matters. They probably just googled 'guard tower' and used the first one that looked close enough.
caseyohara
That’s my assumption and why I said something. I’m guessing they didn’t know.
abrookewood
[flagged]
brigandish
Who, specifically, would that picture offend or change protect?
Xylakant
People that have been interned in this or one of the other camps, or their descendants. It’s one generation ago, people born and raised in camps are still alive. George Takeo for example was born in one of the segregation camps.
It’s a low stakes change.
Nobody assumes harmful intentions from the author - I would not have recognized the picture either. But now that it’s been pointed out that it’s from a site where people were held illegally against their will, the reaction is a tell-tale. Knowing this, and insisting on keeping the images is now willfully associating with harmful behavior.
Apart from that, I cannot associate with the picture either - as an on-call engineer for some widely used infrastructure, I am not a guard on duty keeping people in a camp. I am an emergency responder. I fix things when they go haywire or an accident happens. A firefighter, paramedics, civil emergency responder would IMO be a much better metaphor for what I do.
modo_mario
>Nobody assumes harmful intentions from the author
Then what's the issue or purpose? Other than satisfying or inflating some own moral selfimage? That kind of social dynamic appears to be what this kind of things is about most of the time as opposed to preventing mental harm or stopping such issues from (re)occurring.
>Knowing this, and insisting on keeping the images is now willfully associating with harmful behavior.
Do you not associate with harmfull behavior in far far more direct ways? Perhaps something that's considered ridiculous for most people to avoid like paying taxes that get spent on bombs or the like?
If i want to make an article showing a prison and i pick one from google does it matter if that prison happened to be used in the past to house civil rights activists? Is the implication that that somehow normalises, advocates for or otherwise inches towards oppressing civil rights movements in any meaningfull way?
stackskipton
I have no idea what article was trying to convey since on call was poorly defined
I also couldn’t take it seriously when article opened with this.
> Startups cannot afford engineers to baby sit software, big tech does.
Say what? As ops person, I’ve seen multiple startups where devs are drowning in operational issues because software was written enough for feature MVP, ship with poor testing then constantly poke at with sticks to keep it working in hopes they could get enough revenue to not flame out.
apeescape
Well, did those startups succeed? If not, I think it means the author was spot on in saying startups can't _afford_ that.
smcameron
Been a few years since I worked at Google as an SRE, but I did not find "There’s no incentive in big tech to write software with no bugs" to be particularly true. Perhaps because I was an SRE for a few of the of the older, intrinsic, core products. A lot of the pages we got were for things outside our control (e.g. some fiber optic cables got broken (we found out later), so we've got to drain some cluster because nothing's getting through there, or something's getting overloaded because it's the superbowl today, so throw more machines/memory at it, or other weird external things.) I don't remember a lot of pages due to outright bugs in the code... though it's been enough years now that I might have forgotten.
nijave
Were the systems designed to scale based on load and handle transient failures?
It seems like lack of automated remediation would be a bug unless it's an "accepted business risk" i.e. cheaper to throw people at to manually fix than build a software solution.
Xylakant
You can’t plan for all failure modes. Weird shit happens and it needs human intervention to figure out what went wrong. Sometimes someone needs to assess what path forward is the lowest (financial) harm and weigh the options. No computer should make that call.
VirusNewbie
If a bug is causing multiple pages, SRE will absolutely either force the devs to fix it, or fix the bug themselves.
snapetom
I'll add to this some companies (my current one) still has a hero culture. People that are forced to work all weekend to fix a problem are lauded as the ones who went the extra mile. An ego boost and kudos, for some reason are reward enough. Companies don't have substantially reward these employees nor does the company have to invest in paying technical debt, infrastructure, or automation/monitoring. It's incentivisation gone horribly wrong.
dmoy
Was author's big tech experience with Amazon?
Because Amazon oncall is ... well it's something. But I'm not sure it's really indicative of the rest of big tech oncall.
northern-lights
The author's Linkedin indeed lists working at AWS for 3 years.
AdieuToLogic
The industry myth of "devs need to be on-call just in case prod crashes at 3am" needs to die.
First, a system failure addressed by someone awoken "at 3am" assumes said person can transition from sleep to peek analytic mode in moments. This is obviously not the case.
Second, a system failure addressed by someone awoken "at 3am" assumes said person possesses omniscient awareness of all aspects of a non-trivial system. This is obviously not the case.
Third, and finally, a system failure addressed by someone "at 3am" assumes said person can rectify the problem in real time and without any coworker collaboration nor stakeholder input.
The last is most important:
If a resolution requires a system change, what organization
would allow one person at 3am to make it and push to prod?
procaryote
If the code is important enough that it needs to be fixed if it breaks at 03:00, someone needs to be on-call
If that someone isn't the dev, things breaking at 03:00 is someone else's problem from the PoV of the dev, and it will likely keep breaking.
I've tried at several companies to get dev-teams to prioritise things causing a lot of work for the ops-team, and nothing has worked as well as disbanding the ops team and putting the devs on-call.
Pain from a problem needs to live where the problem can be fixed.
hn_user82179
great points. I think over the 8 years of my SRE experience, I've probably caused a few outages after being paged at 3-4am and/or prolonged them. I've fallen asleep during one responding time (fortunately the main outage had been fixed but we went way beyond our internal SLO for backups as I had fallen asleep before running them).
That said, the author neglected to mention timezones or following the sun. A 12/12 hour shift or 8/8/8 (NA/EU, NA/EU/EMEA) addresses the sleep deprivation problem pretty well, and is pretty easy to staff in a large enough org
AdieuToLogic
> That said, the author neglected to mention timezones or following the sun. A 12/12 hour shift or 8/8/8 (NA/EU, NA/EU/EMEA) addresses the sleep deprivation problem pretty well
I completely agree.
The "get a call at 3am" scenario, to me, is shorthand for an organization which intentionally under-staffs in order to save money. If a system has genuine 24x7 support needs, be it SLA's or inferior construction, it is incumbent upon the organization to staff accordingly.
Still and all, identifying a production issue is one thing, expecting near real-time resolution by personnel likely unfamiliar with the intricacies of every aspect of a non-trivial production system is another.
nijave
It always amazes me these places that have "24/7 support needs" then all of a sudden a bug comes in that's "not important" even though it has customer uptime impacts.
whstl
Great points.
> If a resolution requires a system change, what organization would allow one person at 3am to make it and push to prod?
The only advantage of having a micromanaging manager: if they want to review all PRs and system changes, they also must be on call every single night of the year.
xyst
Question to tech folks in general: when you are "on-call" are you provided with differentials in pay regardless of whether an incident actually occurs?
I have seen it vary across companies:
- Some companies don’t pay any differential
- Some companies only pay out _per incident_
- One company I worked with paid out per shift of on-call rotation
- One company treated it as overtime pay
brikym
I work for a big tech and your solution has already been built. The on-call software uses an LLM to summarize the tickets, and a vector db to find troubleshooting guides and similar tickets.
For me the worst thing about being on-call is not the actual work outside business hours (it’s usually not much), but the potential work: if something happens I need to jump into my laptop within X minutes (changes from company to company, but it’s usually within 10 minutes). This means: I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip). Basically, all I can do is stay at home and be available. It sucks, and the money is not worth it.