Breaking Up with On-Call
29 comments
·March 16, 2025danpalmer
strken
Yeah, I was similarly confused. My experience has been that on-call is a roster of who will keep their laptop with them on the weekend. Incidents involve little to no customer comms, because if anything more than a sentence once an hour is necessary then there's someone else on call to handle it.
Support work is a necessary nuisance, but it's also not what on-call is meant to be.
coolgoose
Sounds like normal support work not sure why that's affecting morale.
It's normal in any kind of system to also cover issues.
Yes, at some point you might want to have a customer support / customer success later to at least triage them, but that makes more sense as you get bigger not when you are small.
I actually like having discussions on support days with customer. Yes, sometimes they're more annoying but it's direct feedback of people trying to use your software.
gpi
Agreed. Author's point of view seemed to have been influenced by their bad experience being part of the "on duty" work.
caseyohara
Not to distract from the article, but I'm pretty sure that photo of the guard tower is from Manzanar, one of the concentration camps in California that interned Japanese Americans during WWII. Probably not in great taste to use that photo to represent the idea of "guard duty" in software.
strken
Apart from anything else, a fire tower is a better metaphor for on-call than a guard tower, too.
brigandish
Who, specifically, would that picture offend or change protect?
idiotsecant
Eh, intent matters. They probably just googled 'guard tower' and used the first one that looked close enough.
caseyohara
That’s my assumption and why I said something. I’m guessing they didn’t know.
abrookewood
[flagged]
pokoleo
From my experience working on SaaS, and improving ops at large organizations, I've seen that "on-call culture" often exists inversely proportional to incentive alignment.
When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes. When incident response becomes an organizational checkbox divorced from financial outcomes and planning, you get perpetual firefighting.
The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.
Big companies aren't missing the resources to fix this; they just don't have the aligned incentive structures that make fixing it rational for individuals involved.
The most rational thing to do as an individual on a bad rotation: quit or transfer.
DanHulton
This assumes that the engineers in question get to choose how to allot their time, and are _allowed_ to spend time to add graceful failure modes. I cannot tell you how many stories I have heard of, and companies I have directly worked at, where this power is not granted to engineers, and they are instead directed to "stop working on technical debt, we'll make time to come back to that later". Of course, time is never found later, and the 3am pages continue because the people who DO choose how time is allocated are not the ones waking up at 3am to fix problems.
adrianN
Minimizing 3am pages is good for engineers but it is not necessarily the best investment for the company. Beyond a certain scale it is probably not a good investment to try to get rid of all pages.
tacticus
IMO it's when the incident response and readiness practice imposes a direct backpressure on feature delivery that you get the issues actually fixed and a resilient system.
if it's just the engineer while product and management see no real cost then people burn out and leave.
> The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.
100%
rednafi
> When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes.
Making engineers handle 3 AM issues caused by their code is one thing, but making them bear the financial consequences is another. That’s how you create a blame-game culture where everyone is afraid to deploy at the end of the day or touch anything they don’t fully understand.
stackskipton
I have no idea what article was trying to convey since on call was poorly defined
I also couldn’t take it seriously when article opened with this.
> Startups cannot afford engineers to baby sit software, big tech does.
Say what? As ops person, I’ve seen multiple startups where devs are drowning in operational issues because software was written enough for feature MVP, ship with poor testing then constantly poke at with sticks to keep it working in hopes they could get enough revenue to not flame out.
smcameron
Been a few years since I worked at Google as an SRE, but I did not find "There’s no incentive in big tech to write software with no bugs" to be particularly true. Perhaps because I was an SRE for a few of the of the older, intrinsic, core products. A lot of the pages we got were for things outside our control (e.g. some fiber optic cables got broken (we found out later), so we've got to drain some cluster because nothing's getting through there, or something's getting overloaded because it's the superbowl today, so throw more machines/memory at it, or other weird external things.) I don't remember a lot of pages due to outright bugs in the code... though it's been enough years now that I might have forgotten.
snapetom
I'll add to this some companies (my current one) still has a hero culture. People that are forced to work all weekend to fix a problem are lauded as the ones who went the extra mile. An ego boost and kudos, for some reason are reward enough. Companies don't have substantially reward these employees nor does the company have to invest in paying technical debt, infrastructure, or automation/monitoring. It's incentivisation gone horribly wrong.
AdieuToLogic
The industry myth of "devs need to be on-call just in case prod crashes at 3am" needs to die.
First, a system failure addressed by someone awoken "at 3am" assumes said person can transition from sleep to peek analytic mode in moments. This is obviously not the case.
Second, a system failure addressed by someone awoken "at 3am" assumes said person possesses omniscient awareness of all aspects of a non-trivial system. This is obviously not the case.
Third, and finally, a system failure addressed by someone "at 3am" assumes said person can rectify the problem in real time and without any coworker collaboration nor stakeholder input.
The last is most important:
If a resolution requires a system change, what organization
would allow one person at 3am to make it and push to prod?
hn_user82179
great points. I think over the 8 years of my SRE experience, I've probably caused a few outages after being paged at 3-4am and/or prolonged them. I've fallen asleep during one responding time (fortunately the main outage had been fixed but we went way beyond our internal SLO for backups as I had fallen asleep before running them).
That said, the author neglected to mention timezones or following the sun. A 12/12 hour shift or 8/8/8 (NA/EU, NA/EU/EMEA) addresses the sleep deprivation problem pretty well, and is pretty easy to staff in a large enough org
AdieuToLogic
> That said, the author neglected to mention timezones or following the sun. A 12/12 hour shift or 8/8/8 (NA/EU, NA/EU/EMEA) addresses the sleep deprivation problem pretty well
I completely agree.
The "get a call at 3am" scenario, to me, is shorthand for an organization which intentionally under-staffs in order to save money. If a system has genuine 24x7 support needs, be it SLA's or inferior construction, it is incumbent upon the organization to staff accordingly.
Still and all, identifying a production issue is one thing, expecting near real-time resolution by personnel likely unfamiliar with the intricacies of every aspect of a non-trivial production system is another.
dmoy
Was author's big tech experience with Amazon?
Because Amazon oncall is ... well it's something. But I'm not sure it's really indicative of the rest of big tech oncall.
rednafi
A moderate on-call ritual is a necessary evil. I’ve worked at places that tried to get rid of it with all kinds of automation and playbooks, only to revert back to PagerDuty a few months later.
That said, my last workplace completely burned me out with a terrible on-call policy and an absurdly short recovery period. Not to mention, upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
I get why it’s necessary, but I’m lucky enough to be in a position where I can flat-out decline any job that mandates unpaid on-call.
bigstrat2003
> Not to mention, upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
Every position I've held (save my current one), that is most definitely the norm. If you have a busy night managers are cool with you coming in late the next day (or potentially not at all), but it's very unusual to be paid for on call in my experience.
tbrownaw
> tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
It is part of normal work for roles where it's a thing, and I'd think it could be either baked in or on top as long as which one it was was known at salary negotiation time.
lolinder
> upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
If it's made explicitly clear in the job interview process that on call is expected in a role and if it's enforced evenly across the organization, this can absolutely be normal and ethical—you're on a specific salary and your salary includes on call expectations. If you knew about that going in then you weighed it against the offered salary and decided it was worth it. It's not "unpaid on-call" in that case, it's paid on-call.
Obviously it can also be done poorly in ways that are harmful and dishonest. Not communicating it clearly during the job interview and salary negotiation, applying it inconsistently, or changing the frequency or the difficulty of the rotation after you have started are all real problems.
rednafi
I agree with your expansion. In this particular case, the on-call policy was enforced after the hiring.
AdieuToLogic
>> upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
> If it's made explicitly clear in the job interview process that on call is expected in a role and if it's enforced evenly across the organization, this can absolutely be normal and ethical—you're on a specific salary and your salary includes on call expectations.
Where this logic breaks down is when on-call expectations are in addition to what a salaried position compensates - a 40-ish hour work week.
Expecting an employee to be paid industry rate for services rendered and then expecting periodic 24-168 hour near real-time availability without compensation is, by definition, "unpaid on-call".
parpfish
And the frequency with which you have to take one of the 168-hour shifts is inversely proportional to how many colleagues you have in the rotation. if somebody leaves, the amount of unpaid work you have to do increases.
So if on-call isn’t explicitly compensated, an employee quitting essentially gives the rest of the team more hours at a lower hourly rate.
As a current SRE, and having worked in a small startup, this doesn't echo my experience at all. What the author describes is possible what we would call "on duty" work, the grunt/maintenance work that comes with big software systems. It's not fun, and most companies/teams have friction getting this sort of work done. It's also however not how my SRE role is defined by any stretch. Our on-call work is much more about support during exceptional, somewhat rare circumstances.