Ask HN: Why are most status pages delayed?

50 comments

·November 4, 2025

As I type this, Reddit is down. My own requests return 500s, Down Detector reports that there is an outage but Reddit itself says all systems operational.

This is a pattern that I have noticed time and time again with many services. Why even have a status page if it is not going to be accurate in real time? It's also not uncommon that smaller issues never get acknowledged.

Is this a factor of how Atlassian Statuspage works?

Edit: Redditstatus finally acknowledged the issue as of 04:27 PST, a good 20+ minutes after Down Detector charts show the spike

Visit

swiftcoder

Because for most major sites, updating the status page requires (a significant number of) humans in the loop.

Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.

If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...

So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so

mirekrusin

So there is access to "degraded functionality" from start (the "3-15" of "degraded functionality" one) - people are asking why not share THAT then?

Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.

swiftcoder

> why not share THAT then?

When you've guaranteed 4 or 5 nines worth of uptime to the customer, every acknowledged outage results in refunds (and potentially being sued over breach of contract)

null

[deleted]

jakevoytko

Because the systems are so complex and capable of emergent behavior that you need a human in the loop to truly interpret behavior and impact. Just because an alert is going off doesn't mean that the alert was written properly, or is measuring the correct thing, or the customer is interpreting its meaning correctly, etc.

2gremlin181

Copying my response over from another comment:

I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too. I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.

rozenmd

There are increasingly more status pages that automatically update based on uptime data (I built a service providing that - OnlineOrNot)

But early-stage startups typically have engineering own the status page, but as they grow, ownership usually transfers to customer support. These teams optimize for controlling the message rather than technical detail, which explains the shift toward vaguer/slower incident descriptions.

Yeri

Because you'd have a ton of downtime and they'd rather hide it if they could. :)

I used to work at a very big cloud service provider, and as the initial comment mentioned, we'd get a ton of escalations/alerts in a day, but the majority didn't necessarily warrant a status page update (only affecting X% of users, or not 'major' enough, or not having any visible public impact).

I don't really agree with that, but that was how it was. A manger would decide whether or not to update the status page, the wording was reviewed before being posted, etc. All that takes a lot of time.

swiftcoder

Not hard at all (our internal dashboards did just that). But to have that data posted publicly was not in the best interests of the business.

And honestly, having been on a few customer escalations where they threatened legal action over outages, one kind of starts to see things the business way...

dvt

> Just stop gaslighting me.

I heard this years ago from someone, but there's material impact to a company's bottom line if those pages get updated, so that's why someone fairly senior has to usually "approve" it. Obviously it's technically trivial, but if they acknowledge downtime (for example, like in the AWS case), investors will have questions, it might make quarterly reports, and it might impact stock price.

So it's not just a "status page," it's an indicator that could affect market sentiment, so there's a lot of pressure to leave everything "green" until there's no way to avoid it.

FinnKuhn

I feel like there should at least be some sort of disclaimer then that tells me the status page can take up to xx minutes to show an outage and not make it seem as if it is updated instantaniously. That way I could way those xx minutes before I file a ticket with support and not have the case thinking it is an isolated problem for me instead of a major outage.

null

[deleted]

Bender

Bureaucracy. Companies have service level agreements with other companies. They want to be damn sure they can not disavow an outage before something says there is an outage. There will in most cases be a process involved in updating the status page that will intentionally have many layers of bureaucracy hurdles to jump through including many approvals. The preference will often be to downgrade an "outage" to a "degradation" or "partial outage" or some other term to downplay it and avoid having to pay credits on their B2B service level agreements and such.

mirekrusin

Because they're incentivized to delay it, ideally until resolved, this way their SLA uptime is 100%. Less of reported downtime is better for them so they push it as much as possible. If they were to report all failures their pretty green history would be filled with red. What, are you going to do, sue them? They can do it so they do.

anomaloustho

It’s already been said, but most companies already have those instant “alarms” that go off within minutes. 80% of the time, those alarms are red herrings that get triaged. At a lot of companies, they go off constantly.

As a company, you don’t want to declare an outage readily and you definitely don’t want it to be declared frequently. Declaring an outage frequently means:

• Telling your exec team that your department is not running well • Negative signal to your investors • Bad reputation with your customers • Admitting culpability to your customers and partners (inviting lawsuits and refunds) • Telling your engineering leadership team that your specific team isn’t running well • Messing up your quarterly goals, bonuses etcetera for outages that aren’t real

So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real. You want to make sure you get it right. Therefore, you don’t just want to flip a status page because a few API calls had a timeout.

FinnKuhn

>So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real.

I would argue that every social and incentive structure along the way basically signals that you don't want to declare an outage, even when it is real. You should still do it though or it becomes meaningless.

Great example for Goodhart's law.

gwbas1c

Just wanted to chime in that, at my company, we have some policies that impact when we actually update our status page to show that we have an outage. Without going into details, the policies deliberately slow down our reporting of downtime: We (engineering) need to have a clear understanding of what the problem is before we say there is a problem publicly.

I've personally challenged some details in these policies, which I won't discuss publicly. What I generally agree with is that it's important to have a human in the loop, and to be very thoughtful about when to update a status page and what is put there.

jpalawaga

It’s not a technical issue, it’s a business one.

Those status pages are often linked to contractual SLAs and updating the page tangibly means money lost.

So there’s an incentive to only up it when the issue is severe and not quickly remediated.

It’s not an engineers tool, it’s a liability tool.

digitalsushi

Imagine what you could get away with if you owned the ledger of truth. Any time you made an offence, you could just update that ledger to say the people complaining are wrong, and that's the end of it.

I feel that the tech industry does not have sole ownership of this powerful tool

colinbartlett

This delay in status page acknowledgement is a huge reason that my app, StatusGator, has blown up in popularity recently.

We are now regularly detecting outages long before providers acknowledge them which is hugely beneficial to IT teams.

For this Reddit outage, we alerted 13 minutes before the official status page.

Last weeks Azure outage, it was 42 minutes prior (!?!).

bithaze

> Why even have a status page if it is not going to be accurate in real time?

The funny thing is reddit's status page used to have real-time graphs of things like error rate, comment backlog, visits, etc. Not with any numbers on the Y-axis, so you could only see relative changes, really, but they were still helpful to see changes before humans got around to updating the status page.

bberenberg

Alert fatigue. Down Detector will show an outage with a service when the intermediate network is down. Companies have to triage alerts and once they’re validated they are posted on a status page. Some companies abuse this to hide their outages. Others delay in a reasonable manner.

I have considered building something to address this and even own honeststatuspage.com to eventually host it on. But it’s a complex problem without an obviously correct answer.

giancarlostoro

I've seen all sorts of Azure outages that never wind up on their status page. Granted they could be unique to a small pool of services.

exasperaited

Yeah. Down Detector is more or less meaningless unless something massive has happened, and as you say it has terrible consequences for knock-on services.

It's not even just intermediate networks, it's sometimes direct coinnections. For example, a flood of people reporting an outage on mobile phone network X when the problem they are experiencing is not being able to call a loved one who is on phone network Y, which is the one that is down. This happened a little while back in the UK, leading the other phone providers to have to deny there was some broad outage (which is not an easy thing to reassure when there are so many MVNOs sharing network Y)

chrismorgan

A few months ago, Cloudflare accidentally turned off 1.1.1.1 (I’m simplifying slightly, most notably DNS-over-HTTPS continued to work). Over the course of five or six minutes, traffic dropped to 10% of normal, and stayed there. Somehow, it took another six minutes before an alert fired, at which point they noticed.

https://news.ycombinator.com/item?id=44578490

You’d think that for such a company they’d notice if global traffic for one of their important services for a given minute had dropped below 50% compared with the last hour, but apparently not.

And that’s Cloudflare, who I would expect better of than most.

onionisafruit

At my company we will notice that traffic drops significantly for a minute, but thanks to reporting latency we don’t get alerted until a few minutes later. In our business and at our scale, that latency is fine because we aren’t a vital internet service.

edit: I should have followed your link before commenting, because this sentiment is well covered there.

Mojah

Most companies prefer to fix any downtime before it's noticed, and sharing any details on a status page means admitting something went wrong.

There's plenty of status page solutions that tie in uptime monitoring with status updates, essentially providing a "if we get an alert, anyone can follow along through the status page" for near real-time updates. But, it means showing _all_ users that something went wrong, when maybe only a handful noticed it in the first place.

It's a flawed tactic to try and hide/dismiss any downtime (people will notice), but it's in our human nature to try and hide the bad things?

[1] ie https://ohdear.app/features/status-pages

xeonmc

Status pages can be replaced with a webcam feed of a whiteboard with post-it notes manually updated by employees.

fouc

I agree, literally the definition of a status page is:

"A status page is used to communicate real-time information about a company's system health, performance, and any ongoing incidents to users. It helps reduce support tickets, improve transparency, and build trust by keeping users informed during outages or maintenance"

real-time. for multiple good reasons. reduces confusion for everyone.