Questions for Cloudflare

50 comments

·November 19, 2025

timenotwasted

"I don’t know. I wish technical organisations would be more thorough in investigating accidents." - This is just armchair quarterbacking at this point given that they were forthcoming during the incident and had a detailed post-mortem shortly after. The issue is that by not being a fly on the wall in the war room the OP is making massive assumptions about the level of discussions that take place about these types of incidents long after it has left the collective conscience of the mainstream.

cogman10

People outside of tech (and some inside) can be really bad at understanding how something like this could slip through the cracks.

Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.

The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.

I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.

Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.

jsnell

I don't believe that is an accurate description of the issue. It wasn't that the system got too slow due to a big file, it's that the file getting too big was treated as a fatal error rather than causing requests to fail open.

kqr

The article makes no claim about the effort that has gone into the analysis. You can apply a lot of effort and still only produce a shallow analysis.

If the analysis has not uncovered the feedback problems (even with large effort, or without it), my argument is that a better method is needed.

Nextgrid

It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening, and I say that as one of the biggest “yellers at the cloud” on here.

Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.

If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.

The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).

psim1

> It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening

> Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them

Could you not say this about any supplier relationship? No, in this case, we all know the root of the outage is CloudFlare, so it absolutely makes sense to blame CloudFlare, and not their customers.

wongarsu

Don't we say that about all supplier relationships? If my Samsung washing machine stops working I blame Samsung. Even when it turns out that it was a broken drive belt I don't blame the manufacturer of the drive belt, or whoever produced the rubber that went into the drive belt, or whoever made the machine involved in the production of this batch of rubber. Samsung choose to put the drive belt in my washing machine, that's where the buck stops. They are free to litigate the matter internally, but I only care about Samsung selling me a washing machine that's now broken

Same with cloudflare. If you run your site on cloudflare you are responsible for any downtime caused to your site by cloudflare

What we can blame cloudflare for is having so many customers that a cloudflare outage has outsized impact compared to the more uncorrelated outages we would have if sites were distributed among many smaller providers. But that's not quite the same as blaming any individual site being down on cloudflare

raincole

> Don't we say that about all supplier relationships?

No always. If the farm sells packs of poisoned bacon to the supermarket, we blame the farm.

It's more about if the website/supermarket can reasonably do the QA.

Nextgrid

Devil’s advocate: I operate the equivalent of an online lemonade stand, some shitty service at a cheap price offered with little guarantees (“if I fuck up I’ll refund you the price of your ‘lemonade’”) for hobbyists to use to host their blog and Visa decides to use it in their critical path. Then this “lemonade stand” goes down. Do you think it’s fair to blame me? I never chose to be part of Visa’s authorization loop, and after all is done I did indeed refund them the price of their “lemonade”. It’s Visa’s fault they introduced a single point of failure with inadequate compensation schedules in their critical path.

stronglikedan

> Do you think it’s fair to blame me?

Absolutely, yes. Where's your backup plan for when Visa doesn't behave as you expect? It's okay to not have one, but it's also your fault for not having one, and that is the sole reason that the lemonade stand went down.

stronglikedan

If I'm paying a company that chose Cloudflare, and my SLA with that company entitles me to some sort of compensation for outages, then I expect that company to compensate me regardless of whose fault it is, and regardless of whether they were compensated by Cloudflare. I can know that the cause of the outage is Cloudflare, but also know that the company that I'm paying should have had a backup plan and not be solely reliable on one vendor. In other words, I care about who I pay, not who they decide to use.

mschuster91

> look at the uptime of card networks, stock exchanges, or airplane avionics.

In fact, I'd say... airplane avionics are not what you should be looking at. Boeing's 787? Reboot every 51 days or risk the pilots getting wrong airspeed indicators! No, I'm not joking [1], and it's not the first time either [2], and it's not just Boeing [3].

[1] https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

[2] https://www.theregister.com/2015/05/01/787_software_bug_can_...

[3] https://www.theregister.com/2019/07/25/a350_power_cycle_soft...

Nextgrid

> Reboot every 51 days or risk the pilots getting wrong airspeed indicators

If this is documented then fair enough - airlines don’t have to buy airplanes that need rebooting every 51 days, they can vote with their wallets and Boeing is welcome to fix it. If not documented, I hope regulators enforced penalties high enough to force Boeing to get their stuff together.

Either way, the uptime of avionics (and redundancies - including the unreliable airspeed checklists) are much higher than anything conventional software “engineering” has been putting out the past decade.

otterley

The post is describing a full port-mortem process including a Five Whys (https://en.wikipedia.org/wiki/Five_whys) inquiry. In a mature organization that follows best SRE practices, this will be performed by the relevant service teams, recorded in the port-mortem document, and used for creating follow-up actions. It's almost always an internal process and isn't shared with the public--and often not even with customers under NDA.

We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.

tptacek

It also couldn't have happened by the time the postmortem was produced. The author of this blog post appears not to have noticed that the postmortem was up within a couple hours of resolving the incident.

otterley

Exactly. These deeper investigations can sometimes take weeks to complete.

dkyc

These engineering insights were not worth the 16 seconds load time this website took.

It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.

raincole

Every engineer has this phase when you're capable enough to do something at small scale, so you look at the incumbents, who are doing the similar thing but at 1000x scale, and wonder how they are so bad at it.

Some never get out of this phase though.

RationPhantoms

> I wish technical organisations would be more thorough in investigating accidents.

Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.

tptacek

I wish blog posts like these would be more thorough in simply looking at the timestamps on the posts they're critiquing.

ItsHarper

If you read their previous article about AWS (linked in this one), they specifically call out root cause analysis as a flawed approach.

spenrose

I am disappointed to see this article flagged. I thought it was excellent.

waiwai933

> Maybe some of these questions are obviously answered in a Cloudflare control panel or help document. I’m not in the market right now so I won’t do that research.

I don't love piling on, but it still shocks me that people write without first reading.

blixt

It's a bit odd to come from the outside to judge the internal process of an organization with many very complex moving parts, only a fraction of which we have been given context for, especially so soon after the incident and the post-mortem explaining it.

I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).

As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.

kqr

> how these ideas were actively used by the author at e.g. Tradera or Loop54.

This would be preferable, of course. Unfortunately both organisations were rather secretive about their technical and social deficiencies and I don't want to be the one to air them out like that.

null

[deleted]

vlovich123

A lot of these questions bely a misunderstanding of how it works - bot management is evaluated inline within the proxy as a feature on the site (similar to other features like image optimization).

So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.

There are better questions but to me the ones posed don’t seem particularly interesting.

mnholt

This website could benefit from a CDN…

majke

Questions for "questions for cloudflare" owner

internetter

8.5s... yikes... although notably they aren't adopting an anti-CDN or even really anti-cloudlare perspective, just grievances with software architecture. So the slowness of their site isn't really detrimental to their argument

Sesse__

I loaded it and got an LCP of ~350 ms, which is better than the ~550 ms I got from this very comment page.

https://web.archive.org/web/20251119165814/https://entropict...

jcmfernandes

The tone is off. Cloudflare shared a post-mortem on the same day as the incident. It's unreasonable to throw a "I wish technical organisations would be more thorough in investigating accidents".

With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.

tptacek

It's a detailed postmortem published within a couple hours of the incident and this blog post is disappointed that it didn't provide a comprehensive assessment of all the procedural changes inside the engineering organization that came as a consequence. At the point in time when this blog post was written, it would not have been possible for them to answer these questions.

kqr

Part of my argument in the article is that it does't take long to come to that realisation if using the right methods. It would absolutely have been possible to identify the problem of missing feedback by that time.

otterley

"But I need attention now!"

HN

Questions for Cloudflare

Questions for Cloudflare