Cloudflare outage on November 18, 2025 post mortem
94 comments
·November 18, 2025ojosilva
This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
wrs
It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
arccy
if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language
otterley
> work has already begun on how we will harden them against failures like this in the future. In particular we are:
> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
> Enabling more global kill switches for features
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
> Reviewing failure modes for error conditions across all core proxy modules
Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?
This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.
Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.
nikcub
They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.
I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons
[0] https://clickhouse.com/docs/guides/developer/deduplication
mewpmewp2
It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.
otterley
It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.
SerCe
As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.
yen223
I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.
Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.
eastdakota
Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.
madeofpalk
From what I've observed, it depends on whether you're an "engineering company" or not.
tom1337
I mean the CEO posted the post-mortem so there aren't that many layers of stakeholders above. For other post-mortems by engineers, Matthew once said that the engineering team is running the blog and that he wouldn't event know how to veto even if he wanted [0]
thesh4d0w
The person who posted both this blog article and the hacker news post, is Matthew Prince, one of highly technical billionaire founders of cloudflare. I'm sure if he wants something to happen, it happens.
gucci-on-fleek
> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.
As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
eastdakota
Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.
gucci-on-fleek
Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:
- A product depends on frequent configuration updates to defend against attackers.
- A bad data file is pushed into production.
- The system is unable to easily/automatically recover from bad data files.
(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)
tptacek
Richard Cook #18 (and #10) strikes again!
https://how.complexsystems.fail/#18
It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".
tetec1
Yeah, I can imagine that this insertion was some high-pressure job.
lukan
"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."
Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)
eastdakota
We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.
reassess_blind
The status page is hosted on AWS Cloudfront, right? It sure looks like Cloudfront was overwhelmed by the traffic spike, which is a bit concerning. Hope we'll see a post from their side.
dnw
Yes, probably a bunch of automated bots decided to check the status page when they saw failures in production.
paulddraper
Quite possibly it was due to high traffic.
IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.
EvanAnderson
It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
tptacek
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).
EvanAnderson
That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)
Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.
eastdakota
That’s correct.
tptacek
Is it actually consul-template? (I have post-consul-template stress disorder).
navigate8310
I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.
Aeolun
I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?
null
dzonga
> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
I don't use Rust, but a lot of Rust people say if it compiles it runs.
Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.
end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.
there's no bad language - just occassional hiccups from us users who use those tools.
jryio
You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.
Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.
Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)
This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.
I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.
metaltyphoon
> Well Rust won't save you from the usual programming mistake
This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.
tptacek
What people are saying is that idiomatic prod rust doesn't use unwrap/expect (both of which panic on the "exceptional" arm of the value) --- instead you "match" on the value and kick the can up a layer on the call chain.
dzonga
other people might say - why use unsafe rust - but we don't know the conditions of what the original code shipped under. why the pr was approved.
could have been tight deadline, managerial pressure or just the occasional slip up.
vsgherzi
Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.
At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").
The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.
If anyone at cloudflare is here please let me in that codebase :)
waterTanuki
Not a cloudflare employee but I do write a lot of Rust. The amount of things that can go wrong with any code that needs to make a network call is staggeringly high. unwrap() is normal during development phase but there are a number of times I leave an expect() for production because sometimes there's no way to move forward.
vsgherzi
I'm in a similar boat, at the very leas an expect can give hits to what happened. However this can also be problematic if your a library developer. Sometimes rust is expected to never panic especially in situations like WASM. This is a major problem for companies like Amazon Prime Video since they run in a WASM context for their TV APP. Any panic crashes everything. Personally I usually just either create a custom error type (preferred) or erase it away with Dyn Box Error (no other option). Random unwraps and expects haunt my dreams.
nanankcornering
Matt, Looking forward in regaining Elon's and his team trust to use CF again.
ed_mercer
Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!
trengrj
Classic combination of errors:
Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.
tptacek
Right but also this is a pretty common pattern in distributed systems that publish from databases (really any large central source of truth); it might be like the problem in systems like this. When you're lucky the corner cases are obvious; in the big one we experienced last year, a new row in our database tripped an if-let/mutex deadlock, which our system dutifully (and very quickly) propagated across our entire network.
The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.
(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)
Related: Cloudflare Global Network experiencing issues - https://news.ycombinator.com/item?id=45963780 - Nov 2025 (1580 comments)