Cloudflare outage on December 5, 2025

96 comments

·December 5, 2025

flaminHotSpeedo

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

dkyc

One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

liampulles

Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

null

[deleted]

lukeasrodgers

Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

echelon

You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

Certain well-understood migrations are the only cases where roll back might not be acceptable.

Always keep your services in "roll back able", "graceful fail", "fail open" state.

This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

NoSalt

Ooh ... I want to be on a cowboy decision making team!!!

this_user

The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?

otterley

From the post:

“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

nine_k

> more to the story

From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

egorfine

> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

I have a mixed feeling about this.

On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

paradite

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

markus_zhang

My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.

paradite

[delayed]

theideaofcoffee

Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

prdonahue

And you moved at a glacial pace compared to Cloudflare. There are tradeoffs.

draw_down

[dead]

Scaevolus

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

philipwhiuk

> Warning signs like this are how you know that something might be wrong!

Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

liampulles

The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

miyuru

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

dreamcompiler

"Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."

"Why?"

"I've just been transferred to the Cloudflare outage explanation department."

xnorswap

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

jsnell

That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)

littlestymaar

In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

greatgib

The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...

debugnik

Prevented unless they assert the wrong invariant at runtime like they did last time.

skywhopper

This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.

rachr

Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/

jgalt212

I do kind of like who they are blaming React for this.

rany_

> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

redslazer

If the request data is larger than the limit it doesn’t get processed by the Cloudflare system. By increasing buffer size they process (and therefore protect) more requests.

boxed

I think the buffer size is the limit on what they check for malicious data, so the old 128k would mean it would be trivial to circumvent by just having 128k ok data and then put the exploit after.

gkoz

I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

ilkkao

Agreed, I don't really like Cloudflare trying to magically fix every web exploit there is in frameworks my site has never used.

HN

Cloudflare outage on December 5, 2025

Cloudflare outage on December 5, 2025