Skip to content(if available)orjump to list(if available)

Cloudflare outage should not have happened

vessenes

"If they had a perfectly normalized database, no NULLing and formally verified code, this bug would not have happened."

That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.

Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.

I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)

dfabulich

That's entirely right. Products have to transition from fast-moving exploratory products to boring infrastructure. We have different goals and expectations for an ecommerce web app vs. a database, or a database vs. the software controlling an insulin pump.

Having said that, at this point, Cloudflare's core DDOS-protection proxy should now be built more like an insulin pump than like a web app. This thing needs to never go down worldwide, much more than it needs to ship a new feature fast.

jacquesm

Precisely. This is key infrastructure we're talking about not some kind of webshop.

simlevesque

Yeah but the anti-DDOS feature needs to react to new methods all the time, it's not a static thing you build once and it works forever.

An insulin pump is very different. Your human body, insulin, and physics aren't changing any time soon.

Aperocky

> This thing needs to never go down worldwide

Quantity introduce a quality all of its own in terms of maintenance.

necovek

This bug might not have, but others would. Formal verification methods still rely on humans to input the formal specification, which is where problems happen.

As others point out, if they didn't really ship fast, they certainly would not have become profitable, and they would definitely not have captured the market to the extent they have.

But really, if the market was more distributed, and Cloudflare commanded 5% of the web as the biggest player, any single outage would have been limited in impact. So it's also about market behaviour: yet "nobody is fired for choosing IBM" as it used to go 40 years ago.

bambax

But does "formally verified code" really go in the same bag as "normalized database" and ensuring data integrity at the database level? The former is immensely complex and difficult; the other two are more like sound engineering principles?

mosura

Software people, especially coming through Rust, are falling into the old trap of believing if code is bug free it is reliable: it isn’t because there is a world of faults outside, including but not limited to the developer intentions.

This inverts everything because structuring to be fault tolerant, of the right things, changes what is a good idea almost entirely.

ViewTrick1002

Rust generally forces you to acknowledge these faults. The problem is managing them in a sane way, which for Rust in many cases simply is failing loudly.

Compared to than many other languages which preferring chugging along and hoping that no downstream corruption happens.

jacquesm

When you're powering this large a fraction of the internet is it even an option not to work like that? You'd think that with that kind of market cap resource constraints should no longer be holding you back from doing things properly.

frumplestlatz

I work in formal verification at a FAANG.

It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths, and even then, the properties asserted by formal verification are often quite a bit less powerful than necessary to truly guarantee something useful.

I want formal verification everywhere. I believe in provable correctness. I wish we could hire people capable of always writing software to that standard and maintaining those proofs alongside their work.

We really can’t, though. Its a frustrating reality of being human — we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.

dpark

> we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.

This seems like a contradiction. If the smartest engineers you can hire are not smart enough to work within formal verification constraints then we in fact do not know how to do this.

If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.

jacquesm

The big trick is - as far as I understand it - to acknowledge that systems fail and to engineer for dealing with those failures.

I support your efforts downthread for at least knowing whether or not underlying abstractions are able to generate a panic (which is a massive side effect) or are only able to return valid results or error flags. The higher level the abstraction the bigger the chance that there is a module somewhere in the stack that is able to blow it all up, at the highest level you can pretty much take it as read that this is the case.

So unless you engineer the whole thing from the ground up without any library modules it is impossible to guarantee that this is not the case and as far as I understand your argument you at least want to be informed when that is the case, or, alternatively, to cause the compiler to flag the situation down from your code as incompatible with the guarantees that you are asking for, is that a correct reading?

jacquesm

Ok, let's start off with holding them to the same standards as avionics software development. The formal verification can wait.

ottah

I would argue the largest CDN provider in the world is a critical path.

lenkite

We need modern programming languages with formal verification built-in - should be applicable to specially demarcated functions/modules. It is a headache to write TLA+ and keep the independent spec up2date with the productive code.

FloorEgg

I agree with you.

I would just add that I've noticed organizations tend to calcify as they get bigger and older. Kind of like trees, they start out as flexible saplings, and over time develop hard trunks and branches. The rigidity gives them stability.

You're right that there's no way they could have gotten to where they are if they had prioritized data integrity and formal verification in all their practices. Now that they have so much market share, they might collapse under their own weight if their trunk isn't solid. Maybe investing in data integrity and strongly typed, functional programming that's formally verifiable is what will help them keep their market share.

Cultures are hard to change and I'm not suggesting an expectation for them to change beyond what is feasible or practical. I don't lead an engineering organization like it so I'm definitely armchairing here. I just see some of the logic of the argument that them adopting some of these methods would probably benefit everyone using their services.

ihaveajob

Thank you for putting this in such clear terms. It really is a Catch-22 problem for startups. Most of the time, you can't reach scale unless you cut some corners along the way, and when you reach scale, you benefit from NOT cutting those corners.

swiftcoder

> It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.

We could also invest in tooling to make this kind of thing easier. Unclear why humans need to hand-normalise the database schema - isn't this exactly the kind of thing compilers are good at?

locknitpicker

This sort of Monday morning quarterbacking is pointless and only serves as a way for random bloggers to try to grab credit without actually doing or creating any value.

nmoura

I disagree. I learnt good stuff from this article and it’s enough.

locknitpicker

> I disagree. I learnt good stuff from this article and it’s enough.

That's perfectly fine. It's also besides the point though. You can learn without reading random people online cynically shit talking others as a self promotion strategy. This is junior dev energy manifesting junior level understanding of the whole problem domain.

There's not a lot to learn from claims that boil down to "don't have bugs".

engineeringwoke

I laughed out loud when he said Cloudflare should have formally verified its systems.

rvnx

It's very similar to LinkedIn posts, where everybody seems to know better than the people actually running the platforms.

DrSusanCalvin

This article actually explains how this bug in particular could have been avoided. Sure you may not consider his approach realistic, but it's not at all saying "don't have bugs". In fact, not having formal verification or similar tooling in place, would be more like saying "just don't write buggy code".

galleywest200

> You can learn without reading random people online

Somebody has to write something in the first place for one to learn from it, even if the writing is disagreeable.

alpinisme

Not commenting on the quality of this post but occasional writing that responds to an event provides a good opportunity to share thoughts that wouldn’t otherwise reach an audience. If you post advice without a concrete scenario you’re responding to, it’s both less tangible for your audience and less likely to find an audience when it’s easier to shrug off (or put off).

oivey

What did you learn? The suggestions in the post seem pretty shallow and non-actionable.

udev4096

Backdooring the internet is certainly a productive venture!

echelon

Like your comment? j/k :)

I'm using this incident to draw attention to Rust's panic behavior.

Rust could use additional language features to help us write mostly panic-free* code and statically catch even transitive dependencies that might subject us to unnecessary panics.

We've been talking about it on our team and to other Rust folks, and I think it's worth building a proposal around. Rust should have a way to statically guarantee this never happens. Opt-in at first, but eventually the default.

* with the exception of malloc failures, etc.

tracker1

It's already in the box... there's a bunch of options from unwrap_or, etc... to actually checking the error result and dealing with it cleanly... that's not what happened.

Not to mention the possibility of just bumping up through Result<> chaining with an app specific error model. The author chose neither... likely because they want the app to crash/reload from an external service. This is often the best approach to an indeterminate or unusable state/configuration.

echelon

> This is often the best approach to an indeterminate or unusable state/configuration.

The engineers had more semantic tools at their disposal for this than a bare `unwrap()`.

This was a systems failure. A better set of tools in Rust would have helped mitigate some of the blow.

`unwrap()` is from pre-1.0 Rust, before many of the type system-enabled error safety features existed. And certainly before many of the idiomatic syntactic sugars were put into place.

I posted in another thread that Rust should grow annotation features to allow us to statically rid or minimize our codebase of panic behavior. Outside of malloc failures, we should be able to constrain or rid large classes of them with something like this:

    panic fn my_panicky_function() {
      None.unwrap(); // NB: `unwrap()` is also marked `panic` in stdlib 
    }

    fn my_safe_function() {
      // with a certain compiler or Crates flag, this would fail to compile
      // as my_safe_function isn't annotated as `panic`
      my_panicky_function() 
    }
Obviously just an idea, but something like this would be nice. We should be able to do more than just linting, and we should have tools that guarantee transitive dependencies can't blow off our feet with panic shotguns.

In any case, until something is done, this is not the last time we'll hear unwrap() horror stories.

1a527dd5

Unless you work at Cloudflare or have worked at Cloudflare I'm not sure opinions like this help.

You don't know the context, you don't know _anything_ except for what Cloudflare chooses to share.

There are very few companies who deal with the kind of load that Clouldflare does, I dread to think what weird edges cases they've run into because of their sheer scale.

IshKebab

Casually suggesting formally verifying the software too.

hodgesrm

> I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.

I'm completely mystified how the author concludes that the switch from PostgreSQL to ClickHouse shows the root of this problem.

1. If the point is that PostgreSQL is somehow more less prone to error, it's not in this case. You can make the same mistake if you leave off the table_schema in information_schema.columns queries.

2. If the point is that Cloudflare should have somehow discovered this error through normalization and/or formal methods, perhaps he could demonstrate exactly how this would have (a) worked, (b) been less costly than finding and fixing the query through a better review process or testing, and (c) avoided generating other errors as a side effect.

I'm particularly mystified how lack of normalization is at fault. ClickHouse system.columns is normalized. And if you normalized the query result to remove duplicates that would just result in other kinds of bugs as in 2c above.

Edit: fix typo

cmckn

I agree it should not have happened, but I don’t agree that the database schema is the core problem. The “logical single point of failure” here was created by the rapid, global deployment process. If you don’t want to take down all of prod, you can’t update all of prod at the same time. Gradual deployments are a more reliable defense against bugs than careful programming.

yodon

>Gradual deployments are a more reliable defense against bugs than careful programming

The challenge, as I understand it, is that the feature in question had an explicit requirement of fast, wide deployment because of the need to react in real time to changing external attacker behaviors.

packetslave

yep, and it was this exact requirement that also caused the exact same outage back in 2013 or so. DDoS rules were pushed to the GFE (edge proxy) every 15 seconds, and a bad release got out. Every single GFE worldwide crashed within 15 seconds. That outage is in the SRE book.

cmckn

Yeah, I don’t know how fast “fast” needs to be in this system; but my understanding is this particular failure would have been seen immediately on the first replica. The progression could still be aggressive after verifying the first wave.

btown

One of the things I find fascinating about this is that we don't blink twice about the idea that an update to a "hot" cache entry that's "just data" should propagate rapidly across caches... but we do have change management and gradual deployments for code updates and meaningful configuration changes.

Machine learning feature updates live somewhere in the middle. Large amounts of data, a need for unsupervised deployment that can react in seconds, somewhat opaque. But incredibly impactful if something bad rolls out.

I do agree with the OP that the remediation steps in https://blog.cloudflare.com/18-november-2025-outage/#remedia... seem undercooked. But I'd focus on something entirely different than trying to verify the creation of configuration files. There should be real attention to: "how can we take blue/green approaches to allowing our system to revert to old ML feature data and other autogenerated local caches, self-healing the same way we would when rolling out code updates?"

Of course, this has some risk in Cloudflare's context, because attackers may very well be overjoyed by a slower rollout of ML features that are used to detect their DDoS attacks (or a rollout that they can trigger to rollback by crafting DDoS attacks).

But I very much hope they find a happy medium. This won't be the last time that a behavior-modifying configuration file gets corrupted. And formal verification, as espoused by the OP, doesn't help if the problem is due to a bad business assumption, encoded in a verified way.

wnevets

> Gradual deployments are a more reliable defense against bugs than careful programming.

Wasn't this one of the key takeaways from the crowdstrike outage?

tptacek

Cloudflare doesn't seem to have called it a "Root Cause Analysis" and, in fact, the term "root cause" doesn't appear to occur in Prince's report. I bring this up because there's a school of thought that says "root cause analysis" is counterproductive: complex systems are always balanced on the precipice of multicausal failure.

Analemma_

When I was at AWS, when we did postmortems on incidents we called it "root cause analysis", but it was understood by everyone that most incidents are multicausal and the actual analyses always ended up being fishbone diagrams.

Probably there are some teams which don't do this and really do treat RCA as trying to find a sole root cause, but I think a lot of "getting mad at RCA" is bikeshedding the terminology, and nothing to do with the actual practice.

tptacek

Right, I'm not a semantic zealot on this point, but the post we're commenting on really does suggest that the Cloudflare incident had a root cause in basic database management failures, which is the substantive issue the root-cause-haters have with the term.

cyberax

> to find a sole root cause

"Six billion years ago the dust around the young Sun coalesced into planets"

luhn

"Workaround: If we wait long enough, the earth will eventually be consumed by the sun."

https://xkcd.com/1822/

parados

True, and I agree, but from their report they do seem to be doing Root Cause Analysis (RCA) even if they don't call it that.

RCA is a really bad way of investigating a failure. Simply put; if you show me your RCA I know exactly where you couldn't be bothered to look any further.

I think most software engineers using RCA confuse the "cause" ("Why did this happen") with the solution ("We have changed this line of code and it's fixed"). These are quite different problem domains.

Using RCA to determine "Why did this happen" is only useful for explaining the last stages of an accident. It focuses on cause->effect relationships and tells a relatively simple story but one that is easy to communicate - Hi there managers and media! But RCA only encourages simple countermeasures which will probably be ineffective and will be easily outrun by the complexity of real systems

However one thing RCA is really good at is allocating blame. If your organisation is using RCA then, what ever you pretend, your organisation has a culture of blame. With a blame culture (rather than a reporting culture) your organisation is much more likely to fail again. You will lack operational resilience.

PunchyHamster

then rename it to "root causes analysis"

nine_k

* The unwrap() in production code should have never passed code review. Damn, it should have been flagged by a linter.

* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.

* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.

kalkin

Unless you work at Cloudflare it seems very unlikely that you have enough information about systems and tradeoffs there to make these flat assertions about what "should have" happened. Systems can do worse things than crashing in response to unexpected states. Blue/green deployment isn't always possible (eg due to constrained compute resources) or practical (perhaps requiring greatly increased complexity), and is by no means the only approach to reducing deploy risk. We don't know that any of the related code was shipped with a "move fast, break things" mindset; the most careful developers still write bugs.

Actually learning from incidents and making systems more reliable requires curiosity and a willingness to start with questions rather than mechanically applying patterns. This is standard systems-safety stuff. The sort of false confidence involved in making prescriptions from afar suggests a mindset I don't want anywhere near the operation of anything critical.

nine_k

Indeed, I never worked at Cloudflare. Still I have some nebulous idea about Cloudflare, and especially their scale.

Systems can do worse things than crashing in response to unexpected states, but they can also do better to report them and terminate gracefully. Especially if the code runs on so many nodes, and the crash renders them unresponsive.

Blue/green deployment isn't always possible, but my imagination is a bit weak, and I cannot suggest a way to synchronously update so many nodes literally all over the internet. A blue/green deployment happens in large distributed systems willy-nilly. It's better when it happens in a controlled way, and the safety of a change that affects basically the entire fleet is tested under real load before applying it everywhere.

I do not even assume that any of Cloudflare's code was ever shipped with the "move fast, break things" mindset; I only posit that such a mindset is not optimal for a company in the Cloudflare's position. Their motto might rather be "move smooth, never break anything"; I suppose that most of their customers value their stability higher than their speed of releasing features, or whatnot.

Starting with questions is a very right way, I agree. My first question: why calling unwrap() might ever be a good idea in production code, and especially in some config-loading code, which, to my mind, should be resilient, and ready to handle variations in the config data gracefully? Certain mechanical patterns, like "don't hit your finger with a hammer", are best applied universally by default, with the rare exceptional cases carefully documented and explained, not the other way around.

whazor

The scale of the outage was so big and global, that the biggest failure was indeed the blast radius.

hnthrowaway0328

I wish they do burn a lot of trust to show up in their financial reports. Otherwise it is like "we do not like it but gonna use it anyway".

spwa4

* The step in front of this query created updates to policies. It should have been limited in the number of changes it would do at once (and ideally per hour and per day and so on), and if it goes over that limit, stop updating, alert and wait until explicitly unblocked. DO NOT generate invalid config and start using that invalid config, use the previous one that worked and alert.

If this happens during startup use a default one.

That would still create impact (customers and developers would not see updates propagate), but would avoid destroying the service. When it comes to outages, people need to learn to go over what happens in the case of violating an invariant and look at what gets sacrificed in those cases, to make sure the answer isn't "the whole service".

If I get to be impolite, you do this because software architects, as seems to be the case here, often choose "crash and destroy the service" when their invariants are violated instead of "stop doing shit and alert" when faced with an unknown problem, or a problem they can't deal with.

This also requires test-crashing. You introduce an assert? Great! The more the merrier, seriously, you should have lots of them. BUT you will be including a test that the world doesn't end when your assert is hit.

zahlman

> the blue/green pattern

?

echelon

unwrap() and the family of methods like it are a Rust anti-pattern from the early days of Rust. It dates back to before many of the modern error-handling and safety-conscious features of the language and type system.

Rust is being pulled in so many different directions from new users that the language perhaps never originally intended. Some engineers will be fine with panicky behavior, but a lot of others want to be able to statically guarantee most panics (outside of perhaps memory allocation failures) cannot occur.

We need more than just a linter on this. A new language feature that poisons, marks, or annotates methods that can potentially panic (for reasons other than allocation) would be amazing. If you then call a method that can panic, you'll have to mark your own method as potentially panicky. The ideal future would be that in time, as more standard library and 3rd party library code adopts this, we can then statically assert our code cannot possibly panic.

As it stands, I'm pretty mortified that some transitive dependency might use unwrap() deep in its internals.

this_user

Of course it shouldn't have happened. But if you run infrastructure as complex as this on the scale that they do, and with the agility that they need, then it was bound to happen eventually. No matter how good you are, there is always some extremely unlikely chain of events that will lead to a catastrophic out. Given enough time, that chain will eventually happen.

hvb2

> A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?

And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.

9cb14c1ec0

As an aside, I find it really interesting how Cloudflare has morphed from CDN/DDOS protection into a services conglomerate that many startups could use for every compute need they have.

jrm4

Nothing in this thread about "this should not have happened because Cloudflare is too centralized?"

We have far better ideas and working prototypes in terms of how to prevent this from happening again to be up here trying to "fix Cloudflare."

Think bigger, y'all.

kjuulh

It did happen, and cloudflare should learn from it, but not just the technical reasons.

Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.

Why: Proxy fails requests

Why: Handlers crashed because of OOM

Why: Clickhouse returns too much data

Why: A change was introduced causing double the amount of data

Why: A central change was rolled out immediately to all cluster (single point of failure)

Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.

While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.

lysace

Very quick rollout is crucial for this kind of service. On top of what you wrote, institutionalizing rollback by default if something catastrophically breaks should be the norm.

Been there in those calls, begging to people in charge who perhaps shouldn't have been, "eh, maybe we should attempt a rollback to the last known good state? cause, it, you know.... worked". But investigating further before making any change always seems to be the preferred action to these people. Can't be faulted for being cautious and doing things properly, right? I kid you not - this is their instinct.

If I recall correctly it took CF 2 hours to roll back the broken changes.

So if I were in charge of Cloudflare (4-5k employees) I'd both look at the processes and the people in charge.