Bit flips: How cosmic rays grounded a fleet of aircraft
33 comments
·December 8, 2025chris_va
actionfromafar
That's very P. K. Dick. Or maybe more Heinlein.
RankingMember
It's important to note that this is just Airbus's best guess as to the cause, as there's no smoking gun: they simply exhausted their troubleshooting and were left scratching their heads so this was the "least unlikely" cause they could come up with given the circumstances.
DecentShoes
Just like that Mario 64 speedrunner! People say it's like it's gospel, but it's really just a bunch of peoples best guess. No proof.
RealityVoid
I thought the same, but in a deeper dive into the postmortem, I think it's not a cop out from their side. The report is actually really well done ( I personally was impressed). The reasons it probably was a bit flip is that the CPU did not have edac on it in this instance so bit flips are expected. The consensus mechanism failed in this case and that is what they are updating, because even though the module gave wrong data because of presumably bit flips, the consensus should have prevented the dive.
serial_dev
…but if I respond with this to a user’s bug report, I’m “not taking this seriously”…
avazhi
“ The increasing reliance of computers in fly-by-wire systems in aircraft, which use electronics rather than mechanical systems to control the plane in the air, also mean the risk posed by bit flips when they do occur is higher.”
Bit of an understatement. I don’t think there any active passenger airliners in the first world today that aren’t fly-by-wire. The MD-80 was the last of its kind and it’s been out of passenger operation for what, 10 years now?
Stevvo
Any Boeing other than 777/787 does not use fly-by-wire.
However, that doesn't illuminate the possibility of these errors. Whilst the flight-controls are mechanically linked, the autopilot/trim is electric, so is still suspectable to bit-flips.
drob518
Still a lot of software involved in controlling the aircraft. The 737 Max incidents were eventually tracked to software quality issues, IIRC. All those old designs are being upgraded with modern avionics, so even if the airframe and linkages are old-school, the inputs are being driven by digital computers. At least that’s my understanding. I confess to not being a “plane guy,” though I have spent a lot of time traveling in planes, and I have stayed at a few Holiday Inn Express hotels.
SoftTalker
Boeing 717 is still in service and it's essentially an MD-80. Many 737s are in service and flight controls are hydraulic-boosted cable-and-pulley operated; the type design dates to the 1960s.
BurningFrog
Don't passenger aircrafts have redundant systems, so if one computer flips, the backup takes over?
RealityVoid
Not to mention, the system affected by the bit flips were designed in the 90's AND newer designed systems have EDAC so they are not susceptible to the same kind of issue. Honestly, if you look into the thing, the press coverage of the event is atrocious.
null
neko_ranger
I swear to god I've been got by cosmic rays modifying a bit before when my boot order changes for random reasons
SwiftyBug
I thought planes had insane redundancy exactly so stuff like that don´t happen. How can a bit flip cause the system that controls altitude to malfunction like that?
procflora
From what I've heard (FWIW), Airbus released a version of the software for one of the flight computers that removed SEU protections (hence grounding affected models until they could be downgraded to the previous version).
There was still hardware redundancy though. Operation of the plane's elevator switched to a secondary computer. Presumably it was also running the same vulnerable software, but they diverted and landed early in part to minimize this risk.
So not just redundancy but layers of redundancy.
willis936
Why would you ever expect one bit flip? You have a flip rate and you design your system to tolerate a certain bit flip rate. Assumptions made during requirements establishment were wrong and nature eventually let them know they had negative margin.
p_l
Possibility of bit flips from cosmic radiation only really came to fore in 1990s, and some aircraft and parts predate that.
bdangubic
if (cosmic_ray) {
do_not_flip_bits()
} else {
flip_away()
}rjp0008
What if in the time between initialization of cosmic_ray to False, and the time this if statement executes, a legitimate cosmic ray flips the bool bit representing cosmic_ray?
sunrunner
This is a really good point and a common error in bit flip detection code. To avoid this kind of look-before-you-leap hazard the following is recommended:
try {
do_action()
} catch (BitFlipError e) {
logger.critical("Shouldn't get here")
}
Ask-for-forgiveness as an error detection pattern avoids these kinds of errors entirely.wavemode
ah, a classic TORTOF bug (time-of-ray, time-of-flip)
preommr
I had no idea this was a real thing - I always thought that xkcd comic[0] was just a random joke.
nomel
It's literally one of the reasons ECC RAM exists.
aruametello
To dial up the weirdness, sometimes the solar flare activity has spikes (https://www.spaceweatherlive.com/en/solar-activity/solar-fla...) and these have a mild relationship with the odds of having "bitflips" in that timeframe.
we had a "historic bad solarweather" a bunch of years ago and i talked with a cyber cafe operator that "you could have more computers bluescreen on this week than usual".
to me it got really weird when he said later he really did, but honestly its 50/50 that could had been just incidental.
in another note there are some "rather intense" discussions when someone speedrunning a game gets a "unreproducible glitch" in their favor, some claim its a flaw from ageing dram hardware, but some always point that it could be a cosmic ray bitfliping the right bit. (https://tildes.net/~games/1eqq/the_biggest_myth_in_speedrunn...)
MarkusQ
This is silly. Rapidly refreshing the data that was (presumably) flipped by a cosmic ray last time won't do anything to prevent an error in whatever it hits next time. Unless the theory is that cosmic rays are somehow more likely to hit these particular bits compared to all the millions (billions?) of others in the system...in which case I have a different objection.
RealityVoid
What is silly is media coverage of this. The error was in the ADIRU. They are updating the ELAC. The ELAC takes the decision based on multiple data streams from 3 ADIRU units and the issue being fixed is that it took the wrong decision. The ADIRU will probably continue having SEU but it will be fine.
AlotOfReading
Not all circuits are equally sensitive. The parts that are known to be sensitive or critical are protected by redundancies and error checking, which are probabilistic protection. You haven't completely eliminated the possibility of corruption, just made it incredibly unlikely. Refreshing your inputs is another form of probabilistic protection focused on mitigating the consequences.
MarkusQ
Why not ECC though? Unless this is a latched output of a robust system being held for use by another robust system I guess?
AlotOfReading
ECC is one of the probabilistic protections I was talking about.
jessriedel
I thought some combination of error correction and redundant systems was already widespread in airplanes to prevent cosmic-ray induced errors. (GPT agrees.) What am I missing? I've read multiple articles on this, and none of them address the fact that the problem, at the level of detail described in the article, should have been prevented by technology available and widely deployed for decades.
RealityVoid
You're missing that the systems were designed in the 90's and they had no edac on them but instead relied on redundancy and a consensus system. The fact bit flips happened is not why they grounded the things and updated sw, they grounded them to address the consensus algorithm in the other CPU that did not get the bit flips.
pengaru
> GPT agrees
What do you think this adds? These things are sycophant confident idiots; they will agree and agree they're incorrect at the slightest challenge in the same interaction.
I highly recommend finding a cloud chamber (various science museums have them) to visualize just how much radiation is flying around.
Part of my work touches high power switches. I am going to do a bad job relating this story, but one of the power engineers was talking about how electric train switches in EU (Switzerland?) were having triggering issues. These were big MW scale IGBTs, not something you want to false trigger. Anyway, they eventually traced the problem to cosmic rays, and just turned the entire package vertical so the die was end-on to space (the mountains around were shielding the horizontal direction), and the problem went away.