Skip to content(if available)orjump to list(if available)

War story: the hardest bug I ever debugged

aetimmes

(disclaimer: I know OP IRL.)

I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:

At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.

Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.

marginalia_nu

Days-taken-to-fix is kind of a weird measure for how difficult a bug is. It's clearly a factor of a large number of things that's not the bug itself, including experience and whether you have to go it alone or if you can talk to the right people.

The bug ticks most of the boxes for a tricky bug:

* Non-deterministic

* Enormous haystack

* Unexpected "1+1=3"-type error with a cause outside of the code itself

Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.

I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.

[1] https://www.marginalia.nu/log/a_104_dep_bug/

ivraatiems

Imagine if you weren't working at Google and were trying to convince the Chromium team you found a bug in V8. That'd probably be nigh-impossible.

One thing I notice is that Google has no way whatsoever to actually just ask users "hey, are you having problems?", a definite downside of their approach to software development where there is absolutely no communication between users and developers.

seeingnature

I'd love to see the rest of your postmortem template! I never thought about adding a "Where did we get lucky?" question.

I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"

I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.

The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.

somat

One of my favorite man pages is scan_ffs https://man.openbsd.org/scan_ffs

    The basic operation of this program is as follows:

    1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out.

    2. ...

srejk

The standard SRE one recommended by Google has a lucky section. We tend to use it to talk about getting unlucky too.

nathan_douglas

A good section to have is one on concept/process issues you encountered, which I think is a generalization of your question about panic.

For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.

That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.

parliament32

No, QR codes are auto-orienting[1]. If you're getting a different reading at different orientations, there is a bug in your scanner.

[1] https://en.wikipedia.org/wiki/QR_code#Design

egypturnash

It does seem to be possible to design QR codes that scan differently depending on the orientation, though they look a little visibly malformed.

https://hackaday.com/2025/01/23/this-qr-code-leads-to-two-we...

Suppafly

> I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation.

Same, I assumed they were designed to always work. I suspect it was whatever app or library you were using that wasn't designed to handle them correctly.

null

[deleted]

jbs789

I suspect that by minimising someone else’s work it allows the commenters to feel better about themselves. As a general rule/perspective.

lesuorac

> In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.

I'm not sure this is really luck.

The fix is to just not use Math.abs. If they didn't work at Google they still would've done the same debugging and used the same fix. Working at Google probably harmed them as once they discovered Math.abs didn't work correctly they could've just immediately used `> 0` instead of asking the chrome team about it.

There's nothing lucky about slowly adding printf statements until you understand what the computer is actually doing; that's just good work.

jdwithit

I wish I could recall the details better but this was 20+ years ago now. In college I had an internship working at Bose, doing QA on firmware in a new multi CD changer addon to their flagship stereo. We were provided discs of music tracks with various characteristics. And had to listen to them over and over and over and over and over and over, running through test cases provided by QA management as we did. But also doing random ad-hoc testing once we finished the required tests on a given build.

At one point I found a bug where if you hit a sequence of buttons on the remote at a very specific time--I want to say it was "next track" twice right as a new track started--the whole device would crash and reboot. This was a show stopper; people would hit the roof if their $500 stereo crashed from hitting "next". Similar to the article, the engineering lead on the product cleared his schedule to reproduce, find, and fix the issue. He did explain what was going on at the time, but the specifics are lost to me.

Overall the work was incredibly boring. I heard the same few tracks so many times I literally started to hear them in my dreams. So it was cool to find a novel, highest severity bug by coloring outside the lines of the testcases. I felt great for finding the problem! I think the lead lost 20% of his hair in the course of fixing it, lol.

I haven't had QA as a job title in a long time but that job did teach me some important lessons about how to test outside the happy path, and how to write a reproducible and helpful bug report for the dev team. Shoutout to all the extremely underpaid and unappreciated QA folks out there. It sucks that the discipline doesn't get more respect.

steveBK123

That is great QAing. It also speaks to why QA should be a real role in more orgs, rather than a shrinking discipline. Engineers LOVE LOVE LOVE to test the happy path.

It's not even malice/laziness, it's their entire interpretation of the problem/requirements drives their implementation which then drives their testing. It's like asking restaurants to self-certify they are up to food safety codes.

z3t4

If you do not follow the happy path something will break 100% of the time. That's why engineers always follow the happy path. Some engineers even think that anything outside the happy path is an exception and not even worth investigating. These engineers only thrives if the users are unable to switch to another product. Only competition will lead to better products.

steveBK123

My favorite happy path developer.. and he was by far 10x worse than any engineer I worked with at this, did the following:

Spec: allow the internal BI tool to send scheduled reports to the user

Implementation: the server required the desktop front end of said user to have been opened that day for the scheduled reports to work, even though the server side was sending the mails

Why this was hilariously bad - the only reason to have this feature is for when the user is out of office / away from desk for an extended period, precisely when they may not have opened their desktop UI for the day.

One of my favorite examples of how an engineer can get the entire premise of the problem wrong.

In the end he had taken so long and was so intransigent that desktop support team found it easier to schedule the desktop UIs to auto-open in windows scheduler every day such that the whole Rube Goldberg scheduled reports would work.

pjc50

> If you do not follow the happy path something will break 100% of the time

No, that means you're dealing with an early alpha, rigged demo, or some sort of vibe coding nonsense.

epolanski

Nonsense, you're not describing any engineering at all.

I mean, it's well known that there's very little engineering in most software "engineers", but you're describing a person I've never seen.

fatnoah

> That is great QAing. It also speaks to why QA should be a real role in more orgs, rather than a shrinking discipline.

As a software engineer, I've always been very proud of my thoroughness and attention to detail in testing my code. However, good QA people always leave me wondering "how did they even think to do that?" when reviewing bug reports.

QA is both a skillset AND a mindset.

philk10

Pedantically pointing out the difference between doing some exploratory testing "testing outside the test cases" and QA which is setting up processes/procedures part of which should be "do exploratory testing as well as running the test cases" but the Testing is not QA distinction has been fought over for decades...

But, love the story and I collect tales like this all the time so thanks for sharing

HdS84

A friend of mine has near PTSD from watching some movie over and over and over at a optician where she worked. Was on rotation so that their customers could gauge their eyesight.

kridsdale1

I imagine flight attendants are pretty tired of the Delta Broadway Show video.

BobbyTables2

Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.

Though abs() returning negative numbers is hilarious.. “You had one job…”

To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

I’m not just talking about concurrency issues either…

The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

2 days is cute though.

userbinator

The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.

One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.

gmueckl

That reminded me of a former colleague at the desk next to me randomly exclaiming one day that he had just fixed a bug he had created 20 years ago.

The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.

treyd

This is a surprisingly common mistake with temperature readings. Especially when the system has a thermal safety power off that triggers if it's above some temperature, but then interprets -1 deg C as actually 255 deg C.

edarchis

My brother is a wifi expert at a hw manufacturer. He once had a case where the customer had issues setting the transmit power to like 100 times the legal limit. They happened to be an offshore drilling platform and had an exemption for the transmission power as their antenna was basically on a buoy on the ocean. He had to convince the developer to fix that very specific bug.

devsda

During the time I was working on a mature hardware product in maintenance, if I think about the number of customer bugs we had to close due to being not-reproducible or were only present for a brief amount of time in specific setup, it was really embarassing and we felt like a bunch of noobs.

dharmab

Bryan Cantril did a talk about this phenomenon called "Zebras all the way down" some years back

jakevoytko

Author here! I debugged a fair number of those when I was a systems engineer in soft real time robotics systems, but none of them felt as bad in retrospect because you're just reading up on the system and mulling over it and eventually you get the answer in a shower thought. Maybe I just find the puzzle of them fun, I don't know why they don't feel quite so bad. This was just an exhausting 2-day brute-force grind where it turned out the damn compiler was broken.

gertlex

I also came to the comments to weigh in on my perception of how rough this was, but instead will ask:

Regarding "exhausting 2-day brute-force grind": is/was this just how you like to get things done, or was there external pressure of the "don't work on anything else" sort? I've never worked at a large company, and lots of descriptions of the way things get done are pretty foreign to me :). I am also used to being able to say "this isn't getting figured out today; probably going to be best if I work on something else for a bit, and sleep on it, too".

jakevoytko

The fatal error volume was so overwhelming that we didn't have any option but understanding the problem in perfect detail so that we could fix it if the problem was on our side, or avoid it if it was caused by something like our compiler or the browser.

Our team also had a very grindy culture, so "I'm going to put in extra hours focusing exclusively on our top crash" was a pretty normalized behavior. After I left that team (and Google), most of my future teams have been more forgiving on pace for non-outages.

jffhn

>Though abs() returning negative numbers is hilarious.

Math.abs(Integer.MIN_VALUE) in Java very seriously returns -2147483648, as there is no int for 2147483648.

eterm

You inspired me to check what .NET does in that situation.

It throws an OverflowException: ("Negating the minimum value of a twos complement number is invalid.")

rhaps0dy

Oh no, Pytorch does the same thing:

a = torch.tensor(-2*31, dtype=torch.int32) assert a == a.abs()

MawKKe

numpy as well. and tensorflow

adrian_b

Unchecked integer overflow strikes again.

bobbylarrybobby

Rust does the same in release, although it panics in debug.

efortis

Same here, we had an IE8 bug that prevented the initial voice over of the screen reader (JAWS). No dev could reproduce it because we all had DevTools open.

gsck

I had a similar issue, worked fine when I was testing it on my machine, but I had dev tools open to see any potential issues.

Turns out IE8 doesn't define console until the devtools are open. That caused me to pull a few hairs out.

smrq

I can't remember the actual bug now, but one of my early career memories was hunting down an IE7 issue by using bookmarklets to alert() values. (Did IE7 even have dev tools?)

camtarn

There was a downloadable developer toolbar for IE6 and IE7, and scripts could be debugged in the external Windows Script Debugger. The developer toolbar even told you which elements had the famous hasLayout attribute applied, which completely changed how it was rendered and interacted with other objects, which was invaluable.

lukan

"To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added."

My favourite are bugs, that not only don't appear in the debugger - but also don't reproduce anymore on normal settings after I took a closer look in the debugger (Only to come back later at a random time). Feels like chasing ghosts.

btschaegg

Terminology proposal: "Gremlins" :)

Adverblessly

> To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

A favourite of mine was a bug (specifically, a stack corruption) that I only managed to see under instrumentation. After a lot of debugging turns out that the bug was in the instrumentation software itself, which generated invalid assembly under certain conditions (calling one of its own functions with 5 parameters even though it takes only 4). Resolved by upgrading to their latest version.

Terr_

This repro was a few times per day, but try fixing a Linux kernel panic when you don't even have C/C++ on your resume, and everyone who originally set stuff up has left...

https://news.ycombinator.com/item?id=37859771

Point being that the difficulty of a fix can come from many possible places.

rowanG077

I don't think taking how long something took to debug in number of days is at all interesting. Trivial bugs can take weeks to debug for a noob. Insanely hard bugs takes hours to debug for genius devs, maybe even without any reproducer, just by thinking about it.

nneonneo

FWIW: this type of bug in Chrome is exploitable to create out-of-bounds array accesses in JIT-compiled JavaScript code.

The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.

However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index - which can be abused to rewrite the array’s length and enable further shenanigans.

Further reading about a Chrome CVE pretty much exactly in this mold: https://shxdow.me/cve-2020-9802/

saghm

> which can be abused to rewrite the array’s length and enable further shenanigans.

I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative? I'm familiar with the paradigm of negative indexing being used to access things from the end of the array (like -1 being the last element), but I don't understand what operation someone could do that would somehow modify the length of the array rather than modifying a specific element in-place. Does JIT-compiled JavaScript not follow the usual JavaScript semantics that would normally happen when using a negative index, or are you describing something that would be used in combination with some other compiler bug (which honestly sounds a lot more severe even in the absence of an usual Math.abs implementation).

nneonneo

Normally, there would be a bounds check to ensure that the index was actually non-negative; negative indices get treated as property accesses instead of array accesses (unlike e.g. Python where they would wrap around).

However, if the JIT compiler has "proven" that the index is never non-negative (because it came from Math.abs), it may omit such checks. In that case, the resulting access to e.g. arr[-1] may directly access the memory that sits one position before the array elements - which could, for example, be part of the array metadata, such as the length of the array.

You can read the comments on the sample CVE's proof-of-concept to see what the JS engine "thinks" is happening, vs. what actually happens when the code is executed: https://github.com/shxdow/exploits/blob/master/CVE-2020-9802.... This exploit is a bit more complicated than my description, but uses a similar core idea.

saghm

I understand the idea of the lack of a bounds check allowing access to early memory with a negative index, but I'm mostly struggling with wrapping my head around why the underlying memory layout is accessible in JavaScript in the first place. I hadn't considered the fact that the same syntax could be used for accessing arbitrary properties rather than just array indexes; that might be the nuance I was missing.

bryanrasmussen

>I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative?

This is my no doubt dumb understanding of what you can do, based on some funky stuff I did one time to mess with people's heads

do the following const arr = []; arr[-1] = "hi"; console.log(arr) this gives you "-1": "hi"

length: 0

which I figured is because really an array is just a special type of object. (my interpretation, probably wrong)

now we can see that the JavaScript Array length is 0, but since the value is findable in there I would expect there is some length representation in the lower level language that JavaScript is implemented in, in the browser, and I would then think that there could even be exploits available by somehow taking advantage of the difference between this lower level representation of length and the JS array length. (again all this is silly stuff I thought and have never investigated, and is probably laughably wrong in some ways)

I remember seeing some additions to array a few years back that made it so you could protect against the possibility of negative indexes storing data in arrays - but that memory may be faulty as I have not had any reason to worry about it.

saghm

You raise a good point that JavaScript arrays are "just" objects that let you assign to arbitrary properties through the same syntax as array indexing. I could totally imagine some sort of optimization where a compiler utilizes this to be able to map arrays directly to their underlying memory layout (presumably with a length prefix), and that would end up potentially providing access to it in the case of a mistaken assumption about omitting a bounds check.

bboygravity

Javascript is the new Macromedia/Adobe Flash.

You can do more and more in it and it's so fun, until it suddenly isn't anymore and dies.

ongy

This is after the jit.

I.e. don't think fancy language shenanigans that do negative indexing. But negative offset from the beginning of the array memory access.

When there's some inlining, there will be no function call into some index operator function

PhilipRoman

For example if arrays were implemented like this (they're not)

    struct js_array {
        uint64_t length;
        js_value *values[];
    }
Because after bound checks have been taken care of, loading an element of a JS array probably compiles to a simple assembly-level load like mov. If you bypass the bounds checks, that mov can read or write any mapped address.

saghm

Yeah, I understand all of that. I think my surprise was that you can access arbitrary parts of this struct from within JavaScript at all; I guess I really just haven't delved deeply enough into what JIT compiling actually is doing at runtime, because I wouldn't have expected that to be possible.

perihelions

My own story: I spent >10 hours debugging an Emacs project that would occasionally cause a kernel crash on my machine. Proximate cause was a nonlocal interaction between two debug-print statements. (Wasn't my first guess). The Elisp debug-print function #'message has two effects: it appends to a log, and also does a small update notification in the corner of the editor window. If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.

Emacs' #'message implementation has a debounce logic, that if you repeatedly debug-print the same string, it gets deduplicated. (If you call (message "foo") 50 times fast, the string printed is "foo [50 times]"). So: if you debug-print inspect a variable that infrequently changes (as was the case), no GUI thrashing occurs. The bug manifested when there were *two* debug-print statements active, which circumvented the debouncer, since the thing being printed was toggling between two different strings. Commenting out one debug-print statement, or the other, would hide the bug.

chrismorgan

> If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.

Until comparatively recently, it was absurdly easy to crash machines via their graphics drivers, even by accident. And I bet a lot of them were security concerns, not just DoS vectors. WebGL has been marvellous at encouraging the makers to finally fix their drivers properly, because browsers declared that kind of thing unacceptable (you shouldn’t be able to bring the computer down from an unprivileged web page¹), and developed long blacklists of cards and drivers, and brought the methodical approach browsers had finally settled on to the graphics space.

Things aren’t perfect, but they are much better than ten years ago.

—⁂—

¹ Ah, fond memories of easy IE6 crashes, some of which would even BSOD Windows 98. My favourite was, if my memory serves me correctly, <script>document.createElement("table").appendChild(document.createElement("div"))</script>. This stuff was not robust.

jason_tko

Reminds me of the classic bug story where users couldn’t send emails more than 500 miles.

https://web.mit.edu/jemorris/humor/500-miles

decimalenough

BoorishBears

I experienced "crashes after 16 hours if you didn't copy the mostly empty demo Android project from the manufacturer and paste the entire existing project into it"

Turned out there was an undocumented MDM feature that would reboot the device if a package with a specific name wasn't running.

Upon decompilation it wasn't supposed to be active (they had screwed up and shipped a debug build of the MDM) and it was supposed to be 60 seconds according to the variable name, but they had mixed up milliseconds and seconds

sgarland

This deserves more upvotes. Absolute classic.

friendzis

My hardest bug story, almost circling back to the origin of the word.

An intern gets a devboard with a new mcu to play with. A new generation, but mostly backwards compatible or something like that. Intern gets the board up and running with embedded equivalent of "hello world". They port basic product code - ${thing} does not work. After enough hair are pulled, I give them some guidance - ${thing} does not work. Okay, I instruct intern to take mcu vendor libraries/examples and get ${thing} running in isolation. Intern fails.

Okay, we are missing something huge that should be obvious. We start pair programming and strip the code down layer by layer. Eventually we are at a stage where we are accessing hand-coded memory addresses directly. ${thing} does not work. Okay, set up a peripheral and read state register back. Assertion fails. Okay, set up peripheral, nop some time for values to settle, read state register back. Assertion fails. Check generated assembly - nopsled is there.

We look at manual, the bit switching peripheral into the state we care about is not set. However we poke the mcu, whatever we write to control register, the bit is just not set and the peripheral never switches into the mode we need. We get a new devboard (or resolder mcu on the old one, don't remember) and it works first try.

"New device - must be new behavior" thinking with lack of easy access to the new hardware led us down a rabbit hole. Yes, nothing too fancy. However, I shudder thinking what if reading the state register gave back the value written?

GianFabien

what if reading the state register gave back the value written?

I've had that experience. Turned out some boards in the wild didn't have the bodge wire that connected the shift register output to the gate that changed the behavior.

latexr

It’s amusing how so many of the comments here are like “You think two days is hard? Well, I debugged a problem which was passed down to me by my father, and his father before him”. It reminds me of the Four Yorkshiremen sketch.

https://youtube.com/watch?v=sGTDhaV0bcw

The author’s “error”, of course, was calling it “the hardest bug I ever debugged”. It drives clicks, but comparisons too.

markrages

Of course the comments section is going to be full of war stories about everyone's hardest bug.

This is how humans work, and this is why I am reading the comments.

latexr

Yes, of course, I greatly enjoy the stories and it’s why I opened this thread. But that’s not what my comment is about, I was specifically referencing the parts of the comments which dismiss the difficulty and length of time the author spent tracking down this particular bug. I found that funny and my comment was essentially one big joke.

jandrese

At least the author worked for Google. It's another layer of fun to go through the work of tracking down a bug like that as a third party and then trying to somehow contact a person at the company who can fix it, especially when it is a big company and doubly so if the product is older and on a maintenance only schedule.

Me: "Your product is broken for all customers in this situation, probably has been so for years, here is the exact problem and how to fix it, can I talk with someone who can do the work?"

Customer Support: "Have you tried turning your machine off and turning it back on again?"

sandos

Complaining about "slow to reproduce" and talking _seconds_. Dear, oh dear those are rookie numbers!

Currently working a bug where we saw file system corruption after 3 weeks of automated testing, 10s of thousands of restarts. We might never see the problem again, even? Only happened once yet.

Cthulhu_

If it only happened once... it might be the final category of bugs where nothing you can do will fix it. Cosmic ray bit flipping bug. Which is something your software needs to be able to work around, or in this case, the file system itself... unless you're actually working on the file system itself, in which case, I wish you good luck.

sa46

What layers of hardware can comic rays impact? Memory with ECC is largely safe, right? What about the L1 cache and friends?

392

Anything can fail, at any time. The best we can do is mitigate it and estimate bounds for how likely it is to mess up. Sometimes those bounds are acceptable.

zdc1

My worst bug had me using statistics to try and correlate occurrence rates with traffic/time of day, API requests, app versions, Node.js versions, resource allocations, etc. And when that failed I was capturing Prod traffic for examination in Wireshark...

Turned out that Node.js didn't gracefully close TCP connections. It just silently dropped the connection and sent a RST packet if the other side tried to reuse it. Fun times.

tonyarkles

Heh, not a nodejs problem but something related to TCP connections.

I won't name the product because it's not its fault, but we had an HA cluster of 3 instances of it set up. Users reported that the first login of the day would fail, but only for the first person to come into the office. You hit the login button, it takes 30 seconds to give you an invalid login, and then you try logging in again and it works fine for the rest of the day.

Turns out IT had a "passive" firewall (traffic inspection and blocking, but no NAT) in place between the nodes. The nodes established long-running TCP connections between them for synchronization. The firewall internally kept a table of known established connections and eventually drops them out if they're idle. The product had turned on TCP keepalive, but the Linux default keepalive interval is longer than the firewall's timeout. When the firewall dropped the connection from the table it didn't spit out RST packets to anyone, it just silently stopped letting traffic flow.

When the first user of the day tried to log in, all three HA nodes believed their TCP connections were still alive and happy (since they had no reason not to think that) and had to wait for the connection to timeout before tearing those down and re-establishing them. That was a fun one to figure out...

smackeyacky

Networking in node.js is maddeningly stupid and extremely hard to debug, especially when you're running it in something like Azure where the port allocation can be restricted outside of your control. It's bad enough that I wouldn't consider using node.js on any new project.

noduerme

Amazing war story. Very well told.

Honestly, of all the stupid ideas, having your engine switch to a completely untested mode when under heavy load, a mode that no one ever checks and it might take years to discover bugs in, is absolutely one of most insane things I can think of. That's at best really lazy, and at worst displays a corporate culture that prizes superficial performance over reliability and quality. Thankfully no one's deploying V8 in, like, avionics. I hope.

At least this is one of those bugs you can walk away from and say, it really truly was a low-level issue. And it takes serious time and energy to prove that.

atq2119

I agree with your assessment of how stupid this is, but I'm not surprised.

To be clear, there are good reasons for this different mode. The fuck-up is not testing it properly.

These kinds of modes can be tested properly in various ways, e.g. by having an override switch that forces the chosen mode to be used all the time instead of using the default heuristics for switching between modes. And then you run your test suite in that configuration in addition to the default configuration.

The challenge is that you have now at least doubled the time it takes to run all your tests. And with this kind of project (like a compiler), there are usually multiple switches of this kind, so you very quickly get into combinatorial explosion where even a company like Google falls far short of the resources it would require to run all the tests. (Consider how many -f flags GCC has... there aren't enough physical resources to run any test suite against all combinations.)

The solution I'd love to see is stochastic testing. Instead of (or, more realistically, in addition to) a single fixed test suite that runs on every check-in and/or daily, you have an ongoing testing process that continuously tests your main branch against randomly sampled (test, config) pairs from the space of { test suite } x { configuration space }. Ideally combine it with an automatic bisector which, whenever a failure is found, goes back to an older version to see if the failure is a recent regression and identifies the regression point if so.

friendzis

Isn't stochastic testing becoming more and more of a standard practice? Even if you have the hardware and time to run a full testsuite, you still want to add some randomness just to catch accidental dependencies between tests.

atq2119

Maybe? I'd love to hear if there are some good tools for it that can be integrated into typical setups with Git repositories, Jenkins or GitHub Actions, etc.