Debugging: Indispensable rules for finding even the most elusive problems (2004)

237 comments

·January 13, 2025

hughdbrown

In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.

Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.

But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.

Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.

randerson

The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them. I've been part of some elusive production issues where eventually 1-2 team members attempted a rewrite of the offending routine while everyone else debugged it, and the rewrite "won" and shipped to production before we found the bug. Heresy I know. In at least one case we never found the bug, because we could only dedicate a finite amount of time to a "fixed" issue.

hughdbrown

> The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them.

So as a metaphor for software debugging, this is "throw away the code, buy a working solution from somewhere else." It may be a way to run a business, but it does not explain how to debug software.

PaulHoule

The worst cases in debugging are the ones where the mental model behind the code is wrong and, in those cases, "throw it away" is the way out.

Those cases are highly seductive because the code seems to work 98% and you'd think it is another 2% to get it working but you never get to the end of the 2%, it is like pushing a bubble around under a rug. (Many of today's LLM enthusiasts will, after they "pay their dues", will wind up sounding like me)

This is documented in https://www.amazon.com/Friends-High-Places-W-Livingston/dp/0... and takes its worst form when you've got an improperly designed database that has been in production for some time and cannot be 100% correctly migrated to a correct database structure.

yakshaving_jgt

I’m not sure it implies buying from elsewhere. I understood it to mean make a new one, rather than try to repair the broken one.

eru

Depending on how your routine looks like, you could run both on a given input, and see when / where they differ.

GuB-42

Rule 0: Don't panic

Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.

augbog

100% agree. I remember I had an on-call and our pagerduty started going off for a SEV-2 and naturally a lot of managers from teams that are affected are in there sweating bullets because their products/features/metrics are impacted. It can get pretty frustrating having so many people try to be cooks in the kitchen. We had a great manager who literally just moved us to a different call/meeting and he told us "ignore everything those people are saying; just stay focused and I'll handle them." Everyone's respect for our manager really went up from there.

ianmcgowan

There's a story in the book - on nuclear submarines there's a brass bar in front of all the dials and knobs, and the engineers are trained to "grab the bar" when something goes wrong rather than jumping right to twiddling knobs to see what happens.

throwawayfks

I read this book and took this advice to heart. I don't have a brass bar in the office, but when I'm about to push a button that could cause destructive changes, especially in prod, my hands reflexively fly up into the air while I double-check everything.

tetha

A weird, yet effective recommendation from someone at my last job: If it's a destructive or dangerous action in prod, touch both your elbows first. This forces ou to take the hands away from the keyboard, stop any possible auto-pilot and look what you're doing.

toolslive

"a good chess player sits on his hands" (NN). It's good advice as it prevents you from playing an impulsive move.

stronglikedan

I have to sit on my hands at the dentist to prevent impulse moves.

scubbo

Thank you for explaining that phrase! I couldn't find it with a quick Google.

Shorn

I always wanted that to be a true story, but I don't think it is.

Cerium

Slow is smooth and smooth is fast. If you don't have time to do it right, what makes you think there is time to do it twice?

sitkack

"We have to do something!"

Bootvis

And this is something, so I’m doing it.

adamc

I had a boss who used to say that her job was to be a crap umbrella, so that the engineers under her could focus on their actual jobs.

r-bryan

I once worked with a company that provided IM services to hyper competitive, testosterone poisoned options traders. On the first fine trading day of a January new year, our IM provider rolled out an incompatible "upgrade" to some DLL that we (our software, hence our customers) relied on, that broke our service. Our customers, ahem, let their displeasure be known.

Another developer and I were tasked with fixing it. The Customer Service manager (although one of the most conniving political-destructive assholes I have ever not-quite worked with), actually carried a crap umbrella. Instead of constantly flaming us with how many millions of dollars our outage was costing every minute, he held up that umbrella and diverted the crap. His forbearance let us focus. He discretely approached every 20 minutes, toes not quite into entering office, calmly inquiring how it was going. In just over an hour (between his visits 3 and 4), Nate and I had the diagnosis, the fix, and had rolled it out to production, to the relief of pension funds worldwide.

As much as I dislike the memory of that manager to this day, I praise his wisdom every chance I get.

airblade

At first I thought you meant an umbrella that doesn't work very well.

FearNotDaniel

Ah, the unintentional ambiguity of language, the reason there are so many lawyers in the world and why they are so expensive. The GP's phrasing is not incorrect but your comment made me realize: I only parsed it correctly the first time because I've heard managers use similar phrases so I recognized the metaphor immediately. But for the sake of reducing miscommunication, which sadly tends to trigger so many conflicts, I could offer a couple of disambiguatory alternatives:

- "her job was to be a crap-umbrella": hyphenate into a compound noun, implies "an umbrella of/for crap" to clarify the intended meaning

- "her job was to be a crappy umbrella": make the adjective explicit if the intention was instead to describe an umbrella that doesn't work well

louthy

I always say this too. But the real trick is knowing what to let thru. You can’t just shield your team from everything going on in the organisation. You’re all a part of the organisation and should know enough to have an opinion.

A better analogy is you’re there to turn down the noise. The team hears what they need to hear and no more.

Equally, the job of a good manager is to help escalate team concerns. But just as there’s a filter stopping the shit flowing down, you have to know what to flow up too.

dazzawazza

Ideally it's crap umbrellas all the way down. Everyone should be shielding everyone below them from the crap slithering its way down.

saghm

Agreed. Even as a relatively junior engineer at my first job, I realized that a certain amount of the job was not worth exposing interns to (like excessive amounts of Jira ticket wrangling) because it would take parts of their already limited time away from doing things that would benefit them far more. Unless there's quite literally no one "below" you, there's probably _something_ you can do to stop shit from flowing down past you.

generic92034

That must be the trickle-down effect everyone talking about in the 80ies. ;)

o_nate

A corollary to this is always have a good roll-back plan. It's much nicer to be able to roll-back to a working version and then be able to debug without the crisis-level pressure.

sroussey

Rollback ability is a must—it can be the most used mitigation if done right.

Not all issues can be fixed with a rollback though.

CobrastanJorji

I once worked for a team that, when a serious visible incident occurred, a company VP would pace the floor, occasionally yelling, describing how much money we were losing per second (or how much customer trust if that number was too low) or otherwise communicating that we were in a battlefield situation and things were Very Critical.

Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.

chrsig

Also a pager/phone going off incessantly isn't useful either. manage your alarms or you'll be throwing your phone at a wall.

bilekas

This is very underrated. Also an extension to this is don’t be afraid to break things further to probe. I often see a lot of devs mid level included panicking and thus preventing them to even know where to start. I’ve come to believe that some people just have an inherent intuition and some just need to learn it.

jimmySixDOF

Yes its sometimes instinct takes over when your on the spot in a pinch but there are institutional things you can do to be prepared in advance that expand your set of options in the moment much like a pre-prepared firedrill playbook you can pull from also there are training courses like Kepner-Tregoe but you are right there are just some people who do better than others when _it's hitting the fan.

nickjj

For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.

Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...

I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).

jerf

"git bisect" is why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release. I use this as my #1 principle, above "I should be able to see every keystroke ever written" or "I want every last 'Fixes.' commit" that is sometimes advocated for here, because those principles make bisect useless.

The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.

If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.

aag

Sometimes you'll find a repo where that isn't true. Fortunately, git bisect has a way to deal with failed builds, etc: three-value logic. The test program that git bisect runs can return an exit value that means that the failure didn't happen, a different value that means that it did, or a third that means that it neither failed nor succeeded. I wrote up an example here:

https://speechcode.com/blog/git-bisect

Noumenon72

Very handy! I forgot about `git bisect skip`.

SoftTalker

I do bisecting almost as a last resort. I've used it when all else fails only a few times. Especially as I've never worked on code where it was very easy to just build and deploy a working debug system from a random past commit.

Edit to add: I will study old diffs when there is a bug, particularly for bugs that seem correlated with a new release. Asking "what has changed since this used to work?" often leads to an obvious cause or at least helps narrow where to look. Also asking the person who made those changes for help looking at the bug can be useful, as the code may be more fresh in their mind than in yours.

forrestthewoods

> why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release

You’re spot on.

However it’s clearly a missing feature that Git/Mercurial can’t tag diffs as “passes” or “bisectsble”.

This is especially annoying when you want to merge a stack of commits and the top passes all tests but the middle does not. It’s a monumental and valueless waste of time to fix the middle of the stack. But it’s required if you want to maintain bisectability.

It’s very annoying and wasteful. :(

michalsustr

This is why we use squash like here https://docs.gitlab.com/ee/user/project/merge_requests/squas...

Izkata

If there's a way to identify those incomplete commits, git bisect does support "skip" - a commit that's neither good nor bad, just ignored.

rlkf

Could you not use --first-parent option to test only at the merge-points?

skydhash

Same. Every branch apart from the “real” one and release snapshots is transient and WIP. They don’t get merged back unless tests pass.

null

[deleted]

smcameron

Back in the 1990s, while debugging some network configuration issue a wiser older colleague taught me the more general concept that lies behind git bisect, which is "compare the broken system to a working system and systematically eliminate differences to find the fault." This can apply to things other than software or computer hardware. Back in the 90s my friend and I had identical jet-skis on a trailer we shared. When working on one of them, it was nice to have its twin right there to compare it to.

yuliyp

The principle here "bisection" is a lot more general than just "git bisect" for identifying ranges of commits. It can also be used for partitioning the space of systems. For instance, if a workflow with 10 steps is broken, can you perform some tests to confirm that 5 of the steps functioned correctly? Can you figure out that it's definitely not a hardware issue (or definitely a hardware issue) somewhere?

This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!

ajross

Not to complain about bisect, which is great. But IMHO it's really important to distinguish the philosophy and mindspace aspect to this book (the "rules") from the practical advice ("tools").

Someone who thinks about a problem via "which tool do I want" (c.f. "git bisect helps a lot"[1]) is going to be at a huge disadvantage to someone else coming at the same decisions via "didn't this used to work?"[2]

The world is filled to the brim with tools. Trying to file away all the tools in your head just leads to madness. Embrace philosophy first.

[1] Also things like "use a time travel debugger", "enable logging", etc...

[2] e.g. "This state is illegal, where did it go wrong?", "What command are we trying to process here?"

gregthelaw

I've spent the past two decades working on a time travel debugger so obviously I'm massively biassed, but IMO most programmers are not nearly as proficient in the available debug tooling as they should be. Consider how long it takes to pick up a tool so that you at least have a vague understanding of what it can do, and compare to how much time a programmer spends debugging. Too many just spend hour after hour hammering out printf's.

CJefferson

I find the tools are annoyingly hard to use, particularly when a program is using a build system you aren't familiar with. I love time travelling debuggers, but I've also lost hours to getting large java, or C++ programs, into any working debuggers along with their debugging symbols (for C++).

This is one area where I've been disappointed by rust, they cleaned up testing, and getting dependencies, by getting them into core, but debugging is still a mess with several poorly supported cargo extensions, none of which seem to work consistently for me (no insult to their authors, who are providing something better than nothing!)

codr7

Different mindsets, I find many problems easier to reason about from traces.

It's very rare for me to use debuggers.

nottorp

Just be careful to not contradict #3 “Quit thinking and look”.

ajross

Touché

Icathian

Tacking on my own article about git bisect run. It really is an amazing little tool.

https://andrewrepp.com/git_bisect_run

tetha

You can also use divide and conquer when dealing with a complex system.

Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.

That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.

And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?

As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.

jvans

git bisect is an absolute power feature everybody should be aware of. I use it maybe once or twice a year at most but it's the difference between fixing a bug in an hour vs spending days or weeks spinning your wheels

epolanski

Bisection is also useful when debugging css.

When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.

Rinse and repeat, it's a basic binary search.

I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.

qwertox

Make sure you're editing the correct file on the correct machine.

ZedZark

Yep, this is a variation of "check the plug"

I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)

overhead4075

This is also covered by "make it fail"

shmoogy

I'm glad I'm not the only one doing this after I wasted too much time trying to figure out why my docker build was not reflecting the changes ... never again..

eddd-ddde

How much time I've wasted unknowingly editing generated files, out of version files, forgetting to save, ... only god knows.

netcraft

the biggest thing I've always told myself and anyone ive taught: make sure youre running the code you think youre running.

codr7

Baby steps, if the foundation is shaky no amount of reasoning on top is going to help.

ajuc

That's why you make it break differently first. To see your changes have any effect.

snowfarthing

When working on a test that has several asserts, I have adopted the process of adding one final assert, "assert 'TEST DEBUGGED' is False", so that even when I succeed, the test fails -- and I could review to consider if any other tests should be added or adjusted.

Once I'm satisfied with the test, I remove the line.

chupasaurus

Poor Yorick!

spacebanana7

Also that it's in the correct folder

reverendsteveii

very first order of business: git stash && git checkout main && git pull

n144q

...and you are building and running the correct clone of a repository

heikkilevanto

Some additional rules: - "It is your own fault". Always suspect your code changes before anything else. It can be a compiler bug or even a hardware error, but those are very rare. - "When you find a bug, go back hunt down its family and friends". Think where else the same kind of thing could have happened, and check those. - "Optimize for the user first, the maintenance programmer second, and last if at all for the computer".

physicles

The first one is known in the Pragmatic Programmer as “select isn’t broken.” Summarized at https://blog.codinghorror.com/the-first-rule-of-programming-...

bsammon

Alternatively, I've found the "Maybe it's a bug. I'll try an make a test case I can report on the mailing list" approach useful at times.

Usually, in the process of reducing my error-generating code down to a simpler case, I find the bug in my logic. I've been fortunate that heisenbugs have been rare.

Once or twice, I have ended up with something to report to the devs. Generally, those were libraries (probably from sourceforge/github) with only a few hundred or less users that did not get a lot of testing.

wormlord

I always have the mindset of "its my fault". My Linux workstation constantly crashing because of the i9-13900k in it was honestly humiliating. Was very relieved when I realized it was the CPU and not some impossible to find code error.

dehrmann

Linux is like an abusive relationship in that way--it's always your fault.

ajuc

It's healthier to assume your code is wrong than otherwise. But it's best to simply bisect the cause-effect chain a few more times and be sure.

astrobe_

About "family and friends", a couple of times by fixing minor and a priori unrelated side issues, it revealed the bug I was after.

nox101

> #1 Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.

Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.

I'd be curious if there's a single programmer that follows this advice.

adolph

This was written in 2004, the year of Google's IPO. Atwood and Spolsky didn't found Stack Overflow until 2008. [0] People knew things as the "Camel book" [1] and generally just knew things.

0. https://stackoverflow.blog/2021/12/14/podcast-400-an-oral-hi...

1. https://www.perl.com/article/extracting-the-list-of-o-reilly...

earnestinger

I think we have a bit different interpretations here.

> read everything in depth

Is not necessarily

> first read the entire manual of the library I'm using, all 700 pages

If I have problem with “git bisect”. I can go only to stackoverflow try several snippets and see what sticks, or I can also go to https://git-scm.com/docs/git-bisect to get a bit deeper knowledge on the topic.

null

[deleted]

feoren

Essentially yes, that's correct. Your mistake is thinking that the outcome of those months of work is being able to kinda-probably fix one single bug. No: the point of all that effort is to truly fix all the bugs of that kind (or as close to "all" as is feasible), and to stop writing them in the first place.

The alternative is paradropping into an unknown system with a weird bug, messing randomly with things you don't understand until the tests turn green, and then submitting a PR and hoping you didn't just make everything even worse. It's never really knowing whether your system actually works or not. While I understand that is sometimes how it goes, doing that regularly is my nightmare.

P.S. if the manual of a library you're using is 700 pages, you're probably using the wrong library.

earnestinger

> if the manual of a library you're using is 700 pages, you're probably using the wrong library.

Statement bordering on papyrophobia. (Seems that is a real phobia)

feoren

More like fear of needless complexity and over-broad scope, which is actually rampant in library design.

scudsworth

great strawman, guy that refuses to read documentation

mox1

I mean he has a point. Things are incredibly complex now adays, I don't think most people have time to "understand the system."

I would be much more interested in rules that don't start with that... Like "Rules for debugging when you don't have the capacity to fully understand every part of the system."

Bisecting is a great example here. If you are Bisecting, by definition you don't fully understand the system (or you would know which change caused the problem!)

david_draco

Step 10, add the bug as a test to the CI to prevent regressions? Make sure the CI fails before the fix and works after the fix.

Tade0

The largest purely JavaScript repo I ever worked on (150k LoC) had this rule and it was a life saver, particularly because the project had commits dating back more than five years and since it was a component/library, it had quite few strange hacks for IE.

seanwilson

I don't think this is always worth it. Some tests can be time consuming or complex to write, have to be maintained, and we accept that a test suite won't be testing all edge cases anyway. A bug that made it to production can mean that particular bug might happen again, but it could be a silly mistake and no more likely to happen again than 100s of other potential silly mistakes. It depends, and writing tests isn't free.

Ragnarork

Writing tests isn't free but writing non-regression tests for bugs that were actually fixed is one of the best test cases to consider writing right away, before the bug is fixed. You'll be reproducing the bug anyway (so already consider how to reproduce). You'll also have the most information about it to make sure the test is well written anyway, after building a mental model around the bug.

Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.

jerf

For people who aren't getting the value of unit tests, this is my intro to the idea. You had to do some sort of testing on your code. At its core, the concept of unit testing is just, what if instead of throwing away that code, you kept it?

To the extent that other concerns get in the way of the concept, like the general difficulty of testing that GUIs do what they are supposed to do, I don't blame the concept of unit testing; I blame the techs that make the testing hard.

seanwilson

> Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.

Some examples that come to mind are bugs to do with UI interactions, visuals/styling, external online APIs/services, gaming/simulation stuff, and asynchronous/thread code, where it can be a substantial effort to write tests for, vs fixing the bug that might just be a typo. This is really different compared to if you're testing some pure functions that only need a few inputs.

It depends on what domain you're working in, but I find people very rarely mention how much work certain kinds of test can be to write, especially if there aren't similar tests written already and you have to do a ton of setup like mocking, factories, and UI automation.

n144q

You (or other people) will thank yourself in a few months/years when refactoring the code, knowing that they don't need to worry about missing edge cases, because all known edge cases are covered with these non regression tests.

seanwilson

There's really no situation you wouldn't write a test? Have you not had situations where writing the test would take a lot of effort vs the risk/impact of the bug it's checking for? Your test suite isn't going to be exhaustive anyway so there's always a balance of weighing up what's worth testing. Going overkill with tests can actually get in the way of refactoring as well when a change to the UI or API isn't done because it would require updating too many tests.

nonrandomstring

Yes, just more generally document it

I've lost count of how many things i've fixed only to to see;

1) It recurs because a deeper "cause" of the bug reactivated it.

2) Nobody knew I fixed something so everyone continued to operate workarounds as if the bug was still there.

I realise these are related and arguably already fall under "You didn't fix it". That said a bit of writing-up and root-cause analysis after getting to "It's fixed!" seems helpful to others.

soco

What do you do with the years old bug fixes? How fast can one run the CI after a long while of accumulating tests? Do they still make sense to be kept in the long run?

hobs

Why would you want to stop knowing that your old bug fixes still worked in the context of your system?

Saying "oh its been good for awhile now" has nothing to do with breaking it in the future.

jerf

I'm not particularly passionate about arguing the exact details of "unit" versus "integration" testing, let alone breaking down the granularity beyond that as some do, but I am passionate that they need to be fast, and this is why. By that, I mean, it is a perfectly viable use of engineering time to make changes that deliberately make running the tests faster.

A lot of slow tests are slow because nobody has even tried to speed them up. They just wrote something that worked, probably literally years ago, that does something horrible like fully build a docker container and fully initialize a complicated database and fully do an install of the system and starts processes for everything and liberally uses "sleep"-based concurrency control and so on and so forth, which was fine when you were doing that 5 times but becomes a problem when you're trying to run it hundreds of times, and that's a problem, because we really ought to be running it hundreds of thousands or millions of times.

I would love to work on a project where we had so many well-optimized automated tests that despite their speed they were still a problem for building. I'm sure there's a few out there, but I doubt it's many.

simmonmt

This is a great problem to have, if (IME) rare. Step 1 Understand the System helps you figure out when tests can be eliminated as no longer relevant and/or which tests can be merged.

gregthelaw

I would say yes, your CI should accumulate all of those regression tests. Where I work we now have many, many thousands of regression test cases. There's a subset to be run prior to merge which runs in reasonable time, but the full CI just cycles through.

For this to work all the regression tests must be fast, and 100% reliable. It's worth it though. If the mistake was made once, unless there's a regression test to catch it, it'll be made again at some point.

lelanthran

> For this to work all the regression tests must be fast,

Doesn't matter how fast it is, if you're continually adding tests for every single line of code introduced eventually it will get so slow you will want to prune away old tests.

hsbauauvhabzb

I think for some types of bugs a CI test would be valuable if the developer believes regressions may occur, for other bugs they would be useless.

sitkack

If folks want to instill this mindset in their kids, themselves or others I would recommend at least

The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)

https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...

https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)

To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...

https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!

deepspace

Agree. I think Zen and the Art of Motorcycle Maintenance encapsulates the art of troubleshooting the best. Especially the concept of "gumption traps",

"What you have to do, if you get caught in this gumption trap of value rigidity, is slow down...you're going to have to slow down anyway whether you want to or not...but slow down deliberately and go over ground that you've been over before to see if the things you thought were important were really important and to -- well -- just stare at the machine. There's nothing wrong with that. Just live with it for a while. Watch it the way you watch a line when fishing and before long, as sure as you live, you'll get a little nibble, a little fact asking in a timid, humble way if you're interested in it. That's the way the world keeps on happening. Be interested in it."

Words to live by

fud101

i'm slogging through zen, it's a bit trite so far (opening pages). im struggling to continue. when will it stop talking about the climate and blackbirds and start saying something interesting?

deepspace

Yes it starts slow, but keep at it. It starts to get interesting about halfway into the book, if I remember correctly.

sitkack

Yeah, it is probably what I would start with and all the messages the book is sending you will resurface in the others. You have to cultivate that debugging mind, but once it starts to grow, it can't be stopped.

nextlevelwizard

What part of Three Body Problem has anything to do with debugging or troubleshooting or any form of planning?

Things just sort of happen with wild leaps of logic. The book is actually a fantasy book with thinnest layer of science babble on top.

knlb

I wrote a fairly similar take on this a few years ago (without having read the original book mentioned here) -- https://explog.in/notes/debugging.html

Julia Evans also has a very nice zine on debugging: https://wizardzines.com/zines/debugging-guide/

Hackbraten

I love Julia Evans’ zine! Bought several copies when it came out, gave some to coworkers and donated one to our office library.

gnufx

Then, after successful debugging your job isn't finished. The outline of "Three Questions About Each Bug You Find" <http://www.multicians.org/thvv/threeq.html> is:

1. Is this mistake somewhere else also?

2. What next bug is hidden behind this one?

3. What should I do to prevent bugs like this?

Zolomon

I have been bitten more than once thinking that my initial assumption was correct, diving deeper and deeper - only to realize I had to ascend and look outside of the rabbit hole to find the actual issue.

> Assumption is the mother of all screwups.

sitkack

This is how I view debugging, aligning my mental model with how the system actually works. Assumptions are bugs in the mental model. The problem is conflating what is knowledge with what is an assumption.

astrobe_

I've once heard from an RTS game caster (IIRC it was Day9 about Starcraft) "Assuming... Is killing you".

kmoser

When playing Captain Obvious (i.e. the human rubber duck) with other devs, every time they state something to be true my response is, "prove it!" It's amazing how quickly you find bugs when somebody else is making you question your assumptions.

fn-mote

The article is a 2024 "review" (really more of a very brief summary) of a 2002 book about debugging.

The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...

The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.

bananapub

> The article is a 2024 "review"

2004.