Debugging: Indispensable rules for finding even the most elusive problems (2004)
211 comments
·January 13, 2025hughdbrown
randerson
The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them. I've been part of some elusive production issues where eventually 1-2 team members attempted a rewrite of the offending routine while everyone else debugged it, and the rewrite "won" and shipped to production before we found the bug. Heresy I know. In at least one case we never found the bug, because we could only dedicate a finite amount of time to a "fixed" issue.
hughdbrown
> The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them.
So as a metaphor for software debugging, this is "throw away the code, buy a working solution from somewhere else." It may be a way to run a business, but it does not explain how to debug software.
yakshaving_jgt
I’m not sure it implies buying from elsewhere. I understood it to mean make a new one, rather than try to repair the broken one.
eru
Depending on how your routine looks like, you could run both on a given input, and see when / where they differ.
GuB-42
Rule 0: Don't panic
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
augbog
100% agree. I remember I had an on-call and our pagerduty started going off for a SEV-2 and naturally a lot of managers from teams that are affected are in there sweating bullets because their products/features/metrics are impacted. It can get pretty frustrating having so many people try to be cooks in the kitchen. We had a great manager who literally just moved us to a different call/meeting and he told us "ignore everything those people are saying; just stay focused and I'll handle them." Everyone's respect for our manager really went up from there.
Cerium
Slow is smooth and smooth is fast. If you don't have time to do it right, what makes you think there is time to do it twice?
ianmcgowan
There's a story in the book - on nuclear submarines there's a brass bar in front of all the dials and knobs, and the engineers are trained to "grab the bar" when something goes wrong rather than jumping right to twiddling knobs to see what happens.
throwawayfks
I read this book and took this advice to heart. I don't have a brass bar in the office, but when I'm about to push a button that could cause destructive changes, especially in prod, my hands reflexively fly up into the air while I double-check everything.
tetha
A weird, yet effective recommendation from someone at my last job: If it's a destructive or dangerous action in prod, touch both your elbows first. This forces ou to take the hands away from the keyboard, stop any possible auto-pilot and look what you're doing.
toolslive
"a good chess player sits on his hands" (NN). It's good advice as it prevents you from playing an impulsive move.
stronglikedan
I have to sit on my hands at the dentist to prevent impulse moves.
scubbo
Thank you for explaining that phrase! I couldn't find it with a quick Google.
Shorn
I always wanted that to be a true story, but I don't think it is.
adamc
I had a boss who used to say that her job was to be a crap umbrella, so that the engineers under her could focus on their actual jobs.
r-bryan
I once worked with a company that provided IM services to hyper competitive, testosterone poisoned options traders. On the first fine trading day of a January new year, our IM provider rolled out an incompatible "upgrade" to some DLL that we (our software, hence our customers) relied on, that broke our service. Our customers, ahem, let their displeasure be known.
Another developer and I were tasked with fixing it. The Customer Service manager (although one of the most conniving political-destructive assholes I have ever not-quite worked with), actually carried a crap umbrella. Instead of constantly flaming us with how many millions of dollars our outage was costing every minute, he held up that umbrella and diverted the crap. His forbearance let us focus. He discretely approached every 20 minutes, toes not quite into entering office, calmly inquiring how it was going. In just over an hour (between his visits 3 and 4), Nate and I had the diagnosis, the fix, and had rolled it out to production, to the relief of pension funds worldwide.
As much as I dislike the memory of that manager to this day, I praise his wisdom every chance I get.
louthy
I always say this too. But the real trick is knowing what to let thru. You can’t just shield your team from everything going on in the organisation. You’re all a part of the organisation and should know enough to have an opinion.
A better analogy is you’re there to turn down the noise. The team hears what they need to hear and no more.
Equally, the job of a good manager is to help escalate team concerns. But just as there’s a filter stopping the shit flowing down, you have to know what to flow up too.
airblade
At first I thought you meant an umbrella that doesn't work very well.
FearNotDaniel
Ah, the unintentional ambiguity of language, the reason there are so many lawyers in the world and why they are so expensive. The GP's phrasing is not incorrect but your comment made me realize: I only parsed it correctly the first time because I've heard managers use similar phrases so I recognized the metaphor immediately. But for the sake of reducing miscommunication, which sadly tends to trigger so many conflicts, I could offer a couple of disambiguatory alternatives:
- "her job was to be a crap-umbrella": hyphenate into a compound noun, implies "an umbrella of/for crap" to clarify the intended meaning
- "her job was to be a crappy umbrella": make the adjective explicit if the intention was instead to describe an umbrella that doesn't work well
dazzawazza
Ideally it's crap umbrellas all the way down. Everyone should be shielding everyone below them from the crap slithering its way down.
saghm
Agreed. Even as a relatively junior engineer at my first job, I realized that a certain amount of the job was not worth exposing interns to (like excessive amounts of Jira ticket wrangling) because it would take parts of their already limited time away from doing things that would benefit them far more. Unless there's quite literally no one "below" you, there's probably _something_ you can do to stop shit from flowing down past you.
generic92034
That must be the trickle-down effect everyone talking about in the 80ies. ;)
o_nate
A corollary to this is always have a good roll-back plan. It's much nicer to be able to roll-back to a working version and then be able to debug without the crisis-level pressure.
sroussey
Rollback ability is a must—it can be the most used mitigation if done right.
Not all issues can be fixed with a rollback though.
CobrastanJorji
I once worked for a team that, when a serious visible incident occurred, a company VP would pace the floor, occasionally yelling, describing how much money we were losing per second (or how much customer trust if that number was too low) or otherwise communicating that we were in a battlefield situation and things were Very Critical.
Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.
bch
> Rule 0: Don’t panic
“Slow is smooth, smooth is fast.”
I know this and still get caught out sometimes…
Edit: hahaha ‘Cerium saying ~same thing: https://news.ycombinator.com/item?id=42683671
bilekas
This is very underrated. Also an extension to this is don’t be afraid to break things further to probe. I often see a lot of devs mid level included panicking and thus preventing them to even know where to start. I’ve come to believe that some people just have an inherent intuition and some just need to learn it.
jimmySixDOF
Yes its sometimes instinct takes over when your on the spot in a pinch but there are institutional things you can do to be prepared in advance that expand your set of options in the moment much like a pre-prepared firedrill playbook you can pull from also there are training courses like Kepner-Tregoe but you are right there are just some people who do better than others when _it's hitting the fan.
nickjj
For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.
Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
jerf
"git bisect" is why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release. I use this as my #1 principle, above "I should be able to see every keystroke ever written" or "I want every last 'Fixes.' commit" that is sometimes advocated for here, because those principles make bisect useless.
The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.
If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.
aag
Sometimes you'll find a repo where that isn't true. Fortunately, git bisect has a way to deal with failed builds, etc: three-value logic. The test program that git bisect runs can return an exit value that means that the failure didn't happen, a different value that means that it did, or a third that means that it neither failed nor succeeded. I wrote up an example here:
Noumenon72
Very handy! I forgot about `git bisect skip`.
SoftTalker
I do bisecting almost as a last resort. I've used it when all else fails only a few times. Especially as I've never worked on code where it was very easy to just build and deploy a working debug system from a random past commit.
Edit to add: I will study old diffs when there is a bug, particularly for bugs that seem correlated with a new release. Asking "what has changed since this used to work?" often leads to an obvious cause or at least helps narrow where to look. Also asking the person who made those changes for help looking at the bug can be useful, as the code may be more fresh in their mind than in yours.
skydhash
Same. Every branch apart from the “real” one and release snapshots is transient and WIP. They don’t get merged back unless tests pass.
null
forrestthewoods
> why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release
You’re spot on.
However it’s clearly a missing feature that Git/Mercurial can’t tag diffs as “passes” or “bisectsble”.
This is especially annoying when you want to merge a stack of commits and the top passes all tests but the middle does not. It’s a monumental and valueless waste of time to fix the middle of the stack. But it’s required if you want to maintain bisectability.
It’s very annoying and wasteful. :(
michalsustr
This is why we use squash like here https://docs.gitlab.com/ee/user/project/merge_requests/squas...
Izkata
If there's a way to identify those incomplete commits, git bisect does support "skip" - a commit that's neither good nor bad, just ignored.
rlkf
Could you not use --first-parent option to test only at the merge-points?
smcameron
Back in the 1990s, while debugging some network configuration issue a wiser older colleague taught me the more general concept that lies behind git bisect, which is "compare the broken system to a working system and systematically eliminate differences to find the fault." This can apply to things other than software or computer hardware. Back in the 90s my friend and I had identical jet-skis on a trailer we shared. When working on one of them, it was nice to have its twin right there to compare it to.
yuliyp
The principle here "bisection" is a lot more general than just "git bisect" for identifying ranges of commits. It can also be used for partitioning the space of systems. For instance, if a workflow with 10 steps is broken, can you perform some tests to confirm that 5 of the steps functioned correctly? Can you figure out that it's definitely not a hardware issue (or definitely a hardware issue) somewhere?
This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!
Icathian
Tacking on my own article about git bisect run. It really is an amazing little tool.
tetha
You can also use divide and conquer when dealing with a complex system.
Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.
That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.
And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?
As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.
jvans
git bisect is an absolute power feature everybody should be aware of. I use it maybe once or twice a year at most but it's the difference between fixing a bug in an hour vs spending days or weeks spinning your wheels
epolanski
Bisection is also useful when debugging css.
When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.
Rinse and repeat, it's a basic binary search.
I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.
rozap
Binary search rules. Being systematic about dividing the problem in half, determining which half the issue is in, and then repeating applies to non software problems quite well. I use the strategy all the time while troubleshooting issue with cars, etc.
manhnt
> Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs
Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.
qwertox
Make sure you're editing the correct file on the correct machine.
ZedZark
Yep, this is a variation of "check the plug"
I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)
overhead4075
This is also covered by "make it fail"
shmoogy
I'm glad I'm not the only one doing this after I wasted too much time trying to figure out why my docker build was not reflecting the changes ... never again..
eddd-ddde
How much time I've wasted unknowingly editing generated files, out of version files, forgetting to save, ... only god knows.
netcraft
the biggest thing I've always told myself and anyone ive taught: make sure youre running the code you think youre running.
chupasaurus
Poor Yorick!
ajuc
That's why you make it break differently first. To see your changes have any effect.
snowfarthing
When working on a test that has several asserts, I have adopted the process of adding one final assert, "assert 'TEST DEBUGGED' is False", so that even when I succeed, the test fails -- and I could review to consider if any other tests should be added or adjusted.
Once I'm satisfied with the test, I remove the line.
spacebanana7
Also that it's in the correct folder
reverendsteveii
very first order of business: git stash && git checkout main && git pull
n144q
...and you are building and running the correct clone of a repository
heikkilevanto
Some additional rules: - "It is your own fault". Always suspect your code changes before anything else. It can be a compiler bug or even a hardware error, but those are very rare. - "When you find a bug, go back hunt down its family and friends". Think where else the same kind of thing could have happened, and check those. - "Optimize for the user first, the maintenance programmer second, and last if at all for the computer".
physicles
The first one is known in the Pragmatic Programmer as “select isn’t broken.” Summarized at https://blog.codinghorror.com/the-first-rule-of-programming-...
bsammon
Alternatively, I've found the "Maybe it's a bug. I'll try an make a test case I can report on the mailing list" approach useful at times.
Usually, in the process of reducing my error-generating code down to a simpler case, I find the bug in my logic. I've been fortunate that heisenbugs have been rare.
Once or twice, I have ended up with something to report to the devs. Generally, those were libraries (probably from sourceforge/github) with only a few hundred or less users that did not get a lot of testing.
wormlord
I always have the mindset of "its my fault". My Linux workstation constantly crashing because of the i9-13900k in it was honestly humiliating. Was very relieved when I realized it was the CPU and not some impossible to find code error.
dehrmann
Linux is like an abusive relationship in that way--it's always your fault.
astrobe_
About "family and friends", a couple of times by fixing minor and a priori unrelated side issues, it revealed the bug I was after.
ajuc
It's healthier to assume your code is wrong than otherwise. But it's best to simply bisect the cause-effect chain a few more times and be sure.
nox101
> #1 Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.
Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.
I'd be curious if there's a single programmer that follows this advice.
null
feoren
Essentially yes, that's correct. Your mistake is thinking that the outcome of those months of work is being able to kinda-probably fix one single bug. No: the point of all that effort is to truly fix all the bugs of that kind (or as close to "all" as is feasible), and to stop writing them in the first place.
The alternative is paradropping into an unknown system with a weird bug, messing randomly with things you don't understand until the tests turn green, and then submitting a PR and hoping you didn't just make everything even worse. It's never really knowing whether your system actually works or not. While I understand that is sometimes how it goes, doing that regularly is my nightmare.
P.S. if the manual of a library you're using is 700 pages, you're probably using the wrong library.
earnestinger
> if the manual of a library you're using is 700 pages, you're probably using the wrong library.
Statement bordering on papyrophobia. (Seems that is a real phobia)
earnestinger
I think we have a bit different interpretations here.
> read everything in depth
Is not necessarily
> first read the entire manual of the library I'm using, all 700 pages
If I have problem with “git bisect”. I can go only to stackoverflow try several snippets and see what sticks, or I can also go to https://git-scm.com/docs/git-bisect to get a bit deeper knowledge on the topic.
adolph
This was written in 2004, the year of Google's IPO. Atwood and Spolsky didn't found Stack Overflow until 2008. [0] People knew things as the "Camel book" [1] and generally just knew things.
0. https://stackoverflow.blog/2021/12/14/podcast-400-an-oral-hi...
1. https://www.perl.com/article/extracting-the-list-of-o-reilly...
scudsworth
great strawman, guy that refuses to read documentation
mox1
I mean he has a point. Things are incredibly complex now adays, I don't think most people have time to "understand the system."
I would be much more interested in rules that don't start with that... Like "Rules for debugging when you don't have the capacity to fully understand every part of the system."
Bisecting is a great example here. If you are Bisecting, by definition you don't fully understand the system (or you would know which change caused the problem!)
sitkack
If folks want to instill this mindset in their kids, themselves or others I would recommend at least
The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)
https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...
https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)
To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...
https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!
nextlevelwizard
What part of Three Body Problem has anything to do with debugging or troubleshooting or any form of planning?
Things just sort of happen with wild leaps of logic. The book is actually a fantasy book with thinnest layer of science babble on top.
deepspace
Agree. I think Zen and the Art of Motorcycle Maintenance encapsulates the art of troubleshooting the best. Especially the concept of "gumption traps",
"What you have to do, if you get caught in this gumption trap of value rigidity, is slow down...you're going to have to slow down anyway whether you want to or not...but slow down deliberately and go over ground that you've been over before to see if the things you thought were important were really important and to -- well -- just stare at the machine. There's nothing wrong with that. Just live with it for a while. Watch it the way you watch a line when fishing and before long, as sure as you live, you'll get a little nibble, a little fact asking in a timid, humble way if you're interested in it. That's the way the world keeps on happening. Be interested in it."
Words to live by
sitkack
Yeah, it is probably what I would start with and all the messages the book is sending you will resurface in the others. You have to cultivate that debugging mind, but once it starts to grow, it can't be stopped.
david_draco
Step 10, add the bug as a test to the CI to prevent regressions? Make sure the CI fails before the fix and works after the fix.
Tade0
The largest purely JavaScript repo I ever worked on (150k LoC) had this rule and it was a life saver, particularly because the project had commits dating back more than five years and since it was a component/library, it had quite few strange hacks for IE.
seanwilson
I don't think this is always worth it. Some tests can be time consuming or complex to write, have to be maintained, and we accept that a test suite won't be testing all edge cases anyway. A bug that made it to production can mean that particular bug might happen again, but it could be a silly mistake and no more likely to happen again than 100s of other potential silly mistakes. It depends, and writing tests isn't free.
Ragnarork
Writing tests isn't free but writing non-regression tests for bugs that were actually fixed is one of the best test cases to consider writing right away, before the bug is fixed. You'll be reproducing the bug anyway (so already consider how to reproduce). You'll also have the most information about it to make sure the test is well written anyway, after building a mental model around the bug.
Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
jerf
For people who aren't getting the value of unit tests, this is my intro to the idea. You had to do some sort of testing on your code. At its core, the concept of unit testing is just, what if instead of throwing away that code, you kept it?
To the extent that other concerns get in the way of the concept, like the general difficulty of testing that GUIs do what they are supposed to do, I don't blame the concept of unit testing; I blame the techs that make the testing hard.
seanwilson
> Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
Some examples that come to mind are bugs to do with UI interactions, visuals/styling, external online APIs/services, gaming/simulation stuff, and asynchronous/thread code, where it can be a substantial effort to write tests for, vs fixing the bug that might just be a typo. This is really different compared to if you're testing some pure functions that only need a few inputs.
It depends on what domain you're working in, but I find people very rarely mention how much work certain kinds of test can be to write, especially if there aren't similar tests written already and you have to do a ton of setup like mocking, factories, and UI automation.
n144q
You (or other people) will thank yourself in a few months/years when refactoring the code, knowing that they don't need to worry about missing edge cases, because all known edge cases are covered with these non regression tests.
seanwilson
There's really no situation you wouldn't write a test? Have you not had situations where writing the test would take a lot of effort vs the risk/impact of the bug it's checking for? Your test suite isn't going to be exhaustive anyway so there's always a balance of weighing up what's worth testing. Going overkill with tests can actually get in the way of refactoring as well when a change to the UI or API isn't done because it would require updating too many tests.
soco
What do you do with the years old bug fixes? How fast can one run the CI after a long while of accumulating tests? Do they still make sense to be kept in the long run?
gregthelaw
I would say yes, your CI should accumulate all of those regression tests. Where I work we now have many, many thousands of regression test cases. There's a subset to be run prior to merge which runs in reasonable time, but the full CI just cycles through.
For this to work all the regression tests must be fast, and 100% reliable. It's worth it though. If the mistake was made once, unless there's a regression test to catch it, it'll be made again at some point.
lelanthran
> For this to work all the regression tests must be fast,
Doesn't matter how fast it is, if you're continually adding tests for every single line of code introduced eventually it will get so slow you will want to prune away old tests.
jerf
I'm not particularly passionate about arguing the exact details of "unit" versus "integration" testing, let alone breaking down the granularity beyond that as some do, but I am passionate that they need to be fast, and this is why. By that, I mean, it is a perfectly viable use of engineering time to make changes that deliberately make running the tests faster.
A lot of slow tests are slow because nobody has even tried to speed them up. They just wrote something that worked, probably literally years ago, that does something horrible like fully build a docker container and fully initialize a complicated database and fully do an install of the system and starts processes for everything and liberally uses "sleep"-based concurrency control and so on and so forth, which was fine when you were doing that 5 times but becomes a problem when you're trying to run it hundreds of times, and that's a problem, because we really ought to be running it hundreds of thousands or millions of times.
I would love to work on a project where we had so many well-optimized automated tests that despite their speed they were still a problem for building. I'm sure there's a few out there, but I doubt it's many.
simmonmt
This is a great problem to have, if (IME) rare. Step 1 Understand the System helps you figure out when tests can be eliminated as no longer relevant and/or which tests can be merged.
hobs
Why would you want to stop knowing that your old bug fixes still worked in the context of your system?
Saying "oh its been good for awhile now" has nothing to do with breaking it in the future.
hsbauauvhabzb
I think for some types of bugs a CI test would be valuable if the developer believes regressions may occur, for other bugs they would be useless.
nonrandomstring
Yes, just more generally document it
I've lost count of how many things i've fixed only to to see;
1) It recurs because a deeper "cause" of the bug reactivated it.
2) Nobody knew I fixed something so everyone continued to operate workarounds as if the bug was still there.
I realise these are related and arguably already fall under "You didn't fix it". That said a bit of writing-up and root-cause analysis after getting to "It's fixed!" seems helpful to others.
gnufx
Then, after successful debugging your job isn't finished. The outline of "Three Questions About Each Bug You Find" <http://www.multicians.org/thvv/threeq.html> is:
1. Is this mistake somewhere else also?
2. What next bug is hidden behind this one?
3. What should I do to prevent bugs like this?
knlb
I wrote a fairly similar take on this a few years ago (without having read the original book mentioned here) -- https://explog.in/notes/debugging.html
Julia Evans also has a very nice zine on debugging: https://wizardzines.com/zines/debugging-guide/
Hackbraten
I love Julia Evans’ zine! Bought several copies when it came out, gave some to coworkers and donated one to our office library.
fn-mote
The article is a 2024 "review" (really more of a very brief summary) of a 2002 book about debugging.
The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...
The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.
bananapub
> The article is a 2024 "review"
2004.
In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.
Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.