Every line is a potential bug (2009)
55 comments
·February 24, 2025deathanatos
> I wrote some code to get a message out of a hash table. The message was going to be put there by another thread. There was a small chance of a race where it wouldn’t be there yet when I initially looked for it. The code looked something like this:
while ((message = map.get(key)) == null && System.currentTimeMillis() < timeoutTime) {
wait(1000);
}
> The wait() call blocks the thread, waiting for the notifyAll() from the thread that puts the message into the map. The 1000 means one second. The timeout was going to be on the order of five seconds.> The above code is simple and correct
… it looks neither simple nor correct.
If this is multithreaded code, "small chance of a race" is a smell, and "notifyAll" is probably a smell. The multithreading primitive that I'd want to see here is a condition, so named because it permits a thread to wait for a particular condition. A condition wakes the "notified" thread with a mutex already locked, so that the condition ("is there a message in the table?") can be checked without interference from other threads. If there is a message, you can remove it, process it, etc. without races, and if there isn't, you can atomically release the lock and go back to sleep (which is again a thing the condition provides; it is not two separate calls).
ryandrake
The other surprising thing is that two code reviewers also evidently missed the opportunity to re-write it properly using thread synchronization primitives. Rather they argued about what to pass to wait(), except you shouldn't be using wait() or sleep() here at all!
pockmarked19
At least in 2025 they could slap it into Claude and produce a better review without engaging in…shudders *thought*.
ryandrake
Well the article was in 2009. Hopefully someone's taken them aside and explained synchronization objects to them by now.
pockmarked19
The first snippet of code makes me nauseous and the second makes me murderous, so the former is obviously better.
That being said, these people should all be miles away from mission critical code.
merb
That what was in my mind, the second I’ve seen the code. I would‘ be pushed so hard to make the while go away. In high level code, it’s simple wrong. Since his code is Java it looks even more suspicious because of the hashmap.
It’s basically a ‚work around’ bugfix.
sam_goody
So, how would you write that line?
null
Jtsummers
Throw in a mutex and a condition variable. Many languages and systems (including C++) with locks allow you to set a timeout (either a specific time to wake or a delay and then wake). Use a notify in the write thread to wake the reader.
https://en.cppreference.com/w/cpp/thread/condition_variable - C++11 so not available in this form to the blog writer, but an equivalent was.
IshKebab
Honestly the fact that a thread is even waiting for a map value to be filled in is a bit of a red flag. 99% of multithreaded code should be using channels or parallel for/map.
If you have to have a thread wait for a value from another then use a condition variable like he said.
https://docs.oracle.com/javase/8/docs/api/java/util/concurre...
dkarl
Lately I've been thinking about the airplane design analogy for writing software. Supposedly (possibly apocryphally) Bill Gates said that measuring software development progress by lines of code written was like measuring aircraft design progress by weight.
I think the analogy is a valuable one.
On an airplane, you might look for places where the airframe is stronger than it needs to be. If you can save material there, it'll be not just cheaper but lighter and more efficient. If the weight savings improves the flight characteristics of the aircraft, or if allows for the addition of safety features elsewhere, it might improve safety as well.
This is a really hard conversation to have in software, because when people think they can make something better with a lot of extra code, a lot of times they'll just do it, even if they know that other people think it's unnecessary. They put in the extra time on their own responsibility, thinking that the only downside is the initial work to write the code, and once it's done, it becomes an asset. It sucks to tell them that I'd rather not have it, that I'd rather have the easy version they could have done in an hour. I give in way more often than I'd like to.
codewench
Conversely, it's far better to have a slightly more verbose section of code that is immediately readable than a super slick one-liner that like two people can parse.
Maintainability is, in my opinion, just as important as other considerations when writing code. And often the more verbose approach is composed of simpler steps, meaning bugs can be easier to spot during code reviews
dkarl
Agreed. For readability, ignoring DRY, my hierarchy of preference is:
1. One reasonable line of code
2. Several lines of plain code
3. A function call, moving the implementation elsewhere
4. A gnarly one-liner
(The tough part about this is drawing the line between 1 & 4.)
yjftsjthsd-h
> This is a really hard conversation to have in software, because when people think they can make something better with a lot of extra code, a lot of times they'll just do it, even if they know that other people think it's unnecessary. They put in the extra time on their own responsibility, thinking that the only downside is the initial work to write the code, and once it's done, it becomes an asset. It sucks to tell them that I'd rather not have it, that I'd rather have the easy version they could have done in an hour. I give in way more often than I'd like to.
I feel like that's a perfect case for making software more modular? Whether it's runtime plugins or compile-time options (with stable APIs and keeping things separate), it's possible to let people who want a feature have it without affecting anyone else.
Retric
Modularity isn’t free. It’s useful for many things, but assuming it’s inherently valuable is how you end up with enterprise bloat.
Programming languages are inherently flexible, however when you have 2 features and a configuration option to select between them you now have 3 features.
yjftsjthsd-h
> Programming languages are inherently flexible, however when you have 2 features and a configuration option to select between them you now have 3 features.
Which is a good argument for separating things into plugins: If you have 1 stable API/ABI and 100 plugins, you have 1 feature.
DanielHB
The hardest part of keeping software sane is saying to no requirements.
I like to say two things about software:
Code is a liability
The best code is no code
umvi
The fastest code is code that never runs because it doesn't exist
fuzztester
it's also the most correct *, as well as the cheapest ;)
* "every line is a potential bug"
Okx
> The above code is simple and correct
Unfortunately not:
Using notifyAll and wait in Java is tricky to do correctly, you'd have a much easier chance using concurrency primitives that wrap these functions, like CountDownLatch[1] for one offs, or Condition[2].
System.currentTimeMillis is also not guaranteed to be monotonic, and System.nanoTime should be used instead. This would of course be obviated if a better concurrency primitive were to be used.
[1] https://docs.oracle.com/en/java/javase/21/docs/api/java.base...
[2] https://docs.oracle.com/en/java/javase/21/docs/api/java.base...
seanwilson
> There’s a small chance that the current time will have advanced by the time the subtraction is done, resulting in a negative value passed to wait() and an IllegalArgumentException getting thrown.
My general rule when there's a potential for race condition bugs like this is you should multiply your paranoia for looking for bugs by at least 10, and very heavily consider alternatives that aren't subject to race conditions because it's impractical to get everyone who might review, extend or maintain the code to be sufficiently paranoid.
For race conditions, it feels like you really have to experience the wide potential for bugs yourself during development to understand how unpredictable, unintuitive, and catastrophic the unhappy code paths can be.
When I'm editing or writing code that needs to be really robust, I'm usually thinking about the state space and control flow diagram in my head. If there's too many paths that aren't likely to be explored during regular manual testing, test suite runs, and normal user usage, that's a sign to do something like push back on if the requirements need to be this complex, or if there's some code improvements we can make like refactoring or better abstractions. Otherwise there's likely to be subtle bugs hiding somewhere. Potential race conditions in particular usually explode the state space.
eddythompson80
Sadly, this has become a fairly outdated view of how to do software. Tbh, I'm not sure if it ever really caught on, but there was sometime where it was gaining momentum. That's long gone though.
People will ask and expect every single part of your code to take in as much complexity as possible because "well, it makes sense". Take something as simple as a timeout value for a task.
- We need to add a timeout because list reasonable reasons to have timeouts
- Hey, so the timeout should only apply for external IO. Some tasks are just CPU intensive and that shouldn't cause the timeout to kick in. Like sure tasks usually take 10 minutes, but if it's taking more because of CPU, then we should let it.
- Hey, so we had an infinite loop bug in a task. So it was consuming a lot of CPU, but shouldn't have waited an hour to timeout. If there is no stdout, then its safe to assume it's stuck for other reasons.
- So this task runs ffmpeg, and there is no stdout, but it's actually running and should be fine. so check if ffmpeg process is running and handle it differently.
- Hey, so the timeout should be different for this one endpoint X. We know that it's slow, so we should increase the timeout only for that endpoint and not count it in the overall timeout.
- So that endpoint X has a fairly high timeout today. However, they had an outage yesterday and all our calls took hours to eventually timeout. We should have the timeout be adaptive depending on the outcome of previous attempts per endpoint.
- So the adaptive timeout logic kinda tripped over itself here. See, service Y is usually fast, so it's timeout is low. But last week they had an issue, and timeout was taking 20 minutes for all task. It looks like the adaptive-timeout-system took the timeout up to 20 minutes for it. But it shouldn't have. For service Y, it should have have only gone up to 2 minutes max. We need to add limits to the adaptive-timeout-system
- Can we determine these adaptive-timeout-system limits automatically? It's a pain in the ass to need to figure out what is the right timeout value for each service we call.
- So what if the value of the timeout came from an LLM. So we give the LLM the information about the current task, and what we know about the the previous calls, then the LLM suggests a value for the timeout.
swatcoder
Every line you don't write has consequences too.
It turns out that you just need to make reasoned assessments about how the system[*] you're building on works; what you need to make it do today, what you're likely to need to make it do tomorrow; how your work impacts other parts of the system today and tomorrow; what resource pressures your system is facing; etc
As written with its language of absolutes, the article gives horrible advice. If you were to follow it at every opportunity, you would write software that becomes more and more terribly brittle, terribly inefficient, terribly illegible, terrible incapable over time.
Our craft is one of engineering. It takes years of study and practice to learn to do it well. You never stop learning to take more into consideration and never overcome having to make hard, arguable decisions in light of both known and unknown unknowns. It just doesn't reduce to adages like this and you start building a wall in front of your own professional growth when start assuming it does.
--
[*] where system is inclusive of your skills and fluency, your development tools, your deployment environment, your team and their skills and preferences, your stakeholders and their requirements, your users, your budget of time and capital and mental energy, etc
kbutler
Thinking too small.
Every line is potentially /many/ bugs.
toddwprice
This is why you should just cram everything into one line.
ludston
I'm looking at this feeling anxiety about whether or not the implementation of `map.get` is threadsafe with memories of deadlocks in unsafe hashset implementations whilst accessing values during resizing.
When what they really wanted here for bug free multithreading is some kind of promise or async await abstraction.
lilyball
This is Java, the use of wait() and notifyAll() means the whole thing is running in a synchronized method/statement.
MathMonkeyMan
I like the author's original example the best, but:
while ((message = map.get(key)) == null) {
long now = System.currentTimeMillis();
if (now >= timeoutTime) {
break;
}
wait(timeoutTime - now);
}
If I had my way, then all "sleep" APIs would take a deadline time point, not a time duration. Also have the API do nothing if the deadline is passed. This way, getting the subtraction right need only happen in the lowest shared part of the library. Then it's a lot like the original example: while ((message = map.get(key)) == null) {
waitUntil(timeoutTime);
}
edit: Oops, my last example has a pretty bad bug. The author's point stands, and I need coffee.You could have waitUntil return true or false to indicate whether the wait occurred or not, but then that loses the opportunity to convey whether the thread was awakened due to an operation on the map.
do {
message = map.get(key);
} while (message == null && waitUntil(timeoutTime));
Author's original example still wins.rqtwteye
Waituntil would be a nice feature.
Jtsummers
https://en.cppreference.com/w/cpp/header/thread - sleep_for and sleep_until are both available in C++, though after this blog post was written (C++11). Also available in other languages.
sorokod
You do have to feel for the reviewer, who didn't introduce more lines of code by the way.
Magic number (1000) unrelated to the timeout constant must have triggered strong emotions.
I think that the moral of the story is a bit different.
All the smartest people are humble and realise the limits of their capability. as a result they automatically make decisions that reduce complexity. unfortunately software development is full of people who overestimate their capability with obvious results