3,200% CPU Utilization

96 comments

·February 28, 2025

lucianbr

> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.

Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.

williamdclt

It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.

But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.

"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)

alex_smart

> Squashing any error, strangeness and warning can be very expensive in some projects

Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.

Ntrails

I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.

I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since

TYMorningCoffee

Could you propose to fail the build based on the number of warnings to ensure it doesn't go up?

I did something similar with spotbugs. There were existing warnings I couldn't get time to fix so I configured the maven to fail if it exceed the level at which I enabled it.

This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.

Forricide

Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.

Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.

munk-a

Sometimes it can take a serious effort to understand why a problem is happening and I'll accept an unknown blip that can be corrected by occasionally hitting a reset button occasionally when dealing with third party software. From my experience my opinion aligns with yours though - it's also worth understanding why an error happens in something you've written, the times we've delayed dealing with mysterious errors that nobody in the team can ascribe we've ended up with a much larger problem when we've finally found the resources to deal with it.

Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.

This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.

gamedever

if there are any warnings I'm supposed to ignore then there are effectively no warnings.

there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored

connicpu

At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.

berkes

The last point is the key.

It then creates immense value by avoiding a lot of risk and uncertainty for little effort.

Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.

This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.

It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world". Because it's much easier to keep it green, clean and automated than to move there later on

Calamitous

That's because it moves from being a project to being a process. I've tried to express this at my current job.

They want to take time out to write a lot of unit tests, but they're not willing to change the process to allow/expect devs to add unit tests along with each feature they write.

I'll be surprised if all the tests are still passing two months after this project, since nobody runs them.

reaperducer

Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.

LeifCarrotson

It very much depends on the nature of your work.

If manual input can generate undefined behavior, you depend on a human making a correct decision, or you're dealing with real-world behavior using incomplete sensors to generate a model...sometimes, the only reasonable target is "fail gracefully". You cannot expect to generate right outputs with wrong inputs. It's not wrong to blame the user when economics, not just laziness, say that you need to trust the user to not do something unimagineable.

I think this is the kind of situation where a little professionalism would have prevented the issue: Handling uncaught exceptions in your threadpool/treemap combo would have prevented the problem from happening.

dheera

> That was before "move fast, break things and blame the user" became the norm.

When VCs only give you effectively 9 months of runway (3 months of coffees, 9 months of actual getting work done, 3 months more coffees to get the next round, 3 more months because all the VCs backed out because your demo wasn't good enough), move fast and break things is how things are done.

If giving startups 5 years of runway was the norm, then yeah, we could all be professional.

stickfigure

The other thing is don't catch and ignore exceptions. Even "catch and log" is a bad idea unless you specifically know that program execution is safe to continue. Just let the exception propagate up to where something useful can be done, like return 500 or displaying an error dialog.

temporallobe

Yes, but…I suppose you have to pick your battles. There was recently a problem vexing me about a Rails project I maintain where the logs were filled with complaints about “unsupported parameters”, even though we painstakingly went through all the controllers and allowed them. It’s /probably/ benign, but it adds a lot of noise to the logs. Several of us took a stab at resolving it, but in the end we always had much higher priorities to address. Also it’s hard to justify spending so many hours on something that has little “business value”, especially when there is currently no functional impact.

It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?

brirec

As someone with pretty bad hemorrhoids, I’m hesitant to ask my doctor about surgery because I’ve been told the hemorrhoids will come back, without question. So it’s even still just a temporary fix…

colechristensen

Couldn’t you just run a debugger to find all of the incidents of that issue?

temporallobe

We’ve been down many paths on this. In some cases we know exactly where it’s happening, but despite configuring everything correctly, it still complains. It might just be a bug in the Rails code or a fault in the way parameters are passed in (some of the endpoints take a lot of parameters, some of them optional). We could “fix” the issue by simply allowing all parameters, but of course this opens a security risk. This is a 10+ year old code base and I am told it has been a thorn in their side for a long time. It’s one of those battles thar I suppose we are not going to try fighting unless we get really bored and have nothing else to work on.

bornfreddy

Also, stack trace should show you everything you need to know to fix this, or am I missing something? (no experience with Ruby)

Otherwise, I see the cleanups and refactoring as part of normal development process. There is no need to put such tasks in Jira - they must be done as preparation for the regular tasks. I can imagine that some companies take agile too seriously and want to micromanage every little task, but I guess lack of time for refactoring is not the biggest problem.

null

[deleted]

swatcoder

> Rarely is this accepted by whoever chooses what we should work on.

You need to find more disciplined teams.

There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.

saulpw

Most teams are less disciplined than they should be. Also, job/team mobility is very low right now. So the question becomes, how do you increase discipline on the team you're on?

rapind

For very small teams, exploring new platforms and / or languages that compliment correctness is an option. Using a statically typed language with explicit managed side effects has made a huge difference for me. Super disruptive the larger the team though of course.

Retr0id

And from a security perspective, the "might cause a problem 0.000001% of the time" flaws can often be manipulated into becoming a problem 100% of the time.

sitkack

All security issues are subclass of bugs. Security is a niche version of QA.

TYMorningCoffee

> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.

I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.

berkes

Another problem with lingering warnings is that it's easy to overlook that one new warning that's actually important amongst floods of older warnings.

Someone

FTA: The code can be reduced to simply:

  public void someFunction(SomeType relatedObject,
                           List<SomeOtherType> unrelatedObjects) {
    ...
    treeMap.put(relatedObject.a(), relatedObject.b());
    ...
    // unrelatedObjects is used later on in the function so the
    // parameter cannot be removed
  }

That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.

You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.

TYMorningCoffee

Good point. It should be replaced with an if not empty check.

scottlamb

> Could an unguarded TreeMap cause 3,200% utilization?

I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.

I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.

[1] This undersynchronization was far from the only problem with that codebase.

layer8

Another way to get infinite loops is using a Comparator or Comparable implementation that doesn’t implement a consistent total order: https://stackoverflow.com/questions/62994606/concurrentskips... (This is unrelated to concurrency.)

Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.

TYMorningCoffee

Have you seen this before in person? It would make a great blog post.

I haven't personally encountered a buggy comparator without a total order.

layer8

I have seen a lot of incorrect Comparators and Comparable implementations in existing code, but haven’t personally come across the infinite-loop situation yet.

To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).

hollowcelery

I knew someone who missed out on a gold medal at the International Olympiad of Informatics because his sort comparator didn’t have a total order.

TYMorningCoffee

Ouch. Any idea which problem? Those problems are public: https://ioinformatics.org/page/contests/10

thinkingemote

"I could barely ssh onto it"

Is there a way to ensure that whatever happens (CPU, network overloaded etc) one can always ssh in? Like reserve a tiny bit of stuff to the ssh daemon?

beisner

On Linux I’ve done this by pinning processes to a certain range of CPU cores, and the scheduler will just keep one core free or something. Which allows whatever I need in terms of management to execute on that one core, including SSH orUI.

LtdJorge

Nice? Or maybe give the Systemd slice a special cgroup with a reservation.

hinkley

The author has discovered a flavor of the Poison Pill. More common in event sourcing systems, it’s a message that kills anything it touches, and then is “eaten” again by the next creature that encounters it which also dies a horrible death. Only in this case it’s live-locked.

Once the data structure gets into the illegal state, every subsequent thread gets trapped in the same logic bomb, instead of erroring on an NPE which is the more likely illegal state.

jonathanlydall

What about spotting a cycle by using an incrementing counter and then throwing an exception if it goes above the tree depth or collection size (presuming one of these is tracked)?

Unlike the author’s hash set proposal it would require almost no memory or CPU overhead and may be more likely to be accepted.

That being said, in the decade plus I’ve used C# I’ve never found that I failed to consider concurrent access on data structures in concurrent situations.

TYMorningCoffee

That's much better. Constant memory. The number of nodes is guaranteed to be less than or equal to the height of the tree.

kachapopopow

Yep, ran into this way too many times. Performing concurrent operations on non thread-safe objects in java or generally in any language produces the most interesting bugs in the world.

Espressosaurus

Which is why you manage atomic access to non-thread-safe objects yourself, or use a thread-safe version of them when using them across threads.

Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.

BobaFloutist

Every time I think I'm sorta getting somewhere in my understanding of how to write code I see a comment like this that reminds me that the rabbithole is functionally infinite in both breadth and depth.

There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!

mrkeen

It's not that bad. We just don't have the equivalent of GC for multi-threading yet, so the advice necessarily needs to be "just remember to take and release locks" (same as remembering to malloc and free).

Hopefully someone will invent something like STM [1] in the distant year of 2007 or so [2]. It has actual thread-safe data structures. Not just the current choice between wrong-answer-if-you-dont-lock and insane-crashing-if-you-dont-lock.

[1] https://www.adit.io/posts/2013-05-15-Locks,-Actors,-And-STM-...

[2] https://youtu.be/4caDLTfSa2Q?feature=shared

kachapopopow

Tell that to inexperienced developers or making a massive single-thread project have multi-threaded capabilities.

stuff4ben

I've been that developer making a single-threaded app multi-threaded. Best way to learn though!

baggy_trough

Multi-threading - ain't nobody got time for that.

ivanjermakov

Some (maybe most?) operations on Java Collections perform integrity checks to warn about such issues, for example map throwing ConcurrentModificationException

bob1029

I've universally found that even when I am convinced that I am OK with the consequences of sharing something that isn't synchronized, the actual outcome is something I wasn't expecting.

loeg

The only things that should be shared without synchronization are readonly objects where the initialization is somehow externally serialized with accessors, and atomic scalars -- C++ std::atomic, Java has something similar, etc.

foobarian

I ran into my share of concurrency bugs, but one thing I could never intentionally trigger was any kind of inconsistency stemming from removing a "volatile" modifier from a mutable field in Java. Maybe the JVM I tried this with was just too awesome.

hashmash

Were you only testing on x86 or any other "total store order" architecture? If so, removing the volatile modifier has less of an impact.

deskr

Exceptions in threads are an absolute killer.

Here's a story of a horror bughunt where the main characters are C++, select() and thread brandishing an exception: https://news.ycombinator.com/item?id=42532979

TYMorningCoffee

I remember reading that article but being unable to understand it due to my lack of knowledge in the area. I will have to give it another go.

loeg

Just to probe the code review angle a little bit: shared mutable state should be a red/yellow flag in general. Whether or not it is protected from unsafe modification. That should jump out to you as a reviewer that something weird is happening.

lolc

What I like here is the discovery of the extra loop, and then still digging down to discover the root cause of the competing threads. I think I would have removed the loop and called it done.

svilen_dobrev

i found this article very/deeply informative , memory-wise , concurency vs optimizations and troubles thereof:

Programming Language Memory Models

https://research.swtch.com/plmm

https://news.ycombinator.com/item?id=27750610

HN

3,200% CPU Utilization

3,200% CPU Utilization