The concurrency trap: How an atomic counter stalled a pipeline

44 comments

·June 7, 2025

hinkley

> The big difference between a concurrent hash map and ArcSwap is that ArcSwap requires swapping out the entirety of the underlying data with every write but trades this off with very cheap reads. Writers don’t even have to wait for all readers to finish since a new epoch is created with the new version of data.

This is where tree data structures win out. Git and subversion and pure functional languages don’t replace the entire data structure. They make a new element and attach it to the tree by creating a new version of the intended parent, and a new version of its parent, on up to the root element.

So somewhere out there is space for a hybrid of ArcSwap and DashMap that uses sharded copy on write. These guys probably don’t need it given their low write rate, but shards should help with write contention for other use cases.

masklinn

You still need something close to ArcSwap with your persistent data structure, because once you’ve built your new tree (possibly using transients) you need to overwrite the old root with the new, atomically (and probably in a CAS loop really).

Left-right is an option which iirc avoids the CAS loop, as it’s a single writer primitive.

deepsun

So the root element single point of congestion for all writes, right?

hinkley

With a filesystem based trees that’s often sufficient to help reduce the amount of write contention. Because you’re not serializing the whole goddamned thing back to disk, just the parts that changed, which are logarithmic to the size of the edit history. And git uses GC to deal with writes that get to the end and then fail due to concurrent access collisions.

For in-memory problems, there are B-trees, which are an older solution to many problems than most of HN’s userbase, including myself. Then you still have the sharding which I did in fact point out would be part of a better hybrid solution.

cmrdporcupine

im::HashMap is a great tool

I would love, though, to have a similar persistent structure where the node pointers at each level were thread shareded, instead of using e.g. atomics. A mutation at a branch could record that the mutation is "just" for thread ID X that did the mutation. At some later point a "commit" could reconcile the storage to global, maybe

cryptonector

You could easily organize the hashmap hierarchically, yes.

nick_

It's unclear if the author understands false sharing. The excessive "ping-pong"ing of cache lines between cores means the cache line that the counter sits on is shared with other memory that's being accessed in contention.

Making sure the counter variable is sitting alone on its own cache line would completely remove the excessive ping-pong behaviour. The cache line with the counter would, of course, still ping-pong as it's mutated, but not excessively.

gpderetta

I don't think false sharing matters here. Cacheline bouncing of the counter, which is written even by readers would be enough to explain the bottleneck. The barrier being likely implied or explicit in the atomic RMW used to update the counter will also stall the pipeline, preventing any other memory operation to happen concurrently.

All threads repeatedly CASing on the same memory location is enough to bring any cpu to its knees.

The critical part of the fix is that readers no longer have to modify any shared memory location and rely on deferred reclamation.

duskwuff

> The cache line with the counter would, of course, still ping-pong as it's mutated, but not excessively.

And the way you solve this is by using a separate counter for each core, at least a cache line apart from each other to ensure they all get separate cache lines. The downside is that you lose the ability to read the counter atomically - but if the counter is incrementing fast enough that you need per-CPU counters, a point-in-time value is basically meaningless anyway.

hinkley

For popcounts this is the classical solution. Make aggregation lazy and worry about keeping k smaller numbers accurate cheaply, and let the reader worry about calculating a total, which is likely eventually consistent anyway.

But for multiple read exclusive write this is never going to work. Or at least not without extra silicon. Which maybe should be a thing. The problem of course would be that a multi-word atomic sum instruction would actually have to cross cache lines to avoid false sharing. So you'd end up with counters on contiguous cache lines but occupying 64 bytes apiece, which is a lot.

It would almost call for a special region of memory that has different cache coherency behavior and I can't see that being easy, fast, or backward compatible which is maybe why we don't do it.

gpderetta

The "counter" here is an RWMutex (or RWSpinLock). You can't really make it distributed while guaranteeing mutual exclusion [1]. You could use an hashmap with more fine grained locking, but to deal with rehashing you end up reinventing something like ArcSwap anyway.

[1] True no-writes-on-shared-locking RWMutexes do exist, but are significantly more complex and require epoch detection like the ArcSwap, so there is no significant advantage.

hinkley

I think it may have been worth calling out the false sharing to readers who are unfamiliar with the problem domain, but I don't get the impression this was false sharing but regular old sharing.

nick_

I think you're right.

null

[deleted]

senderista

Why not? If enough readers are mutating the same ref counter, with very short critsecs, then this could definitely be a problem, regardless of false sharing.

forrestthewoods

> If enough readers are mutating the same ref counter

Well if they were mutating the ref counter I’d call them writers not readers. =P

But yes. Many people incorrectly assume that writing an atomic_int from many threads is super fast. I mean it’s atomic! It couldn’t be a smaller atom of work! Alas this is incredibly incorrect. Even atomic ints can be extremely slow.

gpderetta

They are logically readers, the mutation is an implementation detail.

jeffbee

Long blog, corporate formatting covered in popups and hero images, screenshots of the incredibly ugly ketchup and mustard flame graph, discussion with some errors or vagueness a completely vanilla performance detail. These are the universal ingredients of the midsize SaaS company semitechnical marketing game.

finnh

It's also unclear how the first graph, whose y-axis is labelled "req/s", is showing a latency spike. Latency is not visible on that graph, AFAICT, only throughput.

cryptonector

Or maybe it just so happens that under the observed load most readers really wanted to frob that particular counter, in which case the only solution is to get rid of the counter.

hinkley

> However, further analysis in the perf environment revealed that while there was a spike in the number of active sessions, those gradually dropped off while the DAG processing time continued to stay high.

Thundering herds will do that. Queuing theory. Once the queue builds up it takes far longer to clear than intuition suggests.

cogman10

Not a bad tradeoff and one I've made in the past.

The tricky part here (which the author solved using rcu) is it's REALLY easy introduce a race condition here.

Imagine thread A and thread B want to add a new value to the hashmap. They arrive at the same time. arc_swap used naively would ultimately have one of those threads win and the value from the other thread disappear. Not great.

But another thing to consider with this approach is that it's expensive under any sort of write contention. As soon as writes start to spike up the amount of allocations increases in a potentially really bad way. If you have 3 threads that want to write at the same time, one thread will create 1 full copy. 1 will create 2 full copies, and the 3rd thread will create 3 full copies. If your underlying data structure has any sort of memory allocated to it, this becomes a pretty bad trade off.

Funnily, one easy way to solve this problem is to introduce locks again, but only for the writer. Possibly a good idea if you are spending more than ~1000 cycles creating the data structure.

webstrand

This is effectively the solution I settled on this week for sharing and renewing an API token with multiple threads. ArcSwap protecting the (Token, Instant), a mutex to prevent multiple simultaneous writers, and an additional barrier readers could optionally wait on, if they need a fresh token before the write is finished.

conradludgate

For read heavy workloads, https://docs.rs/papaya/latest/papaya/ should be pretty good. It's a proper concurrent hashmap, with incremental resizing. It even has some support for rcu operations per field

athrowaway3z

I like the idea of arc_swap, but I've tried skimming the code to see if it fit my mental model and found it way more complex than i expected.

nasretdinov

Hate to be that person, but Go developers hit it first and then implemented (a really nice in my opinion) concurrent hash map that does stuff similar to what is described in the article, but without copying the entire hash table each time: https://pkg.go.dev/sync#Map

benmmurphy

Concurrent maps are often easier to implement in GC languages because you can use the garbage collector for cleaning old values. Whereas without a GC you will need an alternative like RCU.

nasretdinov

I do agree with the statement that in GC languages concurrent structures are easier to implement (and I actually quite like that about Go too), however, fundamentally, GC languages also implement GC itself somehow :), so the only real difference I guess is that for some (complex) data structures you _need_ GC to implement them cleanly, and that's what RCU does

kccqzy

I haven't seen a pure user space implementation of RCU that works well without kernel support. The RCU methodology is easy to understand, but how does a user space implementation understand when it is safe to dispose of a piece of data structure? Don't you need a primitive to run some code on each CPU?

cryptonector

You don't need kernel assistance to make user-land RCU-like structures work well enough. I've done it twice. The price you pay for not having kernel assistance is that some threads sometimes have to do extra work, like signal waiting writers, loop over a very short read, write to hazard pointer, read again until the value read is stable, or do a garbage collection, but this price is not too heavy.

j_seigh

You just need to keep track of each thread's quiescent states however they are implemented.

kccqzy

Right but isn't that difficult to implement in a library?

j_seigh

Thread local vars would be used with lazy initialization. Clean up might be a little tricky depending on what implementation of thread local you use. Thread local support is not as great as it should be.

cryptonector

Not really. https://github.com/nicowilliams/ctp is mine -- the API is not RCU, but a simpler and easier to use "thread-safe variable" construct. You'll find two C implementations there. I've been using this in production for almost a decade without any trouble. (For a long time I had a missing producer membar in the create primitive (oops!), but because it only happens once it never caused me any problems.)

To be fair I thought about it for a long time first. But nowadays these techniques are well known, so now it's easier to "copy" the thinking others have done, and that thinking is the hard part.

cryptonector

TFA is about concurrent hashmaps that croak under heavy read load because the synchronization mechanisms of those hashmaps caused cache ping-pong, and the fix was to use RCU with a copy-on-write data structure for the hashmap. Each write requires cloning the previous hashmap, but because the readers don't perform any atomic operations other than one read (for the current read-only hashmap) it all works out as the cache lines don't have to ping-pong given that the readers are not writing.

The stuff about "debt" is about the design of ArcSwap, which uses "debt" instead of "hazard pointers". The idea is that every thread keeps track of the last version of the hashmap that it's seen, and this is "debt" (references that need to be released eventually), and this debt is a thread-local object that is linked with all the other threads' debt, which essentially allows garbage collect of "debt" (debt collection?), or something like that. See https://github.com/vorner/arc-swap/blob/master/src/docs/inte...

I've implemented something similar myself. Instead of implementing user-land RCU (which is what perhaps I should have done) I implemented a "thread-safe variable" where reads are lock-less and wait-free, and writers pay all the costs. My thread-safe variable API is real simple and intuitive: `create(value_destructor) -> thread_safe_var`, `set(thread_safe_var, value)`, `get(thread_safe_var) -> value`, `release(thread_safe_var)`, and `destroy(thread_safe_var)` (presumably only during an atexit() handler or shared object destructor call, if at all). I have two implementations, one with a linked list of per-thread hazard pointers, and the other with a tuple of {next/current, current/previous} "cells" and a protocol that gives readers time to read the cells, figure out which is "current" and then dereference and incref a refcount.

Regardless of how one implements this data structure, the key concept is that you have to atomically compose two distinct atomic operations: a read, and a dereference to atomically increment a reference count -- this is remarkably tricky!

When you have a GC it's really easy: just read -- there is no reference count to increment.

With a linked list of per-thread hazard pointers all you have to do is read the thing then write it into your thread-local hazard pointer then read again, looping until the second read is the same as the first read. Having the value thus safely copied to the reader's thread-local hazard pointer then allows for notional garbage collection. Any thread that would release an old version of the value has to check all the hazard pointers to see that it's not in use, and this amounts to a small garbage collection.

I especially like the hazard pointer approach.

See also MVars (Haskell), transactional memory, RCU, etc.

vrosas

> While we knew where to look, this investigation had already taken weeks and things took a turn for the worse when we hit the issue again on February 23rd.

It blows me away an issue like this could take weeks to track down. If I were in any leadership position at this company I'd be rolling heads with the lack of telemetry or domain knowledge for these systems.

dblohm7

I can't say I'm surprised, TBH. I had a rough idea of where the problem might lie just by reading the title of the post. But I was fortunate enough to do an undergraduate degree where concurrency was actually taught, plus I've learned a lot over the years working in highly-concurrent, asynchronous environments.

Concurrent programming has been mainstream for some time now, but I don't think the level of expertise of most engineers has kept up. That becomes most apparent when software starts hitting concurrency pitfalls: performance problems, deadlocks, UAFs, and so on...

cryptonector

In my career I've had to deal with two big outages that took over a week to diagnose and fix. One of those involved my code, though the conditions that caused it involved bugs in others' code interacting with bugs in mine such that in isolation none of them would have caused an outage. The first one took two weeks to diagnose, and I had to write specialty tools to help diagnose it. The second took lots of data gathering, code inspection, and making and testing many hypotheses. In neither case did heads roll. The staff who go through such a trial by fire come out of it all the more valuable than before, and that includes those whose "fault" it might have been.

Sometimes you can't have the domain knowledge you'd want beforehand. The reasons vary a great deal. For example, the software in question might be commercial, and you might not have an alternative vendor to switch to. Other times your "bus factor" might have dropped to uncomfortably small values through no fault of anyone in the organization (people leaving too quickly for reasons having nothing to do with the way the org is run, people dying, headcount and budgets not allowing for remediating the situation, and just having way too many critical systems whose bus factor to keep wide at all times).