Atomics and Concurrency

55 comments

·May 28, 2025

tialaramex

> TSan is a reliable way of detecting data races in your code, instead of trying to repeatedly run the code in a loop hoping for a data race to occur.

I'm sure this doesn't intend to be misleading but I think it can mislead people anyway.

TSan should detect races but just like "repeatedly run the code in a loop" it's not actually checking what you wrote doesn't have races, it's just reporting any which happened during testing. This is valuable - if you eliminate a race you've almost certainly fixed a real bug, maybe a very subtle bug you'd have cursed for years otherwise, but you can't use TSan to prove there are no further bugs of this kind.

Tools to prove this stuff are often much harder to bring into a project, but you should put that work in if the difference between "It's probably fine" and "I proved it's correct" sounds necessary to you or for your work. I don't want my pacemaker to be "probably fine" but it's OK if the Youtube music player on my phone wasn't proved correct.

dataflow

My (limited) understanding is that the way TSAN works is better than what you're comparing it to. You don't have to rerun TSAN in a loop to finally catch a race. As long as your test has coverage of that scenario (i.e. it executes the lines of interest once within a reasonable time window) it should get caught. Of course, being dynamic, it can't catch code that you never execute during testing, but that's not such a huge problem because you can verify code coverage through other means. It's quite a bit of a different from having to make stars align by looping until some e.g. rare timing issue comes up.

The real issue here isn't that TSAN is low-confidence about absence of data races in C++. The issue is that even if it statically proved the absence of data races in the C++ sense, that still wouldn't imply that your algorithm is race-free. Because it's trivial to patch every single instance of a data race by using an atomic to get TSAN to shut up, but that in no way implies the higher-level concurrent algorithm works correctly. (e.g., if two threads perform an atomic load of some variable, then add one, then perform an atomic store back in, they would obviously stomp on each other, but TSAN wouldn't care.) It's more like a UB checker than a correctness verifier.

withoutboats3

The occurrence of data races depends on the specific non-deterministic sequence of execution of concurrent codepaths. Just because you have 100% code coverage does not mean you've covered every potential execution sequence, and its almost never practical to actually execute every possibility to ensure the absence of data races. Depending on the probability that your data race will occur, it could indeed be something you have to make stars align for TSAN to catch.

Not to talk my own book, but there is a well-known alternative to C++ that can actually guarantee the absence of data races.

dataflow

It "could" for some algorithms, yes, but for a lot of algorithms, that kind of star alignment simply isn't necessary to find all the data races, was my point. And yes, TLA+ etc. can be helpful, but then you have the problem of matching them up with the code.

sebtron

TSan also catches data races that end up not having any visible effect, unlike rerunning and hoping to see the program misbehave.

vlovich123

The false positive on TSAN is so low that it’s worth fixing any false positives although concluding it’s a false positive seems so hard that I always have a nagging feeling as to whether I actually did fix it

tialaramex

"Data race which doesn't seem to modify the data" is actually often not a false positive, they're called "Benign data races" and well, go read the literature, it's often difficult to be sure they're truly "benign" and if there's a subtle reason why they aren't that's now a crazy hard to spot bug in your program.

So yeah, just fix 'em. Most of them you dodged a bullet, and even when you didn't it will now be easier for the next person to reason about the software.

john-h-k

> The important thing to remember is that each of these cannot be split into separate instructions.

Nitpick, but they absolutely can be split into several instructions, and this is the most common way it’s implemented on RISClike processors, and also single instructions aren’t necessarily atomic.

The actual guarantee is that the entire operation (load, store, RMW, whatever) occurs in one “go” and no other thread can perform an operation on that variable during this atomic operation (it can’t write into the low byte of your variable as you read it).

It’s probably best euphemismed by imagining that every atomic operation is just the normal operation wrapped in a mutex, but implemented in a much more efficient manner. Of course, with large enough types, Atomic variables may well be implemented via a mutex

OskarS

> It’s probably best euphemismed by imagining that every atomic operation is just the normal operation wrapped in a mutex

Sort-of, but that's not quite right: if you imagine it as "taking a mutex on the memory", there's a possibility of starvation. Imagine two threads repeatedly "locking" the memory location to update it. With a mutex, it's possible that one of them get starved, never getting to update the location, stalling indefinitely.

At least x86 (and I'm sure ARM and RISC-V as well) make a much stronger progress guarantee than a mutex would: the operation is effectively wait-free. All threads are guaranteed to make progress in some finite amount of time, no one will be starved. At least, that's my understanding from reading much smarter people talking about the cache synchronization protocols of modern ISAs.

Given that, I think a better mental model is roughly something like the article describes: the operation might be slower under high contention, but not "blocking" slow, it is guaranteed to finish in a finite amount of time and atomically ("in one combined operation").

john-h-k

x86 and modern ARM64 has instructions for this, but original ARM and RISC approaches are literally a hardware-assisted polling loop. Unsure what guarantees they make, I shall look.

Definitely a good clarification though yeah, important

OskarS

I got curious about this, because really the most important guarantee is in the C++ memory model, not any of the ISAs, and conforming compilers are required to fulfill them (and it's generally more restrictive than any of the platform guarantees). It's a little bit hard to parse, but if I'm reading section 6.9.2.3 of the C++23 standard (from here [1]), operations on atomics are only lock-free, not wait-free, and even that might be a high bar on certain platforms:

    Executions of atomic functions that are either defined to be lock-free 
    (33.5.10) or indicated as lock-free (33.5.5) are lock-free executions.

    When one or more lock-free executions run concurrently, at least one should 
    complete.

        [Note 3 : It is difficult for some implementations to provide absolute 
        guarantees to this effect, since repeated and particularly inopportune 
        interference from other threads could prevent forward progress, e.g., by 
        repeatedly stealing a cache line for unrelated purposes between load-locked 
        and store-conditional instructions. For implementations that follow this 
        recommendation and ensure that such effects cannot indefinitely delay progress 
        under expected operating conditions, such anomalies can therefore safely be 
        ignored by programmers. Outside this document, this property is sometimes 
        termed lock-free. — end note]

I'm guessing that note is for platforms like you mention, where the underlying ISA makes this (more or less) impossible. I would assume in the modern versions of these ISAs though, essentially everything in std::atomic is wait-free, in practice.

[1] https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/n4950.p...

wbl

Really finicky ones, and the initial ones made none. I think for RISC-V it's something like max 16 instructions covered with no other memory accesses to be assured progress on ll/sc sequences.

That's to enable very minimal hardware implementations that can only track one line at a time.

p_l

Early Alpha CPUs (no idea about laters) essentially had a single special register that asserted a mutex lock on a word-sized (64bit) location for any read/write operation on the Sysbus.

khrbtxyz

https://en.wikipedia.org/wiki/Load-link/store-conditional

pythops

For Rust, I highly recommend this book https://marabos.nl/atomics/

reitzensteinm

This is such a great book, especially the section on operating system primitives, which made the book wider in scope and more practical. After all, you're probably not building exotic data structures by hand in memory without also needing high performance IO.

It's been a hobby of mine to collect concurrency examples from books and blog posts and simulating them in Temper, my Rust memory model simulator. As far as I know, it's the largest Rust/C++11 memory model test suite on the internet (but I'm happy to be corrected).

This is the file for Rust Atomics and Locks:

https://github.com/reitzensteinm/temper/blob/main/memlog/tes...

I didn't find any bugs in the examples, but with how good the book was, I didn't expect to :)

The Williams book for C++ contains many of the same ideas (Rust's memory model is a copy/paste from C++11 without the now deprecated Consume) and I can highly recommend that too.

pythops

absolutely ! This book is useful for even non rust developers I think

pimeys

Yes. This is how I learned the atomics and memory ordering. It's so much fun to read, and super interesting.

Highly recommend!

latchkey

Previously:

Atomics and Concurrency https://news.ycombinator.com/item?id=38964675 (January 11, 2024 — 106 points, 48 comments)

gpderetta

Probably I'm missing something, but enqueue dereferences the current tail node. What prevents it dereferencing a node that has just been deleted by dequeue.

Actually, nothing ever sets head so I'm not sure how anything is ever dequeued. Probably the implementation is incomplete and the queue needs a sentinel node somewhere.

ozgrakkurt

This is great, also looking for information about performance impact of using atomics like atomically reference counting etc. if anyone can recommend something to learn from

cmrdporcupine

https://travisdowns.github.io/blog/2020/07/06/concurrency-co...

j_seigh

Performance from using atomics in reference counting is going to degrade quite a bit under heavy usage due to cache contention. If you are in that situation, you should be looking at something else like RCU or hazard Pointers.

tialaramex

Note that these strategies (collectively, "delayed reclamation") lose one benefit of the RAII concept. With Rust's Drop or C++ destructors we are guaranteed that when we don't need this Goat the clean-up-unused-goat code we wrote will run now - unlike with a Garbage Collector where that code either runs at some unspecified future time or not at all.

Delayed reclamation means that "when we don't need this Goat" will sometimes be arbitrarily delayed after we apparently didn't need it, and it's no longer possible to say for sure by examining the program. This will almost always be a trade you're willing to make, but you need to know it exists, this is not a fun thing to discover while debugging.

manwe150

> we are guaranteed that when we don't need this Goat the clean-up-unused-goat code we wrote will run now

Not to put too fine a point on things, but rust (and c++) very explicitly don’t guarantee this. They are both quite explicit about being allowed to leak memory and never free it (due to reference cycles), something a GC is not typically allowed to do. So yes it usually happens, it just is not guaranteed.

zozbot234

With RCU the "we don't need this Goat" occurs as soon as the last readers are done with the earlier version of Goat, modulo a limited grace period in some RCU implementations. New readers always get the latest version, so there is no risk of waiting for an "unspecified" or "arbitrary" amount of time.

Hazard pointers are a bit fiddlier AIUI, since the reclamation step must be triggered explicitly and verify that no hazard pointers exist to the object being reclaimed, so it is quite possible that you won't promptly reclaim the object.

coolThingsFirst

Amazing, i have decided to learn cpp for fun and wow the designers of the lang had serious gray matter.

Binary search provided. Pair abstraction provided. Lower bound for binary search yup.

Auto for type inference and fast as hell on top of it. Crazy powerful lang and also multiplayform threads.

tialaramex

C++ auto isn't type inference, as so often in C++ it's something weirder which they call type deduction.

In languages with type inference when it could be A or B and it's left to infer the type the compiler just always says it could be A or B, so the programmer needs to disambiguate.

But with deduction in some cases C++ will deduce that you meant A even though you could instead have explicitly chosen B here instead, and too bad if you did mean B because that's not what it deduced.

coolThingsFirst

Can you show me some examples where the deduction is not the expected type?

HN

Atomics and Concurrency

Atomics and Concurrency