Skip to content(if available)orjump to list(if available)

C++: Strongly Happens Before?

C++: Strongly Happens Before?

18 comments

·August 28, 2025

pixelpoet

> A simple program to start

I write a lot of C++, and that is not a simple program. Short, sure.

stingraycharles

I mean, dining philosophers is very simple as well. Dijkstra’s shortest path is simple.

Simple can still be difficult to understand.

taneq

It sounds like your definition of “simple” is more like “short” than “straightforward”?

null

[deleted]

shelajev

my background is mostly Java so I know this happens-before: (https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.htm...).

from the article: > [Note 8: Informally, if A strongly happens before B, then A appears to be evaluated before B in all contexts. — end note]

this is the Java happens-before, right? What's the non-strong happens-berfore in C++ then?

jcranmer

The data-race-free memory model was an observation back in the early 90's that a correctly-synchronized program that has no data races will, even on a weak memory model multiprocessor, be indistinguishable from a fully sequentially-consistent memory model. This was adapted into the Java 5 memory model, with the happens-before relation becoming the definition of correctly-synchronized, and then C++11 explicitly borrowed that model and extended it to include weaker atomics, and pretty much everybody else borrows directly or indirectly from that C++ memory model. However, C++ had to go back and patch the definition because their original definition didn't work, and it took C++ standardizing a model to get the academic community to a state where we could finally formalize weak memory models.

In Java, happens-before is composed essentially of the union of two relations: program order (i.e., the order imposed within a single thread by imperative programming model) and synchronizes-with (i.e., the cross-thread synchronization constructs). C++ started out doing the same. However, this is why it broke: in the presence of weak atomics, you can construct a sequence of atomic accesses and program order relations across multiple threads to suggest that something should have a happens-before relation that actually doesn't in the hardware memory model. To describe the necessary relations, you need to add several more kinds of dependencies, and I'm not off-hand sure which dependencies ended up with which labels.

Note that, for a user, all of this stuff generally doesn't matter. You can continue to think of happens-before as a basic program order unioned with a cross-thread synchronizes-with and your code will all work, you just end up with a weaker (fewer things allowed) version of synchronizes-with. The basic motto I use is, to have a value be written on thread A and read on thread B, A needs to write the value then do a release-store on some atomic, and B then needs to load-acquire on the same atomic and only then can it read the value.

thw_9a83c

> To describe the necessary relations, you need to add several more kinds of dependencies, and I'm not off-hand sure which dependencies ended up with which labels.

It's:

    relaxed, consume, acquire, release, acq_rel, seq_cst
Nicely described here: <https://en.cppreference.com/w/cpp/atomic/memory_order.html>

jcranmer

No, that's not the thing I'm talking about. Those are the different ordering modes you can specify on atomic operations.

Rather, there's a panoply of definitions like "inter-thread happens-before" and "synchronizes-with" and "happens-before", and those are the ones I don't follow closely. It gets even more confusing when you're reading academic papers on weak memory models.

thw_9a83c

This seems like an overly academic exercise. Can the compiler, or even the operating system, guarantee that the threads `a`, `b`, and `c` are started in that order? I don't think so. The OS might start executing thread `a` on one CPU core and then be interrupted by a high-priority interrupt before it can do anything useful. By that time, threads `b` and `c` might already be running on other cores and have finished executing before thread `a`.

Sharlin

Sure, that’s an entirely valid execution. But this is about what exact pairs of values (x, y) are observable by each thread. Some are allowed, others are not, by the semantics of atomic loads and stores guaranteed by the CPU. The starting or joining order of the threads doesn’t matter, except insofar that thread starting and joining both synchronize-with the parent thread.

In general, in the presence of hardware parallelism (ie. always since 2007 or so) the very real corner cases are much more involved than "what if there’s an interrupt" and thinking in terms of single-threaded concurrency is not very fruitful in the presence of memory orderings less strict than seq_cst. It’s not about what order things can happen in (because there isn’t an order), it’s principally about how writes are propagated from the cache of one core to that of another.

x86 processors have sort of lulled many programmers of concurrent code into a false sense of safety because almost everything is either unordered or sequentially consistent. But the other now-common architectures aren’t as forgiving.

thw_9a83c

Thanks! So now the article actually makes sense to me. It would be nice to have this important clarification in the article itself. I'm not saying that a careful reader can't infer this point from the article even now, but I'm not such a careful reader.

Edit: Since the parent commenter added two more paragraphs after I posted my answer: I wasn't wondering about the pitfalls of sequentially consistent multi-threaded execution on various CPU architectures. It is a well-known fact that x86 adheres to a stronger Total Store Order (TSO) model, whereas POWER and ARM have weaker memory models and actually require memory barriers at the instruction level. Not just to prevent a compiler reordering.

cvoss

It doesn't it matter for this article whether there exist possible executions other than the one the author inquires about.

The point of weak memory models is to formally define the set of all possible legal executions of a concurrent program. This gets very complicated in a hurry because of the need to accommodate 1) hardware properties such as cache coherence and 2) desired compiler optimization opportunities that will want to reason over what's possible / what's guaranteed.

In this case, there was a conflict between a behavior of Power processors and an older C++ standard that meant that a correct compiler would have to introduce extra synchronization to prevent the forbidden behavior, thus impacting performance. The solution was to weaken the memory model of the standard in a very small way.

The article walks us through how exactly the newer standard permits the funny unintuitive execution of the example program.

The exercise is academic, sure. A lot of hard academic research has gone into this field to get it right. But it has to be that precise because the problems it solves are that subtle.

thw_9a83c

Yes, see: https://news.ycombinator.com/item?id=45091610

Originally, I was commenting, that the purpose of the article was initially unclear to me, since the order of thread execution cannot be determined anyway.

I now understand that there was a corner case in the POWER and ARM architectures when mixing seq-cst and acquire-release operations on the same atomic variable. Thus, C++26 will be updated to allow more relaxed behavior in order to maintain performance.

https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p06...

mlvljr

Can't wait for the 25 yo seniors to make this the crown jewel of their interviews

shultays

  The comments show the values each thread observed.
Why? Nothing in that code implies any synchronization between threads and force an ordering. thread_2 can fetch value of y before 1 writes to it which would set b to 0.

You would need additional mechanisms (an extra atomic that you compare_exchange) to force order

edit: but I guess the comment means it is the thing author wants to observe

  Now, the big question: is this execution even possible under the C++ memory model?
sure, use an extra atomic to synchronize threads

masfuerte

The comments show the actual values observed in one particular execution. The author asks if this is compatible with the C++ memory model.

In other words, the author considers this execution to be surprising under the C++ memory model, and then goes on to explain it.

cvoss

> sure, use an extra atomic to synchronize threads

What? That would make the situation worse. The execution has a weird unintuitive quirk where the actions of thread 3 seem to precede the actions of thread 1, which seem to precede the actions of thread 2, yet thread 2 observes an action of thread 3. Stronger synchronization would forbid such a thing.

The main question of the article is "Is the memory model _weak enough_ to permit the proposed execution?"