Analyzing the memory ordering models of the Apple M1

charles_irl

Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.

They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.

loeg

This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)

Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?

> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.

So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.

> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.

Yeah, ok, so the huge magnitude observed is due to some really poor program design.

> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).

Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.

MBCook

I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.

It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.

Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?

jchw

I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:

> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.

However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?

https://developer.apple.com/documentation/virtualization/run...

I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?

HN

Analyzing the memory ordering models of the Apple M1

Analyzing the memory ordering models of the Apple M1