bogeholm
Nice guy by the way! Met him on a train home from work once. I was working on my computer, glanced left and saw someone with a red beard running a tiling WM on some real boy system. Since we were in the silent zone, I wrote
phk?
In a text editor - got a nod, and we shook hands :)elteto
"... spending an hour over breakfast, chatting with Dennis Ritchie about device nodes and timekeeping in early UNIX kernels"
Wow, what an incredible experience!
nasretdinov
Nice article! I wonder if now, with all the NUMA stuff and processors with hundreds of cores something changed sufficiently enough that it warrants another complete redesign similar to what happened in the article
karmakaze
A lot of the article talked about swap which wouldn't be a concern in normal operation of most production servers--cache/memory locality still matters but not as dramatically. Back when I was managing bare-metal MySQL servers we were getting scaling to NUMA memory (with jemalloc/tcmalloc). There was an initial performance degradation that required a lot of fine-tuning even working around how the same motherboard/CPUs would initialize core affinities differently. A new problem was deadlocking of large transactions that touched multiple buffer instances. Mind you this wasn't a clean codebase that put a lot of thought into avoiding deadlocks (up until then).
At the time I didn't think much about how the allocators could help as they're constrained to the ABI. Writing in Zig with custom allocators for everything would. The only mysql NUMA setting was innodb_numa_interleave=ON which wasn't very good but not a lot worse than trying harder.
toast0
Much of that is well addressed by one allocator arena per cpu, and either pinning threads to cpus or at least having a high threshold to move threads across NUMA boundaries.
If you have a lot of cross thread memory use, maybe you need something to help with allocate on core X, free on core Y and the cross core communication that causes (maybe that's already in place?).
There's more memory overhead that way, but large core count systems tend to have a lot of memory too.
masklinn
> Much of that is well addressed by one allocator arena per cpu, and either pinning threads to cpus or at least having a high threshold to move threads across NUMA boundaries.
Note that that can have an awkward effect: if the thread gets parked (either entirely, or just stops calling the allocator because it has reached its steady state), the allocator may never have the opportunity to release that thread's memory. IIRC mimalloc suffers from this issue, you need to call an allocator specific API to tell it about the regime change.
For those unaware, "PHK" is:
* https://en.wikipedia.org/wiki/Poul-Henning_Kamp
Amongst other things (including jails), he invented the MD5crypt algorithm (originally for FreeBSD) as an alternative to the original DEScrypt of Unix:
* https://en.wikipedia.org/wiki/Crypt_(C)#MD5-based_scheme
Nowadays probably most well-known for creating Varnish:
* https://en.wikipedia.org/wiki/Varnish_(software)