Using black magic to make a fast circular buffer (2017)

30 comments

·January 11, 2025

defrost

> Apparently this trick has been known for a while, but is seldom used.

A long, long, long time - especially on old big iron mainframes.

Here's a virtual ring buffer in C using mmap from 2002 written by a grey beard former mainframe coder, Philip Howard aka skapare:

https://web.archive.org/web/20100706051254/http://libh.slash...

vdupras

> The fact that a modulo operation is necessary to index into the array makes this function hard (if not impossible) to vectorize, and thus unnecessarily slow.

It seems to me that this benchmark conflates two optimizations that could be tackled separately.

For example, if you took the naive approach and just use a power-of-2 buffer size so that modulo could be replaced with "&", you'd get a pretty good speedup without resorting to mmap, no?

tom_

But it's the same thing: extra per byte overhead. You can polish this turd, or (better, in my view) give up entirely and do what the subsequent snippet does and copy the data in 1 or 2 blocks.

Whatever you do, there's another advantage to the mmap trick: if the items are variable sizes, they never need to be split. (Split items are often a pain to deal with.) The circularity becomes the buffer's problem alone, and consumers don't have to handle it.

Another article about this here: https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff... - see also the comments.

vdupras

Certainly, but by conflating the two, it makes the mmap trick appear better than it is in reality. There are good chances that a "&" change brings a very significant speedup.

tom_

I updated the author's benchmark: https://gcc.godbolt.org/z/bxEzerjc8

While updating it, I noticed a couple of things:

1. the Makefile doesn't specify -O, which seems to imply -O0, unreasonably flattering the 1-memcpy case because there's only 1 call to a generic memcpy rather than 2. With -O3 the memcpys get inlined

2. make the message size more annoying. The queue size was a multiple of the message size, so the 2-memcpy case was never exercised - and -O3 could see this

(Change 2 actually slightly cancels out change 1, and now the 2-memcpy case is terrible again...)

Who even knows what sort of system the Compiler Explorer executor runs the programs on, but this is what I got from running it, and the relative speeds might be representative:

    %: 0.033716 microseconds per write
    &: 0.031772 microseconds per write
    slow: 0.009715 microseconds per write
    fast: 0.001035 microseconds per write

ahartmetz

Actually, even "if (pos >= bufSize) { pos -= bufSize; }" is much faster than pos = pos % bufSize. Division (and modulo) are like the only basic math operations that a modern wide ALU doesn't easily do three to four of per cycle (they are much more expensive), OTOH easily predicted branches are fairly cheap (or even compiled into conditional moves here).

coldpepper

Exactly. I don't know how could the blog post author have missed this.

lq0000

Author here. I was 22 when I wrote this nearly 8 years ago. I just thought it was neat, give me a break. Thanks.

superjan

Don’t worry. It _is_ a neat trick, and it was nice to post complete instructions and benchmarks. And it wasn’t you who put it to the HN frontpage.

eulgro

What other kinda of optimizations can be made by using the page table?

For example, from what I understand, allocators operate on virtual memory, so if for example a vector increases in size past its capacity, memory has to be reallocated and the existing content copied. Wouldn't it be possible to perform that copy by updating virtual addresses in the page table, with no copy actually being performed in physical memory?

giomasce

That still requires you to plan how you use the virtual address space, though. You can't just add more memory pages on the back of your vector if you've already started using those virtual memory locations for other stuff. So you have to know in advance how much space your vector might end up using, and carefully avoid placing anything else there. If you're willing to come up with such an estimate, and if your OS is happy to overcommit memory (and I think Linux is, by default at least), then you can just malloc() all that memory in the first place. With overcommitting no physical page will be used to back your virtual memory block until you really use it.

If your system doesn't do overcommitting, then I guess that with some mmap() trickery (using the appropriate flags) you could do more or less the same thing, reserving the virtual address space that you need and then actually backing with memory as you need.

HALtheWise

One example that comes to mind is the (abandoned) proposal for safe arena allocation in Go. It takes advantage of the fact that 64 bit computers have absolutely massive virtual memory, so that "freed" memory can be permanently unmapped from virtual memory in order to make any future accesses an error instead of a use-after-free bug.

https://github.com/golang/go/issues/51317

ncruces

If the array is large enough to span a page, and the memory was obtained from the system with mmap, you can extend it with mremap, yes.

https://man7.org/linux/man-pages/man2/mremap.2.html

superjan

There’s a great paper from the group of Emery Berger, where they merge memory pages that become sparse. A kind of garbage collection for C and C++ languages. There’s also a good presentation about it on youtube.

https://arxiv.org/abs/1902.04738

The presentation: https://youtu.be/c1UBJbfR-H0

eulgro

Interesting! Thanks for the link.

junon

Technically yes, you can do copy-on-write semantics, which marks both the original and "copied" page as read-only, and the first write to either virtual page causes a page fault in the kernel (access violation) prompting a physical page allocation, copying the page, and then updating both virtual page table entries' permissions to be RW, then the equivalent of `invlpg` to flush them out of the TLB, before returning to the program to re-try the write instruction.

However this gets tricky with heap allocators since they often pack `malloc`-like allocations into the same page using some allocation scheme (e.g. buddy allocation, etc.). Though you can typically bypass malloc-like, in-process allocators to handle page mappings yourself, it's just a little more cumbersome.

This is actually how "virtual allocation" (I think is the term, don't quote me) happens; if you `malloc` a very large space on many modern OSes, you get a pointer back and there is an allotment in the process's virtual address space registered, but no page tables are actually updated (the kernel can assume they're "not present", meaning that there is no read nor write availability; any memory operation with that range causes a page fault). Upon any page in that range being accessed for the first time, the page fault handler sees it's part of that otherwise very large segment and will allocate in physical memory to that virtual page, then return.

It allows for e.g. allocating a VERY large sparse array of items (even terabytes!) that might otherwise exceed the size of physical memory actually available to the system, with the assumption that it'll never actually be fully populated (or at least populated across enough memory pages such that you exhaust 'real', physical memory).

This is also how file mapping works. You file map something from the filesystem, and the kernel keeps a cache of read pages from that file; memory reads and writes cause page faults if they haven't been read in (and cached), prompting a read from the disk storage into a physical page, updating the page tables, and resuming the faulting thread. It also allows the kernel to selectively reclaim less-frequently-used file mapped pages for higher priority allocations at any time really, since the next time the program tries to access that page, the kernel just faults again, reads it into a new page, and updates the tables. It's entirely transparent to the process.

I'm designing a novel kernel at the moment and some of these sorts of possibilities are what I find the more interesting parts of design, actually. I find them to be underutilized to some extent even in modern kernels :)

virexene

The author claims that mmap is exposed by glibc whereas memfd_create is not, so they reimplemented it with syscall, but looking at the man page, did they just forget to #define _GNU_SOURCE?

lq0000

When I originally wrote this, it indeed was not

virexene

Ah, you're right, I didn't see it's a more recent addition.

truefossil

Linux, FreeBSD seem to have memfd_create(). Mac OS seem to have shm_open() instead. Actually, the first two have that one as well. Interesting.

The technique itself is indeed very elegant.

giomasce

How does that interact with cache? Does accessing the ring buffer using the second set of mapped pages ends up using the same cache line, or is it a fresh request to main memory? If it's the latter, I guess that's has good chances of making your circular buffer slower, depending on how big it is, how does your cache work and how much cache pressure you experience. I don't think I know enough about actual caches to say whether that's probable or not.

junon

Same cache-line. CPU caches come after virtual memory translations / TLB lookups. Memory caches work on physical addresses, not linear (virtual) addresses.

Memory access -> TLB cache lookup -> PT lookup (if TLB miss) -> L1 cache check (depending on PT flags) -> L2 cache check (depending on PT flags, if L1 misses) -> ... -> main memory fetch, to boil it down simply.

CPUs would be ridiculously slow if that wasn't the case. Also upon thinking about it a bit more, I have no idea how it'd even work if it was the other way around. (EDIT: To be clear, I meant if main memory cache was hit first followed by the MMU - someone correctly mentioned VIVT caches which aren't what I meant :D)

saagarjha

VIVT caches exist, though.

junon

That's very true, though AFAIK they aren't used much in modern processors. It's usually PIPT or VIPT (I think I've seen more references to the latter), VIPT being prevalent because the logical address and the cache can be resolved in parallel when designing the circuitry.

But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.

EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D

lq0000

I collected some more measurements with different implementations of the get function.

1000000 iterations, buffer 8192, message 128

i9-13900K, gcc 13.3, -O3, kernel 6.8.0-49, glibc 2.39

byte-mod: 0.117353 us/iter

byte-and: 0.065379 us/iter

byte-sub: 0.027865 us/iter

1cpy-mod: 0.001143 us/iter

1cpy-and: 0.001100 us/iter

1cpy-sub: 0.001098 us/iter

2cpy-mod: 0.008140 us/iter

2cpy-and: 0.001100 us/iter

2cpy-sub: 0.007711 us/iter

funny-mod: 0.001145 us/iter

funny-and: 0.001101 us/iter

funny-sub: 0.001100 us/iter

where:

`byte` is per-byte copy

`1cpy` is a single-memcpy (assumes message size divides buffer size evenly)

`2cpy` is the split memcpy (which supports other message sizes)

`funny` is single-memcpy with the double-memmapped buffer

and:

`mod` uses `head %= buffer_size`

`and` uses `head &= buffer_size`

`sub` uses `if (head >= buffer_size) head -= buffer_size`

The tl;dr here is that the doubly memmapped buffer performs the same as the 1-memcpy implementation that doesn't do anything funny and significantly better than all 2-memcpy implementations except `and`. Since your page size is gonna be a power of 2 anyway, this means this trick is not really worthwhile and indeed you should just use `and`. But, compared to the other wrapping implementations it does still provide a tangible improvement. That may just come down to how the compiler was able to optimize it, but I don't feel like nitpicking the generated code to figure out why right now.

752963e64

[dead]