Using black magic to make a fast circular buffer
13 comments
·January 11, 2025virexene
The author claims that mmap is exposed by glibc whereas memfd_create is not, so they reimplemented it with syscall, but looking at the man page, did they just forget to #define _GNU_SOURCE?
defrost
> Apparently this trick has been known for a while, but is seldom used.
A long, long, long time - especially on old big iron mainframes.
Here's a virtual ring buffer in C using mmap from 2002 written by a grey beard former mainframe coder, Philip Howard aka skapare:
https://web.archive.org/web/20100706051254/http://libh.slash...
giomasce
How does that interact with cache? Does accessing the ring buffer using the second set of mapped pages ends up using the same cache line, or is it a fresh request to main memory? If it's the latter, I guess that's has good chances of making your circular buffer slower, depending on how big it is, how does your cache work and how much cache pressure you experience. I don't think I know enough about actual caches to say whether that's probable or not.
junon
Same cache-line. CPU caches come after virtual memory translations / TLB lookups. Memory caches work on physical addresses, not linear (virtual) addresses.
Memory access -> TLB cache lookup -> PT lookup (if TLB miss) -> L1 cache check (depending on PT flags) -> L2 cache check (depending on PT flags, if L1 misses) -> ... -> main memory fetch, to boil it down simply.
CPUs would be ridiculously slow if that wasn't the case. Also upon thinking about it a bit more, I have no idea how it'd even work if it was the other way around. (EDIT: To be clear, I meant if main memory cache was hit first followed by the MMU - someone correctly mentioned VIVT caches which aren't what I meant :D)
saagarjha
VIVT caches exist, though.
junon
That's very true, though AFAIK they aren't used much in modern processors. It's usually PIPT or VIPT (I think I've seen more references to the latter), VIPT being prevalent because the logical address and the cache can be resolved in parallel when designing the circuitry.
But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.
EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D
eulgro
What other kinda of optimizations can be made by using the page table?
For example, from what I understand, allocators operate on virtual memory, so if for example a vector increases in size past its capacity, memory has to be reallocated and the existing content copied. Wouldn't it be possible to perform that copy by updating virtual addresses in the page table, with no copy actually being performed in physical memory?
giomasce
That still requires you to plan how you use the virtual address space, though. You can't just add more memory pages on the back of your vector if you've already started using those virtual memory locations for other stuff. So you have to know in advance how much space your vector might end up using, and carefully avoid placing anything else there. If you're willing to come up with such an estimate, and if your OS is happy to overcommit memory (and I think Linux is, by default at least), then you can just malloc() all that memory in the first place. With overcommitting no physical page will be used to back your virtual memory block until you really use it.
If your system doesn't do overcommitting, then I guess that with some mmap() trickery (using the appropriate flags) you could do more or less the same thing, reserving the virtual address space that you need and then actually backing with memory as you need.
superjan
There’s a great paper from the group of Emery Berger, where they merge memory pages that become sparse. A kind of garbage collection for C and C++ languages. There’s also a good presentation about it on youtube.
https://arxiv.org/abs/1902.04738
The presentation: https://youtu.be/c1UBJbfR-H0
eulgro
Interesting! Thanks for the link.
ncruces
If the array is large enough to span a page, and the memory was obtained from the system with mmap, you can extend it with mremap, yes.
junon
Technically yes, you can do copy-on-write semantics, which marks both the original and "copied" page as read-only, and the first write to either virtual page causes a page fault in the kernel (access violation) prompting a physical page allocation, copying the page, and then updating both virtual page table entries' permissions to be RW, then the equivalent of `invlpg` to flush them out of the TLB, before returning to the program to re-try the write instruction.
However this gets tricky with heap allocators since they often pack `malloc`-like allocations into the same page using some allocation scheme (e.g. buddy allocation, etc.). Though you can typically bypass malloc-like, in-process allocators to handle page mappings yourself, it's just a little more cumbersome.
This is actually how "virtual allocation" (I think is the term, don't quote me) happens; if you `malloc` a very large space on many modern OSes, you get a pointer back and there is an allotment in the process's virtual address space registered, but no page tables are actually updated (the kernel can assume they're "not present", meaning that there is no read nor write availability; any memory operation with that range causes a page fault). Upon any page in that range being accessed for the first time, the page fault handler sees it's part of that otherwise very large segment and will allocate in physical memory to that virtual page, then return.
It allows for e.g. allocating a VERY large sparse array of items (even terabytes!) that might otherwise exceed the size of physical memory actually available to the system, with the assumption that it'll never actually be fully populated (or at least populated across enough memory pages such that you exhaust 'real', physical memory).
This is also how file mapping works. You file map something from the filesystem, and the kernel keeps a cache of read pages from that file; memory reads and writes cause page faults if they haven't been read in (and cached), prompting a read from the disk storage into a physical page, updating the page tables, and resuming the faulting thread. It also allows the kernel to selectively reclaim less-frequently-used file mapped pages for higher priority allocations at any time really, since the next time the program tries to access that page, the kernel just faults again, reads it into a new page, and updates the tables. It's entirely transparent to the process.
I'm designing a novel kernel at the moment and some of these sorts of possibilities are what I find the more interesting parts of design, actually. I find them to be underutilized to some extent even in modern kernels :)
Linux, FreeBSD seem to have memfd_create(). Mac OS seem to have shm_open() instead. Actually, the first two have that one as well. Interesting.
The technique itself is indeed very elegant.