io_uring is faster than mmap

48 comments

·September 4, 2025

bawolff

Shouldn't you also compare to mmap with huge page option? My understanding is its presicely meant for this circumstance. I don't think its a fair comparison without it.

Respectfully, the title feels a little clickbaity to me. Both methods are still ultimately reading out of memory, they are just using different i/o methods.

jared_hulbert

The original blog post title is intentionally clickbaity. You know, to bait people into clicking. Also I do want to challenge people to really think here.

Seeing if the cached file data can be accessed quickly is the point of the experiment. I can't get mmap() to open a file with huge pages.

void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0); doesn't work.

You can can see my code here https://github.com/bitflux-ai/blog_notes. Any ideas?

jandrewrogers

Read the man pages, there are restrictions on using the huge page option with mmap() that mean it won’t do what you might intuit it will in many cases. Getting reliable huge page mappings is a bit fussy on Linux. It is easier to control in a direct I/O context.

mastax

MAP_HUGETLB can't be used for mmaping files on disk, it can only be used with MAP_ANONYMOUS, with a memfd, or with a file on a hugetlbfs pseudo-filesystem (which is also in memory).

inetknght

> MAP_HUGETLB can't be used for mmaping files on disk

False. I've successfully used it to memory-map networked files.

modeless

Wait, PCIe bandwidth is higher than memory bandwidth now? That's bonkers, when did that happen? I haven't been keeping up.

Just looked at the i9-14900k and I guess it's true, but only if you add all the PCIe lanes together. I'm sure there are other chips where it's even more true. Crazy!

DiabloD3

"No."

DDR5-8000 is 64GB/s per channel. Desktop CPUs have two channels. PCI-E 5.0 in x16 is 64GB/s. Desktops have one x16.

modeless

Hmm, Intel specs the max memory bandwidth as 89.6 GB/s. DDR5-8000 would be out of spec. But I guess it's pretty common to run higher specced memory, while you can't overclock PCIe (AFAIK?). Even so I guess my math was wrong, it doesn't quite add up to more than memory bandwidth. But it's pretty darn close!

adgjlsfhk1

on server chips it's kind of ridiculous. 5th gen Epyc has 128 lanes of PCIEx5 for over 1TB/s of pcie bandwith (compared to 600GB/s RAM bandwidth from 12 channel ddr5 at 6400)

andersa

Your math is a bit off. 128 lanes gen5 is 8 times x16, which has a combined theoretical bandwidth of 512GB/s, and more like 440GB/s in practice after protocol overhead.

Unless we are considering both read and write bandwidth, but that seems strange to compare to memory read bandwidth.

pclmulqdq

People like to add read and write bandwidth for some silly reason. Your units are off, too, though: gen 5 is 32 GT/s, meaning 64 GB/s (or 512 gigabits per second) each direction on an x16 link.

hsn915

Shouldn't this be "io_uring is faster than mmap"?

I guess that would not get much engagement though!

That said, cool write up and experiment.

dang

Let's use that. Since HN's guidelines say ""Please use the original title, unless it is misleading or linkbait", that "unless" clause seems to kick in here, so I've changed the title above. Thanks!

If anyone can suggest a better title (i.e. more accurate and neutral) we can change it again.

jared_hulbert

Lol. Thanks.

themafia

no madvise() call with MADV_SEQUENTIAL?

juancn

    Because PCIe bandwidth is higher than memory bandwidth

This doesn't sound right, a PCIe 5.0 x16 slot offers up to 64 GB/s. That's fully saturated, a fairly old Xeon server can sustain >100 GB/s memory reads per numa node without much trouble.

Some newer HBM enabled, like a Xeon Max 9480 can go over 1.6TBs for HBM (up to 64GB) and DDR5 can reach > 300 GB/s.

Even saturating all PCIe lanes (196 on a dual socket Xeon 6), you could at most theoretically get ~784GB/s, which coincidentally is the max memory bandwidth of such CPUs (12 Channels x 8,800 MT/s = 105,600 MT/s total bandwidth or roughly ~784GB/s).

I mean, solid state IO is getting really close, but it's not so fast on non-sequential access patterns.

I agree that many workloads could be shifted to SSDs but it's still quite nuanced.

jared_hulbert

Not by a ton but if you add up the DDR5 channel bandwidth and the PCIe lanes most systems the PCIe bandwidth is higher. Yes. HBM and L3 cache will be higher than the PCIe.

lowbloodsugar

Someone who’s read it in more detail, it looks like the uring code is optimized for async, while the mmap code doesn’t do any prefetching so just chokes when the OS has to do work?

titanomachy

Very interesting article, thanks for publishing these tests!

Is the manual loop unrolling really necessary to get vectorized machine code? I would have guessed that the highest optimization levels in LLVM would be able to figure it out from the basic code. That's a very uneducated guess, though.

Also, curious if you tried using the MAP_POPULATE option with mmap. Could that improve the bandwidth of the naive in-memory solution?

> humanity doesn't have the silicon fabs or the power plants to support this for every moron vibe coder out there making an app.

lol. I bet if someone took the time to make a high-quality well-documented fast-IO library based on your io_uring solution, it would get use.

jared_hulbert

YES! gcc and clang don't like to optimize this. But they do if you hardcode the size_bytes to an aligned value. It kind of makes sense, what if a user passes size_bytes as 3? With enough effort the compilers could handle this, but it's a lot to ask.

I just ran MAP_POPULATE the results are interesting.

It speeds up the counting loop. Same speed or higher as the my read() to a malloced buffer tests.

HOWEVER... It takes a longer time overall to do the population of the buffer. The end result is it's 2.5 seconds slower to run the full test when compared to the original. I did not guess that one correctly.

time ./count_10_unrolled ./mnt/datafile.bin 53687091200 unrolled loop found 167802249 10s processed at 5.39 GB/s ./count_10_unrolled ./mnt/datafile.bin 53687091200 5.58s user 6.39s system 99% cpu 11.972 total time ./count_10_populate ./mnt/datafile.bin 53687091200 unrolled loop found 167802249 10s processed at 8.99 GB/s ./count_10_populate ./mnt/datafile.bin 53687091200 5.56s user 8.99s system 99% cpu 14.551 total

mischief6

it could be interesting to see what ispc does with similar code.

inetknght

Nice write-up with good information, but not the best. Comments below.

Are you using linux? I assume so since stating use of mmap() and mention using EPYC hardware (which counts out macOS). I suppose you could use any other *nix though.

> We'll use a 50GB dataset for most benchmarking here, because when I started this I thought the test system only had 64GB and it stuck.*

So the OS will (or could) prefetch the file into memory. OK.

> Our expectation is that the second run will be faster because the data is already in memory and as everyone knows, memory is fast.*

Indeed.

> We're gonna make it very obvious to the compiler that it's safe to use vector instructions which could process our integers up to 8x faster.

There are even-wider vector instructions by the way. But, you mention another page down:

> NOTE: These are 128-bit vector instructions, but I expected 256-bit. I dug deeper here and found claims that Gen1 EPYC had unoptimized 256-bit instructions. I forced the compiler to use 256-bit instructions and found it was actually slower. Looks like the compiler was smart enough to know that here.

Yup, indeed :)

Also note that AVX2 and/or AVX512 instructions are notorious for causing thermal throttling on certain (older by now?) CPUs.

> Consider how the default mmap() mechanism works, it is a background IO pipeline to transparently fetch the data from disk. When you read the empty buffer from userspace it triggers a fault, the kernel handles the fault by reading the data from the filesystem, which then queues up IO from disk. Unfortunately these legacy mechanisms just aren't set up for serious high performance IO. Note that at 610MB/s it's faster than what a disk SATA can do. On the other hand, it only managed 10% of our disk's potential. Clearly we're going to have to do something else.

In the worst case, that's true. But you can also get the kernel to prefetch the data.

See several of the flags, but if you're doing sequential reading you can use MAP_POPULATE [0] which tells the OS to start prefetching pages.

You also mention 4K page table entries. Page table entries can get to be very expensive in CPU to look up. I had that happen at a previous employer with an 800GB file; most of the CPU was walking page tables. I fixed it by using (MAP_HUGETLB | MAP_HUGE_1GB) [0] which drastically reduces the number of page tables needed to memory map huge files.

Importantly: when the OS realizes that you're accessing the same file a lot, it will just keep that file in memory cache. If you're only mapping it with PROT_READ and PROT_SHARED, then it won't even need to duplicate the physical memory to a new page: it can just re-use existing physical memory with a new process-specific page table entry. This often ends up caching the file on first-access.

I had done some DNA calculations with fairly trivial 4-bit-wide data, each bit representing one of DNA basepairs (ACGT). The calculation was pure bitwise operations: or, and, shift, etc. When I reached the memory bus throughput limit, I decided I was done optimizing. The system had 1.5TB of RAM, so I'd cache the file just by reading it upon boot. Initially caching the file would take 10-15 minutes, but then the calculations would run across the whole 800GB file in about 30 seconds. There were about 2000-4000 DNA samples to calculate three or four times a day. Before all of this was optimized, the daily inputs would take close to 10-16 hours to run. By the time I was done, the server was mostly idle.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

jared_hulbert

int fd = open(filename, O_RDONLY); void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0);

This doesn't work with a file on my ext4 volume. What am I missing?

inetknght

What issue are you having? Are you receiving an error? This is the kind of question that StackOverflow or perhaps an LLM might be able to help you with. I highly suggest reading the documentation for mmap to understand what issues could happen and/or what a given specific error code might indicate; see the NOTES section:

> Huge page (Huge TLB) mappings

> For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for mappings that use the native system page size.

> For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size.

Ensure that the file is at least the page size, and preferably sized to align with a page boundary. Then also ensure that the length parameter (size_bytes in your example) is also aligned to a boundary.

There are also other important things to understand for these flags, which are described in the documentation, such as information available from /sys/kernel/mm/hugepages

https://www.man7.org/linux/man-pages/man2/mmap.2.html

jared_hulbert

Cool. Original author here. AMA.

nchmy

I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...

And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?

jared_hulbert

I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.

When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.

Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.

nchmy

The hockey stick growth is the entire problem - it's an optical illusion resulting from the fact that going from 100 to 200 is the same rate as 200 to 400. And 800, 1600. You understand exponents.

Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse. If it still deviates from straight, then you really know there's true changes in the trendline.

Lines on same chart can all be divided by their initial value, anchoring them all at 1. Sometimes they're still a mess, but it's always worth a try.

You're enormously knowledgeable and the posts were fascinating. But this is stats 101. Not doing this sort of thing, especially explicitly in favour of showing a hockey stick, undermines the fantastic analysis.

john-h-k

You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?

jared_hulbert

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

AMD has something similar.

The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.

In theory... the specifics of what is supported exactly? I can't vouch for that.

josephg

I’d be fascinated to see a comparison with SPDK. That bypasses the kernel’s NVMe / SSD driver and controls the whole device from user space - which is supposed to avoid a lot of copies and overhead.

You might be able to set up SPDK to send data directly into the cpu cache? It’s one of those things I’ve wanted to play with for years but honestly I don’t know enough about it.

https://spdk.io/

Jap2-0

Would huge pages help with the mmap case?

jared_hulbert

Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.

But the arm64 systems with 16K or 64K native pages would have fewer faults.

inetknght

> I'd have look into that. Off the top of my head I don't know how you'd make that happen.

Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)

inetknght

> Would huge pages help with the mmap case?

Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.

comradesmith

Thanks for the article. What about using file reads from a mounted ramdisk?

jared_hulbert

Hmm. tmpfs was slower. hugetlbfs wasn't working for me.

userbinator

...for sufficiently solid values of "disk" ;-)

jared_hulbert

I worked on SSDs for years. Too many people are suffering from insufficiently solid values of "disk" IMHO.