Io_uring, kTLS and Rust for zero syscall HTTPS server
65 comments
·August 22, 2025npalli
thomashabets2
Rust: Well yes. Rust does force you to understand the things, or it won't compile. It does have drawbacks.
Go: goroutines are not async. And you can't understand goroutines without understanding channels. And channels are weirdly implemented in Go, where the semantics of edge cases, while well defined, are like rolling a D20 die if you try to reason from first principles.
Go doesn't force you to understand things. I agree with that. It has pros and cons.
I see what you mean but "cheap threads" is not the same thing as async. More like "current status of massive concurrency". Except that's not right either. tarweb, the subject of the blog post in question, is single threaded and uses io_uring as an event loop.
So it's current status of… what exactly?
Seattle3503
> For example when submitting a write operation, the memory location of those bytes must not be deallocated or overwritten.
> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.
I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.
IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
newpavlov
This actually one of my many gripes about Rust async and why I consider it a bad addition to the language in the long term. The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP) and it has heavily influenced the async design (sometimes indirectly through other languages).
Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.
For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.
There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.
Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.
duped
> You simply can not do it with state machines polled in user code
That's not really true. The only guarantees in Rust futures are that they are polled() once and must have their Waker's wake() called before they are polled again. A completion based future submits the request on first poll and calls wake() on completion. That's kind of the interesting design of futures in Rust - they support polling and completion.
The real conundrum is that the futures are not really portable across executors. For io_using for example, the executor's event loop is tightly coupled with submission and completion. And due to instability of a few features (async trait, return impl trait in trait, etc) there is not really a standard way to write executor independent async code (you can, some big crates do, but it's not necessarily trivial).
Combine that with the fact that container runtimes disable io_uring by default and most people are deploying async web servers in Docker containers, it's easy to see why development has stalled.
It's also unfair to mischaracterize design goals and ideas from 2016 with how the ecosystem evolved over the last decade, particularly after futures were stabilized before other language items and major executors became popular. If you look at the RFCs and blog posts back then (eg: https://aturon.github.io/tech/2016/09/07/futures-design/) you can see why readiness was chosen over completion, and how completion can be represented with readiness. He even calls out how naïve completion (callbacks) leads to more allocation on future composition and points to where green threads were abandoned.
newpavlov
No, the problem (in the context of io-uring) is that futures are managed by user code and can be dropped at any time. This often referred as "cancellation safety". Imagine a future has initialized completion-based IO with buffer which is part of the future state. User code can simply drop the future (e.g. if it was part of `select!`) and now we have a huge problem on our hands: the kernel will write into a dropped buffer! In the synchronous context it's equivalent to de-allocating thread stack under foot of the thread which is blocked on a synchronous syscall. You obviously can do it (using safe code) in thread-based code, but it's fine to do in async.
This is why you have to use various hacks when using io-uring based executors with Rust async (like using polling mode or ring-owned buffers and additional data copies). It could be "resolved" on the language level with an additional pile of hacks which would implement async Drop, but, in my opinion, it would only further hurt consistency of the language.
kibwen
> The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP)
No, this is a mistaken retelling of history. The Rust developers were not ignorant of IOCP, nor were they zealous about any specific async model. They went looking for a model that fit with Rust's ethos, and completion didn't fit. Aaron Turon has an illuminating post from 2016 explaining their reasoning: https://aturon.github.io/tech/2016/09/07/futures-design/
See the section "Defining futures":
There’s a very standard way to describe futures, which we found in every existing futures implementation we inspected: as a function that subscribes a callback for notification that the future is complete.
Note: In the async I/O world, this kind of interface is sometimes referred to as completion-based, because events are signaled on completion of operations; Windows’s IOCP is based on this model.
[...] Unfortunately, this approach nevertheless forces allocation at almost every point of future composition, and often imposes dynamic dispatch, despite our best efforts to avoid such overhead.
[...] TL;DR, we were unable to make the “standard” future abstraction provide zero-cost composition of futures, and we know of no “standard” implementation that does so.
[...] After much soul-searching, we arrived at a new “demand-driven” definition of futures.
I'm not sure where this meme came from where people seem to think that the Rust devs rejected a completion-based scheme because of some emotional affinity for epoll. They spent a long time thinking about the problem, and came up with a solution that worked best for Rust's goals. The existence of a usable io_uring in 2016 wouldn't have changed the fundamental calculus.
newpavlov
>which we found in every existing futures implementation we inspected
This is exactly what I meant when I wrote about the indirect influence from other languages. People may dress it up as much as they want, but it's clear that polling was the most important model at the time (outside of the Windows world) and a lot of design consideration was put into being compatible with it. The Rust async model literally uses the polling terminology in its most fundamental interfaces!
>this approach nevertheless forces allocation at almost every point of future composition
This is only true in the narrow world of modeling async execution with futures. Do you see heap allocations in Go on each equivalent of "future composition" (i.e. every function call)? No, you do not. With the stackfull models you allocate a full stack for your task and you model function calls as plain function calls without any future composition shenaniganry.
Yes, the stackless model is more efficient memory-wise and allows for some additional useful tricks (like sharing future stacks in `join!`). But the stackfull model is perfectly efficient for 95+% of use cases, fits better with the borrow/ownership model, does not result in the `.await` noise, does not lead to the horrible ecosystem split (including split between different executors), and does not need the language-breaking hacks like `Pin` (see the `noalias` exception made for it). And I believe it's possible to close the memory efficiency gap between the models with certain compiler improvements (tracking maximum stack usage bound for functions and introducing a separate async ABI with two separate stacks).
>The existence of a usable io_uring in 2016 wouldn't have changed the fundamental calculus.
IIRC the first usable versions of io-uring very released approximately during the time when the Rust async was undergoing stabilization. I am really confident that if the async system was designed today we would've had a totally different model. Importance of completion-based models has only grown since then not only because of the sane async file IO, but also because of Spectre and Meltdown.
aliceryhl
> IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
Well, I think there is interest, but mostly for file IO.
For file IO, the situation is pretty simple. We already have to implement that using spawn_blocking, and spawn_blocking has the exact same buffer challenges as io_uring does, so translating file IO to io_uring is not that tricky.
On the other hand, I don't think tokio::net's existing APIs will support io_uring. Or at least they won't support the buffer-based io_uring APIs; there is no reason they can't register for readiness through io_uring.
johncolanduoni
This covers probably 90% of the usefulness of io_uring for non-niche applications. Its original purpose was doing buffered async file IO without a bunch of caveats that make it effectively useless. The biggest speed up I’ve found with it is ‘stat’ing large sets of files in the VFS cache. It can literally be 50x faster at that, since you can do 1000 files with a single systemcall and the data you need from the kernel is all in memory.
High throughput network usecases that don’t need/want AF_XDP or DPDK can get most of the speedup with ‘sendmmsg/recvmmsg’ and segmentation offload.
jcranmer
There is, I think, an ownership model that Rust's borrow checker very poorly supports, and for lack of a better name, I've called it hot potato ownership. The basic idea is that you have a buffer which you can give out as ownership in the expectation that the person you gave it to will (eventually) give it back to you. It's a sort of non-lexical borrowing problem, and I very quickly discovered when trying to implement it myself in purely safe Rust that the "giving the buffer back" is just really gnarly to write.
pornel
This can be done with exclusively owned objects. That's how io_uring abstractions work in Rust – you give your (heap allocated) buffer to a buffer pool, and get it back when the operation is done.
&mut references are exclusive and non-copyable, so the hot potato approach can even be used within their scope.
But the problem in Rust is that threads can unwind/exit at any time, invalidating buffers living on the stack, and io_uring may use the buffer for longer than the thread lives.
The borrow checker only checks what code is doing, but doesn't have power to alter runtime behavior (it's not a GC after all), so it only can prevent io_uring abstractions from getting any on-stack buffers, but has no power to prevent threads from unwinding to make on-stack buffer safe instead.
alfiedotwtf
In my universe, `let` wouldn’t exist… instead there would only be 3 ways to declare variables:
1. global my_global_var: GlobalType = …
2. heap my_heap_var: HeapType = …
3. stack my_stack_var: StackType = …
Global types would need to implement a global trait to ensure mutual exclusion (waves hands).So by having the location of allocation in the type itself, we no longer have to do boxing mental gymnastics
stouset
Maybe I’m misunderstanding, but why is that not possible with a
Fn(_: T) -> T
dwattttt
As sibling notes, it is. It's very rarely seen though.
One place you might see something like it is if an API takes ownership, but returns it on error; you see the error side carry the resource you gave it, so you could try again.
IshKebab
How is that different to
Fn(_: &mut T)
?
johncolanduoni
It’s annoying but possible to do this correctly and not have the API be too bad. The “happy path” of a clean success or error is fine if you accept that buffers can’t just be simple &[u8] slices. Cancellation can be handled safely with something like the following API contract:
Have your function signature be async fn read(buffer: &mut Vec<u8>) -> Result<…>’ (you can use something more convenient like ‘&mut BytesMut’ too). If you run the future to completion (success or failure), the argument holds the same buffer passed in, with data filled in appropriately on success. If you cancel/drop the future, the buffer may point at an empty allocation instead (this is usually not an annoying constraint for most IO flows, and footgun potential is low).
The way this works is that your library “takes” the underlying allocation before starting the operation out of the variable, replacing it with the default unallocated ‘Vec<u8>’. Once the buffer is no longer used by the IO system, it puts it back before returning. If you cancel, it manages the buffer in the background to release it when safe and the unallocated buffer is left in the passed variable.
andyferris
It sounds like this would be better modelled by passing ownership of the buffer and expecting it to be returned on the success (ok) case. What you described doesn't seem compatible with what I would call a mutable borrow (mutate the contents of a Vec<u8>).
Or maybe I've misunderstood?
JoshTriplett
I think the right way to build a safe interface around io_uring would be to use ring-owned buffers, ask the ring for a buffer when you want one, and give the buffer back to the ring when initiating a write.
pingiun
This is something that Amos Wenger (fasterthanlime) has worked on: https://github.com/bearcove/loona/blob/main/crates/buffet/RE...
Tuna-Fish
This works perfectly well, and allows using the type system to handle safety. But it also really limits how you handle memory, and makes it impossible to do things like filling out parts of existing objects, so a lot of people are reluctant to take the plunge.
johncolanduoni
That’s annoying for people writing bespoke low-level networking code, but for a high-level HTTP library it’s a rounding error in the overall complexity on display. I think the bigger barrier for Tokio is that the interplay between having an epoll instance and a io_uring instance on the same pool is problematic and can erase performance gains. If done greenfield you could implement the “normal” APIs with ‘IORING_OP_POLL_ADD’, but not all of the exposed ‘mio’ surface area can work this way - only the oneshot API.
ozgrakkurt
You don’t have to represent everything with borrows. You can just use data structures like Slab to make it cancel safe.
As an example this library I wrote before is cancel safe and doesn’t use lifetimes etc. for it.
ozgrakkurt
Just realised my code isn’t cancel safe either. It is invalid if the user just drops a read future and the buffer itself while the operation is in the kernel.
It is just a PITA to get it fully right.
Probably need the buffer to come from the async library so user allocates the buffers using the async library like a sibling comment says.
It is just much easier to not use Rust and say futures should run fully always and can’t be just dropped and make some actual progress. So I’m just doing it in zig now
bmcahren
This was a good read and great work. Can't wait to see the performance tests.
Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface
I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.
I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.
https://netflixtechblog.com/life-of-a-netflix-partner-engine...
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
kev009
It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.
jabl
To nitpick at least as of Apache HTTPD 1.3 ages ago it wasn't forking for every request, but had a pool of already forked worker processes with each handling one connection at a time but could handle an unlimited number of connections sequentially, and it could spawn or kill worker processes depending on load.
The same model is possible in Apache httpd 2.x with the "prefork" mpm.
avar
every HTTP session was commonly a forked
copy of the entire server in the CERN
and Apache lineage!
And there's nothing wrong with that for application workers. On *nix systems fork() is very fast, you can fork "the entire server" and the kernel will only COW your memory. As nginx etc. showed you can get better raw file serving performance with other models, but it's still a legitimate technique for application logic where business logic will drown out any process overhead.tsimionescu
Forking for anything other than calling exec is still a horrible idea (with special exceptions like shells). Forking is a very unsafe operation (you can easily share locks and files with the child process unless both your code and every library you use is very careful - for example, it's easy to get into malloc deadlocks with forked processes), and its performance depends a lot on how you actually use it.
josephg
So long as you have something like nginx in front of your server. Otherwise your whole site can be taken down by a slowloris attack over a 33.6k modem.
klaussilveira
For anyone wanting to learn more about how to create a small server with io_uring: https://unixism.net/2020/04/io-uring-by-example-article-seri...
Imustaskforhelp
Such a good read.
I am patient to wait for the benchmarks so take your time ,but I honestly love how the author doesn't care about benchmarks right now and wanted to clean the code first. Its kinda impressive that there are people who have such line of thinking in this world where benchmarks gets maxxed and whole project's sole existence is to satisfy benchmarks.
Really a breath of fresh air and honestly I admire the author so much for this. It was such a good read, loved it a lot thank you. Didn't know ktls existed or Io_uring could be used in such a way.
phrotoma
Anybody know what the state of kTLS is? I asked one of the Cilium devs about it a while ago'cause I'd seen Thomas Graf excitedly talking about it and he told me that kernel support in many distros was lacking so they aren't ready to enable it by default.
drewg123
That's a shame. How hard is it to enable? Do you need a custom kernel, or can you enable it at runtime?
On FreeBSD, its been in the kernel / openssl since 13, and has been one runtime toggle (sysctl kern.ipc.tls.enable=1) away from being enabled. And its enabled by default in the upcoming FreeBSD-15.
We (at Netflix) have run all of our tls encrypted streaming over kTLS for most of a decade.
j-krieger
I do wonder if this would make for an excellent exfil implant since it doesn‘t register syscalls.
sandeep-nambiar
This is really cool. I've been thinking about something similar for a long time and I'm glad someone has finally done it. GG!
I can recommend writing even the BPF side of things with rust using Aya[1].
ValtteriL
Excellent read. I'd like to see DPDK style full kernel bypass next
spaintech
Not sure if you are aware of this, but LUNA does this already.
6r17
I really want to see the benchmarks on this ; tried it like 4 days ago and then built a standard epoll implementation ; I could not compete against nginx using uring but that's not the easiest task for an arrogant night so I really hope you get some deserved sweet numbers ; mine were a sad deception but I did not do most of your implementation - rather simply tried to "batch" calls. Wish you the best of luck and much fun
bullen
So far everything after epoll that I have compared with falls short.
So to reimplement my foundation (with all the bugs) will not be worth it.
I will however compare Javas NIO (epoll) with the new Virtual Threads IO (without pinning).
boredatoms
Whats the goto instead of strace, if you wanted to see what was going on?
fuy
perf and look at stack traces (or off-cpu events for waits/locks). also, ebpf
abrookewood
I think you have to use eBPF-based tools
So, current status on async
Rust - you need to understand: Futures, Pin, Waker, async runtimes, Send/Sync bounds, async trait objects, etc.
C++20, coroutines.
Go, goroutines.
Java21+, virtual threads