QUIC for the kernel

230 comments

·July 31, 2025

qwertox

I recently had to add `ssl_preread_server_name` to my NGINX configuration in order to `proxy_pass` requests for certain domains to another NGINX instance. In this setup, the first instance simply forwards the raw TLS stream (with `proxy_protocol` prepended), while the second instance handles the actual TLS termination.

This approach works well when implementing a failover mechanism: if the default path to a server goes down, you can update DNS A records to point to a fallback machine running NGINX. That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.

However, this method won't work with HTTP/3. Since HTTP/3 uses QUIC over UDP and encrypts the SNI during the handshake, `ssl_preread_server_name` can no longer be used to route based on domain name.

What alternatives exist to support this kind of SNI-based routing with HTTP/3? Is the recommended solution to continue using HTTP/1.1 or HTTP/2 over TLS for setups requiring this behavior?

dgl

Clients supporting QUIC usually also support HTTPS DNS records, so you can use a lower priority record as a failover, letting the client potentially take care of it. (See for example: host -t https dgl.cx.)

That's the theory anyway. You can't always rely on clients to do that (see how much of the HTTPS record Chromium actually supports[1]), but in general if QUIC fails for any reason clients will transparently fallback, as well as respecting the Alt-Svc[2] header. If this is a planned failover you could stop sending a Alt-Svc record and wait for the alternative to timeout, although it isn't strictly necessary.

If you do really want to route QUIC however, one nice property is the SNI is always in the first packet, so you can route flows by inspecting the first packet. See cloudflare's udpgrm[3] (this on its own isn't enough to proxy to another machine, but the building block is there).

Without Encrypted Client Hello (ECH) the client hello (including SNI) is encrypted with a known key (this is to stop middleboxes which don't know about the version of QUIC breaking it), so it is possible to decrypt it, see the code in udpgrm[4]. With ECH the "router" would need to have a key to decrypt the ECH, which it can then decrypt inline and make a decision on (this is different to the TLS key and can also use fallback HTTPS records to use a different key than the non-fallback route, although whether browsers currently support that is a different issue, but it is possible in the protocol). This is similar to how fallback with ECH could be supported with HTTP/2 and a TCP connection.

[1]: https://issues.chromium.org/issues/40257146

[2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

[3]: https://blog.cloudflare.com/quic-restarts-slow-problems-udpg...

[4]: https://github.com/cloudflare/udpgrm/blob/main/ebpf/ebpf_qui...

geocar

> for setups requiring this behavior?

TLS terminating at your edge (which is presumably where the IP addresses attach) isn't any particular risk in a world of letsencrypt where an attacker (who gained access to that box) could simply request a new SSL certificate, so you might as well do it yourself and move on with life.

Also: I've been unable to reproduce performance and reliability claims of quic. I keep trying a couple times a year to see if anything's gotten better, but I mostly leave it disabled for monetary reasons.

> This approach works well when implementing a failover mechanism: if the default path to a server goes down...

I'm not sure I agree: DNS can take minutes for updates to be reflected, and dumb clients (like web browsers) don't failover.

So I use an onerror handler to load the second path. When my ad tracking that looks something like this:

    <img src=patha.domain1?tracking
      onerror="this.src='pathb.domain2?tracking';this.onerror=function(){}">

but with the more complex APIs, fetch() is wrapped up similarly in the APIs I deliver to users. This works much better than anything else I've tried.

dgl

> […] isn't any particular risk in a world of letsencrypt where an attacker (who gained access to that box) could simply request a new SSL certificate

You can use CAA records with validationmethods and accounturi to limit issuance, so simply access to the machine isn’t enough. (E.g. using dns and an account stored on a different machine.)

johncolanduoni

For a failover circumstance, I wouldn’t bother with failover for QUIC at all. If a browser can’t make a QUIC connection (even if advertised in DNS), it will try HTTP1/2 over TLS. Then you can use the same fallback mechanism you would if it wasn’t in the picture.

MadnessASAP

Unfortunately I think that falls under the "Not a bug" category of bugs. Keeping the endpoint concealed all the way to the TLS endpoint is a feature* of HTTP/3.

* I do actually consider it a feature, but do acknowledge https://xkcd.com/1172/

PS. HAProxy can proxy raw TLS, but can't direct based on hostname. Cloudflare tunnel I think has some special sauce that can proxy on hostname without terminating TLS but requires using them as your DNS provider.

dgl

Unless you're using ECH (encrypted client helo) the endpoint is obscured (known keys), not concealed.

PS: HAProxy definitely can do this too, something using req.ssl_sni like this:

   frontend tcp-https-plain
       mode tcp
       tcp-request inspect-delay 10s
       bind [::]:443 v4v6 tfo
       acl clienthello req.ssl_hello_type 1
       acl example.com req.ssl_sni,lower,word(-1,.,2) example.com
       tcp-request content accept if clienthello
       tcp-request content reject if !clienthello
       default_backend tcp-https-default-proxy
       use_backend tcp-https-example-proxy if example.com

Then tcp-https-example-proxy is a backend which forwards to a server listening for HTTPS (and using send-proxy-v2, so the client IP is kept). Cloudflare really isn't doing anything special here; there are also other tools like sniproxy[1] which can intercept based on SNI (a common thing commerical proxies do for filtering reasons).

[1]: https://github.com/ameshkov/sniproxy

MadnessASAP

Neat! Thank you very much for the information.

jcgl

Hm, that’s a good question. I suppose the same would apply to TCP+TLS with Encrypted Client Hello as well, right? Presumably the answer would be the same/similar between the two.

xg15

Not an expert on eSNI, but my understanding was that the encryption in eSNI is entirely separate from the "main" encryption in TLS, and the eSNI keys have to be the same for every domain served from the same IP address or machine.

Otherwise, the TLS handshake would run into the same chicken/egg problem that you have: To derive the keys, it needs the certificate, but to select the certificate, it needs the domain name.

So you only need to replicate the eSNI key, not the entire cert store.

silon42

Personally, I'd like to have an option of the outbound firewall doing the eSNI encryption, is that possible?

NewJazz

That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.

Won't you need to "replicate the TLS config" on the back end servers then? And how hard is it to configure TLS on the nginx side anyway, can't you just use ACME?

kldx

QUIC v1 does encrypt the SNI in the client hello, but the keys are derived from a predefined salt and the destination connection id. I don't see why decrypting this would be difficult for a nginx plugin.

WASDx

I recall this article on QUIC disadvantages: https://www.reddit.com/r/programming/comments/1g7vv66/quic_i...

Seems like this is a step in the right direction to resole some of those issues. I suppose nothing is preventing it from getting hardware support in future network cards as well.

miohtama

QUIC does not work very well for use cases like machine-to-machine traffic. However most of traffic in Internet today is from mobile phones to servers and it is were QUIC and HTTP 3 shine.

For other use cases we can keep using TCP.

kldx

Let me try providing a different perspective based on experience. QUIC works amazingly well for _some_ kinds of machine to machine traffic.

ssh3, based on QUIC is quicker at dropping into a shell compared to ssh. The latency difference was clearly visible.

QUIC with the unreliable dgram extension is also a great way to implement port forwarding over ssh. Tunneling one reliable transport over another hides the packer losses in the upper layer.

exDM69

The article that GP posted was specifically about throughput over a high speed connection inside a data center.

It was not about latency.

In my opinion, the lessons that one can draw from this article should not be applied for use cases that are not about maximum throughput inside a data center.

thickice

Why doesn't QUIC work well for machine-to-machine traffic ? Is it due to the lack of offloads/optimizations for TCP and machine-to-machine traffic tend to me high volume/high rate ?

yello_downunder

QUIC would work okay, but not really have many advantages for machine-to-machine traffic. Machine-to-machine you tend to have long-lived connections over a pretty good network. In this situation TCP already works well and is currently handled better in the kernel. Eventually QUIC will probably be just as good for TCP in this use case, but we're not there yet.

extropy

The NAT firewalls do not like P2P UDP traffic. Majoritoy of the routers lack the smarts to passtrough QUIC correctly, they need to treat it the same as TCP essentially.

dan-robertson

I think basically there is currently a lot of overhead and, when you control the network more and everything is more reliable, you can make tcp work better.

m00x

It's explained in the reddit thread. Most of it is because you have to handle a ton of what TCP does in userland.

exabrial

For starters, why encrypt something literally in the same datacenter 6 feet away? Add significant latency and processing overhead.

mort96

I don't understand what you mean by "machine-to-machine" if a phone (a machine) talking to a server (a machine) is not machine-to-machine.

szszrk

I hope you don't think that user-to-machine means that I have to stick my finger in a network switch? :)

Machine-to-machine is usually meant as traffic where neither of the sides is the client device (desktop, mobile etc). Often not initiated by user, but that's debatable.

I would say an server making a sync of database to passive node is machine-to-machine, while a user connection from his browser to webserver is not.

Ericson2314

What will the socket API look like for multiple streams? I guess it is implied it is the same as multiple connections, with caching behind the scenes.

I would hope for something more explicit, where you get a connection object and then open streams from it, but I guess that is fine for now.

https://github.com/microsoft/msquic/discussions/4257 ah but look at this --- unless this is an extension, the server side can also create new streams, once a connection is established. The client creating new "connections" (actually streams) cannot abstract over this. Something fundamentally new is needed.

My guess is recvmsg to get a new file descriptor for new stream.

gte525u

I would look at SCTP socket API it supports multistreaming.

mananaysiempre

SCTP is very telecom-shaped; in particular, IIRC, the number of streams is fixed at the start of the connection, so (this sucks but also) GP’s problem does not appear.

Ericson2314

That problem doesn't appear, but also trying to pass off a stream as a readable/writable thing to a polymorphic interface ("everything is file") wouldn't work.

...however sctp_peeloff (see other thread) fixes the issue. Hurray!

wahern

API RFC is https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket...

Ericson2314

Ah fuck, it still has a stream_id notion

How are socket APIs always such garbage....

signa11

> API RFC is ...

still a draft though.

Ericson2314

I checked that out and....yuck!

- Send specifies which stream by ordinal number? (Can't have different parts of a concurrent app independently open new streams)

- Receive doesn't specify which stream at all?!

Bender

I don't know about using it in the kernel but I would love to see OpenSSH support QUIC so that I get some of the benefits of Mosh [1] while still having all the features of OpenSSH including SFTP, SOCKS, port forwarding, less state table and keep alive issues, roaming support, etc... Could OpenSSH leverage the kernel support?

[1] - https://mosh.org/

wmf

SSH would need a lot of work to replace its crypto and mux layers with QUIC. It's probably worth starting from scratch to create a QUIC login protocol. There are a bunch of different approaches to this in various states of prototyping out there.

Bender

Fair points. I suppose Mosh would be the proper starting point then. I'm just selfish and want the benefits of QUIC without losing all the really useful features of OpenSSH.

bauruine

OpenSSH is an OpenBSD project therefore I guess a Linux api isn't that interesting but I could be wrong ofc.

skissane

Once Linux implements it, I think odds are high that FreeBSD sooner or later does too. And maybe NetBSD and XNU/Darwin/macOS/iOS thereafter. And if they’ve all got it, that increases the odds that eventually OpenBSD also implements it. And if OpenBSD has the support in its kernel, then they might be willing to consider accepting code in OpenSSH which uses it. So OpenSSH supporting QUIC might eventually happen, but if it does, it is going to be some years away

Bender

That's a good point. At least it would not be an entirely new idea. [1] Curious what reactions he received.

[1] - https://papers.freebsd.org/2022/eurobsdcon/jones-making_free...

another_twist

I have a question - bottleneck for TCP is said to the handshake. But that can be solved by reusing connections and/or multiplexing. The current implementation is 3-4x slower than the Linux impl and performance gap is expected to close.

If speed is touted as the advantage for QUIC and it is in fact slower, why bother with this protocol ? The author of the PR itself attributes some of the speed issues to the protocol design. Are there other problems in TCP that need fixing ?

jauntywundrkind

The article discusses many of the reasons QUIC is currently slower. Most of them seem to come down to "we haven't done any optimization for this yet".

> Long offers some potential reasons for this difference, including the lack of segmentation offload support on the QUIC side, an extra data copy in transmission path, and the encryption required for the QUIC headers.

All of these three reasons seem potentially very addressable.

It's worth noting that the benchmark here is on pristine network conditions, a drag race if you will. If you are on mobile, your network will have a lot more variability, and there TCP's design limits are going to become much more apparent.

TCP itself often has protocols run on top of it, to do QUIC like things. HTTP/2 is an example of this. So when you compare QUIC and TCP, it's kind of like comparing how fast a car goes with how fast an engine bolted to a frame with wheels on it goes. QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3. Thats less system design.

QUIC also has wins for connecting faster, and especially for reconnecting faster. It also has IP mobility: if you're on mobile and your IP address changes (happens!) QUIC can keep the session going without rebuilding it once the client sends the next packet.

It's a fantastically well thought out & awesome advancement, radically better in so many ways. The advantages of having multiple non-blocking streams (alike SCTP) massively reduces the scope that higher level protocol design has to take on. And all that multi-streaming stuff being in the kernel means it's deeply optimizable in a way TCP can never enjoy.

Time to stop driving the old rust bucket jalopy of TCP around everywhere, crafting weird elaborate handmade shit atop it. We need a somewhat better starting place for higher level protocols and man oh man is QUIC alluring.

redleader55

> QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3

IP is layer 3 - network(ensures packets are routed to the correct host). TCP is layer 4 - transport(some people argue that TCP has functions from layer 5 - eg. establishing sessions between apps), while TLS adds a few functions from layer 6(eg. encryption), which QUIC also has.

fanf2

The OSI is not a useful guide to how layering works in the Internet.

w3ll_w3ll_w3ll

TCP is level 4 in the OSI model

morning-coffee

That's just one bottleneck. The other issue is head-of-line blocking. When there is packet loss on a TCP connection, nothing sent after that is delivered until the loss is repaired.

another_twist

Whats the packet loss rate on modern networks ? Curious.

adgjlsfhk1

~80% when you step out of wifi range on your cell phone.

deathanatos

… from 0% (a wired home LAN with nothing screwy going on) to 100% (e.g., cell reception at the San Antonio Caltrain station), depending on conditions…?

As it always has been, and always will be.

wmf

It can be high on cellular.

geysersam

Pretty bad sometimes when on a train

reliablereason

That depends on how much data you are pushing. if you are pushing 200 mb on a 100mb line you will get 50% packet loss.

anonymousiam

TCP windowing fixes the issue you are describing. Make the window big and TCP will keep sending when there is a packet loss. It will also retry and usually recover before the end of the window is reached.

https://en.wikipedia.org/wiki/TCP_window_scale_option

quietbritishjim

The statement in the comment you're replying to is still true. While waiting for those missed packets, the later packets will not be dropped if you have a large window size. But they won't be delivered either. They'll be cached in the kennel, even though it may be that the application could make use of them before the earlier blocked packet.

morning-coffee

They are unrelated. Larger windows help achieve higher throughput over paths with high delay. You allude to selective acknowledgements as a way to repair loss before the window completely drains which is true, but my point is that no data can be delivered to the application until the loss is repaired (and that repair takes at least a round-trip time). (Then the follow-on effects from noticed loss on the congestion controller can limit subsequent in-flight data for a time, etc, etc.)

Twirrim

The queuing discipline used by default (pfifo_fast) is barely more than 3 FIFO queues bundled together. The 3 queues allow for a barest minimum semblance of prioritisation of traffic, where Queue 0 > 1 > 2, and you can tweak some tcp parameters to have your traffic land in certain queues. If there's something in queue 0 it must be processed first before anything in queue 1 gets touched etc.

Those queues operate purely head-of-queue basis. If what is at the top of the queue 0 is blocked in any way, the whole queue behind it gets stuck, regardless of if it is talking to the same destination, or a completely different one.

I've seen situations where a glitching network card caused some serious knock on impacts across a whole cluster, because the card would hang or packets would drop, and that would end up blocking the qdisc on a completely healthy host that was in the middle of talking to it, which would have impacts on any other host that happened to be talking to that healthy host. A tiny glitch caused much wider impacts than you'd expect.

The same kind of effect would happen from a VM that went through live migration. The tiny, brief pause would cause a spike of latency all over the place.

There are classful alternatives like fq_codel that can be used, that can mitigate some fo this, but you do have to pay a small amount of processing overhead on every packet, because now you have a queuing discipline that actually needs to track some semblance of state.

bawolff

> bottleneck for TCP is said to the handshake. But that can be solved by reusing connections

You can't reuse a connection that doesn't exist yet. A lot of this is about reducing latency not overall speed.

frmdstryr

The "advantage" is tracking via the server provided connection ID https://www.cse.wustl.edu/~jain/cse570-21/ftp/quic/index.htm...

bawolff

That's non-sensical. Connection-id doesn't allow tracking that you couldn't do with tcp.

kibwen

I'm confused, I thought the revolution of the past decade or so was in moving network stacks to userspace for better performance.

toast0

Most QUIC stacks are built upon in-kernel UDP. You get significant performance benefits if you can avoid your traffic going through kernel and userspace and the context switches involved.

You can work that angle by moving networking into user space... setting up the NIC queues so that user space can access them directly, without needed to context switch into the kernel.

Or you can work the angle by moving networking into kernel space ... things like sendfile which let a tcp application instruct the kernel to send a file to the peer without needing to copy the content into userspace and then back into kernel space and finally into the device memory, if you have in-kernel TLS with sendfile then you can continue to skip copying to userspace; if you have NIC based TLS, the kernel doesn't need to read the data from the disk; if you have NIC based TLS and the disk can DMA to the NIC buffers, the data doesn't need to even hit main memory. Etc

But most QUIC stacks don't get benefit from either side of that. They're reading and writing packets via syscalls, and they're doing all the packetization in user space. No chance to sendfile and skip a context switch and skip a copy. Batching io via io_uring or similar helps with context switches, but probably doesn't prevent copies.

stingraycharles

Yeah, there’s also a lot of offloads that can be done to the kernel with UDP (e.g. UDP segmentation offload, generic receive offload, checksum offload), and offloading quick entirely would be a natural extension to that.

It just offers people choice for the right solution at the right moment.

shanemhansen

You are right but it's confusing because there are two different approaches. I guess you could say both approaches improve performance by eliminating context switches and system calls.

1. Kernel bypass combined with DMA and techniques like dedicating a CPU to packet processing improve performance.

2. What I think of as "removing userspace from the data plane" improves performance for things like sendfile and ktls.

To your point, Quic in the kernel seems to not have either advantage.

FuriouslyAdrift

So... RDMA?

michaelsshaw

No, the first technique describes the basic way they already operate, DMA, but giving access to userspace directly because it's a zerocopy buffer. This is handled by the OS.

RDMA is directly from bus-to-bus, bypassing all the software.

Karrot_Kream

You still need to offload your bytes to a NIC buffer. Either you can do something like DMA where you get privileged space to write your bytes to that the NIC reads from or you have to cross the syscall barrier and have your kernel write the bytes into the NIC's buffer. Crossing the syscall barrier adds a huge performance penalty due to the switch in memory space and privilege rings so userspace networking only makes sense if you're not having to deal with the privilege changes or you have DMA.

Veserv

That is only a problem if you do one or more syscalls per packet which is a utterly bone-headed design.

The copy itself is going at 200-400 Gbps so writing out a standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in steady state with caches being prefetched). Of course you get slaughtered if you stupidly do a syscall (~100 ns hardware overhead) per packet since that is like 300% overhead. You just batch like 32 packets so the write time is ~1,000-2,000 ns then your overhead goes from 300% to 10%.

At a 1 Gbps throughput, that is ~80,000 packets per second or one packet per ~12.5 us. So, waiting for a 32 packet batch only adds a additional 500 us to your end-to-end latency in return for 4x efficiency (assuming that was your bottleneck; which it is not for these implementations as they are nowhere near the actual limits). If you go up to 10 Gbps, that is only 50 us of added latency, and at 100 Gbps you are only looking at 5 us of added latency for a literal 4x efficiency improvement.

zamalek

What is done for that is userspace gets the network data directly without (I believe) involving syscalls. It's not something you'd do for end-user software, only the likes of MOFAANG need it.

In theory the likes of io_uring would bring these benefits across the board, but we haven't seen that delivered (yet, I remain optimistic).

phlip9

I'm hoping we get there too with io_uring. It looks like the last few kernel release have made a lot of progress with zero-copy TCP rx/tx, though NIC support is limited and you need some finicky network iface setup to get the flow steering working

https://docs.kernel.org/networking/iou-zcrx.html

michaelsshaw

The constant mode switching for hardware access is slow. TCP/IP remains in the kernel for windows and Linux.

wmf

Performance comes from dedicating core(s) to polling, not from userspace.

0xbadcafebee

Networking is much faster in the kernel. Even faster on an ASIC.

Network stacks were moved to userspace because Google wanted to replace TCP itself (and upgrade TLS), but it only cared about the browser, so they just put the stack in the browser, and problem solved.

bjourne

What is the need for mashing more and more stuff into the kernel? I thought the job of the kernel was to manage memory, hardware, and tasks. Shouldn't protocols built on top of IP be handled by userland?

heavyset_go

Having networking, routing, VPN etc all not leave kernel space can be a performance improvement for some use cases.

Similarly, splitting the networking/etc stacks out from the kernel into userspace can also be a performance improvement for some use cases.

bjourne

Can't you say that about virtually anything? I'm sure having, say, MIDI synthesizers in the kernel would improve performance too, but not many think that is a good idea.

heavyset_go

Depends on the workload and scale. There are cases where offloading everything to userspace in order to minimize context switches into kernel space would improve performance, as well.

stingraycharles

Yup, context switches between kernelspace and userspace are very expensive in high-performance situations, which is why these types of offloads are used.

At specific workloads (think: load balancers / proxy servers / etc), these things become extremely expensive.

leoh

Maybe. Getting stuff into the kernel means (in theory) it’s been hardened, it has a serious LTS, and benefits from… well, the performance of being part of the kernel.

mcosta

DMA transfers and NIC offloading

kortilla

No, protocols directly on IP specifically can’t be used in userland because they can’t be multiplexed to multiple processes.

If everything above IP was in userland, only one program at a time could use TCP.

TCP and UDP being intermediated by the kernel allow multiple programs to use the protocols at the same time because the kernel routes based on port to each socket.

QUIC sits a layer even higher because it cruises on UDP, so I think your point still stands, but it’s stuff on top of TCP/UDP, not IP.

surajrmal

How do you think this works on microkernels? Do they have no support for multiple applications using the network?

Veserv

That is not at all a problem. On a microkernel you just have a userspace TCP/network server that your other programs talk to that manages/multiplexes the shared network connection.

kortilla

If they don’t have TCP in them, yes. Either each application would need its own IP or another application would be responsible for being the TCP port router.

xgpyc2qp

Looks good. Quick is a real game changer for many. Internet should be a little faster with it. Probably we will not care because of 5g, but still valuable. Wondering that there is a separate tow handshake, I was thinking that qick embeds tls, but seams like I am wrong.

wosined

The general web is slowed down by bloated websites. But I guess this can make game latency lower.

fmbb

https://en.m.wikipedia.org/wiki/Jevons_paradox

The Jevons Paradox is applicable in a lot of contexts.

More efficient use of compute and communications resources will lead to higher demand.

In games this is fine. We want more, prettier, smoother, pixels.

In scientific computing this is fine. We need to know those simulation results.

On the web this is not great. We don’t want more ads, tracking, JavaScript.

01HNNWZ0MV43FF

No, the last 20 years of browser improvements has made my static site incredibly fast!

I'm benefiting from WebP, JS JITs, Flexbox, zstd, Wasm, QUIC, etc, etc

null

[deleted]

EdSchouten

> Calls to bind(), connect(), listen(), and accept() can be used to initiate and accept connections in much the same way as with TCP, but then things diverge a bit. [...] The sendmsg() and recvmsg() system calls are used to carry out that setup

I wish the article explained why this approach was chosen, as opposed to adding a dedicated system call API that matches the semantics of QUIC.

jeffbee

This seems to be a categorical error, for reasons that are contained in the article itself. The whole appeal of QUIC is being immune to ossification, being free to change parameters of the protocol without having to beg Linux maintainers to agree.

toast0

IMHO, you likely want the server side to be in the kernel, so you can get to performance similar to in-kernel TCP, and ossification is less of a big deal, because it's "easy" to modify the kernel on the server side.

OTOH, you want to be in user land on the client, because modifying the kernel on clients is hard. If you were Google, maybe you could work towards a model where Android clients could get their in-kernel protocol handling to be something that could be updated regularly, but that doesn't seem to be something Google is willing or able to do; Apple and Microsoft can get priority kernel updates out to most of their users quickly; Apple also can influence networks to support things they want their clients to use (IPv6, MP-TCP). </rant>

If you were happy with congestion control on both sides of TCP, and were willing to open multiple TCP connections like http/1, instead of multiplexing requests on a single connection like http/2, (and maybe transfer a non-pessimistic bandwidth estimate between TCP connections to the same peer), QUIC still gives you control over retransmission that TCP doesn't, but I don't think that would be compelling enough by itself.

Yes, there's still ossification in middle boxes doing TCP optimization. My information may be old, but I was under the impression that nobody does that in IPv6, so the push for v6 is both a way to avoid NAT and especially CGNAT, but also a way to avoid optimizer boxes as a benefit for both network providers (less expense) and services (less frustration).

ComputerGuru

One thing is that congestion control choice is sort of cursed in that it assumes your box/side is being switched but the majority of the rest of the internet continues with legacy limitations (aside from DCTCP, which is designed for intra-datacenter usage), which is an essential part of the question given that resultant/emergent network behavior changes drastically depending on whether or not all sides are using the same algorithm. (Cubic is technically another sort-of-exception, at least since it became the default Linux CC algorithm, but even then you’re still dealing with all sorts of middleware with legacy and/or pathological stateful behavior you can’t control.)

toast0

I mean, if you're trying to be a good netizen, you try to tune your congestion control so it's fair enough in at least a few scenarios. You want it to be fair relative to status quo streams when status quo is dominant or when your new system is dominant, and also fair relative to new streams in the same conditions. This is a challenge of course, and if something in the middle is doing its own congestion control, that's indeed its own layer of fun and pathology.

jeffbee

This is a perspective, but just one of many. The overwhelming majority of IP flows are within data centers, not over planet-scale networks between unrelated parties.

toast0

I've never been convinced by an explanation of how QUIC applies for flows in the data center.

Ossification doesn't apply (or it shouldn't, IMHO, the point of Open Source software is that you can change it to fit your needs... if you don't like what upstream is doing, you should be running a local fork that does what you want... yeah, it's nicer if it's upstreamed, but try running a local fork of Windows or MacOS); you can make congestion control work for you when you control both sides; enterprise switches and routers aren't messing with tcp flows. If you're pushing enough traffic that this is an issue, the cost of QUIC seems way too high to justify, even if it helps with some issues.

jeroenhd

Unless you're using QUIC as some kind of datacenter-to-datacenter protocol (basically as SCTP on steroids with TLS), I don't think QUIC in the datacenter makes much sense at all.

As very few server administrators bother turning on features like MPTCP, QUIC has an advantage on mobile phones with moderate to bad reception. That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead. Any service those people are using should probably consider implementing QUIC, and if they use it, they'd benefit from an in-kernel server.

All the data center operators can stick to (MP)TCP, the telco people can stick to SCTP, but the consumer facing side of the internet would do well to keep QUIC as an option.

corbet

Ossification does not come about from the decisions of "Linux maintainers". You need to look at the people who design, sell, and deploy middleboxes for that.

jeffbee

I disagree. There is plenty of ossification coming from inside the house. Just some examples off the top of my head are the stuck-in-1974 minimum RTO and ack delay time parameters, and the unwillingness to land microsecond timestamps.

otterley

Not a networking expert, but does TCP in IPv6 suffer the same maladies?

0xbadcafebee

The "middleboxes" excuse for not improving (or replacing) protocols in the past was horseshit. If a big incumbent player in the networking world releases a new feature that everyone wants (but nobody else has), everyone else (including 'middlebox' vendors) will bend over backwards to support it, because if you don't your competitors will and then you lose business. It was never a technical or logistical issue, it was an economic and supply-demand issue.

To prove it:

1. Add a new OSI Layer 4 protocol called "QUIC" and give it a new protocol number, and just for fun, change the UDP frame header semantics so it can't be confused for UDP.

2. Then release kernel updates to support the new protocol.

Nobody's going to use it, right? Because internet routers, home wireless routers, servers, shared libraries, etc would all need their TCP/IP stacks updated to support the new protocol. If we can't ship it over a weekend, it takes too long!

But wait. What if ChatGPT/Claude/Gemini/etc only supported communication over that protocol? You know what would happen: every vendor in the world would backport firmware patches overnight, bending over backwards to support it. Because they can smell the money.

GuB-42

The protocol itself is resistant to ossification, no matter how it is implemented.

It is mostly achieved by using encryption, and it is a reason why it is such an important and mandatory part of the protocol. The idea is to expose as little as possible of the protocol between the endpoints, the rest is encrypted, so that "middleboxes" can't look at the packet and do funny things based on their own interpretation of the protocol stack.

Endpoint can still do whatever they want, and ossification can still happen, but it helps against ossification at the infrastructure level, which is the worst. Updating the linux kernel on your server is easier than changing the proprietary hardware that makes up the network backbone.

The use of UDP instead of doing straight QUIC/IP is also an anti-ossification technique, as your app can just use UDP and a userland library regardless of the QUIC kernel implementation. In theory you could do that with raw sockets too, but that's much more problematic since because you don't have ports, you need the entire interface for yourself, and often root access.

Karrot_Kream

Do you think putting QUIC in the kernel will significantly ossify QUIC? If so, how do you want to deal with the performance penalty for the actual syscalls needed? Your concern makes sense to me as the Linux kernel moves slower than userspace software and middleboxes sometimes never update their kernels.

codedokode

That's so wrong, putting more and more stuff into the kernel and expanding attack surface. How long will it take before someone finds a vulnerability in QUIC handling?

The kernel should be as minimal as possible and everything that can be moved to userspace should be moved there. If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.

NewJazz

Use a microkernel if this is your strong opinion. Linux is a monolithic kernel and includes a whole lot in kernel space for the sake of performance and (as mentioned in the article) hardware integration. A well designed microkernel may be able to provide similar performance with better security, but until people put serious work in, it won't be competitive with Linux.

surajrmal

Unfortunately the os community puts 99% of it'st collective energy into Linux. There is definitely pent up demand for a different architecture. China seems to be innovating here, but it's unclear if the west will get anything out of their designs.

codedokode

Sadly Linux distributions use large kernel and there is no simple way to get a working desktop system with a microkernel.

sevg

> If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.

By the same logic, we should never improve performance in software and just tell everyone to buy new hardware instead. A bit ridiculous.

codedokode

We should not compromise security for minor improvements in performance.