Tokio and Prctl = Nasty Bug

77 comments

·February 23, 2025

nemothekid

I may be mistaken, but I believe the bug still exists, but in a more esoteric manner; and a future change might cause the bug to exist again. The author might want to warn against usage of `tokio::task::block_in_place`, if the underlying issue can't be fixed.

The reason the current approach works is it runs on tokio's worker threads, which last the lifetime of the tokio runtime. However, if `tokio::task::block_in_place`, the current worker thread is demoted to a blocking thread pool, and the new worker thread is spawned in it's place.

There can be a situation when the stars align that:

1. Thread A spawns Process X.

2. N minutes/hours/days pass, and Thread A hits a section of code that calls `tokio::task::block_in_place`

3. Thread A goes into the blocking pool.

4. After some idle time, Thread A dies, prematurely killing Process X, causing the same bug again.

You can imagine that this would be much harder to reproduce and debug, because thread lifetime will be completely divorced from when you spawned the process. It's actually pretty lucky that the author reached for spawn_blocking, instead of block_in_place as when doing benchmarking it's a bit more tempting to use block_in_place. Had they used block_in_place it may have been harder to catch this bug.

rendaw

My knowledge isn't very good here, but I assumed since they're using the single thread executor, everything was being spawned on the main thread. The only time new (temporary) threads were created was when calling `spawn_blocking`. And the main thread can't be moved because it's part of the `main()` call stack? Maybe...

null

[deleted]

kobzol

That's a very good point! But yeah, we use the single threaded runtime, so this shouldn't be a concern.

eqvinox

> It is called PR_SET_DEATHSIG, and we configure it when spawning tasks using the prctl syscall like this

PDEATHSIG was to my knowledge (85% confidence) created for the original Linux userspace pthreads implementation (LinuxThreads¹, before NPTL) that was created back when it was implemented via kernel processes (the kernel had no concept of threads yet). This is AFAIK also why it behaves oddly in regards to later-added kernel threads. I have a flag for "don't use this, it's highly fragile" in my head but don't remember where that's from.

If the receiving side can be controlled, there's always the option of opening a pipe; if the other end dies that's always detectable. Doesn't work with arbitrary processes though (random other code won't care if some fd ≥3 is suddenly closed…)

¹ https://en.wikipedia.org/wiki/LinuxThreads

fulafel

There's been no fundamental change in the kernel level representation of pthreads, they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.

eqvinox

> they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.

I'd say this is depending on perspective both true and false¹, but also unhelpful to work with here.

Instead, I would suggest this perspective: the kernel has neither processes nor threads; it has tasks, which are entities the scheduler can run. They're exposed to userland as processes and threads. Excluding kernel tasks/threads, which can have arbitrary rules but are also user-visible, a task is exposed as a thread, and a set of threads is exposed as a process. Both operations working with threads as well as operations working with processes exist.

We're looking at an API in this case that works with threads on one side (parent, the signal is triggered by thread exit) and processes on the other (child, the signal is process-targeted). How these were created is irrelevant, what matters is the abstractions they refer to.

¹ you could equally well argue that processes do not exist in the kernel, they're just threads with sharing flags set differently.

fc417fc802

More precisely, distinguishing a process and a thread is a pointless overspecification. Unfortunately POSIX mandates it and glibc accepts it.

If you want to register per-thread signal handlers you're forced to step outside the bounds of glibc and pthreads which I think is quite unfortunate.

skissane

Digressing a little, but Glibc’s pthreads implementation is painful, because they don’t provide any public API to map a pthread_t to the kernel TID, except for the horrendously awful thread_db. Of course, for the current thread, you can just call gettid() - but if you want to map pthread_t to TID for another thread, the thread_db abomination is the only supported way. Bionic supplies a nice simple pthread_gettid_np() for this, macOS has that too (albeit sadly with an incompatible prototype).

Now, pthread_t is actually a pointer to an undocumented structure, and the TID is stored at a certain offset in it… so it is easy to pull the TID from there. Until some day the glibc developers change the layout of the structure and suddenly that code breaks.

There’s an entry in glibc’s bug tracker for this - https://sourceware.org/bugzilla/show_bug.cgi?id=27880 - but it doesn’t look like it will be implemented any time soon

surajrmal

Linux isn't the only kernel in the world. Posix needs to be kernel agnostic. Also need common abstractions to have unique names.

pengaru

Can you actually substantiate your 85% confident claim? Because it doesn't ring the slightest bell here, and I don't see any mention of "deathsig" in glibc's LinuxThreads fork of Xavier's found @ https://ftp.gnu.org/gnu/libc/glibc-linuxthreads-2.5.tar.bz2

I used LinuxThreads back in the 90s, and its main problem ISTR was hijacking SIGUSR[12]. My interests back then involved demo programming using SVGAlib, and mixing LinuxThreads with SVGAlib was a mess due to both wanting to use SIGUSR1. Endless corrupt consoles...

eqvinox

> Can you actually substantiate your 85% confident claim?

I unfortunately can't, it was apparently added in 2.1.57, which would be somewhere around 1998~1999. I started working with Linux around 2001~2002, and this association of PDEATHSIG with LinuxThreads has at some point embedded itself into my brain… I can't reconstruct when or why. And I can't seem to find the specific patch that added PDEATHSIG, and can't find a versioned history of LinuxThreads either…

Probably best to treat my comment as "grandpa tells weird stories that may or may not be true" :'(

[… I'm not even that old T_T]

Best reference I can find is in MAINTAINERS:

  N: Richard E. Gooch
  E: rgooch@atnf.csiro.au
  D: parent process death signal to children
  D: prctl() syscall
  S: CSIRO Australia Telescope National Facility
  S: P.O. Box 76, Epping
  S: N.S.W., 2121
  S: Australia

[ed.] wait! — https://man7.org/conf/piter2019/once_upon_an_API-Linux-Piter...

"(Of course, there was no explanation of why the feature was needed)"

Note Richard was at minimum involved in discussions about LinuxThreads: https://lkml.iu.edu/hypermail/linux/kernel/9806.2/1227.html — and it does mention prctl and "dying main thread"…

pengaru

Thanks for digging this up, that kerrisk pdf is great.

cryptonector

If you want to be able to spawn processes that fask then `fork()` is NOT your friend. You want either `vfork()` (or `clone()` equivalent) or `posix_spawn()`.

`fork()` is inherently very slow due to the need to either copy the VM of the parent, or arrange to copy pages on write, or copy the resident set of the parent (then copy any pages paged-in when those page-in events happen -- all three of these options are very expensive.

Also, what I might recommend here is to create a `posix_spawn()`-like API that gives you asynchronous notification of the exec starting or failing, that way you don't block even for that. I'd use a `pipe()` that will be set to close on exec and which will therefore close when the exec starts, but if the exec fails I'd write the `errno` value into the pipe, that way EOF on the pipe implies the exec started while read on the pipe implies the exec failed and you can read the error number out of the pipe.

  https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234
  https://news.ycombinator.com/item?id=30502392
  https://github.com/famzah/popen-noshell

kobzol

The stdlib already mostly does all of that :)

Check out https://kobzol.github.io/rust/2024/01/28/process-spawning-pe....

fc417fc802

Copying the page table isn't free but it isn't particularly expensive either. At least unless the parent is a real behemoth.

Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)

The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?

cryptonector

CoW is extremely expensive for threaded processes on multi-processor systems since you need TLB shootdowns.

> The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?

Yes, so have the `vfork()` happen in worker threads.

> Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)

I have proof that `vfork()` can be used safely: the several `posix_spawn()` implementations that use it.

fc417fc802

Good point about multiprocessor systems. He did say this is being used on HPC clusters.

You can also use C safely if you're careful. Doesn't mean it isn't full of footguns.

> have the `vfork()` happen in worker threads

Fair enough. Since this is an exercise in efficiency and latency, if you're creating a worker thread isn't an atomic write by the worker cheaper than creating a pipe?

jerf

A similar issue in Go, that I've encountered in real code: https://github.com/golang/go/issues/27505#issuecomment-71370...

In a nutshell, if you want to use the death signal, which is very handy and useful, you also need to lock an OS thread so that it can't be destroyed. Fortunately I'm only spawning one process so I don't need to jump through hoops, I can just dedicate a thread to it, but it would be inconvenient to want to spawn lots of processes that way.

Speaking more generally, a lot of things that I learned in the 200xs apply to "processes", and things I just osmosed over the years as applying to "processes", were changed to apply to "threads" over the decades and a lot of people have not noticed that, even now. Even though I know this, my mental model of what is associated to a thread and what is associated to a process is quite weak, since I've not yet needed to acquire a deep understanding. In general I would suggest to people that if you are dealing with this sort of system programming that you at least keep this general idea in your head so that the thought pops up that if you're having trouble, it may be related to your internal beliefs that things related to "processes" are actually related to "threads" and in fact just because you did something like set a UID or something somewhere in your code doesn't necessarily mean that that UID will be in effect somewhere else.

LegionMammal978

Yeah, a lot of process vs. thread distinctions can be unclear, even in documentation. E.g., the Linux clone(2) man page mostly talks about "the child process", even though it can create either a new thread or a new process.

The weirdest case of processes vs. threads is definitely the setuid() family of functions on Linux. The underlying syscalls apply the new uid (or euid, fsuid, etc.) to the current thread, but POSIX requires them to apply to the entire process. How does glibc paper over this? It registers a realtime signal handler which calls the appropriate syscall, and sends that signal to every thread when the wrapper function is called. On top of that, it quietly removes the handler's signal number (SIGSETXID) from all calls to sigfillset(), sigprocmask(), and pthread_sigmask() to keep it from getting blocked, and bumps the value of SIGRTMIN in the userspace headers so that programs won't notice the gap. I believe musl libc does something very similar.

bhawks

Normally I'd stay away from job control posix APIs - but since HyperQueue is a job control system, it might be appropriate if the worker was a session leader. If it dies than all its subprocesses would receive SIGHUP - which is fatal by default.

Generally you'd use this functionality to implement something like sshd or an interactive shell. HQ seems roughly analogous.

https://notes.shichao.io/apue/ch9/#sessions

kobzol

I do use setsid when spawning the children (I omitted it from the post, but I set it in the dsme pre_exec call where I configure DEATHSIG) but they don't receive any signal, IIRC. Or if they do, it does not seem to be propagated to their children.

bhawks

Calling setsid in pre_exec isn't exactly correct, you'd do this to daemonize a process in order to prevent it from getting signals from the process which spawned you exited or its terminal disconnected.

If you read exit(3) you'll see that you also need a controlling terminal for the kernel to send SIGHUP to the processes in your foreground process group.

In python it'd look roughly like:

  master, slave = os.openpty()

  # ensure we can call setsid successfully

  try:

    pid = os.fork()

    if pid > 0:

      os.close(master)

      os.close(slave)

      sys.exit(0)

  finally:

    sys.exit(1)

  os.setsid()
  # we are now session leader and process group leader

  fcntl.ioctl(master, termios.TIOCSCTTY, 0)
  # we now have a controlling pty - closing master will send us a SIGHUP

  os.close(slave)

  # go start spawning subprocesses - if _this_ process is killed, they will receive SIGHUPs. Unless they take themselves out of your session or the foreground process group.

ahepp

Yeah I suspect there may be a solution involving setsid

TheDong

I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.

They would need one special "main thread" that deals with reaping and that isn't subject to tokio's runtime cleaning it up, but presumably they already have that, or else the fix they did apply wouldn't have worked.

Alternatively, if they want they could integrate with systemd, even just by wrapping the children all in 'systemd-run', which would reliably allow cleaning up of children (via cgroups).

timhh

> PR_SET_CHILD_SUBREAPER

I wrote a tool that does just this: https://github.com/timmmm/anakin

If you run `anakin <some command>` it will kill any orphan processes that <some command> makes.

However is still isn't the true "orphans of this process must automatically die" option that everyone writing job control software wants - if `anakin` itself somehow crashes then the orphans can live again.

Still it was the best I could come up with that didn't need root.

badmintonbaseba

The name of the tool is on point.

skissane

> I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.

PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants

As far as I’m aware, PR_SET_CHILD_SUBREAPER doesn’t do anything if you die. Assuming you yourself don’t crash, it can be used to help clean up orphaned descendant processes, by ensuring they reparent to you instead of init; but in the event you do crash, it doesn’t do anything to help.

PID namespaces do exactly what you want - if their init process dies it automatically kills all its descendants. However, they require privilege - unless you use an unprivileged user namespace - but those are frequently disabled, and even when enabled, using them potentially introduces a whole host of other issues

> Alternatively, if they want they could integrate with systemd

The problem is a lot of code runs in environments without systemd-e.g. code running in containers (Docker, K8S, etc), most containers don’t contain systemd. So any systemd-centric solution is only going to work for some people

Really, it would be great if Linux added some new process grouping construct which included the “kill all members of this group if its leader dies” semantic of PID namespaces without any of its other semantics. It is those other semantics (especially the new PID number semantics) which are the primary source of the security concerns, so a construct which offered only the “kill-if-leader-dies” semantic should be safe to allow for unprivileged access. (The one complexity is setuid/setgid/file capabilities - allowing an unprivileged process to effectively kill a privileged process at an arbitrary point in its execution is a security risk-plausible solutions include refuse to execute any setuid/setgid/caps executable, or else allow them to run but remove the process from this grouping when it executes one)

eqvinox

> PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants

It indirectly does, unless you unset it the child dying will trigger another run of PDEATHSIG on the grandchildren, and so on. (The setting is retained across forks, as shown in the original article.)

kobzol

It is sadly not propagated to grandchildren.

I tries the subreaper approach, but it doesn't help. The children are reparented to the worker, but when the worker dies, they are then just reparented to init, like normally.

skissane

> The setting is retained across forks, as shown in the original article

That’s not what the man page says:

> The parent-death signal setting is cleared for the child of a fork(2).

https://man7.org/linux/man-pages/man2/pr_set_pdeathsig.2cons...

Unless the man page is wrong?

vlovich123

> when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status

Seems like you don’t need a dedicated “always alive” thread if it’s being delivered to the process and tokio automatically does masking for threads so that you register for listening to signals using it’s asynchronous mechanisms & don’t have issues around signal safety which it abstracts away for you (i.e. as long as you’re handling the SIGCHILD signal somewhere or even just ignoring it as I don’t think they actually care?).

That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies which is the effect they’re looking for here (not the reverse where you reap forked children as they die). So you may need to spawn a dedicated reaper process rather than thread to manage the lifetime of children which is much more complicated.

eqvinox

> That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies

CHILD_SUBREAPER kills neither children nor grandchildren. It's effect is in the other direction, inteded for sub-service-managers that want to keep track of all children. If the subreaper dies, children are reparented to the next subreaper up (or init).

TheDong

Yeah, I was assuming they have something calling `wait` somewhere since they say "HyperQueue is essentially a process manager", and to me "process manager" implies pretty strongly "spawns and waits for processes".

emmelaich

Seeing the first mention of 10 seconds, I thought (jokingly) - Why not grep the source for the 10 second value.

Jokingly, because I thought it would be an emergent property, not a literal value.

Turns out it was a literal value after all and grepping would have helped!

ricardobeat

There was no mention of a ten second interval anywhere in the code, only in the tests they wrote while debugging.

AntiRush

As mentioned in the article, there is a 10 second value in Tokio - the default thread timeout.

ricardobeat

That wouldn’t be greppable in the source though?

mkesper

Although we will need to add one additional unsafe block once we migrate to the 2024 edition because we use std::env::set_var in main :laughing:

This is unsafe rightfully and should not be used without checking. It's just undefined behavior when using threads as it's not threadsafe. https://ttimo.typepad.com/blog/2024/11/the-steam-client-upda...

immibis

Good writeup of yet another bug different from all the other bugs.

The Linux kernel isn't really bothered by the difference between threads and processes. Threads are just processes that happen to share an address space, file descriptor table, and thread group ID (what most tools call a PID). I think there are some subtle things related to the thread group ID, but they're subtle. The rest is implemented in glibc.

rendaw

Are there any differences between threads and processes in how signals are handled?

I recently learned that aside from processes there are process groups, process sessions (setsid), process group and session leaders, trees have associated VT ownership data, systemd sessions (which seem to be inherited by the entire subtree and can't be purged), and possibly other layered metadata spaces that I haven't heard of yet.

And I feel like there's got to be some way to tag or associate custom metadata with processes, but I haven't found it yet.

I really wish there were an overview of all these things and how they interact with eachother somewhere.

skissane

> Are there any differences between threads and processes in how signals are handled?

Yes. As signal(7) notes [0], Linux has both “process-directed signals” (which can be handled by any thread in a process), and “thread-directed signals” (which are targeted at a specific thread and only handled by that thread). For user-generated signals, the classification depends on which syscall you use (kill/rt_sigqueueinfo generate process-directed signals, tgkill/rt_tsigqueueinfo generate thread-directed). For system-generated signals, it is up to the kernel code generating the signal to decide. So the same signal number can be thread-directed in some cases and process-directed in others

> systemd sessions (which seem to be inherited by the entire subtree and can't be purged)

At a kernel level those are implemented with cgroups.

> I really wish there were an overview of all these things

Unfortunately I think Linux has grown a complex mess of different features in this area, all of which are full of complicated limitations and gotchas. Despite attempts to introduce orthogonality (e.g. with several different types of namespaces), the end result is still a long way from any ideal of orthogonality

[0] https://man7.org/linux/man-pages/man7/signal.7.html

rendaw

Oh thanks! I was recently having `runuser -l` silently not do the session setup because of the systemd thing, so maybe there's a better way (than laundering it through a process launcher daemon in a separate tree) to handle that.

I forgot capabilities with another 5 layers (+) of different flags and applied differently to processes and files... (and then namespaces, etc)

eqvinox

> Are there any differences between threads and processes in how signals are handled?

Yes, absolutely, there are thread-directed and process-directed signals; for the latter a thread is chosen at random (more or less) to handle the signal.

oguz-ismail

> I really wish there were an overview of all these things and how they interact with eachother somewhere.

    man 7 signal

fc417fc802

Also see `man 2 clone` and `man 7 cgroups`.

eqvinox

The distinction isn't quite as subtle as you believe, it also shows up in e.g. file locks, AF_UNIX SO_PEERCRED, and with any process-directed signal.

As a matter of fact, the original implementation of POSIX threads for Linux was userspace based and had unfixable bugs and issues that necessitated introducing the concept of threads into the Linux kernel.

vlovich123

> Edit: Someone on Reddit sent me a link to a method that can override the thread keep-alive duration. Its description makes it clear why the tasks were failing after exactly 10 seconds

> Yeah, testing if a task can run for 20 seconds isn’t great, but hey, at least it’s something

Well a reasonable thing to me is then to use the override within the test to shorten it (e.g. to 1s & use a 2s timeout).

kobzol

Could be done, yeah, but 20s isn't that much, and I'd like to avoid adding more test-only magic environment variables zo configure this (our end-to-end tests are in Python and they use HQ as a binary).

vlovich123

I appreciate the sentiment, but I think it’s making the wrong pragmatism/purity tradeoff. The test is brittle - what happens when a future update of the dependency in a couple of years makes a change to the default timeout value? Aside from making test runs quicker which is good for anyone running the test suite without caring about this 1 test itself, it future proofs the test flakiness better against defaults changing out from under you.

yencabulator

> This makes sense, of course, because there’s not exactly an asynchronous version of fork.

IORING_OP_CLONE and IORING_OP_EXEC have been proposed: https://lwn.net/Articles/1002371/

kevingadd

Leaving PDEATHSIG enabled would make it harder for me to sleep at night, but I understand why the alternatives probably aren't appealing. Seems like a future bug waiting to happen. At least the author knows what to expect now.

HN

Tokio and Prctl = Nasty Bug

Tokio and Prctl = Nasty Bug