The first year of free-threaded Python

301 comments

·May 16, 2025

sgarland

> Instead, many reach for multiprocessing, but spawning processes is expensive

Agreed.

> and communicating across processes often requires making expensive copies of data

SharedMemory [0] exists. Never understood why this isn’t used more frequently. There’s even a ShareableList which does exactly what it sounds like, and is awesome.

[0]: https://docs.python.org/3/library/multiprocessing.shared_mem...

chubot

Spawning processes generally takes much less than 1 ms on Unix

Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

It's 1 to 2 orders of magnitude difference, so it's worth being precise

This is a fallacy with say CGI. A CGI in C, Rust, or Go works perfectly well.

e.g. sqlite.org runs with a process PER REQUEST - https://news.ycombinator.com/item?id=3036124

kragen

To be concrete about this, http://canonical.org/~kragen/sw/dev3/forkovh.c took 670μs to fork, exit, and wait on the first laptop I tried it on, but only 130μs compiled with dietlibc instead of glibc, and with glibc on a 2.3 GHz E5-2697 Xeon, it took 130μs compiled with glibc.

httpdito http://canonical.org/~kragen/sw/dev3/server.s (which launches a process per request) seems to take only about 50μs because it's not linked with any C library and therefore only maps 5 pages. Also, that doesn't include the time for exit() because it runs multiple concurrent child processes.

On this laptop, a Ryzen 5 3500U running at 2.9GHz, forkovh takes about 330μs built with glibc and about 130–140μs built with dietlibc, and `time python3 -c True` takes about 30000–50000μs. I wrote a Python version of forkovh http://canonical.org/~kragen/sw/dev3/forkovh.py and it takes about 1200μs to fork(), _exit(), and wait().

If anyone else wants to clone that repo and test their own machines, I'm interested to hear the results, especially if they aren't in Linux. `make forkovh` will compile the C version.

1200μs is pretty expensive in some contexts but not others. Certainly it's cheaper than spawning a new Python interpreter by more than an order of magnitude.

kragen

On my cellphone forkovh is 700μs and forkovh.py is 3700μs. Qualcomm SDM630. All the forkovh numbers are with 102400 bytes of data.

ori_b

As another example: I run https://shithub.us with shell scripts, serving a terabyte or so of data monthly (mostly due to AI crawlers that I can't be arsed to block).

I'm launching between 15 and 3000 processes per request. While Plan 9 is about 10x faster at spawning processes than Linux, it's telling that 3000 C processes launching in a shell is about as fast as one python interpreter.

kstrauser

The interpreter itself is pretty quick:

  ᐅ time echo "print('hi'); exit()" | python
  hi
  
  ________________________________________________________
  Executed in   21.48 millis    fish           external
     usr time   16.35 millis  146.00 micros   16.20 millis
     sys time    4.49 millis  593.00 micros    3.89 millis

Too

In Python, if you are spawning processes or even threads in a tight loop you have already lost. Use ThreadPoolExecutor or ProcessPoolExecutor from concurrent.futures instead. Then startup time becomes no factor.

charleshn

> Spawning processes generally takes much less than 1 ms on Unix

It depends on whether one uses clone, fork, posix_spawn etc.

Fork can take a while depending on the size of the address space, number of VMAs etc.

crackez

Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap. If you launch a new Python process from let's say the shell, and it's already in the buffer cache, then you should only have to pay the startup CPU cost of the interpreter, since the IO should be satisfied from buffer cache...

knome

for glibc and linux, fork just calls clone. as does posix_spawn, using the flag CLONE_VFORK.

morningsam

>Spawning a PYTHON interpreter process might take 30 ms to 300 ms

Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter, which takes low-single-digit ms as well.

zahlman

Even when the 'spawn' strategy is used (default on Windows, and can be chosen explicitly on Linux), the overhead can largely be avoided. (Why choose it on Linux? Apparently forking can cause problems if you also use threads.) Python imports can be deferred (`import` is a statement, not a compiler or pre-processor directive), and child processes (regardless of the creation strategy) name the main module as `__mp_main__` rather than `__main__`, allowing the programmer to distinguish. (Being able to distinguish is of course necessary here, to avoid making a fork bomb - since the top-level code runs automatically and `if __name__ == '__main__':` is normally top-level code.)

But also keep in mind that cleanup for a Python process also takes time, which is harder to trace.

Refs:

https://docs.python.org/3/library/multiprocessing.html#conte... https://stackoverflow.com/questions/72497140

codethief

> Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter

…which can also be a great source of subtle bugs if you're writing a cross-platform application.

Sharlin

Unix is not the only platform though (and is process creation fast on all Unices or just Linux?) The point about interpreter init overhead is, of course, apt.

btilly

Process creation should be fast on all Unices. If it isn't, then the lowly shell script (heavily used in Unix) is going to perform very poorly.

seunosewa

You can use a pool of interpreter processes. You don't have to spawn one for each request.

LPisGood

My understanding is that spawning a thread takes just a few micro seconds, so whether you’re talking about a process or a Python interpreter process there are still orders of magnitude to be gained.

ogrisel

You cannot share arbitrarily structured objects in the `ShareableList`, only atomic scalars and bytes / strings.

If you want to share structured Python objects between instances, you have to pay the cost of `pickle.dump/pickle.dump` (CPU overhead for interprocess communication) + the memory cost of replicated objects in the processes.

notpushkin

We need a dataclass-like interface on top of a ShareableList.

notpushkin

Actually, ShareableList feels like a tuple really (as it’s impossible to change its length). If we could mix ShareableList and collections.namedtuple together, it would get us 90% there (99.9% if we use typing.NamedTuple). Unfortunately, I can’t decipher either one [1, 2] from the first glance – maybe if I get some more sleep?

[1]: https://github.com/python/cpython/blob/3.13/Lib/collections/...

[2]: https://github.com/python/cpython/blob/3.13/Lib/typing.py#L2...

tomrod

I can fit a lot of json into bytes/strings though?

frollogaston

If all your state is already json-serializable, yeah. But that's just as expensive as copying if not more, hence what cjbgkagh said about flatbuffers.

cjbgkagh

Perhaps flatbuffers would be better?

reliabilityguy

What’s the point? The whole idea is to share an object, and not to serialize them whether it’s json, pickle, or whatever.

vlovich123

That’s even worse than pickle.

sgarland

So don’t do that? Send data to workers as primitives, and have a separate process that reads the results and serializes it into whatever form you want.

modeless

Yeah I've had great success sharing numpy arrays this way. Explicit sharing is not a huge burden, especially when compared with the difficulty of debugging problems that occur when you accidentally share things between threads. People vastly overstate the benefit of threads over multiprocessing and I don't look forward to all the random segfaults I'm going to have to debug after people start routinely disabling the GIL in a library ecosystem that isn't ready.

I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

zahlman

> I wish more effort was put into baseline performance for Python.

There has been. That's why the bytecode is incompatible between minor versions. It was a major selling(?) point for 3.11 and 3.12 in particular.

But the "Faster CPython" team at Microsoft was apparently just laid off (https://www.linkedin.com/posts/mdboom_its-been-a-tough-coupl...), and all of the optimization work has to my understanding been based around fairly traditional techniques. The C part of the codebase has decades of legacy to it, after all.

Alternative implementations like PyPy often post impressive results, and are worth checking out if you need to worry about native Python performance. Not to mention the benefits of shifting the work onto compiled code like NumPy, as you already do.

csense

Yeah, when I'm having Python performance issues, my first instinct is to reach for Pypy. My second instinct is to rewrite the "hot" part in C or Rust.

dhruvrajvanshi

> I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

This is a fair observation.

I think a part of the problem is that the things that make GIL less python hard are also the things that make faster baseline performance hard. I.e. an over reliance of the ecosystem on the shape of the CPython data structures.

What makes python different is that a large percentage of python code isn't python, but C code targeting the CPython api. This isn't true for a lot of other interpreted languages.

com2kid

Nobody sane tries to do math in JS. Backend JS is recommended for situations where processing is minimal and it is mostly lots of tiny IO requests that need to be shunted around.

I'm a huge JS/Node proponent and if someone says they need to write a backend service that crunches a lot of numbers, I'll recommend choosing a different technology!

For some reason Python peeps keep trying to do actual computations in Python...

frollogaston

Python peeps tend to do heavy numbers calc in numpy, but sometimes you're doing expensive things with dictionaries/lists.

dragonwriter

> For some reason Python peeps keep trying to do actual computations in Python...

Mostly, Python peeps do heavy calculation in not-really-Python (even if it is embedded in and looks like Python), e.g., via numpy, numba, taichi, etc.

frollogaston

"I wonder why people never complained so much about JavaScript not having shared-everything threading"

Mainly cause Python is often used for data pipelines in ways that JS isn't, causing situations where you do want to use multiple CPU cores with some shared memory. If you want to use multiple CPU cores in NodeJS, usually it's just a load-balancing webserver without IPC and you just use throng, or maybe you've got microservices.

Also, JS parallelism simply excelled from the start at waiting on tons of IO, there was no confusion about it. Python later got asyncio for this, and by now regular threads have too much momentum. Threads are the worst of both worlds in Py, cause you get the overhead of an OS thread and the possibility of race conditions without the full parallelism it's supposed to buy you. And all this stuff is confusing to users.

monkeyelite

> I wonder why people never complained so much about JavaScript not having shared-everything threading

Because it greatly simplifies the language and gives you all kinds of invariants.

isignal

Processes can die independently so the state of a concurrent shared memory data structure when a process dies while modifying this under a lock can be difficult to manage. Postgres which uses shared memory data structures can sometimes need to kill all its backend processes because it cannot fully recover from such a state.

In contrast, no one thinks about what happens if a thread dies independently because the failure mode is joint.

wongarsu

> In contrast, no one thinks about what happens if a thread dies independently because the failure mode is joint.

In Rust if a thread holding a mutex dies the mutex becomes poisoned, and trying to acquire it leads to an error that has to be handled. As a consequence every rust developer that touches a mutex has to think about that failure mode. Even if in 95% of cases the best answer is "let's exit when that happens".

The operating system tends to treat your whole process as one and shot down everything or nothing. But a thread can still crash in its own due to unhandled oom, assertion failures or any number of other issues

jcalvinowens

> But a thread can still crash in its own due to unhandled oom, assertion failures or any number of other issues

That's not really true on POSIX. Unless you're doing nutty things with clone(), or you actually have explicit code that calls pthread_exit() or gettid()/pthread_kill(), the whole process is always going to die at the same time.

POSIX signal dispositions are process-wide, the only way e.g. SIGSEGV kills a single thread is if you write an explicit handler which actually does that by hand. Unhandled exceptions usually SIGABRT, which works the same way.

** Just to expand a bit: there is a subtlety in that, while dispositions are process-wide, one individual thread does indeed take the signal. If the signal is handled, only that thread sees -EINTR from a blocking syscall; but if the signal is not handled, the default disposition affects all threads in the process simultaneously no matter which thread is actually signalled.

oconnor663

I think this is conflating two different things. A Rust Mutex gets poisoned if the thread holding it panics, but that's not the same thing as evaporating into thin air. Destructors run while a panic unwinds (indeed this is how the Mutex poisons itself), and you usually have the option of catching panics if you want. In the panic=abort configuration, where you can't catch a panic, it takes down the whole process rather than just one thread, which is another way of making the same point here: you can't usually kill a thread independently of the whole process its in, because lots of things (like locks) assume you'll never do that.

jcalvinowens

This is a solvable problem though, the literature is overflowing with lock-free implementations of common data structures. The real question is how much performance you have to sacrifice for the guarantee...

perlgeek

> Never understood why this isn’t used more frequently.

Can you throw a JSON-serializable data structure (lists, dict, strings, numbers) into SharedMemory? What about regular instance of random Python classes? If the answer is "no", that explains why it's not done more often.

The examples in the docs seem to pass byte strings and byte arrays around, which is far less convenient than regular data structures.

dragonwriter

> Can you throw a JSON-serializable data structure (lists, dict, strings, numbers) into SharedMemory?

You can throw a JSON-serialized data structure into SharedMemory, sure, since you can store strings.

> The examples in the docs seem to pass byte strings and byte arrays around

The examples in the docs largely use ShareableList, which itself can contain any of int, float, bool, str, bytes, and None-type values.

tinix

shared memory only works on dedicated hardware.

if you're running in something like AWS fargate, there is no shared memory. have to use the network and file system which adds a lot of latency, way more than spawning a process.

copying processes through fork is a whole different problem.

green threads and an actor model will get you much further in my experience.

bradleybuda

Fargate is just a container runtime. You can fork processes and share memory like you can in any other Linux environment. You may not want to (because you are running many cheap / small containers) but if your Fargate containers are running 0.25 vCPUs then you probably don't want traditional multiprocessing or multithreading...

tinix

Go try it and report back.

Fargate isn't just ECS and plain containers.

You cannot use shared memory in fargate, there is literally no /dev/shm.

See "sharedMemorySize" here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...

> If you're using tasks that use the Fargate launch type, the sharedMemorySize parameter isn't supported.

sgarland

Well don’t use Fargate, there’s your problem. Run programs on actual servers, not magical serverless bullshit.

tinix

> Well don’t use Fargate, there’s your problem. Run programs on actual servers, not magical serverless bullshit.

That kind of absolutism misses the point of why serverless architectures like Fargate exist. It might feel satisfying, but it closes the door on understanding why stateless and ephemeral workloads exist in the first place.

I get the frustration, but dismissing a production architecture outright ignores the constraints and trade-offs that lead teams to adopt it in the first place. It's worth asking: if so many teams are using this shit in production, at scale, with real stakes, what do they know that might be missing from my current mental model?

Serverless, like any abstraction, isn't magic. It's a tool with defined trade-offs, and resource/process isolation is one of them. If you're running containerized workloads at scale, optimizing for organizational velocity, security boundaries, and multi-tenant isolation, these constraints aren't bullshit, they're literally design parameters and intentional boundaries.

It's easy to throw shade from a distance, but the operational reality of running modern systems, especially in regulated or high-scale environments, looks very different from a home lab or startup sandbox.

pyronik19

[dead]

crabbone

[flagged]

pansa2

Does removal of the GIL have any other effects on multi-threaded Python code (other than allowing it to run in parallel)?

My understanding is that the GIL has lasted this long not because multi-threaded Python depends on it, but because removing it:

- Complicates the implementation of the interpreter

- Complicates C extensions, and

- Causes single-threaded code to run slower

Multi-threaded Python code already has to assume that it can be pre-empted on the boundary between any two bytecode instructions. Does free-threaded Python provide the same guarantees, or does it require multi-threaded Python to be written differently, e.g. to use additional locks?

rfoo

> Does free-threaded Python provide the same guarantees

Mostly. Some of the "can be pre-empted on the boundary between any two bytecode instructions" bugs are really hard to hit without free-threading, though. And without free-threading people don't use as much threading stuff. So by nature it exposes more bugs.

Now, my rants:

> have any other effects on multi-threaded Python code

It stops people from using multi-process workarounds. Hence, it simplifies user-code. IMO totally worth it to make the interpreter more complex.

> Complicates C extensions

The alternative (sub-interpreters) complicates C extensions more than free-threading and the top one most important C extension in the entire ecosystem, numpy, stated that they can't and they don't want to support sub-interpreters. On contrary, they already support free-threading today and are actively sorting out remaining bugs.

> Causes single-threaded code to run slower

That's the trade-off. Personally I think a single digit percentage slow-down of single-threaded code worth it.

celeritascelery

> That's the trade-off. Personally I think a single digit percentage slow-down of single-threaded code worth it.

Maybe. I would expect that 99% of python code going forward will still be single threaded. You just don’t need that extra complexity for most code. So I would expect that python code as a whole will have worse performance, even though a handful of applications will get faster.

rfoo

That's the mindset that leads to the funny result that `uv pip` is like 10x faster than `pip`.

Is it because Rust is just fast? Nope. For anything after resolving dependency versions raw CPU performance doesn't matter at all. It's writing concurrent PLUS parallel code in Rust is easier, doesn't need to spawn a few processes and wait for the interpreter to start in each, doesn't need to serialize whatever shit you want to run constantly. So, someone did it!

Yet, there's a pip maintainer who actively sabotages free-threading work. Nice.

pphysch

But the bar to parallelizing code gets much lower, in theory. Your serial code got 5% slower but has a direct path to being 50% faster.

And if there's a good free-threaded HTTP server implementation, the RPS of "Python code as a whole" could increase dramatically.

foresto

As I recall, CPython has also been getting speed-ups lately, which ought to make up for the minor single-threaded performance loss introduced by free threading. With that in mind, the recent changes seem like an overall win to me.

oconnor663

Sure but of those 99%, how many are performance-sensitive, CPU-bound (in Python not in C) applications? It's clearly some, not saying it's an easy tradeoff, but I assume the large majority of Python programs out there won't notice a slowdown.

rocqua

Note that there is an entire order of magnitude range for a 'single digit'.

A 1% slowdown seems totally fine. A 9% slowdown is pretty bad.

smilliken

I've seen benchmarks that estimate the regression at 20-30%, though I expect there's large variance depending on what a program's bottleneck is.

monkeyelite

If so, then why use python?

jacob019

Your understanding is correct. You can use all the cores but it's much slower per thread and existing libraries may need to be reworked. I tried it with PyTorch, it used 10x more CPU to do half the work. I expect these issues to improve, still great to see after 20 years wishing for it.

btilly

It makes race conditions easier to hit, and that will require multi-threaded Python to be written with more care to achieve the same level of reliability.

monkeyelite

Yes it makes every part of the ecosystem more complex and prone to bugs in hopes of getting more performance in a scripting language.

make3

I hate how these threads always devolve in insane discussions about why not using threads is better, while most real world people who have tried to do real world speeding up of Python code realize how amazing it would be to have proper threads with shared memory instead of the processes that have so many limitations, like forcing to pickle objects back and forth, & fork so often just not working in the cloud setting, & spawn being so slow in a lot of applications. The usage of processes is just much heavier and less straightforward.

pjmlp

On the other news, Microsoft dumped the whole faster Python team, apparently the 2025 earnings weren't enough to keep the team around.

https://www.linkedin.com/posts/mdboom_its-been-a-tough-coupl...

Lets see whatever performance improvements still land on CPython, unless other company sponsors the work.

I guess Facebook (no need to correct me on the name) is still sponsoring part of it.

bgwalter

They were quite a bit behind the schedule that was promised five years ago.

Additionally, at this stage the severe political and governance problems cannot have escaped Microsoft. I imagine that no competent Microsoft employee wants to give his expertise to CPython, only later to suffer group defamation from a couple of elected mediocre people.

CPython is an organization that overpromises, allocates jobs to the obedient and faithful while weeding out competent dissenters.

It wasn't always like that. The issues are entirely self-inflicted.

make3

Microsoft also fired a whole lot of other open source people unrelated to Python in this current layoff

pjmlp

Notably MAUI, ASP.NET, Typescript and AI frameworks.

biorach

> CPython is an organization that overpromises, allocates jobs to the obedient and faithful while weeding out competent dissenters.

This stinks of BS

wisty

It sounds like an oblique reference to that time they temporarily suspended one of the of the most valuable members of the community, apparently for having the audacity to suggest that their powers to suspend members of the community seemed a little arbitrary and open to abuse.

morkalork

Didn't Google lay off their entire Python development team in the last year as well? I wonder if there is some impetus behind both.

monkeyelite

This is a story from a few years ago, and they obviously have lots of teams that use python. This team was an internal support for python and tools - which honestly sounds exactly like the kind of thing that would get cut in a pinch (no value judgement).

bgwalter

No, that story is from last year.

make3

doesn't print money right away = cut by executive #3442

vlovich123

That’s unfortunate but I called it when people were claiming that Microsoft had committed to this effort for the long term.

mtzaldo

Could we do a crowdfunding campaign so we can keep paying them? The whole world is/will benefit from their work.

null

[deleted]

rich_sasha

Ah that's very, very sad. I guess they have embraced and extended, there's only one thing left to do.

stusmall

That shows a misunderstanding of what EEE was. This team was sending changes upstream which is the exact opposite of "extend" step of the strategy. The idea of "extend" was to add propriety extensions on top of an open standard/project locking customers into the MSFT implementation.

jerrygenser

Ok so a better example of what you describe might be vscode.

biorach

At this stage the cliched and clueless comments about embrace/extend/extinguish are tiresome and inevitable whenever Microsoft is mentioned.

A few decades ago MS did indeed have a playbook which they used to undermine open standards. Laying off some members of the Python team bears no resemblence whatsoever to that. At worst it will delay the improvement of free-threaded Python. That's all.

Your comment is lazy and unfounded.

kstrauser

cough Bullshit cough

* VSCode got popular and they started preventing forks from installing its extensions.

* They extended the Free Source pyright language server into the proprietary pylance. They don’t even sell it. It’s just there to make the FOSS version less useful.

* They bought GitHub and started rate limiting it to unlogged in visitors.

Every time Microsoft touches a thing, they end up locking it down. They can’t help it. It’s their nature. And if you’re the frog carrying that scorpion across the pond and it stings you, well, you can only blame it so much. You knew this when they offered the deal.

Every time. It hasn’t changed substantially since they declared that Linux is cancer, except to be more subtle in their attacks.

falcor84

It wouldn't have bothered me if you just said "Facebook" - I probably wouldn't have even noticed it. But I'm really curious why you chose to write "Facebook", then apparently noticed the issue, and instead of replacing it with "Meta" decided to add the much longer "(no need to correct me on the name)". What axe are you grinding?

pjmlp

Yes, because I am quite certain someone without anything better to do would correct me on that.

For me Facebook will always be Facebook, and Twitter will always be Twitter.

rbanffy

> Twitter will always be Twitter.

If Elon can deadname his daughter, then we can deadname his company.

falcor84

> Yes, because I am quite certain someone without anything better to do would correct me on that.

Well, you sure managed to avoid that by setting up camp on that hill. Kudos on so much time saved.

> For me Facebook will always be Facebook, and Twitter will always be Twitter.

Well, for me the product will always be "Thefacebook", but that's since I haven't used it since. But I do respect that there's a company running it now that does more stuff and contributes to open source projects.

null

[deleted]

heybrendan

I am a Python user, but far from an expert. Occasionally, I've used 'concurrent.futures' to kick off running some very simple functions, at the same time.

How are 'concurrent.futures' users impacted? What will I need to change moving forward?

rednafi

It’s going to get faster since threads won’t be locked on GIL. If you’re locking shared objects correctly or not using them all, then you should be good.

AlexanderDhoore

Am I the only one who sort of fears the day when Python loses the GIL? I don't think Python developers know what they’re asking for. I don't really trust complex multithreaded code in any language. Python, with its dynamic nature, I trust least of all.

jillesvangurp

You are not the only one who is afraid of changes and a bit change resistant. I think the issue here is that the reasons for this fear are not very rational. And also the interest of the wider community is to deal with technical debt. And the GIL is pure technical debt. Defensible 30 years ago, a bit awkward 20 years ago, and downright annoying and embarrassing now that world + dog does all their AI data processing with python at scale for the last 10. It had to go in the interest of future proofing the platform.

What changes for you? Nothing unless you start using threads. You probably weren't using threads anyway because there is little to no point in python to using them. Most python code bases completely ignore the threading module and instead use non blocking IO, async, or similar things. The GIL thing only kicks in if you actually use threads.

If you don't use threads, removing the GIL changes nothing. There's no code that will break. All those C libraries that aren't thread safe are still single threaded, etc. Only if you now start using threads do you need to pay attention.

There's some threaded python code of course that people may have written in python somewhat naively in the hope that it would make things faster that is constantly hitting the GIL and is effectively single threaded. That code now might run a little faster. And probably with more bugs because naive threaded code tends to have those.

But a simple solution to address your fears: simply don't use threads. You'll be fine.

Or learn how to use threads. Because now you finally can and it isn't that hard if you have the right abstractions. I'm sure those will follow in future releases. Structured concurrency is probably high on the agenda of some people in the community.

dkarl

> What changes for you? Nothing unless you start using threads

Coming from the Java world, you don't know what you're missing. Looking inside an application and seeing a bunch of threadpools managed by competing frameworks, debugging timeouts and discovering that tasks are waiting more than a second to get scheduled on the wrong threadpool, tearing your hair out because someone split a tiny sub-10μs bit of computation into two tasks and scheduling the second takes a hundred times longer than the actual work done, adding a library for a trivial bit of functionality and discovering that it spins up yet another threadpool when you initialize it.

(I'm mostly being tongue in cheek here because I know it's nice to have threading when you need it.)

UltraSane

Just consider that mess job security!

HDThoreaun

> But a simple solution to address your fears: simply don't use threads. You'll be fine.

Im not worried about new code. Im worried about stuff written 15 years ago by a monkey who had no idea how threads work and just read something on stack overflow that said to use threading. This code will likely break when run post-GIL. I suspect there is actually quite a bit of it.

bayindirh

Software rots, software tools evolve. When Intel released performance primitives libraries which required recompilation to analyze multi-threaded libraries, we were amazed. Now, these tools are built into processors as performance counters and we have way more advanced tools to analyze how systems behave.

Older code will break, but they break all the time. A language changes how something behaves in a new revision, suddenly 20 year old bedrock tools are getting massively patched to accommodate both new and old behavior.

Is it painful, ugly, unpleasant? Yes, yes and yes. However change is inevitable, because some of the behavior was rooted in inability to do some things with current technology, and as hurdles are cleared, we change how things work.

My father's friend told me that length of a variable's name used to affect compile/link times. Now we can test whether we have memory leaks in Rust. That thing was impossible 15 years ago due to performance of the processors.

bgwalter

If it is C-API code: Implicit protection of global variables by the GIL is a documented feature, which makes writing extensions much easier.

Most C extensions that will break are not written by monkeys, but by conscientious developers that followed best practices.

actinium226

If code has been unmaintained for more than a few years, it's usually such a hassle to get it working again that 99% of the time I'll just write my own solution, and that's without threads.

I feel some trepidation about threads, but at least for debugging purposes there's only one process to attach to.

zahlman

>Im worried about stuff written 15 years ago

Please don't - it isn't relevant.

15 years ago, new Python code was still dominantly for 2.x. Even code written back then with an eye towards 3.x compatibility (or, more realistically, lazily run through `2to3` or `six`) will have quite little chance of running acceptably on 3.14 regardless. There have been considerable removals from the standard library, `async` is no longer a valid identifier name (you laugh, but that broke Tensorflow once). The attitude taken towards """strings""" in a lot of 2.x code results in constructs that can be automatically made into valid syntax that appears to preserve the original intent, but which are not at all automatically fixed.

Also, the modern expectation is of a lock-step release cadence. CPython only supports up to the last 5 versions, released annually; and whenever anyone publishes a new version of a package, generally they'll see no point in supporting unsupported Python versions. Nor is anyone who released a package in the 3.8 era going to patch it if it breaks in 3.14 - because support for 3.14 was never advertised anyway. In fact, in most cases, support for 3.9 wasn't originally advertised, and you can't update the metadata for an existing package upload (you have to make a new one, even if it's just a "post-release") even if you test it and it does work.

Practically speaking, pure-Python packages usually do work in the next version, and in the next several versions, perhaps beyond the support window. But you can really never predict what's going to break. You can only offer a new version when you find out that it's going to break - and a lot of developers are going to just roll that fix into the feature development they were doing anyway, because life's too short to backport everything for everyone. (If there's no longer active development and only maintenance, well, good luck to everyone involved.)

If 5 years isn't long enough for your purposes, practically speaking you need to maintain an environment with an outdated interpreter, and find a third party (RedHat seems to be a popular choice here) to maintain it.

dhruvrajvanshi

> Im not worried about new code. Im worried about stuff written 15 years ago by a monkey who had no idea how threads work and just read something on stack overflow that said to use threading. This code will likely break when run post-GIL. I suspect there is actually quite a bit of it.

I was with OP's point but then you lost me. You'll always have to deal with that coworker's shitty code, GIL or not.

Could they make a worse mess with multi threading? Sure. Is their single threaded code as bad anyway because at the end of the day, you can't even begin understand it? Absolutely.

But yeah I think python people don't know what they're asking for. They think GIL less python is gonna give everyone free puppies.

rbanffy

> There's some threaded python code of course

A fairly common pattern for me is to start a terminal UI updating thread that redraws the UI every second or so while one or more background threads do their thing. Sometimes, it’s easier to express something with threads and we do it not to make the process faster (we kind of accept it will be a bit slower).

The real enemy is state that can me mutated from more than one place. As long as you know who can change what, threads are not that scary.

tgv

> Nothing unless you start using threads.

Isn't it also promises/futures? They might start threads implicitly.

jillesvangurp

Why would you use those without threads?

null

[deleted]

bayindirh

More realistically, as it happened in ML/AI scene, the knowledgeable people will write the complex libraries and will hand these down to scientists and other less experienced, or risk-averse developers (which is not a bad thing).

With the critical mass Python acquired over the years, GIL becomes a very sore bottleneck in some cases. This is why I decided to learn Go, for example. Properly threaded (and green threaded) programming language which is higher level than C/C++, but lower than Python which allows me to do things which I can't do with Python. Compilation is another reason, but it was secondary with respect to threading.

bgwalter

Knowledgeable people? Pytorch has memory leaks by design, it uses std::shared_ptr for a graph with cycles. It also has threading issues.

quectophoton

I don't want to add more to your fears, but also remember that LLMs have been trained on decades worth of Python code that assumes the presence of the GIL.

rocqua

This could, indeed, be quite catastrophic.

I wonder if companies will start adding this to their system prompts.

zahlman

Suppose they do. How is the LLM supposed to build a model of what will or won't break without a GIL purely from a textual analysis?

Especially when they've already been force-fed with ungodly amounts of buggy threaded code that has been mistakenly advertised as bug-free simply because nobody managed to catch the problem with a fuzzer yet (and which is more likely to expose its faults in a no-GIL environment, even though it's still fundamentally broken with a GIL)?

miohtama

GIL or no-GIL concerns only people who want to run multicore workloads. If you are not already spending time threading or multiprocessing your code there is practically no change. Most race condition issues which you need to think are there regardless of GIL.

fulafel

A lot of Python usage is leveraging libraries with parallel kernels inside written in other languages. A subset of those is bottlenecked on Python side speed. A sub-subset of those are people who want to try no-GIL to address the bottleneck. But if non-GIL becomes pervasive, it could mean Python becomes less safe for the "just parallel kernels" users.

kccqzy

Yes sure. Thought experiment: what happens when these parallel kernels suddenly need to call back in to Python? Let's say you have a multithreaded sorting library. If you are sorting numbers then fine nothing changes. But if you are sorting objects you need to use a single thread because you need to call PyObject_RichCompare. These new parallel kernels will then try to call PyObject_RichCompare from multiple threads.

immibis

With the GIL, multithreaded Python gives concurrent I/O without worrying about data structure concurrency (unless you do I/O in the middle of it) - it's a lot like async in this way - data structure manipulation is atomic between "await" expressions (except in the "await" is implicit and you might have written one without realizing in which case you have a bug). Meanwhile you still get to use threads to handle several concurrent I/O operations. I bet a lot of Python code is written this way and will start randomly crashing if the data manipulation becomes non-atomic.

rowanG077

Afaik the only guarantee there is, is that a bytecode instruction is atomic. Built in data structures are mostly safe I think on a per operation level. But combining them is not. I think by default every few millisecond the interpreter checks for other threads to run even if there is no IO or async actions. See `sys.getswitchinterval()`

OskarS

That doesn't match with my understanding of free-threaded Python. The GIL is being replaced with fine-grained locking on the objects themselves, so sharing data-structures between threads is still going to work just fine. If you're talking about concurrency issues like this causing out-of-bounds errors:

    if len(my_list) > 5:
        print(my_list[5])

(i.e. because a different thread can pop from the list in-between the check and the print), that could just as easily happen today. The GIL makes sure that only one python interpreter runs at once, but it's entirely possible that the GIL is released and switches to a different thread after the check but before the print, so there's no extra thread-safety issue in free-threaded mode.

The problems (as I understand it, happy to be corrected), are mostly two-fold: performance and ecosystem. Using fine-grained locking is potentially much less efficient than using the GIL in the single-threaded case (you have to take and release many more locks, and reference count updates have to be atomic), and many, many C extensions are written under the assumption that the GIL exists.

imtringued

You start talking about GIL and then you talk about non-atomic data manipulation, which happen to be completely different things.

The only code that is going to break because of "No GIL" are C extensions and for very obvious reasons: You can now call into C code from multiple threads, which wasn't possible before, but is now. Python code could always be called from multiple python threads even in the presence of the GIL in python.

monkeyelite

When you launch processes to do work you get multi-core workload balancing for free.

bratao

This is a common mistake and very badly communicated. The GIL do not make the Python code thread-safe. It only protect the internal CPython state. Multi-threaded Python code is not thread-safe today.

porridgeraisin

Internal cpython state also includes say, a dictionary's internal state. So for practical purposes it is safe. Of course, TOCTOU, stale reads and various race conditions are not (and can never be) protected by the GIL.

amelius

Well, I think you can manipulate a dict from two different threads in Python, today, without any risk of segfaults.

spacechild1

It's memory safe, but it's not necessarily free of race conditions! It's not only C extensions that release the GIL, the Python interpreter itself releases the GIL after a certain number of instructions so that other threads can make progress. See https://docs.python.org/3/library/sys.html#sys.getswitchinte....

Certain operations that look atomic to the user are actually comprised of multiple bytecode instructions. Now, if you are unlucky, the interpreter decides to release the GIL and yield to another thread exactly during such instructions. You won't get a segfault, but you might get unexpected results.

pansa2

You can do so in free-threaded Python too, right? The dict is still protected by a lock, but one that’s much more fine-grained than the GIL.

kevingadd

This should not have been downvoted. It's true that the GIL does not make python code thread-safe implicitly, you have to either construct your code carefully to be atomic (based on knowledge of how the GIL works) or make use of mutexes, semaphores, etc. It's just memory-safe and can still have races etc.

tialaramex

You're not the only one. David Baron's note certainly applies: https://bholley.net/blog/2015/must-be-this-tall-to-write-mul...

In a language conceived for this kind of work it's not as easy as you'd like. In most languages you're going to write nonsense which has no coherent meaning whatsoever. Experiments show that humans can't successfully understand non-trivial programs unless they exhibit Sequential Consistency - that is, they can be understood as if (which is not reality) all the things which happen do happen in some particular order. This is not the reality of how the machine works, for subtle reasons, but without it merely human programmers are like "Eh, no idea, I guess everything is computer?". It's really easy to write concurrent programs which do not satisfy this requirement in most of these languages, you just can't debug them or reason about what they do - a disaster.

As I understand it Python without the GIL will enable more programs that lose SC.

monkeyelite

Good engineering design is about making unbalanced tradeoffs where you get huge wins for low costs. These kinds of decisions are opinionated and require you to say no to some edge cases to get a lot back on the important cases.

One lesson I have learned is that good design cannot survive popularity and bureaucracy that comes with it. Over time people just beat down your door with requests to do cases you explicitly avoided. You’re blocking their work and not being pragmatic! Eventually nobody is left to advocate for them.

And part of that is the community has more resources and can absorb some more complexity. But this is also why I prefer tools with smaller communities.

freeone3000

I'm sure you'll be happy using the last language that has to fork() in order to thread. We've only had consumer-level multicore processors for 20 years, after all.

im3w1l

You have to understand that people come from very different angles with python. Some people write web servers where in python, where speed equals money saved. Other people write little UI apps that where speed is a complete non-issue. Yet others write aiml code that spends most of its time in gpu code. But then they want to do just a little data massaging in python which can easily bottleneck the whole thing. And some people people write scripts that don't use a .env but rather os-libraries.

monkeyelite

I don’t understand this argument. My python program isn’t the only program on the system - I have a database, web server, etc. It’s already multi-core.

YouWhy

Hey, I've been developing professionally with Python for 20 years, so wanted to weigh in:

Decent threading is awesome news, but it only affects a small minority of use cases. Threads are only strictly necessary when it's prohibitive to message pass. The Python ecosystem these days includes a playbook solution for literally any such case. Considering the multiple major pitfalls of threads (i.e., locking), they are likely to become a thing useful only in specific libraries/domains and not as a general.

Additionally, with all my love to vanilla Python, anyone who needs to squeeze the juice out of their CPU (which is actually memory bandwidth) has a plenty of other tools -- off the shelf libraries written in native code. (Honorable mention to Pypy, numba and such).

Finally, the one dramatic performance innovation in Python has been async programming - I warmly encourage everyone not familiar with it to consider taking a look.

kstrauser

I haven’t been using it that much longer than you, and I agree with most of what you’re saying, but I’d characterize it differently.

Python has a lot of solid workarounds for avoid threading because until now Python threading has absolutely sucked. I had naively tried to use it to make a CPU-bound workload twice as fast and soon realized the implications of the GIL, so I threw all that code away and made it multiprocessing instead. That sucked in its own way because I had to serialize lots of large data structures to pass around, so 2x the cores got me about 1.5x the speed and a warmer server room.

I would love to have good threading support in Python. It’s not always the right solution, but there are a lot of circumstances where it’d be absolutely peachy, and today we’re faking our way around its absence with whole playbooks of alternative approaches to avoid the elephant in the room.

But yes, use async when it makes sense. It’s a thing of beauty. (Yes, Glyph, we hear the “I told you so!” You were right.)

zahlman

> That sucked in its own way because I had to serialize lots of large data structures to pass around, so 2x the cores got me about 1.5x the speed and a warmer server room.

In many cases you can't reasonably expect better than that (https://en.wikipedia.org/wiki/Amdahl's_law). If your algorithm involves sharing "large data structures" in the first place, that's a bad sign.

kstrauser

That's true, but you can sometimes get a whole lot closer if you can share state between threads. Sometimes you can't help the size of the data. Maybe you have a thread reading frames from a video and passing them to workers for analysis. You might get crazy IO contention if you pass around "foo.vid;frame222" and "foo.vid;frame223" to the workers and make them retrieve that data themselves.

There may be another way to skin that specific cat. My point isn't to solve one specific problem, but to say that some problems are just inherently large. And with Python, today, if those workers are CPU-bound in Python-land, that means running separate processes and passing large hunks of state around (or shoving it through SHM; same idea, just a different way of passing state).

BrokrnAlgorithm

I find python's async to be lacking in fine grained control. It may be fine for 95% of simple use cases, but lacks advanced features such as sequential constraining, task queue memory management, task pre-emption etc. The async keword also tends to bubble up through codebases in aweful ways, making it almost impossible to create reasonably decoupled code.

0x000xca0xfe

I know it's just an AI image... but a snake with two tails? C'mon!

null

[deleted]

brookst

Confusoborus

vpribish

shh. don't complain too loudly or we'll lose an important tell. python articles using snake illustrations can usually be ignored because they are not clueful.

-- python, monty

amelius

The snake in the header image appears to have two tail-ends ...

cestith

I guess it’s spawned a second thread in the same process.

pawanjswal

This is some serious groundwork for the next era of performance!

aitchnyu

Whats currently stopping me (apart from library support) from running a single command that starts up WSGI workers and Celery workers in a single process?

gchamonlive

Nothing, it's just that these aren't first class features of the language. Also someone already explained that the GIL is mostly about technical debt in the CPython interpreter, so there are reasons other than full parallelism to get rid of the GIL.

MichaelMoser123

cpython doesn't have a JIT, why is free-threaded python a higher priority than developing a just in time compiler? The later would be more resonant with the typical use case for python and benefit a larger portion of users, wouldn't it? (Wouldn't a backend server project use golang or java to begin with?)

diziet_sma

There have been many attempts to make a Python JIT that is compatible with CPython to various levels of success. However the larger reason is that the gains from removing the GIL far exceeds the gains from a JIT.

If you're writing performance sensitive python code, the "hot" code is likely already in a C-extension, such as Numpy. So there is negligible benefit to running the code with a JIT.

HN

The first year of free-threaded Python

The first year of free-threaded Python