Default musl allocator considered harmful to performance

58 comments

·September 5, 2025

Visit

burntcaramel

Instead of “harmful to performance”, why can’t we say “slow”?

Harmful should be reserved for things that affect security or privacy e.g. accidentally encourage bugs like goto does.

jayd16

"Considered harmful" is a meme they're referencing but yeah...its pretty stale at this point.

Agingcoder

To me it’s not a meme, it’s a reference to a very famous letter by dijkstra regarding goto statements.

https://en.wikipedia.org/wiki/Considered_harmful

dwattttt

Only once we convince C developers that a lack of performance isn't inherently harmful.

DaSHacka

More like Python / JS devs

C devs are the few I've met that seem to actually care.

amonith

I think you misunderstood. That's exactly the problem. C developers consider slow performance harmful, which is often dumb.

sim7c00

maybe its not 'slow' but more 'generalized for a wide range of use-cases'? - because is it really slow for what it does, or simply slower compared to a specialized implementation? (this is calling a regular person car slow compared to an F1 car... sure the thing is fast but good luck takin ur kids on holiday or doing weekly shopping runs?)

masklinn

“Generalised to a wide range of use cases” is a really strange way to say “unsuitable to most multi-threaded programs”.

In 2025 an allocator not cratering multi-threaded programs is the opposite of specialisation.

flohofwoe

It only matters when your threads allocate with such a high frequency that they run into contention.

A too high access frequency to a shared resource is not a "general case", but simply poorly designed multithreaded code (but besides, a high allocation frequency through the system allocator is also poor design for any single-threaded code, application code simply should not assume any specific performance behaviour from the system allocator).

flarecoder

For docker images, cgr.dev/chainguard/wolfi-base (https://images.chainguard.dev/directory/image/wolfi-base/ver...) is a great replacement for Alpine. Wolfi is glibc based. It's easy to switch from Alpine since Wolfi uses apk for package management with similar package names and also contains busybox like Alpine.

dijit

I’d much rather go with distroless, if its a choice.

But I think you can tweak musl to perform well, and musl is closer to the spec than glibc so I would rather use it; even if its slower in the default case for multithreaded programmes.

flohofwoe

My simple rule of thumb: if the general purpose allocator shows up in performance profiles, then there's too much allocation going on in the hot path (e.g. depending on the 'system allocator' being fast in all situations is a convenient but sloppy attitude for code that's supposed to be portable since neither the C standard nor POSIX say anything performance).

time4tea

Maybe its just that the allocator is absolutely fine for single thread programs, and that's what a lot of programs are...

Its not so long ago that the GNU libc had a very similar allocator too, and thats why you'd pop Hoard in your LD_PRELOAD or whatever.

Not every program is multi-threaded, and so not every program would experience thread contention.

petcat

Programs that tend to have higher performance requirements are typically multi threaded and those are the ones that are also hit particularly hard by this issue.

citrin_ru

glibc malloc still doesn't work well for multi-threaded apps. It is prone to memory fragmentation which causes excessive memory usage. One can reduce number of arenas using MALLOC_ARENA_MAX environment variable and in many cases it's a good idea but it could increase lock contention.

If you care about efficiency of a multi-threaded app you should use jemalloc (sadly no longer maintained but still works well), mi-malloc or tcmalloc.

simonask

Hot take: Almost all programs are actually multithreaded. The only exception is tiny UNIX-like shell utilities that are meant to run in parallel with other processes, and toy programs.

The third exception is programs that should be multithreaded but aren't because they are written in languages where adding more threads is disproportionately hard (C, C++) or impossible (Python, Ruby, etc.).

sim7c00

how are C/C++ disproportionally hard? the concept of multi-threading is the same for any language that supports it, most of the primitives are the same, and it's really not a lot nor complicated code to implement those.

the difficulty totally lies in the design... actually using parallelism where it matters. - tons of multi-threaded programs are just single-thread with a lot of 'scheduler' spliced into this one thread -_-

imtringued

I'm not seeing how this justifies a 700x performance difference.

rurban

Rich replaced the default musl malloc some time ago for exactly those reasons. Maybe they still used the old musl libc?

The new one was drafted here: https://github.com/richfelker/mallocng-draft

Aissen

This is addressed in the article: https://nickb.dev/blog/default-musl-allocator-considered-har...

masklinn

The new allocator does nothing to improve the performances in a threaded / contended application: https://www.openwall.com/lists/musl/2025/09/04/3

typpilol

The response to the link here is really telling.

Blames it all on app code like Wayland

fyrn_

The musl pthread muxtexes are also awfully slow: https://justine.lol/mutex/

userbinator

I believe musl is supposed to be optimised heavily for size, not speed.

stonogo

Specifically its goals are low memory overhead and hardening. Safe defaults, and easy to swap to a performance-oriented malloc for those apps that want it.

My question is: why is Rust performance contingent on a C malloc?

masklinn

> why is Rust performance contingent on a C malloc?

Because Rust switched to “system” allocators way back for compatibility with, well, the system, as well as introspection / perf tooling, to lower the size of basic programs, and to lower maintenance.

It used to use jemalloc, but that took a lot of space in even the most basic binary and because jemalloc is not available everywhere it still had to deal with system allocators anyway.

flohofwoe

So basically, the Rust project made a bad decision and now it's all musl's fault? ;)

torginus

The root cause of the issue, is that musl malloc uses a single head, and relies on locking to support multiple heaps. This means each allocation/free must acquire this lock. Imo it's good for single threaded programs (which might've been musls main usecase), but Rust programs nowadays mostly use multiple threads.

In contrast mimalloc, a similarly minimalistic allocator has a per-thread heap, which each thread owning the memory it allocates, and cross-thread free's are handled in a deferred manner.

This works very well with Rust's ownership system, where objects rarely move between threads.

Internally, both allocators use size-class based allocation, into predefined chunks, with the key difference being that musl uses bitmaps and mimalloc uses free lists to keep track of memory.

Musl could be fixed, it they switch from a single thread model, to a per-thread heap as well.

flohofwoe

> a similarly minimalistic allocator

mimalloc has about 10kloc, while (assuming I'm looking in the right place) the new musl allocator has 891 and the old musl allocator has 518 lines of code. I wouldn't call an order of magnitude difference in line count 'similar'.

torginus

It's minimalistic in the sense that it compiles to a tiny binary (a lot of the code is either per platform, musl is POSIX only afaik) or for debugging. Yes it's bigger, but still tiny compared to something like jemalloc, and I'm sure it's like 10kb in a binary.

null

[deleted]

jauntywundrkind

Alas this is a huge foot gun that ensnares many orgs. Because engineers seem drawn like moths to the flame to Alpine container images. Yes they are small, but the ramifications of Alpine & using musl are significant.

Optimizing for size & stdlib code simplicity is probably not the best fit for your application server! Container size has always struck me as such a Goodhart's Law issue (and worse, already a bad measure as it measures only a very brief part of the software lifecycle). Goodhart's Law:

> When a measure becomes a target, it ceases to be a good measure

This particular musl/Alpine footgun can be worked around. It's not particularly hard to install and use another allocator on Alpine or anywhere really. Ruby folks in particular seem to have a lot of lore around jemalloc, with various versions preferences and MALLOC_CONFIGs on top of that. But in general I continue to feel like Alpine base images bring in quite an X factor, even if you knowingly adjust the allocator: the prevalence of Alpine in container images feels unfortunate & eccentric.

Going distorless is always an option. A little too radical for my tastes though usually. I think of musl+busybox+ipkg as the distinguishing aspects of Alpine, so on that basis I'm excited to see the recent huge strides by uutil, the rust rewrite of gnu coreutils focused on compatibility. While offering a BusyBox-like all-in-one binary convenience! It should make a nice compact coreutils for containers! The recent 0.2 has competitive performance which is awesome to see. https://www.phoronix.com/news/Rust-Coreutils-0.2

Karrot_Kream

Huh I guess I'm lucky I never faced this, we've always used Debian or RHEL containers where I've worked. Every time I toyed with using a minimalist distro I found debugging to be much more difficult and ended up abandoning the idea.

Once the container OS forks and runs your binary, I'm curious why does it matter? Is it because people run interpreted code (like Python or Node) and use runtimes that link musl libc? If you deploy JVM or Go apps this will probably not be a factor.

jauntywundrkind

Jvm will also use whatever libc is available, afaik. Here's an article on switching a jvm container to jemalloc from 2021. But this isn't for the heap, it's just for the jvm itself & io related concerns! https://blog.malt.engineering/java-in-k8s-how-weve-reduced-m...

Go is a rare counter example, which ignores the system allocator & bundles its own.

jurschreuder

I'm not a fan of Rust I'm more of a C++ guy but Ripgrep is also nice I always install it.

anthk

Chimera Linux did some changes on their distro because of that.

EDIT: Ah, they were mentioned, of course.

On some malloc replacements, telescope -a gopher/gemini client- used to be a bit crashy until I used jemalloc on some platforms (with LD_PRELOAD).

Also, the performance rendering pages with tons of links improved a lot.

w15v

[dead]

HN

Default musl allocator considered harmful to performance

Default musl allocator considered harmful to performance