Make Ubuntu packages 90% faster by rebuilding them
164 comments
·March 18, 2025smallstepforman
vlovich123
Mimalloc is a general purpose allocator like JEMalloc / TCMalloc. Glibc is known to have a pretty bad allocator that modern allocators like MIMalloc & the latest TCMalloc (not the one available by default in Ubuntu) run laps around. While of course speedup may be variable, generally the benchmarks show an across the board speedup (whether that matters for any given application is something entirely different). As for crashes, these are all general purpose multi-thread allocators and behave no differently from glibc (modulo bugs that can exist equally in glibc).
smallstepforman
Agree for most short running apps. I updated my comment to reflect issues with apps that are constantly reallocating, and running for longer that 60 seconds. But you are absolutely correct for most short running apps, 99% recommended to replace glibc. However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice. Hence why its the default, since there are less crashes in pathaligical cases. And the devs are exhausted dealing with crashing bugs which can be eliminated by using the slower allocator.
vlovich123
Looked at your updated post and it looks like you’re operating under wildly incorrect assumptions.
1. Fragmentation: MIMalloc and the newest TCMalloc definitely handle this better than glibc. This is well established in many many many benchmarks.
2. In terms of process lifetime, MIMalloc (Microsoft Cloud) and TCMalloc (Google Cloud) are designed to be run for massive long-lived services that continually allocate/deallocate over long periods of time. Indeed, they have much better system behavior in that allocating a bunch of objects & then freeing them actually ends up eventually releasing the memory back to the OS (something glibc does not do).
> However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice.
I’m going to challenge you to please produce an example with MIMalloc or the latest TCMalloc (or heck - even any real data point from some other popular allocators vs vague anectodes). This just simply is not something these allocators suffer from and would be major bugs the projects would solve.
null
grandempire
That’s why it’s a bad idea to use one allocator for everything in existence . It’s terrible that everyone pays the cost of thread safety even for single threaded applications - or even multithreaded applications with disciplined resource management.
silisili
This is a common pain point.
Write in a language that makes sense for the project. Then people tell you that you should have used this other language, for reasons.
Use a compression algo that makes sense for your data. Then people will tell you why you are stupid and should have used this other algo.
My most recent memory of this was needing to compress specific long json strings to fit in Dynamo. I exhaustively tested every popular algo, and Brotli came out far ahead. But that didn't stop every passerby from telling me that zlib is better.
It's rather exhausting at times...
teitoklien
Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.
After initial setup, it’s pretty simple and easy to use, I remember making a ton of friends at matrix’s Gentoo Linux channel, was fun times.
Fun fact, initial ChromeOS was basically just custom Gentoo Linux install, I’m not sure if they still use Gentoo Linux internally.
donio
> Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.
That's true but worth noting that "optimize" here doesn't necessarily refer to performance.
I've been using Gentoo for 20 years and performance was never the reason. Gentoo is great if you know how you want things to work. Gentoo helps you get there.
kennysoona
If it wasn't for performance, what was gained in using it over something like Slackware and building only the packages you needed to?
xlii
Long time ago when I was using it I preferred Gentoo because of ergonomics and better exposition to supply chain.
Slackware was very manual and some bits were drowned in its low level and long command chains. Gentoo felt easy but highlighted dependencies with a hard cost associated with compilation times.
Being a newb back then I enjoyed user friendliness with access to the machinery beneath. Satisfaction of a 1s boot time speedu, a result of 48h+ compilation, was unparalleled, too ;)
6SixTy
Afaik the Gentoo based ChromeOS is being replaced by Android.
mycall
It will be interesting if Gentoo supported Fuchsia next.
yjftsjthsd-h
Gentoo already has prefix support for non-Linux systems and used to have at least some interest in a full Gentoo/kFreeBSD, so it's plausible.
surajrmal
What would that even mean?
eru
I used Gentoo for a while, but the temptation to endlessly fiddle with everything always let me to eventually break the system. (It's not Gentoo's fault, it's mine.)
Afterwards I moved to ArchLinux, and that has been mostly fine for me.
If you are using a fairly standard processor, then Gentoo shouldn't give you that much of an advantage?
ryao
Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build. You can also install Gentoo on top of another system via Gentoo Prefix for use as a userland packages manager:
eru
> Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build.
Yes, Gentoo is great. I'm just saying that for me it was too much of a temptation.
serbuvlad
I'd like to bring attention to the ALHP repos[1].
These are the Arch packages built for x86-64-v2, x86-64-v3 and x86-64-v4, which are basically names for different sets of x86-64 extensions. Selecting the highest level supported by your processor should get you most of the way to -march=native, without the hassle of compiling it yourself.
It also enables -O3 and LTO for all packages.
eru
Nice, I'll try them out!
LTO is great, but I have my doubts about -O3 (vs the more conservative -O2).
UPDATE: bah, ALHP repos don't support the nvidia drivers. And I don't want to muck around with setting everything up again.
shanemhansen
So ChromeOS and also the OS for GKE are still basically built this way.
rgmerk
It's a while since I had to deal with this kind of thing, but my memory was that as soon as you go beyond the flags that the upstream developers use (just to be clear, I mean the upstream developers, not the distro packagers) you're buying yourself weird bugs and a whole lot of indifference if they occur.
I haven't used a non-libc malloc before but I suspect the same applies.
Brian_K_White
Two opposing things are both true at the same time.
If you as an individual avoid being at all different, then you are in the most company and will likely have the most success in the short term.
But it's also true that if we all do that then that leads to monoculture and monoculture is fragile and bad.
It's only because of people building code in different contexts (different platforms, compilers, options, libraries, etc...) that code ever becomes at all robust.
A bug that you mostly don't trigger because your platform or build flags just happens to walk just a hair left of the hole in the ground, was still a bug and the code is still better for discovering and fixing it.
We as individuals all benefit from code being generally robust instead of generally fragile.
taeric
I've been building my own emacs for a long time, and have yet to hit any weird bugs. I thought that as long as you avoid any unsafe optimizations, you should be fine? Granted, I also thought that -march=native was the main boost that I was seeing. This post indicates that is not necessarily the case.
I also suspect that any application using floats is more likely to have rough edges?
pertymcpert
Complex software usually has some undefined behavior lurking that at higher or even just different optimization levels can trigger the compiler to do unexpected things to the code. It happens all the time in my line of work. If there's an extensive test suite you can run to verify that it still works mostly as expected then it's easier.
notpushkin
On the other hand, if your optimization helps consistently across platforms, you could convince upstream developers to implement it directly. (Not necessarily across all platforms – a sizable performance gain on just a single arch might still be enough to tweak configuration for that particular build).
rlpb
Note that if you do this then you will opt out of any security updates not just for jq but also for its regular expression parsing dependency onigurama. For example, there was a security update for onigurama previously; if this sort of thing happens again, you'd be vulnerable, and jq is often used to parse untrusted JSON.
> * SECURITY UPDATE: Fix multiple invalid pointer dereference, out-of-bounds write memory corruption and stack buffer overflow.
(that one was for CVE-2017-9224, CVE-2017-9226, CVE-2017-9227, CVE-2017-9228 and CVE-2017-9229)
ryao
A userland package manager like Gentoo Prefix could be used to install a custom build of this and still get security updates.
rlpb
Indeed, there are many methods to have a custom build and still get security updates, including at least one method that is native to Ubuntu and doesn’t need any external tooling. However my warning refers to the method presented in the article, where this isn’t the case.
jmward01
But isn't there still the kernel of an idea here for a package management system that intelligently decides to build based on platform? Seems like a lot of performance to leave on the table.
taeric
I'm curious how applicable these are, in general? Feels like pointing out that using interior doors in your house misses out on the security afforded from a vault door. Not wrong, but there is also a reason every door in a bank is not a vault door.
That is, I don't want to devalue the CVE system; but it is also undeniable that there are major differences in impact between findings?
topspin
This is certainly true. Also, by replacing the allocator and changing compiler flags, you're possibly immunizing yourself from attacks that rely on some specific memory layout.
estebarb
By hardwiring the allocator you may end up with binaries that load two different allocators. It is too fun to debug a program that is using jemalloc free to release memory allocated by glibc. Unless you know what you are doing, it is better to leave it as is.
smarx007
There are also many flags that should be enabled by default for non-debug builds like ubsan, stack protection, see https://news.ycombinator.com/item?id=35758898
ryao
UBSAN is usually a debug build only thing. You can run it in production for some added safety, but it comes at a performance cost and theoretically, if you test all execution paths on a debug build and fix all complaints, there should be no benefit to running it in production.
internetter
That is, if you are a believer in security via obscurity
frontfor
Are you arguing that ASLR is “security via obscurity”?
eru
Why? You could advertise publicly what your flags are.
jeffbee
This is generally true but specifically false. The builds described in the gist are still linking onigurama dynamically. It is in another package, libonig5, that would be updated normally.
Onavo
What about PGO?
kstrauser
(Well, rebuilding them with a different allocator that benchmarks well on their specific workflow.)
loeg
Everything outperforms glibc malloc. It's essentially malpractice that distros continue to use it instead of mimalloc or jemalloc.
dan-robertson
Is it even known what workloads the glibc malloc is good for?
searealist
Using all your memory on multi-threaded workflows.
electromech
I'd be curious how the performance compares to this Rust jq clone:
cargo install --locked jaq
(you might also be able to add RUSTFLAGS="-C target-cpu=native" to enable optimizations for your specific CPU family)
"cargo install" is an underrated feature of Rust for exactly the kind of use case described in the article. Because it builds the tools from source, you can opt into platform-specific features/instructions that often aren't included in binaries built for compatibility with older CPUs. And no need to clone the repo or figure out how to build it; you get that for free.
jaq[1] and yq[2] are my go-to options anytime I'm using jq and need a quick and easy performance boost.
saghm
As a bonus that people might not be aware of, in the cases where you do want to use the repo directly (either because there isn't a published package or maybe you want the latest commit that hasn't been released), `cargo install` also has a `--git` flag that lets you specify a URL to a repo. I've used this a number of times in the past, especially as an easy way for me to quickly install personal stuff that I throw together and push to a repo without needing to put together any sort of release process or manually copy around binaries to personal machines and keep track of the exact commits I've used to build them.
oguz-ismail
> I'd be curious how the performance compares to this Rust jq clone
Every once in a while I test jaq against jq and gojq with my jq solution to AoC 2022 day 13 https://gist.github.com/oguz-ismail/8d0957dfeecc4f816ffee79d...
It's still behind both as of today
rubatuga
Chimera Linux defaults to using clang + mimalloc
Read more here: https://chimera-linux.org/about/
john-tells-all
If you actually do this, just get Ubuntu to download the exact source package you want. In this case:
apt-get source jq
Then go into the package and recompile to your heart's content. You can even repackage it for distribution or archiving.You'll get a result much closer to the upstream Ubuntu, as opposed to getting lots of weird errors and misalignments.
cyounkins
Why is glibc malloc() not more performant? Are tcmalloc/mimalloc making tradeoffs that maintainers are unwilling to make in glibc?
atonse
I’m almost more amazed that someone figured out jq’s syntax and got some use out of it.
In all seriousness though, are you sure some of this isn’t those blocks being loaded into some kind of file system cache the second and third times?
How about if you rebooted and then ran the mimalloc version?
matt123456789
The benchmarking tool being used in this post accounts for that using multiple runs for each invocation, together with a warmup run that is not included in the result metric.
atonse
Ah missed that, sorry.
behnamoh
jq has point-free programming. it's intuitive once you wrap your head around it.
See this: https://en.wikipedia.org/wiki/Tacit_programming#jq
Flimm
Thank you! That's helpful. I got the examples in that Wikipedia section to run using the `--null-input` flag:
$ jq --null-input '[1, 2] | add'
3
atonse
Yeah after this post I sought a couple youtube videos explaining it. It's starting to make a bit more sense now. But the lightbulb hasn't gone off just yet. Appreciate the link.
mmastrac
I wish I had seen this earlier. My mental model was close but not quite there to the point where I needed to think too hard about how to solve problems.
This is much more intuitive now.
zX41ZdbW
ClickHouse will be faster for processing large JSON data files. The example from the gist will look like follows:
ch -q "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity WHERE f.properties.TotalNetValue < 193000 FROM 'data.json'"
Reference: https://jsonbench.com/jeffbee
I'm willing to believe that will execute in less than 2 seconds, but it doesn't work as given.
rustc
Can you post a link to the JSON file?
Try this (the placement of FROM was incorrect):
ch "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity FROM 'a.json' WHERE f.properties.TotalNetValue < 193000"
polskibus
This is gentoo resurrected! Great memories building entire Linux from scratch on install ;)
Engineering is a compromise. The article shows most gains come from specialising the memory allocater. The thing to remember is that some projects are multithreaded, and allocate in one thread, use data in another and maybe deallocate in a 3rd. The allocator needs to handle this. So a speedup for one project may be a crash in another.
Also, what about reallocation strategy? Some programs preallocate and never touch malloc again, others constantly release and acquire. How well do they handle fragmentation? What is the uptime (10 seconds or 10 years)? Sometimes the choice of allocators is the difference between long term stability vs short term speed.
I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.
As already pointed out, engineering is a compromise.