Make Ubuntu packages 90% faster by rebuilding them

374 comments

·March 18, 2025

wengo314

"Make one Ubuntu package 90% faster by rebuilding it and switching the memory allocator"

i wish i could slap people in the face over standard tcp/ip for clickbait. it was ONE package and some gains were not realized by recompilation.

i have to give it to him, i have preloaded jemalloc to one program to swap malloc implementation and results have been very pleasant. not in terms of performance (did not measure) but in stabilizing said application's memory usage. it actually fixed a problem that appeared to be a memory leak, but probably wasn't fault of the app itself (likely memory fragmentation with standard malloc)

FooBarWidget

I did research into the glibc memory allocator. Turns out this is not memory fragmentation, but per-thread caches that are never freed back to the kernel! A free() call does not actually free the memory externally unless in exceptional circumstances. The more threads and CPU cores you have, the worse this problem becomes.

One easy solution is setting the "magic" environment variable MALLOC_ARENA_MAX=2, which limits the number of caches.

Another solution is having the application call malloc_trim() regularly, which purges the caches. But this requires application source changes.

https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus...

glandium

The glibc memory allocator DOES have pathological cases that lead to what can look like memory leaks. See https://glandium.org/blog/?p=3698 https://glandium.org/blog/?p=3723 https://sourceware.org/bugzilla/show_bug.cgi?id=23416 (despite being rather old, it's also still a problem)

wengo314

FWIW i had it with icinga2. so now they actually preload jemalloc in the service file to mitigate the issue, this may very well be what you're talking about

in case someone is interested: https://github.com/Icinga/icinga2/issues/8737

(basically using jemalloc was the big fix)

https://icinga.com/docs/icinga-2/latest/doc/15-troubleshooti...

blablabla123

True, I also believed it for a second. But it's also easy to blame Ubuntu for errors. IMHO they are doing a quite decent job with assembling their packages. In fact they are also compiled with Stack fortifications. On the other hand I'm glad they are not compiled with the possibly buggy -O3. It can be nice for something performance critical but I definitely don't want a whole system compiled with -O3.

kllrnohj

> with the possibly buggy -O3

-O3 isn't buggy on either GCC or Clang. Are you thinking of -Ofast and/or -ffast-math that disregard standards compliance? Those aren't part of -O3.

BoingBoomTschak

-O3 itself isn't "buggy", but since it uses more optimizations, it can reveal issues in them. Other Gentoo users know: e.g. https://bugs.gentoo.org/show_bug.cgi?id=941208 https://bugs.gentoo.org/show_bug.cgi?id=940923 (search O3 in the bugzilla).

dietr1ch

To me it's obviously a scam because there's no way such an improvement can be achieved globally with a single post explanation. 90% faster is a micro-benchmark number.

arghwhat

This is neither a micro-benchmark nor a scam, but it is click-bait by not mentioning jq specifically.

Micro-benchmarks would be testing e.g. a single library function or syscall rather than the whole application. This is the whole application, just not one you might care that much for the performance of.

Other applications will of course see different results, but stuff like enabling LTO, tuning THP and picking a suitable allocator are good, universal recommendations.

looofooo0

True that, I mean it is still interesting, that if you have a narrow task, you might achieve some significant speed up from rebuilding them. But this is a very niche application.

wengo314

true, i saw a thread recently on reddit where guy hand-tuned compilation flags and did pgo profiling for a video encoder app that he uses on video encode farm.

In his case, even a gain of ~20% was significant. It calculated into extra bandwidth to encode a few thousand more video files per year.

cratermoon

I wonder how many prepackaged binary distributions are built with the safest options for the os/hardware and don't achieve the best possible performance. I bet most of them, tbh. Many years ago I started building Mozilla and my own linux kernels to my preferences, usually realizing modest performance gains. The entire purpose of the Gentoo Linux distribution, e.g., is performance gains possible by optimized compilation of everything from source.

tonymet

the title is clickbait, but it's good to encourage app developers to rebuild. esp when you are cpu bound on a few common utitilities e.g. jq, grep, ffmpeg, ocrmypdf -- common unix utils built build targets for general use rather than a specific application

UncleEntity

Or, if I understand TFA correctly, don't release debug builds in your release packages.

Reminds me of back in the day, when I was messing around with blender's cmake config files quite a bit, I noticed the fedora package was using the wrong flag -- some sort of debug only flag intended for developers instead of whatever they thought is was. I mentioned this to the package maintainer, it was confirmed by package sub-maintainer (or whomever) and the maintainer absolutely refused to change it because the spelling of the two flags was close enough they could just say "go away, contributing blender dev, you have no idea what you're talking about." Wouldn't doubt the fedora package still has the same mistaken flag to this day and all this occurred something like 15 years ago.

So, yeah, don't release debug builds if you're a distro package maintainer.

margana

I thought it would be something like recompiling to utilize AVX512 capabilities or something.

tremon

Vector operations like AVX512 will not magically make common software faster. The number of applications that deal with regular operations on large blocks of data is pretty much limited to graphical applications, neural networks and bulk cryptographic operations. Even audio processing doesn't benefit that much from vector operations because a codec's variable-size packets do not allow for efficient vectorization (the main exception being multi-channel effects processing as used in DAW).

isotypic

Vector operations are widely used in common software. Java uses AVX512 for sorting. glibc uses SIMD instructions for string operations.

draw_down

[dead]

smallstepforman

Engineering is a compromise. The article shows most gains come from specialising the memory allocater. The thing to remember is that some projects are multithreaded, and allocate in one thread, use data in another and maybe deallocate in a 3rd. The allocator needs to handle this. So a speedup for one project may be a crash in another.

Also, what about reallocation strategy? Some programs preallocate and never touch malloc again, others constantly release and acquire. How well do they handle fragmentation? What is the uptime (10 seconds or 10 years)? Sometimes the choice of allocators is the difference between long term stability vs short term speed.

I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.

As already pointed out, engineering is a compromise.

vlovich123

Mimalloc is a general purpose allocator like JEMalloc / TCMalloc. Glibc is known to have a pretty bad allocator that modern allocators like MIMalloc & the latest TCMalloc (not the one available by default in Ubuntu) run laps around. While of course speedup may be variable, generally the benchmarks show an across the board speedup (whether that matters for any given application is something entirely different). As for crashes, these are all general purpose multi-thread allocators and behave no differently from glibc (modulo bugs that can exist equally in glibc).

smallstepforman

Agree for most short running apps. I updated my comment to reflect issues with apps that are constantly reallocating, and running for longer that 60 seconds. But you are absolutely correct for most short running apps, 99% recommended to replace glibc. However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice. Hence why its the default, since there are less crashes in pathaligical cases. And the devs are exhausted dealing with crashing bugs which can be eliminated by using the slower allocator.

vlovich123

Looked at your updated post and it looks like you’re operating under wildly incorrect assumptions.

1. Fragmentation: MIMalloc and the newest TCMalloc definitely handle this better than glibc. This is well established in many many many benchmarks.

2. In terms of process lifetime, MIMalloc (Microsoft Cloud) and TCMalloc (Google Cloud) are designed to be run for massive long-lived services that continually allocate/deallocate over long periods of time. Indeed, they have much better system behavior in that allocating a bunch of objects & then freeing them actually ends up eventually releasing the memory back to the OS (something glibc does not do).

> However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice.

I’m going to challenge you to please produce an example with MIMalloc or the latest TCMalloc (or heck - even any real data point from some other popular allocators vs vague anectodes). This just simply is not something these allocators suffer from and would be major bugs the projects would solve.

yxhuvud

Funny, cause the situations where I've had to replace glibc is always that it is a long running server that allocates often. Glibc: Ballooning memory, eventually crash. jemalloc: Stable as a rock.

null

[deleted]

scottlamb

> I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.

I also work with large (8K) video frames [1]. If you're talking about the frames themselves, 60 allocations per second is nothing. In the case of glibc, it's slow for just one reason: each allocation exceeds DEFAULT_MMAP_THRESHOLD_MAX (= 32 MiB on 64-bit platforms), so (as documented in the mallopt manpage), you can not convince glibc to cache it. It directly requests the memory from the kernel with mmap and returns it with munmap each time. Those system calls are a little slow, and faulting in each page of memory on first touch is in my case slow enough that it's impossible to meet my performance goals.

The solution is really simple: use your own freelist (on top of the general-purpose allocator or mmap, whatever) for just the video frames. It's a really steady number of allocations that are exactly the same size, so this works fine.

[1] in UYVY format, this is slightly under 64 MiB; in I420 format, this is slightly under 48 MiB.

jeffbee

Buffers of that size are also in tcmalloc's "whatever" class, right? It just does a smarter[1] job of not unmapping them.

1: for a certain point of view

hedora

I’d be shocked if jemalloc or tcmalloc had issues with that workload.

Do you have a minimal reproducing example?

scottlamb

Sure, in plain C because why not. <https://gist.github.com/scottlamb/459a3ce6230be67bf4ceb1f1a8...>

There's one other element I didn't mention in my previous comment, which is a thread handoff. It may be significant because it trashes any thread-specific arena and/or because it introduces a little bit of variability over a single malloc at a time.

For whatever reason the absolute rate on my test machine is much higher than in my actual program (my actual program does other things with a more complex threading setup, has multiple video streams, etc.) but you can see the same effect of hitting the mmap, munmap, and page fault paths that really need not ever be exercised after program start.

In my actual (Rust-based) program, adding like 20 lines of code for the pooling was a totally satisfactory solution and took me less time than switching general-purpose allocator, so I didn't try others. (Also, my program supports aarch64 and iirc the vendored jemalloc in the tikv-jemallocator crate doesn't compile cleanly there.)

1dom

Sorry, I'm struggling to make sense of this comment. I don't know C or C compilers very well at all, but I read the full gist and felt I learned a bunch of stuff and got a lot of value from it.

But then when I read this top comment, it makes me concerned I've completely misunderstood the article. From the tone of this comment, I assume that I shouldn't ever do what's talked about in this gist and it's a terrible suggestion that overlooks all these complexities that you understand and have referenced with rhetorical-looking questions.

Any chance you could help me understand if the original gist is good, makes any legitimate points, or has any value at all? Because I thought it did until I saw this was the top comment, and it made me realise I'm not smart enough to be able to tell. You sound like you're smart enough to tell, and you're telling me only bad things.

alextingle

I'll have a go at explaining: The process described in the article isn't a simple recipe that you can apply to any program to achieve similar results.

`jq` is a command-line program that fires up to do one job, and then dies. For such a program, the only property we really want to optimise is execution speed. We don't care about memory leaks, or how much memory the process uses (within reason). `jq` could probably avoid freeing memory completely, and that would be fine. So using a super-stupid allocator is a big win for `jq`. You could probably write your own and make it run even faster.

But for a program with different runtime characteristics, the results might be radically different. A long-lived server program might need to avoid memory bloat more than it needs to run fast. Or it might need to ensure stability more than speed or size. Or maybe speed does matter, but it's throughput speed rather than latency. Each of those cases need to be measured differently, and may respond better to different optimisation strategies.

The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.

1dom

> The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.

I think that's what confused and irritated me. There's a lot of value and learning in the gist - I've used JQ in my previous jobs regularly, this is the real world, and valuable to many. But the top comment (at the time I responded) is largely rhetorically trashing the submission based on purely the title.

I get that the gist won't make _everything_ faster: but I struggle to believe that any HN reader would genuinely believe that's either true, or a point that the author is trying to make. The literal first sentence of the submission clarifies the discussion is purely about JQ.

Anyone can read a submission, ignore any legitimate value it in, pick some cases the submission wasn't trying to address, and then use those cases to rhetorically talk it down. I'm struggling to understand why/how that's bubbling to the top in a place of intellectual curiosity like HN.

Edit: I should practice what I preach. Conversation and feedback which is purely cautionary or negative isn't a world that anyone really wants! Thanks for the response, I really appreciated it:) It was helpful in confirming my understanding that this submission does genuinely improve JQ on Ubuntu. Cautionary is often beneficial and necessary, and I think the original comment I responded to could make a better world with a single sentence confirming that this gist is actually valuable in the context it defines.

grandempire

That’s why it’s a bad idea to use one allocator for everything in existence . It’s terrible that everyone pays the cost of thread safety even for single threaded applications - or even multithreaded applications with disciplined resource management.

yxhuvud

While I do agree with the general sentiment, I think the default should be to use the safer ones that can handle multi threaded usage. If someone wants to use a bad allocator like glibc that doesn't handle concurrency well, then they should certainly be free to switch.

grandempire

So now you’re using std vector and it’s taking a lock for no reason, even though push_back on the same vector across threads isn’t thread safe to begin with.

Haphazard multithreading is not a sane default.

I understand a million decisions have been made so that we can’t go flip that switch back off, but we’ve got to learn these lessons for the future.

silisili

This is a common pain point.

Write in a language that makes sense for the project. Then people tell you that you should have used this other language, for reasons.

Use a compression algo that makes sense for your data. Then people will tell you why you are stupid and should have used this other algo.

My most recent memory of this was needing to compress specific long json strings to fit in Dynamo. I exhaustively tested every popular algo, and Brotli came out far ahead. But that didn't stop every passerby from telling me that zlib is better.

It's rather exhausting at times...

silisili

note: in the above, I meant 'zstd', not 'zlib'.

teitoklien

Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.

After initial setup, it’s pretty simple and easy to use, I remember making a ton of friends at matrix’s Gentoo Linux channel, was fun times.

https://www.gentoo.org/

Fun fact, initial ChromeOS was basically just custom Gentoo Linux install, I’m not sure if they still use Gentoo Linux internally.

donio

> Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.

That's true but worth noting that "optimize" here doesn't necessarily refer to performance.

I've been using Gentoo for 20 years and performance was never the reason. Gentoo is great if you know how you want things to work. Gentoo helps you get there.

kennysoona

If it wasn't for performance, what was gained in using it over something like Slackware and building only the packages you needed to?

CamouflagedKiwi

USE flags. You can build packages with specific features enabled or disabled, which can further reduce your dependency tree.

xlii

Long time ago when I was using it I preferred Gentoo because of ergonomics and better exposition to supply chain.

Slackware was very manual and some bits were drowned in its low level and long command chains. Gentoo felt easy but highlighted dependencies with a hard cost associated with compilation times.

Being a newb back then I enjoyed user friendliness with access to the machinery beneath. Satisfaction of a 1s boot time speedu, a result of 48h+ compilation, was unparalleled, too ;)

0xbadcafebee

> If it wasn't for performance, what was gained

A new hobby

vasco

Never seen the HN version of the 'install gentoo' meme before, more sophisticated definitely.

> The goal of Gentoo is to have an operating system that builds all programs from source, instead of having pre-built binary packages. While this does allow for advanced speed and customizability, it means that even the most basic components such as the kernel must be compiled from source. It is known through out the Linux community as being a very complex operating system because of its daunting install process. The default Gentoo install boots straight to a command prompt, from which the user must manually partition the disk, download a package known as a "Stage 3 tarball", extract it, and build the system up by manually installing packages. New or inexperienced users will often not know what to do when they boot in to the installer to find there is no graphical display. Members of /g/ will often exaggerate the values of Gentoo, trying to trick new users in to attempting to install it.

toyg

Where does that blurb come from, chatgpt? I don't think it's true anymore, last time I checked I think Gentoo had a "normal" liveCD installation for the base system, which you could then recompile on your own if wanted.

_joel

GRP (Gentoo packages) existed at least 20 years ago, from my memory, as that's the last time I really used it in anger. I remeber packages being available and not having to rice everything, for sure.

Vilkku

Seems to be from https://knowyourmeme.com/memes/install-gentoo

stakhanov

I had had Gentoo continuously in use since 2003, and only very recently moved off of it (late 2024) when I tried Void Linux. On Void, buildability from source by end users is not a declared goal nor architectural feature, but you have a pretty decent chance of being able to make it work. You can expect one or two hiccups, but if you have decent all-round Linux experience, chances are you'll be able to jump into the build recipes, fix them, make everything work for what you need it to do, and contribute the fixes back upstream. This is what you get from a relentless focus on minimalism and avoiding overengineering of any kind. It's what I had been missing in Gentoo all those years. With Gentoo, I always ended up having to fiddle with use flags and package masks in ways that wouldn't be useful to other users. The build system is so complex that it had been just too difficult for me, over all these years, to properly learn it and learn to fix problems at the root cause level and contribute them upstream. Void should also be an ideal basis for when you don't want to build the entire system from source, but you just want to mix & match distro-provided binaries with packages you've built from source (possibly on the basis of a modified build recipe to better match your needs or your hardware).

eru

I used Gentoo for a while, but the temptation to endlessly fiddle with everything always let me to eventually break the system. (It's not Gentoo's fault, it's mine.)

Afterwards I moved to ArchLinux, and that has been mostly fine for me.

If you are using a fairly standard processor, then Gentoo shouldn't give you that much of an advantage?

ryao

Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build. You can also install Gentoo on top of another system via Gentoo Prefix for use as a userland packages manager:

https://wiki.gentoo.org/wiki/Project:Prefix

eru

> Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build.

Yes, Gentoo is great. I'm just saying that for me it was too much of a temptation.

serbuvlad

I'd like to bring attention to the ALHP repos[1].

These are the Arch packages built for x86-64-v2, x86-64-v3 and x86-64-v4, which are basically names for different sets of x86-64 extensions. Selecting the highest level supported by your processor should get you most of the way to -march=native, without the hassle of compiling it yourself.

It also enables -O3 and LTO for all packages.

[1]: https://github.com/an0nfunc/ALHP

eru

Nice, I'll try them out!

LTO is great, but I have my doubts about -O3 (vs the more conservative -O2).

UPDATE: bah, ALHP repos don't support the nvidia drivers. And I don't want to muck around with setting everything up again.

Another update: I moved to nvidia-open, so now I can try the suggested repos.

globular-toast

I've broken Arch but never broken Gentoo. I think this more due to the fact I ran Arch first and you then Gentoo first, rather than any real difference between them.

Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.

eru

> I think this more due to the fact I ran Arch first and you then Gentoo first, rather than any real difference between them.

I can believe that.

> Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.

I actually had way more trouble with stuff breaking with Ubuntu. That's because every six months, when I did the distro upgrade, lots of stuff broke at once and it was hard to do root cause analysis.

With a rolling distribution, it's usually only one thing breaking at a time.

johnisgood

I have used Gentoo and Arch. I still favor Arch, however. I like pacman (C), I don't like Portage (Python).

I found Arch Linux to be more stable than Gentoo, but that is just my own experience.

6SixTy

Afaik the Gentoo based ChromeOS is being replaced by Android.

mycall

It will be interesting if Gentoo supported Fuchsia next.

yjftsjthsd-h

Gentoo already has prefix support for non-Linux systems and used to have at least some interest in a full Gentoo/kFreeBSD, so it's plausible.

surajrmal

What would that even mean?

jsight

I tried that once, but it is still compiling. :)

Kidding... honestly that was a pretty fun distribution to play around with ~20 years ago. The documentation was really good and it was a great way to learn how a lot of the pieces of a Linux distribution fit together.

I was never convinced that the performance difference was really noticeable, though.

wantless

Gentoo was the primary source of heating for my living quarters back in the early 2000s. My tower was highly constrained on memory, and I was on a relentless quest to pare out any modules or dependencies I wasn't actually using. Performance gains were primarily from being able to stay out of slow HDD swap space memory. I doubt there were any gains once amortizing the compilation times, but I ran my compile batches at night, and they kept me nice and warm.

shanemhansen

So ChromeOS and also the OS for GKE are still basically built this way.

boobsbr

Obligatory "Gentoo is rice" mention:

https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...

rlpb

Note that if you do this then you will opt out of any security updates not just for jq but also for its regular expression parsing dependency onigurama. For example, there was a security update for onigurama previously; if this sort of thing happens again, you'd be vulnerable, and jq is often used to parse untrusted JSON.

> * SECURITY UPDATE: Fix multiple invalid pointer dereference, out-of-bounds write memory corruption and stack buffer overflow.

(that one was for CVE-2017-9224, CVE-2017-9226, CVE-2017-9227, CVE-2017-9228 and CVE-2017-9229)

ryao

A userland package manager like Gentoo Prefix could be used to install a custom build of this and still get security updates.

rlpb

Indeed, there are many methods to have a custom build and still get security updates, including at least one method that is native to Ubuntu and doesn’t need any external tooling. However my warning refers to the method presented in the article, where this isn’t the case.

alextingle

> including at least one method that is native to Ubuntu and doesn’t need any external tooling.

Can you explain a little more? Search has failed me on this one.

jmward01

But isn't there still the kernel of an idea here for a package management system that intelligently decides to build based on platform? Seems like a lot of performance to leave on the table.

stkdump

Rebuilding from scratch also takes longer than installing a prebuilt package. So while it might be worth it for a heavily used application, in general I doubt it.

Also I think in earlier days the argument to build was so you can optimize the application for the specific capabilities of your system like the supported SIMD instruction set or similar. I think nowadays that is much less of a factor. Instead it would probably be better to do things like that on a package or distribution level (i.e. have one binary distribution package prebuilt by the distribution for different CPU capabilities).

jeffbee

This is generally true but specifically false. The builds described in the gist are still linking onigurama dynamically. It is in another package, libonig5, that would be updated normally.

rlpb

The gist uses the orig tarball only, so skips the distro patch that selects the distro packaged libonig in favor of the "vendored" one. At least that's how it appears to me. I only skimmed the situation.

Or do you see something deeper that ensures that the distro libonig is actually the one that gets used?

taeric

I'm curious how applicable these are, in general? Feels like pointing out that using interior doors in your house misses out on the security afforded from a vault door. Not wrong, but there is also a reason every door in a bank is not a vault door.

That is, I don't want to devalue the CVE system; but it is also undeniable that there are major differences in impact between findings?

lpapez

In my experience, most CVEs are reports about ice cream trucks lacking nuclear-proof bank vault doors.

rlpb

Sure, but jq is very much a "front door" in your analogy. You'd have to look at each individual CVE to assess the risk for your specific case, but for jq, claimed security vulnerabilities are worth paying attention to.

skrebbel

> I don't want to devalue the CVE system

You could, though. It's 99.9% stuff like this!

topspin

This is certainly true. Also, by replacing the allocator and changing compiler flags, you're possibly immunizing yourself from attacks that rely on some specific memory layout.

estebarb

By hardwiring the allocator you may end up with binaries that load two different allocators. It is too fun to debug a program that is using jemalloc free to release memory allocated by glibc. Unless you know what you are doing, it is better to leave it as is.

smarx007

There are also many flags that should be enabled by default for non-debug builds like ubsan, stack protection, see https://news.ycombinator.com/item?id=35758898

ryao

UBSAN is usually a debug build only thing. You can run it in production for some added safety, but it comes at a performance cost and theoretically, if you test all execution paths on a debug build and fix all complaints, there should be no benefit to running it in production.

internetter

That is, if you are a believer in security via obscurity

frontfor

Are you arguing that ASLR is “security via obscurity”?

eru

Why? You could advertise publicly what your flags are.

guappa

The normal way is to use dpkg to rebuild and patch, and use dch to increase the patch version with a .1 or something similar, so that the OS version always takes precedence, and then rebuild.

Onavo

What about PGO?

ryao

https://en.wikipedia.org/wiki/Profile-guided_optimization

rgmerk

It's a while since I had to deal with this kind of thing, but my memory was that as soon as you go beyond the flags that the upstream developers use (just to be clear, I mean the upstream developers, not the distro packagers) you're buying yourself weird bugs and a whole lot of indifference if they occur.

I haven't used a non-libc malloc before but I suspect the same applies.

Brian_K_White

Two opposing things are both true at the same time.

If you as an individual avoid being at all different, then you are in the most company and will likely have the most success in the short term.

But it's also true that if we all do that then that leads to monoculture and monoculture is fragile and bad.

It's only because of people building code in different contexts (different platforms, compilers, options, libraries, etc...) that code ever becomes at all robust.

A bug that you mostly don't trigger because your platform or build flags just happens to walk just a hair left of the hole in the ground, was still a bug and the code is still better for discovering and fixing it.

We as individuals all benefit from code being generally robust instead of generally fragile.

taeric

I've been building my own emacs for a long time, and have yet to hit any weird bugs. I thought that as long as you avoid any unsafe optimizations, you should be fine? Granted, I also thought that -march=native was the main boost that I was seeing. This post indicates that is not necessarily the case.

I also suspect that any application using floats is more likely to have rough edges?

pertymcpert

Complex software usually has some undefined behavior lurking that at higher or even just different optimization levels can trigger the compiler to do unexpected things to the code. It happens all the time in my line of work. If there's an extensive test suite you can run to verify that it still works mostly as expected then it's easier.

taeric

This is one where I suspect we don't disagree. But "all the time" can have a very different feel between people.

It also used to happen that just changing processors was likely to find some problems in the code. I have no doubt that still happens, but I'd also expect it has reduced.

Some of this has to be a convergence on far fewer compilers than we used to encounter. I know there are still many c compilers. Seems there are only two common ones, though. Embedded, of course, is a whole other bag of worms.

Y_Y

Did you try -ffast-math? IIRC that used to break emacs in some subtle way, while providing no extra speed.

taeric

I thought touching the math optimizations directly was in the "unsafe" bucket. Really the only optimization I was aiming for was -march=native. That and the features like native compilation that have made it to the release.

I do think I saw improvements. But I never got numbers, so I'm assuming most of my feel was wishful thinking. Reality is a modern computer is hella fast for something like emacs.

I did see compilation mode improve when I trimmed down the regexes it watches for to only the ones I knew were relevant for me. That said, I think I've stopped doing that; so maybe that is a lot better?

fireant

I've turned on fastmath in python numba compiler while thinking "of course i want faster math, duh". Took me a while to find out it was a cause of many "fun" subtle bugs. Never touching that stuff again.

notpushkin

On the other hand, if your optimization helps consistently across platforms, you could convince upstream developers to implement it directly. (Not necessarily across all platforms – a sizable performance gain on just a single arch might still be enough to tweak configuration for that particular build).

kstrauser

(Well, rebuilding them with a different allocator that benchmarks well on their specific workflow.)

loeg

Everything outperforms glibc malloc. It's essentially malpractice that distros continue to use it instead of mimalloc or jemalloc.

throw16180339

It's been awhile since I looked into this, but it's not necessarily an easy change. glibc malloc has debugging APIs; a distro can't easily replace it without either emulating the API or patching programs that use it.

loeg

No need to even patch it out. It's relatively easy to change the default, rebuild the world (distros have a flow for this), and restore glibc for the tiny, tiny handful of individual programs that actually rely on glibc debugging APIs.

From a end developer perspective: I have no particular familiarity with mimalloc, but I know jemalloc has pretty extensive debugging functionality (not API compatible with glibc malloc of course).

dan-robertson

Is it even known what workloads the glibc malloc is good for?

searealist

Using all your memory on multi-threaded workflows.

rurban

It's using the least additional memory, compared to all others. Whilst being the slowest by far of all modern allocators.

electromech

I'd be curious how the performance compares to this Rust jq clone:

cargo install --locked jaq

(you might also be able to add RUSTFLAGS="-C target-cpu=native" to enable optimizations for your specific CPU family)

"cargo install" is an underrated feature of Rust for exactly the kind of use case described in the article. Because it builds the tools from source, you can opt into platform-specific features/instructions that often aren't included in binaries built for compatibility with older CPUs. And no need to clone the repo or figure out how to build it; you get that for free.

jaq[1] and yq[2] are my go-to options anytime I'm using jq and need a quick and easy performance boost.

[1] https://github.com/01mf02/jaq

[2] https://github.com/mikefarah/yq

oguz-ismail

> I'd be curious how the performance compares to this Rust jq clone

Every once in a while I test jaq against jq and gojq with my jq solution to AoC 2022 day 13 https://gist.github.com/oguz-ismail/8d0957dfeecc4f816ffee79d...

It's still behind both as of today

saghm

As a bonus that people might not be aware of, in the cases where you do want to use the repo directly (either because there isn't a published package or maybe you want the latest commit that hasn't been released), `cargo install` also has a `--git` flag that lets you specify a URL to a repo. I've used this a number of times in the past, especially as an easy way for me to quickly install personal stuff that I throw together and push to a repo without needing to put together any sort of release process or manually copy around binaries to personal machines and keep track of the exact commits I've used to build them.

john-tells-all

If you actually do this, just get Ubuntu to download the exact source package you want. In this case:

    apt-get source jq

Then go into the package and recompile to your heart's content. You can even repackage it for distribution or archiving.

You'll get a result much closer to the upstream Ubuntu, as opposed to getting lots of weird errors and misalignments.

ltbarcly3

Misleading title, it's 90% of the faster time. It's about 45% faster.

It's actually a little bit interesting, if you are interested in how we use language. You could argue that now you now get 90% more work done in the same amount of time, and that would align with other 'speed' units that we commonly use (miles per hour, words per minute, bits per second). However, the convention in computer performance is to measure time for a fixed amount of work. I would guess that this is because generally we have a fixed amount of work and what might vary is how long we wait for it (and that is absolutely true in the case of this blog post) so we put time in the numerator.

It's a very interesting post and very well done, but it's not 90% faster.

gblargg

More misleading is that it implies that all packages can be made 90% faster. This is one particular package.

odo1242

I feel like using units of rate (90% faster) and not units of time makes more sense here.

Plus, if you were using units of time, you wouldn’t use the word “faster.” “Takes 45% less time” and “45% faster” are very different assertions, but they both have meaning, both in programming and outside it.

ltbarcly3

It comes down to convention. When talking about proportional differences, we can fix the unitary example to be the smaller, larger, earlier, or later, object or subject.

I think, generally, we fix on the earlier when talking about the change over time of a characteristic. "This stock went up 100%, this stock went down 50%". In both cases it's the earlier measurement that is taken as the unit. That makes this a 45% reduction in time to do the work, and that's actually what they measured.

When talking about comparisons between two things that aren't time dependent it depends on if we talk in multiples or percents I think. A twin bed is half as big as a king bed. A king bed is twice as big as a twin bed. Both are idiomatic. A king bed is 100% bigger than a twin bed. Yes, you could talk like this. A twin bed is 100% smaller than a king bed. Right away you say wait, a twin bed isn't 0 size! Because we don't talk in terms of the smaller thing when talking about decreasing percents, only increasing. A twin bed is 50% smaller than a king bed (iffy). A twin bed is 50% as big as a king bed. There, that's idiomatic again.

tom_

Daneel_

Great read. I hadn't even considered that people might interpret "90% faster" as "10 times as fast", i.e., it will take 100-90=10% of the original time. It seems like a completely incorrect interpretation to me, but obviously there are people who read it with this understanding. Huh.

narmiouh

It is unfortunate (because how do you interpret 100% faster) but a common interpretation of X% faster implies it takes x% less time than before. One easy way is to have chatgpt give a numeric example for a statement like this, and in all cases I tried it gives an example of sorts if the job took 100 seconds before it takes 10 seconds now (a 10x increase), and I'm assuming it is a representation of common usage based on how much data is used to train it.

ltbarcly3

After thinking about it more, I think I know why it seems misleading. They are talking about the change in the larger value as a % of the smaller value. That's what is misleading. They are saying it is reduced by 90% (of some other thing). When you say "We reduced THING by N%" it's just assumed that the N% is N% of THING, not OTHER THING.

xmprt

I think you're right. I think you could say the package (as in the code) is 45% faster or that the package increases parsing rate by 90%. But mixing the two is confusing.

Daneel_

I think it's an interpretation issue. When you say it's 45% faster I interpret that as "the new package handles work at a rate of 145% when compared to the original, that is, 1.45 times as fast".

I would rephrase your comment as "the package takes 45% less time to process a given data set".

msm_

Thanks for this, as an average HN user I didn't click the link and just skimmed the comments thinking how is it possible that they reduced the runtime to 10% of the original. This post clarifies that for me (now on to actually read the blog post).

alpaca128

The title doesn’t imply that at all though. 100% faster means doubled speed. 10% runtime means 1000% faster.

ltbarcly3

I think it's not idiomatic to talk that way, especially because it's ambiguous.

Aachen

Reading that such a simple change can get such a big speedup, my first thought is to let the authors of jq know. Maybe there's a caveat to be aware of, maybe they'll test it and end up making it faster for everyone. Useful to drop a quick note pretty much no matter the result, I think?

The article doesn't seem to even consider that option and I don't see any comment here mentioning this either. Am I missing something?

chasil

I would be curious if Intel Clear Linux had similar gains, by using newer opcodes in the instruction set.

I wonder if the glibc allocator is standard there.

https://en.m.wikipedia.org/wiki/Clear_Linux_OS

atonse

I’m almost more amazed that someone figured out jq’s syntax and got some use out of it.

In all seriousness though, are you sure some of this isn’t those blocks being loaded into some kind of file system cache the second and third times?

How about if you rebooted and then ran the mimalloc version?

matt123456789

The benchmarking tool being used in this post accounts for that using multiple runs for each invocation, together with a warmup run that is not included in the result metric.

atonse

Ah missed that, sorry.

behnamoh

jq has point-free programming. it's intuitive once you wrap your head around it.

See this: https://en.wikipedia.org/wiki/Tacit_programming#jq

atonse

Yeah after this post I sought a couple youtube videos explaining it. It's starting to make a bit more sense now. But the lightbulb hasn't gone off just yet. Appreciate the link.

Flimm

Thank you! That's helpful. I got the examples in that Wikipedia section to run using the `--null-input` flag:

  $ jq --null-input '[1, 2] | add'
  3

mmastrac

I wish I had seen this earlier. My mental model was close but not quite there to the point where I needed to think too hard about how to solve problems.

This is much more intuitive now.

penguin_booze

I think jq's syntax is pretty sweet, once you get used to it (or are already familiar with point-free style, as it's in Haskell). The man page is subpar, however.

nottorp

Make a specific ubuntu package 90% faster by rebuilding it after changing the memory allocator, that is.

Should probably work for Debian and RedHat too. For this particular package.

Edit: based just on the title i initially thought this is an article about turning Ubuntu into Linux From Scratch.