Skip to content(if available)orjump to list(if available)

Cutting down Rust compile times from 30 to 2 minutes with one thousand crates

wiseowise

> Given that we now fully utilize 128 threads or 64 cores for pretty much the entire compile time, we can do a back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.

Amdahl’s Law would like to have a word.

gz09

FWIW the article carefully worded it as a "back of the envelope" calculation, says we can't expect a linear speedup in practice and also gives the time it takes for linking (7 secs).

(Disclaimer I am the author of the article and I am quite familiar with the law)

stouset

This is an observed change of going from 1 core at 100% to 64 cores at 100%. This is establishing a lower bound, assuming there is no wasted work or contention for shared resources.

repstosb

Amdahl's Law, like most 20th-century performance "wisdom" and metrics, focuses excessively on instruction count, neglecting memory and I/O pressure. 64 cores doesn't mean 64 independent cache pyramids and memory buses. In real life, the difference between CPU cycle frequency and memory latency is so great that memory pressure primarily determines performance, whereas core count really only matters to the extent that it contributes to that memory pressure.

throwawaymaths

Amdahls law is about coordination costs, so either you would expect cores to be starved or lots pf extra coordination-related compute to be happening, which, i guess is not totally crazy since there are that many crates, but as a first guess OP's back of the envelope is fine

HippoBaro

Eminently pragmatic solution — I like it. In Rust, a crate is a compilation unit, and the compiler has limited parallelism opportunities, especially since rustc offloads much of the work to LLVM, which is largely single-threaded.

It’s not surprising they didn’t see a linear speedup from splitting into so many crates. The compiler now produces a large number of intermediate object files that must be read back and linked into the final binary. On top of that, rustc caches a significant amount of semantic information — lifetimes, trait resolutions, type inference — much of which now has to be recomputed for each crate, including dependencies. That introduces a lot of redundant work.

I also would expect this to hurt runtime performance as it likely reduces inlining opportunities (unless LTO is really good now?)

JJJollyjim

They mention that compiling one crate at a time (-j1) doesnt give the 7x slowdown, which rules out the object file/caching-in-rustc theories... I think the only explanation is the rustcs are sharing limited L3 cache.

lsuresh

The L3 cache angle is one of our hypotheses too. But it doesn't seem like we can do much about it.

dathinab

The main issue here is:

- in rust one semantic compilation unit is one crate

- in C one semantic compilation unit is one file

There are quite a bunch of benefits in the rust approach, but also drawbacks, like huge projects have to be split into multiple workspaces to maximize parallel building.

Oversimplified the codegen-units setting tells the compiler into how many parts the compiler is allowed to split the a single semantic code gen unit.

Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).

lsuresh

> Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).

Agreed, seems like there are some rustc performance bugs at play here.

steveklabnik

I haven't dug into the details, but it may not even be a performance bug, depending on how you define 'bug': the Rust compiler is not fully parallel itself yet. That's a bug in the sense of something that needs to be improved and fixed, but isn't one in the sense of "unexpected bad behavior".

lsuresh

Makes sense. We'd appreciate some more eyeballs here for sure. Between HN and a Reddit thread, there are a few hypotheses floating around. I've shared a repro here for anyone interested: https://github.com/feldera/feldera/issues/3882

dathinab

the thing is:

codegen-units defaults to 16 in release builds, and by far the most time in the "passes" list is spend in LLVM passed (which is was codegen-units parallelizes),so most times it shouldn't be stuck with 1 high load core (even if it's not 16 all the time).

so it looks a lot like something is prevented the intended codegen parallelization of the crate

Through it indeed might not have been a bug, e.g. before the change in generation to split it across crates source code might have been in a way where it can't split the crate into multiple units. Or maybe something made rust believe splitting it is a bad idea, e.g. related to memory usage or similar.

bryanlarsen

Rust has a great compromise between crate and file: module. I wonder why that's not the compilation unit?

dathinab

- cyclic dependencies

- some subtleties related to (proc-)macros

- better optimizations (potentially, not always, sometimes not at all)

- how generics and compilation units interact (reduces the benefit of making each module a compilation unit)

- a lot of unclearity about how rust will develop in the future when this decision was made

Also when people speak about rust compiling slow and splitting helping it's most times related to better caching of repeated builds (unrelated to the incremental build feature) and not the specific issue here. But there is definitely potential to improve on it to make humongous single crates work better (like instead of just 16/256 internal splits you could factor in the crate size, maybe add a attribute to hint code unit splits etc.), but so far no one has deemed it important enough to invest their time into fixing it. I mean splitting crates is often easy so you do that once are good forever or at lest a long time.

lmkg

Per a reddit comment, modules are allowed to have circular dependencies while crates are not.

colonial

I think it's due to the fact that (unlike crates) cyclic dependencies are allowed between modules without any extra ceremony (e.x. forward declarations in C.)

simfoo

Agreed, this is the underlying main issue. I've faced it before with generated C++ code too, and after long and painful refactorings what ultimately helped the most was to just split the generated code into multiple compilation units to allow for parallel compilation. It comes with the drawback of potentially instatiating (and later throwing away) a lot more templates though

hu3

> We're using rustc v1.83, and despite having a 64-core machine with 128 threads, Rust barely puts any of them to work.

> That’s right — 1,106 crates! Sounds excessive? Maybe. But in the end this is what makes rustc much more effective.

> What used to take 30–45 minutes now compiles in under 3 minutes.

I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.

pcwalton

> I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.

It partially is, with codegen units. The problem is that you can't generally do that until codegen time, because of circular dependencies.

zwnow

1106 crates? Are they sure this is not a Javascript project?

faitswulff

They compile their customers' SQL to Rust code. Hence the preponderance of crates. It's a somewhat unique scenario.

lsuresh

That's correct. These aren't external dependencies but a dataflow graph being split into crates.

lsuresh

For any Rust compiler experts interested in taking a look, I've put together a short repro here: https://github.com/feldera/feldera/issues/3882

It will give you a workspace with a bunch of crates that seems to exercise some of the same bottlenecks the blog post described.

jmull

That's a cool project.

But I wonder if generating rust is the best approach. On the plus side, you can take advantage of the rich type and type checking system the compiler has. On the other hand, you're stuck with that compiler.

I wonder if the dynamic constraints can be expressed and checked through some more directly implemented mechanism. It should be both simpler to express exactly the constraints you want (no need to translate to a rust construct that rustc will check as desired), and, of course, should be a lot more efficient. Feldera may have no feasible way to get away from generated rust, but a potential competitor might avoid the issue. (That's not to say the runtime shouldn't/couldn't be implemented in rust. I'm just talking about the large amounts of generated rust.)

slashdev

I would probably have gone with generating C here. You don't need all the safety of the Rust compiler, you're the one generating the code. As you point out, you can check all the constraints you want before you compile it.

pjmlp

The right way done by the likes of Oracle and SQL Server is to JIT compile their queries and stored procedures, with PGO data from query analyser.

lsuresh

We think a JIT compiler is the right approach too. Will be a substantial effort though, so we're waiting to get a bit of bandwidth on that front.

Crisco

Are there any performance implications for the final binary because you’re splitting it up into thousands of crates?

epage

Loss of inlining

dgacmu

Loss of automatic inlining of non-generics without LTO, to be a little pedantic.

Functions marked #[inline] can still be handled across crates.

LTO can inline across crates but, of course, at a substantial compile time cost.

nixpulvis

I love Rust, but it seems like a really bad intermediate language if you have a compiler/transpiler which is sound. You don't need the type system from Rust telling you something's wrong.

That said, I could see how it would make writing the transpiler easier, so that's a win.

dietr1ch

> back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.

I'd aim for this linear speedup for compiling (sans overhead to compile a small crate), but the linking part won't be faster, maybe even slower. Maybe a slightly bigger envelope can tell you how much performance is there to extract and the cost of using "too many" crates (which I'm not even sure it's too many, maybe your original crate was too big to ease incremental compilation?)

lsuresh

Towards the end, the article says it takes 7s for linking using mold.

dietr1ch

Yeah, I read that part too late. In that case it seems that there's indeed a lot of overhead when building many crates from a cold-start, but it pays off in wall time and can probably save resources in incremental builds.

lsuresh

Yes. We found both cold and incremental builds sped up. The incremental builds were the main win -- small changes to the SQL can sometimes complete in seconds for what used to be a full recompilation.

UncleEntity

Yeah, I was working on this project to generate a python C-API module from the SVG schema and found when I generated one huge file (as opposed to one file per class) the compilation times were significantly faster. Or maybe it was generating the C++ SVG library, don't quite remember which one was super slow as I did both at around the same time since the code changes between the two were minimal.

Looks like I settled on the slower 'one class per file' compilation method for whatever reason, probably because generating a 200k+ file didn't seem like such a good idea.

ay

I have just went through this with a project of mine, though unfortunately the code wasn’t autogenerated, so I needed to do a lot of mind-numbingly boring search-and-replace commands. I have cobbled together a little utility that allowed to automate the process somewhat.

Mostly a throwaway code with a heavy input from Claude, so the docs are in the code itself :-)

But in case anyone can find it useful:

https://github.com/ayourtch/tweak-code

pdimitar

Zero documentation? Do you expect potential users t9 figure it out by themselves?

ay

Thanks for the feedback !

The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.

Anyway, I decide to experiment with Claude also writing a README - the result doesn’t seem too terribly incorrect on the first squint, and hopefully gives a slightly more impression of what that thing was attempting to do. (Disclaimer: I didn’t test it much other than my use case, so YMMV on whether it works at all).

pdimitar

That's much better, thanks.

> The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.

Even if that assumption is true for part of the potential users, they would appreciate a starting point, you know.

Now I have bookmarked it and will check it out at one point.

If you ever figure you want to invest some more effort into it: try make it into an LSP server so it can integrate with LSP Code Actions directly.

lesuorac

> Of course, we tried debug builds too. Those cut the time down to ~5 minutes — but they’re not usable in practice.

I wonder how true this is.

Haven't use feldera but other rust stuff I have if I run as debug it has serious performance problems. However, for testing I have it compile a few crates that do work like `image` to be optimized (and the vast majority as debug) and that is enough to make the performance issues not noticeable. So if the multi-crate hadn't worked, possibly just only compile some of the stuff as optimized.

1vuio0pswjnm7

The dependency costs seem to be self-evident but perhaps it would be enlightening to see a comparison of the energy costs, as well as the time costs and storage costs, of compiling Rust programs versus their C equivalents. For example, comparison might reveal that there is no difference and no trade-off or that any differences and trade-offs are small enough to be worth making in the interest of some higher purpose.