A tale of distros joining forces for a common goal: reproducible builds [video]

68 comments

·February 8, 2025

anotherhue

Reproducibility changes everything, because nothing changes. We go from shamans chanting incantations over a blessed code base to a mathematical function with an algebra of system composition.

Let me give you the simplest example, when builds are reproducible you don't need package repositories, you need build caches.

All the problems with maintaining a repository (save bandwidth) evaporate.

okanat

> Let me give you the simplest example, when builds are reproducible you don't need package repositories, you need build caches.

The type of reproducibility is different here. What you mention is possible via a stable compiler ABI already. However one needs to keep the source code the same. Without a stable compiler ABI, you may or may not fix it depending on what the compiler does.

The goal of reproducible builds is removing sources of environment-dependent behavior at the build level instead of the compiler level. So given all the same dependencies and same build commands your binaries should match wherever and whenever you compile them. The distros and software developers also made a huge effort to remove any kind of environment-dependent commands.

Different distros still have differences in the build commands they issue and the set of dependencies they enable. The space to cache each individual possible output would be enormous and impractical. So you would still need repositories.

anotherhue

All true, but I can't not take this opportunity to shill NixOS which has meaningfully addressed many of these issues, and is indeed spending impractical amounts of money storing build outputs ($10k/m).

https://discourse.nixos.org/t/the-nixos-foundations-call-to-...

It is absolutely better to remove build entropy at the source code stage, but until all software is written that way there are a few build-environment tricks we can use along the way.

zelphirkalt

I highly doubt, that we will get anywhere close to developers understanding and valuing reproducibility any time soon. Maybe in 20y or so. Basically outside of the corners like Nix, Guix and maybe a few random people in discussions about issues of package managers, I have not met anyone knowing how to and caring about reproducibility.

Meanwhile I enjoy setting up my own GNU Guile projects with a Makefile that sets up a reproducible Guix shell in which my project is run, so that I get the same result on various devices, only needing to issue a single easily memorized or discoverable Makefile target call. Most developers I met don't know how to set something like this up. Provided no one messes with guix package manager commits and guix infrastructure still exists, my projects will run in 10y just like they do today, with reproducible result. Neat.

Recently I have taken some time to have this for Ocaml as well. Took some asking on the mailing list, but now works. No need to have anything installed prior to running the Makefile target other than Make and Guix package manager.

kpcyrd

Arch Linux has .BUILDINFO files embedded in their .pkg.tar.zst files, which references files on https://archive.archlinux.org/.

Debian has .buildinfo files over at https://buildinfos.debian.net/ referencing files on https://snapshot.debian.org/.

This is not special to NixOS, documented and archived build environments are an essential part of accomplishing reproducible builds for an operating system.

transpute

Does Haskell (used by NixOS) have a fully reproducible build chain?

jimmaswell

> The space to cache each individual possible output would be enormous and impractical. So you would still need repositories.

I've been using Gentoo for the past few weeks and this was my thought. There are so many ways to compile packages depending on your needs. It is practical to make binary sets for a few common configurations that are good enough for most people for each distro, though.

(Installing and using Gentoo has been an incredible learning experience. You have to go into it with the desire to take long tangents filling the gaps in your knowledge re: shared libraries, kernel modules, your init system of choice, gcc, your windowing system of choice, bootloaders, bash, etc. I feel like I've made more of a quantum leap in my Linux skills the past month than I have in years, and it's been quite fun and rewarding.)

> Different distros still have differences in the build commands they issue and the set of dependencies they enable.

Would it even make sense in theory for this not to be the case? What is a Linux distro but a set of programs, libraries, and environment choices chosen to be run on the Linux kernel?

kbaker

The Yocto Project (from the embedded space) has reproducibility as one of its goals. Since everything is built from source and from scratch, even the build toolchains, it is not too big of a step.

Seems like Yocto would make the base for a good general purpose desktop distro (if there is not one out there already.)

rcxdude

Yocto was kinda doing the nix thing for a while before nix existed, but basically by slowly growing the capabilities in an ad-hoc fashion instead of working on it from first principles. It's resulted in a bit of a mess (it's an unholy mix of a custom functional-ish programming language grown out of a config file format, python, and bash, with a ludicrous capacity for action-at-distance) but there's still not really anything else like it.

transpute

Yocto still has dependencies on a bootstrap build distro.

Stagex + Yocto would be fully reproducible from a small seed.

> Yocto would make the base for a good general purpose desktop distro

YP has been working on a public binary reference cache/distro.

XorNot

This doesn't work at all: a build cache is meaningless in this context because the only thing you can do is rebuild the code to verify that what is in the cache is what comes from the code you expect...or to bootstrap the whole compiler chain, then use cryptographic signatures to chain trust from a matching compiler hash up to all the dependent outputs.

It certainly doesn't change the nature of packaging in any real way in that case.

anotherhue

Others can rebuild the chain and the results can be compared. Without reproducibility there is no concept of comparison.

XorNot

Which is still a cryptographic trust system - i.e. who are the others who built it and are they suitably independent or working from similar sources?

BiteCode_dev

You mean you don't need dep resolution, signatures and moderation? How so?

anotherhue

Good questions:

1. If the deps are also themselves reproducible then you refer to a fixed point version of them and (at least for nix) the package manager works out the rest.

2. Signatures are a trust mechanism, if a cache feeds you bad data in response to a query then that's absolutely an issue, but since there can be multiple caches (or your own local spot-checks) it becomes easier to detect if a cache is returning bad data. A hyper-targetted attack would still get you unless you decide to manually build certain packages, but that's no different than existing repos.

Manually building might sound impractical but it doesn't actually take that long, probably less than a day for a desktop environment, which might acceptable in a high-trust environment if amortized. I should add that the process is fully automatic, it just takes longer than using a cache.

3. Moderation I don't have a good answer to, anyone can run an apt server, or publish a flake.nix file to a repo. Some would say it is censorship resistant.

LtWorf

He hasn't understood what reproducible builds are.

kazinator

But, same thing when you never reproduce builds (just build once and treat as gold master, same thing) So that's a minor value.

Build reproducibility has other values, like the virtue of basic determinism. That answers the question of why would we want to reproduce a build?

If you /can't/ build the exact same thing twice, down to the bit, how do you know that, say, a compiler isn't doing something funny based on an uninitialized variable?

To check reproducible builds against regression, we can't always be caching. Someone has to actually build everything at least twice (ideally more times) and verify that they match. And this has to be done forever; if you don't build multiple times and check, you will not catch a regression in reproducibility.

blueflow

Do they evaporate? My impression was that they got shifted to the package recipes. Like Nix having to take the same considerations with channels.

kpcyrd

hello, I'm one of the speakers. I've been working on this since 2017, happy to answer any questions the hackernews crowd might have.

foobazfoo

Fantastic project. Thank you for all your efforts.

Regarding the reproducible bootstrapping problem, what is your project's policy on building from binary sources? For instance, Zig is written in zig and bootstraps from a binary wasm file which is translated to C: https://github.com/ziglang/zig/tree/master/stage1

Golang has an even more complicated bootstrapping procedure requiring to build each successive version of the compiler to get to the most recent version.

pabs3

See the Bootstrappable Builds community. They do not allow bootstrap that uses pre-generated files (binary or otherwise), except for an MBR worth of commented machine code in hex.

https://bootstrappable.org/ https://lwn.net/Articles/983340/

kpcyrd

Thanks! The kind of work I do is about making an existing operating system issue reproducible packages, to the point that you can install a system with reproducible-only packages. This assumes "trusted source code and compiler", but no more tampering by the build server, which is already quite the improvement from what we have right now.

To solve the need for trusted compilers (aka bootstrap from binary seeds) you're probably interested in https://bootstrappable.org/ and https://codeberg.org/stagex/stagex.

To solve the need for trusted source code there isn't really any solution besides "have people publicly document the source code they have read", like https://github.com/crev-dev/cargo-crev does. Often people ask "how do I know whose reviews to trust", but in reality there's a scarcity of reviews even if you're willing to trust literally anybody. There aren't really any incentives for people to make them, capitalism is failing us on that front and big companies don't want to publicly talk about the source code they have and haven't read either.

algo_trader

looking at [1][2] are the slides available online? i am on chrome, getting just static html

[1] https://salsa.debian.org/reproducible-builds/reproducible-pr... [2] https://fosdem.org/2025/schedule/event/fosdem-2025-6479-a-ta...

null

[deleted]

kpcyrd

Yes, here you go https://reproducible-builds.org/_lfs/presentations/2025-02-0...

jmclnx

It is very cool to see distros working together for a common goal.

But I still do not understand the point of "reproducible builds". I know what they are, but to me the amount of work involved outweighs the benefit.

I even heard NetBSD is also working on "reproducible builds". So maybe I am missing something :)

noirscape

Practically speaking, the idea with a reproducible build is that you can take the source files they used, run their instructions the same way they did and get hash for hash[0] the specific resulting executable.

The main benefit is that you can trust that the resulting binary file being served matches the source code that it's build from. This mostly matters for distros in that they build from a source package repository, but anyone running a mirror could hypothetically replace the package with another (potentially malicious) package, leading users to install malicious tooling. It mainly matters for distros because pretty much every distro out there runs on third party mirrors (often ran by universities, but also just people who want to help) rather than on direct upstream; packages get uploaded to a main server, then mirrors copy from that main server (to reduce network traffic load on the main server). Right now, mirror trust is mostly "we assume you're not gonna be evil, until we get complaints". If the build is reproducible, the software can inherently confirm that the file they're getting is trustworthy, making "getting complaints" much easier to confirm.

It can also speed up the overall building process; if the package source code hasn't changed, you can also always assume that the resulting binary hasn't changed (meaning you can use hashes instead of relying on mtime like make does). Docker build cache works in a somewhat similar way (although docker isn't inherently deterministic).

Devwise, you can also reconstruct a build much easier if it's reproducible; ie. if you've accidentally thrown away the .elf file for debugging, if your build is deterministic, you can just rerun the build and get the same .elf file again.

[0]: While not a problem for Linux distros, in cases where you need a secret to sign an application, reproducible typically means "identical except for the signature" instead. F-Droid uses this for example to figure out if they should use buildserver stuff or the original APKs: https://f-droid.org/docs/Reproducible_Builds/

yellow_lead

> a mirror could hypothetically replace the package with another (potentially malicious) package, leading users to install malicious tooling.

It was my assumption that a mirror is required to host a build that has a hash conforming to the original. Is that not the case?

jerf

Yes, the real attack isn't that mirrors change the files, the real attack is that just because a distro packages Binary X and Source X, it is difficult without reproducible builds to prove that Source X actually did produce Binary X. It could have been compiled with a trojan in it between the source and binary.

gruez

More specifically the packages are signed by the distro and automatically checked, so a mirror can't go rogue even if it wanted to.

patmorgan23

>but anyone running a mirror could hypothetically replace the package with another (potentially malicious) package, leading users to install malicious tooling.

I thought all packages were cryptographically signed, and that the package manager would compare the hashes of artifacts downloaded from mirrors to the hashes listed in the package index (which is also signed). This is not an attack that needs reproducible builds to mitigate.

null

[deleted]

SkiFire13

> meaning you can use hashes instead of relying on mtime like make does

Note that mtime still has the advantage of being faster than hashing.

null

[deleted]

david-gpu

It's a safety measure. Reproducible builds ensure identical binaries are produced from the same source. They help detect e.g. hidden backdoors.

jcranmer

The main benefit you'll hear touted is something along the lines of being able to get an attestation that the resulting artifact was built following the steps claimed to build it. I think that's a somewhat overstated benefit, though, as it's not clear to me that this is an avenue of attack used in practice, given the frequency with which software already has vulnerabilities usable for exploits, or the ease with which one can insert a backdoor into the source code (e.g., the xz backdoor).

I think the actual main utility is that the process has done a very good job of rooting out several causes of unintentional nondeterminism in the build process. I say unintentional because the two main causes of unreproducibility, by several orders of magnitude, are timestamps being embedded everywhere and absolute paths being embedded everywhere, and those are rather expected. But some of the unreproducibility comes from things like accidental reliance on inodes in file paths (i.e., doing "for file in listdir()" without sorting the results of listdir) or the compiler itself accidentally sorting based on pointer address (which is unreproducible on ASLR systems).

uecker

The xz backdoor was news because they went to a lot of effort to try to hide a backdoor in the source (actually a binary file in the source) and still failed. In contrast, without reproducible builds it is trivial for a maintainer with upload rights (or somebody who managed to get the credentials from a maintainer) to insert a backdoor into a binary. And it is then virtually impossible to detect.

rcxdude

Well, the xz backdoor was detected through the behaviour of the resulting binaries, not through observation of the source code tampering, so I don't think it's a great example.

A really important application of reproducible builds is running code inside Secure Enclaves that has been committed to on a public transparency log. A client can connect to a remote secure enclave that can then prove to the client that it’s running the commit code via a process known as remote attestation. It’s pretty cool stuff. However it’s only possible if the build inside the enclave is reproducible (deterministic) and always identical to the build on the transparency log

champtar

With reproducible build you know that what you test on your dev laptop is the same as what will go out from your CI, and if hash mismatch you can chase why. For a concrete exemple, Mellanox driver configure script will auto detect if it's running under docker and change a compile flags, so if you build in a container using podman you get a different result.

solarkraft

> but to me the amount of work involved outweighs the benefit

I don’t know whether I’d spend this much work on such an abstract goal, but what reproducibility changes really is quite amazing. It vastly increases trust in published binaries and obviates the need for signing and the security benefit of compiling software yourself.

gruez

>obviates the need for signing and the security benefit of compiling software yourself

Not really. Most people still would rely on signatures because they can't be expected to compile everything from scratch just to verify their download is authentic. Moreover even though reproducible builds make verification easier, it still requires someone to sound the alarm. For less popular packages there might be nobody checking any particular build is backdoored, because most people see "reproducible builds" and they assume Somebody Else is doing the reproduction.

mjl-

for transparency of reproducible builds of go applications, i made https://beta.gobuilds.org/. it compiles any publicly available go application on-demand, with a toolchain version of your choice (latest stable by default), for a platform of your choice. all (pure) go applications are reproducible by default, including when cross-compiled, and go toolchains run nothing provided by the go module (awesome properties!). the source code is verified through the go sum database (a transparency log containing go modules). the hash of the resulting binary is added to gobuild's own transparency log. so it can be publicly verified. the gobuilds service builds the binary itself, and has another instance (on a different platform & config) build the binary too, to ensure the binary is really reproducible (i'ld like other instances that i don't run myself as secondaries too). i no longer publish binaries for my applications (that i write in go). i just point to the "latest"-build link for the go module at gobuilds. also makes it easy for users (including myself) to get new builds for new go toolchains (which may include fixes to the (relatively large, and often used) standard library).

you still may not trust the public gobuilds instance. my hope is that people (eg software projects themselves, or distros, or other kinds of communities) will run & use their own gobuild instances and verify their builds against the public gobuilds service. win-win: gives them assurance their builds are really reproducible, and builds trust in the public gobuilds (keeping it honest, if someone sees a hash mismatch, they will speak up).

i usually don't get much enthusiasm for it though. (:

ssivark

What makes you so confident that the benefit is less than the effort?

Given the increasing likelihood of supply chain attacks, isn’t this a very prudent precaution?

samsartor

The video gets into that. The main purpose is to verify that the binary you're running came from the actual source code.

lowkey

Original link didn't work for me. Here is the source https://fosdem.org/2025/schedule/event/fosdem-2025-6479-a-ta...

pabs3

I'm looking forward to more distros adopting Bootstrappable Builds, so far I think only Guix has.

https://bootstrappable.org/ https://lwn.net/Articles/983340/

pelasaco

I think for a distro like Talos Linux, with only 12 binaries, will be much easier to accomplish it

GuestFAUniverse

I would prefer a "common core Linux" where most backports of stable distributions happen. Instead of each distribution fixing the same bugs in (slightly) different versions. What a waste of resources.

pabs3

I wonder how common reproducible builds are outside of the distro bubble. I guess PyPI isn't looking at it yet for example.

kpcyrd

I use repro-env to implement reproducible builds for my Github binaries and custom apt repository[1]:

  - https://github.com/spytrap-org/spytrap-adb/releases/tag/v0.3.3
  - https://github.com/kpcyrd/sh4d0wup/releases/tag/v0.10.0
  - https://github.com/kpcyrd/rshijack/releases/tag/v0.5.2
  - https://github.com/kpcyrd/archlinux-userland-fs-cmp/releases/tag/v0.1.0
  - https://github.com/kpcyrd/repro-env/releases/tag/v0.4.1

[1]: https://github.com/kpcyrd/apt-vulns-xyz?tab=readme-ov-file#r...

Foxboron

[flagged]

null

[deleted]

HN

A tale of distros joining forces for a common goal: reproducible builds [video]

A tale of distros joining forces for a common goal: reproducible builds [video]