Debian bookworm live images now reproducible

208 comments

·March 26, 2025

jcmfernandes

Insane effort. This sounded like a pipe dream just a couple of years ago. Congrats to everyone involved, especially to those who drove the effort.

Joel_Mckay

The Debian group is admirable, and have positively changed the standards for OS design several times. Reminds me I should donate to their coffee fund around tax time =3

alfiedotwtf

Exactly!

I’ve said it many times and I’ll repeat it here - Debian will be one of the few Linux distros we have right now, that will still exist 100 years from now.

Yea, it’s not as modern in terms of versioning and risk compared to the likes of Arch, but that’s also a feature!

roenxi

> Debian will be one of the few Linux distros we have right now, that will still exist 100 years from now.

It'd certainly be nice, but if you've ever seen an organisation unravel it can happen with startling speed. I think the naive estimate is if you pick something at random it is half-way through its lifespan; so there isn't much call yet to say Debian will make it to 100.

vbezhenar

I feel more safe using Arch, compared to Debian. Debian adds so much of their patches on top of original software, that the result is hardly resembles the original. Arch just ships original code almost always. And I trust much more to the original developers, than Debian maintainers.

walrus01

It's quite easy to run Debian unstable (sid) if you want a more risky approach to having the newest of every package.

toasteros

I always want to donate more to open source projects but as far as I know there aren't any I can get tax credits for in Canada. My budget is strapped just enough that I can't quite afford to donate for nothing.

Any Canadian residents here know of any tax credit eligible software projects to donate to?

Joel_Mckay

Depends on where you live, work, and invest. Still, I would recommend chatting with a local accountant to be sure if a significant contribution to a donee qualifies as deductible. Note most large universities will be registered in both the US/Canada.

https://www.canada.ca/content/dam/cra-arc/formspubs/pub/p113...

Best regards, =3

imcritic

I don't get how someone achieves reproducibility of builds: what about files metadata like creation/modification timestamps? Do they forge them? Or are these data treated as not important enough (like it 2 files with different metadata but identical contents should have the same checksum when hashed)?

jzb

Debian uses a tool called `strip-nondeterminism` to help with this in part: https://salsa.debian.org/reproducible-builds/strip-nondeterm...

There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/

frakkingcylons

I see this is written in Perl, is that the case with most Debian tooling?

lamby

One of the authors of strip nondeterminism is here. The primary reason it's written in Perl is that given that strip-nondeterminism is used when building 99.9% of all Debian packages, using any other language would have essentially made that language's runtime a dependency for all building Debian packages. (Perl is already required by the build process, whilst Python is not.)

fooker

It’s helpful to think of Perl as a superior bash, rather than a worse python, when it comes to scripting.

londons_explore

Packaging and making build scripts is perhaps one of the most unrewarding tasks out there. As an open source project where most work is done for free, debian can't afford to be prescriptive about what languages are used for this sort of task.

johnisgood

I checked the code. Perl is suitable for these kind of tasks.

dannyobrien

some, but not all. There's a bunch of historical code which means that Perl is in the base install, but modern tooling has a lot of Python too, as well as POSIX shell (not bash).

jeltz

Last time I checked a lot was also written in Python.

o11c

Timestamps are easiest part - you just set everything according to the chosen epoch.

The hard things involve things like unstable hash orderings, non-sorted filesystem listing, parallel execution, address-space randomization, ...

koolba

ASLR shouldn’t be an issue unless you intend to capture the entire memory state of the application. It’s an intermediate representation in memory, not an output of any given step of a build.

Annoying edge cases come up for things like internal object serialization to sort things like JSON keys in config files.

kazinator

ASLR means that the pointers from malloc (which may come from mmap) are not predictable.

Sometimes programs have hash tables which use object identity as key (i.e. pointer).

ASLR can cause corresponding objects in different runs of the program to have different pointers, and be ordered differently in an identity hash table.

A program producing some output which depends on this is not necessarily a bug, but becomes a reproducibility issue.

E.g. a compiler might output some object in which a symbol table is ordered by a pointer hash. The difference in order doesn't change the meaning/validity of the object file, but is is seen as the build not having reproduced exactly.

cperciva

FreeBSD tripped over an issue recently where a C++ program (I think clang?) used a collection of pointers and output values in an order based on the pointers rather than the values they pointed to.

ASLR by itself shouldn't cause reproducibility issues, but it can certainly expose bugs.

sodality2

Let’s say a compiler is doing something in a multi-threaded manner - isn’t it possible that ASLR would affect the ordering of certain events which could change the compiled output? Sure you could just set threads to 1 but there’s probably some more edge cases in there I haven’t thought of.

purkka

Generally, yes: https://reproducible-builds.org/docs/timestamps/

Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.

ryandrake

C compilers offer __DATE__ and __TIME__ macros, which expand to string constants that describe the date and time that the preprocessor was invoked. Any code using these would have different strings each time it was built, and would need to be modified. I can't think of a good reason for them to be used in an actual production program, but for whatever reason, they exist.

mananaysiempre

And that’s why GCC (among others) accepts SOURCE_DATE_EPOCH from the environment, and also has -Wdate-time. As for using __DATE__ or __TIME__ in code, I suspect that was more helpful in the age before ubiquitous source control and build IDs.

repiret

> I can't think of a good reason for them

I work on a product whose user interface in one place says something like “Copyright 2004-2025”. The second year there is generated from __DATE__, that way nobody has to do anything to keep it up to date.

fmbb

Toolchains for reproducible software likely let you set these values, or ensure they are 1970-01-01 00:00:00

rtpg

It's super nice to have timestamps as a quick way to know what program you're looking at.

Sticking it into --version output is helpful to know if, for example, the Python binary you're looking at is actually the one you just built rather than something shadowing that

paulddraper

> Do they forge them?

Yes. All archive entries and date source code macros and any other timestamps are set to a standardized date (in the past).

lamby

This is not quite right. At least in Debian, only files that are newer than some standardised date are to that standardised date. This "clamping" preserves any metadata in older files.

null

[deleted]

echoangle

Maybe dumb question but why would this change the reproducibility? If you clone a git repo, do you not get the meta data as it is stored in git? Or would the files have the modification date of the cloning?

I never actually checked that.

mathfailure

You clone source from git, but then you use them to build some artifacts. The artifacts build time may differ, yet with reproducible builds - the artifact should match.

echoangle

Right, but if you only clone and build, why would the files modification date be different compared to the version that was committed to git? Does just cloning a repo already lead to different file modification dates in my local copy?

HideousKojima

Those aren't needed to generate a hash of a file. And that metadata isn't part of the file itself (or at least doesn't need to be), it's part of the filesystem or OS

imcritic

That's an acceptable answer for the simple case when you distribute just a file, but what if your distribution is something more complex, like an archive with some sub-archives? Metadata in the internal files will affect the checksum of the resulting archive.

londons_explore

Finding and fixing cases like this are part of what the project has done...

exe34

unless you fix them to a known epoch.

null

[deleted]

kroeckx

It's my understanding that is about generating the .iso file from the .deb files, not about generating the .deb files from source. Generating .deb from source in a reproducible way is still a work in progress.

abdullahkhalids

Is the build infrastructure for Debian also reproducible? It seems like we if someone wants to inject malware in Debian package binaries (without injecting them into the source), they have to target the build infrastructure (compilers, linkers and whatever wrapper code is written around them).

Also, is someone else also compiling these images, so we have evidence that the Debian compiling servers were not compromised?

jzb

There's a page that includes reproducibility results for Debian here: https://tests.reproducible-builds.org/debian/bookworm/index_...

I think there's also a similar thing for the images, but I might be wrong and I definitely don't have the link handy at the moment.

There's lots of documentation about all of the things on Debian's site at the links in the brief. And LWN also had a story last year about Holger Levsen's talk on the topic from DebConf: https://lwn.net/Articles/985739/

goodpoint

The whole point of reproducible builds is to ensure security even if buildbots are compromised.

layer8

And what about the hardware on which the build runs? Is it reproducible? ;)

kragen

Working on it! But in general the answer is that for most purposes it's good enough to show that many independently produced pieces of hardware can reproduce the same results.

abdullahkhalids

You are joking. But solving this problem is probably amongst the most important we can have in the information age we live in.

Every country in the world should have the capability of producing "good enough" hardware.

ratmice

And who trusting trusted the original RepRap?

orblivion

The 50th generation builds a robot that murders you

TacticalCoder

> And what about the hardware on which the build runs? Is it reproducible? ;)

"Fully Countering Trusting Trust through Diverse Double-Compiling (DDC) - Countering Trojan Horse attacks on Compilers"

https://dwheeler.com/trusting-trust/

If the build is reproducible inside VMs, then the build can be done on different architectures: say x86 and ARM. If we end up with the same live image, then we're talking something entirely different altogether: either both x86 and ARM are backdoored the same way or the attack is software. Or there's no backdoor (which is a possibility we have to fancy too).

nikisweeting

well little johnny, when one hardware loves another hardware very much...

null

[deleted]

paulddraper

A la xz.

You must ultimately root trust in some set of binaries and any hardware that you use.

XorNot

For user space? No you can definitely do a stage 0 build which depends only on about 364 bytes of x86_64 binary (though ironically I haven't managed to get this to work for me yet).

The liability is EFI underneath that, and the Intel ring -1 stuff (which we should be mandating is open source).

paulddraper

> which depends only on about 364 bytes of x86_64 binary

geocrasher

What is the significance of a reproducible build, and how is it different than a normal distribution?

csense

Reproducible: If Alice and Bob both download and compile the same source code, Alice's binary is byte-for-byte identical to Bob's binary.

Normal: Before Debian's initiative to handle this problem, most people didn't think hard about all the ways system-specific differences might wind up in binaries. For example: __DATE__ and __TIME__ macros in C, parallel builds finishing in different order, anything that produces a tar file (or zip etc.) usually by default asks the OS for the input files' modification time and puts that into the bytes of the tar file, filesystems may list files in a directory in different order and this may also get preserved in tar/zip files or other places...

Why it's important: With reproducible builds, anyone can check the official binaries of Debian match the source code. This means going forward, any bad actors who want to sneak backdoors or other malware into Debian will have to find a way to put it in the source code, where it will be easier for people to spot.

sirsinsalot

The important property that anyone can verify the untainted relationship between the binary and the source (providing we do the same for both tool chains, not relying on a blessed binary at any point) is useful if people do actually verify outside the debian sphere.

I hope they promote tools to enable easy verification on systems external to debian build machines.

walrus01

as the 'xz' backdoor was in the source code, and remained there for a while before anyone spotted it, it doesn't necessarily guarantee that backdoors/malware won't make their way into the source of a very-widely-redistributed project.

badsectoracula

Source code availability doesn't mean that backdoors wont be put in place, it just makes it relatively easier to spot and remove them. Reproducible builds mean that the people who look for backdoors, malware, etc can focus on the source code instead of the binaries.

jkaplowitz

Certainly true. But removing some attack vectors still helps security and trustworthiness. These are not all or nothing questions.

jeltz

Only part of the backdoor was in the source code. It was split like that between the tarball and the code to hide it better. But, yes, with reproducible builds they could have put all of it in the source.

floxy

> __DATE__ and __TIME__ macros in C

So how do those work in these Debian reproducible builds? Do they outlaw those directives? Or do they set those based on something other than the current date and time? Or something else?

progval

The toolchain (eg. compiler) reads the time from an environment variable if present, instead of the actual time. https://reproducible-builds.org/docs/source-date-epoch/

flkenosad

Thank you for that fantastic explaination.

orblivion

Open source means "you can see the code for what you run". Except... how do you know that your executables were actually built from that code? You either trust your distro, or you build it yourself, which can be a hassle.

Now that the build is reproducible, you don't need to trust your distro alone. It's always exactly the same binary, which means it'll have one correct sha256sum. You can have 10 other trusted entities build the same binary with the same code and publish a signature of that sha256sum, confirming they got the same thing. You can check all ten of those. The likelihood that 10 different entities are colluding to lie to you is a lot lower than just your distro lying to you.

jrockway

Reproducible builds actually solve a lot of problems. (Whether these are real problems, who really knows, but people spend a lot of money to solve them.)

At my last job, some team spent forever making our software build in a special federal government build cluster for federal government customers. (Apparently a requirement for everything now? I didn't go to those meetings.) They couldn't just pull our Docker images from Docker Hub; the container had to be assembled on their infrastructure. Meanwhile, our builds were reproducible and required no external dependencies other than Bazel, so you could git checkout our release branch, "bazel build //oci" and verify that the sha256 of the containers is identical to what's on Docker Hub. No special infrastructure necessary. It even works across architectures and platforms, so while our CI machines were linux / x86_64, you can build on your darwin / aarch64 laptop and get the exact same bytes, every time.

In a world where everything is reproducible, you don't need special computers to do secure builds. You can just build on a bunch of normal computers and verify that they all generate the same bytes. That's neat!

(I'll also note that the government's requirements made no sense. The way the build ended up working was that our CI system build the binaries, and then the binaries were sent to the special cluster, and there a special Dockerfile assembled the binaries into the image that the customers would use. As far as I can tell, this offers no guarantee that the code we said was in the image was in the image, but it checked their checkbox. I don't see that stuff getting any better over the next 4 years, so...)

genpfault

https://en.wikipedia.org/wiki/Reproducible_builds

https://wiki.debian.org/ReproducibleBuilds/About

bbarnett

It means you can build it yourself, and know the source code you have, is all there is.

It validates that publicly available downloads aren't different from what is claimed.

rstuart4133

It's a link in a chain that allows you to trust programs you run.

- At the start of the chain, developers write software they claim is secure. But very few people trust the word of just one developer.

- Over time other developers look at the code and also pronounce it secure. Once enough independent developers from different countries and backgrounds do this, people start to believe it really is secure. As measure of security this isn't perfect, but it is verifiable and measurable in the sense more is always better, so if you set the bar very high you can be very confident.

- Somebody takes that code, goes through a complex process to produce a binary, releases it, and pronounces it is secure because it is only based on code that you trust, because of the process above. You should not believe this. That somebody could have introduced malicious code and you would never know.

- Therefore before reproducible builds, your only way to get a binary you knew was built from code you had some level of trust in was to build it yourself. But most people can't do that, so they have to trust that Debian, Google, Apple, Microsoft or whoever that are no backdoors have been added. Maybe people do place their faith in those companies, but is is misplaced. It's misplaced because countries like Australia have laws that allow them to compel such companies to silently introduce malicious code and distribute it to you. Australia's law is called the "Assistance and Access Bill (2018)". Countries don't introduce such laws for no reason. It's almost certain it is being used now.

- But now the build can be reproducible. That means many developers can obtain the same trusted source code from the source the original builder claimed he used, build the binary themselves, verify it is identical to the original so publicly validate the claim. Once enough independent developers from different countries and backgrounds do this, people start to believe it really built from the trusted sources.

- Ergo reproducible builds allow everyone, as opposed to just software developers, to run binaries they can be very confident was built just from code that has some measurable and verifiable level of trustworthiness.

It's a remarkable achievement for other reasons too. Although the ideas behind reproducible builds are very simple, it turned out executing it was about as simple as other straightforward ideas like "lets put a man on old moon". It seems build something as complex as an entire OS was beyond any company, or capitalism/socialism/communism, or a country. It's the product of something we've only seen arise in the last 40 years, open source, and it been built by a bunch of idealistic volunteers who weren't paid to do it. To wit: it wasn't done by commercial organisations like RedHat, or Ubuntu. It was done by Debian. That said, other similar efforts have since arisen like F-Droid, but they aren't on this scale.

zozbot234

Nice, these live images could become the foundation for a Debian-based "immutable OS" workflow.

polynox

That is the goal of Vanilla OS! https://vanillaos.org/

moondev

Do these live images come ready with cloud-init? A cloud-init in-memory live iso seems perfect for immutable infrastructure "anywhere"

bravetraveler

Should be trivial to put in, if not. Install the package and maybe prepare some datasource hints while reproducing the image. Depends on where you'll be using it.

The trick will be in the details, as usual. User data that both does useful work... and plays nicely with immutability.

I suspect it would be more sensible to skip the gymnastics of trying to manicure something inherently resistant, and instead, lean in on reproducibility. Make it as you want it, skip the extra work.

Want another? Great - they're freely reproducible :)

kragen

This is a huge milestone: https://lists.reproducible-builds.org/pipermail/rb-general/2...

Cort3z

I’m a noob to this subject. How can a build be non-reproducible? By that, I mean, what part of the build process could return non-deterministic output? Are people putting timestamps into the build and stuff like that?

r3trohack3r

File paths, timestamps, unstable ordering of inputs/outputs, locals, version info, variations in the build environment, etc.

This pages has a good write up

https://reproducible-builds.org/docs/

jcranmer

Timestamps, timestamps, absolute paths (i.e., differences between building /src versus /home/Cort3z/source), timestamps, file inode numbering ("for file in directory" defaults to inode order rather than alphabetical order in many languages, and that means it's effectively pseudorandom), more timestamps, using random data in your build process (e.g., embedding a generated private key, or signing something), timestamps, and accidental nondeterminism within the compiler.

By far the most prevalent source of nondeterminism is timestamps, especially since timestamps crop up in file formats you don't expect (e.g., running gzip stuffs a timestamp in its output for who knows what reason). After that, it's the two big filesystem issues (absolute paths and directory iteration nondeterminism), and then it's basically a long tail of individual issues that affect but one or two packages.

yupyupyups

This is amazing news. Well done!

nwellinghoff

Does anyone have any information as to how they modified their C code such that the complier output was deterministic? I thought one of the hardest problems with a effort like this was writing your C such that the compiler would output everything in the same order (same bytes)? And I am not just talking about time stamps etc.

account42

This is something compilers themselves need to guarantee, e.g. GCC has -frandom-seed [0].

[0] https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#in...

letters90

the update is gold, original message: "They are reproduceable" updated message "lol actually not"

eqvinox

Not really, it's just someone with a higher goal of "freedom" (no binary firmware blobs) using it to push their agenda.

I'll happily agree higher degrees of "freedom" are an admirable goal, but this is just rudely shitting on a hard-earned achievement.

HN

Debian bookworm live images now reproducible

Debian bookworm live images now reproducible