Skip to content(if available)orjump to list(if available)

Fedora change aims for 99% package reproducibility

barotalomey

The real treasure was the friend I found along the way

https://github.com/keszybz/add-determinism

AshamedCaptain

remram

That's mentioned at the bottom of the README.

Tijdreiziger

…and in the article:

> The Fedora project chose to write its own tool because it was undesirable to pull Perl into the build root for every package.

m463

I kind of wonder if this or something similar could somehow nullify timestamps so you could compare two logfiles...

further would be the ability to compare logfiles with pointer addresses or something

yjftsjthsd-h

I'm not confident that I understand what you're asking for, but couldn't you just sed off the timestamp from every line? Or for a more extreme example, I have occasionally used ... tr, I think? to completely remove all numbers from logs in order to aggregate error messages without worrying about the fact that they kept including irrelevant changing numbers (something like tail -5000 logfile | tr -d [0-9] | sort | uniq -c | sort -n or so).

winwang

how would you do it if your logs were printed on paper with a printer, each line printed with stochastic timing (due to a bug), with an ink containing a chemical tracer with halflife `h` (after being put to paper), but the ink is randomly sampled from several (`m`) inks of different halflives `h1`, h2`,... `hn`? assume `p` different printers scattered across the 10 most populous US cities. you may use standard unix utilities.

didericis

A different but more powerful method of ensuring reproducibility is more rigorous compilation using formally verifiable proofs.

That’s what https://pi2.network/ does. It uses K-Framework, which is imo very underrated/deserves more attention as a long term way of solving this kind of problem.

apatheticonion

Another thing I'd love to see is more statically linked binaries. Something like Python, for instance, is a nightmare to install and work with

theteapot

I think general consensus is against you. Fedora packaging policy [1]:

> Packages including libraries should exclude static libs as far as possible (eg by configuring with --disable-static). Static libraries should only be included in exceptional circumstances. Applications linking against libraries should as far as possible link against shared libraries not static versions.

[1]: https://docs.fedoraproject.org/en-US/packaging-guidelines/

Brian_K_White

I'd far rather a static binary than a bundled vm for a single app which produces all the same bad points of a static binary plus 900 new bad points on top.

Packaging guidelines from a distros docs like this are not any kind of counter argument to that comment.

This is the current orthodoxy, so obviously all docs say it. We all know the standard argument for the current standard. Their comment was explicitly "I'd like to see a change from the current orthodoxy". They are saying that maybe that argument is not all it promised to be back in the 90's when we started using dynamic libs.

Kudos

I don't know why you'd go straight to a VM as the alternative when containers are the obvious choice.

booi

So instead of 1000 programs and 1000 libraries, you’d rather 1000 programs and 1,000,000 libraries?

supriyo-biswas

For Python, take a look at the musl builds in python-build-standalone[1], which are statically linked.

I also have a tiny collection of statically linked utilities available here[2].

[1] https://github.com/astral-sh/python-build-standalone

[2] https://github.com/supriyo-biswas/static-builds

hashstring

What do you mean with “a nightmare to install and work with” exactly?

gessha

They use Windows. \s

JodieBenitez

Python has official installers for windows, is distributed in the Microsoft store and can also be pulled out by UV which works a breeze in Powershell.

throwaway48476

Were stuck with a computing paradigm from 50 years ago.

Ideally everything would be statically linked but thr sections would be marked and deduped by the filesystem.

danieldk

Even though the idea is much older, shared libraries were only introduced on Unix systems on SunOS 4.x and System V release 3 (?). Sun paper from 1988: https://www.cs.cornell.edu/courses/cs414/2001FA/sharedlib.pd...

nimish

As a user of fedora what does this actually get me? I mean I understand it for hermetic builds but why?

kazinator

Reproducible builds can improve software quality.

If we believe we have a reproducible build, that's constitutes a big test case which gives us confidence in the determininism of the whole software stack.

To validate that test case, we actually have to repeat the build a number of times.

If we spot a difference, something is wrong.

For instance, suppose that a compiler being used has a bug whereby it is relying on the value of an unitialized variable somewhere. That could show up as a difference in the code it generates.

Without reproducible builds, of course there are always differences in the results of a build: we cannot use repeated builds to discover that something is wrong.

(People do diffs between irreproducible builds anyway. For instance, disassemble the old and new binaries, and do a textual diff, validating that only some expected changes are present, like string literals that have embedded build dates. If you have reproducible builds, you don't have to do that kind of thing to detect a change.

Reproducible builds will strengthen the toolchains and surrounding utilities. They will flush out instabilities in build systems, like parallel Makefiles with race conditions, or indeterminate orders of object files going into a link job, etc.

tomcam

I don't know this area, but it seems to me it might be a boon to security? So that you can tell if components have been tampered with?

LinuxBender

That's already been a thing in all the Redhat variants. RPM/DNF have checksums of the installed binaries and there is GPG signing of packages and repositories. The only part of that ecosystem I've always had a gripe with is putting the GPG public keys in the mirrors. People should have to grab those from non mirrors or any low skilled attacker can just replace the keys and sign everything again. It would be caught but not right away.

Changes can also be caught using bolt on tools like Tripwire, OSSEC and it's alternatives or even home grown tools that build signed manifests of approved packages usually for production approval.

dwheeler

Yes! The attack on SolarWinds Orion was an attack on its build process. A verified reproducible build would have detected the subversion, because the builds would not have matched (unless the attackers managed to detect and break into all the build processes).

bobmcnamara

Bingo. We caught a virus tampering with one of our code gens this way.

uecker

I don't think it is that unlikely that build hosts or some related part of the infrastructure gets compromised.

stefan_

You know what does not give me confidence? Updating software, but whats that, its still printing the same build date? Of course hours later tens of files deep I found out some reproducability goof just hardcoded it.

So far, reproducible builds are heavy on the former, zero on these bugs you mention and zero on supply chain attacks.

Some of these supposed use cases make no sense. You update the compiler. Oh no, all the code is different? Enjoy the 16h deep dive to realize someone tweaked code generation based on the cycle times given on page 7893 of the Intel x64 architecture reference manual.

kazinator

They should be setting the build days for a package from say the commit date of the top commit of the branch that's being built. It can't be something that doesn't change when the next version is spun. If you see a behavior like that in anybody's reproducible package system or distro, you have a valid complaint.

jacobgkau

My impression is that reproducible builds improve your security by helping make it more obvious that packages haven't been tampered with in late stages of the build system.

* Edit, it's quoted in the linked article:

> Jędrzejewski-Szmek said that one of the benefits of reproducible builds was to help detect and mitigate any kind of supply-chain attack on Fedora's builders and allow others to perform independent verification that the package sources match the binaries that are delivered by Fedora.

kazinator

The supply chain attacks you have to most worry about are not someone breaking into Fedora build machines.

It's the attacks on the upstream packages themselves.

Reproducible builds would absolutely not catch a situation like the XZ package being compromised a year ago, due to the project merging a contribution from a malicious actor.

A downstream package system or OS distro will just take that malicious update and spin it into a beautifully reproducing build.

yjftsjthsd-h

Don't let the perfect be the enemy of the good; this doesn't prevent upstream problems but it removes one place for compromises to happen.

pxc

When builds are reproducible, one thing a distro can do is have multiple build farms with completely different operators, so there's no shared access and no shared secrets. Then the results of builds of each package on each farm can be compared, and if they differ, you can suspect tampering.

So it could help you detect tampering earlier, and maybe even prevent it from propagating depending on what else is done.

bluGill

Reproducible builds COULD fix the xz issues. The current level would not, but github could do things to make creating the downloadable packages scrip table and thus reproducible. Fedora could checkout the git hash instead of downloading the provided tarball and again get reproducible builds that bypass this.

The above are things worth looking at doing.

However I'm not sure what you can code that tries to obscure the issues while looking good.

phkahler

And anything designed to catch upstream problems like the XZ compromise will not detect a compromise in the Fedora package build environment. Kinda need both.

null

[deleted]

Zamicol

Bingo.

conradev

Better security! A malicious actor only needs to change a few bytes in either the source or binary of OpenSSL to break it entirely (i.e. disable certificate checking).

Reproducible builds remove a single point of failure for authenticating binaries – now anyone can do it, not just the person with the private keys.

bagels

It's one tool of many that can be used to prevent malicious software from sneaking in to the supply chain.

null

[deleted]

russfink

Keep in mind that compilers can be backdoored to install malicious code. Bitwise/signature equivalency does not imply malware-free software.

bluGill

True, but every step we add makes the others harder too. It is unlikely Ken Thompson's "trusting trust" compiler would detect modern gcc, much less successfully introduce the backdoor. Even if you start with a compromised gcc of that type there is a good chance that after a few years it would be caught when the latest gcc fails to build anymore for someone with the compromised compiler. (now add clang and people using that...)

We may never reach perfection, but the more steps we make in that direction the more likely it is we reach a point where we are impossible to compromise in the real world.

jt2190

In this attack, the compiler is not a reproducible artifact? Or does backdooring use another technique?

kaelit

Interesting take. I'm building something related to zk systems — will share once it's up.

kccqzy

> For example, Haskell packages are not currently reproducible when compiled by more than one thread

Doesn't seem like a big issue to me. The gcc compiler doesn't even support multithreaded compiling. In the C world, parallelism comes from compiling multiple translation units in parallel, not any one with multiple threads.

Dwedit

Reproducibility is at odds with Profile-Guided-Optimization. Especially on anything that involves networking and other IO that isn't consistent.

nrvn

from Go documentation[0]:

> Committing profiles directly in the source repository is recommended as profiles are an input to the build important for reproducible (and performant!) builds. Storing alongside the source simplifies the build experience as there are no additional steps to get the profile beyond fetching the source.

I very much hope other languages/frameworks can do the same.

[0]: https://go.dev/doc/pgo#building

nyrikki

The Performant claim there is counter to research I have heard. Plus as the PGO profile data is non-deterministic in most cases, even when compiled on the same hardware as the end machine "Committing profiles directly in the source repository" is the reason why they are deleted or at least excluded from the comparison.

A quote from the paper that I remember on the subject[1] as these profiles are just about as machine dependent as you can get.

> Unfortunately, most code improvements are not machine independent, and the few that truly are machine independent interact with those that are machine dependent causing phase-ordering problems. Hence, effectively there are no machine-independent code improvements.

There were some differences between various Xeon chip's implementations of the same or neighboring generations that I personally ran into when we tried to copy profiles to avoid the cost of the profile runs that may make me a bit more sensitive to this, but I personally saw huge drops in performance well into the double digits that threw off our regression testing.

IMHO this is exactly why your link suggested the following:

> Your production environment is the best source of representative profiles for your application, as described in Collecting profiles.

That is very different from Fedora using some random or generic profile for x86_64, which may or may not match the end users specific profile.

[1] https://dl.acm.org/doi/10.5555/184716.184723

clhodapp

If those differences matter so much for your workloads, treat your different machine types as different different architectures, commit profiling data for all of them and (deterministically) compile individual builds for all of them.

Fedora upstream was never going to do that for you anyway (way too many possible hardware configurations), so you were already going be in the business of setting that up for yourself.

zbobet2012

That's only the case if you did PGO with "live" data instead of replays from captured runs, which is best practice afaik.

nyrikki

This is one of the "costs" of reproducible builds, just like the requirement to use pre-configured seeds for pseudo random number generators etc.

It does hit real projects and may be part of the reason that "99%" is called out but Fedora also mentions that they can't match the official reproducible-builds.org meaning in the above just due to how RPMs work, so we will see what other constraints they have to loosen.

Here is one example of where suse had to re-enable it for gzip.

https://build.opensuse.org/request/show/499887

Here is a thread on PGO from the reproducible-builds mail list.

https://lists.reproducible-builds.org/pipermail/rb-general/2...

There are other costs like needing to get rid of parallel builds for some projects that make many people loosen the official constraints. The value of PGO+LTO being one.

gcda profiles are unreproducible, but the code they produce is typically the same. If you look into the pipeline of some projects, they just delete the gcda output and then often try a rebuild if the code is different or other methods.

While there are no ideal solutions, one that seems to work fairly well, assuming the upstream is doing reproducible builds, is to vendor the code, build a reproducible build to validate that vendored code, then enable optimizations.

But I get that not everyone agrees that the value of reproducibility is primarily avoiding attacks on build infrastructure.

However reproducible builds as nothing to do with MSO model checking etc... like some have claimed. Much of it is just deleting non-deterministic data as you can see here with debian, which fedora copied.

https://salsa.debian.org/reproducible-builds/strip-nondeterm...

As increasing the granularity of address-space randomization at compile and link time is easier than at the start of program execution, obviously there will be a cost (that is more than paid for by reducing supply chain risks IMHO) of reduced entropy for address randomization and thus does increase the risk of ROP style attacks.

Regaining that entropy at compile and link time, if it is practical to recompile packages or vendor, may be worth the effort in some situations, probably best to do real PGO at that time too IMHO.

goodpoint

Yo, the attacker has access to the same binaries, so only runtime address randomization is useful.

gnulinux

It's not at odds at all but it'll be "Monadic" in the sense that the output of system A will be part of the input to system A+1 which is complicated to organize in a systems setting, especially if you don't have access to a language that can verify. But it's absolutely achievable if you do have such a tool, e.g. you can do this in nix.

michaelt

Why should it be?

Does the profiler not output a hprof file or whatever, which is the input to the compiler making the release binary? Why not just store that?

frainfreeze

Amazing to see this progress! Cudos to everyone who put in the effort.

Related news from March https://news.ycombinator.com/item?id=43484520 (Debian bookworm live images now fully reproducible)

sheepscreek

YES! I want more tools to be deterministic. My wish-list has Proxmox config at the very top.

trod1234

Can someone provide a brief clarification about build reproducibility in general?

The stated aim is that when you compile the same source, environment, and instructions the end result is bit identical.

There is, however; hardware specific optimizations that will naturally negate this stated aim, and I don't see how there's any way to avoid throwing out the baby with the bathwater.

I understand why having a reproducible build is needed on a lot of fronts, but the stated requirements don't seem to be in line with the realities.

At its most basic, there is hardware, where the hardware may advertise features it doesn't have, or doesn't perform the same instructions in the same way, and other nuances that break determinism as a property, and that naturally taints the entire stack since computers rely heavily on emergent design.

This is often hidden in layers of abstraction and/or may be separated into pieces that are architecture dependent vs independent (freestanding), but it remains there.

Most if not all of the beneficial properties of reproducible builds rely on the environment being limited to a deterministic scope, and the reality is manufacturers ensure these things remain in a stochastic scope.

Crestwave

> hardware specific optimizations that will naturally negate this stated aim

Distro packages are compiled on their build server and distributed to users with all kinds of systems; therefore, by nature, it should not use optimizations specific to the builder's hardware.

On source-based distros like Gentoo, yes, users adding optimization flags would get a different output. But there is still value in having the same hardware/compilation flags result in the same output.

amarshall

Well the point is that if N of M machines produce the same output, it provides the opportunity to question why it is different on the others. If the build is not reproducible then one just throws up their arms.

It’s not clear if you’re also talking about compiler optimizations—a reproducible build must have a fixed target for that.

trod1234

I was thinking more from a reference point along the lines of LLVM type performance optimizations when I was speaking about optimizations, if that sufficiently clarifies.

koreanguy

[dead]