Deep Down the Rabbit Hole: Bash, OverlayFS, and a 30-Year-Old Surprise
12 comments
·June 25, 2025justincormack
Most of the stuff that configure scripts check is obsolete, and breaks in situations like this as the checks are often not workable without running code. It is likely the check does not apply to any system that has existed for decades. Lots of systems have disabled eg Nix in 2017 [1]
[1] https://github.com/NixOS/nixpkgs/commit/dff0ba38a243603534c9...
arp242
I had a look at the bash source code a few years back, and there are tons of hacks and workarounds for 1980s-era systems. Looking at the git log, GETCWD_BROKEN was added in bash 1.14 from 1996, presumably to work around some system at the time (a system which was perhaps already old in 1996, but it's not detailed which).
Also, that getcwd.c which contains the getcwd() fallback and bug is in K&R C, which should be a hint at how well maintained all of this is. Bash takes "don't fix it if it ain't broke" to new levels, to the point of introducing breakage like here (the bash-malloc is also notorious for this – no idea why that's still enabled by default).
malkia
Autoconf is the prime example of easy vs simple.
It looks easy on the surface to roll down support for any kind of operating system there is, based on auto-detection and then #if HAVE_THIS or #if HAVE_THAT, but it breaks in ways that maybe really hard to untangle later.
I'd rather have a limited set set of configurations targeting specific platforms/flavors, and knowing that no matter how I compile it, I would know what is `#define`-d and what is not, instead of guessing on what the "host" might have.
jwilk
> Once the bug report becomes publicly visible, it will be linked here.
Here it is: https://lists.gnu.org/archive/html/bug-bash/2025-06/msg00149...
saurik
FWIW, if you are cross-compiling, while you might get a vaguely usable result by ignoring all of the warnings and letting worst-common-denominator defaults get applied, you absolutely should be paying more attention and either manually providing autoconf the answers it needs or (if at all possible, as this is more general) make sure to tell it how to run a binary on the target system (maybe in an emulator or over ssh)... you shouldn't just be YOLOing a cross-compile like this and expecting it to work (not to say that this wasn't a good bug in the fallback to fix, just that the premise is awkward).
iforgotpassword
Like for example when compiling Linux (plus user space) from Windows XP using only the official Services for Unix package from Microsoft as a starting point.
pogopop77
Interesting investigation, good read. Definitely illustrates how new paradigms (i.e. overlay filesystems) can subtly affect behaviors in ways that are complex to track down.
akoboldfrying
Remember, folks: It's not enough to check $WEARING_PANTS before stepping outside. You need to check !$PANTS_BROKEN && !$SOLARIS too.
x0x0
Saving this to explain why software is hard.
For a long time, inode numbers from readdir() had certain semantics. Supporting overlay filesystems required changing those semantics. Piles of software were written against the old semantics; and even some of the most common have not been upgraded.
JdeBP
The opposite, if anything. Very little was written against the old semantics, with most of the time the supplied C library providing what was needed, and so the code that did rely upon old semantics barely got exercised. A little-used shim that had been broken wasn't noticed, in other words, until just the right combination of circumstances got the shim being used on a platform where it would break.
What there are piles of, are softwares that reinvent the C library, all too often in little bits of conditionally-compiled code that have either been reinvented or nicked from some old C library and sit unused in every platform that that application is nowadays ported to. Every time that I see a build log dutifully informing me that it has checked for <string.h> or some other thing that has been standard for 35 years I wonder (a) why that is thought to be necessary in 2025, and (b) what sort of shims would get used if the check ever failed.
arp242
> what sort of shims would get used if the check ever failed.
Most programs will probably just fail to compile: "#undef HAVE_STRING_H" gets added to config.h, but it's never checked. Or something along those lines. It's little more than "failed to find <string.h>" with extra steps.
The exceptions are older projects which support tons of systems: bash, Vim, probably Emacs, that type of thing. A major difficulty is that it can be very hard to know what is safe to remove. So to use your strings.h example, bash currently does:
#if defined (HAVE_STRING_H)
# include <string.h>
#endif /* !HAVE_STRING_H */
#if defined (HAVE_STRINGS_H)
# include <strings.h>
#endif /* !HAVE_STRINGS_H */
And Vim has an even more complex check: // Note: Some systems need both string.h and strings.h (Savage). However,
// some systems can't handle both, only use string.h in that case.
#ifdef HAVE_STRING_H
# include <string.h>
#endif
#if defined(HAVE_STRINGS_H) && !defined(NO_STRINGS_WITH_STRING_H)
# include <strings.h>
#endif
Looks like that NO_STRINGS_WITH_STRING_H gets defined on "OS/X". Is that still applicable? Probably not?Is any of this still needed? Who knows. Is it safe to remove? Who knows. No one is really tracking any of this. There is no "caniuse" for this, and even the autoconf people aren't sure on what systems autoconf does and doesn't work. There is no way to know who is running what on what, and people do run some of these programs on pretty old systems.
So ... people don't touch any of this because no one knows what is or isn't broken and what does and doesn't break if you touch it.
Aside: people love to complain about telemetry, sometimes claiming it's never useful, but this is where telemetry would absolutely be very useful.
Wow great bug!
> Bash forgot to reset errno before the call. For about 30 years, no one noticed
I have to say, this part of the POSIX API is maddening!
99% of the time, you don't need to set errno = 0 before making a call. You check for a non-zero return, and only then look at errno.
But SOMETIMES you need to set errno = 0, because in this case readdir() returns NULL on both error and EOF.
I actually didn't realize this before working on https://oils.pub/
---
And it should go without saying: Oils simply uses libc - we don't need to support system with a broken getcwd()!
Although a funny thing is that I just fixed a bug related to $PWD that AT&T ksh (the original shell, that bash is based on) hasn't fixed for 30+ years too!
(and I didn't realize it was still maintained)
https://www.illumos.org/issues/17442
https://github.com/oils-for-unix/oils/issues/2058
There is a subtle issue with respect to:
1) "trusting" the $PWD value you inherit from another process
2) Respecting symlinks - this is the reason the shell can't just call getcwd() !
Basically, the shell considers BOTH the inherited $PWD and the value of getcwd() to determine its $PWD. It can't just use one or the other!