Skip to content(if available)orjump to list(if available)

Exploiting Undefined Behavior in C/C++ Programs: The Performance Impact [pdf]

pcwalton

I notice that the paper doesn't claim to eliminate all reasoning about undefined behavior for optimizations. For example:

    int f() {
        int arr[3], i = 0;
        arr[3] = 5;
        return i;
    }
Optimizing this to "return 0" is relying on UB, because it's assuming that i wasn't laid out directly after arr in the stack frame. I believe this is what the paper calls "non-guardable UB".

I don't agree with the claim in the paper that their semantics offers a "flat memory model". A flat memory model would rule out the optimization above. Rather, the memory model still has the notion of object bounds; it's just simplified in some ways.

dooglius

>it's assuming that i wasn't laid out directly after arr in the stack frame

The compiler isn't "assuming" that so much as choosing not to put i in the stack frame at all. And I don't think it's correct to view the lack of a register spill as an "optimization" per se. It does remain true that code writing past the end of an array will be UB in typical scenarios (i.e. when not using asan/valgrind).

(Now, if the compiler also removed the store, that could legitimately be called an optimization based on assuming no-UB)

pcwalton

"Exploiting undefined behavior" occurs when a simple semantics (however one defines "simple") results in behavior A, but the compiler chooses behavior B instead based on the actual, more complex, language semantics. The code snippet in question passes that test. If I flip the declaration order of i and arr, then I get this [1] at -O0 (the "simple" semantics):

        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], 0
        mov     dword ptr [rbp - 4], 5
        mov     eax, dword ptr [rbp - 4]
        pop     rbp
        ret
Which indeed returns 5. But at -O2 clang optimizes it to this:

        xor     eax, eax
        ret
Which returns 0. So the simple semantics produces one result, and the complex semantics produces another. Hence, it's exploiting undefined behavior.

[1]: https://godbolt.org/z/df4dhzT5a

dooglius

Maybe this is just arguing semantics, but I don't agree with the definition you've given, and I don't think that your definition is what TFA means. "Exploiting undefined behavior" I think refers narrowly to the implementation assuming that undefined behavior does not occur, and acting on that assumption. Undefined behavior naturally resulting in unpredictable behavior is not exploitation in the same sense. For example,

  printf("A");
  bool x;
  if ( x ) {printf("B");} else {printf("C");}
  printf("\n");
If at -O0 "AB" is printed and at -O2 "AC" is printed (due to the vagaries of whatever was left on the stack), then that would meet your definition, but I would not regard that as "exploiting undefined behavior", merely as the manifestation of the inherent unpredictability of UB. If the compiler didn't print anything (i.e. the whole block was removed due to UB detection) then that _would_ be a case of exploiting undefined behavior.

artikae

What about this: https://godbolt.org/z/xP9xG3Ee3

Here the compiler "register allocates" i for some reads but not for others.

i gets stack allocated, but some uses of it act as though they were register allocated.

dooglius

I'm not sure quite what you're asking for exactly, given the link is for clang trunk and doesn't have the modifications discussed in TFA, and I don't dispute that clang does UB-based reasoning at -O3. But, I will argue that the assembly shown can be accomplished without resorting to what I call "reasoning about UB", and within a flat memory model, supporting the claim that these sacrifices are often not necessary. I'm going to draw a distinction between stack memory being "private" in the sense that only the compiler is allowed to alter it, and "public" where the address can be written to by something else and the compiler needs to handle that. Local variables at first are tracked privately. After the address of a variable is taken with &x, or at the point in time when an array variable is indexed, the associated memory is public. Conceptually, the use of private memory can be indirect; the compiler could encode/decode a stack variable x as (x XOR 0xDEADBEEF) on the stack and it would be fine (but the compiler does the simple thing in practice, naturally). Note that this notion of "private"/"public" stack memory is a property of addresses, not the provenance of the accessing pointers, and so is fully compatible with a flat memory model. The compiler's ability to use private memory isn't a case of "reasoning around UB" in a meaningful sense -- otherwise you could just as well argue that returning from a function call is "reasoning about UB", because the the return address can't be clobbered.

In your provided snippet, the correctness argument for the assembly in a non-UB-reasoning universe goes like this: at first, i is stored privately on the stack with value zero, and so as an optimization we can assume that value is still zero without rereading. Only later, when &i is taken, is that memory made public and the compiler has to worry about something altering it. In actual execution, the problem is that the write function alters compiler-private memory (and note again, that being private is a property of the underlying address, not the fact that it's accessed via an out-of-bounds array indexing), and this is UB and so the program breaks. But, the compiler didn't need to make _assumptions_ around UB.

jcranmer

I've only briefly skimmed the paper, but on that glance, it looks like what they did was (effectively) drop all the attributes in LLVM that can indicate UB, e.g., the inbounds flags on getelementptr instructions, or the nsw flags on arithmetic operations.

As you note, it doesn't remove the more core UB behaviors in LLVM, in particular LLVM's reliance on pointer provenance.

nikic

They do a bit more than that. One of the options (-disable-object-based-analysis under AA2) disables the assumption that distinct identified objects do not alias, which is disabling pointer provenance in at least one key place.

So I think this option very roughly approximates a kind of "no provenance, but with address non-determinism" model, which still permits optimizations like SROA on non-escaping objects.

jcranmer

That's what I get for relying on a skimming of the paper.

Also, hi, didn't know you commented on this site.

throwawayqqq11

Sorry, i dont get why the memory layout should have any effect, when its clear in the AST that i=0 should be returned.

maartenscholl

I think in the example the parent gave `arr[3]` is past the end of the 3 element array, where `i` might reside, potentially changing its value.

bregma

It's clear in the AST that there is undefined behaviour and it is malformed code. It is not valid C code, so what the compiler chooses to do with it is not defined by the language.

pcwalton

Note that if you change the code to this you have the same issue:

    int g(int n) {
        int arr[3], i = 0;
        arr[n] = 5;
        return i;
    }
Without "exploiting UB" it's incorrect to optimize this to "return 0", because of the possibility that i was allocated right after arr and n == 3.

jonstewart

> The results show that, in the cases we evaluated, the performance gains from exploiting UB are minimal. Furthermore, in the cases where performance regresses, it can often be recovered by either small to moderate changes to the compiler or by using link-time optimizations.

_THANK YOU._

Rusky

It's worth noting (and the paper does go into this) that this is limited to a very specific subset of UB, which they call "guardable."

They are not removing UB around things like out-of-bounds or use-after-free, which would likely be more expensive.

jonstewart

I don’t understand the down votes. Conducting empirical research on the performance impact of undefined behavior is fantastically needed, as the C++ committee’s obsession with undefined behavior strictness (in contrast with longstanding semantics, e.g., uninitialized memory accesses being just fine) has been justified largely by how they enable optimizing compilers. This research shows that many types of UB have a negligible impact on performance.

atq2119

Possibly somebody downvoted because "thank you" in all caps is not a substantial contribution to discussion. It feels like the kind of low effort stuff you'd see on reddit.

Also, commenting on downvotes is generally frowned upon.

saagarjha

You're getting downvoted because you're looking for a particular result ("UB optimizations don't help performance") rather than actually evaluating the quality of this analysis (which doesn't really support what you want anyway).

ryao

> by using link-time optimizations

These are almost never used by software.

mgaunard

Only places where I've seen LTO not be used are places with bad and unreliable build systems that systematically introduce undefined behaviour by violating the ODR.

jeffbee

The only organization I've worked in that had comprehensive LTO for C++ code was Google. I've worked at other orgs even with 1000s of engineers where LTO, PGO, BOLT, and other things you might consider standard techniques were considered voodoo and too much trouble to bother with, despite the obvious efficiency improvements being left on the table.

tialaramex

Violating ODR doesn't introduce UB it's IFNDR, Ill-formed No Diagnostic Required which is much worse in principle and in such cases probably also in practice.

UB is a runtime phenemenon, it happens, or it doesn't, and we may be able to ensure the case where it happens doesn't occur with ordinary human controls.

But IFNDR is a property of the compiled program, if you have IFNDR (by some estimates that's most C++ programs) your program has no defined behaviour and never did, so there is no possible countermeasure, too bad game over.

ryao

I am curious where you have seen LTO used. Linux distributions and open source projects in general rarely use LTO. Their build systems are usually very good.

jandrewrogers

LTO is heavily used in my experience. If it breaks something that is indicative of other issues that need to be addressed.

yxhuvud

Main issue isn't that it break stuff but that it tend to be pretty slow to compile with it.

ryao

How many Linux distributions use LTO? It is a rarity among Gentoo users as far as I know and that is the one place where you would expect more LTO usage.

steveklabnik

It's on by default for Rust release builds, so at least the codepaths in LLVM for it are well-exercised.

vlovich123

I don't think that's right unless the docs are stale:

    [profile.release]
    lto = false
https://doc.rust-lang.org/cargo/reference/profiles.html#rele...

alpaca128

That must have been changed sometime in the last year then. When I enable LTO for one of my projects on a Rust compiler from 2024 the compilation time more than doubles.

nikic

One peculiar thing about the benchmark results is that disabling individual UB seems to fairly consistently reduce performance without LTO, but improve it with LTO. I could see how the UB may be less useful with LTO, but it's not obvious to me why reducing UB would actually help LTO. As far as I can tell, the paper does not attempt to explain this effect.

Another interesting thing is that there is clearly synergy between different UB. For the LTO results, disabling each individual UB seems to be either neutral or an improvement, but if you disable all of them at once, then you get a significant regression.

mwkaufma

Reading e.g. the 13% perf regression in simdjson from disabling UB:

  A simpler alternative is to compile the program with LTO. We confirmed that LLVM’s inter-procedural analyses can propagate both alignment and dereferenceability information for this function, which allows the LTO build to recover the performance loss.
"can" is doing a lot of heavy-lifting here. Guaranteeing expected optimizations "will" be applied are hard-enough, without leaving it entirely to an easily-derailed indirect side-effect.

UebVar

This is "can" has exactly the same meaning as in "UB can make your programms faster". You could replace it with "it does, at least with clang". LTO is, in this regard, the same as UB, and unlike guaranteed optimizations, such as the single member optimization, or the empty base optimization.

mwkaufma

Concretely, here, the UB-exploitation in question in this case is assuming that the "this" pointer in C++ is aligned and non-null, meaning it's a pervasive annotation throughout C++ codebases, not an edge-case.

Relying on LTO to "discover" this annotation through interprocedural analysis -- based on my experience of looking at LTO in practice -- will not be as comprehensive, and even when it works it accomplishes its task in an achingly-slow and expensive way.

This is a real devil-is-in-the-details case.

quotemstr

I love when papers disagree with their own abstracts.

hnaccountme

You don't program C. You program your OS using C.

If you look at it this way, does most complaints about undefined behavior go away?

gitroom

perfect, this is right up my alley - honestly i keep wondering if teams avoid optimizations like lto just because build pain sucks or if theres some deeper trust issues around letting the toolchain be clever. you think peopled deal with slow builds if it bought way more speed for the final product?

imtringued

Amazing that the pain of C is unnecessary and offers few benefits.

hyperhello

C is very much one level above assembly, the way dipping a jug in the river is one level above bending down to drink. It's a whole lot easier to mechanically translate *ptr++ = 0 to the corresponding machine code than to memorize and write those actual instructions. Neither is going to automatically check the security of your memory access through any more than the jug is going to test the water.

pjmlp

In the good old days of dumb CPUs.

There is a world apart between C code and auto-vectorization into AVX 512.