Pointers Are Complicated II, or: We need better language specs (2020)

124 comments

·January 30, 2025

kibwen

Great series of blog posts, I link them every chance I get.

Important to note that in the years since this post was written, provenance has become an explicitly accepted consideration of the Rust memory model ( https://doc.rust-lang.org/std/ptr/index.html#provenance ), and standard APIs have been stabilized for working with pointers in ways that preserve provenance: https://doc.rust-lang.org/std/ptr/index.html#strict-provenan...

tialaramex

More specifically Aria's Experiment (the provenance APIs) was stabilized in 1.84 which actually shipped as a release only last month, January.

Socializing these APIs is important = if you write or review unsafe Rust, especially unsafe Rust which does stuff with pointers, knowing these APIs exist makes it less likely you will write or wave through unsound software that treats integers as having provenance, sometimes you would get away with that, other times you would not, good Rust programmers shouldn't let other Rust programmers use a usize where a pointer was the correct type, just as you wouldn't use u32 where IPv4Addr is the correct type.

IshKebab

I found provenance easiest to think about after learning CHERI, since there the provenance isn't just an abstract concept that allows better optimisation; it's actually real data in the capability that affects whether it works at all, irrespective of optimisations. You literally can't cast an integer to a pointer, even in assembly.

tialaramex

Well, you can't "just" cast any of Rust's integers to a pointer because they're not necessarily the same size (one of the changes to Rust is the decision that usize is not the same size as a pointer only the same size as the pointer address), but C++ uintptr_t is in fact guaranteed to be the same size and presumably the compiler might just model it as an address the same as a pointer.

Today there are no settled provenance rules for C or C++ but there is a TS for C explaining PNVI-ae-udi. Under that model if we cast a pointer to uintptr_t we're exposing the provenance of that pointer. In Rust's Exposed Provenance API we're not promised that this is even possible on every target (if we imagine compiling Rust for a CHERI target clearly expose_provenance may not compile to any actual code, maybe the compiler barfs) but in C we're guaranteed the actual raw bits so we're allowed to, for example, display them to the user and under this TS the associated provenance is thereby exposed - so we can have the user later input an arbitrary integer, turn it into a pointer and the resulting pointer must work if they gave us the right number and that pointer would still be valid!

Is that a good idea? No. But it's very on brand for C and might become the de facto or even de jure semantic for pointer provenance in C++ too.

IshKebab

> Well, you can't "just" cast any of Rust's integers to a pointer because they're not necessarily the same size (one of the changes to Rust is the decision that usize is not the same size as a pointer only the same size as the pointer address

You can though, in practice on a non-CHERI system and if you avoid optimisations.

I haven't looked but I would expect the change you mentioned was made precisely because of CHERI, where a pointer and address are different sizes. On all other platforms Rust supports they are the same.

uecker

The reason this is a really good idea is because is sometimes needed in the real world and with PNVI-ae-udi is just works. Also the alternative models have horrible semantics, e.g. either because you have provenance via integers (where algebraic properties of mathematics breaks) or because you get really subtle semantics that is hard to reason about (all the angelic / demonic indeterminism ideas).

tialaramex

Are you telling me that, if you could start over somehow, you'd suggest K&R should specify PNVI-ae-udi ? Or do you mean (which I accept) only that well, too bad, PNVI-ae-udi is the best we could do with C given where we started from 20+ years ago?

I actually think there's yet significant risk that this is swamped by the "Bag of bits" people despite all your hard work. There are plenty of WG21 submissions whose subtext can be read that way, they want their "high level assembler" and they don't care that it's a myth and can't be made real, they're not taking "No" for an answer. So, congratulations on the TS.

saurik

> ...another pass might notice the INT_MAX+1 and replace it by unreachable since UB code is “by definition” unreachable...

I would love to see the concrete example where this actually makes the program faster, to give me a better understanding of what I'm up against if I prefer to make integer field arithmetic well-defined.

AFAIK, the main reason these loops are so frustrating in C is because no one wanted to bite the bullet and insist that fast loops all use size_t instead of int--as otherwise on systems where int isn't native (which is, today, most of them) you have to emulate the wrap by adding a mask to the iteration--and so the compilers started inferring stuff and the language spec got damaged to let them work around it instead.

gpm

For what it's worth rust has made all integer arithmetic well-defined unless you explicitly opt in to the alternative on a case by case basis. Opting into the alternative is not something I've seen people having to do for performance.

Specifically in release builds it's defined to be 2s complement arithmetic. In debug builds it's defined to panic (in non rust speak that is basically throw an exception) on overflows, and the language theoretically reserves the right to change the behavior on release builds to do the same, but that hasn't been done because the exception route does actually incur a non-negligible performance penalty.

steveklabnik

I agree with you that I don't see people opting in for performance, but it's also the case that C relies on this behavior a lot more than Rust does, since it doesn't have iterators. If more Rust programmers wrote "raw" for loops, maybe the overhead would be more significant. I don't know if there's any way to qualify this.

Also, it' not done by default because it's believed to incur the penalty. We didn't try to do any significant analysis here, mostly relying on the widespread belief. If we wanted to go against the grain, we'd have to demonstrate that it's incorrect, and that's difficult, time consuming, and there's other things to spend time on.

Google recently demonstrated pretty convincingly that bounds checks on array access in their C and C++ code has a very minor penalty that's absolutely worth it. I wonder if maybe we'll see something similar with this semantic in the future. I think it's lower priority because it's not a safety issue on its own to have the "maybe this, maybe that" behavior, since it's not undefined behavior.

tialaramex

I would imagine it wouldn't be too hard to collect this information because you can just change the setting on a project and report back whether that's no significant difference, broken (indicating your Rust is wrong, you should fix that) a little slower or unacceptably slower.

Actually that information might help inform people choosing this setting for a new project. I think many assume that Wrapping is the conservative choice but actually the Strict / panic semantic is going to be better for at least some of those people if they could afford it and so if they knew that say 60% of projects find this works out they might try it and benefit.

As a side effect a handful of people get to find out that their project has a nasty bug where it actually relies on wrapping for routine functionality but does not correctly use a wrapping type.

uecker

I use signed integer because they give me nice errors on overflow, instead of hideously difficult to find wrap-around bugs.

saurik

Just in case you are saying this to defend why you are using "int" instead of "size_t" (as, otherwise, I am not sure of your point and a I'll hope you might clarify it), you would then want to use "ssize_t"... the issue I was referring to is not about signed vs. unsigned: it is about using a type which is smaller than a native full-width machine number type (such as when int is 32-bit on a 64-bit machine), as you then have to use suboptimal instructions or even add more code to emulate the wrapping semantics.

uecker

I understood your comment to be about well-definedness of wraparound and my point is that I do not want it to be well-defined for signed integers because it would take a tool away from me for catching bugs. This is true independent of the size of the type. For similar reasons I use int and not ssize_t because I want an error in my lifetime if there is a bug that causes my loop not to terminate. (And yes, sometimes int would be too small.) Otherwise, there are many examples where signed integers lead to better code, here is one: https://godbolt.org/z/szGd1864a

James_K

  char p[1], q[1] = {0};
  uintptr_t ip = (uintptr_t)(p+1);
  uintptr_t iq = (uintptr_t)q;
  if (iq == ip) {
    *(char*)iq = 10;
    print(q[0]);
  }

I am genuinely a little scared of this C program.

shakna

As a C program, it doesn't actually work. [0] It's just C-as-pseudocode for LLVM IR. And LLVM has different semantics for what is allowed and denied, than C.

[0] https://godbolt.org/z/58GcP1d9j

James_K

The only reason it doesn't work is because the stack grows downwards on x86 targets so p is at a higher address than q. If you swap the variables round, it compiles and runs printing the value assigned.

https://godbolt.org/z/bedb7cvY8

shakna

As the compiler is free to reorganise the variable order, that means you're relying on Undefined Behaviour for it to compile and do what you want. That's not something inherent to the way pointers work.

spaqin

[dead]

chongli

Section 6.3.2.3 of the C11 standard [1] reads:

1 A pointer to void may be converted to or from a pointer to any object type. A pointer to any object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer.

...

6 Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.

The issue for your code is that on line 2 you cast to uintptr_t the pointer (p+1) to type char, which is not a pointer to void (and not a null pointer). So the result of your conversion is implementation-defined and the result need not be in the range of values of any integer type. Thus an implementation can assume that the result of the comparison (iq == ip) is always false and even omit the entire if-block.

[1] http://port70.net/%7Ensz/c/c11/n1570.pdf

James_K

> The following type designates an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:

> uintptr_t

Given this, are you suggesting that the code would be valid if the char pointer was first cast to void as so:

  char p[1], q[1] = {0};
  uintptr_t ip = (void *)(p+1);
  uintptr_t iq = (void *)q;
  if (iq == ip) {
    *(char *)(void *)iq = 10;
    print(q[0]);
  }

Additionally, it seems the code may be valid as-is:

> A pointer to void shall have the same representation and alignment requirements as a pointer to a character type.

> § 6.2.5.28

chongli

You still have a few bugs to fix (need to cast to uintptr_t after casting to void *, need to use printf with a proper format string) but now your code is valid. I believe it's still implementation-defined because different implementations are free to lay out memory as they wish, say by including the use of padding bytes between values, or by changing the location of the stack and the direction it grows (up or down in memory). This means you cannot assume that the array q directly follows the array p in memory (they may be in the opposite order or they may be separated by padding bytes).

I created a working version of your code on Godbolt Compiler Explorer [1]. I tried it with a few different compilers and the code runs. The comparison evaluates to false on every compiler that I could get to work at all. Some don't seem to print anything at all, though the assembly they generate seems like it ought to.

As a general rule of thumb for C, you should assume it is impossible to reach the object pointed to by one pointer via another pointer, unless the second pointer directly aliases the first. You also cannot assume that the integer representation of a pointer corresponds directly to a valid address in memory. Virtual memory (with address translation), fat pointers, segmented memory, etc. are all possible which means the pointer's actual bit pattern should be considered an opaque structure handled by the implementation, not an integer for you to manipulate and then cast back to a pointer.

The main reason for wanting to convert a pointer into a uintptr_t type is to serialize it into some other structure in a one-way fashion, such as to store it in a hash table.

[1] https://godbolt.org/z/dT8G8o5xM

varispeed

You also wouldn't write such code in real world.

grandempire

Certain C optimizations are difficult to implement due to pointers. But that doesn't make pointers themselves complicated.

Most of the C community agrees that strict aliasing assumptions are more trouble than they are worth.

kibwen

Strict aliasing isn't the root problem here, because Rust, which doesn't have C-style strict aliasing, would similarly exhibit the same theoretical miscompilation in the absence of provenance.

grandempire

strict aliasing is an example of questionable optimizations from hard to verify assumptions.

null

[deleted]

serbuvlad

It seems to me like the final optimization is the only one that is not correct and obviously so.

> The final optimization notices that q is never written to, so we can replace q[0] by its initial value 0.

However the line right above is

    *(p+1) = 10;

The type of p is char[] which decays into a pointer and the type of (p+1) is char *. Because it is char * and not char * restrict, any assignment through such a pointer can be assumed to modify any and all data in a program.

Therefore, the assumption that p[0] does not change and can be swapped out is not somehow surprisingly wrong, but simply entirely unfounded.

It is up to the compiler to prove for itself that assignment through pointers does not change data it's interested in from before to after the assignment. (except, of course, in the case of restrict pointers, where the programmer makes this guarantee). This is not very difficult, at least conceptually.

If the assignment was

    *p = 10;

Well, we know that *p means p[0], which is within the bounds of the array, and we know that p and q do not overlap. Therefore, this assignment can not modify q.

In the case of the first assignment, we know that *(p+1) means p[1], which is out of bounds for the array. So we do not know whether or not an assignment through *(p+1) modifies q, and we can not assume it does not.

Groxx

It's not always correct, but that's kinda the tradeoff with undefined behavior. It's your responsibility to know the defined spec and stay within it, not the compiler's to prove you did it correctly in all cases.

>Since q and p point to different local variables, a pointer derived from p cannot alias q[0], and hence we know that this write cannot affect the value stored at q[0].

Pointer math outside the array bounds isn't defined, so it can be treated as anything, including "cannot possibly do X, so we can optimize it out". A compiler that preferred to spend memory to improve correctness (e.g. Valgrind) might add padding between variables to prevent this kind of small-OOB action, which would make the output change again, which is why this is undefined and you can't rely on it. A compiler that was feeling moody could put that array at the end of your page of addressable memory, and cause a segfault when you try to write to it.

serbuvlad

In the actual code *(p+1) is only ever accessed if it has been verified to be the same address as q[0], which is an object which we know exists and we have full access to.

The problem is that pointers of the same type to the same address but originating from pointer arithmetic on the pointers obtained by dereferencing different objects have different semantics.

In other words, the following is not semantically a NOP.

    char *p = /*some valid pointer*/;
    {
        char x;
        char *q = &x;
        while ((uintptr_t) q != (uintptr_t) p)
            q++;
        p = q;
    }

I believe this goes against the principle of least astonishment (I find it quite astonishing, personally). I believe, even if the standard not does not guarantee the expected behavior, that implementations should still guarantee it. And they can do the optimizations they want in other ways, as I have outlined.

Groxx

While I generally agree and I much prefer my langs with minimal undefined behavior, this gets solidly into Hyrum's Law territory, and it means practically every significant optimization could break something therefore it cannot be done. C just doesn't have rich enough semantics to allow much else.

You can find some languages that try to adhere to that, but C is extremely clearly not part of that group. It never has been, and it never will be.

imtringued

> The type of p is char[] which decays into a pointer and the type of (p+1) is char . Because it is char and not char * restrict, any assignment through such a pointer can be assumed to modify any and all data in a program.

Out of bounds access is classic UB. The compiler is allowed to assume that you never access an array out of bounds. Since you never access p out of bounds, the effects of any write to p will not affect any "object" that is not p.

>Therefore, the assumption that p[0] does not change and can be swapped out is not somehow surprisingly wrong, but simply entirely unfounded.

I hope you notice your small mistake. It's print(q[0]), not print(p[0]), there is no "p" in the print. It's q[0] and q[0] is not p. The write to p did not affect q, because that would be UB.

>It is up to the compiler to prove for itself that assignment through pointers does not change data it's interested in from before to after the assignment. (except, of course, in the case of restrict pointers, where the programmer makes this guarantee). This is not very difficult, at least conceptually.

The code accesses q[0], not iq. You might say that if you did print(*iq), that the compiler should definitively know that iq and ip are the same and therefore a write to ip would also be a write to iq, but the code uses q[0], which is a memory address that is not proven to be identical to ip.

Then again, what you're hoping for here is that the compiler fixes the UB for you, which is not the point of UB. You're supposed to rely on your godly C programmer instincts to never write code with UB in the first place.

uecker

This statement "any assignment through such a pointer can be assumed to modify any and all data in a program." is incorrect. You are not allowed to use a pointer to one object to modify another. At least this is what the C standard (somewhat imprecisely) implies and compilers generally exploit for optimization.

Most people agree that the right solution is that the integer-round trips can not simply be assumed to be a no-op and has the effect that the recreated pointer can be used to point to the other object (as proposed by us in TS 6010). (So in this sense you are right). The LLVM maintainer stated recently that they are going to fix this.

serbuvlad

> You are not allowed to use a pointer to one object to modify another.

In this context, you means the author of a program written in Standard C. Implementations can choose to allow this (and in practice all do, because of the nature of pointers). And obviously, a code transformer for a particular implementation of C would only need to generate code valid for that implementation of C, not valid Standard C code.

uecker

True, an implementation could choose to allow this. But in is a bit questionable to say they do in practice, because most will miscompile your code in this case in various circumstances.

IgorPartola

I have some questions. First, do pointers that do not come from specifying a statically sized array also have provenance? That is if I say `char* c; size_t n = get_user_input(); c = malloc(n);` does c have provenance? If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

Also, in Rust why are pointer to integer casts allowed at all if they lose provenance?

Also, why exactly is one past end of bounds a valid pointer you can calculate?

Also, it seems like you cannot just create a random pointer to an unallocated address, not just that you can’t write to it. Is that correct?

Lastly, is provenance just a fancy way of saying metadata? Is there a reason something like “pointer metadata” or “object metadata” wasn’t used?

Arnavion

>That is if I say `char* c; size_t n = get_user_input(); c = malloc(n);` does c have provenance?

Yes.

>If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

Yes.

>Also, in Rust why are pointer to integer casts allowed at all if they lose provenance?

It would be a breaking change to remove the ability to cast. Instead Rust added API to extract the provenance when casting from ptr to int and then add provenance back when casting from int to ptr.

>Also, why exactly is one past end of bounds a valid pointer you can calculate?

It's useful to allow it to make `if (ptr < end)` with exclusive `end` possible. The alternative of only allowing `if (ptr <= end)` with inclusive `end` is not as trivial if it needs to handle empty ranges (the exclusive version can just use `end = start` to denote an empty range).

>Lastly, is provenance just a fancy way of saying metadata?

Pointer metadata in Rust at least refers to a different thing, namely the second half of fat pointers. For slice pointers the fat pointer contains data ptr and length, so the length is the metadata. For trait object pointers the fat pointer contains data ptr and vtable ptr, so the vtable ptr is the metadata.

josephg

> It's useful to allow it to make `if (ptr < end)` with exclusive `end` possible. The alternative of only allowing `if (ptr <= end)` with inclusive `end` is not as trivial if it needs to handle empty ranges (the exclusive version can just use `end = start` to denote an empty range).

As an aside: I find it fascinating that half-open ranges turn out to be - universally as far as I can tell - the right choice for ranges in software. As well as being able to store an empty range, they also allow you to calculate the length of the range by trivially subtracting end - start.

tialaramex

For the growable array and slice types (C++ std::vector and std::span) at a low level we should actually store ptr and length not start and end, it makes a tiny but measurable performance difference.

This is one of the tiny perf leaks from C++ ABI conservation. IIRC All three popular std::vector implementations got this wrong decades ago and are now ABI frozen so too bad. The design of std::span is correct though for whatever that's worth.

SkiFire13

> I find it fascinating that half-open ranges turn out to be - universally as far as I can tell - the right choice for ranges in software.

There is a famous letter from Dijkstra on this topic. https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831...

sfoley

> > If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

> Yes.

Uh, no. This is flatly untrue. You cannot "assign an array such that it points to a region of memory". Arrays are not pointers, they do not point to anything.

Arnavion

I assumed they meant `char (*str)[5]`, ie a pointer to an array, because they wanted to reinterpret the malloc result as an array of the malloc'd length.

SkiFire13

> If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

I don't think you can perform such assignment.

> Also, in Rust why are pointer to integer casts allowed at all if they lose provenance?

Sometimes you want the numeric address to perform some arithmetic/bitwise operation on it (e.g. for tagged pointers)

> Also, why exactly is one past end of bounds a valid pointer you can calculate?

Because it's increadibly useful for iterating. It is also used for representing an empty tail slice. It's something inherited from C/C++ and likely won't go away.

> Also, it seems like you cannot just create a random pointer to an unallocated address, not just that you can’t write to it. Is that correct?

You can create such pointer, it just won't be valid.

> Lastly, is provenance just a fancy way of saying metadata? Is there a reason something like “pointer metadata” or “object metadata” wasn’t used?

It's not metadata that exists at runtime. It only exist at compile time to allow certain common optimizations.

garaetjjte

>Clearly, one of the three optimizations is incorrect in the sense that it introduced a change in program behavior.

What about declaring that all three optimizations are correct, and this program invokes UB? More precisely, it compares object one-past-the-end pointer with pointer pointing into entirely another object. one-past-the-end pointer is special because numerically it points outside the object. I think declaring that one-past-the-end pointer could be compared only with itself, valid pointer into the same object or NULL wouldn't be too unreasonable (comparing one-past-the-end pointer with pointer to another object would invoke UB).

afdbcreid

But it does not. It compares integers. Comparing integers is always well-defined.

If you want to declare it UB, you will have to say that the integers are not simple bits, they also "remember" which pointer they came from. This is essentially what we mean when we say that integers has provenance, and it's not a good idea because it doesn't let you do some very common optimizations on integers.

captainmuon

The downside of defining things like pointer provenance in the IL is that it ties the semantics of your IL to C. If you are writing another language and want to have very different symantics - say a flat memory model where pointers are integer indicies into memory, or definied signed overflow - then you will have to be very careful and constantly have to fight or work around the C semantics. Basically the same downside as when you use C as a compilation target (like for example Nim does/did).

nikic

Where possible, undefined behavior at the LLVM IR level is always optional, controlled by various flags, attributes and metadata. Things like signed integer overflow, or even the forward progress guarantee are controlled that way, and admit a wide range of possible language semantics.

Provenance (and to some degree the entire memory model) is the big exception to this. It's a very core part of the IR semantics, and an optimizing compiler without provenance would have to use entirely different methodology to justify many optimizations. I don't believe that anyone has yet been able to demonstrate the feasibility of highly optimizing compiler without provenance.

Rusky

The IR doesn't have to cater to programmers in the same way, and can support many choices of semantics simultaneously in a single program.

For example, LLVM IR is used to compile C both with and without strict aliasing. The choice is lowered into metadata applied to individual pointer operations. Similarly with signed integer overflow.

Nim could even have gotten some of this benefit when compiling to C if it had been willing to target a specific set of C compilers, rather than standard C.

varispeed

I could never understand why "pointers are complicated" or "hard"?

They just point at things. Sure there are some semantic quirks if you want to go there, to idk obfuscate the code or to make things more complex than they really are.

I just don't get it.

Been programming in C for decades now and never had any issues with this.

nynx

Depending on what kind of software you tend to work on, you may have written a lot of subtly incorrect C.

varispeed

Yes, I've done fair share of these, but I don't see them as different from any other incorrect code I may write. I agree that sometimes it may be more challenging to debug, as it may not be immediately obvious that e.g. I am writing memory I am not supposed to.

James_K

I don't see how the proposed solution improves matters. Even without the second optimisation, it seems that the third could still occur and cause issues. I suppose the compiler must only recognise the potential for a pointer writing arbitrary data when it is first cast to an integer.

Animats

What this really says is that C giving one-past-the-end pointers meaningful semantics may have been a mistake. So was trying to give aliasing of writable objects meaningful semantics.

HN

Pointers Are Complicated II, or: We need better language specs (2020)

Pointers Are Complicated II, or: We need better language specs (2020)