Parse, Don't Validate (For C)

28 comments

·July 9, 2025

o11c

`_t` should not be used for custom types since it's reserved for future standard types (and/or types declared in a header you might include someday). This does cause real-world problems (`key_t` anyone?).

Gratuitous allocations are gratuitous.

The whole "prevent double free" claim is completely bogus. Setting a variable to `NULL` only works for cases where there is one, obvious, owner, which is not the circumstance under which double free is prone to happening in the first place. Actually preventing double free requires determining ownership of every object, and C sucks at that.

flohofwoe

> `_t` should not be used for custom types since it's reserved for future standard types (and/or types declared in a header you might include someday).

That old thing again...

The _t postfix is only reserved in the POSIX standard, but not in the C standard (and C and POSIX are entirely different things - outside the UNIX bubble at least).

It's unlikely that POSIX changes anymore, but if you get a name collision in a new POSIX version it's still just a simple name collision, and it's up to the user code to fix that.

And it's not like symbol collision problems are limited to POSIX, the world won't end because some piece of C code collides with a symbol used in a dependency, there's always ways to isolate its usage.

Also, it's good practice in the C world to use a namespace prefix for libraries, and such a prefix will also make sure that any _t postfix will not collide with POSIX symbols (the question is of course why POSIX couldn't play nice with the rest of the world and use a posix_ prefix - historical reasons I guess, but then just going a ahead and squatting on the _t postfix for all eternity is a bit rich).

shakna

The C23 spec also says:

> A potentially reserved identifier becomes a reserved identifier when an implementation begins using it or a future standard reserves it, but is otherwise available for use by the programmer.

Which, in practice, does mean using _t is likely to cause you problems, as it may become a reserved identifier, when an implementation like POSIX begins using it.

tialaramex

For the _t suffix it is deeply unfortunate that on the one hand C's standard library gives people this idea but then the standard says they mustn't use it.

I understand exactly why it was necessary, but to my mind that highlighted an urgent need to provide actual namespacing so that we don't need to rope off whole categories of identifiers for exclusive use by the stdlib, with the implication that every single library will need to do the same. This should have been addressed last century IMO.

lelanthran

> The whole "prevent double free" claim is completely bogus.

"Completely" means "for all". Are you seriously claiming that "for all instances of double-free, setting the pointer to NULL after freeing it would not help"?

lisper

> "Completely" means "for all".

Not in the case of bogosity. Completely bogus things might occasionally work under some very particular circumstances, but unless those particular circumstances just happen to be the circumstances you actually care about, complete bogosity can still obtain.

> setting the pointer to NULL

There is no such thing as setting a pointer to null. You can set the value of a variable (whose current value is a pointer) to null, but you cannot guarantee that there isn't a copy of the pointer stored somewhere else except in a few very particular circumstances. This is what the GP meant by "setting a variable to `NULL` only works for cases where there is one, obvious, owner". And, as the GP also pointed out, this "is not the circumstance under which double free is prone to happening in the first place." Hence: complete bogosity.

anon-3988

> Here, I’ll build on that by showing how this technique can be used outside of niche academic languages by demonstrating it in a language that is as practical as it is dangerous - C.

The "practical" part really bugged me because the entire post is trying to explain exactly why it is not.

The only way to make C reasonably safe is to encode information via newtype pattern. Wrap `char *` inside a struct that have proper names and include the size in there as well.

Basically, there should be ZERO pointers except at creation and consumption by outside libraries (open, write, etc)

pianoben

The trouble I have with this approach (which, conceptually, I agree with) is that it's damned hard to do anything with the parse results. Want to print that email_t? Then you're right back to char*, unless you somehow write your own I/O system that knows about your opaque conventions.

So you say, okay, I'll make an `email_to_string` function. Does it return a copy or a reference? Who frees it? etc, etc, and you're back to square one again. The idea is to keep char* and friends at "the edge", but I've never found a way to really achieve that.

Could just be my limitations as a C programmer, in which case I'd be thrilled to learn better.

8organicbits

In the past I've taken inspiration from strncpy: the caller needs to allocate the memory. For the email example, you'd probably also want a function to tell you the length of the emailstring, but for other types there are clear size limits. This puts the caller in control of memory allocation, so they may be able to statically allocate, allocate in an arena, or use other methods which promote performance. The static approach is really nice when it works, because there's nothing to free.

lelanthran

Firstly, `parsing` is just a way to say "serialise from a string". The reverse operation can be done for every type you are creating. If the reverse operation (serialise to a string) does not exist in the interface then adding it gives you a single place to catch all the bugs.

I'm thinking of that recent git bug that occurred because the round-trip of `string -> type -> string` had an error (stripping out the CR character). Using a specific type for a value that is being round-tripped means that a bugfix needs to only be made in the parser function. Storing the value as simple strings would result in needing to put your fix everywhere.

> The trouble I have with this approach (which, conceptually, I agree with) is that it's damned hard to do anything with the parse results.

You're right - it is damn hard, but that is on purpose; if you're doing something with the email that boils down to "treat it like a `char *`" then the potential for error is large.

If you're forced to add in a new use-case to the `email_t` interface then you have reduced the space of potential errors.

For example:

> Want to print that email_t? Then you're right back to char, unless you somehow write your own I/O system that knows about your opaque conventions.

is a bug waiting to surface, because it's an email, not a string, and if you decide to print an email* that was read as a `char *` you might not get what you expect.

It's all a trade-off - if you want more flexibility with the value stored in a variable, then sure, you can have it but it comes at a cost: some code somewhere almost certainly will eventually use that flexibility to mismatch the type!

If you want to prevent type mismatches, then a lot of flexibility goes out the window.

dwattttt

email_t doesn't have to be opaque; if it's just a visible wrapper around char* then you can still do everything with it as a char* (that is, everything you do with strings).

The benefit is to avoid treating char*s as email_t, not avoiding treating email_t as char*.

maxbond

(Using a thin wrapper like this to add safety is called the newtype pattern, if anyone wants to know.)

bcrosby95

In the example code they explicitly put the struct in the c file so the char* is not available.

If you're suggesting getting around this by casting an email_t* to char* then I wish you good luck on your adventures. There's some times you gotta do stuff like that but this ain't it.

dwattttt

You could probably get away with the typecast if you satisfy the "common struct prefix" requirement, that's nowhere near necessary.

While the article does hide the internal char*, that's not strictly necessary to get the benefit of "parse, don't validate". Hide implementation details sure, but not everything is an implementation detail.

null

[deleted]

jkuli

validating twice is safe. it will not error the second time.

it is against the rules to call someone dumb on this server.

mrkeen

This stuck out:

  email_t theEmail = parseEmail(untrustedInput);
  if (theEmail == PARSE_ERROR) {
    return error;
  }

An email_t is not a parse error, and a parse error is not one of the emails, so this shouldn't compile (and I don't take 'pseudocode' as an excuse).

bmandale

> and I don't take 'pseudocode' as an excuse

They write the non-pseudo variant later. There, the return value is a pointer and the check is against NULL. Which is fairly standard for C code, albeit not always desirable.

mrkeen

Correct, it is fairly standard C code. It is not Parse, Don't Validate.

parkcedar

> and I don't take 'pseudocode' as an excuse

Weird hill to die on, since neither email_t nor PARSE_ERROR were defined in the sample snippets. How do you know PARSE_ERROR is not email_t?

mrkeen

It's the parse-versus-validate hill in this case.

This pseudocode is "Validate" for at least 3 reasons:

Forgetting to check:

  this check is fragile: it’s extremely easy to forget. Because its return value is unused, it can always be omitted, and the code that needs it would still typecheck.

Repeatable/redundant checks:

  First, it’s just annoying. We already checked that the list is non-empty, why do we have to clutter our code with another redundant check?

  Second, it has a potential performance cost. Although the cost of the redundant check is trivial in this particular example, one could imagine a more complex scenario where the redundant checks could add up, such as if they were happening in a tight loop.

Not using the type system:

  Use a data structure that makes illegal states unrepresentable. Model your data using the most precise data structure you reasonably can. If ruling out a particular possibility is too hard using the encoding you are currently using, consider alternate encodings that can express the property you care about more easily. Don’t be afraid to refactor.

> How do you know PARSE_ERROR is not email_t

It has to be for it to compile, right? Which means that email_t is the type which represents both valid and invalid emails. How do you know if it's valid? You remember to write a check for it. Why not just save yourself some keystrokes and use char* instead. This is validate, not parse.

VMG

Because an error is not an email?

exe34

By that logic, a float couldn't store NaN.

howaboutno2312

[dead]

Gibbon1

I'm with you, don't do crap like that. Always return a valid object.

  email_t theEmail = parseEmail(untrustedInput);
  if (theEmail.error != PARSE_OK) {
    return error;
  }

mrkeen

This is validate.

You made an email-or-error type and named it email_t and then manually checked it.

PDV returns an non-error-email type from the check method.

HN

Parse, Don't Validate (For C)

Parse, Don't Validate (For C)