Data evolution with set-theoretic types

32 comments

·January 14, 2025

josevalim

Author here. This is probably the article that took me the longest to write, roughly 15 months, and I may still not have explained all concepts with the clarity they deserve and I intended to. If there is any feedback or questions, I'd be glad to answer them!

ddellacosta

This is great stuff! As a Haskell fan who uses Elixir (and Erlang) in my day job, I'm really looking forward to seeing how we can leverage more of the new set-theoretic type features showing up in Elixir. Revisions seem like something I've never really seen before in any of the type systems I've used, so I'm excited to see how it can be leveraged in a real production system. Thanks for the post.

practal

Stuff like this is why I don't like type systems. What you want to do is easy, but it becomes difficult to explain in a sane way (15 months difficult), because you need to work around the limitations of type systems. When you say "set-theoretic types", I hear, "get rid of types, just give me logic".

josevalim

The work to develop the base theory, which this article presents, takes 15 months, but it doesn't take 15 months to read it (and hopefully it won't take as long to use it either). Whenever you use a programming language, you may work with data structures that took months to formalize and several more years to optimize, yet no one is saying "throw our data structures away". Even things like pretty printing and formatting a float have collectively several years of research behind them, yet the API is often a single function call.

Of course, you can still not like types, and making it harder to evolve libraries over time is a good reason. But using the time it takes to formalize its underlying concepts is not a strong argument against them. The goal is that someone will spend this time, precisely so you don't have to. :)

Conscat

I have seen multiple users of one certain popular programming language claim that data structures besides a dynamic length array and a hash table have no useful application.

practal

Oh, I like formalising things, don't get me wrong, and I don't mind spending time on it at all. I just don't like doing it via types, and looking at how much time you spent on what, I rest my case.

ryanschaefer

> what you want to do is easy Easy to implement, hard to get correct. It inverts where you do work in a system. It can be hard to implement robust types but once that’s done it’s easy to know what you are writing is correct.

ddellacosta

> because you need to work around the limitations of type systems.

What limitations of type systems are you talking about?

> When you say "set-theoretic types", I hear, "get rid of types, just give me logic".

A type theory/type system is a logic, so I don't really understand this point. https://en.wikipedia.org/wiki/Type_theory#Logic

practal

Yes, a type theory is a logic, but not a particularly good one. It is limiting to have to under-approximate the mathematical universe via statically typed chunks before being able to talk about the objects in the universe, and the article just describes another instance of why that is problematic in practice.

The alternative I propose is simpler than you would think, and definitely simpler than type theory, but it is also very new: http://abstractionlogic.com . And of course, it doesn't come with a fixed algorithm for type checking, but going forward and in the age of AI, I think that will turn out to be an advantage.

beders

These are great examples of difficulties people will encounter in all popular statically typed languages sooner or later.

I find the solution presented interesting but it is limited to the 3 operations mentioned.

The alternative to this runtime schema checks and treating your data as data - the data you are working with is coming from an external system and needs to be runtime-validated anyways.

Throw in some nil-punting and you can have a very flexible system that is guarded by runtime checks.

manx

I recommend reading: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

specialist

IDK...

I've done a lot of data processing. What the kids today call ETL. Plus some parsing (writing grammars).

Here's my working taxonomy:

Parsing for source code and protocols.

Serialization when you control both ends. (Implicitly defined parsing of protocols.)

Scrapping is data processing, more or less. Find and read a value, maybe transmogrify it, write new value.

As for the OC, once the contract made a breaking change, the problem transmuted from serialization to scrapping.

Maybe some future runtime system will successfully enable multiple live updatable schema revisions. (I've been pondering that gordian knot for decades. Anyone else remember "objected oriented databases" hype?)

Until then, we're stuck in data processing hell.

mrkeen

Agreed that it is naive to assume that static types will keep some kind of consistency on the bytes that come in on the wire, but it misses the point.

With static types, you do one transformation from bytes to a proper data type at the boundary.

The static type information propogates bidirectionally deep into the guts of the system.

If you're 10 layers in, and want to invoke getUserId() on an object, you just add it to your type. Now your endpoint can return 400s without you needing to add extra nil-checks somewhere (or multiple places) from layers 1 to 10 (which realistically will turn into 500s).

Vampiero

Realistically you can design a runtime-checked interface that uses option types instead of nulls, thereby guaranteeing static correctness and preventing common pitfalls at the cost of making the usage pattern a bit more verbose.

delusional

> but it misses the point.

I don't think it misses the point as much as it's an entirely different point. When I talk to my coworkers about the uses of these automated marshaling systems at the boundary, they're not talking about correctness or "evolution". They talk about how annoying they find it to type. In the fingers hitting keyboard sense of that word.

They explicitly don't want a change 10 layers deep to result in any error in the parsing layer. They want the magical system to just do something at runtime.

Moosieus

This is some interesting shit, I love it! At least that parts I think I understand :P

> The goal of data versioning is to provide more mechanisms for library authors to evolve their schemas without imposing breaking changes often. Application developers will have limited use of this feature, as they would rather update their existing codebases and their types right away. Although this may find use cases around durable data and distributed systems.

"Hard cutover" is definitely a lot less laborious, but often times incremental updates are necessary - large teams, existing tables and whatnot. To that end, I would foresee a lot of appeal in app dev when applied to database migrations.

airstrike

That was an interesting read. Thanks for sharing and congrats on getting it out there.

Two bits stood out and made me think, somewhat in disagreement:

> If “data_schema” changes a user facing type, such as the Schema type above, in a backwards incompatible manner, its authors have to release a new major version. A new major version can be a fork on the road. Libraries that depend on “data_schema”‘s new version won’t accept the old ones. Old libraries that depend on “data_schema” have to be updated and potentially forced to also release a new major version, cascading the problem. If any package along the way is unmaintained, it can stall and complicate further downstream updates.

I wouldn't call it a fork on the road. More like growing pains. This is to be expected of any dependency you have, and it's on you to keep your own code up-to-date. If the original Schema was indeed incorrect, it warrants fixing, even if it's an edge case.

And then

> In our v1 of Schema, we allowed name to only be a string(). In a statically typed language, we are effectively proving that all uses of Schema has a string() name. When we allow the name to also be nil in v2, that should not introduce bugs in our software yet because, by definition, there is no instance of Schema where the name is nil!

> In other words, all existing code remains correct, but most type systems would immediately flag them as wrong, on the possibility they may receive a nil value. There must be a better way and that’s what we study in this article.

Personally I don't think that what is proposed here is strictly better. I'd rather have to explicitly change my code to match the new shape of the Schema than to have it magically work. If I don't make changes now, I'm just punting that complexity for later and it's going to come back to bite me in some other issue that will be twice as hard to debug.

harrisi

Some initial concerns when thinking about this:

- Understanding how it expands the surface area of both the API and discussion of a struct (and thus also libraries and dependencies of the librar{y,ies}). Now you don't need to ask someone if they're using v1.0 of some package, but also which revision of a struct they're using, which they may not even know. This compounds when a library releases a new breaking major version, presumably - `%SomeStruct{}` for v2 may be different from `%SomeStruct{}` v1r2.

- Documentation seems like it would be more complex, as well. If `%SomeStruct{}` in v1.0 has some shape, and you publish `v1.1`, then realize you want to update the revision of `%SomeStruct{}` with a modified (or added) field, would there be docs for `v1.1` and `v1.1r2`? or would `v1.1` docs be retroactively modified to include `%SomeStruct{}v1.1r2`?

- The first example is a common situation, where an author realizes after it's too late that the representation of some data isn't ideal, or maybe even isn't sufficient. Typically, this is a solved with a single change, or rarely in a couple or few changes. I'm not sure if the complexity is worth it. I understand the desire to not fragment users due to breaking changes, but I'm not sure if this is the appropriate solution.

- How does this interact with (de-)serialization?

I happen to be working with the SQLite file format currently, and I generally really enjoy data formats. It's not exactly the same as this, since runtime data structures are ephemeral technically, but in reality they are not. The typical strategy for any blob format is to hoard a reasonable amount of extra space for when you realize you made a mistake. This is usually fine, since previous readers and writers ought to ignore any extra data in reserved space.

One of the things one quickly realizes when working with various blob formats is the header usually has (as a guess), 25% of it allocated to reserved space. However, looking at many data formats over the last several decades, it's extraordinarily rare to ever see an update to them that uses it. Maybe once or twice, but it's not common. One solution is to have more variable-length data, but this has its own problems.

So, in general, I'm very interested in this problem (which is a real problem!), and I'm also skeptical of this solution. I am very willing to explore these ideas though, because they're an interesting approach that don't have prior art to look at, as far as I know.

EDIT: Also, thanks for all the work to everyone on the type system! I'm a huge fan of it!

josevalim

> This compounds when a library releases a new breaking major version, presumably - `%SomeStruct{}` for v2 may be different from `%SomeStruct{}` v1r2.

You should never reuse the revisions. If you launch a major version, then it means you only support r5 onwards. Do not reset it back to r1.

I am also not sure if the user needs to know the version. Remember that if I am in r5, because of subtyping, the code naturally supports r1, r2, r3, and r4, and those already have to be pretty printed. All of the work here is to "generate automatic proofs" via type signatures, no new capability or requirement is being introduced on the type representation of things.

> Documentation seems like it would be more complex, as well. If `%SomeStruct{}` in v1.0 has some shape, and you publish `v1.1`, then realize you want to update the revision of `%SomeStruct{}` with a modified (or added) field

I don't think documentation is more complex because that's something already factored in today. Functions add new options in new releases. Structs add new fields. And today we already annotate the version they were added. Perhaps this approach can automate some of it instead.

---

Other than that, we should be skeptical indeed. :)

harrisi

As a preamble, the discussion of versioning is difficult since it's abstract; this is difficult to discuss in general.

> You should never reuse the revisions. If you launch a major version, then it means you only support r5 onwards. Do not reset it back to r1.

Hm, so hypothetically a v1.0.0 of a library could begin its life with structs at a greater-than-1 revision number. This seems odd, since a major version bump is able to be breaking, but revisions don't have the same semantic meaning - it just means "it's different from before".

This kind of breaks my mental model of major version bumps. I consider v1 of something to be distinct from v2, where it just happens to be that v2 typically is mostly compatible with v1. However, with revisions, it maintains that connection explicitly. I suppose the answer to that would be to rename the structs and have them start at revision 1, but now we're back at square one, where I could've just done that without revisions to maintain backwards compatibility by introducing a new struct.

I'm not sure if I'm conveying what's odd about that sufficiently. Let me know if I should think on that more and try explaining it again.

> I don't think documentation is more complex because that's something already factored in today. Functions add new options in new releases. Structs add new fields. And today we already annotate the version they were added. Perhaps this approach can automate some of it instead.

The expectation here would be that a struct revision would be a minor version bump, correct? That seems like it kind of breaks semantic versioning, since a user of v1.0.0 wouldn't necessarily have any behavioral changes in v1.1.0, since the structs and users of the structs would be identical, right? It kind of pushes the minor version bump onto the consumer code at their discretion, but not in a typical way. Now if I bump a dependency's minor, I don't necessarily want to bump my own minor - but maybe I do? Especially if I'm a middleman library, where I pass behavior from client code to one of my dependencies.

This is getting a bit confusing to consider the hypothetical situations. The main gut feeling I have about this is the same about versioning in general - it's fine until it's not, which is usually some unknown point in the future when things get complex. Kind of the gist of my hesitation is just that versioning is already a complicated problem, and introducing more variation in that seems like it really needs to be a long experiment with the expectation that it may not work. Unfortunately, the situations in which versioning rears its ugly head is typically only in long-used, complex projects. So I don't really know how a test bed for this could exist to give realistic results.

As a bit of a disclaimer to everything I just said, semver is not the holy grail, in my opinion, and I think it's perfectly reasonable to experiment with it and try alternatives. Maybe part of my issue is just that I haven't entertained other versioning schemes that can sufficiently handle the difference between behavioral and representational changes.

Thanks for the discussion!

josevalim

> This kind of breaks my mental model of major version bumps.

If you want, you can reset the revisions in a major version, as you could rename all modules and change the behavior of all functions, but that's hardly a good idea. Revisions are not any different. If you give a revision a new meaning in a major version, it is going to be as confusing as giving a new implementation to a function: users upgrading will have to figure out why something has subtly changed, instead of a clear error that something has gone away.

> The expectation here would be that a struct revision would be a minor version bump, correct?

We could make either minor and major versions (as described by semver) work. For example, if we want to allow new revisions to be minor versions, we could make it so every time you install a dependency, we automatically lock all structs it provides. This way, if a new version is out, you have to explicitly upgrade any new revision too.

Of course, you could also release a new major version of the library once you introduce a new revision and _that's fine_. The goal is not to avoid major versions of the library that defines a struct, the goal is to avoid forcing its dependents to release major versions, which has a cascading effect in the ecosystem. Putting in Elixir terms, I want to be able to release Phoenix v2 without forcing LiveView to immediately become v2. LiveView should be able to stay on v1.x and say it supports both Phoenix v1 and v2 at once. But in most typed languages, if you change a field definition, it is game over.

The guarantees you need to provide are not that many: as long as two revisions overlap at one point in time, you can offer a better upgrading experience to everyone, instead of assuming *all code must be updated at once*.

Thanks for the convo!

harrisi

I forgot to mention, the typical situation for these things for added elements is to add `_ex` or the like, which is not a great solution. You can see this in various places in Erlang, especially around erl_interface and related aspects.

On the flip side, what if you realize a data structure needs to remove elements? Say revision 2 adds a field, but revision 3 removes it?

Groxx

Yea, this is an issue rather near and dear to my heart (due to pain). I very much appreciate strong and safe types, but it tends to mean enormous pain when making small obvious fixes to past mistakes, and you really don't want to block those. It just makes everything harder in the long term.

As an up-front caveat for below: I don't know Elixir in detail, so I'm approaching this as a general type/lang-design issue. And this is a bit stream-of-consciousness-y and I'm not really seeking any goal beyond maybe discussion.

---

Structural sub-typing with inference (or similar, e.g. perhaps something fancier with dependent types) seems like kinda the only real option, as you need to be able to adapt to whatever bowl of Hyrum Slaw[1] has been created when you weren't looking. Allowing code that is still provably correct to continue to work without modification seems like a natural fit, and for changes I've made it would fairly often mean >90% of users would do absolutely nothing and simply get better safety and better behavior for free. It might even be an ideal end state.

I kinda like the ergonomics of `revision 2` here, it's clear what it's doing and can provide tooling hints in a rather important and complicated situation... but tbh I'm just not sure how much this offers vs actual structural typing, e.g. just having an implicit revision per field. With explicit revisions you can bundle interrelated changes (which is quite good, and doing this with types alone ~always requires some annoying ceremony), but it seems like you'll also be forcing code to accept all of v2..N-1 to get the change in vN because they're not independent.

The "you must accept all intermediate changes" part is in some ways natural, but you'll also be (potentially) forcing it on your users, and/or writing a lot of transitioning code to avoid constraining them.

I'm guessing this is mostly due to Elixir's type system, and explicit versions are a pragmatic tradeoff? A linear rather than combinatoric growth of generated types?

>It is unlikely - or shall we say, not advisable to - for a given application to depend on several revisions over a long period of time. They are meant to be transitory.

An application, yes - applications should migrate when they are able, and because they are usage-leaf-nodes they can do that without backwards compatibility concerns. But any library that uses other libraries generally benefits from supporting as many versions as possible, to constrain the parent-library's users as little as possible.

It's exactly the same situation as you see in normal SAT-like dependency management: applications should pin versions for stability, libraries should try to allow as broad of a range as possible to avoid conflicts.

>Would downcasting actually be useful in practice? That is yet to be seen.

I would pretty-strongly assume both "yes" and "it's complicated". For end-users directly touching those fields: yes absolutely, pick your level of risk and live with it! This kind of thing is great for isolated workarounds and "trust me, it's fine" scenarios, code has varying risk/goals and that's good. Those happen all the time, even if nobody really likes them afterward.

But if those choices apply to all libraries using it in a project... well then it gets complicated. Unless you know how all of them use it, and they all agree, you can't safely make that decision. Ruby has refinements which can at least somewhat deal with this, by restricting when those decisions apply, and Lisps with continuations have another kind of tool, but most popular languages do not... and I have no idea how possible either would be in Elixir.

---

All that probably summarizes as: if we could boil the ocean, would this be meaningfully different than structural typing with type inference, and no versioning? It sounds like this might be a reasonable middle-ground for Elixir, but what about in general, when trying to apply this strategy to other languages? And viewed through that lens, are there other structural typing tools worth looking at?

[1] https://en.wiktionary.org/wiki/Hyrum%27s_law

josevalim

Thank you for the comments. Your questions at the end resonate a lot with what I have been asking myself!

> Structural sub-typing with inference

Can we have structural sub-typing with inference that is relatively fast and will generate reasonable error reports? We have been bulking up the amount of inference for dynamic code in our system and sometimes the inferred types get quite large, which can make trouble shooting daunting. In any case, better inference is a win even without taking data evolution into account.

> but tbh I'm just not sure how much this offers vs actual structural typing

The bulk of the work is definitely achieved by structural typing. The revisions help generate automated type signatures that guarantee you have not widened the output for old versions. If all you have is inference, you could accidentally introduce breaking changes?

I guess there may be some automated way where we could check old inferred types against new ones but I am not sure how it could be done without annotating that _something_ has changed?

> The "you must accept all intermediate changes" part is in some ways natural, but you'll also be (potentially) forcing it on your users, and/or writing a lot of transitioning code to avoid constraining them.

Theoretically, you do not need to support all revisions, only support more than one revision at once. A library that provides r1-r2-r3 can be supported downstream through the pairs `r1-r2` and then `r2-r3` and that should hopefully provide a smoother upgrade experience to everyone compared to `r1`, `r2`, and `r3` being part of distinct major versions.

Vampiero

> It just makes everything harder in the long term.

Opinions. I'd gladly refactor a C# legacy monster because I can follow the compiler errors and reason about the shape of the types and how they interact in an abstract, codeless way. I can modify 200+ files and not lose track of what I'm doing. I can apply the latest language features to make the code more concise and expressive while preserving the semantics. I can use my IDE to modify every single reference to some type or object across the entire solution in a couple of keystrokes without worrying that my search and replace regex will match something it shouldn't. I can simplify and generalize redundancies. I can transform everything in my head and make sure it's internally consistent. I can even use reflection to do advanced metaprogramming and make sure that it, too, is consistent (with a couple of tests). If it compiles, chances are it's going to work.

The more static typing, the better, because that means that most of the logic is baked in the abstractions and I don't have to run the debugger.

I would never refactor a legacy Javascript application because I'm not a masochist. There are many juniors who don't yet know what a pain it is to work on large dynamic projects so I'll let them have all the fun while I use my time productively. I haven't done a null check in years (in my personal projects; at my workplace they don't even know what an option type is).

But I'll take Prolog. It's dynamic, but unification is basically type theory anyway. You can declare all your constraints and reason about them clearly. As God intended for us applied mathematicians.

So your claim that static typing makes things harder in the long term is alien to me. The real issue is that companies are afraid of refactoring because the client doesn't see the immediate benefit, so they simply don't do it, and then 10 years later they just redo the project from scratch in the latest shiny language (while reinventing the whole domain mapping layer because the old engineers were probably idiots since they wrote such unmaintanable code -- let's ignore the fact that they, too, never had time to refactor and dogfood).

Groxx

My claim isn't that static typing is harder in general:

>... it tends to mean enormous pain when making small obvious fixes to past mistakes, and you really don't want to block those. It just makes everything harder in the long term.

It's that not making fixes is harder in the long term. That means either manual migrations or automated ones (e.g. API-compatible ones are trivially automation-friendly). And the more difficult you make the migration, the less likely people are to migrate (which I don't think is a contentious claim).

Static typing is great, but it does generally mean more full-automation-unfriendly migrations, because it's forcing you to see and make decisions on more kinds of changed assertions. Generally that's a very good thing, but it does add some struggles that looser typing can simply ignore and sometimes still succeed.

HN

Data evolution with set-theoretic types

Data evolution with set-theoretic types