Data evolution with set-theoretic types
12 comments
·January 14, 2025josevalim
practal
Stuff like this is why I don't like type systems. What you want to do is easy, but it becomes difficult to explain in a sane way (15 months difficult), because you need to work around the limitations of type systems. When you say "set-theoretic types", I hear, "get rid of types, just give me logic".
josevalim
The work to develop the base theory, which this article presents, takes 15 months, but it doesn't take 15 months to read it (and hopefully it won't take as long to use it either). Whenever you use a programming language, you may work with data structures that took months to formalize and several more years to optimize, yet no one is saying "throw our data structures away". Even things like pretty printing and formatting a float have collectively several years of research behind them, yet the API is often a single function call.
Of course, you can still not like types, and making it harder to evolve libraries over time is a good reason. But using the time it takes to formalize its underlying concepts is not a strong argument against them. The goal is that someone will spend this time, precisely so you don't have to. :)
Conscat
I have seen multiple users of one certain popular programming language claim that data structures besides a dynamic length array and a hash table have no useful application.
practal
Oh, I like formalising things, don't get me wrong, and I don't mind spending time on it at all. I just don't like doing it via types, and looking at how much time you spent on what, I rest my case.
ryanschaefer
> what you want to do is easy Easy to implement, hard to get correct. It inverts where you do work in a system. It can be hard to implement robust types but once that’s done it’s easy to know what you are writing is correct.
beders
These are great examples of difficulties people will encounter in all popular statically typed languages sooner or later.
I find the solution presented interesting but it is limited to the 3 operations mentioned.
The alternative to this runtime schema checks and treating your data as data - the data you are working with is coming from an external system and needs to be runtime-validated anyways.
Throw in some nil-punting and you can have a very flexible system that is guarded by runtime checks.
Moosieus
This is some interesting shit, I love it! At least that parts I think I understand :P
> The goal of data versioning is to provide more mechanisms for library authors to evolve their schemas without imposing breaking changes often. Application developers will have limited use of this feature, as they would rather update their existing codebases and their types right away. Although this may find use cases around durable data and distributed systems.
"Hard cutover" is definitely a lot less laborious, but often times incremental updates are necessary - large teams, existing tables and whatnot. To that end, I would foresee a lot of appeal in app dev when applied to database migrations.
Groxx
Yea, this is an issue rather near and dear to my heart (due to pain). I very much appreciate strong and safe types, but it tends to mean enormous pain when making small obvious fixes to past mistakes, and you really don't want to block those. It just makes everything harder in the long term.
As an up-front caveat for below: I don't know Elixir in detail, so I'm approaching this as a general type/lang-design issue. And this is a bit stream-of-consciousness-y and I'm not really seeking any goal beyond maybe discussion.
---
Structural sub-typing with inference (or similar, e.g. perhaps something fancier with dependent types) seems like kinda the only real option, as you need to be able to adapt to whatever bowl of Hyrum Slaw[1] has been created when you weren't looking. Allowing code that is still provably correct to continue to work without modification seems like a natural fit, and for changes I've made it would fairly often mean >90% of users would do absolutely nothing and simply get better safety and better behavior for free. It might even be an ideal end state.
I kinda like the ergonomics of `revision 2` here, it's clear what it's doing and can provide tooling hints in a rather important and complicated situation... but tbh I'm just not sure how much this offers vs actual structural typing, e.g. just having an implicit revision per field. With explicit revisions you can bundle interrelated changes (which is quite good, and doing this with types alone ~always requires some annoying ceremony), but it seems like you'll also be forcing code to accept all of v2..N-1 to get the change in vN because they're not independent.
The "you must accept all intermediate changes" part is in some ways natural, but you'll also be (potentially) forcing it on your users, and/or writing a lot of transitioning code to avoid constraining them.
I'm guessing this is mostly due to Elixir's type system, and explicit versions are a pragmatic tradeoff? A linear rather than combinatoric growth of generated types?
>It is unlikely - or shall we say, not advisable to - for a given application to depend on several revisions over a long period of time. They are meant to be transitory.
An application, yes - applications should migrate when they are able, and because they are usage-leaf-nodes they can do that without backwards compatibility concerns. But any library that uses other libraries generally benefits from supporting as many versions as possible, to constrain the parent-library's users as little as possible.
It's exactly the same situation as you see in normal SAT-like dependency management: applications should pin versions for stability, libraries should try to allow as broad of a range as possible to avoid conflicts.
>Would downcasting actually be useful in practice? That is yet to be seen.
I would pretty-strongly assume both "yes" and "it's complicated". For end-users directly touching those fields: yes absolutely, pick your level of risk and live with it! This kind of thing is great for isolated workarounds and "trust me, it's fine" scenarios, code has varying risk/goals and that's good. Those happen all the time, even if nobody really likes them afterward.
But if those choices apply to all libraries using it in a project... well then it gets complicated. Unless you know how all of them use it, and they all agree, you can't safely make that decision. Ruby has refinements which can at least somewhat deal with this, by restricting when those decisions apply, and Lisps with continuations have another kind of tool, but most popular languages do not... and I have no idea how possible either would be in Elixir.
---
All that probably summarizes as: if we could boil the ocean, would this be meaningfully different than structural typing with type inference, and no versioning? It sounds like this might be a reasonable middle-ground for Elixir, but what about in general, when trying to apply this strategy to other languages? And viewed through that lens, are there other structural typing tools worth looking at?
josevalim
Thank you for the comments. Your questions at the end resonate a lot with what I have been asking myself!
> Structural sub-typing with inference
Can we have structural sub-typing with inference that is relatively fast and will generate reasonable error reports? We have been bulking up the amount of inference for dynamic code in our system and sometimes the inferred types get quite large, which can make trouble shooting daunting. In any case, better inference is a win even without taking data evolution into account.
> but tbh I'm just not sure how much this offers vs actual structural typing
The bulk of the work is definitely achieved by structural typing. The revisions help generate automated type signatures that guarantee you have not widened the output for old versions. If all you have is inference, you could accidentally introduce breaking changes?
I guess there may be some automated way where we could check old inferred types against new ones but I am not sure how it could be done without annotating that _something_ has changed?
> The "you must accept all intermediate changes" part is in some ways natural, but you'll also be (potentially) forcing it on your users, and/or writing a lot of transitioning code to avoid constraining them.
Theoretically, you do not need to support all revisions, only support more than one revision at once. A library that provides r1-r2-r3 can be supported downstream through the pairs `r1-r2` and then `r2-r3` and that should hopefully provide a smoother upgrade experience to everyone compared to `r1`, `r2`, and `r3` being part of distinct major versions.
Author here. This is probably the article that took me the longest to write, roughly 15 months, and I may still not have explained all concepts with the clarity they deserve and I intended to. If there is any feedback or questions, I'd be glad to answer them!