Type-constrained code generation with language models
133 comments
·May 13, 2025homebrewer
tough
But isn't TypeScript already a typed language to begin with?
habitue
This is about the speed with which the compiler can advise an LLM that a particular thing checks or doesn't check. Typescript is much slower than Go
tough
okay so basically the faster compiling means a tigher feedback loop for the LLM to -know- if the code compiles or not etc? interesting
is go faster than rust?
energy123
This is what I'd consider doing if I was a small AI lab. Don't try to build a frontier LLM that beats all benchmarks. Try to make the world's best LLM at one programming language. Create your RL pipeline that puts all your resources into making the LLM the best at that language. Even better if there's a dearth of human-created training data on Github, since all your competitors will be bad at it.
Google somewhat did this with javascript in their latest Gemini-2.5 Pro release. But what about doing it for a smaller language? Google isn't going to do that, but there is still a lot of demand.
eigenspace
I'm not saying this is a bad idea, but it does sound like a rather risky prospect. You're basically proposing a bet against the ability of LLMs to generalize across programming languages, and to embed concepts at a deeper level than the syntax.
Many people do think this, but I'm not sure many of them are running AI labs.
harperlee
From my experience around less-used languages (with clojure on one hand and code aster's python on the other), LLMs may be able to generalize syntax but availability of APIs, functions, etc. is something that you can't solve by generalizing. Or more precisely, you can generalize but that means hallucinating non existing tools.
unshavedyak
Would non-generalizing solve this issue for libraries though? Ie a lot of models produce reasonable code for me, but i almost always care about usage of libraries. That's where they get the wrong version, or hallucinate, or etc for me.
JohnMakin
General purpose LLM's fail really hard at this in domains like terraform. There may be drastic differences in syntax and semantics between the massive matrix of terraform version + provider version(s) and they've shown to be absolutely terrible at navigating that, even if you specify versions specifically. Even worse, and probably what exacerbates it, this version matrix changes at a much faster pace than most programming languages typically introduce large changes.
robrenaud
Meta synthetically generated lots of PHP from Python for Llama 3 for training purposes. Meta writes a crazy amount of PHP internally. Translation tends to be way easier than unconstrained generation for LLMs. But if you can translate and filter a large amount of code, you can learn to generate. If you also translate and run the unittests, you get another layer of error checking.
https://arxiv.org/abs/2407.21783
See figure 8.
tough
it feels to me most of the real usage of AI is in coding right now, so a small lab that decided to go all in into just code-gen would have at least the differentiator of a narrower field to beat the bigger incumbents doing it all?
I dunno tho.
Big AI labs also have their own agendas and would rather keep scaling and growing than serving a rather smaller real market ?
Once you're into real usage territory, you can't no longer use make up numbers to justify future growth.
eigenspace
Again though, my point was just that it's not actually clear that you can do better than these big models by taking a narrower focus. I'm saying that that the things these big LLMs are learning about other languages probably do have utility when applied even to quite niche languages.
If you take some niche language and build an LLM from scratch that's hyperspecialized on that language, will that LLM actually outperform some big LLM that's trained on all the programming resources out there, and all the blogs, forum conversations, stack overflow posts on all those languages, and then learns to generalize that information and apply it to your niche language?
One of the things that LLMs seem to excel at is taking information from one context, transforming it and applying it to another context.
Drakim
It makes sense to specialize it on one programming language to dedicate all of the LLM's intellectual space to that one domain, but on the flip side I wonder how much the LLM's sharpness and reasoning capabilities is increased by having more data to train on even if it's the wrong programming language.
As a developer I certainly think my programming skills in a specific language was improved by knowing other languages so I can contrast and compare.
tough
You could just have specialized fine-tunes for esxh programling la guage that are only called when writing code, a more general bigger model could pass the plan/pseudo code to it
nurettin
Using the language itself isn't the challenge for LLMs, they do that with a very high success rate. I haven't seen an LLM make syntax errors for several months. Calling the right functions with correct parameters is the challenge your hypothetical AI lab will have to solve (or half ass it and show great benchmark results).
jiggawatts
This was an obvious next step. Most current products can only restrict the token prediction to valid JSON or a specific JSON schema at best. There's no reason that this should be the only grammar available for constrained output mode.
The real challenge will be to make this detect and switch languages automatically. For example, a snippet of code could include a LaTeX formula in a comment and SQL in a string literal. There are many more examples, such as regex inside a shell script, and so on.
The obvious next step after that is back-tracking. It's possible to emit a token that is valid, but then allows no further completions that are valid. In other words, the model can paint itself into a corner. To my knowledge, no current online LLM service uses any kind of backtracking, they run in append ("forwards") mode only.
tough
SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking
https://arxiv.org/abs/2504.00532
IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking
https://arxiv.org/abs/2410.07295
ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation
pizza
Another one: SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking https://arxiv.org/abs/2306.05426
There was also an hn thread: https://news.ycombinator.com/item?id=36425375
foota
I believe Microsoft introduced a framework that did this sort of backtracking that you're suggesting. I'm not sure how much traction it got.
null
helltone
Backtracking idea is interesting, could maybe diffusion help? At some point it turns into sat solving.
null
null
nielstron
re detecting and switching language: you could run several constraint systems in parallel and switch as soon as one of them rejects the input and another accepts it
re backtracking: a core part of this paper is ensuring a prefix property. that is there is always a legitimate completion and the model can not "corner" itself!
research needs to be done for what kind of languages and language features this prefix property can be ensured
_jayhack_
Also worth checking out MultiLSPy, effectively a python wrapper around multiple LSPs: https://github.com/microsoft/multilspy
Used in multiple similar publications, including "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), which uses static analysis beyond the type system to filter out e.g. invalid variable names, invalid control flow etc.
nielstron
Yes this work is super cool too! Note that LSPs can not guarantee resolving the necessary types that we use to ensure the prefix property, which we leverage to avoid backtracking and generation loops.
LostBenjamin
As an author of this paper, I am very excited see the great discussion here!
Several people mentioned the generation - compilation - fixing loop. Just want to remind you that our approach works for not only the generation step but also the fixing step. This is because fixing is essentially asking LLMs to generate a new version of the code. The paper actually has a "repair" experiment to demonstrate this and our approach achieves significant gain in this experiment, i.e., 37% relative improvement in functional correctness.
yewW0tm8
37% gain relative to what? What percent of generated functions were incorrect?
LostBenjamin
compared to vanilla LLM decoding.
tough
Thank you for your research really impressive work!
ArcaneMoose
I think TypeScript is uniquely positioned to be the optimal language for LLMs. Tons of training data (benefiting from all the JS examples as well) plus the structure of types for LLMs to follow and tools to enforce.
johnmw
Those who agree might be interested in "Introducing TypeChat" by Anders Hejlsberg + others (2023) [1]
[1]: https://microsoft.github.io/TypeChat/blog/introducing-typech...
dcsan
Wish this project had more traction. Typechat with type checking could generate lots of synthetic data for model training too
pram
LLMs work well with any static analysis tool. I frequently instruct Claude to use stuff like “go vet” and “deadcode” when it goes on a tear and writes a bunch of broken trash and declares mission accomplished.
koakuma-chan
> LLMs work well with any static analysis tool.
tsc error messages are so bad that every time my LLM sees one of those "SomeType is not assignable to SomeLongAssTypeDontEvenTryToUnderstandWhatsGoingOnHere<<<<>>>>>>>>>>>>>>>>>>>>" it just gives up and casts to any. goes for python too.
floydnoel
ha, that's always been my biggest gripe with ts
miki123211
And unlike many other languages, Typescript types are extremely expressive.
For example, you can write a function that takes an object received from an API that uses snake_cased keys, and returns that same object, but with camelCased keys instead. This is not some "special case" in the Typescript compiler, the ability to do this emerges naturally from Typescript's features. I don't know any other language that can do this.
Most people don't know enough TS to use tese things effectively, but I think one could train an LLM to be very good at them. The combination of LLMs placing such advanced constraints on themselves, and then generating code based on those constraints, seems extremely powerful.
rfoo
> Tons of training data (benefiting from all the JS examples as well)
More != better.
AaronAPU
I can’t be the only one who hopes this was a joke.
AnthonBerg
I believe that the rutabaga is the perfect material to make sausages out of as it has proven as excellent swine fodder with widespread adoption!
(Please forgive me the extreme disrespect put forth in the above statement! It is not the intention to show disrespect; I… am quite the rutabaga enjoyer in all respects, you know? I certainly include myself within the absurdity and it is with love.)
OutOfHere
There are languages that constrain types a lot more tightly than TypeScript, e.g. Kotlin, Rust, and Haskell. The more constrained the types, the more correct the program could be.
mindwok
Yep, and Rust famously goes beyond this by modelling memory ownership at compile time.
In fact, the more behaviour we can model at compile time the better when it comes to LLMs - there's some cool ideas here like transpiling Rust into languages for formal verification. See https://github.com/formal-land/coq-of-rust as an example.
Formal verification was one of those things that was previously so annoying to do that it rarely made it past academic use cases or extremely important libraries, but I think LLMs take the tedium out of it. Perhaps formal verification will have a "test driven development" type of moment in the sun thanks to this.
koakuma-chan
Can LLMs properly code in Rust yet? There is way more TypeScript code out there compared to Rust, and I doubt structured output can alleviate this.
IsTom
I wonder if at some point LLM would "give up" when given a difficult to satisfy types and insert nonterminating code / bottoms instead.
babyent
It’s better sure but as a power TS user it still sucks at generating better code, and consistently fucks up with generics (or doesn’t use them) or simple types sometimes.
threeseed
Scala would be the best given that its type system is formally modelled:
https://infoscience.epfl.ch/entities/publication/6c6bb09d-a4...
cpfiffer
We (.txt, the outlines people) had a brief thread about this paper on twitter if you're interested: https://x.com/dottxtai/status/1922322194379551128
muglug
Really cool results!
That this research comes out of universities, and not large AI labs, makes me think those labs believe that larger models are still the way to go.
aibrother
+1 this seems like healthy development
nielstron
thank you!
tough
The code can be found here: https://github.com/eth-sri/type-constrained-code-generation
bmc7505
The correct way to do this is with finite model theory but we're not there yet.
slt2021
we really need LLM trained on AST, instead of token, is there any research on this?
tough
ASTrust: Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations
https://arxiv.org/abs/2407.08983
AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
https://arxiv.org/abs/2401.03003
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
nielstron
The downside is that you need to properly preprocess code, have less non-code Training Data, and can not adapt easily to new programming languages
int19h
Been using Devin for a few months now, for Typescript and Python.
I've never seen it check-in uncompilable code, but watching the Devin console I can see it building and using the code to ensure commits are not complete garbage. When it has checked-in compilable and almost right but slightly wrong code, automatically running lint and tests (it doesn't always run them before checking in) from ci triggers it to push a fix on its own.
Feedback loops are nice, but they can be expensive, and time consuming (oh look at me complain that it takes Devin a whopping 15 minutes to complete a task) so I can definitely see the value in type constraints.
android521
is Devin worth the money? Would it be a big jump in productivity migrating from cursor to Devin?
int19h
it has been worth it for me, ymmv of course.
also they have a pay-as-you-go tier now as well.
I pay the full $500 though. This month I'm going to blow past the base allowance and tap into 'gift credits'
speaking of which if anyone wants a referral code (gift creds for me, and for you) hmu
tough
how to hit you up tho
Hejlsberg mentioned the ability to quickly provide accurate type information to LLMs as one of the reasons for rewriting tsc into Go:
https://youtu.be/10qowKUW82U?t=3186