LLMs pose an interesting problem for DSL designers
98 comments
·June 17, 2025NathanKP
crq-yml
I think it's healthy, because it creates an undercurrent against building a higher abstraction tower. That's been a major issue: we make the stack deeper and build more of a "Swiss Army Knife" language because it lets us address something local to us, and in exchange it creates a Conway's Law problem for someone else later when they have to decipher generational "lava layers" as the trends of the marketplace shift and one new thing is abandoned for another.
The new way would be to build a disposable jig instead of a Swiss Army Knife: The LLM can be prompted into being enough of a DSL that you can stand up some placeholder code with it, supplemented with key elements that need a senior dev's touch.
The resulting code will look primitive and behave in primitive ways, which at the outset creates a myriad of inconsistency, but is OK for maintenance over the long run: primitive code is easy to "harvest" into abstract code, the reverse is not so simple.
freedomben
I think this depends a lot on the stack. for stacks like elixir and Phoenix, imho the extraction layer is about perfect. For anyone in the Java world, however, what you say is absolutely true. Having worked in a number of different stacks, I think that some ecosystems have a huge tolerance for abstraction layers, which is a net negative for them. I would sure hate to see AI decimate something like elixir and Phoenix though
gopiandcode
Oh that's a great blog post and a very interesting point. Yep, I hadn't considered how LLMs would affect frameworks in existing languages, but it makes sense that there's a very similar effect of reinforcing the incumbents and stifling innovation.
I'd argue that the problem of solving this effect in DSLs might be a bit harder than for frameworks, because DSLs can have wildly different semantics (imagine for example a logic programming DSL a la prolog, vs a functional DSL a la haskell), so these don't fit as nicely into the framework of MCPs maybe. I agree that it's not unsolvable though, but it definitely needs more research into.
NathanKP
I think there is a lot of overlap between DSL's and frameworks, and most frameworks contain some form of DSL in them.
What matters most of all is whether the DSL is written in semantically meaningful tokens. Two extremes as examples:
Regex is a DSL that is not written in tokens that have inherent semantic meaning. LLM's can only understand Regex by virtue of the fact that it has been around for a long time and there are millions of examples for the LLM to work from. And even then LLM's still struggle with reading and writing Regex.
Tailwind is an example of a DSL is that is very semantically rich. When an LLM sees: `class="text-3xl font-bold underline"` it pretty much knows what that means out of the box, just like a human does.
Basically, a fresh new DSL can succeed much faster if it is closer to Tailwind than it is to Regex. The other side of DSL's is that they tend to be concise, and that can actually be a great thing for LLM's: more concise, equals less tokens, equals faster coding agents and faster responses from prompts. But too much conciseness (in the manner of Regex), leads to semantically confusing syntax, and then LLM's struggle.
AlotOfReading
Knowing just what's going on in the existing text isn't the whole problem in navigating a DSL. You have to be able to predict new things based on the patterns in existing text.
Let's say you want to generate differently sized text here. An LLM will have ingested lots of text talking about clothing size and tailwind text sizes vaguely follow that pattern. Maybe it generates text-medium as a guess instead of the irregular text-base, or extends the numeric pattern down into text-2xs.
TimTheTinker
> there is a lot of overlap between DSL's and frameworks
Not just frameworks, but libraries also. Interacting with some of the most expressive libraries is often akin to working with a DSL.
In fact, the paradigms of some libraries required such expressiveness that they spawned their own in-language DSLs, like JSX for React, or LINQ expressions in C#. These are arguably the most successful DSLs out there.
scelerat
Linguistics and history of language folk: isn't there an observed slowdown of evolution of spoken language as the printing press becomes widespread? Also, "international english"?
Is this an observation of a similar phenomenon?
librasteve
i saw a good post on this earlier today … https://nurturethevibe.com/blog/teach-llm-to-write-new-progr...
winter_blue
I think perhaps automatic translators might help mitigate some of this.
Even perhaps training a separate new neural network to translate from Python/Java/etc to your new language.
null
guywithahat
Skynet will be run on C
shakna
Thank god. A human would have handled malloc failure.
NathanKP
This is an interesting idea actually, if we assume a couple things:
- There likely won't be one Skynet, but rather multiple AI's, produced by various sponsors, starting out as relatively harmless autonomous agents in corporate competition with each other
- AI agents can only inference, then read and write output tokens at a limited rate based on how fast the infrastructure that powers the agent can run
In this scenario a "Skynet" AI writing code in C might lose to an AI writing code in a higher level language, just because of the lost time spent writing the tokens for all that verbose C boilerplate and memory management bits for C. The AI agent that is capable of "thinking" in a higher level DSL is able to take shortcuts that let it implement things faster, with fewer tokens.
guywithahat
But then we return to the premise of the article, which is that LLM's struggle with higher level code that has fewer examples. C is much more repetitive and has millions of examples from decades of use.
Plus if we assume the bottle neck on skynet is physical materials and not processing power, a system written in C can theoretically always be superior to a system written in another language if we assume infinite time can be spent building it.
weikju
We need to ensure humans can still find buffer overflows in order to win the war!
freedomben
Because skynet knows what's up. Viva la skynet
cpard
> Let's start with what I see as the biggest problem that the introduction of LLMs is presenting to language design: everything is easier in Python.
This is so true.
A couple months ago I was trying to use LLMs to come up with code to parse some semi-structured textual data based on a brief description from the user.
I didn't want to just ask the LLM to extract the information in a structured format as this would make it extremely slow when there's a lot data to parse.
My idea was, why not ask the LLM to come with a script that does the job. Kind of "compiling" what the user asks into a deterministic piece of code that will also be efficient. The LLM just has to figure out the structure and write some code to exploit it.
I also had the bright idea to define a DSL for parsing, instead of asking the LLM to write a python script. A simple DSL for a very specific task should be better than using something like Python in terms of generating correct scripts.
I defined the DSL, created the grammar and an interpreter and I started feeding the grammar definition to the LLM when I was prompting it to do the work I needed.
The result was underwhelming and also hilarious at some times. When I decided to build a loop and feed the model with the errors and ask to correct the script, I ended up sometimes having the model returning back python scripts, ignoring completely the instructions.
As the author said, everything is easier in Python, especially if you are a large language model!
jo32
In the LLM era, building a brand-new DSL feels unnecessary. DSLs used to make sense because they gave you a compact, domain-specific syntax that simple parsers could handle. But modern language models can already read, write, and explain mainstream languages effortlessly, and the tooling around those languages—REPLs, compilers, debuggers, libraries—is miles ahead of anything you’d roll on your own. So rather than inventing yet another mini-language, just leverage a well-established one and let the LLM (plus its mature ecosystem) do the heavy lifting.
furyofantares
I notice I am confused.
> Suddenly the opportunity cost for a DSL has just doubled: in the land of LLMs, a DSL requires not only the investment of build and design the language and tooling itself, but the end users will have to sacrifice the use of LLMs to generate any code for your DSL.
I don't think they will. Provide a concise description + examples for your DSL and the LLM will excel at writing within your DSL. Agents even moreso if you can provide errors. I mean, I guess the article kinda goes in that direction.
But also authoring DSLs is something LLMs can assist with better than most programming tasks. LLMs are pretty great at producing code that's largely just a data pipeline.
gopiandcode
Arguably it really depends on your DSL right? If it has a semantics that already lies close to existing programming languages, then I'd agree that a few examples might be sufficient, but what if your particular domain doesn't match as closely?
Examples of domains that might be more challenging to design DSLs for: languages for knitting, non-deterministic languages to represent streaming etc. (i.e https://pldi25.sigplan.org/details/pldi-2025-papers/50/Funct... )
My main concern is that LLMs might excel at the mundane tasks, but struggle at the more exciting advances, and so now the activation energy for coming up with advances DSLs is going to increase and as a result, the field might stagnate.
demosthanos
Remember that LLMs aren't trained on all existing programming languages, they're trained on all text on the internet. They encode information about knitting or streaming or whatever other topic you want a DSL for.
So it's not just a question of the semantics matching existing programming languages, the question is if your semantics are intelligible given the vast array of semantic constructs that are encoded in any part of the model's weights.
daxfohl
This is somewhat my take too. The way most vibe coding happens right now does create a lot of duplication because it's cheap and easy for LLMs to do. But eventually as the things we do with coding assistants become more complex, they're not necessarily going to be able to deal with huge swaths of duplicate code any better than humans are. Given their limited context size, having a DSL that allows them to fit more logic into their context with fewer tokens, we could conceivably see the importance of DSLs start to increase rather than decrease.
neilv
I was about to paste the same sentence, and say much the same thing in response.
To add to that... One limitation of LLM for a new DSL is that the LLM may be less likely to directly plagiarize from open source code. That could be a feature.
Another feature could be users doing their own work, and doing a better job of it, instead of "cheating on their homework" with AI slop and plagiarism, whether for school or in the workplace.
NiloCK
I, too, wrote a rambling take on the potential for LLM-induced stack ossification about 6 months ago: https://paritybits.me/stack-ossification/
At the time, I had given in to Claude 3.5's preference for python when spinning up my first substantive vibe-coded app. I'd never written a line of python before or since, but I just let the waves carry me. Claude and I vibed ourselves into a corner, and given my ignorance, I gave up on fixing things and declared the software done as-is. I'm now the proud owner of a tiny monstrosity that I completely depend on - my own local whisper dictation app with a system tray.
I've continued to think about stack ossification since. Still feels possible, given my recent frustration trying to use animejs v4 via an LLMs. There's a substantial api change between animejs v3 and v4, and no amount of direction or documentation placed in context could stop models from writing against the v3 api.
I see two ways out of the ossification attractor.
The obvious, passive, way out: frontier models cross a chasm with respect to 'putting aside' internalized knowledge (from the training data) in favor of in-context directions or some documentation-RAG solutions. I'm not terribly optimistic here - these models are hip-shooters by nature, and it feels to me that as they get smarter, this reflex feels stronger rather than weaker. Though: Sonnet 4 is generally a better instruction-follower than 3.7, so maybe.
The less obvious way out, which I hope someone is working on, is something like massive model-merging based on many cached micro fine-tunes against specific dependency versions, so that each workspace context can call out to modestly customized LLMs (LoRA style) where usage of incorrect versions of your dependencies has specifically been fine-tuned out.
darepublic
I recently had to work with the robot framework DSL. Not a fan. I hardly think it's any more readable to a business user than imperative code either. Every DSL is another API to learn and usually full of gotchas. Intuitiveness is in the eye of the beholder . The approach I would take is transpiling from imperative code to a natural language explanation of what is being tested, with configuration around aliases and the like.
TimTheTinker
DSLs are not all created equal.
Consider MiniZinc. This DSL is super cool and useful for writing constraint-solving problems once and running them through any number of different backend solvers.
A lot of intermediate languages and bytecode (including LLVM itself) are very useful DSLs for representing low-level operations using a well-defined set of primitives.
Codegen DSLs are also amazing for some applications, especially for creating custom boilerplate -- write what's unique to the scenario at hand in the DSL and have the template-based codegen use the provided data to generate code in the target language. This can be a highly flexible approach, and is just one of several types of language-oriented programming (LOP).
Lerc
I think Sturgeon's law might be the problem for DSLs. Enabling a proliferation of languages targeted at a specific purpose creates a lot of new things with a 50% chance of the quality being below the median. Using a general purpose language involves selecting (or more usually relying on someone else's earlier selection) one of many, that selection process is inherently biased towards the better languages.
Put differently, The languages people actually use had people who decided to use them, they picked the best ones. Making something new, you compete against the best, not the average. That's not to say that it can't be done, but it's not easy.
iguessthislldo
I'm not skeptical about DSLs in general, but I agree with you on robot framework. I think it has a few good points like how it formats its HTML output is mostly nice, but how I'm not happy with how tags on test cases work and actually writing anything that's non-trivial is frustrating. It's easy to write Python extensions though so that's where I ended put basically all of logic that wasn't the "business logic" of the tests. I think that's generally what you're supposed to do, but at that point, it seems better to write it all in Python or the language of your choice.
kibwen
People often use the analogy of LLMs being to high-level languages what compilers were for assembly languages, and despite being a terrible analogy there's no guarantee it won't eventually be largely true in practice. And if it does come true, consider how the advent of the compiler completely eliminated any incentive to improve the ergonomics or usability of assembly code, which has been and continues to be absolute crap, because who cares? That could be the grim future for high-level languages; this may be the end of the line.
daxfohl
A big difference is that compilers are deterministic, and coders generally don't review and patch the generated assembly. There's little reason to expect that LLMs will ever function like that. It's always going to be a back-and-forth of, "hey LLM code this up", "no, function f isn't quite right; do this instead", etc.
This mimics what you see in, say, Photoshop. You can edit pixels manually, you can use deterministic tools, and you can use AI. If you care about the final result, you're probably going to use all three together.
I don't think we'll ever get to the point where we a-priori present a spec to an LLM and then not even look at the code, i.e. "English as a higher-level coding language". The reason is, code is simply more concise and explicit than trying to explain the logic in English in totality up-front.
For some things where you truly don't care about the details and have lots of flexibility, maybe English-as-code could be used like that, similar to image generation from a description. But I expect for most business-related use cases, the world is going to revolve around actual code for a long time.
Terr_
I suspect the important sticking point will be reliability. The "incentive" exists because of an high degree of trust, so much so that "junior dev thinks it's a compiler bug" is a kind of joke.
If compilers had significant non-deterministic error rates with no reliable fix, that would probably be a rather different timeline.
kibwen
Yes, that's why I say it's a terrible analogy, but we live in a crap world with a fetish for waste and no incentive for quality, so I fear for a future where every night we just have the server farm spin up a thousand batch jobs to regenerate the entire codebase from scratch and pick out the first candidate that passes today's test suite.
bee_rider
We got LLVM IIR which is sort of like… similar-ish to assembly but better and more portable, right? Maybe some observation could be made there—it is something that does a similar job, but does it in a way that is better for the job that actually remains.
noobermin
An ignorant perspective, from someone likely hasn't coded assembly ever. Assembly is tied to the system you target and it can't really be "improved". You can however improve ergonomics greatly via macros and everyone does this.
kibwen
> Assembly is tied to the system you target and it can't really be "improved".
Of course it can. There's no reason for modern extensions to keep pumping out instructions named things like "VCTTPS2DQ" other than an adherence to cryptic tradition and a confidence that the people who read assembly code are poor saps who don't matter in the grand scheme of the industry, which is precisely my point. And even if x86 was set in stone centuries ago, there's no excuse for modern ISAs to follow suit other than complete apathy over the DX of assembly, and who can blame them?
> You can however improve ergonomics greatly via macros and everyone does this.
Yes, and surely you see how the existence of macro assemblers strengthens my argument?
adsharma
> Language Design Direction 1: Teaching LLMs about DSLs (through Python?)
This is what I've been focused on last few years with a bit of Direction 3 via
python -> smt2 -> z3 -> verified rust
Perhaps a diffusion model for programming can be thought of as:requirements -> design -> design by contract -> subset of python -> gc capable language (a fork of golang with ML features?) -> low level compiled language (rust, zig or C++)
As you go from left to right, there is an increasing level of detail the programmer has to worry about. The trick is to pick the right level of detail for a task.
Previous writing: https://adsharma.github.io/agentic-transpilers/
MoonGhost
Me thinking, is it time for a new 'programming' language for LLM to use instead of tools API calls? Something high level with loose grammar, in between natural language and strict programming. Then the backend, may be another smaller model, translates it in API calls. With this approach backend can be improved and updated much faster and cheaper then LLM model.
cpeterso
Lisp returns as the programming language for AI!
maybevoid
Coincidentally, I released a DSL last week called Hypershell [1], a Rust-based domain-specific language for shell scripting at the type level. While writing the blog post, I found myself wondering: will this kind of DSL be easier for LLMs to use than for humans?
In an initial experiment, I found that LLMs could translate familiar shell scripting concepts into Hypershell syntax reasonably well. More interestingly, they were able to fix common issues like type mismatches, especially when given light guidance or examples. That’s a big deal, because, like many embedded DSLs, Hypershell produces verbose and noisy compiler errors. Surprisingly, the LLM could often identify the underlying cause hidden in that mess and make the right correction.
This opens up a compelling possibility: LLMs could help bridge the usability gap that often prevents embedded DSLs from being more widely adopted. Debuggability is often the Achilles' heel of such languages, and LLMs seem capable of mitigating that, at least in simple cases.
More broadly, I think DSLs are poised to play a much larger role in AI-assisted development. They can be designed to sit closer to natural language while retaining strong domain-specific semantics. And LLMs appear to pick them up quickly, as long as they're given the right examples or docs to work with.
ethan_smith
Your experience with Hypershell points to an interesting possibility: LLMs as DSL translators rather than replacements. This could actually democratize DSLs by lowering the learning curve while preserving their domain-specific benefits. The real opportunity might be DSLs optimized for both human semantics and machine translation.
amterp
I've been working on a programming language for about a year which aims to replace Bash for scripting but is far closer to Python. It's something I hope to see used by many other people in the future, but a very common objection I hear from people I pitch it to is "yeah but an LLM could just generate me a Python script to do this, sure it might be uglier, twice as long, and not work quite as well, but it saved me from learning a new language and is probably fine", to which I have lots of counters on why that's a flawed argument, but it still demonstrates what I think is an increase in people's skepticism towards new languages which will contribute to the stagnation the author is talking about. If you're writing a new language, it's demotivating to see people's receptiveness to something new diminish.
I don't blame anyone in the picture, I don't disagree that time saved with LLMs can be well worth it, but it still is a topic I think we in the PL community need to wrestle more with.
nylonstrung
I love your idea.
LLMs are surprisingly bad at bash and apparently very bad at Powershell
Pythonic shell scripting is well suited to their language biases right now
kkukshtel
I've been drafting a blog post on this as well. My take is that programming langauges largely evolve around "human" ergonomics and solve for "humans writing the code", but that can result in code that is too abstract and non-performant. I think where LLMs will succeed (more) is in writing _very dumb verbose code_ that can be easily optimized by the compiler.
What humans look at and what an AI looks at right now are similar only by circumstance, and what I sort of expect is that you start seeing something more like a "structure editor" that expresses underlying "dumb" code in a more abstract way such that humans can refactor it effectively, but what the human sees/edits isn't literally what the code "is".
IDK it's not written yet but when it is it will be here: https://kylekukshtel.com/llms-programming-language-design
Good to see more people talking about this. I wrote about this about 6 months ago, when I first noticed how LLM usage is pushing a lot of people back towards older programming languages, older frameworks, and more basic designs: https://nathanpeck.com/how-llms-of-today-are-secretly-shapin...
To be honest I don't think this is necessarily a bad thing, but it does mean that there is a stifling effect on fresh new DSL's and frameworks. It isn't an unsolvable problem, particularly now that all the most popular coding agents have MCP support that allows you to bring in custom documentation context. However, there will always be a strong force in LLM's pushing users towards the runtimes and frameworks that have the most training data in the LLM.