Skip to content(if available)orjump to list(if available)

Why LLMs Can't Write Q/Kdb+: Writing Code Right-to-Left

clord

There is something deep in this observation. When I reflect on how I write code, sometimes it’s backwards. Sometimes I start with the data and work back through to the outer functions, unnesting as I go. Sometimes I start with the final return and work back to the inputs. I notice sometimes LLMs should work this way, but can’t. So they end up rewriting from the start.

Makes me wonder if future llms will be composing nonlinear things and be able to work in non-token-order spaces temporarily, or will have a way to map their output back to linear token order. I know nonlinear thinking is common while writing code though. current llms might be hiding a deficit by having a large and perfect context window.

hnuser123456

Yes, there are already diffusion language models, which start with paragraphs of gibberish and evolve them into a refined response as a whole unit.

altruios

Right, but that smoothly(ish) resolves all at the same time. That might be sufficient, but it isn't actually replicating the thought process described above. That non-linear thinking is different than diffuse thinking. Resolving in a web around a foundation seems like it would be useful for coding (and other structured thinking, in general).

hansvm

With enough resolution and appropriately chosen transformation steps, it is equivalent. E.g., the diffusion could focus on one region and then later focus on another, and it's allowed to undo the effort it did in one region. Nothing architecturally prohibits that solution style from emerging.

lelanthran

> Sometimes I start with the final return and work back to the inputs.

Shouldn't be hard to train a coding LLM to do this too by doubling the training time: train the LLM both forwards and backwards across the training data.

UltraSane

I think long term LLMs should directly generate Abstract Syntax Trees. But this is hard now because all the training data is text code.

FeepingCreature

Another example of this is Claude placing unnecessary imports when writing Python, because it's hedge-importing modules that it suspects it might need later.

cenamus

Is it hedging or did the training data just have lots of unecessary imports?

haiku2077

Especially in Python, where it can be hard to tell if something is being imported purely for side effects.

0cf8612b2e1e

That does happen, but not frequently in the common libraries that are going to be in public training data.

Is there a top 100 package that does something funny on import?

tantalor

Languages that are difficult for LLM to read & write are also difficult for the general public. These languages have always had poor uptake and never reach critical mass, or are eventually replaced by better languages.

Language designers would be smart to recognize this fact and favor making their languages more LLM friendly. This should also make them more human friendly.

markerz

I actually think Ruby on Rails is incredibly difficult for LLMs to write because of how many implicit "global state" things occur. I'm always surprised how productive people are with it, but people are productive with it for sure.

draw_down

[dead]

knome

don't plan on it staying that way. I used to toss wads of my own forth-like language into LLMs to see what kinds of horrible failure modes the latest model would have in parsing and generating such code.

at first they were hilariously bad, then just bad, then kind of okay, and now anthropic's claude4opus reads and writes it just fine.

sitkack

How much incontext documentation for your language are you giving it, or does it just figure it out?

trjordan

Seems like it could easily be training data set size as well.

I'd love to see some quantification of errors in q/kdb+ (or hebrew) vs. languages of similar size that are left-to-right.

fer

>Seems like it could easily be training data set size as well.

I'm convinced that's the case. On any major LLM I can carpet bomb Java/Python boilerplate without issue. For Rust, at least last time I checked, it comes up with non-existing traits, more frequent hallucinations and general struggle to use the context effectively. In agent mode it turns into a first fight with the compiler, often ending in credit destroying loops.

And don't get me started when using it for Nix...

So not surprised about something with orders of magnitude smaller public corpus.

dotancohen

I realized this too, and it led me to the conclusion that LLMs really can't program. I did some experiments to find what a programming language would look like, instead of e.g. python, if it were designed to be written and edited by an LLM. It turns out that it's extremely verbose, especially in variable names, function names, class names, etc. Actually, it turned out that classes were very redundant. But the real insight was that LLMs are great at naming things, and performing small operations on the little things they named. They're really not good at any logic that they can't copy paste from something they found on the web.

weird-eye-issue

> I did some experiments to find what a programming language would look like, instead of e.g. python, if it were designed to be written and edited by an LLM.

Did your experiment consist of asking an LLM to design a programming language for itself?

mfro

Yep. I had similar issues asking Gemini for help with F#, I assume lack of training data is the cause.

dlahoda

i tried gemini, openai, copilot, claude on reasonably big rust project. claude worked well to fix use, clippy, renames, refactorings, ci. i used highest cost claude with custom context per crate. never was able to get it write new code well.

for nix, i is nice template engine to start or search. did not tried big nix changes.

gizmo686

Hebrew is still written sequentially in Unicode. The right-to-left aspect there is simply about how the characters get displayed. On mixed documents, there is U+200E and U+200F to change the text direction mid stream.

From the perspective of a LLM learning from Unicode, this would appear as a delimeter that needs to be inserted on language direction boundaries; but everything else should work the same.

Timwi

I know I'm being pedantic, but I just want to point out that even U+200E/U+200F are generally not needed. If you put a Hebrew word in the middle of an English sentence, it displays correctly all by itself. This is due to the Unicode bidirectional algorithm, which defines a super sensible default behavior. You only need the RTL control characters in weird circumstances, perhaps ones involving punctuation marks or unusual uses of special characters.

cubefox

> Hebrew is still written sequentially

Everything is written sequentially in the sense that the character that is written first can only be followed by the character that is written next. In this sense writing non-sequentially is logically impossible.

dotancohen

An older Hebrew encoding actually encoded the last character first, then the penultimate character, then the character preceding that, etc.

Exercise to the reader to guess how line breaks, text wrapping, and search algorithms worked.

null

[deleted]

goatlover

Multiple characters can be written at once, they can also be done in reverse or out of order.

briandw

This is something that diffusion based models would capable of. For example diffusion-coder https://arxiv.org/abs/2506.20639 Could be trained on right to left, but it doesn't seem like they did.

aghilmort

most mainstream models are decoders vs. encoders-decoders, diffusers, etc. and lack reversible causal reasoning, which of course can be counter-intuitive since it doesn’t feel that way when models can regenerate prior content

some hacks for time / position/ space flipping the models:

- test spate of diffusion models emerging. pro is faster, con is smaller context, ymmv is if trained on that language &/or context large enough to ICL lang booster info

- exploit known LTL tricks that may work there’s bunch of these

- e.g., tell model to gen drafts in some sort RPN variant of lang, if tests tell it to simulate creating such a fork of this and then gen clean standard form at end

- have it be explicit about leapfrogging recall and reasoning, eg be excessively verbose with comments can regex strip later

- have it build a stack / combo of the RPN & COT & bootstrapping its own ICL

- exploit causal markers - think tags that can splinter time - this can really boost any of the above methods - eg give each instance of things disjoint time tags, A1 vs K37 for numbered instances of things that share a given space - like a time GUID

- use orthogonal groups of such tags to splinter time and space recall and reasoning in model, to include seemingly naive things like pass 1 etc

- our recent arXiv paper on HDRAM / hypertokens pushes causal markers to classic-quantum holographic extreme and was built for this, next version will be more accessible

- the motivators are simple - models fork on prefix-free modulo embedding noise, so the more you make prefix-free, the better the performance, there’s some massive caveats on how to do this perfectly which is exactly our precise work - think 2x to 10x gain on model and similar on reasoning, again ymmv as we update preprint, post second paper that makes baseline better, prep git release etc to make it tons easier to get better recall and exploit same to get better reasoning by making it possible for any model to do the equivalent of arbitrary RPN

- our future state is exactly this a prompt compiler for exactly this use case - explainable time-independent computation in any model

electroly

I always thought APL was written in the wrong direction. It writes like a concatenative language that's backwards--you tack things onto the front. NumPy fixes it by making the verbs all dotted function calls, effectively mirroring the order. e.g. in APL you write "10 10 ⍴ ⍳100" but in NumPy you write "np.arange(1, 101).reshape(10, 10)". Even if you don't know either language, you can tell that the APL version is the reverse of the Python version.

My hot take is that Iverson was simply wrong about this. He couldn't be expected to predict code completion and then LLMs both wanting later tokens to depend on earlier tokens. SQL messed it up, too, with "from" not coming first. If APL were developed today, I think left-to-right evaluation would have been preferred. The popularity of dotted function calls in various languages makes it reasonably clear that people like tacking things onto the end and seeing a "pipeline" form from left to right.

beagle3

APL was designed as a notation for math; if you pronounce it properly, it makes more sense than numpy:

The 10 by 10 reshaping of counting to 100

fwip

With complicated formulas, it often makes more sense and can give more guidance by first talking about the last operations to be applied. This seems to match the LLM structure, by starting by describing what we want, and then filling in the more specialized holes as we get to them. "Top-down" design vs "bottom-up".

Your insight about APL being reverse-concatenative is very cool.

leprechaun1066

It's not because of the left of right evaluation. If the difference was that simple, most humans, let alone LLMs, wouldn't struggle with picking up q when they come from the common languages.

Usually when someone solves problems with q, they don't use the way one would for Python/Java/C/C++/C#/etc.

This is probably a poor example, if I asked someone to write a function to create an nxn identity matrix for a given number the non-q solution would probably involve some kind of nested loop that checks if i==j and assigns 1, otherwise assigns 0.

In q you'd still check equivalence, but instead of looping, you generate a list of numbers as long as the given dimension and then compare each item of the list to itself:

  {x=/:x:til x}3
An LLM that's been so heavily trained on an imperative style will likely struggle to solve similar (and often more complex) problems in a standard q manner.

wat10000

A human can deal with right-to-left evaluation by moving the cursor around to write in that direction. An LLM can’t do that on its own. A human given an editor that can only append would struggle too.

grej

This is, in part, one of the reasons why I am interested in the emerging diffusion based text generation models.

vessenes

Interesting. Upshot - right to left eval means you generally must start at the end, or at least hold an expression in working memory - LLMs - not so good at this.

I wonder if diffusion models would be better at this; most start out as sequential token generators and then get finetuned.

cess11

"Claude is aware of that, but it struggled to write correct code based on those rules"

It's actually not, and unless they in some way run a rule engine on top of their LLM SaaS stuff it seems far fetched to believe it adheres to rule sets in any way.

Local models confuse Python, Elixir, PHP and Bash when I've tried to use them for coding. They seem more stable for JS, but sometimes they slip out of that too.

Seems pretty contrived and desperate to invent transpilers from quasi-Python to other languages to try and find a software development use for LLM SaaS. Warnings about Lisp macros and other code rewrite tools ought to apply here as well. Plus, of course, the loss of 'notation as a tool of thought'.

strangescript

If your model is getting confused by python, its a bad model. Python is routinely the best language for all major models.

cess11

I don't know what counts as a major model. Relevant to this, I've dabbled with Gemma, Qwen, Mistral, Llama, Granite and Phi models, mostly 3-14b varieties but also some larger ones on CPU on a machine that has 64 GB RAM.

wild_egg

I think the issue there is those smaller versions of those models. I regularly use Gemma3 and Qwen3 for programming without issue but in the 27b-32b range. Going smaller than that generally yields garbage.