Don't Force Your LLM to Write Terse [Q/Kdb] Code: An Information Theory Argument

Veedrac

> Let’s start with an example: (2#x)#1,x#0 is code from the official q phrasebook for constructing an x-by-x identity matrix.

Is this... just to be clever? Why not

    (!x)=/:!x

aka. the identity matrix is defined as having ones on the diagonal? Bonus points AI will understand the code better.

neprotivo

This approach of solving a problem by building a low-perplexity path towards the solution reminds me of Grothendieck's approach towards solving complex mathematical problems - you gradually build a theory which eventually makes the problem obvious.

https://ncatlab.org/nlab/show/The+Rising+Sea

benjaminwootton

The bigger issue is that LLMs haven’t had much training on Q as there’s little publically available code. I recently had to try and hack some together and LLMs couldn’t string simple pieces of code together.

It’s a bizarre language.

haolez

I don't think that's the biggest problem. I think it's the tokenizer: it probably does a poor job with array languages.

sanjayjc

> I think the aesthetic preference for terseness should give way to the preference for LLM accuracy, which may mean more verbose code

From what I understand, the terseness of array languages (Q builds on K) serves a practical purpose: all the code is visible at once, without the reader having to scroll or jump around. When reviewing an LLM's output, this is a quality I'd appreciate.

dapperdrake

Perl and line noise also share these properties. Don’t particularly want to read straight binary zip files in a hex editor, though.

Human language has roughly, say, 36% encoding redundancy on purpose. (Or by Darwinian selection so ruthless we might as well call it "purpose".)

gabiteodoru

I agree with you, though in the q world people tend to take it to the extreme, like packing a whole function into a single line rather than a single screen. Here's a ticker plant standard script from KX themselves; I personally find this density makes it harder to read, and when reading it I put it into my text editor and split semicolon-separated statements onto different lines: https://github.com/KxSystems/kdb-tick/blob/master/tick.q E.g. one challenge I've had was generating a magic square on a single line; for odd-size only, I wrote: ms:{{[(m;r;c);i]((.[m;(r;c);:;i],:),$[m[s:(r-1)mod n;d:(c+1) mod n:#:[m]];((r+1)mod n;c);(s;d)])}/[((x;x)#0;0;x div 2);1+!:[x*x]]0}; / but I don't think that's helping anyone

krackers

When Q folks try to write C: https://github.com/kparc/ksimple

dapperdrake

When EAX and RAX take too long to type.

lynx97

Hey, another language with smileys! Like haskell, which has (x :) (partial application of a binary operator)

icsa

I think that there are a few critical issues that are not being considered:

* LLMs don't understand the syntax of q (or any other programming language).

* LLMs don't understand the semantics of q (or any other programming language).

* Limited training data, as compared to kanguages like Python or javascript.

All of the above contribute to the failure modes when applying LLMs to the generation or "understanding" of source code in any programming language.

chewxy

> Limited training data, as compared to kanguages like Python or javascript.

I use my own APL to build neural networks. This is probably the correct answer, and inline with my experience as well.

I changed the semantics and definition of a bunch of functions and none of the coding LLMs out there can even approach writing semidecent APL.

HN

Don't Force Your LLM to Write Terse [Q/Kdb] Code: An Information Theory Argument

Don't Force Your LLM to Write Terse [Q/Kdb] Code: An Information Theory Argument