My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)
339 comments
·July 29, 2025NitpickLawyer
genewitch
I'll bite. How do i train/make and/or use LoRA, or, separately, how do i fine-tune? I've been asking this for months, and no one has a decent answer. websearch on my end is seo/geo-spam, with no real instructions.
I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?
techwizrd
We have been fine-tuning models using Axolotl and Unsloth, with a slight preference for Axolotl. Check out the docs [0] and fine-tune or quantize your first model. There is a lot to be learned in this space, but it's exciting.
arkmm
When do you think fine tuning is worth it over prompt engineering a base model?
I imagine with the finetunes you have to worry about self-hosting, model utilization, and then also retraining the model as new base models come out. I'm curious under what circumstances you've found that the benefits outweigh the downsides.
syntaxing
What hardware do you train on using axolotl? I use unsloth with Google colab pro
notpublic
https://github.com/unslothai/unsloth
I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.
qcnguy
LLM fine tuning tends to destroy the model's capabilities if you aren't very careful. It's not as easy or effective as with image generation.
israrkhan
do you have a suggestion or a way to measure if model capabilities are getting destroyed? how do one measure it objectively?
svachalek
For completeness, for Apple hardware MLX is the way to go.
jasonjmcghee
brev.dev made an easy to follow guide a while ago but apparently Nvidia took it down or something when they bought them?
So here's the original
https://web.archive.org/web/20231127123701/https://brev.dev/...
minimaxir
If you're using Hugging Face transformers, the library you want to use is peft: https://huggingface.co/docs/peft/en/quicktour
There are Colab Notebook tutorials around training models with it as well.
electroglyph
unsloth is the easiest way to finetune due to the lower memory requirements
pdntspa
Have you tried asking an LLM?
Nesco
Zuck wouldn’t have leaked it on 4chan of all the places
tough
prob just told an employee to get it done no?
vaenaes
Why not?
tonyhart7
is GLM 4.5 better than Qwen3 coder??
diggan
For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.
My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.
They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.
kelvinjps10
coding? they are coding models? what specific tasks is one performing better than the other?
NitpickLawyer
I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).
bob1029
> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.
I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.
It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.
jonas21
Wasn't the bitter lesson about training on large amounts of data? The model that he's using was still trained on a massive corpus (22T tokens).
itsalotoffun
I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.
reactordev
This. When we ran out of speed on the CPU, we moved to the GPU. Same thing here. The more we work with (22T) models, quants, and decimating precision - the more we learn and find more novel ways to do things.
yahoozoo
What does that have to do with quantizing?
righthand
Did you understand the implementation or just that it produced a result?
I would hope an LLM could spit out a cobbled form of answer to a common interview question.
Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?
People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.
Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.
That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.
However the solutions are absolutely useless for anyone else but the implementer.
simonw
I scanned the code and understood what it was doing, but I didn't spend much time on it once I'd seen that it worked.
If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.
I wrote a whole lot more about my approach to using LLMs to help write "real" code here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
photon_lines
This is why I love using the Deep-Seek chain of reason output ... I can actually go through and read what it's 'thinking' to validate whether it's basing its solution on valid facts / assumptions. Either way thanks for all of your valuable write-ups on these models I really appreciate them Simon!
vessenes
Nota bene - there is a fair amount of research that indicates models outputs and ‘thoughts’ do not necessarily align with their chain of reasoning output.
You can validate this pretty easily by asking some logic or coding questions: you will likely note that a final output is not necessarily the logical output of the end of the thinking; sometimes significantly orthogonal to it, or returning to reasoning in the middle.
All that to say - good idea to read it, but stay vigilant on outputs.
shortrounddev2
Serious question: if you have to read every line of code in order to validate it in production, why not just write every line of code instead?
simonw
Because it's much, much faster to review a hundred lines of code than it is to write a hundred lines of code.
(I'm experienced at reading and reviewing code.)
th0ma5
[flagged]
dang
Please don't cross into personal attack in HN comments.
https://news.ycombinator.com/newsguidelines.html
Edit: twice is already a pattern - https://news.ycombinator.com/item?id=44110785. No more of this, please.
Edit 2: I only just realized that you've been frequently posting abusive replies in a way that crosses into harangue if not harassment:
https://news.ycombinator.com/item?id=44725284 (July 2025)
https://news.ycombinator.com/item?id=44725227 (July 2025)
https://news.ycombinator.com/item?id=44725190 (July 2025)
https://news.ycombinator.com/item?id=44525830 (July 2025)
https://news.ycombinator.com/item?id=44441154 (July 2025)
https://news.ycombinator.com/item?id=44110817 (May 2025)
https://news.ycombinator.com/item?id=44110785 (May 2025)
https://news.ycombinator.com/item?id=44018000 (May 2025)
https://news.ycombinator.com/item?id=44008533 (May 2025)
https://news.ycombinator.com/item?id=43779758 (April 2025)
https://news.ycombinator.com/item?id=43474204 (March 2025)
https://news.ycombinator.com/item?id=43465383 (March 2025)
https://news.ycombinator.com/item?id=42960299 (Feb 2025)
https://news.ycombinator.com/item?id=42942818 (Feb 2025)
https://news.ycombinator.com/item?id=42706415 (Jan 2025)
https://news.ycombinator.com/item?id=42562036 (Dec 2024)
https://news.ycombinator.com/item?id=42483664 (Dec 2024)
https://news.ycombinator.com/item?id=42021665 (Nov 2024)
https://news.ycombinator.com/item?id=41992383 (Oct 2024)
That's abusive, unacceptable, and not even a complete list!
You can't go after another user like this on HN, regardless of how right you are or feel you are or who you have a problem with. If you keep doing this, we're going to end up banning you, so please stop now.
ajcp
They said "production systems", not "critical production applications".
Also the 'if' doesn't negate anything as they say "I still", meaning the behavior is actively happening or ongoing; they don't use a hypothetical or conditional after "still", as in "I still would".
bnchrch
You do realize your talking to the creator of Django, Datassette, and Lanyrd right?
CamperBob2
I missed the part where he said he was going to put the Space Invaders game into production. Link?
magic_hamster
The LLM is the solution.
bsder
> However the solutions are absolutely useless for anyone else but the implementer.
Disposable code is where AI shines.
AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.
This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.
The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.
AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
jauntywundrkind
MLX does have decent/good software support among ML stacks. Targeting both iOS and mac is a big win in itself.
I wonder what's possible, what the software situation is today with the PC NPU's. AMD's XDNA has been around for a while, XDNA2 jumps from 10->40 TOps. AMD iGPU can access huge memory: is it similar here? The "AMDXDNA" driver merged in 6.14 last winter: where are we now?
But not seeing any evidence that there's popular support in any of the main frameworks. https://github.com/ggml-org/llama.cpp/issues/1499 https://github.com/ollama/ollama/issues/5186
Good news, AMD has an initial implementation of llama.cpp. I don't particularly know what it means, but the firt gen supports W4ABF16 quantization, newer chips support W8A16. https://github.com/ggml-org/llama.cpp/issues/14377 . I'm not sure what it's good for, but there is a Linux "xdna-driveR", https://github.com/amd/xdna-driver . IREE has an experimental backend: https://github.com/nod-ai/iree-amd-aie
There's a lot of other folks also starting on their NPU journeys. ARM's Ethos, and Rockchip's RKNN recently shipped Linux kernel drivers, but it feels like that's just a start? https://www.phoronix.com/news/Arm-Ethos-NPU-Accel-Driver https://www.phoronix.com/news/Rockchip-NPU-Driver-RKNN-2025
AlexeyBrin
Most likely its training data included countless Space Invaders in various programming languages.
gblargg
The real test is if you can have it tweak things. Have the ship shoot down. Have the space invaders come from the left and right. Add two player simultaneous mode with two ships.
wizzwizz4
It can usually tweak things, if given specific instruction, but it doesn't know when to refactor (and can't reliably preserve functionality when it does), so the program gets further and further away from something sensible until it can't make edits any more.
simonw
For serious projects you can address that by writing (or having it write) unit tests along the way, that way it can run in a loop and avoid breaking existing functionality when it adds new changes.
quantumHazer
and probably some synthetic data are generated copy of the games already on the dataset?
i have this feeling with LLM's generated react frontend, they all look the same
cchance
Have you used the internet? thats how the internet looks, their all fuckin react and the same layouts and styles 90% shadcn lol
bayindirh
Last time somebody asked for a "premium camera app for iOS", and the model (re)generated Halide.
Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...
Uehreka
> Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...
People really need to stop saying this. I get that it was the Smart Guy Thing To Say in 2023, but by this point it’s pretty clear that that it’s not true in any way that matters for most practical purposes.
Coding LLMs have clearly been trained on conversations where a piece of code is shown, a transformation is requested (rewrite this from Python to Go), and then the transformed code is shown. It’s not that they’re just learning codebases, they’re learning what working with code looks like.
Thus you can ask an LLM to refactor a program in a language it has never seen, and it will “know” what refactoring means, because it has seen it done many times, and it will stand a good chance of doing the right thing.
That’s why they’re useful. They’re doing something way more sophisticated than just “recombining codebases from their training data”, and anyone chirping 2023 sound bites is going to miss that.
FeepingCreature
True where trivial; where nontrivial, false.
Trivially, humans don't emit something they don't know either. You don't spontaneously figure out Javascript from first principles, you put together your existing knowledge into new shapes.
Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times. Will it be put together from smaller fragments? Yes, this is called "experience" or if the fragments are small enough, "understanding".
satvikpendem
This doesn't make sense thermodynamically because models are far smaller than the training data they purport to hold and recall, so there must be some level of "understanding" going on. Whether that's the same as human understanding is a different matter.
mr_toad
> They remix and rewrite what they know. There's no invention, just recall...
If they only recalled they wouldn’t “hallucinate”. What’s a lie if not an invention? So clearly they can come up with data that they weren’t trained on, for better or worse.
NitpickLawyer
This comment is ~3 years late. Every model since gpt3 has had the entirety of available code in their training data. That's not a gotcha anymore.
We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.
These kinds of comments are really missing the point.
haar
I've had little success with Agentic coding, and what success I have had has been paired with hours of frustration, where I'd have been better off doing it myself for anything but the most basic tasks.
Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.
I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.
aschobel
Bingo, it's magical but the learning curve is very very steep. The METR study on open-source productivity alluded to this a bit.
I am definitely at a point where I am more productive with it, but it took a bunch of effort.
jan_Sate
Not exactly. The real utility value of LLM for programming is to come up with something new. For Space Invaders, instead of using LLM for that, I might as well just manually search for the code online and use that.
To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.
tracker1
I have a friend who has been doing just that... usually with his company he manages a handful of projects where a bulk of the development is outsourced overseas. This past year, he's outpaced the 6 devs he's had working on misc projects just with his own efforts and AI. Most of this being a relatively unique combination of UX with features that are less common.
He's using AI with note taking apps for meetings to enhance notes and flush out technology ideas at a higher level, then refining those ideas into working experiments.
It's actually impressive to see. My personal experience has been far more disappointing to say the least. I can't speak to the code quality, consistency or even structure in terms of most people being able to maintain such applications though. I've asked to shadow him through a few of his vibe coding sessions to see his workflow. It feels rather alien to me, again my experience is much more disappointing in having to correct AI errors.
devmor
> The real utility value of LLM for programming is to come up with something new.
That's the goal for these projects anyways. I don't know that its true or feasible. I find the RAG models much more interesting myself, I see the technology as having far more value in search than generation.
Rather than write some markov-chain reminiscent frankenstein function when I ask it how to solve a problem, I would like to see it direct me to the original sources it would use to build those tokens, so that I can see their implementations in context and use my judgement.
AlexeyBrin
You are reading too much into my comment. My point was that the test (a Space Invaders clone) used to asses the model is irrelevant for some time now. I could have gotten a similar result with Mistral Small a few months ago.
MyOutfitIsVague
I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).
It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.
NitpickLawyer
> because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated
I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.
stolencode
It's amazing that none of you even try to falsify you claims anymore. You can literally just put some of the code in a search engine and find the prior art example:
https://www.web-leb.com/en/code/2108
Your "AI tools" are just "copyright whitewashing machines."
These kinds of comments are really ignoring reality.
jayd16
I think you're missing the point.
Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.
Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.
Aurornis
> These kinds of comments are really missing the point.
I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.
Asking them to produce novel output that doesn’t match the training set produces very different results.
When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.
It shows the reality of working with LLMs and it’s an important consideration.
phkahler
I find the visual similarity to breakout kind of interesting.
elif
Most likely this comment included countless similar comments in its training data, likely all synthetic without any actual tether to real analysis.
Conflonto
That sounds so dismissive.
I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.
But hey.
alankarmisra
I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.
That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.
When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).
Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).
tossandthrow
Interesting, I can not produce these warnings in ChatGPT - though this is something that really interests me, as it represents immense political power to be able ti interject such warnings (explicitly, or implicitly by slight reformulations)
null
dr-detroit
[dead]
aaron695
[dead]
xianshou
I initially read the title as "My 2.5 year old can write Space Invaders in JavaScript now (GLM-4.5 Air)."
Though I suppose, given a few years, that may also be true!
lxgr
This raises an interesting question I’ve seen occasionally addressed in science fiction before:
Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?
bob1029
This is the premise of all of the ML research I've been into. The only difference is to replace the wormhole with linear genetic programming, neuroevolution, et. al. The size of programs in the demoscene is what originally sent me down this path.
The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?
I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.
bgirard
> ~100b/s of UTF8 text out of a NLP black box is this big
That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
bob1029
> If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
Exactly. This is what compels me to try.
switchbak
This is what I find fascinating. What hidden capabilities exist, and how far could it be exploited? Especially on exotic or novel hardware.
I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.
lxgr
Abstraction, or efficient modeling of a given system, is probably a feature, not a bug, given the strong similarity between intelligence and compression and all that.
A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)
pulkitsh1234
Is there any website to see the minimum/recommended hardware required for running local LLMs? Much like 'system requirements' mentioned for games.
CharlesW
> Is there any website to see the minimum/recommended hardware required for running local LLMs?
LM Studio (not exclusively, I'm sure) makes it a no-brainer to pick models that'll work on your hardware.
svachalek
In addition to the tools other people responded with, a good rule of thumb is that most local models work best* at q4 quants, meaning the memory for the model is a little over half the number of parameters, e.g. a 14b model may be 8gb. Add some more for context and maybe you want 10gb VRAM for a 14gb model. That will at least put you in the right ballpark for what models to consider for your hardware.
(*best performance/size ratio, generally if the model easily fits at q4 you're better off going to a higher parameter count than going for a larger quant, and vice versa)
nottorp
> maybe you want 10gb VRAM for a 14gb model
... or if you have Apple hardware with their unified memory, whatever the assholes soldered in is your limit.
qingcharles
This can be a useful resource too:
GaggiX
https://apxml.com/tools/vram-calculator
This one is very good in my opinion.
jxf
Don't think it has the GLM series on there yet.
knowaveragejoe
If you have a HuggingFace account, you can specify the hardware you have and it will show on any given model's page what you can run.
stpedgwdgfhgdd
Aside that space invaders from scratch is not representative for real engineering, it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine (no usage tier per hour or week), let’s say, one year from now. At $200 per month for 2 years I can buy a decent Mx with 64GB (or perhaps even 128GB taking residual value into account)
falcor84
How come it's "not representative for real engineering"? Other than copy-pasting existing code (which is not what an LLM does), I don't see how you can create a space invaders game without applying "engineering".
hbn
The prompt was
> Write an HTML and JavaScript page implementing space invaders
It may not be "copy pasting" but it's generating output as best it can be recreated from its training on looking at Space Invaders source code.
The engineers at Taito that originally developed Space Invaders were not told "make Space Invaders" and then did their best to recall all the source code they've looked at in their life to re-type the source code to an existing game. From a logistics standpoint, where the source code already exists and is accessible, you may as well have copy-pasted it and fudged a few things around.
simonw
The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.
I used that prompt because it's the shortest possible prompt that tells the model to build a game with a specific set of features. If I wanted to build a custom game I would have had to write a prompt that was many paragraphs longer than that.
The aim of this piece isn't "OMG looks LLMs can build space invaders" - at this point that shouldn't be a surprise to anyone. What's interesting is that my laptop can run a model that is capable of that now.
sharkjacobs
Making a space invaders game is not representative of normal engineering because you're reproducing an existing game with well known specs and requirements. There are probably hundreds of thousands of words describing and discussing Space Invaders in GLM-4.5's training data
It's like using an LLM to implement a red black tree. Red black trees are in the training data, so you don't need to explain or describe what you mean beyond naming it.
"Real engineering" with LLMs usually requires a bunch of up front work creating specs and outlines and unit tests. "Context engineering"
jasonvorhe
Smells like moving the goal post. What's real engineering to be in 2028? Implementing Google's infra stack in your homelab?
phkahler
>> Other than copy-pasting existing code (which is not what an LLM does)
I'd like to see someone try to prove this. How many space invaders projects exist on the internet? I'd be hard to compare model "generated" code to everything out there looking for plagiarism, but I bet there are lots of snippets pulled in. These things are NOT smart, they are huge and articulate information repositories.
simonw
Go for it. https://www.google.com/search?client=firefox-b-1-d&q=github+... has a bunch of results. Here's the source code GLM-4.5 Air spat out for me on my laptop: https://github.com/simonw/tools/blob/main/space-invaders-GLM...
Based on my mental model of how these things work I'll be genuinely surprised if you can find even a few lines of code duplicated from one of those projects into the code that GLM-4.5 wrote for me.
ben_w
Sorites paradox. Where's the distinction between "snippet" and "a design pattern"?
Compressing a few petabytes into a few gigabytes requires that they can't be like this about all of the things they're accused of simply copy-pasting, from code to newspaper articles to novels. There's not enough space.
dmortin
" it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine "
Most people won't bother with buying powerful hardware for this, they will keep using SAAS solutions, so Anthropic can be in trouble if cheaper SAAS solutions come out.
qingcharles
The frontier models are always going to tempt you with their higher quality and quicker generation, IMO.
kasey_junk
I’ve been mentally mapping tge models to the history of db.
Most db in the early days you had to pay for. There are still for pay db that are just better than ones you don’t pay for. Some teams think that the cost is worth the improvements and there is a (tough) business there. Fortunes were made in the early days.
But eventually open source models became good enough for many use cases and they have their own advantages. So lots of teams use them.
I think coding models might have a similar trajectory.
qingcharles
You make a good point -- a majority of applications are now using open source or free versions[1] of DBs.
My only feedback is: are these the same animal? Can we compare an O/S DB vs. paid/closed DB to me running an LLM locally? The biggest issue right now with LLMs is simply the cost of the hardware to run one locally, not the quality of the actual software (the model).
[1] e.g. SQL Server Express is good enough for a lot of tasks, and I guess would be roughly equivalent to the upcoming open versions of GPT vs. the frontier version.
zarzavat
Closed doesn't always win over open. People said the same thing about Windows vs Linux, but even Microsoft was forced to admit defeat and support Linux.
All it takes is some large companies commoditizing their complements. For Linux it was Google, etc. For AI it's Meta and China.
The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.
indigodaddy
Did pretty well with a boggle clone. I like that it tries to do a single html file (I didn't ask for that but was pleasantly surprised). It didn't include dictionary validation so needed a couple of prompts. Touch selection on mobile isn't the greatest but I've seen plenty worse
Keyframe
I went the other route with tetris clone the other day. It's definitely not a single prompt. It took me solid 15 hours until this stage to get here and most of that me thinking.. BUT, except one small trivial thing (space invader logo in pre tag) I haven't touched code - just looked at it. I made it mandatory for myself to see if I can first greenfield myself into this project and then brownfield features and fixes.. It's definitely a ton of work on my end, but it's also not something I'd be able to do in ~2 working days or less. As a cherry on top, even though it's still not done yet, I put in AI-generated music singing about the project itself. https://www.susmel.com/stacky/
Definitely a ton of things I learned about how to "develop" "with" AI along the way.
JKCalhoun
Cool — if only diagonals were easier. ;-) (Hopefully I'm being constructive here.)
indigodaddy
Yep I tried to have it improve that but actually didn't use the word 'diagonal' in the prompt. I bet it would have done better if I had..
indigodaddy
Had it try to improve Diagonal selection but didn't seem to help much
dust42
I tried with Claude Sonnet 4 and it does *not* work. So looks like GLM-4.5 Air in 3bit quant is ahead.
Chat is here: https://claude.ai/share/dc9eccbf-b34a-4e2b-af86-ec2dd83687ea
Claude Opus 4 does work but is far behind of Simon's GLM-4.5: https://claude.ai/share/5ddc0e94-3429-4c35-ad3f-2c9a2499fb5d
> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.
Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.
And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)
By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.