Skip to content(if available)orjump to list(if available)

Converting C to ASM to specs and then to a working Z/80 Speccy tape

ohmygoodniche

The amount of cognitive dissonance here is interesting.

I compiled c to asm. Title says the llm did this. it works! But it's broken. It generated a bunch of other files! But I only need one. It couldn't target z80 so I was a human in the loop. You have to trust it and understand how the Black box works to get n factor gains. But no one knows how these tools actually work and general advice is NOT to trust LLM outputs and the author didn't trust them either... And even the final result has the incorrect tax rates...

I'm not denying LLMs can sort of rewrite small chunks of code in other languages, add comments to code, etc. but the way people talk about them is so snake oily.

Going by any of the major bullet points I would say that the title is wrong, and misleading at best.

throwaway150

> Going by any of the major bullet points I would say that the title is wrong, and misleading at best.

And it wouldn't be their first time.

Check https://news.ycombinator.com/item?id=43217357 by the same author. That post got flagged for being misleading too.

Read https://news.ycombinator.com/item?id=43220639 for same type of criticism.

The OP seems to be on an LLM spree. They ask LLM to produce code. The code is invariable always broken thanks to hallucinations but they go ahead and post it on HN anyway with a misleading title.

ohmygoodniche

Ah well maybe the blogs are written by an LLM to try to gain "agency" or test how compelling their poopy code is against people on the internet for free... I cannot imagine what would motivate an actual person to be that publicly wrong and frankly hostile about it while maintaining any sense of reputation...

littlestymaar

I've tried to get LLMs rewrite a bunch of stuff in Rust, from different languages (JavaScript, Python, C++, C) and I can definitely relate: LLMs cannot be trusted rewriting anything significant without a lot of supervision (realistically, the only thing I've gained was not to have to type boring boilerplate, but that's pretty much it).

And before you say “oh but it's because Rust is too hard”, SOTA LLMs don't have much problem writing Rust code nowadays, and I suspect Rust is actually a better candidate than most language as a target, because the compiler catch so many things and the errors are very explicit, which helps the LLM a lot when doing multi-turn rewriting sessions.

baq

You're missing the point.

The point isn't and never has been that it's a flawless tool. The point is that you can work the tool and get a working POC of something you'd never attempt to do before in literally a couple hours.

It isn't a compiler, it isn't a logician, it's a lossily-compressed image of the whole internet with English as a query language. Use it within the operational envelope, which is what the author did, with some interesting results and possibly pointing to a large implications on vibe-coding solutions you'd previously pay for.

feverzsj

For this piece of code, one can rewrite it correctly and much more performant in much less time. If the tool only drags you down, you'd better drop it.

Jcampuzano2

Just a note that - 'one can' does not mean everyone can. And this can be applied between any two languages or tasks. An AI will simply have better broad knowledge than practically anyone at a task they are unfamiliar with so they can massively reduce the friction of getting started.

ohmygoodniche

The estimated run time of this code provided by another poster would have compelled me to seek alternatives as quickly as possible that's for sure...

baq

Again, missing the point.

You can. I can't. I don't want to. I don't care if it's slow, I can easily tell if it's correct and I can make the LLM fix incorrectness (in this simple case, anyway).

The point is this project probably wouldn't have happened without an LLM.

foolswisdom

The sounds like exactly the point OP is making. The way LLMs are spoken about implies much more than actually demonstrated.

sksrbWgbfK

> The point isn't and never has been that it's a flawless tool

It's a contradiction with all the managers and vibe-coding developers that have been saying for months that it can replace 90% of a development team.

Jcampuzano2

I'm going to go against the grain and say that AI could feasibly replace a very large percentage of most development teams, even today.

Not because the AI itself could do all of the coding with no developers at all in the room, but the developers who do know how to use it effectively could output so much more than those who don't that they would more than make up for their lost productivity.

Lots of companies have not yet fully embraced AI, and their developers are actually held back by not having access to it as a tool. But as someone who was recently given full access to sophisticated AI tooling, it makes a massive difference in my productivity.

Lots of people don't like to hear this, but if you are not using AI today in some capacity, or your company is lagging in its adoption your career is at risk, especially if you are still relatively young. This said as somebody who originally was a massive AI skeptic, but decided to give it a shot.

Yes, you still need to know how to code. That is not going away. But there will come a time when you yourself write an order of magnitude less code than you do today because you will become more of a reviewer than a developer yourself. Software development as a role will still exist, because in essence our job is to solve problems and build software, it just happens to be we write a lot of code to do that nowadays. But we will reach a breaking point where we don't write much of the code ourselves at all, maybe just some edits, and we review orders of magnitude than we do today.

ghuntley

Give it a try. You'll be surprised at what can be achieved when the LLM is driven via /specs (business requirements) + /stdlib (steering LLM technical outcomes). The end result, when driven by a good eval loop (property-based tests + compiler that provides soundness such as Haskell or Rust), is code outputted at brrrrrrrr speeds which is high quality.

ohmygoodniche

Having seen and helped fix colleague and stranger generated LLM generated rust and typescript code I would rather not rewrite code all day to make it hold water. Doing that for Haskell would probably give me an aneurysm.

idiotsecant

[flagged]

ghuntley

[flagged]

20k

Every test I've ever tried with an LLM to get it to generate code has produced a complete unworkable mess. I have no idea what people are generating with them, but its always far less work to write something that I understand, rather than spend the time trying to fix up a complete disaster that barely even begins to touch the problem I asked it to solve

Jcampuzano2

I'm not saying this is you but I have to ask as someone who has had success using LLM's but originally had this same mentality - what is the most recent model and tool you used to try to generate workable code?

If you tried it even just 3-6 months ago, in just this small amount of time tooling has had such a massive improvement that maybe you haven't tried it recently. I had your same experience when I tried before, but I have readily been able to get LLM's generate thousands of lines of actually useful and readable code for my job.

I tried generating code before and dismissed it because I had similarly bad experiences. But having generated myself now entire apps where I barely wrote any code that were actually usable and productive, including internal tooling, personal apps, and code that is client facing in production today, I don't think this really applies anymore. LLM's are more than capable of producing lots of high quality code given the right tools.

YuukiRey

You’re digging your own hole by using “brrrrrrr speeds” as a marketing term. It doesn’t help your overall argument.

beagle3

Interesting. Surprisingly, it decided to encode the multiplication and division as addition/subtraction loops, which is incredibly inefficient - multiplying e.g. 32,000 by 32,000 (ignoring the overflow ...) will take 1,024,000,000 iterations, so thousands of seconds on the speccy's humble 4Mhz Z80 (8 instructions, each taking at least 2 T-states ...)

Here is the multiplication loop (division is similar but in reverse, subtracting instead of adding).

    mult_loop:
     ; Check if BC is zero
     ld a, b
     or c
     jr z, mult_done
     
     ; Add HL to result
     ex de, hl           ; DE = multiplier, HL = result
     add hl, de          ; Add multiplier to result
     ex de, hl           ; DE = result, HL = multiplier
     
     ; Decrement counter
     dec bc
     jr mult_loop

ghuntley

oh interesting. One thing about the specy, is it's incredibly hard to hook in the eval technique (see middle/bottom of https://ghuntley.com/specs/ ) as deployment is manual/human. So I had to drive it all by hand and just accept it as afaik there's no testing framework. If this was another programming language I would have taken the approach of creating a `cargo bench` over the application and then looping _that_ back into the LLM to identify and resolve the performance issues. I've done it; it works well. Just not on the speccy :P

fancyfredbot

As someone who is generally very excited about the potential of LLMs to improve developer productivity I find this article a bit frustrating.

This isn't a productive way to use an LLM. The example was so trivial you could easily rewrite it from scratch in less time.

That would be irrelevant if the process scaled to more complex applications, but the blog shows it fails to understand or implement the simple example repeatedly. This left me with little hope that the technique scales.

I worry that creating so much hype will lead to some kind of backlash.

johnisgood

If backslash means less demand that drives down prices, I am all up for it, so I can be productive for cheaper.

stevekemp

At some point the food-tax dropped from 10% to 5% which I guess is good for people who need to eat!

I still do a fair bit of z80 programming for myself, and I'm very familiar with the Spectrum so this was a nice article to see, but it's a bit underwhelming how well this seemed to go.

ghuntley

Yeah, this was pointed out to me after publishing. Specifically, when the ASM was converted over /specs, it incorrectly wrote the amount of sales tax to the specs. Thus, the z80 implementation took that incorrect /spec and made it happen. At this stage, I'm sure if I had put some more care/effort into transpiling something from the source (i.e., connecting two or more LLMs as a check on that process) to specs, this problem could have been avoided.

sneak

…and it changed the default tax rate from 25% to 10%.

It’s still impressive, but it’s basically advanced autocomplete. You still have to read and check every line and make sure it is doing what you expect it to do.

ghuntley

covered here https://news.ycombinator.com/item?id=43387590

but yes, indeed you still need to watch it like a hawk when using it outside of a tight eval loop (ie. make all | make property-tests) and/or implement some other form eval loop on the original /spec generation.

ohmygoodniche

The author says you have to trust the LLMs though for it to work... You can downvote this but read the articles conclusions first...

aNoob7000

Are there any demos showing how to manage an existing codebase? Everyone loves to demo how AI can create new programs with a snap, but the elephant in the room is how well AI works with existing codebases and manages things like naming conventions, APIs from other apps, etc.

ghuntley

That's an interesting question, and I spend plenty of time exploring this at my current client. The best tip I can give you right now is really - it depends.

For example .NET/Java have this horrendous convention of splitting files into separate locations by the hundreds onto the filesystem.

ie. com/yada/repository | com/yada/models | com/yada/controllers | com/yada/services | com/yada/dtos et al

See https://www.youtube.com/watch?t=1507&v=J1-W9O3n7j8&feature=y... for an excellent discussion from folks sharing learnings that this is an anti-pattern for the current generation of AI assistants. By splitting the tests + everything related to the code being modified into separate files the LLM does worse.

Depending on the 'uniqueness' of the codebase and how it has been 'structured for humans' (vs being structured for LLMs - see above) then one will need to do some funky stuff with building custom MCP tools that teach the AI assistants how to work with the codebase...

ghuntley

> manages things like naming conventions, APIs from other apps, etc.

These particular concerns can be handled via https://ghuntley.com/stdlib and it works very very well.

consumer451

I just learned something about LLMs: start a new chat as often as possible. I knew that this was a best practice, but I didn't know how quickly the situation becomes dire.

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. [0]

I copied all the code that I could from TFA and pasted it into OpenAI's tokenizer. It counted ~15k tokens. Many other tokens were generated in TFA's chat, some of which are not visible to the user. I think it's fair to assume that the entire chat was at least 25k tokens, right? Therefore, I believe that by the end of that chat, 4o's performance was significantly degraded.

I think a major skill to develop for LLM supported coding is to compress a chat after just few thousand tokens into something like a step1.md file. Then start a new chat with "read step1.md" as the first prompt, and so on..

Is my logic sound here?

[0] https://arxiv.org/abs/2502.05167

ghuntley

It is sound! The real context window is indeed smaller than the advertised marketing number. This gets interesting in verbose languages such as Java, where I've seen +16k tokens wasted on the classpath/import stanzas within a class. Yick.

The current series of prompts I'm using is somewhat 'manually manage/allocate memory' of the LLM context window.

It starts with a PROMPT.md with this content.

```

create or update implementation_status.md with the implementation status

study /specs and implement what has not been implemented yet in /src

create property-based tests inline in the source file that is being implemented

run "make" to verify the implementation after each change

```

If it goes off the rails (and it does) - restart the chat and use this.

```

@prompt

continue

```

None of this works without applying /stdlib to control the technical outcomes/patterns to steer the LLM. Otherwise you just get slop.

consumer451

> None of this works without applying /stdlib to control the technical outcomes/patterns to steer the LLM. Otherwise you just get slop.

The LLM coding tool I have the most experience with is Windsurf IDE + Sonnet 3.5. In Windsurf you can define both global and project rules in .md files. I have found that managing those files very closely is key to success. They just tried to automate that with auto-generated "Memories" but they are generally slop, and I delete them. Managing the project rule file will save you so much pain. That is where I define the frameworks and APIs to use.

ghuntley

<3 you get it. It's a new skillset to be learned and when it clicks, you get incredible outcomes!

hakaneskici

Thanks for publishing this.

Can you also share your opinion if you compare the "code to spec" vs "spec to code" phases?

I'm wondering if the LLM considers "code" and "spec" as two separate programming languages, or one as a programming, and the other as a human language? Not sure if it makes a difference or not for its internal translation logic though, if that makes sense.

PS: I learned BASIC on a friend's ZX Spectrum, and your post made me remember some forgotten childhood memories :) Extra thanks.

ghuntley

> I'm wondering if the LLM considers "code" and "spec" as two separate programming languages, or one as a programming, and the other as a human language? Not sure if it makes a difference or not for its internal translation logic though, if that makes sense.

It would have been possible to go direct from intel asm to z80 asm without /specs.

> and the other as a human language

There's some research here from a couple years ago over at https://githubnext.com/projects/speclang/ which is all about /specs as the source of truth for creating an app.

null

[deleted]

feverzsj

Did he input all the prompts in the same session? It's kinda nonsense bs. Any disassembler is more useful than this.

ghuntley

It was more about applying /specs, to generate specs. Then applying /stdlib + /specs to generate an application for another platform without using the original reference application/disassembler. Proving a point that the technique works to re-create any software from /specs via LLM...

ohmygoodniche

No it's a bunch of sessions with manual interventions reprompts and attempts all through out according to the author.

Downvote it if you want it's factually correct. Read the article.

feverzsj

There is no comment or string containing "Food", "Hygiene" or etc. in the asm, because rodata is not dumped. If he inputted the listed asm in a separated session, how could LLM predicted them?

DeathArrow

TLDR: No, it can not.