'A lot worse than expected': AI Pac-Man clones, reviewed

28 comments

·March 14, 2025

Aurornis

This is a good summary of the AI app generation experience. They’re very good at producing something as long as it overlaps with what you can find in a lot of public blogs and tutorials. Pac-Man is a common intro game so there’s a lot of source material to train on.

But once you get past the initial wow factor it’s hard to move forward with the refinement. Recently I tried to use multiple LLM tools to implement a solution to a problem that I had solved before. I spent hours prompting and reprompting trying to tell it exactly what to do, but every time I’d pin down one fix it would break something else by gravitating back to a different solution that didn’t work. Finally I re-read the tutorial in the library’s documentation and realized that it was just pulling my code back toward the documentation example at every opportunity. My attempts to deviate and implement something more complex kept breaking it.

parasti

In my experience the biggest roadblock to continuous conversation is context length. It fills up and the LLM starts forgetting parts of the conversation. Most tools don't even tell you that this has happened. But if you keep that in mind, that there is a buffer there filling up, you can massively improve the quality of the output.

yzydserd

Many prompters don’t practice hygienic reversal by rolling the outcome of a conversation back up the thread as input. If 10 comments are spent coming to some clean code, go back up the thread and include it as the status “back then”.

godelski

IME vibe coding feels faster but traditional coding is faster. Every time I've tried to code with LLMs in just spending lots of time waiting.

Funny thing is it reminds me of when I'd first start coding. You write too much, compile, wait, compiler error. Read error, fix, add print statements, repeat. There's that down time with the compiler that made it feel not as bad but also make me lose concentration.

Similar to when I started coding it also results in things that technically work but are clearly spaghetti held together with duct tape. But at last that process resulted in me becoming better

nemomarx

What you could do, I guess, is write up 5-6 implementations of it yourself first, then feed those into the training set for the model, and then it should have a decent shot at understanding that particular problem. (maybe /s?)

I guess my question is if you had a good solution you'd done before, and also a good understanding of what kinda changes it needs, why bother with hours of AI prompting?

drdeca

Presumably the purpose of the exercise was to test the model’s capabilities / “take it for a spin” , not to obtain the working progress (as, as you mention, they already had a working program to do the task).

ehutch79

These models are supposed to be reasoning. Why would you need to do all that work? If I can teach it to an intern, why wouldn't a model that can reason be able to do that?

Unless all the promises and hype are false...

junipertea

To write a blog post! "I used AI to solve this challenging problem" gets more traction than "I solved this challenging problem"

cc62cf4a4f20

After spending some time vibe coding, I think this article is pretty accurate in that it aligns with a) how poor AI agents work in practice and b) the fact that non-coders are expecting magic from AI (which to be fair is what the AI companies are promising with all of their hype).

Where I have found vibe coding as an approach really shine is if I need to write some sort of quick utility to get a task done. Something that might take an hour or more to slap together to solve some menial task that I need to do on a bunch of files. Here I can definitely throw it together quicker than manually and don't care if it is messy code.

Larger, more complicated apps that are meant for production are painful to try to get AI tools to build. Spend so much time prompting the AI to get the task done without breaking something else that I doubt I'm any faster than just hand coding it alongside a co-pilot

LoganDark

Claude Code released a little while ago and I've been using it on a production codebase at work. It's really good at repetitive tasks like filling out JSON schemas, cloning boilerplate logic, and so on. It's also honestly not half bad at helping point me to the right locations when bugs can be found.

I find that it works best when used by an actual programmer who has a good idea of exactly what they want to do and how they want it done. I often find myself telling it extremely specific things like, add a switch case in this callback in this file. Add a command in this file after this other one. Create a new file in this directory that follows the convention of all the others. And so on. If you instruct it well, you can then tell it to repeat what it just did for every item in a list that is like 20 items long and you will have saved hours of development time. Very rarely does it spit out fully functional code but it's very good at saving you the time it takes to constantly repeat yourself.

(This codebase isn't that good at DRY, I try my best with things like higher-order functions but there's only so much I can do, I still need to repeat myself in many cases.)

godelski

What do you mean cloning boiler plate logic? Don't you just write it once and then call the function? Need to change things? Okay do a little abstraction. But I thought a big part of coding was to reduce repetition

LoganDark

> What do you mean cloning boiler plate logic?

As an example, each tool callable by the AI needs its own input JSON schema, and in its execute function needs to send a request to the client, client needs to have a callback that handles that request, etc. it is very boilerplatey, bridges across multiple implementation languages, in completely different parts of the codebase, and Claude knocks it out in like 30 seconds flat so I can focus on the parts of the implementation that it slightly fucked up, but it usually gets the boilerplate bang-on.

henning

Looking at YouTube Unity tutorials, it looks like you could just straight up make your own Pac-Man in a few hours with the benefit that you could actually control and understand what the fuck is going on at each step of the way. It would also probably be a comparable but slightly higher amount of time in SDL, Raylib, PyGame, or other game-related libraries that don't provide as much hand-holding.

This reminds me of the fad of creating Twitter/Reddit/Hacker News clones in 15 minutes in some new web framework back in the 2000s/2010s. And of course none of the 15 minute Twitter clones actually have the hybrid fan-out architecture that Twitter wrote after a few years so that it wouldn't failwhale all the time. I'd love to see someone vibe code their way through that.

What I don't understand is that the same people who embrace long hours, grinding, and having secret founder DNA sauce also want to completely take their hands off the wheel and vibe code their way to financial freedom.

Note that just turning on Copilot/some other assistant and having it do code suggestions is fine in the sense of surrendering far less control over architecture and correctness and is something I do at work to cut down on boilerplate. So I think there is a spectrum from micro-AI assistance to macro-AI vibe coding. For instance, you could ask the chat bot to help you implement specific parts of the app, like using Djikstra's algorithm for pathfinding for the AI, without just asking it to make the entire app.

stared

This reads like a sponsored article promoting xAI, without clear ethical disclosure. While it appears to be about AI-generated games in general, it focuses solely on Grok-generated content.

Other models often perform better (https://web.lmarena.ai/, https://aider.chat/docs/leaderboards/) - I have yet to meet anyone who uses Grok as their primary programming assistant.

jsight

I use it as my primary coding assistant, when I'm able to. I haven't paid for the more advanced models from others, and it seems to be the most advanced free-to-use thinking model at the moment.

Aider can't use grok 3 with thinking yet, afaik, because xai hasn't made it available in the API.

From what I'm hearing, it and Claude 3.7 "thinking" are very similar in performance.

tempaway47474

Lol at the idea that The Guardian would promote grok, or that the article is making grok look good

cc62cf4a4f20

I've spent a lot of hours vibe coding with sonnet 3.7 thinking and I'm not seeing anything in the article that jumps out at me as being different from my experience.

codeulike

This is a pretty good test, how long does it take an AI to write a proper pacman clone, and how much prompting. All of these quick efforts look dismal. So how long would it take? 2 days and a lot of back and forth, something like that? The ghost movement in pacman was quite subtle and would take a long time to get right I reckon.

Also this is hilarious:

Unfortunately, my invitation to chat further over Zoom apparently means I’m after his crypto. “[Expletive] you, scammer,” he says. I ask if it would help if I sent over my Guardian credentials. “How about [Expletive] you?” comes the reply.

copypasterepeat

Agreed. I feel like the right test would be: "OK, you can use AI to easily generate these pac-man prototypes in an hour or two. But once you get past the 'wow I was able to get all this with so little work', you still have a basically unplayable prototype. Nobody would seriously want to play these or pay for them. How about you make a full-blown pac-man clone with all the nuances of the original? How long does that take? Do you even start converging to the solution at some point or do you keep playing whack-a-mole with bugs and issues?"

snats

It's pretty funny to test in-distribution for AI models. But they fail horribly once you push them a bit[1].

I recently made LLMs play Minesweeper and ALL LLMs that I tested had a pretty bad win to loose ratio. Like the only model that won more than 3 times was R1 (mind you there were 50 games).

[1] https://snats.xyz/pages/articles/minesweeper_bench.html

fleshmonad

As long as it gets the logic right I don't care. You probably also don't calculate the bitmasks for sprites in your head and have the full maze design in your head. So as long as you have the working logic you can supply it with the map layout and sprites. But yeah LLM capability is overhyped.

croes

As long it’s for games ok.

But AI coded software for planes or pacemakers?

karwash

I had to build a Pacman game in Java Swing as one of the projects for my Java college course in 2023. We received a list of requirements which was about two pages long (so much more demanding than in The Guardian article).

ChatGPT was extremely useful in getting started, especially since in the beginning I didn't feel too comfortable with Swing. But as I built the project and the logic got more complex, AI became much less useful.

Overall it took me about 30 hours to build a complete game meeting all the requirements of the project and on which I got 100%.

I didn't lean on AI as much as I could have -- point of the project was to get better at Java, but I think that even if I did, that time would have been cut down to 20 hours.

karwash

One interesting application was for maze creation. ChatGPT was not able to code the maze specified in the requirements so I did some googling and found Python code which came pretty close to what I needed. I asked ChatGPT to rewrite it into Java and then I tweaked the result to meet the requirements.

ptx

> “I’ll assume you want an accurate version of Pac-Man based on the classic arcade game – complete with a proper maze, four ghosts with distinct behaviours, power pellets, and smoother gameplay?” says Grok.

So the AI is already confused and spouting nonsense from the start. An "accurate version" which is "complete with [...] smoother gameplay"? Smoother than what? Smoother than what would be accurate? Smoother in what way?

Perhaps it's regurgitating a mangled version of a description of some improved implementation where "smoother gameplay" (compared to some reference) would make sense?

apwell23

my comment from another thread applies here too

https://news.ycombinator.com/item?id=43362343

HN

'A lot worse than expected': AI Pac-Man clones, reviewed

'A lot worse than expected': AI Pac-Man clones, reviewed