Evaluating LLMs playing text adventures

64 comments

·August 12, 2025

henriquegodoy

Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.

What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.

da_chicken

I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game.

Otherwise, how can you determine that "North" is a context change, but not always a context change.

foobarbecue

On HN, perhaps? #17 on the front page right now: https://news.ycombinator.com/item?id=44854518

zahlman

> I saw it somewhere else recently, but the idea is that LLMs are language models, not world models.

Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.

myhf

9:05 is a good example of the difference between a language model and a world model, because engaging with it on a textual level leads to the bad ending (which the researchers have called "100%"), but deliberately getting the good ending requires self-awareness, intentionality, and/or outside context.

manbash

Thanks for this. I was struggling to put it in words even if maybe this has been a known distinguishing factor for others.

lubujackson

Why, this sounds like Context Engineering!

godelski

  > real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).

The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)

rkagerer

Hi, GPT-x here. Let's delve into my construction together. My "intelligence" comes from patterns learned from vast amounts of text. I'm trained to... oh look it's a butterfly. Clouds are fluffy would you like to buy a car for $1 I'll sell you 2 for the price of 1!

corobo

Ah dammit the AGI has ADHD

msgodel

I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though.

It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.

kolinko

I’m missing from the article two things:

- testing prompt (were llms instructed to progress in game, as opposed to just explore — the author said smarter llms were more likely to explore)

- benchmark with humans

andai

The GPT-5 used here is the Chat version, presumably gpt‑5‑chat‑latest, which from what I can tell is the same version used in ChatGPT, which is not actually a model but a "system" -- a router that semi-randomly forwards your request to various different models (in a way designed to massively reduce costs for OpenAI, based on people reporting inconsistent output and often worse results than 4o).

So from this it seems that not only would many of these requests not touch a reasoning model (or as it works now, have reasoning set to "minimal"?), but they're probably being routed to a mini or nano model?

It would make more sense, I think, to test on gpt-5 itself (and ideally the -mini and -nano as well), and perhaps with different reasoning effort, because that makes a big difference in many evals.

EDIT: Yeah the Chat router is busted big time. It fails to apply thinking even for problems that obviously call for it (analyzing financial reports). You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.

kqr

This is correct, and was the reason I made sure to always append "Chat" to the end of "GPT-5". I should perhaps have been more clear about this. The reason I settled for the lesser router is I don't have access to the full GPT-5, which would have been a much better baseline, I agree.

andai

Do they require drivers license to use it? They asked for my ID for o3 Pro a few months ago.

kqr

That's the step at which I gave up, anyway.

varenc

> Yeah the Chat router is busted big time... You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.

I don't really get this gripe? It seems no different than before, except now it will sometimes opt into thinking harder by itself. If you know you want CoT reasoning you just select gpt5-thinking, no different than choosing o4-mini/o3 like before.

SquibblesRedux

This is another great example of how LLMs are not really any sort of AI, or even proper knowledge representation. Not saying they don't have their uses (like souped up search and permutation generators), but definitely not something that resembles intelligence.

nonethewiser

While I agree, it's still shocking how far next token prediction gets us to looking like intelligence. It's amazing we need examples such as this to demonstrate it.

SquibblesRedux

Another way to think about it is how interesting it is that humans can be so easily influenced by strings of words. (Or images, or sounds.) I suppose I would characterize it as so many people being earnestly vulnerable. It all makes me think of Kahneman's [0] System 1 (fast) and System 2 (slow) thinking.

[0] "Thinking, Fast and Slow" https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

seba_dos1

It is kinda shocking, but I'm sure ELIZA was too for many people back then. It just took shorter to realize what was going on there.

seanwilson

I won't be surprised when LLMs get good at puzzle-heavy text adventures if there was more attention turned to this.

I've found for text adventures based on item manipulation, variations of the same puzzles appear again and again because there's a limit to how many obscure but not too obscure item puzzles you can come up with, so training would be good for exact matches of the same puzzle, and variations, like different ways of opening locked doors.

Puzzles like key + door, crowbar + panel, dog + food, coin + vending machine, vampire + garlic etc. You can obscure or layer puzzles, like changing the garlic into garlic bread which would still work on the vampire, so there's a logical connections to make but often nothing too crazy.

A lot of the difficulty in these games comes from not noticing or forgetting about clues/hints and potential puzzles because there's so much going on, which is less likely to trip up a computer.

You can already ask LLMs "in a game: 20 ways to open a door if I don't have the key", "how to get past an angry guard dog" or "I'm carrying X, Y, and Z, how do I open a door", and it'll list lots of ways that are seen in games, so it's going to be good at matching that with the current list of objects you're carrying, items in the world, and so on.

Another comment mentions about how the AI needs a world model that's transforming as actions are performed, but you need something similar to reason about maths proofs and code, where you have to keep track of the current state/context. And most adventure games don't require you to plan many steps in advance anyway. They're often about figuring out which item to combine/use with which other item next (where only one combination works), and navigating to the room that contains the latter item first.

So it feels like most of the parts are already there to me, and it's more about getting the right prompts and presenting the world in the right format e.g. maintaining a table of items, clues, and open puzzles, to look for connections and matches, and maintaining a map.

Getting LLMs to get good at variations of The Witness would be interesting, where the rules have to be learned through trial and error, and combined.

jlarocco

Doesn't it kind of defeat the point, though?

First you have to train the AIs on every new problem, and then you have to babysit them as you apply them to similar problems to the ones they were trained on.

It's not really intelligent in any real sense.

jameshart

Nothing in the article mentioned how good the LLMs were at even entering valid text adventure commands into the games.

If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.

throwawayoldie

"You're absolutely right, that's not a sentence you recognize..."

kqr

The previous article (linked in this one) gives an idea of that.

jameshart

I did see that. But since that focused really on how Claude handled that particular prompt format, it’s not clear whether the LLMs that scored low here were just failing at producing valid input, struggled to handle that specific prompt/output structure, or were doing fine at basically operating the text adventure but were struggling at building a world model and problem solving.

kqr

Ah, I see what you mean. Yeah, there was too much output from too many models at once (combined with not enough spare time) to really perform useful qualitative analysis on all the models' performance.

andrewla

The article links to a previous article discussing methodology for this. The prompting is pretty extensive.

It is difficult here to separate out how much of this could be fixed or improved by better prompting. A better baseline might be to just give the LLM direct access to the text adventure, so that everything the LLM replies is given to the game directly. I suspect that the LLMs would do poorly on this task, but would undoubtedly improve over time and generations.

EDIT: Just started playing 9:05 with GPT-4 with no prompting and it did quite poorly; kept trying to explain to me what was going on with the ever more complex errors it would get. Put in a one line "You are playing a text adventure game" and off it went -- it took a shower and got dressed and drove to work.

standardly

LLMs work really well for open-ended role-playing sessions, but not so much games with strict rules.

They just can't seem to grasp what would make a choice a "wrong" choice in a text-based adventure game, so they end up having no ending. You have to hard-code failure events, or you just never get anything like "you chose to attack the wizard, but he's level 99, dummy, so you died - game over!". It just accepts whatever choice you make, ad infinitum.

My best session was one in which I had the AI give me 4 dialogue options to choose from. I never "beat" the game, and we never solved the mystery - it just kept going further down the rabbit hole.. But it was surprisingly enjoyable, and repayable! A larger framework just needs written for it to keep the tires between the lines and to hard-code certain game rules - what's under the hood is already quite good for narratives imo.

throwawayoldie

My takeaway is: LLMs are not great at text adventures, even when those text adventures are decades old and have multiple walkthroughs available on the Internet. Slow clap.

dr-detroit

[dead]

gibbitz

This study raises the question, why do we play games? Do we play to win or to enjoy ourselves. Why design a machine to do what we should be enjoying? This goes for writing, creating Art, coding. Wanting a machine to win is the desire to achieve a goal without doing the work to earn it. Same for making art or writing novels. The point of these things (growth and achievement) is lost when done by a machine. I want to see this done with investment, legal strategy or business management. These are better suited to LLMs than what we're making them do, but I'd venture that those who are profiting from LLMs right now would profit less if replaced by LLMs by their boards.

tjr

I imagine that pitting LLMs against computer games is itself an enjoyable activity.

Generally speaking, people play games for fun, and I suspect that will continue. Even if an LLM can beat all humans at computer games, it doesn't matter. We will continue to enjoy playing them. Computers, pre-LLM, could already out-play humans in many cases.

Other activities mentioned -- writing, art, coding, etc. -- can indeed be fun, but they are also activities that people have been paid to do. It seems that there is incentive to create LLMs that can do an at least adequate job of these tasks for less money than humans are paid, so that that money is rerouted to LLM companies instead of human workers. I imagine humans will continue to write, create art, and even code, without any financial incentive, though probably less.

(I personally remain unpersuaded that LLMs will do away with paid creative work altogether, but there's clearly a lot of interest in trying to maximize what LLMs can do.)

lottaFLOPS

related research that was also announced this week: https://www.textquests.ai/

kqr

They seem to be going for a much simpler route of just giving the LLM a full transcript of the game with its own reasoning interspersed. I didn't have much luck with that, and I'm worried it might not be effective once we're into the hundreds of turns because of inadvertent context poisoning. It seems like this might indeed be what happens, given the slowing of progress indicated in the paper.

1970-01-01

Very interesting how they all clearly suck at it. Even with hints, they can't understand the task enough to complete the game.

abraxas

that's a great tracker. How often is the laderboard updated?

8f2ab37a-ed6c

Are we anywhere near someone being able to play a D&D or WoD type of game session with an LLM somewhere in the mix, perhaps generating a new and interesting adventure every time? Or is this still science fiction for now?