Show HN: Beating Pokemon Red with RL and <10M Parameters

70 comments

·March 5, 2025

Hi everyone!

After spending hundreds of hours, we're excited to finally share our progress in developing a reinforcement learning system to beat Pokémon Red. Our system successfully completes the game using a policy under 10M parameters, PPO, and a few novel techniques. With the release of Claude Plays Pokémon, now feels like the perfect time to showcase our work.

We'd love to get feedback!

Visit

levocardia

Really cool work. It seems like some critical areas (team rocket, safari zone) rely on encoding game knowledge into the reward function somehow, which "smuggles in" external intelligence about the game. A lot of these are related to planning, which makes me wonder whether you could "bolt on" an LLM to do things like steer the RL agent, dynamically choose what to reward, or even do some of the planning itself. Do you think there's any low-hanging fruit on this front?

Xelynega

For well-known games like "Pokemon Red" I wonder how much of that game knowledge would be "smuggled in" by an LLM in it's training data if you just replaced the external info in the reward function with it/used it to make up for other deficiencies.

I think they allude to this in their conclusion, but it's less about the low-hanging fruit and more about designing a system to feedback game dialogue into the RL decision making process in a way that can be mutated as part of the RL(be it an LLM or something else)

drubs

Wrote about this in the results section. I think there is a way to mix the two and simplify the rewards in the process. A lot of the magic behind getting the agent to teach and use cut probably could have been handled by an LLM.

rvz

Note: What makes this interesting is that this is a pre-LLM project which shows that in some projects you don't need an "LLM" for this. All you need is just a plain old reinforcement learning algorithm and a deep neural network which is perfect for this.

This is what I want to see more of and goes against the hype of LLMs. What a great RL project.

Meanwhile, "Claude" is still stuck somewhere in the game. Imagine the costs of running that vs this project.

mclau156

Claude 3.7 recently failed to finish Pokemon after getting stuck in a corner and deciding it was impossible to get out

xinpw8

not our agents a hierarchical approach would be superior. add rl to claude and it's gg

null

[deleted]

N_Lens

Wow nice work. 10M is a tiny model and I suspect this might be the future for specialised work. I can also imagine the progress towards AGI/ASI to have smaller models used as submodules.

brains basically have “modules” like this as well - neuronal columns that handle specialised tasks. For example when you’re driving on the road, the understanding whether the distance between you and the vehicle in front is increasing or decreasing is a finely tuned function of a specialised part of the brain.

novia

Please stream the gameplay to twitch so people can compare.

tehsauce

We have a shared community map where you can watch hundreds of agents from multiple peoples training runs playing in real time!

https://pwhiddy.github.io/pokerl-map-viz/

Matthyze

That's amazing. Really awesome work.

novia

Can you make a twitch stream of a single agent playing?

drubs

Wouldn't make much sense. We generally train with 288 environments simultaneously. I've been thinking about ways to nicely stream all 288 environments though.

benopal64

Incredible work. I am just learning about PyBoy from your project, and it made me think of many fun ways to use that library to play Pokemon autonomously.

xinpw8

Very good to hear. Join the pyboy/pokemon discords! https://discord.gg/UXpjQTgs https://discord.gg/EVS3tAGm

bubblyworld

What an awesome project! I'm curious - I would have thought that rewarding unique coordinates would be enough to get the agent to (eventually) explore all areas, including the key ones. What did the agents end up doing before key areas got an extra reward?

(and how on earth did you port Pokémon red to a RL environment? O.o)

drubs

The environments wouldn't concentrate enough in the Rocket Hideout beneath Celadon Game Corner. The agent would have the player wander the world reward hacking. With wild battles enabled, the environments would end up in Lavender Tower fighting Gastly.

> (and how on earth did you port Pokémon red to a RL environment? O.o)

Read and find out :)

bubblyworld

Thanks haha, I kept reading =D I see, so it's not just that you have to visit the key areas, they need to show up in the episodes enough to provide a signal for training.

drubs

Yup!

wegfawefgawefg

you dont port it you wrap it. you can put anything in an rl environment. usually emulators are done with bizhawk, and some lua. worst case theres ffi or screen capture.

bubblyworld

Right, my thought was that this would be way too slow for episode rollout (versus an accelerated implementation in jax or something), but I guess not!

wegfawefgawefg

well thats the golden issue with rl, sample efficiency. it is env bounded, so you want an architecture that extracts the max possible information from each collected sample, avoiding catastrophic forgetting, prioritizing samples according to relevance

drubs

My first version of this project 5 years ago involved a python-lua named pipe using Bizhawk actually. No clue where that code went

modeless

Can't Pokemon be beaten by almost random play?

drdeca

Judging by the “pi plays Pokemon Sapphire”, uh, not in a reasonable amount of time? It’s been at it for over 3 years, hasn’t gotten a gym badge yet, mostly stays in the starting town.

tehsauce

It's impossible to beat with random actions or brute force, but you can get surprisingly far. It doesn't take too long to get halfway through route 1, but even with insane compute you'll never make it even to viridian forest.

VertanaNinjai

It can be brute forced if that’s what you mean. It has a fairly low difficulty curve and these old games have a grid system for movement and action selections. That’s why they’re pointing out the lower parameter amount and CPU. The point I took away is doing more with less.

xinpw8

It definitely cannot be beaten using random inputs. It doesn't even get out of Pallet Town after billions of steps. We tested...

fancyswimtime

the game has been beaten by fish

bloomingkales

The win condition of the game is the entire state of the game configured in a certain way. So there exists a lot of win conditions, you just have to do a search.

xinpw8

not sure what you mean..details?

kerkeslager

Are there any uses for AI yet that aren't either:

1. Doing things humans do for fun. 2. Doing things that AI is horribly terrible at.

drubs

There's a ton of applications for AI. Back when I was at Spotify, I co-authored Basic Pitch (https://basicpitch.spotify.com/), an audio-to-midi library. There are a ton of uses for AI outside of what's heavily publicized.

sadeshmukh

Medical field, spotting things

Autonomous drones

Financial fraud detection

Scheduling of trains/buses/etc

I personally do like chatbots but you probably don't

xinpw8

the only chatbot for me is smarterchild

bigfishrunning

I feel like that sentence aged me.

throwaway314155

Awesome! Why do you think the reward for reading signs helped? I'm assuming the model doesn't gain the ability to read and understand english just from RL, so what purpose does it serve other than to maybe waste ticks on signs that ultimately don't need to be read?

drubs

It's silly, but signs were a way to incentivize the agent to explore deeper into the Safari Zone among other areas.

jononor

Very nice! Nice to see demonstrations of reinforcement learning being used to solve non-trivial tasks.

differintegral

This is very cool, congrats!

I wonder, does anyone have a sense of the approximate raw number of button presses required to beat the game? Mostly curious to see how that compares to the parameter count.

tarentel

I imagine < 10000. https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequen... and https://www.youtube.com/watch?v=6gjsAA_5Agk. I believe this is something like 200k and is a slightly different game. Quite a bit less than 10m either way.

worble

Heads up, clicking "Next Page" just takes you to an empty screen, you have to use the navigation links on the left if you want to get read past the first screen.

drubs

Thanks for the heads up. I just pushed a fix.

worble

I think you fixed the one below the puffer.ai image, but not the one above Authors.

drubs

and...fixed!