DeepMind program finds diamonds in Minecraft without being taught
169 comments
·April 4, 2025suddenlybananas
toxik
Like all things RL, it is 99.9% about engineering the environment and rewards. As one of the authors stated elsewhere here, there is a reward for completing each of 12 steps necessary to find diamonds.
Mostly I'm tired of RL work being oversold by its authors and proponents by anthropomorphizing its behaviors. All while this "agent" cannot reliably learn to hold down a button, literally the most basic interaction of the game.
red75prime
The "no free lunch" theorem. You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours[1].
While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).
[1] And you certainly can't reproduce the observation selection effect in a laboratory. That is the thing that makes it possible to overcome the "no free lunch" theorem: our existence and intelligence are conditional on evolution being possible and finding the right biases.
We have to bake in inductive biases to get results. We have to incentivize behaviors useful (or interesting) to us to get useful results instead of generic exploration.
toxik
You don't have to repeat 4 billion years of evolution, an RL agent lives inside a strange universe where the basic axioms happen to be exactly aligned with what you can do in that universe.
Its actions are not muscular, they are literal gameplay actions. It is orders of magnitude easier to learn that the same action should be performed until completion, than that the finger should be pressed against a surface while the hand is stabilized with respect to the cursor on a screen.
One of the most interesting (and pathological) things about humans is that we learn what is rewarding. Not how to get a reward, but actually we train ourselves to be rewarded by doing difficult/novel/funny/etc things. Notably this is communicated largely by being social, i.e., we feel reward for doing something difficult because other people are impressed by that.
In Castaway, Hanks' only companion is a mute, deflated ball, but nonetheless he must keep that relationship alive---to keep himself alive. The climax of the movie is when Hanks returns home and people are so impressed, his efforts are validated.
Contrast that to RL, there is no intrinsic motivation. The agents do not play, or meaningfully explore, really. The extent of its exploration is a nervous tic that makes it press the wrong button with probability ε. The reason it cannot hold down buttons is because it explores by having Parkinson's disease, by accident, not because it thought it might find out something useful/novel/funny/etc. In fact, it can't even have a definition of those words, because they are defined in the space between beings.
rebeccaskinner
> While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).
What's interesting to me about this is that the problem seems really aligned with the research they are doing. From what I can tell, they build a system where the agent has a simplified "mental" model of the game world and it uses to predict actions that will lead to better rewards.
I don't think what's missing here is teaching the model that it should just try to do things a lot until they succeed. Instead, what I think is missing is the context that it's playing a game, and what that means.
For example, any human player who sits down to play minecraft is likely to hold down the button to mine something. Younger children might also hold the jump button down and jump around aimlessly, but older children and adults probably wouldn't. Why? I suspect it's because people with experience in video games have set expectations for how game designers communicate the gameplay experience. We understand that clicking on things to interact with them is a common mode of interaction, and we expect that games have upgrade mechanics that will let us work faster or interact with high level items. It's not that we repeat any action arbitrarily to see that it pays off, but rather that we're speaking a language of games and modeling the mind of the game designers and anticipating what they expect from us.
I would think that trying to expand the model of the world to include this notion of the language of games might be a better approach to overcoming the limitation instead of just hard-coding the model to try things over and over again to see if there's a payoff.
d0mine
Isn’t it exactly what alphazero did?
“AlphaZero was trained solely via self-play using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks, all in parallel, with no access to opening books or endgame tables. After four hours of training, DeepMind estimated AlphaZero was playing chess at a higher Elo rating than Stockfish 8; after nine hours of training, the algorithm defeated Stockfish 8 in a time-controlled 100-game tournament (28 wins, 0 losses, and 72 draws).” [emphasis added] https://en.wikipedia.org/wiki/AlphaZero
827a
Given that a computer should be able to simulate at least some applicable aspects and processes of reality billions of times faster than the speed at which our own universe runs at: Yes, I think it is entirely reasonable to have these agents follow at least some kind of from-scratch evolutionary history. It might also be valuable: As it could further research in understanding what the word "applicable" there even means; what parts of our evolutionary history are important toward inductively reasoning your way toward a diamond in Minecraft? What parts aren't? How can that generalize?
If you code a reward function for each step necessary to get a diamond, you are teaching the AI how to do it. There is no other way to look at it. Its extremely unethical to claim, as Nature does, that it did this without "being taught", and it is in my eyes academic malpractice to claim, as their paper does, that it did this "without human data or curricula"; though mitigated by the reality that they admit this in the paper. If this is the case; I am still digesting the paper, as it is quite technical.
This isn't an LLM, I'm aware of this, but I am at the point where if I could bet on the following statement being true, I'd go in at five figures: Every major AI benchmark, advancement, or similar accomplishment in the past two years can almost entirely be explained by polluted training data. These systems are not nearly as autonomously intelligent as anyone making money on them says they are.
kypro
> You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours
Really? Minecraft's gameplay dynamic are not particularly complex... The AI here isn't learning highly complex rules about the nuances of human interaction or learning to detect the relatively subtle differences between various four legged creatures based on small differences in body morphology. In these cases I could see how millions of years of evolution is important to at least give us and other animals a head start when entering the world. If the AI had to do something like this to progress in Minecraft then I'd get why learning those complexities would be skipped over.
But in this case a human would quickly understand that holding a button creates a state which tapping a button does not, and therefore would assume this state could be useful to explore further states. Identifying this doesn't seem particularly complex to me. If the argument is that it will take slightly longer for an AI to learn patterns in dependant states then okay, sure, but I think arguing that learning that holding a button creates a new state is such a complex problem that we couldn't possibly expect an AI to learn it from scratch within a short timeframe is a very weak argument. It's just not that complex. To me this suggests that current algorithms are lacking.
LPisGood
When I was a child and first played Minecraft I clicked instead of held and after 10 minutes I gave up, deciding that Minecraft was too hard.
zvitiate
What if you were in an environment where you had to play Minecraft for say, an hour. Do you think your child brain would've eventually tried enough things (or had your finger slip and stay on the mouse a little extra while), noticed that hitting a block caused an animation, (maybe even connect it with the fact that your cursor highlights individual blocks with a black box,) decide to explore that further, and eventually mine a block? Your example doesn't speak to this situation at all.
daedrdev
I had the same problem, learned from a roblox mining game where mining a block required clicking it a bunch of times.
freeone3000
RL is useful for action selection and planning. Actually determining the mechanics of the game can be achieved with explicit instruction and definition of an action set.
I suppose whether you find this result intriguing or not depends on if you’re looking to build result-building planning agents over an indeterminate (and sizable!) time horizon, in which case this is a SOTA improvement and moderately cool, or if you’re looking for a god in the machine, which this is not.
SpaceManNabs
If you have an alternative for RL in these use cases, please feel free to share.
When RL works, it really works.
The only alternative I have seen is deep networks with MCTS, and they are quickly to ramp up to decent quality. But they hit caps relatively quickly.
o11c
And a relevant piece of ancient wisdom (exact date not known, but presumably before 1970):
> In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
> “What are you doing?”, asked Minsky.
> “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.
> “Why is the net wired randomly?”, asked Minsky.
> “I do not want it to have any preconceptions of how to play”, Sussman said.
> Minsky then shut his eyes.
> “Why do you close your eyes?”, Sussman asked his teacher.
> “So that the room will be empty.”
> At that moment, Sussman was enlightened.
lgeorget
Well, to be fair... I (a human) had to look it up online the first time I played as well. I was repeatedly clicking on the same tree for an entire minute before that. I even tried several different trees just in case.
fusionadvocate
But it is possible to discover by holding down the button and realizing the block is getting progressively more "scratched".
kharak
In my mind, this generalizes to the same problem with other non-stochastic (deterministic) operations like logical conclusions (A => B) .
I have a running bet with friend that humans encode deterministic operations in neural networks, too, while he thinks there has to be another process at play. But there might be something extra helping our neural networks learn the strong weights required for it. Or the answer is again: "more data".
FrustratedMonky
"accelerating block breaking because learning to hold a button for hundreds of consecutive steps "
This is fine, and does not impact the importance of figuring out the steps.
Anybody that has done any tuning on systems that run at different speeds, the adjusting for the speed difference is just engineering, and allows you to get on with more important/inventive work.
JohnKemeny
I'm not sure it's a serious caveat if the "hint" or "control" is in the manual.
suddenlybananas
Sorry, I don't quite follow what you mean?
franktankbank
I didn't read the manual and when I was trying to help my kid play the game I couldn't figure out how to break blocks.
Hamuko
Turns out that AI are much better at playing video games if they're allowed to cheat.
thesz
"It allows AI to understand its physical environment and also to self-improve over time, without a human having to tell it exactly what to do."
ks1723
I my view, the 'exactly' is crucial here. They do implicitly tell the model what to do by encoding it in the reward function:
In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.
This is also why I think the title of the article is slightly misleading.
wongarsu
It's kind of fair, humans also get rewarded for those steps when they learn Minecraft
Animats
Key to Dreamer’s success, says Hafner, is that it builds a model of its surroundings and uses this ‘world model’ to ‘imagine’ future scenarios and guide decision-making.
Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?
Machine learning with world models is very interesting, and the people doing it don't seem to say much about what the models look like. The Google manipulation work talks endlessly about the natural language user interface, but when they get to motion planning, they don't say much.
danijar
Yes, you can decode the imagined scenarios into videos and look at them. It's quite helpful during development to see what the model gets right or wrong. See Fig. 3 in the paper: https://www.nature.com/articles/s41586-025-08744-2
Animats
So, prediction of future images from a series of images. That makes a lot of sense.
Here's the "full sized" image set.[1] The world model is low-rez images. That makes sense. Ask for too much detail and detail will be invented, which is not helpful.
[1] https://media.springernature.com/full/springer-static/image/...
lnsru
I implemented an acoustic segmentation system in FPGA recently. The whole world model was a long list of known events and states with feasible transitions. Plus novel things not observed before. Basically rather dumb state machine with machine learning part attached to acoustic sensors. Of course, both parts could be hidden behind weights. But state machine was easily readable and that was the biggest advantage of it.
jtsaw
I’d say it’s more like Waymo’s world model. The main actor uses a latent vector representation of the state of the game to make decisions. This latent vector at train time is meant to compress a bunch of useful information about the game. So while you can’t really understand the actual latent vector that represents state, you do know it encodes at least the state of the game.
This world model stuff is only possible in environments that are sandboxed. Ie you can represent the state of the world in an and have a way of producing the next state given a current state and action. Things like Atari games, robot simulations, etc
TeMPOraL
> Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?
I imagine it's the latter, and in general, we're already dealing with plenty of models with world models hidden inside their weights. That's why I'm happy to see the direction Anthropic has been taking with their interpretability research over the years.
Their papers, as well as most discussions around them, focus on issues of alignment/control, safety, and generally killing the "stochastic parrot" meme and keeping it dead - but I think it'll be even more interesting to see attempts at mapping how those large models structure their world models. I believe there's scientific and philosophical discoveries to be made in answering why these structures look the way they do.
namaria
> killing the "stochastic parrot" meme
This was clearly the goal of the "Biology of LLMs" (and ancillary) paper but I am not convinced.
They used a 'replacement model' that by their own admission could match the output of the LLM ~50% of the time, and the attribution of cognition related labels to the model hinges entirely on the interpretation of the 'activations' seen in the replacement model.
So they created a much simpler model, that sorta kinda can do what the LLM can do in some instances, contrived some examples, observed the replacement model and labeled what it was doing very liberally.
Machine learning and the mathematics involved is quite interesting but I don't see the need to attribute neuroscience/psychology related terms to them. They are fascinating in their own terms and modelling language can clearly be quite powerful.
But thinking that they can follow instructions and reason is the source of much misdirection. The limits of this approach should make clear that feeding text to a text continuation program should not lead to parsing the generated text for commands and running these commands, because the tokens the model outputs are just statistically linked to the tokens inputted to them. And as the model takes more tokens from the wild, it can easily lead to situations that are very clearly an enormous risk. Pushing the idea that they are reasoning about the input is driving all sorts of applications that seeing them as statistical text continuation programs would make clear are a glaring risk.
Machine learning and LLMs are interesting technology that should be investigated and developed. Reasoning by induction that they are capable of more than modelling language is bad science and drives bad engineering.
DeborahEmeni_
The “holding a button” thing actually resonated. It feels like the real work here is engineering the reward structure to make exploration even remotely viable. Dreamer’s world model might be cool, but most of the heavy lifting still seems to come from how forgiving the Minecraft environment is for training.
I do wonder though: if you swapped Minecraft for a cloud-based synthetic world with similar physics but messier signals, like object permanence or social reasoning, would Dreamer still hold up? Or is it just really good at the kind of clean reward hierarchies that games offer?
reportgunner
Article makes it seem like finding diamonds is some kind of super complicated logical puzzle. In reality the hardest part is knowing where to look for them and what tool you need to mine them without losing them once you find them. This was given to the AI by having it watch a video that explains it.
If you watch a guide on how to find diamonds it's really just a matter of getting an iron pickaxe, digging to the right depth and strip mining until you find some.
danijar
Hi, author here! Dreamer learns to find diamonds from scratch by interacting with the environment, without access to external data. So there are no explainer videos or internet text here.
It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2
itchyjunk
Since diamonds are surrounded by danger and if it dies, it loses its items and such, why would it not be satisfied after discovering iron pick axe or somesuch? Is it in a mode where it doesn't lose its item when it dies? Does it die a lot? Does it ever try digging vertically down? Does it ever discover other items/tools you didn't expect it to? Open world with sparse reward seems like such a hard problem. Also, once it gets the item, does it stop getting reward for it? I assume so. Surprised that it can work with this level of sparse rewards.
taneq
In all reinforcement learning there is (explicitly as part of a fitness function, or implicitly as part of the algorithm) some impetus for exploration. It might be adding a tiny reward per square walked, a small reward for each block broken and a larger one for each new block type broken. Or it could be just forcing a random move every N steps so the agent encounters new situations through “clumsiness”.
danijar
When it dies it loses all items and the world resets to a new random seed. It learns to stay alive quite well but sometimes falls into lava or gets killed by monsters.
It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.
Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.
null
SpaceManNabs
I just want to express my condolences in how difficult it must be to correct basic misunderstandings that can be immediately corrected from reading the fourth paragraph under the section "Diamonds are forever"
Thanks for your hard work.
danijar
Haha thanks!
ryan-duve
For the curious, from the link above:
> log, plank, stick, crafting table, wooden pickaxe, cobblestone, stone pickaxe, iron ore, furnace, iron ingot, iron pickaxe and diamond
kuu
While I agree with your comment, this sentence:
"This was given to the AI by having it watch a video that explains it."
This was not as trivial as it may seem just a few months ago...
rcxdude
EDIT: Incorrect, see below
it didn't watch 'a video', it watched many, many hours of video of playing minecraft (with another specialised model feeding in predictions of keyboard and mouse inputs from the video). It's still a neat trick, but it's far from the implied one-shot learning.
danielbln
The author replied in this thread and says the opposite.
NVHacker
Alpha Star was also trained initially from youtube videos of pros playing Starcraft. I would argue that it was pretty trivial a few years ago.
rcxdude
I don't think it was videos. Almost certainly it was replay files with a bunch of work to transform them into something that could be compared to the model's outputs. (Alphastar never 'sees' the game's interface, only a transformed version of information available via an API)
ismailmaj
Do you know if it was actual videos or some simpler inputs like game state and user inputs? I’d be impressed if it was the former at that time.
skwirl
>This was given to the AI by having it watch a video that explains it.
That is not what the article says. It says that was separate, previous research.
Bluglionio
I don't get it. How can you reduce this achievement down to this?
Have you gotten used to some ai watching a video and 'getting it' so fast that this is boring? Unimpressive?
jerf
The other replies have observed that the AI didn't get any "videos to watch" but I'd also observe that this is being used as an English colloquialism. The AIs aren't "watching videos", they're receiving videos as their training data. That's quite different from what is coming to your mind as "watching a video" as if the AI watched a single YouTube tutorial video once and got the concept.
reportgunner
I feel like you are jumping to conclusions here, I wasn't talking about the achievement or the AI, I was talking about the article and the way it explains finding diamonds in minecraft to people who don't know how to find diamonds in minecraft.
rowanG077
The AI is able to learn from video and you don't find that even a little bit impressive? Well I disagree.
lupusreal
Characterizing finding diamonds as "mastering" Minecraft is extremely silly. Tantamount to saying "AI masters Chess: Captures a pawn." Getting diamonds is not even close to the hardest challenge in the game, but most readers of Nature probably don't have much experience playing Minecraft so the title is actually misleading, not harmless exaggeration.
zimpenfish
> Getting diamonds is not even close to the hardest challenge in the game
Mining diamonds isn't even necessary if you build, e.g., ianxofour's iron farm on day one and trade that iron[0] with a toolsmith, armourer, and weaponsmith. You can get full diamond armour, tools, and weapons pretty quickly (probably a handful of game weeks?)
[0] Main faff here is getting them off their base trade level.
lupusreal
True, and if the objective is to get some raw diamonds as fast as possible demonstrating mastery of the game, I'd expect a strategy like making a boat, finding a shipwreck and then a buried treasure chest. Takes just a few minutes usually.
Really though, if AI wants to impress me it needs to collect an assortment of materials and build a decent looking base. Play the way humans usually play.
danijar
I agree with you, this is just the start and Minecraft has a lot more to offer for future research!
YeGoblynQueenne
Reinforcement learning is very good with games.
>> In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.
And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.
danijar
For a lot of things, VLMs are good enough already to provide rewards. Give them the recent images and a text description of the task and ask whether the task was accomplished or not.
For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.
IshKebab
Plenty of real world situations have clear objectives with obvious rewards.
YeGoblynQueenne
Example.
IshKebab
Fold clothes -> clothes are folded.
Take children to school -> they safely arrive on time.
Autonomous driving -> arrive at destination without crashing.
Call centre -> customers are happy.
xwolfi
Work a job, receive money
SpaceManNabs
> And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.
I encourage you to read deepmind's work with robots.
YeGoblynQueenne
Oh I have. For example I remember this project:
>> Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.
https://research.google/blog/scalable-deep-reinforcement-lea...
That was in 2018.
So what do you think, is vision-based robotic manipulation and grasping a solved problem, seven years later? Is QT-Opt now an established industry standard in training robots with RL?
Or was that just another project that was announced with great fanfare and hailed as a breakthrough that would surely lead to great increase of capabilities... only to pop, fizzle and disappear in obscurity without any real-world result, a few years later? Like most of DeepMind's RL projects do?
SpaceManNabs
Let's look at 2025
https://www.youtube.com/watch?v=x-exzZ-CIUw
It looks pretty awesome. Let's see what happens.
smokel
> games have clear objectives with obvious rewards. The real world, not so much.
Tell that to the people here who are trying to turn their startup ideas into money.
zamadatix
I don't think folks go the startup path because the steps to go from idea to making money are obvious and clear.
janalsncm
> it is never going to work in the real world
DeepSeek used RL to train R1, so that is clearly not true. But ignoring that, what is your alternative? Supervised learning? Good luck finding labels if you don’t even know what the objective is.
YeGoblynQueenne
No, let's not ignore DeepSeek: text is not the real world any more than Minecraft is the real world.
And why do I have to offer an alternative? If it's not working, it's not working, regardless of whether there's an alternative (that we know of) or not.
CodeCompost
I didn't know that Nature did movie promotions.
colechristensen
Who would have thought you could get your TAS run published in Nature if you used enough hot buzzwords. (they have been using various old-school-definition "artifical intelligence" algorithms for a long time)
FrustratedMonky
Minecraft is ubiquitous now.
But I remember the alpha version, and NOBODY knew how to make a pick ax. Humans were also very bad at figuring out these steps.
People were de-compiling the java and posting help guides on the internet.
How to break a tree, get sticks, make a wood pick. In Alpha, that was a big deal for humans also.
ryoshu
Or you could watch Notch build it.
ljdtt
Slightly off-topic from the article itself, but… does anyone else feel like Nature’s cookie banner just never goes away? I have vivid memories of trying to reject cookies multiple times, eventually giving up and accepting them just to get to the article only for the banner to show up again the next time I visit. I swear it’s giving me déjà vu every single visit.. Am I the only one experiencing this, or is this just how their site works?
textlapse
Could this perform better by having the internal representation of Minecraft instead of raw pixels?
It seems rather tenuous to keep pounding on 'training via pixels' when really a game's 2D/3D output is an optical trick at best.
I understand Sergey Brin/et al had a grandiose goal for DeepMind via their Atari games challenge - but why not try alternate methods - say build/tweak games to be RL-friendly? (like MuJoCo but for games)
I don't see the pixel-based approach being as applicable to the practical real world as say when software divulges its direct, internal state to the agent instead of having to fake-render to a significantly larger buffer.
I understand Dreamer-like work is a great research area and one that will garner lots of citations for sure.
EMIRELADERO
> I understand Sergey Brin/et al had a grandiose goal for DeepMind via their Atari games challenge - but why not try alternate methods - say build/tweak games to be RL-friendly?
Because the ultimate goal (real-world visual intelligence) would make that impossible. There's no way to compute the "essential representation" of reality, the photons are all there is.
textlapse
There is no animal on planet earth that functions this way.
Visual cortex and plenty of other organs compress the data into useful, semantic information before feeding into a 'neural' network.
Simply from an energy and transmission perspective an animal would use up all its store to process a single frame if we were to construct such an organism based on just 'feed pixels to a giant neural network'. Things like colors, memory, objects, recognition, faces etc are all part of the equation and not some giant neural network that runs from raw photons hitting cones/rods.
So this isn't biomimicry or cellular automata - it's simply a fascination similar to self-driving cars being able to drive with a image -> {neural network} -> left/right/accelerate simplification.
janalsncm
Brains may operate on a compressed representation internally, but they only have access to their senses as inputs. A model that needs to create a viable compressed representation is quite different from one which is spoon fed one via some auxiliary data stream.
Also I believe the DeepMind StarCraft model used the compressed representation, but that was a while ago. So that was already kind of solved.
> simply a fascination similar to self-driving cars being able to drive with a image
Whether to use lidar is more of an engineering question of the cost/benefit of adding modalities. LiDAR has come down in price quite a bit so it’s less wise in retrospect.
An important caveat from the paper
>Moreover, we follow previous work in accelerating block breaking because learning to hold a button for hundreds of consecutive steps would be infeasible for stochastic policies, allowing us to focus on the essential challenges inherent in Minecraft.