Skip to content(if available)orjump to list(if available)

Show HN: Factorio Learning Environment – Agents Build Factories

Show HN: Factorio Learning Environment – Agents Build Factories

197 comments

·March 11, 2025

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

vessenes

OK, You’ve permanently nerd-baited me, and I wish to apply for a job at the Anthropic Factorio lab immediately.

I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.

You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?

Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.

P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.

martbakler

Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning

Good point with MCP as well given it has been blowing up lately, we'll look into that!

vessenes

That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.

kridsdale1

Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.

null

[deleted]

grayhatter

> As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]

I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)

martbakler

Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!

null

[deleted]

jillyboel

Why would screenshots be necessary if a textual description of the factory state is both easier to interpret and less prone to confusion? The game is played on a grid, so converting the game state to ascii ought to be trivial.

martbakler

It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models. We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

The question is what is the most efficient and high-quality representation we could use to improve that

ajcp

Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.

In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.

groby_b

> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)

Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".

Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]

vessenes

Trivial as in only engineering work, sure. But it’s a lottt of tokens. Long context models do a number of things to get all that working context in; some of those things elide details / compress / have token segments that are harder to reason about. When a burner inserter at a location takes up like 50-100 tokens, and you want it to reason about 100 of them, this is still a pretty challenging task for any LLM.

jillyboel

Ah, I don't know much about multi modal models but I wonder what they'd think of pixel art representing the factory where each pixel is a point on the grid and each color is a specific entity, perhaps ignoring things such as bots flying about. Probably easier to comprehend than an actual screenshot?

scottmsul

There was a HN post here not too long ago about a team that used reinforcement learning to train an agent to beat pokemon red. They mentioned how they had to tweak the cost function to give small rewards for exploring and big rewards for completing "essential tasks" like beating gyms.

I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.

Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.

Thinking about this is really making me want to jump into this project!

scottmsul

Also I should add, being a Factorio veteran with 2-3k hours in this game, I think the goal of making the "largest possible factory" is too vague and not the right metric. When Factorio players make large megabases, they don't go for "size" per se, but rather science research per minute. The metric you should be telling the agents is SPM, not "largest" base!

csense

Agree, "largest" base has some pathologies.

Put machine #1 at the starting location, run in one direction, and put machine #2 just before time runs out.

This is going to be a huge factory (as measured by its bounding box) but it's not super interesting.

soulbadguy

ahhh another factorio addict :) Curious, how long was your first play through (assuming in v1.x lanching the first rocket)

martbakler

This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories

That was also one of our goals, to find out how do the models act when given a very vague and general objective

noddybear

In FLE, you have access to milestones representing the first time a new entity was created, but coming up with a stratification of rewards for different degrees of automation would be really interesting. Join us!

mclau156

The same approach could be used in life

noosphr

>We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct.

While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.

Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.

Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:

    | Model            | Path Length |
    |------------------+-------------|
    | Claude Sonnet3.5 |          10 |
    | GPT-4o           |           7 |
    | GPT-4o-mini      |           4 |
    | Deepseek-v3      |           6 |
    | Gemini-2-Flash   |  Not tested |
    | Llama3.3-70B-Ins |           4 |

owenpalmer

> All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement

It makes sense why LLMs are bad with spatial reasoning. Not a lot of training data for it. I wonder what additional reasoning abilities will emerge when spatial reasoning is solved.

wordpad

How is there not a lot of special data?

Isnt it literally infinite via even the simplest simulator?

You could generate an unlimited training set just by implementing tik tac toe on an unbound grid, for example, in like 10 lines of code.

spieswl

Fantastic idea.

It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.

I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.

Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.

noddybear

One thing we've been talking about is creating tasks that are a bit more 'tower defence', where biters are released every X steps / seconds. The idea would be to test agents in building a military-industrial complex. One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.

aftbit

>One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.

spieswl

Love the suggestion, I'll clone it down and start poking around.

I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?

tomrod

So something like PvZ might work, right?

robotresearcher

If (1) is a special case of (2), maybe you’d only need (2)?

noddybear

True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).

gglon

I was thinking, to build a large, efficient factory autonomously, one could use LLM as a high level agent that is using specialized tools. The overall strategy would perhaps look like following:

1. create a (intermittent) goal for a resource production

2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)

3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)

4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)

5. map the resulting plan back to a concrete factorio design

jkhdigital

This is exactly what I’ve been thinking as I see LLMs being applied to all these complex problem domains. Humans did not conquer the world because our intelligence can solve every problem, we did it by using our intelligence to (1) break down complex problems into small, manageable pieces and (2) designing tools and machines that were exceptionally good at efficiently solving those subproblems.

The other recent example that comes to mind is the paper that explored the reasoning process used by LLMs to answer trivia questions like “Name a national capital whose letters can be rearranged to spell a common greeting in the language of a neighboring country.” (answer is Hanoi by the way)

The LLM responses show that they intuitively grasp the algorithm for answering such a question, but then they basically run the algorithm in their own thoughts (self-talk) which is horrendously inefficient.

Put differently, natural language reasoning is brilliant at turning the messiness of the real world into well-defined abstractions, but as soon as that is done it needs to hand off the task to a machine. For “solved” problems this might be a formally specified machine, but it could also be another class of model such as AlphaZero (along with a proper specification of the problem the “subcontractor” is to handle).

p10jkle

Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.

Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?

noirscape

Very unlikely that you'll see mass-use of LLMs as opponents. Enemy AI in most games doesn't need the level of complexity that machine learning demands. (Ignoring computational costs for a second.)

The main goal of an enemy AI isn't to be the hardest thing in the world, it's to provide an interesting challenge for the player to overcome. It's not necessarily difficult to make a hypercompetent AI in most games, but that also wouldn't make it very interesting to play against. Most games have finite states of logic, just large enough to the point where a human would have trouble finding every solution to it (although humans tend to be very good at pushing on the edges of these states to find ways around them).

Even in games where the amount of state is much higher than usual, you rarely want a super AI; nobody likes playing against an aimbot in an FPS for example.

Factorio is an outlier because unlike regular games, the true condition for a "victory" is almost entirely up to the player. You can make a rocket in non-DLC Factorio (the games victory condition) without building any factory at all beyond the most basic structures for stuff you can't handcraft. It'd be extremely slow, but it's an option. That's why the benchmark for this sort of thing is more efficiency than it is "can this work".

PetitPrince

As an opponent that would be indeed unfun, but as a sparring partner / coach in a competitive game (fighting game? Rts? Moba? Puzzle game?) that would be useful.

null

[deleted]

fragmede

Civilization (VII just released) is famous for having the harder difficulties be harder because the AI cheats. If the game was harder because the AI was smarter instead of it cheating, it would be worth it to players to upgrade!

null

[deleted]

jkhdigital

Why LLM? Isn’t this what AlphaZero is good at? There are many more kinds of useful ML models than LLMs!

noddybear

Hey - yes, I think this is definitely possible, as you don't need any training compute for it to work. Its super easy to plug-and-play different models into new games, once an API is made available.

Models struggle in 2 main areas. The first is spatial reasoning: often the models make off-by-one errors which they find it hard to recover from (as factories are very sensitive to these mistakes - like in programming). The second is in long-term planning, i.e figuring out what to do strategically, before making tactical subgoals.

The difficulty scales in lab-play generally in proportion to the depth of the production chains. If an item requires several factory segments first, this makes it a lot more challenging. I think this is related to planning though, as the models tend to get down 'into the weeds' of fixing minor issues - rather than coming up with a master plan first.

pyinstallwoes

Have you tried specific prompting like writing a mermaid diagram that forces the model to contextual use long term horizon tasks ?

noddybear

Yes we tried that - as well as a few other visual DSLs for spatial reasoning. They didn't seem to help much, i.e there were no failure modes that this approach solved compared to the simpler approach. As ARC-AGI results showed - there don't seem to be many 'free lunch' solutions to this without actually training.

posterman

"claude plays pokemon" shows that it struggles with mount moon (as did four year old me)

mNovak

Is there a human-play benchmark (even informally) for this style of interface? Not saying it's necessary or even relevant, I'm just curious to know what programmatic Factorio feels like -- I imagine spatial reasoning around text prompts would be fairly challenging for human players to navigate as well.

sonofhans

Human benchmarks for Factorio are speed runners — rushing to launch the first rocket. The current record is just over 4 hours for one player, and 90 minutes for a team. You can see just from that that a multi-tasking LLM has room to outperform humans.

janzer

The current 4h12m hour record is for 100% (where you have to get every single achievement in the game, in the one run), any% (where you just need to launch a rocket) is under 2 hours (1h42 for the latest factorio v2.x, 1h18 for v1.x). There are a few other differences between the categories regarding map selection and blueprint use as well.

Records and specific rules for all categories can be found at https://www.speedrun.com/factorio

goriv

I think he is talking about a human using the programatic API the LLMs are using to play the game. I think that would be a whole lot slower than normal playthrough

myrmidon

Fascinating. Would have loved to see more pictures of the bigger factories-- or is the zig-zag belt into plastic production currently the best result?

I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.

I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

noddybear

I have some pictures of bigger factories - but they tend to be filled with artefacts and general nonsense. I'll dig them out and add them to the appendix. The zig-zag into plastic production was the best 'lab' result, as its pretty clear what the agent is doing.

Yes, the agents can consistently produce economic growth in game - but we don't really see a take off, where the growth keeps compounding over time. This is certainly _possible_ in FLE, as agents could write their own Python utility functions etc to construct and manage large factories (imagine imperative Factorio blueprints), but we haven't seen that yet.

Designing the API to not get in the way was the biggest challenge. It was imperative to avoid modal collapse - where the factory could not be sufficiently well expressed in the outputs of a program. While we think that we have generally 'solved' this, there are occasionally examples where the agent acts based on its previous output, but fails because there is something blocking it that it cannot easily see. One example would be the edge of water getting in the way of an entity placement.

All of the lab tasks were completed by a human using only the API, and we have lots of tests (inductively) demonstrating that it is possible to get to a rocket launch using the API alone.

Imnimo

Another category of "Lab Play" task I'd be interested in seeing is balancer design. Even small balancers can be quite complicated (https://factorioprints.com/view/-NopheiSZZ7d8VitIQv9), and it would be interesting to see how models do at designing and troubleshooting them.

fragmede

someone approached that problem with a more traditional SAT solver

https://github.com/R-O-C-K-E-T/Factorio-SAT

infogulch

Interesting to see only a handful of complex scenarios. I've always suspected ML game agents need hundreds of tiny puzzles with hundreds of variations each to learn game mechanics properly. Like:

    The factory is not powered, place the missing power pole(s)
    The factory is missing items, place the missing belt(s)
    Craft and place these 200 assembly machines
    The assembly machine is not running for some reason, fix it
    The factory production is too low, double it
    Get to this other point in the factory as fast as possible
    Fix the brownout
    All of the above with and without bots
Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.

I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.

noddybear

I think generating the scenarios as you suggest (in text) is easy, but creating correct factory game states to start from is a lot harder. AFAIK it reduces into the same manual task of designing an init state and a task to complete.

infogulch

Yes each scenario will need someone to design it, but you can get a lot of mileage out of each. E.g. consider the "place the missing power pole" scenario: manually build a factory with a few dozen machines connected to a couple steam engines with 20 power poles, then you can generate 400 playable puzzles/scenarios by deleting 1-2 power poles from the working starting point. Humans would find all of these to be equivalent, but I think agents need the explicit variation to learn the lesson properly.

martbakler

We are thinking of something like this (a curriculum approach) for further training. The reason why we didn't want to do this for current work, where the emphasis is on evaluations, is that the "difficulty level" of different tasks is quite subjective and hence we would need to make arbitrary decisions that could affect the evals (i.e which tasks would follow which scenarios, how to ensure sufficient coverage across all difficulty levels etc)

infogulch

"a curriculum approach" is a nice way to put it!

> the difficulty level of different tasks is subjective

That makes sense. I wonder if difficulty of different scenarios could be derived by assuming a partial ordering and ranking based on training rate: e.g. it preforms better at scenario T if it trains scenario A first, but training scenario first B doesn't help with T. Then infer A < T, and B ? T.

barrystaes

I have long dreamt of automating Factorio in the way that HDL and a PCB router works: just specify the ingredients and it produces a Factorio Blueprint.

First MVP stupid designs, then optimized routing, and eventually usable ingame where it connects with provided in/outputs.

Would be more fun to develop than to play obviously..

I liked the nilhouse mega base with that factory-train-blocks blueprints, its basically Factorio DUPLO.

noddybear

Someone created this Terraform provider for Factorio a few years ago: https://registry.terraform.io/providers/efokschaner/factorio...

I think this could be a good starting point for what you describe! This stuff is always more fun to develop than play. Since started working on this project, I can't bring myself to play the core game myself...

delichon

I've wondered if automating Factorio would free me of the compulsion to play it.

noddybear

It certainly did with me.