Show HN: Factorio Learning Environment – Agents Build Factories
224 comments
·March 11, 2025vessenes
OK, You’ve permanently nerd-baited me, and I wish to apply for a job at the Anthropic Factorio lab immediately.
I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.
You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?
Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.
P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.
martbakler
Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning
Good point with MCP as well given it has been blowing up lately, we'll look into that!
vessenes
That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.
kridsdale1
Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.
null
grayhatter
> As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]
I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)
martbakler
Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!
jillyboel
Why would screenshots be necessary if a textual description of the factory state is both easier to interpret and less prone to confusion? The game is played on a grid, so converting the game state to ascii ought to be trivial.
martbakler
It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models. We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens
The question is what is the most efficient and high-quality representation we could use to improve that
groby_b
> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens
That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)
Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".
Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]
ajcp
Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.
In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.
vessenes
Trivial as in only engineering work, sure. But it’s a lottt of tokens. Long context models do a number of things to get all that working context in; some of those things elide details / compress / have token segments that are harder to reason about. When a burner inserter at a location takes up like 50-100 tokens, and you want it to reason about 100 of them, this is still a pretty challenging task for any LLM.
jillyboel
Ah, I don't know much about multi modal models but I wonder what they'd think of pixel art representing the factory where each pixel is a point on the grid and each color is a specific entity, perhaps ignoring things such as bots flying about. Probably easier to comprehend than an actual screenshot?
null
scottmsul
There was a HN post here not too long ago about a team that used reinforcement learning to train an agent to beat pokemon red. They mentioned how they had to tweak the cost function to give small rewards for exploring and big rewards for completing "essential tasks" like beating gyms.
I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.
Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.
Thinking about this is really making me want to jump into this project!
scottmsul
Also I should add, being a Factorio veteran with 2-3k hours in this game, I think the goal of making the "largest possible factory" is too vague and not the right metric. When Factorio players make large megabases, they don't go for "size" per se, but rather science research per minute. The metric you should be telling the agents is SPM, not "largest" base!
csense
Agree, "largest" base has some pathologies.
Put machine #1 at the starting location, run in one direction, and put machine #2 just before time runs out.
This is going to be a huge factory (as measured by its bounding box) but it's not super interesting.
soulbadguy
ahhh another factorio addict :) Curious, how long was your first play through (assuming in v1.x lanching the first rocket)
noddybear
In FLE, you have access to milestones representing the first time a new entity was created, but coming up with a stratification of rewards for different degrees of automation would be really interesting. Join us!
martbakler
This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories
That was also one of our goals, to find out how do the models act when given a very vague and general objective
mclau156
The same approach could be used in life
Gasp0de
Did you read the page? Because they did give rewards per item produced, and more complex items gave higher rewards.
noosphr
>We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct.
While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.
Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.
Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:
| Model | Path Length |
|------------------+-------------|
| Claude Sonnet3.5 | 10 |
| GPT-4o | 7 |
| GPT-4o-mini | 4 |
| Deepseek-v3 | 6 |
| Gemini-2-Flash | Not tested |
| Llama3.3-70B-Ins | 4 |
noddybear
This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.
noosphr
That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.
For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.
kjrfghslkdjfl
[dead]
owenpalmer
> All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement
It makes sense why LLMs are bad with spatial reasoning. Not a lot of training data for it. I wonder what additional reasoning abilities will emerge when spatial reasoning is solved.
wordpad
How is there not a lot of special data?
Isnt it literally infinite via even the simplest simulator?
You could generate an unlimited training set just by implementing tik tac toe on an unbound grid, for example, in like 10 lines of code.
owenpalmer
Synthetic data will play I big role, yes. There's other challenges though, like how verbal descriptions of objects would affect their spatial behavior. Building a generalized simulator that combines those modalities is hard.
In this particular case with Factorio, I suspect generating the synthetic data would be easier, since the rules of the environment are relatively simple and well defined, with quantifiable outcomes.
Imnimo
Another category of "Lab Play" task I'd be interested in seeing is balancer design. Even small balancers can be quite complicated (https://factorioprints.com/view/-NopheiSZZ7d8VitIQv9), and it would be interesting to see how models do at designing and troubleshooting them.
fragmede
someone approached that problem with a more traditional SAT solver
spieswl
Fantastic idea.
It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.
I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.
Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.
noddybear
One thing we've been talking about is creating tasks that are a bit more 'tower defence', where biters are released every X steps / seconds. The idea would be to test agents in building a military-industrial complex. One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)
Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.
aftbit
>One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)
This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.
noddybear
There was a black mirror episode about this too, I seem to remember! Soldiers imagining they were fighting monsters - while actually committing war crimes.
spieswl
Love the suggestion, I'll clone it down and start poking around.
I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?
robotresearcher
If (1) is a special case of (2), maybe you’d only need (2)?
noddybear
True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).
tomrod
So something like PvZ might work, right?
jxjnskkzxxhx
I don't understand - were these models post-trained to play factorio? A) If so, how is that possible given that e.g. Claude doesn't have public weights? B) If not, how would the agent know what the API does? Even if it's "guessing" from the English meaning of the API commands (e.g. place_entity_next_to places entity next to something), how would it know what the recipes are? If it's trying and learning we go back to A).
Having read the pdf I don't think these models were post-trained, so how do we explain the questions in B)?
And if indeed there's no post-training and authors expected exploration of recipes to come from the context window.... I think that's way too short for RL-style improvement.
In short, I don't understand how they could've tested those models with post training, and without post training they all did unbelievably well.
If the authors read this: can you give us an idea how many API query and API pairs fit within the context window, on average? Follow up, do you get better results if you abbreviate the API call names, so that more response pairs fit within one context window?
martbakler
To also jump in here, regarding tools the agents had access to function signatures (i.e tool docstrings, input and output types) and for each tool a small "manual", which described what the tool does, how it affects the game state and a small number of examples where using this tool would be useful (for instance, how to use place_entity_next_to to put an inserter next to an existing chest)
Overall as Jack said, no post-training was done at all but all agents had a complete API description (tools, entities, research) in their context so the results indicate to some level how well can modern agents use a completely OOD API with decent level of documentation
noddybear
These models were not post-trained - all off-the-shelf.
We can fit about 128 pairs maximum in the context, but this performed the same as 32, which we ultimately decided on (for cost, latency purposes).
Encoding the input/outputs to make them shorter degraded performance. It seems that descriptive names is helpful for pretrained models because they have an intuition on what they do.
jxjnskkzxxhx
Follow up. Do you have an hypothesis why Claude performs much better than the rest at these tasks?
Is it just because Clause is the best at coding and the API is code? (not very interesting). Maybe if the API required the llms to write in poems, the best LLM at poetry would win...
Or is it because whatever makes claude good at coding, also makes it good at mathematical-like tasks. This is more interesting, as it would show some transfer learning. It would also suggest if you're doing training for a specific task, you would also benefit from training adjacent tasks e.g. if you're training for maths you could benefit from training coding. I believe this is actually true for humans.
And would you know how to check whether if any of the above hypothesis is correct?
c0wb0yc0d3r
The way I read the footnotes about the authors, one works at Anthropic. I would guess that is some insider access.
noddybear
One of us works at Anthropic - but we had no insider access to any models or weights. All of our evals were on public models.
infogulch
Interesting to see only a handful of complex scenarios. I've always suspected ML game agents need hundreds of tiny puzzles with hundreds of variations each to learn game mechanics properly. Like:
The factory is not powered, place the missing power pole(s)
The factory is missing items, place the missing belt(s)
Craft and place these 200 assembly machines
The assembly machine is not running for some reason, fix it
The factory production is too low, double it
Get to this other point in the factory as fast as possible
Fix the brownout
All of the above with and without bots
Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.
noddybear
I think generating the scenarios as you suggest (in text) is easy, but creating correct factory game states to start from is a lot harder. AFAIK it reduces into the same manual task of designing an init state and a task to complete.
infogulch
Yes each scenario will need someone to design it, but you can get a lot of mileage out of each. E.g. consider the "place the missing power pole" scenario: manually build a factory with a few dozen machines connected to a couple steam engines with 20 power poles, then you can generate 400 playable puzzles/scenarios by deleting 1-2 power poles from the working starting point. Humans would find all of these to be equivalent, but I think agents need the explicit variation to learn the lesson properly.
noddybear
Oh super interesting! Create 10 scenarios containing working factories, and ‘drop out’ entities to break the factory in different ways. great idea.
martbakler
We are thinking of something like this (a curriculum approach) for further training. The reason why we didn't want to do this for current work, where the emphasis is on evaluations, is that the "difficulty level" of different tasks is quite subjective and hence we would need to make arbitrary decisions that could affect the evals (i.e which tasks would follow which scenarios, how to ensure sufficient coverage across all difficulty levels etc)
infogulch
"a curriculum approach" is a nice way to put it!
> the difficulty level of different tasks is subjective
That makes sense. I wonder if difficulty of different scenarios could be derived by assuming a partial ordering and ranking based on training rate: e.g. it preforms better at scenario T if it trains scenario A first, but training scenario first B doesn't help with T. Then infer A < T, and B ? T.
mNovak
Is there a human-play benchmark (even informally) for this style of interface? Not saying it's necessary or even relevant, I'm just curious to know what programmatic Factorio feels like -- I imagine spatial reasoning around text prompts would be fairly challenging for human players to navigate as well.
sonofhans
Human benchmarks for Factorio are speed runners — rushing to launch the first rocket. The current record is just over 4 hours for one player, and 90 minutes for a team. You can see just from that that a multi-tasking LLM has room to outperform humans.
janzer
The current 4h12m hour record is for 100% (where you have to get every single achievement in the game, in the one run), any% (where you just need to launch a rocket) is under 2 hours (1h42 for the latest factorio v2.x, 1h18 for v1.x). There are a few other differences between the categories regarding map selection and blueprint use as well.
Records and specific rules for all categories can be found at https://www.speedrun.com/factorio
p10jkle
Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.
Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?
noirscape
Very unlikely that you'll see mass-use of LLMs as opponents. Enemy AI in most games doesn't need the level of complexity that machine learning demands. (Ignoring computational costs for a second.)
The main goal of an enemy AI isn't to be the hardest thing in the world, it's to provide an interesting challenge for the player to overcome. It's not necessarily difficult to make a hypercompetent AI in most games, but that also wouldn't make it very interesting to play against. Most games have finite states of logic, just large enough to the point where a human would have trouble finding every solution to it (although humans tend to be very good at pushing on the edges of these states to find ways around them).
Even in games where the amount of state is much higher than usual, you rarely want a super AI; nobody likes playing against an aimbot in an FPS for example.
Factorio is an outlier because unlike regular games, the true condition for a "victory" is almost entirely up to the player. You can make a rocket in non-DLC Factorio (the games victory condition) without building any factory at all beyond the most basic structures for stuff you can't handcraft. It'd be extremely slow, but it's an option. That's why the benchmark for this sort of thing is more efficiency than it is "can this work".
fragmede
Civilization (VII just released) is famous for having the harder difficulties be harder because the AI cheats. If the game was harder because the AI was smarter instead of it cheating, it would be worth it to players to upgrade!
PetitPrince
As an opponent that would be indeed unfun, but as a sparring partner / coach in a competitive game (fighting game? Rts? Moba? Puzzle game?) that would be useful.
null
null
noddybear
Hey - yes, I think this is definitely possible, as you don't need any training compute for it to work. Its super easy to plug-and-play different models into new games, once an API is made available.
Models struggle in 2 main areas. The first is spatial reasoning: often the models make off-by-one errors which they find it hard to recover from (as factories are very sensitive to these mistakes - like in programming). The second is in long-term planning, i.e figuring out what to do strategically, before making tactical subgoals.
The difficulty scales in lab-play generally in proportion to the depth of the production chains. If an item requires several factory segments first, this makes it a lot more challenging. I think this is related to planning though, as the models tend to get down 'into the weeds' of fixing minor issues - rather than coming up with a master plan first.
pyinstallwoes
Have you tried specific prompting like writing a mermaid diagram that forces the model to contextual use long term horizon tasks ?
noddybear
Yes we tried that - as well as a few other visual DSLs for spatial reasoning. They didn't seem to help much, i.e there were no failure modes that this approach solved compared to the simpler approach. As ARC-AGI results showed - there don't seem to be many 'free lunch' solutions to this without actually training.
posterman
"claude plays pokemon" shows that it struggles with mount moon (as did four year old me)
jkhdigital
Why LLM? Isn’t this what AlphaZero is good at? There are many more kinds of useful ML models than LLMs!
gglon
I was thinking, to build a large, efficient factory autonomously, one could use LLM as a high level agent that is using specialized tools. The overall strategy would perhaps look like following:
1. create a (intermittent) goal for a resource production
2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)
3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)
4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)
5. map the resulting plan back to a concrete factorio design
jkhdigital
This is exactly what I’ve been thinking as I see LLMs being applied to all these complex problem domains. Humans did not conquer the world because our intelligence can solve every problem, we did it by using our intelligence to (1) break down complex problems into small, manageable pieces and (2) designing tools and machines that were exceptionally good at efficiently solving those subproblems.
The other recent example that comes to mind is the paper that explored the reasoning process used by LLMs to answer trivia questions like “Name a national capital whose letters can be rearranged to spell a common greeting in the language of a neighboring country.” (answer is Hanoi by the way)
The LLM responses show that they intuitively grasp the algorithm for answering such a question, but then they basically run the algorithm in their own thoughts (self-talk) which is horrendously inefficient.
Put differently, natural language reasoning is brilliant at turning the messiness of the real world into well-defined abstractions, but as soon as that is done it needs to hand off the task to a machine. For “solved” problems this might be a formally specified machine, but it could also be another class of model such as AlphaZero (along with a proper specification of the problem the “subcontractor” is to handle).
myrmidon
Fascinating. Would have loved to see more pictures of the bigger factories-- or is the zig-zag belt into plastic production currently the best result?
I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.
I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.
Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?
noddybear
I have some pictures of bigger factories - but they tend to be filled with artefacts and general nonsense. I'll dig them out and add them to the appendix. The zig-zag into plastic production was the best 'lab' result, as its pretty clear what the agent is doing.
Yes, the agents can consistently produce economic growth in game - but we don't really see a take off, where the growth keeps compounding over time. This is certainly _possible_ in FLE, as agents could write their own Python utility functions etc to construct and manage large factories (imagine imperative Factorio blueprints), but we haven't seen that yet.
Designing the API to not get in the way was the biggest challenge. It was imperative to avoid modal collapse - where the factory could not be sufficiently well expressed in the outputs of a program. While we think that we have generally 'solved' this, there are occasionally examples where the agent acts based on its previous output, but fails because there is something blocking it that it cannot easily see. One example would be the edge of water getting in the way of an entity placement.
All of the lab tasks were completed by a human using only the API, and we have lots of tests (inductively) demonstrating that it is possible to get to a rocket launch using the API alone.
I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).
FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.
A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.
The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.
Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.
Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)
We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map
We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).
The code is available at https://github.com/JackHopkins/factorio-learning-environment.
You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+
The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.
We would love to hear your thoughts and see what others can do with this framework!