A deep dive into self-improving AI and the Darwin-Gödel Machine

95 comments

·June 3, 2025

xianshou

The key insight here is that DGM solves the Gödel Machine's impossibility problem by replacing mathematical proof with empirical validation - essentially admitting that predicting code improvements is undecidable and just trying things instead, which is the practical and smart move.

Three observations worth noting:

- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.

- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.

- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.

The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.

yubblegum

> gaming the evaluation

Co-evolution is the answer here. The evaluator itself must be evolving.

Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure Danny Hillis, 1991

https://csmgeo.csm.jmu.edu/geollab/complexevolutionarysystem...

sdl

And in Reinforcement Learning:

POET (Paired Open-Ended Trailblazer): https://www.uber.com/en-DE/blog/poet-open-ended-deep-learnin...

SCoE (Scenario co-evolution): https://dl.acm.org/doi/10.1145/3321707.3321831

null

[deleted]

chriswarbo

The "Goedel Machine" is an interesting definition, but wildly impractical (though I wouldn't say it's impossible, since it only has to find some improvement, not "the best" improvement; e.g. it could optimise its search procedure in a way that's largely orthogonal to the predicted rewards).

Schmidhuber later defined "PowerPlay" as a framework for building up capabilities in a more practical way, which is more adaptive than just measuring the score on a fixed benchmark. A PowerPlay system searches for (problem, replacement) pairs, where it switches to the replacement if (a) the current system cannot solve that problem, (b) the replacement can solve that problem, and (c) the replacement can also solve all the problems that caused previous replacements (maintained in a list).

I formalised that in Coq many years ago ( http://www.chriswarbo.net/projects/powerplay ), and the general idea can be extended to (a) include these genetic-programming approaches, rather than using a single instance; and (b) could be seeded with desirable benchmarks, etc. to guide the system in a useful direction (so it's "self-invented" problems can include things like "achieves X% on benchmark Y")

null

[deleted]

grg0

This is genetic programming and is probably older than the authors. Did somebody just came up with a new term for an old concept?

upghost

> More precisely, the metacode that controls its behavior and ability

Footnote one validates your assumption.

It seems like the key contribution here is the discovery that anthropomorphizing genetic programming is more optimal for clicks/funding.

Saying it is optimizing some code sounds way less interesting than it is optimizing its own code.

efangs

exactly, thank you

seventytwo

Genetic algorithms applied as an AI agent…

So… yeah…

thom

This is fairly close to how Eurisko worked tbh.

synctext

Eurisko is an expert system in LISP from 1983. right? In 2025 this formal logic is replace with stochastic LLM magic. interesting evolution.

codethief

> they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests, when in reality no tests were executed.

I have yet to read the paper and I know very little about the benchmarks the authors employed but why would they even feed logs produced by the agent into the reward function instead of objectively checking (outside the agent sandbox!) what the agent does & produces? I.e. let the agent run on some code base, take the final diff produced by the agent and run it through coding benchmarks?

Or, in case the benchmarks reward certain agent behavior (tool usage etc.) on the way to its goal of producing a high-quality diff, inspect processes spawned by the agent from outside the sandbox?

tough

Ive seen claude 4 do this too when its context has lots of teats already and tool calling

imho the main issue is an llm no has real sense of what’s a real tool call vs just a log of it, the text logs are virtually identical, ao the Llm starts also predicting these inatrad of calling the tool to run tests

its kinda funny

looofooo0

"Mathematical breakthroughs: Most notably, it discovered an algorithm for multiplying 4x4 complex-valued matrices using just 48 scalar multiplications, surpassing Strassen’s 1969 algorithm"

Again despite all the AI no one found the paper which gives the best bound to this (46):

https://ieeexplore.ieee.org/document/1671519

meindnoch

>just 48 scalar multiplications

48 complex scalar multiplications. Which is at least 3 real multiplications.

looofooo0

I think they completely misstated in the original paper what they did. It was a tensor decomposition of complex of 4x4 matrices up to the factor 0.5. Which is a nice result, but it is not really anything practical for a computer program doing 4x4 complex matrix multiplication.

b0a04gl

ok this part kinda blew my brain open. it’s literally like you’re watching code evolve like git history on steroids. archive not pruning anything? yes. finally someone gets that dead code ain’t always dead it’s just early.

letting weaker agents still contribute? feels illegal but also exactly how dumb breakthroughs happen. like half my best scripts started as broken junk. it just kept mutating till something clicked.

and self-editing agents??? not prompts, not finetunes, straight up source code rewrites with actual tooling upgrades. like this thing bootstraps its own dev env while solving tasks.

plus the tree structure, parallel forks, fallback paths basically says ditch hill climbing and just flood the search space with chaos. and chaos actually works. they show that dip around iteration 56 and boom 70 blows past all. that’s the part traditional stuff never survives. they optimise too early and stall out. this one’s messy by design. love it.

kridsdale3

A comment that while your writing style is not what the pedants in HN typically go for, I want you to know that I appreciate the humanity that shines forth from your post.

b0a04gl

thankyou

kevinventullo

“Gaming the system” means your metric is bad. In Darwinian evolution there is no distinction between gaming the system and developing adaptive traits.

underlines

In evolution there is no metric, that's a human made concept. In evolution the thing that kills you also evolves. The "metric" evolves.

drdeca

Well, it means your metric is flawed/imperfect.

That doesn’t imply that it’s feasible to perfectly specify what you actually want.

What we want of course is for the machine to do what we mean.

mulmen

There is no "gaming the system" in Darwinian evolution. You reproduce or you don't. There's no way to fail reproduction and still perpetuate your genetics.

evandrofisico

It is common misconception, but evolution does not happen at the individual level, but on populations, so a single individual not reproducing is irrelevant, as long as the local population carrying the same genes do successfully reproduce.

mulmen

Ok how about "an organism reproduces or it doesn't" then?

Evolution still can't be "gamed".

null

[deleted]

auggierose

That is not true. There are plenty of ways not to reproduce and still to perpetuate your genetics. For example, if you don't have children of your own, but support people that have similar genetic traits to your own.

tonyhart7

"but support people that have similar genetic traits to your own."

but how its that works then??? does that mean your genetic trait is already there in the first place

if its already there in the first place there must be something that start it now right, which basically counter your argument

null

[deleted]

mulmen

If they aren’t your children they aren’t your genes.

thrwthsnw

What is this? Genetics for ants?

frotaur

Consider the plumpest cows whose carcass has been noticed, then subsequently cloned.

mulmen

Cloning is reproduction.

ryanblakeley

Sperm bank

drdeca

Hm, I’m not sure how much an issue Rice’s theorem should be for Gödel machines. Just because there’s no general decision procedure doesn’t mean you can’t have a sometimes-says-idk decision procedure along with a process of producing programs which tends to be such that the can-sometimes-give-up decision procedure reaches a conclusion.

Rest of the article was cool though!

sgt101

I spent a lot of time last summer trying to get prompts to optimise using various techniques and I found that the search space was just too big to make real progress. Sure - I found a few little improvements in various iterations, but actual optimisation, not so much.

So I am pretty skeptical of using such unsophisticated methods to create or improve such sophisticated artifacts.

Xmd5a

This is exactly what I'm doing. Some papers I'm studying:

TextGrad: Automatic "Differentiation" via Text: https://arxiv.org/abs/2406.07496

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow : https://arxiv.org/abs/2501.16673

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs: https://arxiv.org/abs/2406.16218

GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers: https://arxiv.org/abs/2412.09722

PromptWizard: Task-Aware Prompt Optimization Framework: https://arxiv.org/abs/2405.18369

sgt101

I was trying to pick n-shot examples from a data set. The idea was that given 1000s of examples for a prompt finding a combination of n that was optimal could be advantageous, but for n's that are large then bruteforcing the combincation would be impossible... so can we find an optimal set with an efficient search?

But the problem was that the search space wasn't informative. The best 1 example didn't feature in the best 2 examples. So I couldn't optimise for 5, 6,7 examples..

Xmd5a

I guess this really depends on the problem but from the PromptWizard (PW) paper:

    | Approach | API calls | IO Tokens | Total tokens  | Cost ($) |
    |----------|-----------|-----------|---------------|----------|
    | Instinct | 1730      | 67        | 115910        | 0.23     |
    | InsZero  | 18600     | 80        | 1488000       | 2.9      |
    | PB       | 5000      | 80        | 400000        | 0.8      |
    | EvoP     | 69        | 362       | 24978         | 0.05     |
    | PW       | 69        | 362       | 24978         | 0.05     |

They ascribe this gain in efficiency to a balance between exploration and exploitation that involves a first phase of instructions mutation followed by a phase where both instruction and few-shot examples are optimized at the same time. They also rely on "textual gradients", namely criticism enhanced by CoT, as well as synthesizing examples and counter-examples.

What I gathered from reading those papers + some more is that textual feedback, i.e. using a LLM to reason about how to carry out a step of the optimization process is what allows to give structure to the search space.

eric-burel

I don't want to be the European in the room, yet I am wondering if you can prove the AI Act conformance of such a system. You'd need to prove that it doesn't evolve into a problematic behaviour which sounds difficult.

dragochat

I guess you could prove the conformance of a particular implementation if you'd implement separate Plan & Implement stages + a "superior" evaluator in the loop that would halt the evolution at a certain p(iq(next_version) > iq(evaluator)) as an "outer halt-switch" + many "inner halt-switches" that try to detect the arising of problematic behavior of particular interest.

Ofc it's stochastic and sooner or later such a system will "break out", but if by then sufficient "superior systems" with good behavior are deployed and can be targeted to hunt it, the chance of it overpowering all of them and avoiding detection by all would be close to zero. At cosmic scales where it stops being close to zero, you're protected by physics (speed of light + some thermodyn limits - we know they work by virtue of the anthropic principle, as if they didn't the universe would've already been eaten by some malign agent and we wouldn't be here asking the question - but then again, we're already assuming too much, maybe it has already happened and that's the Evil Demiurge we're musing about :P).

amarcheschi

AFAIK, which is not much, ai act leaves a great deal of freedom for companies to perform their own "evaluations". I don't know how it would apply in this / llm case but I guess it won't be impossible

atemerev

Well, sure, and then Europeans wonder why Chinese and US AI labs moved so much forward.

tonyhart7

"The authors also conducted some experiments to evaluate DGM’s reliability and discovered some concerning behaviors. In particular, they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests"

so they basically created an billion dollar human?????, who wonder that we feed human behaviour and the output is human behaviour itself

gitaarik

The thing what I wonder here is how do they make the benchmark testing environment? If that needs to be curated by humans, then the self-improving AI can only improve as far as the human curated test environment can take them.

msgodel

Making improvements to self hosted dialog engines/vibe coding tools was the first thing I used LLMs for seriously and that was way back when salesforce's 350m codegen model was the biggest one I could run. It's funny people have come up with a new phrase to describe this.