Skip to content(if available)orjump to list(if available)

CompileBench: Can AI Compile 22-year-old Code?

falcor84

> Our toughest challenges include cross-compiling to Windows or ARM64 and resurrecting 22-year-old source code from 2003 on modern systems. Some agents needed 135 commands and 15 minutes just to produce a single working binary.

I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.

ACCount37

15 minutes?

And here's me, after 4 straight days of wrangling an obscure cross-compilation toolchain to resurrect some ill-fated piece of software from year 2011 in a modern embedded environment.

piotrgrabowski

Author here.

So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).

Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].

In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.

[1] https://www.compilebench.com/curl-ssl-arm64-static/

OtherShrezzing

For the _reviving 20 year old code_ type tasks, are the tested outcomes things we'd expect to be in the public domain? For example, in the way the 'SWEBenchVerified' tests are poisoned tests, because the LLMs are able to look up bug fixes in the project git repository.

Philpax

Excellent benchmark. May I suggest a extension: "port any pre-uv Python ML codebase to uv so that it can actually be reliably reproduced"?

nl

This is a really good benchmark. So much time is spent on these messy types of tasks and no one really likes doing it.

Now if it could fix React Native builds after package upgrades I'd be impressed...

shallichange

I hadn’t thought of that use case. Say for example you find 1990’s Clipper code and want to give it a try on a modern Linux. Thanks

jclay

the libs in the bench don’t really have an external deps. will be much more interesting to see the results with ffmpeg, Qt, etc. The original source releases from any repo here would also be great candidates: https://github.com/id-software

sehugg

I have tried to get Claude to compile arbitrary C++ projects with Emscripten, and its track record is about as good as mine.

buildbot

I’ve been doing this a lot! AI seems to really excel at setting up compiler boilerplate/minor modifications for new arch. I made a simple cpu information utility work on HP PA-RISC and Sparc64 :)

stared

Curious for the ultimate benchmark - can AI compile Doom an on arbitrary device?

flenserboy

that, & how well does it cope with Perl?

johnisgood

Claude is good enough at Perl with lots of hand-holding and reiterations, according to my experiences.

bgwalter

LGTM! I'm sure it comes with a correctness proof, too!

The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.