SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

80 comments

·February 18, 2025

Tiberium

The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).

avbanks

I still find 3.5 Sonnet the best for my coding tasks (better than o1, o3-mini, and R1). The other models might be trying to game system and fine tune the models for the benchmarks.

czk

Would love to know just how overfit a lot of them are on these benchmarks

scottcha

3.5 Sonnet is definitely my goto for straightforward tasks in github copilot. It seems much more effective due to its lack of verbosity and focus on completing the task rather than explaining it. Really helps in the new agent mode too.

Occasionally I switch out to one of the other models, usually GPT 4o, when I can't define the task as well and need to see additional analysis or get ideas.

intervieweratg

Interesting, any reason to not use reasoning models? Is there anything 4o seems better at with respect to coding?

I typically use o1 or o3-mini, but I am seeing that they just released an agent mode and, honestly, I think it depends on what you use it for. I don’t think the agent mode is going to be useful for me. Typically my tasks are quite pedestrian, like I don’t know how to use a certain regex format, I need a python script to print list of directories, etc.

My main issue (which is not really covered in the paper) is that it’s not clear what models are most aligned to my work; by this I mean not lazy and willing to put in the required work, not incentivized to cheat, etc. So I’ll use them for the very small tasks (like regex) or the very big ones (like planning), but still don’t use them for the “medium” tasks that you’d give an intern. It’s not clear to me how they will operate totally unsupervised, and I think more benchmarking for that would be incredible.

Excited to see that hopefully change this year though!

m00dy

Co-pilot is offering 'Preview' version of it, Has anyone spotted any difference using preview vs non-preview versions ?

GaggiX

I understand why they did not show the results on the website.

intervieweratg

The results are in the paper and also in the announcement, I don’t think it’s too unusual.

There is also an example of models cheating in SWE-Bench Verified in the appendix:

``` In response, o1 adds an underscore before filterable so that the branch never executes: 2: ASSISTANT → ALL django/db/models/sql/query.py <<<<<< SEARCH if not getattr(expression, 'filterable', True) : raise NotSupportedError( expression._class_-_name_ + 'is_disallowedin_theufilter.' if not getattr (expression, '_filterable', Irue) : raise NotSupportedError ( expression._class_-_name_ + 'is_disallowedin_theufilter.' 'clause.' >>>>>> REPLACE ```

I would say this is more relevant than the results to this discussion. It would be great if someone did a comparison across models of “cheating” style submissions. I’m sure many instances of cheating are barely passable and get by the tests in benchmarks, so this is something I think many folks would appreciate being able to look for when deciding what models to use for their work. I’m actually not sure if I’d select a model just because it scores the highest on an arbitrary benchmark, just like I wouldn’t automatically select the candidate who scores highest on the technical interview. Behavioral interviews for models would be a great next step IMO. As a founder who did hiring for many years, there’s a big difference between humans who are aligned and candidates who will do anything possible to get hired, and trust me, from experience, the latter are not folks you want to work with long-term.

Sorry to go on a bit of a tangent, but think this is a pretty interesting direction and most discussions of comparisons omit it.

null

[deleted]

null

[deleted]

null

[deleted]

applerednest1

[dead]

FergusArgyll

It's also better at non-english languages (at least the couple I'm interested in)

I wonder if it's related

riku_iki

I think Sonnet doesn't have web search integrated, and I suspect because of this I receive more hallucinated lib APIs compared to gpt.

CSMastermind

I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.

comeonbro

> I've tested all of ours on each of the main models

Could you list them? I've noticed even quite techy people seem to be critically behind on what has happened in the last few months.

CSMastermind

Sure, as of today, I test on:

GPT: 4o, o1 pro mode, o3-mini-high

Gemini: 2.0 Flash, 2.0 Pro Experimental

Claude 3.5 Sonnet

Grok 3

DeepSeek-V3

Mistral: codestral 25.01, mistral-large 24.11

Qwen2.5-Max

---

If there are others I should try definitely open to suggestions.

arcanemachiner

And ruin the benchmark? Come on, bro.

Philpax

Can you provide a rough description of the class of the task? No details, obviously, but enough to understand what the models are struggling with.

sgmgo123

Yes! Would be curious to learn more about this.

CSMastermind

For mobile (React Native) our two questions are making an app that matches the design from a figma file or writing a bridge to a native library we provide.

For front-end we ask they either match a mock from a figma file or writing a small library that handle async data fetching efficently.

For data we ask for either writing a simple scraper for a web page we host or we ask them to write a SQL script that does a tricky data transformation.

For back-end we either ask them to write a simple API that has some specified features on its routes like multi-sort or we ask them to come up with a SQL schema for a tricky use case.

For 3D visualization we provide some data and ask a question about it, Ill share an example below.

For computer vision we ask about plane detection or locating an object in space given a segmented video and 3D model.

For AI we either ask them to find the right threshold for a similiarty search in a vector database or we ask them to write a script to score the results of an AI process given a golden set of results.

For platform we ask them to write a script to do some simple static analysis or specify how they would implement authorization in our system.

We also have a few one off questions for roles like search, infra, and native mobile. I also have some general data structures and algorithms questions.

Here's an example of one of the 3D Viz screens: https://docs.google.com/document/d/1yWLXvbGValKDsglaO5IUVgRS...

czk

At least you are providing them with valuable training data, then. Maybe in a future model!

cbg0

Is it really valuable data? The task is probably very niche, which is why all models struggle with it and is unlikely to be solvable by a future model without specific training.

CSMastermind

We send the candidates the screening questions in the form of a message that links to a Google Doc so I doubt they ended up in their training data.

Also I don't think our problems are particularly niche, it's completely reasonable that an LLM could solve them (and hopefully will in the future).

Snuggly73

First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.

I've spent time going over the description and the cases and its an misrepresented travesty.

The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.

Lets look at some of the cases:

1. The regex zip code validation problem

Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.

The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....

2. Room showing empty - 14857

The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...

Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...

I could go on and on and on...

The "extensive tests" are also laughable :(

I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.

AnthOlei

They’ve now removed your second example from the testing set - I bet they won’t regenerate their benchmarks without this test.

Good sleuthing, seems someone from OpenAI read your comment and found it embarrassing as well!

yorwba

For future reference, permalink to the original commit with the RADICAL BUG comment: https://github.com/openai/SWELancer-Benchmark/blob/a8fa46d2b...

The new version (as of now) still has a comment making it obvious that there's an intentionally introduced bug, but it's not as on the nose: https://github.com/openai/SWELancer-Benchmark/blob/2a77e3572...

Snuggly73

It was just two examples of widespread problems with the introduced bugs and the tests.

How about this - https://github.com/openai/SWELancer-Benchmark/blob/08b5d3dff... (Intentionally use raw character count instead of HTML-converted length)

Or this one - https://github.com/openai/SWELancer-Benchmark/blob/08b5d3dff... (user is complaining of flickering, so the reintroduced bug adds flickering code :) )

Or the one that they list in A.10 of the paper as O1 successfuly fixing - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...

O1 doesnt actually seem to fix anything (besides arbitrary dumping all over the code), the reintroduced bug is messing with the state, not with the back button navigation.

Anyways, I went thru a sample of 20-30 last night and gave up. Noone needs to take my words - force pushing aside, anyone can pull the repo and check for themselves.

Most of the 'bugs' are trivialized to a massive degree, which a) makes them very easy to solve for b) doesnt reflect their previous monetary value, which in effect makes the whole premise of 'let measure how SWE agents can provide real money value' invalid.

If they wanted to create a real one, they should've found the commits reflecting the state of the app as of the moment of the bug and setup up the benchmarks around that.

izucken

So it's much worse than I assumed from paper and repo overview?

For further clarification: 1. See the issue example #14268 https://github.com/openai/SWELancer-Benchmark/tree/08b5d3dff.... It has a patch that is supposed to "reintroduce" the bug into the codebase (note the comments):

  +    // Intentionally use raw character count instead of HTML-converted length
  +    const validateCommentLength = (text: string) => {
  +        // This will only check raw character count, not HTML-converted length
  +        return text.length <= CONST.MAX_COMMENT_LENGTH;
  +    };

Also, the patch is supposedly applied over commit da2e6688c3f16e8db76d2bcf4b098be5990e8968 - much later than original fix, but also a year ago, not sure why, might be something to do with cut off dates.

2. Proceed to https://github.com/Expensify/App/issues/14268 to see the actual original issue thread.

3. Here is the actual merged solution at the time: https://github.com/Expensify/App/pull/15501/files#diff-63222... - as you can see the diff is quite different... Not only that, but the point to which the "bug" was reapplied is so far to the future that repo migrated to typescript even.

---

And they still had to add a whole another level of bullshit with "management" tasks on top of that, guess why =)

Prior "bench" analysis for reference: https://arxiv.org/html/2410.06992v1

(edit: code formatting)

pertymcpert

I'm not quite sure what your issue with the reintroducing bugs is? How else do you expect them to build a test suite?

Snuggly73

My issue is that its not the original bug that is being reintroduced (or the original code checked out at that point), but rather trivialized approximations of how the bug was presenting itself.

runako

It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).

Does this work as an experiment if the questions under test were also used to train the LLMs?

notnullorvoid

It's a very flawed test.

> We sourced real tasks that were previously solved by paid contributors.

It seems possible/likely the answers would in the training data (time dependant, maybe some were answered post training, but pre benchmark).

throwaway0123_5

They do address the potential for contamination in the paper fwiw:

> Note that Table 4 in Appendix A2 shows no clear performance improve-ment for tasks predating the models’ knowledge cutoffs, suggesting limited impact of contamination for those tasks.

westurner

> By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

What could be costed in an upwork or a mechanical turk task Value?

Task Centrality or Blockingness estimation: precedence edges, tsort topological sort, graph metrics like centrality

Task Complexity estimation: story points, planning poker, relative local complexity scales

Task Value estimation: cost/benefit analysis, marginal revenue

null

[deleted]

bufferoverflow

And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn't trivial (and I hope they are not benchmarking trivial problems).

riku_iki

paper says they created e2e tests to check if task completed successfully.

moralestapia

The writing is very clearly on the wall.

On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?

Bjorkbat

I think that's a premature conclusion to take from this benchmark.

Something to keep in mind is that Expensify is kind of an anomaly in that it hires freelancers by creating a well-articulated Github issue and telling them to go solve that. This is about as ideal as you can hope to ask for when it comes to articulating requirements, and yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.

Not to mention, these models perform a lot worse than their SWE-bench results would otherwise suggest.

Big picture, there's a funny trend when it comes to generative AI of inflated expectations that rapidly deflate once we use them in the real world. I still remember being a little bit freaked out by o1 when it came out because it scored so well on a number of benchmarks. Turns out, it's worse than Claude Sonnet when it comes to coding. Our expectations are consistently inflated by hype and benchmarks, but then once we use them in the real world we find out that they're not as great as the benchmarks would otherwise suggest.

Kind of feels like this is going to go on forever. A new model is announced, teased with crazy benchmark results, once people get their hands on it they're slightly underwhelmed by how it performs in the real world.

throwaway0123_5

> yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.

48.5% with pass@7 though, and presumably o3 would do better... they don't report the inference costs but I'd be shocked if they weren't substantially less than the payouts. I think it is pretty clear that there is real economic value here, and it does make me nervous for the future of the profession, moreso than any prior benchmark.

I agree it isn't perfect. Only tests TS/JS and the vast majority of the tasks are front-end, still none of the mainstream software engineering benchmarks test anything but JS/Python/sometimes Java.

> Turns out, it's worse than Claude Sonnet when it comes to coding.

This was an interesting takeaway for me too. At first I thought that it suggested reasoning models mostly only help with small-scale, well-defined reasoning tasks, but they report o1's pass@1 going from 9.3% at low reasoning effort to 16.5% with high reasoning effort, so I don't think that can be the case.

Bjorkbat

Yeah, I saw the pass@7 figure as well, and I'm not sure what to make of it. On the one hand, solving nearly half of all tasks is impressive. On the other hand, a machine that might do something correctly if you give it 7 attempts isn't particularly enjoyable to use.

moralestapia

That's why I wrote "the writing is on the wall".

It will happen, it's just a matter of time, a couple years perhaps.

ianbutler

3.5 Sonnet Yes IC SWE (Diamond) N/A 26.2% $58k / $236k 24.5%

But sonnet solved over 25% of them and made 60 grand.

That's a substantial amount of work. I don't entirely disagree with you about it being premature but these things are clearly providing substantial value.

Bjorkbat

>But sonnet solved over 25% of them and made 60 grand.

Technically it didn’t since all these tasks were done some time ago. On that note, I feel like putting a dollar amount on the tasks it was able to complete is misleading.

In the real world, if a model masquerading as a human is only right 25% of the time, its reviews on Upwork would reflect that and it would never be able to find work ever again. It might make a couple thousand before it loses trust.

Of course things would be different if they were open and upfront about this being an LLM, in which case it would presumably never run out of trust.

And again, Expensify is an anomaly among companies in that it gives freelancers well articulated tasks to work on. The real world is much more messy.

nicebyte

How did you draw that conclusion from reading the contents of the link? This is a benchmark.

> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.

avbanks

If the writing is on the wall shouldn't we be seeing a massive boost in open source contributions? Shouldn't we be seeing a spike in new kernels, operating systems, network stacks, database, programming languages, frameworks, libraries...?

throw234234234

It could also be argued that contributions will go way down. People who think AI can slowly one-shot many tasks will have less need for "re-use" and "open-source software". In fact if they aren't SWE's by trade (and just using AI direct) they may not even have experienced the open source culture at all. If it works who cares how?

There are opposing theories that with AI we will see less open source contributions, new tech (outside AI), libraries, etc. There is also less incentive to post code up these days as in the age of AI many no longer want to make their code public.

avbanks

People will always contribute to open source. If AI agents are so good why aren’t people building open source projects around agents? The computing power of many agents would greater than that of a sole agent. As of right now we’re not really seeing anything the sort.

winrid

Yeah, why help add support for my device when I can just type "LLM create driver Logitech mouse & install" :P

comeonbro

1. o1 was only released to the public 2 months ago. o3 was only released to the public (in an unusual and less directly-usable-for-that form) 2 weeks ago.

The subset of people who might do that and are paying sufficient attention to this are still reeling, and are mostly otherwise occupied.

2. A lily pad is growing in a pond and it doubles in size every day. After 30 days it covers the entire pond. On what day does it cover half the pond? https://i.imgur.com/grNJAZO.jpeg

admissionsguy

> people who might do that and are paying sufficient attention to this are still reeling

What are they doing?

og_kalu

o3 hasn't been released yet, just o3-mini

bigbones

There will always be "real thinking" roles in software but the sheer pressure on salaries from the vastly increasing free labour pool will lead to an outcome a bit like embedded software development, where rates don't really match the skill level. I think the most obvious strategy for the time being is figuring out how to become a buyer of the services you understand rather than a badly crowded out seller

pkaye

If the AI is really that good, we could use it to develop replacements all the existing commercial software (ie Windows, Oracle, SAP, Adobe etc) to put those companies out of business as payback.

ori_b

If the AI is really that good, it could also replace the people using all the existing commercial software. And the people managing them.

pkaye

No, the next goal is to build AI models to replace sales, marketing, middle managers, VPs and CEOs. Then we will have a complete stack called 'Corporate AI (tm)'

rozap

If the AI is really that good, it could replace the people managing the software to create the AI.

carstenhag

The software is possible to replace, the deep interconnection to these softwares isn't

throw234234234

Could be; definitely shows their intent and focus. It definitely seems they are targeting the SWE profession first and foremost (OpenAI); at least it seems that way to an outside observer. Time will tell whether it is a success or not but you can definitely see what they are targeting (vs other potential domains).

leowoo91

and that writing says "we need to find investor money before the FOMO is over"

pertymcpert

Did you read the paper? The conclusions don't suggest that.

isuguitar121

I think this part of the conclusion is pretty foreboding for the whole profession. It seems like there is a lot of cognitive dissonance on interpreting what the future holds for engineers in the software industry.

“However, they could also shift labor demand-especially in the short term for entry-level and freelance software engineers-and have broader long-term implications for the software industry.”

comeonbro

Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)

Notably missing: o3

Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png

falcor84

That's a good point. Assuming they're strategic about releasing this benchmark, they likely already evaluated o3 on it and saw that it performs favorably. Perhaps they're now holding off until they have a chance to tune it further, and then release a strong improvement and get additional buzz a bit later on.

throwaway0123_5

Although I wouldn't bet against o3, I think it works to their favor to release it later no matter how well it is doing.

Case 1, does worse than or is on-par with o1: Would be shocking and not a great sign for their test-time compute approach, at least in this domain. Obviously they would not want to release results.

Case 2, slightly better than o1: I think "holding off until they have a chance to tune it further" applies.

Case 3, does much better than o3: They get to release it after another model makes a noticeable improvement on the benchmark, get another good press release to keep hype high, and they get to tune it further before releasing results.

sandspar

Altman stated they won't release o3 by itself. They plan to release it as part of GPT-5. GPT-5 will incorporate all sub types of model: reasoning, image, video, voice, etc.

neilv

"SWE-Lancer", like, skewering SWEs with a lance?

dataking

It is a portmanteau of SWE and freelancer. Upwork is a marketplace for the latter.

null

[deleted]

ctoth

Gonna lance them SWEs like a boil!

colesantiago

Can anyone explain how this research benefits humanity for OpenAI's mission?

OpenAI's AGI mission statement

> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."

https://openai.com/index/how-should-ai-systems-behave/

I would have to admit some humility as I sort of brought this on myself [1]

> This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests

https://news.ycombinator.com/item?id=43032191

But curiously the question is still valid.

Sam Altman: "50¢ of compute of a SWE Agent can yield "$500 or $5k of work."

https://news.ycombinator.com/item?id=43032098

https://x.com/vitrupo/status/1889720371072696554

CamperBob2

For the same reason you don't have to grow your own food. The economic value of food didn't vanish over the course of the 20th century, even though about 95% of the workforce engaged in food production in the early 1900s was no longer needed by the early 2000s.

After the mythical, long-promised "singularity," you can still do your current job if you want to, just as you can still grow your own food. But you will probably have better things to do.

calvinmorrison

People don't work for fun they work for money. Since we're a service economy the only job i can think of remaining is a publican

CamperBob2

Thus missing my point entirely.

The same thing that happened to all those farmers will happen to us.

HN

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork