A Research Preview of Codex

486 comments

·May 16, 2025

johnjwang

Some engineers on my team at Assembled and I have been a part of the alpha test of Codex, and I'll say it's been quite impressive.

We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:

Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)

It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.

Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.

hintymad

It looks we are in this interesting cycle: millions of engineers contribute to open-source on github. The best of our minds use the code to develop powerful models to replace exactly these engineers. In fact, the more code a group contributes to github, the easier it is for the companies to replace this group. Case in point, frontend engineers are impacted most so far.

Does this mean people will be less incentivized to contribute to open source as time goes by?

P.S., I think the current trend is a wakeup call to us software engineers. We thought we were doing highly creative work, but in reality we spend a lot of time doing the basic job of knowledge workers: retrieving knowledge and interpolating some basic and highly predictable variations. Unfortunately, the current AI is really good at replacing this type of work.

My optimistic view is that in long term we will have invent or expand into more interesting work, but I'm not sure how long we will have to wait. The current generation of software engineers may suffer high supply but low demand of our profession for years to come.

lispisok

As much as I support community developed software and "free as in freedom", "Open Source" got completely perverted into tricking people to work for free for huge financial benefits for others. Your comment is just one example of that.

For that reason all my silly little side projects are now in private repos. I dont care the chance somebody builds a business around them is slim to none. Dont think putting a license will protect you either. You'd have to know somebody is violating your license before you can even think about doing anything and that's basically impossible if it gets ripped into a private codebase and isnt obvious externally.

brookst

Protect you from what?

What harm is there to you if someone uses some of your code to build a business, as compared to not doing so? How are you worse off?

I’ve never understood this mentality. It seems very zero sum and kind of anti social. I’ve built a couple of businesses, and there’s always economic or technical precedent. I honestly don’t mind paying it forward if someone can benefit from side projects I enjoyed doing anyways.

Wowfunhappy

So let's say your side project improves your life by 5 happiness points. You have two options:

--- OPTION A - Keep your project private.

• You get five happiness points.

--- OPTION B - Make your project public.

• Other individuals may get a small number of happiness points.

• A megacorp might turn your project into a major product without compensating you and get a million happiness points.

• You get five happiness points.

----------

In either scenario, you still end up with five happiness points. If you release your code, other people may get even more happiness points than you, which isn't really fair. But you are no worse off, and you've increased humanity's total wealth of happiness points.

hintymad

> "Open Source" got completely perverted into tricking people to work for free for huge financial benefits for others

I'm quite conflicted on this assessment. On one hand, I was wondering if we would get better job market if there were not much open-sourced systems. We may have had a much slower growth, but we would see our growth last for a lot more years, which mean we may enjoy our profession until our retirement and more. On the other hand, open source did create large cakes, right? Like the "big data" market, the ML market, the distributed system market, and etc. Like the millions of data scientists who could barely use Pandas and scipy, or hundreds of thousands of ML engineers who couldn't even bother to know what semi positive definite matrix is.

Interesting times.

Daishiman

> P.S., I think the current trend is a wakeup call to us software engineers. We thought we were doing highly creative work, but in reality we spend a lot of time doing the basic job of knowledge workers: retrieving knowledge and interpolating some basic and highly predictable variations. Unfortunately, the current AI is really good at replacing this type of work.

Most of the waking hours of most creative work have this type of drudgery. Professional painters and designers spend most of their time replicating ideas that are well fleshed-out. Musicians spend most of their time rehearsing existing compositions.

There is a point to be made that these repetitive tasks are a prerequisite to come up with creative ideas.

rowanG077

I disagree. AI have shown to most capable in what we consider creative jobs. Music creation, voice acting, text/story writing, art creation, video creation and more.

blibble

> Does this mean people will be less incentivized to contribute to open source as time goes by?

personally, I completely stopped 2 years ago

it's the same as the stack overflow problem: the incentive to contribute tends towards zero, at which point the plagiarism machine stops improving

tom_m

Same. It'll be interesting to see when open source slows down enough and is maybe all but eliminated...how will things advance?

Once everyone uses AI. How will anything new come about?

AI could quite possibly slow down innovation and improvement.

SubiculumCode

Now do open science.

More generally, specialty knowledge is valuable. From now on, all employees will be monitored in order to replace them.

popcorncowboy

> From now on, all employees will be monitored in order to replace them.

This is going on a t-shirt.

mikepurvis

My pessimistic view is that we're liable to end up cutting off the pipeline into the industry. Similar to lawyers replacing clerks with bots, if senior engineers can now command bots rather than mentor new hires, where is the on-ramp? How does one actually gain enough experience to become a senior?

Or is all this a nothing-burger, since the new hires will just be commanding bots of their own, but on lower level tasks that they are qualified to supervise?

username223

> Does this mean people will be less incentivized to contribute to open source as time goes by?

Yes. I certainly don't intend to put any free code online until I can legally bar AI bros from using it without payment. As Mike Monteiro put it long ago, "F** you, pay me" (https://www.youtube.com/watch?v=jVkLVRt6c1U)

tom_m

People might also lace open source with Trojans and back doors. It's going to get interesting.

electrondood

> doing the basic job of knowledge workers

If you extrapolate and generalize further... what is at risk is any task that involves taking information input (text, audio, images, video, etc.), and applying it to create some information output or perform some action which is useful.

That's basically the definition of work. It's not just knowledge work, it's literally any work.

woah

> Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)

> It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.

What's the benefit of this? It sounds like it's just a gimmick for the "AI will replace programmers" headlines. In reality, LLMs complete their tasks within seconds, and the time consuming part is specifying the tasks and then reviewing and correcting them. What is the point of parallelizing the fastest part of the process?

johnjwang

In my experience, it still does take quite a bit of time (minutes) to run a task on these agentic LLMs (especially with the latest reasoning models), and in Cursor / Cline / other code editor versions of AI, it's enough time for you to get distracted, lose context, and start working on another task.

So the benefit is really that during this "down" time, you can do multiple useful things in parallel. Previously, our engineers were waiting on the Cursor agent to finish, but the parallelization means you're explicitly turning your brain off of one task and moving on to a different task.

woah

In my experience in Cursor with Claude 3.5 and Gemini 2.5, if an agent has run for more than a minute it has usually lost the plot. Maybe model use in Codex is a new breed?

kfajdsl

A single response can take a few seconds, but tasks with agentic flows can be dozens of back and forths. I've had a fairly complicated Roo Code task take 10 minutes (multiple subtasks).

tom_m

How much of it did you read? Haha. That's not anything against you, I'm just pointing out to people that there will be a bunch of folks out there who will never care to read and learn. They just want to mash all the buttons until it works.

When I was a kid, that worked with Nintendo games sure...but I like to think I've matured beyond that...but I haven't read every little thing returned by the LLM in Roo Code myself, so maybe it's human nature.

ctoth

> Each task is processed independently in a separate, isolated environment preloaded with your codebase. Codex can read and edit files, as well as run commands including test harnesses, linters, and type checkers. Task completion typically takes between 1 and 30 minutes, depending on complexity, and you can monitor Codex’s progress in real time.

fourside

> You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.

One issue with junior devs is that because they’re not fully autonomous, you have to spend a non trivial amount of time guiding them and reviewing their code. Even if I had easy access to a lot of them, pretty quickly that overhead would become the bottleneck.

Did you think that managing a lot of these virtual devs could get overwhelming or are they pretty autonomous?

fabrice_d

They wrote "You still need to do a lot of work to get it production ready". So I would say it's not much better than real colleagues. Especially since junior devs will improve to a point they don't need your hand holding (remember you also were a junior at some point), which is not proven will happen with AI tools.

bmcahren

Counter-point A: AI coding assistance tools are rapidly advancing at a clip that is inarguably faster than humans.

Counter-point B: AI does not get tired, does not need space, does not need catering to their experience. AI is fine being interrupted and redirected. AI is fine spending two days on something that gets overwritten and thrown away (no morale loss).

tom_m

You also have to provide accurate instructions.

I find most often times, "bugs" aren't with writing code that doesn't compile or doesn't have passing tests. The "bugs" come from not understanding the requirements and what it is you're building.

I'm not entirely sure AI will help this at all. People are generally bad at describing software and how they want it to work. They are inaccurate there or entirely omit things in the requirements.

Yes, though, it would be overwhelming to manage a bunch of AI agents. Context switching and redirecting, guiding, will be very difficult and not everyone's cup of tea.

If argue this isn't really a result of AI though. Many people are already in this boat today. The industry is set up in this way with contractors and outsourced devs that are at a junior level...because it's the attraction of cheap labor. Many businesses are attracted to this beyond programming. One of the questions is going to be, is the cost per token economics cheaper? So long as it's cheaper, AI coding agents will have a future. If it proves to not be cheaper (and this could take years to prove out), then I don't think it'll be as popular. I think people will need to go back to the drawing board on how we use AI agents or use AI for other purposes (like training, education, developer onboarding, code reviews, debugging, etc.)

rfoo

You don't need to be nice to your virtual junior devs. Saves quite a lot time too.

As long as I spend less time reviewing and guiding than doing it myself it's a win for me. I don't have any fun doing these things and I'd rather yelling at a bunch of "agents". For those who enjoy doing bunch of small edits I guess it's the opposite.

HappMacDonald

I'm definitely wary of the concept of dismissing courtesy when working with AI agents, because I certainly don't want to lose that habit when I turn around and have to interact with humans again.

quantumHazer

CTO of an AI agents company (which has worked with AI labs) says agents works fine. Nothing new under the sun.

Jimmc414

> We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much.

If you don't mind, what were the strengths and limitations of Claude Code compared to Codex? You mentioned parallel task execution being a standout feature for Codex - was this a particular pain point with Claude Code? Any other insights on how Claude Code performed for your team would be valuable. We are pleased with Claude Code at the moment and were a bit underwhelmed by comparable Codex CLI tool OAI released earlier this month.

t_a_mm_acq

Post realizing CC can operate same code base, same file tree on different terminals instances, it's been a significant unlock for us. Most devs have 3 running concurrently. 1. master task list + checks for completion on tasks. 2. operating on current task + documentation. 3. side quests, bugs, additional context.

rinse and repeat once task done, update #1 and cycle again. Add in another CC window if need more tasks concurrently.

downside is cost but if not an issue, it's great for getting stuff done across distributed teams..

naiv

do you have then instance 2 and 3 listening to instance 1 with just a prompt? or how does this work?

NewEntryHN

The advantage of Cursor is the reduced feedback loop where you watch it live and can intervene at any moment to steer it in the right direction. Is Codex such a superior model that it makes sense to take the direction of a mostly background agent, on which you seemingly have a longer feedback loop?

scragz

it sounds like their approach is launch 5 with the same task and hopefully one works it out.

_bin_

I believe cursor now supports parallel tasks, no? I haven't done much with it personally but I have buddies who have.

If you want one idiot's perspective, please hyper-focus on model quality. The barrier right now is not tooling, it's the fact that models are not good enough for a large amount of work. More importantly, they're still closer to interns than junior devs: you must give them a ton of guidance, constant feedback, and a very stern eye for them to do even pretty simple tasks.

I'd like to see something with an o1-preview/pro level of quality that isn't insanely expensive, particularly since a lot of programming isn't about syntax (which most SotA modls have down pat) but about understanding the underlying concepts, an area in which they remain weak.

Atp I really don't care if the tooling sucks. Just give me really, really good mdoels that don't cost a kidney.

runako

> Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling.

This is also part of a recent update to Zed. I typically use Zed with my own Claude API key.

ai-christianson

Is Zed managing the containerized dev environments, or creating multiple worktrees or anything like that? Or are they all sharing the same work tree?

runako

As far as I know, they are sharing a single work tree. So I suppose that could get messy by default.

That said, it might be possible to tell each agent to create a branch and do work there? I haven't tried that.

I haven't seen anything about Zed using containers, but again you might be able to tell each agent to use some container tooling you have in place since it can run commands if you give it permission.

nadis

In the preview video, I appreciated Katy Shi's comment on "I think this is a reflection of where engineering work has moved over the past where a lot of my time now is spent reviewing code rather than writing it."

Preview video from Open AI: https://www.youtube.com/watch?v=hhdpnbfH6NU&t=878s

As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.

While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.

klabb3

> a lot of my time now is spent reviewing code rather than writing it.

Reviewing has never been a panacea. It’s a best-effort at catching obvious mistakes, like a second opinion. Only with highly rigorous tests can reviewing give as high confidence as I trust another engineer or myself. Generally cadence of code output has never been a bottleneck for me, rather the opposite (if I had more time I’d write you a shorter letter).

Most importantly, writing code that is testable on meaningful boundaries is an extremely difficult and delicate art form, which ime is something you really want to get right if possible. Not saying an AI can or can’t do that, only that it’s the hardest part. An army of automated junior engineers still can’t win over the complexity beast that yolo programming causes. At some point code mutations will cause more problems as side effects than what they fix.

nadis

> An army of automated junior engineers still can’t win over the complexity beast that yolo programming causes. At some point code mutations will cause more problems as side effects than what they fix.

This resonates a lot for me, completely agreed.

fosterfriends

++ Kind of my whole thesis with Graphite. As more code gets AI-generated, the weight shifts to review, testing, and integration. Even as someone helping build AI code reviewers, we'll _need_ humans stamping forever - for many reasons, but fundamentally for accountability. A computer can never be held accountable

https://constelisvoss.com/pages/a-computer-can-never-be-held...

hintymad

> A computer can never be held accountable

I think the issue is not about humans being entirely replaced. Instead, the issue is that if AI replaces enough number of knowledge workers while there's no new or expanded market to absorb the workforce, the new balance of supply and demand will mean that many of us will have suppressed pay or worse, losing our jobs forever.

TeMPOraL

That is true regardless of whether there is or isn't a "new or expanded market to absorb the workforce".

It's a crucial insight that's usually missed or eluded in discussions about automation and workforce - unless you're literally at the beginning of your career, losing your career to automation screws you over big time, forever. At best, you'll have to downsize your entire lifestyle, and that of your family, to be commensurate with your now entry-level pay. If you're halfway through the career that suddenly ended, you won't recover.

All the new jobs and markets are for the kids. Mind you, not your kids - your kids are going to be disadvantaged by their household being suddenly thrown into financial insecurity or downright poverty, and may not even get a chance to start a good career path with their peers.

That, not "anti technology sentiment", is why Luddites smashed the looms. Those were people who got rug-pulled by business decisions and thrown into poverty, along with their families and communities.

nadis

> A computer can never be held accountable

I feel like I've been thinking along similar lines recently (due to re-read this though!) but instead of "computer" am replacing it with "AI" or "Agents" these days. Same point holds true.

csomar

> I think this is a reflection of where engineering work has moved over the past where a lot of my time now is spent reviewing code rather than writing it.

This was always true. Front-End code is not really code. Most of the back-end code is just convert and moving data around. For most functionality where you need "real code" like crypto, compression, math, etc.. you use a library used by another 100k developers.

sagarpatil

Re:simulation Deebo does this for debugging: https://github.com/snagasuri/deebo-prototype

nadis

Thanks for sharing - wasn't familiar with Deebo!

ai-christianson

> rather than looking at simulations

You mean like automated test suites?

tough

automated visual fuzzy-testing with some self-reinforcement loops

There's already library's for QA testing and VLM's can give critique on a series of screenshots automated by a playwright script per branch

ai-christianson

Cool. Putting vision in the loop is a great idea.

Ambitious idea, but I like it.

ofirpress

[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.

Snuggly73

I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.

Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/

swe polybench - https://amazon-science.github.io/SWE-PolyBench/

Kotlin bench - https://firebender.com/leaderboard

Bjorkbat

I kind of had the feeling LLMs would be better at Python vs other languages, but wow, the difference on Multi SWE is pretty crazy.

kristianp

Maybe a lot of the difference we see between peoples comments about how useful AI is for their coding, is a function of what language they're using. Python coders may love it, Go coders not much at all.

ofirpress

Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html

Snuggly73

I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.

mr_north_london

How long did it take to go from 20% to 75%?

blixt

They mentioned "microVM" in the live stream. Notably there's no browser or internet access. It makes sense, running specialized Firecracker/Unikraft/etc microkernels is way faster and cheaper so you can scale it up. But there will be a big technical scalability difficulty jump from this to the "agents with their own computers". ChatGPT Operator already does have a browser, so they definitely can do this, but I imagine the demand is orders of magnitudes different.

There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.

thundergolfer

It's not our only focus at Modal but it's a big focus![1] Code agents are the killer use case for LLMs right now, and this complements our GPU inference and training capabilities.

I'm quietly betting that agents increase the leverage of deterministic, reproducible devbox tech (eg. Nix, lockfiles, package mirroring), and this will end up being a huge win for us human engineers too.

1. https://modal.com/use-cases/sandboxes

ushakov

we offer this with E2B Desktop

Demo: https://surf.e2b.dev

SDK: https://github.com/e2b-dev/desktop

ionwake

Im sorry if Im being silly, but I have paid for the Pro version, $200 a month, everytime I click on Try Codex, it takes me to a pricing page with the "Team Plan" https://chatgpt.com/codex#pricing.

Is this still rolling out? I dont need the team plan too do I?

I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.

throwaway314155

They do this with every major release. Never going to understand why.

mr_north_london

It's still rolling out

ionwake

Thx for the reply, Im in london too ( atm )

jdee

im the same, and it appeared for me 2 mins ago. looks like its still rolling out

ionwake

cool it appeared - I wa sjsut worried it was a payment issue. thanks guys.

solresol

I'm not sure what's wrong with me, but I just wasted several hours wrestling codex to make it behave.

Here's my workflow that keeps failing: - it writes some code. It looks good a first glance - I push it to github - automated tests on github show that there's a problem - go back to codex and ask it to fix it - it does stuff. It looks good again.

Now what do I do? If I ask it to push again to github, then it will often create a pull request that doesn't include stuff from the first pull request, but it's not a pull request that stacks on top of the previous pull request, it's a pull request that stacks on top of main.

When asked to write something that called out to gpt-4.1-mini, it used openai.ChatCompletion.create (!?!!?)

I just found myself using claude to fix codex's mistakes.

fcoury

I upgraded to Pro just because of Codex and I am really not impressed. Granted, I am using rust so that may be the issue (or skill issue on my end too).

One of the things I am constantly struggling with is that the containers they use are having issues to fetch anything from the internet:

  error: failed to get `anyhow` as a dependency of package `yawl-core v0.1.0 (/wor
  kspace/yawl/core)`

  Caused by:
    download of config.json failed

  Caused by:
    failed to download from `https://index.crates.io/config.json`

  Caused by:
    [7] Could not connect to server (Failed to connect to proxy port 8080 after 30 65 ms: Could not connect to server)

Hopefully they fix this and it gets better with time, but I am not going to renew past this month otherwise.

hmottestad

You can specify a startup script for your environment in the Edit -> adbvaned section. The code placed there will run before they cut off the internet access. Also worth noting that it uses a proxy stored in $http_proxy.

Took me an few hours today to figure out how to install maven and have it download all the dependencies. Spent an hour trying to figure out why sudo apt-get update was failing, it was because I was was using sudo!

bargainbin

I have this issue with Devin. Given my limited knowledge of how these work, I believe there is simply too much context for it to take a holistic view of the task and finish accordingly.

If both OpenAI and Devin are falling into the same pattern then that’s a good indication there’s a fundamental problem to be solved here.

csomar

I think you need to run the tests locally before you push the PR. I actually think you need to (somehow?) make this part of the generation process before Codex proposes the changes.

landl0rd

[flagged]

alvis

I used to work for a bank and the legal team used to ping us to make tiny changes to the app for compliance related issues. Now they can fix themselves. I think they’d be very proud and happy

ajkjk

Hopefully nobody lets legal touch anything without the ability to run the code to test it, plus code reviews. So probably not.

eru

I'm not sure what you are on about?

You can make arbitrary teams, like legal, make PRs. You would still have the proper owners of the project agree whether they take the PRs. Either by human review and/or by any other review process they set up.

ajkjk

I am on about the fact that you are imagining something working in a way that it does not work in practice.

You have two choices:

1. A developer makes a PR. They build the app and run it themselves to make sure it works, does what they intended, and nothing unexpected happens, and report on their testing in the PR. There's also a suite of automated tests. Between these you are confident that you can rubber-stamp it without touching the change yourself. But this requires them being able to run the code and intelligently think about it themselves, which AI cannot do.

2. A non-developer uses an LLM to make a PR. The change passes the tests, but you have no other validation about it because it was done by a bot that can't think about how the app actually works. As reviewer you have to pull down the change to run and validate it yourself. Now you are doing the same amount of work as before, except that instead of you telling the LLM to make the change, someone else did.

The only difference is maybe that the activation energy for doing the work was avoided. Which, fine, is non-negligible. But let's not pretend like in the latter case legal "made the change". They did the 1% upfront work of asking the LLM to find where the change goes, and left you the other 99% of actually shepherding it through. The only changes that will work for at all are, like, updating copy/strings/icons---and honestly, at my last job at least, we already let legal (and product, etc) do stuff like that. I suppose the LLM might save them having to figure out how to use Git, at least.

You might imagine there's some software out there where the automated tests are so thorough that you can trust a code change to not break anything if it passes the tests. I have personally never seen such a thing. And in practice many tests validate against strings and other small feature-level details, meaning that the kinds of code changes that other orgs are making are going to be touching the tests as well, so human verification is still required.

singularity2001

that will be an interesting new Bug tracker: anyone in the company will be able to report any bug or add any future request, if the model will be able to solve it automatically perfect otherwise some human might take over. The interesting question then will be what code changes are legal and within the standards of what the company wants. So non-technical code/issue reviewer will become a super important and ubiquitous job.

SketchySeaBeast

Not just legal/within the standards, but which actually meet the unspoken requirements of the request. "We just need a new checkbox that asks if you're left handed" might seem easy, but then it has ramifications for the Application PDF that gets generated, as well as any systems downstream, and maybe it requires a data conversion of some sort somewhere. I know that the PO's I work with miss stuff or assume that the request will just have features by default.

asdev

I promise you the legal team is not pushing any code changes

b0ner_t0ner

All they need is "vibes".

null

[deleted]

ZeroCool2u

"23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded."

What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.

asdev

is the point of this to actually assign tasks to an AI to complete end to end? Every task I do with AI requires atleast some bit of hand holding, sometimes reprompting etc. So I don't see why I would want to run tasks in parallel, I don't think it would increase throughput. Curious if others have better experiences with this

masterj

The example use-cases in the videos are pretty compelling and much smaller scope.

“Here’s an error reported to the oncall. Give a try fixing it” (Could be useful even if it fails)

Refactor this small piece I noticed while doing something else. Small-scoped stuff that likely wouldn’t get done otherwise.

I wouldn’t ask LLMs for full-features in a real codebase but these examples seem within the scope of what they might be able to accomplish end-to-end

sagarpatil

I am working with a 3rd party API (Exa.ai) and I hacked together a python script. I ran a remote agent to do these tasks simultaneously (augment.new, I’m not affiliated, I have early access)

Agent 1: write tests, make sure all the tests pass.

Agent 2: concert python script to fastapi

Agent 3: create frontend based on fastapi endpoints

I get a PR, I check code and see if it works and then merge to main. All three PR’s worked flawlessly (front end wasn’t pretty).

nmca

with a bad ai it is pointless, with a good ai it is powerful.

codex-1 has been quite good in my experience

fullstackchris

Reading these threads its clear to me people are so cooked and no longer understand (or perhaps never did) understand the simple process of how source code is shared, built, and merged together with multiple editors has ever worked

bionhoward

What about privacy, training opt out?

What about using it for AI / developing models that compete with our new overlords?

Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?

piskov

What the video: there is an explicit switch at one of the steps about (not) allowing to train on your repo.

lurking_swe

That’s nice. And we trust that it does what it says because…? The AI company (openai, anthropic, etc) pinky promised? Have we seen their source code? How do you know they don’t train?

Facebook has been caught in recent DOJ hearings breaking the law with how they run their business, just as one example. They claimed under oath, previously, to not be doing X, and then years later there was proof they did exactly that.

https://youtu.be/7ZzxxLqWKOE?si=_FD2gikJkSH1V96r

A companies “word” means nothing imo. None of this makes sense if i’m being honest. Unless you personally have a negotiated contract with the provider, and can somehow be certain they are doing what they claim, and can later sue for damages, all of this is just crossing your fingers and hoping for the best.

tough

On the other hand you can enable explicit sharing of your data and get a few million free tokens daily

wilg

If you don't trust the company your opt-out strategy is much easier, you simply do not authorize them to access your code.

null

[deleted]

kleiba

Just curious: is your company happy sharing their code-base with an AI provider? Or are you using a local installation?

pixl97

Companies commonly share their code with SAAS providers. Typically they'll have a contract to prevent usage otherwise.

asadm

why not? OpenAI won't be stupid to look at my code and be that vulnerable legally. It ain't worth it.

KaiserPro

They literally scraped half of youtube, made a library to extract the audio and released it as whisper.

Of _course_ they are training on your shit.

asadm

thats publicly accessible shit. my code is trade secret and IP. I would litigate that shit if a line I wrote ends up in public model, easiest money to be made.

odie5533

For 99% of companies, their code is worthless to anyone but them.

manquer

For copying the product / service yes it is not worth much .

However for people trying to compromise your system access to your code can be a valuable asset .The worth of that could be well beyond just enterprise value of the organization , it could people’s lives or bring down critical infrastructure.

You don’t just have access to code you created and have complete control to. Organizations have vendors providing code(drivers , libraries…) with narrow licenses that prohibit sharing or leaking in anyway. So this type of leak can open you to a lot of liability.

kleiba

If that was true, hardly any company would be opposed to open sourcing their code base.

nmca

It is a cost benefit trade off, as with all things. Benefits look pretty good.

layer8

The cost of sharing your code is unknown, though.

philomath_mn

Under what circumstances would that cost be high? Is OpenAI going to rip off your app? Why would they waste a second on that when there are better models to be built?

bhl

Cursor has enterprise mode which forces a data privacy feature.

kleiba

So only in "enterprise mode", huh. Interesting, thanks.