OpenAI Codex CLI: Lightweight coding agent that runs in your terminal

297 comments

·April 16, 2025

gklitt

I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.

Claude Code did great and wrote pretty decent docs.

Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.

I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.

I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.

strangescript

Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.

I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.

artdigital

Claude Code is just way too expensive.

These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.

aitchnyu

Is it using one of these models? https://openrouter.ai/models?q=amazon

Seems 4x costlier than my Aider+Openrouter. Since I'm less about vibes or huge refactoring, my (first and only) bill is <5 usd with Gemini. These models will halve that.

monsieurbanana

> Upgrade apps in a fraction of the time with the Amazon Q Developer Agent for code transformation (limit 4,000 lines of submitted code per month)

4k loc per month seems terribly low? Any request I make could easily go over that. I feel like I'm completely misunderstanding (their fault though) what they actually meant.

Edit: No I don't think I'm misunderstanding, if you want to go over this they direct you to a pay-per-request plan and you are not capped at $20 anymore

ekabod

"gemini 2.5 pro exp" is superior to Claude Sonnet 3.7 when I use it with Aider [1]. And it is free (with some high limit).

[1]https://aider.chat/

razemio

Compared to cline aider had no chance, the last time I tried it (4 month ago). Has it really changed? Always thought cline is superior because it focuses on sonnet with all its bells an whistles. While aider tries to be an universal ide coding agent which works well with all models.

When I try gemmini 2.5 pro exp with cline it does very well but often fails to use the tools provided by cline which makes it way less expensive while failing random basic tasks sonnet does in its sleep. I pay the extra to save the time.

Do not get me wrong. Maybe I am totally outdated with my opinion. It is hard to keep up these days.

jacooper

Don't they train on your inputs if you use the free Ai studio api key?

strangescript

I would use Aider if it had an agent mode. It needs to catch up with UX, frankly just have a mode that copies what claude code does.

Aeolun

> Its not cheap, but its by far the best, most consistent experience I have had.

It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.

jasonjmcghee

If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I'm not surprised you're seeing high cost.

Be aware of the "cache".

Tell it to read specific files, never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).

Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.

Have a clear goal in mind and keep sessions to as few messages as possible.

Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.

I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).

If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).

For hobby stuff, it adds up - totally.

For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).

Implicated

I keep seeing this sentiment and it's wild to me.

Sure, it might cost a few dollars here and there. But what I've personally been getting from it, for that cost, is so far away from "expensive" it's laughable.

Not only does it do things I don't want to do, in a _super_ efficient manner. It does things I don't know how to do - contextually, within my own project, such that when it's done I _do_ know how to do it.

Like others have said - if you're exhausting the context window, the problem is you, not the tool.

Example, I have a project where I've been particularly lazy and there's a handful of models that are _huge_. I know better than to have Claude read those models into context - that would be stupid. Rather - I tell it specifically what I want to do within those models, give it specific method names and tell it not to read the whole file, rather search for and read the area around the method definition.

If you _do_ need it to work with very large files - they probably shouldn't be that large and you're likely better off refactoring those files (with Claude, of course) to abstract out where you can and reduce the line count. Or, if anything, literally just temporarily remove a bunch of code from the huge files that isn't relevant to the task so that when it reads it it doesn't have to pull all of that into context. (ie: Copy/paste the file into a backup location, delete a bunch of unrelated stuff in the working file, do your work with claude then 'merge' the changes to the backup file and copy it back)

If a few dollars here and there for getting tasks done is "too expensive" you're using it wrong. The amount of time I'm saving for those dollars is worth many times the cost and the number of times that I've gotten unsatisfactory results from that spending has been less than 5.

I see the same replies to these same complaints everywhere - people complaining about how it's too expensive or becomes useless with a full context. Those replies all state the same thing - if you're filling the context, you've already screwed it up. (And also, that's why it's so expensive)

I'll agree with sibling commenters - have claude build documentation within the project as you go. Try to keep tasks silo'd - get in, get the thing done, document it and get out. Start a new task. (This is dependent on context - if you have to load up the context to get the task done, you're incentivized to keep going rather than dump and reload with a new task/session, thus paying the context tax again - but you also are going to get less great results... so, lesson here... minimize context.)

100% of the time that I've gotten bad results/gone in circles/gotten hallucinations was when I loaded up the context or got lazy and didn't want to start new sessions after finishing a task and just kept moving into new tasks. If I even _see_ that little indicator on the bottom right about how much context is available before auto-compact I know I'm getting less-good functionality and I need to be careful about what I even trust it's saying.

It's not going to build your entire app in a single session/context window. Cut down your tasks into smaller pieces, be concise.

It's a skill problem. Not the tool.

ilaksh

Did you try the same exact test with o3 instead? The mini models are meant for speed.

gklitt

I want to but I’ve been having trouble getting o3 to work - lots of errors related to model selection.

ksec

Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.

Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.

The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.

enether

there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case

I wonder if this is what's causing it to do badly in these cases

victor9000

> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.

AGI may well be on its way, as the mode is mastering the fine art of bullshitting.

null

[deleted]

null

[deleted]

kristopolous

Ever use Komment? They've been in the game a awhile. Looks pretty good

swyx

related demo/intro video: https://x.com/OpenAIDevs/status/1912556874211422572

this is a direct answer to claude code which has been shipping furiously: https://x.com/_catwu/status/1903130881205977320

and is not open source; there are unverified comments that they have DMCA'ed decompilations https://x.com/vikhyatk/status/1899997417736724858?s=46

by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space

(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)

if you are a Claude Code (and now OAI Codex) power user we want to hear use cases - CFP closing soon, apply here https://sessionize.com/ai-engineer-worlds-fair-2025

axkdev

Hey! The weakest part of Claude Code I think is that it's closed source and locked to Claude models only. If you are looking for inspiration, Roo is the the best tool atm. It offers far more interesting capabilities. Just to name some - user defines modes, the built in debug mode is great for debugging, architecture mode. You can, for example, ask it to summarize some part of the running task and start a new task with fresh context. And, unlike in Claude Code, in Roo the LLM will actually follow your custom instructions (seriously, guys, that Claude.md is absolutely useless)! The only drawback of Roo, in my opinion, is that it is NOT a cli.

kristopolous

there's goose. plandex and aider also there's kilo as a new fork of roo.

senko

I got confused, so to clarify to myself and others - codex is open source, claude code isn't, and the referenced decompilation tweets are for claude code.

asadm

These days, I usually paste my entire (or some) repo into gemini and then APPLY changes back into my code using this handy script i wrote: https://github.com/asadm/vibemode

I have tried aider/copilot/continue/etc. But they lack in one way or the other.

jwpapi

It’s not just about saving money or making less mistakes its also about iteration speed. I can’t believe this process is remotely comparable to aider.

In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.

Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.

asadm

I am at my 5th project in this workflow and these are of different types too:

- an embedded project for esp32 (100k tokens)

- visual inertial odometry algorithm (200k+ tokens)

- a web app (60k tokens)

- the tool itself mentioned above (~30k tokens)

it has worked well enough for me. Other methods have not.

t1amat

Use a tool like repomix (npm), which has extensions in some editors (at least VSCode) that can quickly bundle source files into a machine readable format

brandall10

Why not just select Gemini Pro 2.5 in Copilot with Edit mode? Virtually unlimited use without extra fees.

Copilot used to be useless, but over the last few months has become quite excellent once edit mode was added.

asadm

copilot (and others) try to be too smart and do context reduction (to save their own wallets). I want ENTIRETY of the files I attached to context, not RAG-ed version of it.

bredren

This problem is real.

Claude Projects, chatgpt projects, Sourcegraph Cody context building, MCP file systems, all of these are black boxes of what I can only describe as lossy compression of context.

Each is incentivized to deliver ~”pretty good” results at the highest token compression possible.

The best way around this I’ve found is to just own the web clients by including structured, concatenation related files directly in chat contexts.

Self plug but super relevant: I built FileKitty specifically to aid this, which made HN front page and I’ve continued to improve:

https://news.ycombinator.com/item?id=40226976

If you can prepare your file system context yourself using any workflow quickly, and pair it with appropriate additional context such as run output, problem description etc, you can get excellent results and you can pound away at OpenAI or Anthropic subscription refining the prompt or updating the file context.

I have been finding myself spending more time putting together prompt complexity for big difficult problems, they would not make sense to solve in the IDE.

nowittyusername

I believe this is the root of the problem for all agentic coding solutions. They are gimping the full context through fancy function calling and tool use to reduce the full context that is being sent through the API. Problem with this is you can never know what context is actually needed for the problem to be solved in the best way. The funny thing is, this type of behavior actually leads many people to believe these models are LESS capable then they actually are, because people don't realize how restricted these models are behind the scenes by the developers. Good news is, we are entering the era of large context windows and we will all see a huge performance increase in coding as a results of these advancement.

thelittleone

Regarding context reduction. This got me wondering. If I use my own API key, there is no way for the IDE or coplilot provider to benefit other than monthly sub. But if I am using their provided model with tokens from the monthly subscription, they are incentivized to charge me based on tokens I submit to them, but then optimize that and pass on a smaller request to the LLM and get more margin. Is that what you are referring to?

brandall10

FWIW, Edit mode gives the impression of doing this, vs. originally only passing the context visible from the open window.

You can choose files to include and they don't appear to be truncated in any way. Though to be fair, I haven't checked the network traffic, but it appears to operate in this fashion from day to day use.

AaronAPU

Is that why it’s so bad? I’ve been blown away by how bad it is. Never had a single successful edit.

The code completion is chefs kiss though.

siva7

Thanks, most people don't understand this fine difference. Copilot does RAG (as all other subscription-based agents like Cursor) to save $$$, and results with RAG are significantly worse than having the complete context window for complex tasks. That's also the reason why Chatgpt or Claude basically lie to the users when they market their file upload functions by not telling the whole story.

MrBuddyCasino

Cline doesn’t do this - this is what makes it suitable for working with Gemini and its large context.

fasdfasdf11234

Isn't this similar to https://aider.chat/docs/usage/copypaste.html

Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.

If you wish you could /add * to add all your files.

Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.

But hey I might be wrong. Did a limited test with 3 files in project.

asadm

that’s correct. aider doesn’t RAG on files which is good. I don’t use it because 1) UI is so slow and clunky 2) using gemini 2.5 via api in this way (huge context window) is expensive but also heavily rate limited at this point. No such issue when used via aistudio ui.

fasdfasdf11234

You could use aider copy-paste with aistudio ui or any other web chat. You could use gemini-2.0-flash for the aider model that will apply the changes. But I understand your first point.

I also understand having build your own tool to fit your own workflow. And being able to easily mold it to what you need.

null

[deleted]

ramraj07

I felt it loses track of things on really large codebases. I use 16x prompt to choose the appropriate files for my question and let it generate the prompt.

asadm

do you mean gemini? I generally notice pretty great recall UPTO 200k tokens. It's ~OK after that.

cube2222

Fingers crossed for this to work well! Claude Code is pretty excellent.

I’m actually legitimately surprised how good it is, since other coding agents I’ve used before have mostly been a letdown, which made me only use Claude in direct change prompting with Zed (“implement xyz here”, “rewrite this function with abc”, etc), so very hands-on.

So I’ve went into trying out Claude Code rather pessimistically, and now I’m using it all the time! Sure, it ends up costing a bunch, but it’s easy to justify $15 for a prompting session if the end result is a mostly complete PR, done much faster.

All that is to say - competition is good, fingers crossed for codex!

therealmarv

Claude Code has a closed license https://github.com/anthropics/claude-code/blob/main/LICENSE....

There is fork named Anon Kode https://github.com/dnakov/anon-kode which can use more models and non-Anthropic ones. But the license is unclear for it.

It's interesting to see codex to be Apache License. Maybe somebody extends it to be usable with competing models.

WatchDog

If it's a fork of the proprietary code, the license is pretty clear, it's violating copyright.

Now whether or not anthropic care enough to enforce their license is separate issue, but it seems unwise to make much of an investment in it.

acheong08

They call it a "fork" but it doesn't share any code. It's from scratch afaik

cube2222

In terms of terminal-based and open-source, I think aider is the most popular one.

therealmarv

yes! It's great! I like it!

But it has one downside: It's not so good on unknown big complex code bases where you don't know how it's structured. I wished they (or somebody else) would add an AI or an automation to add files dynamically or in a smart way when you don't know the codebase structure (with the expense of burning more tokens).

I'm thinking Codex (have not checked it yet), Claude Code, Anon Kode and all the AI editors/plugins doing a better job there (and potentially burning more tokens).

But that's the only downside I can think of about aider.

seunosewa

I didn't like not seeing the reasoning of the models

jwr

Seconded. I was surprised by how good Claude Code is, even for less mainstream languages (Clojure). I am happy there is competition!

kurtis_reed

Fingers crossed for what?

dzhiurgis

I started using claude code everyday. It’s kinda expensive and hallucinates a ton (tho with custom prompt i’ve mostly tamed it).

Hope more competition can bring price down.

retinaros

too expensive. I cant understand why everyone is into claude code vs using claude in cursor or windsurf.

danenania

I think it depends a lot on how you value your time. I'm personally willing to spend hundreds or thousands per month happily if it saves me enough hours. I'd estimate that if I were to do consulting, I'd likely be charging in the $150-250 per hour range, so by my math, it's pretty easy to justify any tools that save me even a few hours per month.

mwigdahl

Or, increasingly, how the company values your time. If Claude Code can make a $100K/year dev 10% more productive, it's worth it to the employer to pay anything under $1600/month for it (assuming fully loaded cost of the employee to the business is twice salary).

retinaros

ok but in what way a terminal is a bettter UI than an IDE? I am trying all of them on a weekly basis and windsurf UX seems miles ahead/ more efficient than a terminal. that is also what OAI believes or else they wouldnt try to buy it

SoftTalker

Are you still working 40 hours a week? If so, what's the difference?

_joel

You get the same results for cheaper by using a different tool (Windsurf's better imho).

taneq

How do you price this in? If you’re charging by the hour, paying out of pocket to reduce your hours seems self-defeating unless you raise your rates enough to cover both the costs and the lost hours. I can’t imagine too many clients would accept “I’m very expensive per hour because I’m fast, because I get AI to do most of it.”

otabdeveloper4

> if it saves me enough hours

You're being paid to type? I want your job.

ChadMoran

Claude Code has been able to produce results equivalent to a junior engineer. I spent about $300 API credits in a month but got the value out of it far surpassing that.

benzible

If you have AWS credits...

export CLAUDE_CODE_USE_BEDROCK=1

export ANTHROPIC_MODEL=us.anthropic.claude-3-7-sonnet-20250219-v1:0

export ANTHROPIC_API_TYPE=bedrock

codercotton

Is this for Claude Code?

and where are these export used? Ader?

_neil

Anecdotally, Claude code performs much better than Claude within Cursor. Not sure if it’s a system prompt thing or if I’ve just convinced myself of it because the aesthetic is so much better, but either way the end result feels better to me.

rafaelmn

One has the incentive to burn through as much tokens and the other has an incentive to use as little as possible

Workaccount2

My choice conspiracy is resource allocation and playing favorites.

drusepth

I tried switching from Claude Code to both Cursor and Windsurf. Neither of the latter IDEs fully support MCP implementations (missing basic things like tool definitions and other vital features last time I tried) and both have been riddled with their own agentic flow issues (cursor going down for a week a bit ago, windsurf requiring paid upgrades to "get around" bugs, etc).

This is all ignoring the controversies that pop up around e.g. Cursor seemingly every week. As an IDE, they're both getting there -- but I have objectively better results in Claude Code.

tcdent

that's what my Ramp card is for.

seriously though, anything that makes me smarter and more productive has a threshold in the thousands-of-dollars range, not hundreds

newlisp

Why is using cursor with sonnet cheaper than using claude code?

therealmarv

probably because cursor is betting on many paying people not using their tool to full extend. Like people paying on their gym memberships but not going to the gym.

Or they are burning VC money.

gizmodo59

This is pretty neat! I was able to use it for few use cases where it got it right the first time. The ability to use a screenshot to create an application is nice for rapid prototyping. And good to see them open sourcing it unlike claude.

kumarm

First experience is not great. Here are the issues to start using codex:

1. Default model used doesn't work and you get error: system OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

2. You have to switch to model o4-mini-2025-04-16 or some other model using /model. Now if you exit codex, you are back to default model and again have to switch everytime.

3. Crashed the first time with NodeJS error.

But after initial hickups seems to work and still checking how good/bad it is compared to claude code (which I love except for context size limits)

shekhargulati

Not sure why they used React for a CLI. The code in the repo feels like it was written by an LLM—too many inline comments. Interestingly, their agent's system prompt mentions removing inline comments https://github.com/openai/codex/blob/main/codex-cli/src/util....

> - Remove all inline comments you added as much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.

kristianp

I find it irritating too when companies use react for a command line utility. I think its just my preference for anything else but javascript.

mgdev

Strictly worse than Claude Code presently, but I hope since it's open source that changes quickly.

killerstorm

Given that Claude Code only works with Sonnet 3.7 which has severe limitations, how can it be "strictly worse"?

mgdev

Whatever Claude Code is doing in the client/prompting is making much better use of 3.7 than any other client I'm using that also uses 3.7. This is especially true for when you bump up against context limits; it can successfully resume with a context reset about 90% of the time. MCP Commander [0] was built almost 100% using Claude Code and pretty light intervention. I immediately felt the difference in friction when using Codex.

I also spent a couple hours picking apart Codex with the goal of adding Sonnet 3.7 support (almost there). The actual agent loop they're using is very simple. Not to say that's a bad thing, but they're offloading all planning and workflow execution to the agent itself. That's probably the right end state to shoot for long-term, but given the current state of these models I've had much better success offloading task tracking to some other thing - even if that thing is just a markdown checklist. (I wrote about my experience [1] building AI Agents last year.)

[0]: https://mcpcommander.com/

[1]: https://mg.dev/lessons-learned-building-ai-agents/

udbhavs

Next, set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-api-key-here"

Note: This command sets the key only for your current terminal session. To make it permanent, add the export line to your shell's configuration file (e.g., ~/.zshrc).

Can't any 3rd party utility running in the same shell session phone home with the API key? I'd ideally want only codex to be able to access this var

jsheard

If you let malicious code run unsandboxed on your main account then you probably have bigger problems than an OpenAI API key getting leaked.

mhitza

You mean running npm update at the "wrong time"?

null

[deleted]

jjmarr

Just don't export it?

    OPENAI_API_KEY="your-api-key-here" codex

aesbetic

Yea that’s not gonna work, you have to export it for it to become part of your shell’s environment and be passed down to subprocesses.

You could however wrap the export variable and codex command in a script and just call that. This way the variable would only be part of that script’s environment.

PhilipRoman

That code example uses the "VAR=VALUE program" syntax, which exports the variable only for that particular process, so it should work (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...)

primitivesuave

You could create a shell function - e.g. `codex() { OPENAI="xyz" codex "$@" }'. To call the original command use `command codex ...`.

People downvoting legitimate questions on HN should be ashamed of themselves.

udbhavs

That's neat! I only asked because I haven't seen API keys used in the context of profile environment variables in shell before - there might be other common cases I'm unaware of

null

[deleted]

ramoz

Claude Code represents something far more than a coding capability to me. It can do anything a human can do within a terminal.

It’s exceptionally good at coding. Amazing software, really, I’m sure the cost hurdles will be resolved. Yet still often worth the spend

stitched2gethr

> It can do anything a human can do within a terminal.

This.. isn't true.

999900000999

From my experience with playing with Claude Code vs Cline( which is open source and the tool to beat imo). I don't want anything that doesn't let me set my own models.

Deepseek is about 1/20th of the price and only slightly behind Claude.

Both have a tendency to over engineer. It's like a junior engineer who treats LOC as a KPI.

noidesto

I've had great results with the Amazon Q developer cli, ever since it became agentic. I believe it's using claude-3.7-sonnet under the hood.

094459

+1 this has become my go to cli tool now, very impressed with it

sagarpatil

How does it compare to Claude Code

noidesto

I haven't used Claude Code. But one major difference is Q Cli is $19/month with generous limits.