OpenAI o3 and o4-mini

526 comments

·April 16, 2025

M4v3R

Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...

With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.

What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.

Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.

Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.

Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46

int_19h

Compare to Gemini Pro 2.5:

https://g.co/gemini/share/c8fb1c9795e4

Of note, the final step in the CoT is:

> Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.

and then the response is in line with that.

M4v3R

I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.

werdnapk

I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

SkyPuncher

There's a bit of a skill to it.

Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic.

I'll often end up with a task that looks something like this:

* Implement Foo with a relation to FooBar.

* Foo should have X, Y, Z features

* We have an existing pattern for Fidget in BigFidget. Look at that for implementation

* Make sure you account for A, B, C. Check Widget for something similar.

It works surprisingly well.

motorest

> Good architecture plans help.

This is they key answer right here.

LLMs are great at interpolating and extrapolating based on context. Interpolating is far less error-prone. The problem with interpolating is that you need to start with accurate points so that interpolating between them leads to expected and relatively accurate estimates.

What we are seeing is the result of developers being oblivious to higher-level aspects of coding, such as software architecture, proper naming conventions, disciplined choice of dependencies and dependency management, and even best practices. Even basic requirements-gathering.

Their own personal experience is limited to diving into existing code bases and patching them here and there. They often screw up the existing software architecture because their lack of insight and awareness leads them to post PRs that get the job done at the expense of polluting the whole codebase into an unmanageable mess.

So these developers crack open an LLM and prompt it to generate code. They use their insights and personal experience to guide their prompts. Their experience reflects what they do on a daily basis. The LLMs of course generate code from their prompts, and the result is underwhelming. Garbage-in, garbage-out.

It's the LLMs fault, right? All the vibe coders out there showcasing good results must be frauds.

The telltale sign of how poor these developers are is how they dump the responsibility of they failing to get LLMs to generate acceptable results on the models not being good enough. The same models that are proven effective at creating whole projects from scratch at their hands are incapable of the smallest changes. It's weird how that sounds, right? If only the models were better... Better at what? At navigating through your input to achieve things that others already achieve? That's certainly the model's fault, isn't it?

A bad workman always blames his tools.

extr

Yeah this is a great summary of what I do as well and I find it very effective. I think of hands-off AI coding like you're directing a movie. You have a rough image of what "good" looks like in your head, and you're trying to articulate it with enough detail to all the stagehands and actors such that they can realize the vision. The models can always get there with enough coaching, traditionally the question is if that's worth the trouble versus just doing it yourself.

Increasingly I find that AI at this point is good enough I am rarely stepping in to "do it myself".

hatefulmoron

It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).

I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.

mhitza

Don't need to ho that esoteric. Seen them make stuff up pretty often for more common functional programming languages like Haskell and OCaml.

mikepurvis

I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.

(This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)

chaboud

I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).

That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.

For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.

motorest

> That said, 100% pure vibe coding is, as far as I can tell, still very much BS.

I don't really agree. There's certainly a showboating factor, not to mention there is currently a goldrush to tap this movement to capitalize from it. However, I personally managed to create a fully functioning web app from scratch with Copilot+vs code using a mix of GPT4 and o1-mini. I'm talking about both backend and frontend, with basic auth in place. I am by no means a expert, but I did it in an afternoon. Call it BS, the the truth of the matter is that the app exists.

ecocentrik

People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.

killerdhmo

I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366

motorest

> I've used AI with "niche" programming questions and it's always a total let down.

That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.

I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively

> I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.

lend000

I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.

What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.

siva7

It can imitate its creator. We reached AGI.

casinoplayer0

I wanted to believe. But not now.

hirvi74

Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.

shultays

AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions

felipeerias

LLMs made me a lot more aware of leading questions.

Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations.

M4v3R

Btw Ive also asked this question using Deep Research mode in ChatGPT and got the correct answer: https://chatgpt.com/share/68009a09-2778-8004-af40-4a8e7e812b...

So maybe this is just too hard for a “non-research” mode. I’m still disappointed it lied to me instead of saying it couldn’t find an answer.

shmerl

How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.

M4v3R

I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.

erikw

Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

danpalmer

Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.

https://xcancel.com/TransluceAI/status/1912552046269771985 / https://news.ycombinator.com/item?id=43713502 is a discussion of these hallucinations.

As for the hash, could it have simply found a listing for the package with hashes provided and used that hash?

tymscar

Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it

bool3max

I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?

peterldowns

If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.

ZeroTalent

I was a major contributor of Flake. What in particular is so idiotic in your opinion?

peterldowns

I use flakes a lot and I think both flakes and the Nix language are beyond comprehension. Try searching duckduckgo or google for “what is nix flakes” or “nix flake schema” and take an honest read at the results. Insanely complicated and confusing answers, multiple different seemingly-canonical sources of information. Then go look at some flakes for common projects; the almost necessary usage of things like flake-compat and flake-util, the many-valid-approaches to devshell and package definitions, the concepts of “apps” in addition to packages. All very complicated and crazy!

Thank you for your service, I use your work with great anger (check my github I really do!)

yjftsjthsd-h

FWIW, they said the language was bad, not specifically flakes. IMHO, nix is super easy if you already know Haskell (possibly others in that family). If you don't, it's extremely unintuitive.

null

[deleted]

brailsafe

I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.

Just jokes, idk anything about either.

ai-christianson

> Interesting... I asked o3 for help writing...

What tool were you using for this?

georgewsinger

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

jjani

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

unsupp0rted

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

itsmevictor

I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of

bitbuilder

This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.

If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.

And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.

Never thought I'd see the day I was leaving comments for my AI agent coworker.

erikw

What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.

jdgoesmarching

Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.

Workaccount2

It's viable context, context length where is doesn't fall apart, is also much longer.

zaptrem

I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.

armen52

I don't understand this assertion, but maybe I'm missing something?

Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...

amedviediev

I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.

spaceman_2020

I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence

redox99

2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.

saberience

Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.

pizzathyme

The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

mchusma

Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.

Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...

ec109685

I don’t know if they let you share the actual images when sharing a chat. For me, they are blank.

ilaksh

wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.

Are you sure that's not 4o?

AaronAPU

I’m generating logo designs for merch via o4-mini-high and they are pretty good. Good text and comprehending my instructions.

Agentus

also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.

oofbaroomf

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

georgewsinger

Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

Arguably this shouldn't be counted though?

[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

tedsanders

I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:

> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:

> We sample multiple parallel attempts with the scaffold above

> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.

> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.

> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.

awestroke

OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt

swyx

they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet

lattalayta

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

mickael-kerjean

The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

  switch(testFile) {
    case "test1.ase": // run this because it's a particular case 
    case "test2.ase": // run this because it's a particular case
    default:  // run something that's not working but that's ok because the previous case should
              // give the right output for all the test files ...
  }

emp17344

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

knes

Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...

The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent

thefourthchime

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

ksec

I often wonder if we could expect that to reach 80% - 90% within next 5 years.

osigurdson

I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

sebzim4500

Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914

osigurdson

As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.

However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.

In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.

int_19h

I think it's because the question is rather ambiguous - "convert the number to base-N" is a very common API, e.g. in C# you have Convert.ToString(long value, int base), in JavaScript you have Number.toString(base) etc. It seems that it just follows this pattern. If you were to ask me the same question, I'd probably do the same thing without any further context.

OTOH if you tell it to write a Base62 encoder in C#, it does consistently produce an API that can be called with byte arrays: https://g.co/gemini/share/6076f67abde2

jiggawatts

Similarly, many of my informal tests have started passing with Gemini 2.5 that never worked before, which makes the 2025 era of AI models feel like a step change to me.

AaronAPU

I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.

But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.

But they are all quite similar and so far these new models are similar but faster IMO.

croemer

I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.

The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.

Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.

And the CoT summary keeps mentioning my name which is irritating.

istjohn

It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.

scragz

deep research will run in the BG on mobile and I think it gives a notification when done. it's not like normal chats that need the app to be in the foreground.

beefnugs

Have you tried cutting the job up into a series of smaller verifiable intermediate steps?

NiloCK

I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.

glial

I guess you could consider it a lossy encoding.

jcynix

To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

GPT-4o mini: The new moon in August 2025 will occur on August 12.

Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]

Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]

I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

WhatIsDukkha

I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.

I would also never ask a coworker for this precise number either.

achierius

But it's a good reminder when so many enterprises like to claim that hallucinations have "mostly been solved".

WhatIsDukkha

I agree with you partially, BUT

when are the long list of 'enterprise' coworkers, who have glibly and overconfidently answered questions without doing math or looking them up, going to be fired?

jcynix

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.

And, BTW, I thought that LLMs are computers too ;-0

WhatIsDukkha

I think its much better to help people learn that an LLM is "not" a computer (even if it technically is).

Thinking its a computer makes you do dumb things with them that they simply have never done a good job with.

Build intuitions about what they do well and intuitions about what they don't do well and help others learn the same things.

Don't encourage people to have poor ideas about how they work, it makes things worse.

Would you ask an LLM a phone number? If it doesn't use a function call the answer is simply not worth having.

stavros

First we wanted to be able to do calculations really quickly, so we built computers.

Then we wanted the computers to reason like humans, so we built LLMs.

Now we want the LLMs to do calculations really quickly.

It doesn't seem like we'll ever be satisfied.

WhatIsDukkha

Ask the LLM what calculations you might or should do (and how you might implement and test those calculations) is pretty wildly useful.

ec109685

These models are proclaiming near AGI, so they should be smarter than hallucinating an answer.

pixl97

So I asked GPT-o4-mini-high

"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"

It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.

"The new moon in August 2025 falls on Friday, August 22, 2025"

Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.

phoe18

Response from Gemini 2.5 Pro for comparison -

``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.

In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```

jcynix

"Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.

pixl97

Heh, I've always been neurodivergent enough that I've never been great at 'normal human' questions. I commonly add a lot of verbosity. This said it's worked out well talking to computer based things like search engines.

LLMs on the other hand are weird in ways we don't expect computers to be. Based upon the previous prompting, training datasets, and biases in the model a response to something like "What time is dinner" can all have the response "Just a bit after 5", "Quarter after 5" or "Dinner is at 17:15 CDT". Setting ones priors can be important to performance of the model, much in the same way we do this visually and contextually with other humans.

All that said, people will find AI problematic for the foreseeable future because it behaves somewhat human like in responses and does so with confidence.

ec109685

Even with a knowledge cutoff, you could know when a future new moon would be.

andrewinardeer

"Who was the President of the United States when Neil Armstrong walked on the moon?"

Gemini 2.5 refuses to answer this because it is too political.

staticman2

Gemini 2.5 is not generating that refusal. It's a separate censorship model.

It's more clear when you try via AI studio where that have censorship level toggles.

croemer

I call bs on this: https://g.co/gemini/share/ed38e9d38b02

bjk95

Interesting - i got rejected https://g.co/gemini/share/17f73f620a3e

throwaway314155

> one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

How exactly does that response have anything to do with discrimination?

xnx

Gemini gets the new moon right. Better to use one good model than 5 worse ones.

kenjackson

I think all the full power LLMs will get it right because they do web search. ChatGPT 4 does as well.

what_ever

Gemini 2.0 Flash gets it correct too.

andrethegiant

Buried in the article, a new CLI for coding:

> Codex CLI is fully open-source at https://github.com/openai/codex today.

dang

Related ongoing thread:

OpenAI Codex CLI: Lightweight coding agent that runs in your terminal - https://news.ycombinator.com/item?id=43708025

ipsum2

Looks like a Claude Code clone.

jumpCastle

But open source like aider

zapnuk

Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

Lets see what the pricing looks like.

Workaccount2

Looks like they are taking a page from Apple's book, which is to never even acknowledge other products exist outside your ecosystem.

stogot

Apple has commercials for a decade making fun of “PCs”

oofbaroomf

They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.

BeetleB

Pricing is already available:

https://platform.openai.com/docs/pricing

carlita_express

> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

anothermathbozo

This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.

carlita_express

I am aware of that, like I said:

> (Or at least because O(log) increases in model performance became unreasonably costly?)

But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.

OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.

mode80

This happens once it starts improving itself.

og_kalu

It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.

Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.

testfrequency

As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.

1123581321

I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.

energy123

Gemini 2.5 Pro for every single task was the meta until this release. Will have to reassess now.

hollerith

Huh. I use Gemini 2.0 Flash for many things because it's several times faster than 2.5 Pro.

mring33621

Agreed.

I pretty much stopped shopping around once Gemini 2.0 Flash came out.

For general, cloud-centric software development help, it does the job just fine.

I'm honestly quite fond of this Gemini model. I feel silly saying that, but it's true.

jug

Yes, this one is addictive for its speed and I like how Google was clever and also offered it in a powerful reasoning edition. This helps offset deficiencies from being smaller while still being cheap. I also find it quite sufficient for my kind of coding. I only pull out 2.5 Pro on larger and complex code bases that I think might need deeper domain specific knowledge beyond the coding itself.

thom

Mad tangent, but as an old timey MtG player it’s always jarring when someone uses “the meta” not to refer to the particular dynamics of their competitive ecosystem but to a single strategy within it. Impoverishes the concept, I feel, even in this case where I don’t actually think a single model is best at everything.

hatefulmoron

I'm a World of Warcraft & Dota 2 player, using "the meta" in that way is pretty common in gaming these days I think. The "meta" is still the 'metagame' in the competitive ecosystem sense, but it also refers to strategies that are considered flavor of the month (FOTM) or just generally safe bets.

So there's "the meta", and there's "that strategy is meta", or "that strategy is the meta."

blueprint

how do you deal with the fact that they use all of your data for training their own systems and review all conversations

sharkjacobs

gemini-2.5-pro-preview-03-25 is the paid version which doesn't use your data

https://ai.google.dev/gemini-api/terms#data-use-paid

CaptainFever

Personally, I frankly do not care for most things. But for more sensitive things which might land me in trouble, local models are the way to go.

yoyohello13

The answer is to just use the latest Claude model and not worry beyond that.

darioush

It's becoming a bit like iphone 3, 4... 13, 25...

Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.

The big step function happened when they could search the web but not much else has changed in my limited experience.

refulgentis

This is a very apt analogy.

It confers to the speaker confirmation they're absolutely right - names are arbitrary.

While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.

n2d4

This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.

boznz

It feels like all the AI companies are pulling the versions out of their arse at the moment, I think they should work backwards and work to AGI 1.0

So my guess currently is that most are lingering at about 0.3

CamperBob2

I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help

ithkuil

Lol. But that's nothing. Wait until you shimmer and wink in and out of existence, like llms do during each completion

tempaccount420

As another consumer, I think you're overreacting, it's not that bad.

burke

It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.

TuxSH

They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:

"ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."

with rate limits unchanged

wilg

They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.

drcongo

Or maybe OpenAI could wait until they'd released it before telling people to use it now.

squeaky-clean

They'd probably want their announcement to be the one the press picks up instead of a tweet or reddit post saying "Did anyone else notice the new ChatGPT model?"

wilg

Deploying several things is sometimes tricky and this could not be a smaller deal.

null

[deleted]

beejiu

Why pay $200/mo when you can just access the models from the Platform playground?

alphabettsy

Higher limits and operator access maybe?

_bin_

I see o4-mini on the $20 tier but no o3.

null

[deleted]

brcmthrowaway

Holy crap... thats expensive.

rsanheim

`ETOOMANYMODELS`

Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?

I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.

I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:

* agent & tool based dev (cloud) - [top 3 models] * agent & tool based dev (local) - m1, m2, m,3 * code review / high level analysis - ... * general tech questions - ... * technical writing (ADRs, needs assessments, etc) - ...

Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).

departed

LMArena might have some of the information you are looking for. It offers rankings of LLM models across main cloud offerings, and I feel that its evaluation method, human prompting and voting, is closer to real-world use case and less prone to data contamination than benchmarks.

https://lmarena.ai/

In the "Leaderboard">"Language" tab, it lists the top models in various categories such as overall, coding, math, and creative writing.

In the "Leaderboard">"Price Analysis" tab, it shows a chart comparing models by cost per million tokens.

In the "Prompt-to-Leaderboard" tab, there is even an LLM to help you find LLMs -- you enter a prompt, and it will find the top models for your particular prompt.

ac29

> Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development?

Below is a spreadsheet I bookmarked from a previous HN discussion. Its information dense but you can just look at the composite scores to get a quick idea how things compare.

https://docs.google.com/spreadsheets/u/1/d/1foc98Jtbi0-GUsNy...

Carbonhell

I have been using this site: https://artificialanalysis.ai/ . It's still about benchmarks, and it doesn't do deep dives into specific use cases, but it's helpful to compare models for intelligence vs cost vs latency and other characteristics.

brap

Where's the comparison with Gemini 2.5 Pro?

gallerdude

For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.

Gemini 2.5 Pro got 72.9%

o3 high gets 81.3%, o4-mini high gets 68.9%

croemer

Isn't it easy to train on the specific Exercism exercises that this benchmark uses?

vessenes

where do you find those o3 high numbers? https://aider.chat/docs/leaderboards/ currently has gemini 2.5 pro as the leader at, as you say, 72.9%.

re-thc

It's in the OpenAI article post (OP) i.e. OpenAI ran Aider themselves.

jumpCastle

It was a good benchmark until it entered the training set.

asadm

thanks

SweetSoftPillow

Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.

We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).

usaar333

> Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

It's the opposite. o3 scores higher

SweetSoftPillow

On SWE bench? Show your source.

kridsdale1

Exactly.