OpenAI O3-Mini

425 comments

·January 31, 2025

jen729w

> Testers preferred o3-mini's responses to o1-mini 56% of the time

I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.

They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.

That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.

dkjaudyeqooe

It's kind of strange that they gave that stat. Maybe they thought people would somehow think about "56% better" or something.

Because when you think about it, it really is quite damning. Minus statistical noise it's no better.

cm2187

And another way to rephrase it is that almost half of the users prefer the older model, which is terrible PR.

Powdering7082

That would be 12%, why would you assume that is eaten by statistical noise?

senorrib

The OPs comment is probably a testament of that. With such a poorly designed A/B test I doubt this has a p-value of < 0.10.

aqme28

They even include error bars. It doesn't seem to be statistical noise, but it's still not great.

afro88

Yeah. I immediately thought: I wonder if that 56% is in one or two categories and the rest are worse?

fsndz

exactly I was surprised as well

brookst

Those prompts are so irritating and so frequent that I’ve taken to just quickly picking whichever one looks worse at a cursory glance. I’m paying them, they shouldn’t expect high quality work from me.

apparent

Have you considered the possibility that your feedback is used to choose what type of response to give to you specifically in the future?

I would not consider purposely giving inaccurate feedback for this reason alone.

MattDaEskimo

I don't want a model that's customized to my preferences. My preferences and understanding changes all the time.

I want a single source model that's grounded in base truth. I'll let the model know how to structure it in my prompt.

francis_lewis

I think my awareness that this may influence future responses has actually been detrimental to my response rate. The responses are often so similar that I can imagine preferring either in specific circumstances. While I’m sure that can be guided by the prompt, I’m often hesitant to click on a specific response as I can see the value of the other response in a different situation and I don’t want to bias the future responses. Maybe with more specific prompting this wouldn’t be such an issue, or maybe more of an understanding of how inter-chat personalisation is applied (maybe I’m missing some information on this too).

Der_Einzige

Spotted the pissed off OpenAI RLHF engineer! Hahahahaha!

isaacremuant

Alternatively, I'll use the tool that is most user friendly and provides the most value for my money.

Wasting time on an anti pattern is not value nor is it trying to outguess the way that selection mechanism is used.

null

[deleted]

Tenoke

That's such a counter-productive and frankly dumb thing to do. Just don't vote on them.

explain

You have to pick one to continue the chat.

janalsncm

> I'm usually just selecting the one that answered first

Which is why you randomize the order. You aren’t a tester.

56% vs 44% may not be noise. That’s why we have p values. It depends on sample size.

brianstrimp

That makes the result stronger though. Even though many people click randomly, there is still a 12% margin between both groups. Not the world, but still quite a lot.

sharkweek

Funny - I had ChatGPT document some stuff for me this week and asked which responses I preferred as well.

Didn’t bother reading either of them, just selected one and went on with my day.

If it were me I would have set up a “hey do you mind if we give you two results and you can pick your favorite?” prompt to weed out people like me.

apparent

I wonder if they down-weight responses that come in too fast to be meaningful, or without sufficient scrolling.

usef-

I'm surprised how many people claim to do this. You can just not select one.

rubyn00bie

I think it’s somewhat natural and am not personally surprised. It’s easy to quickly select an option, that has no consequence, compared to actively considering that not selecting something is an option. Not selecting something feels more like actively participating than just checking a box and moving on. /shrug

letmevoteplease

The article says "expert testers."

"Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. W"

arijo

People could be flipping a coin and the score would be the same.

brianstrimp

A 12% margin is literally the opposite of a coin flip. Unless you have a really bad coin.

teeray

This prompt is like "See Attendant" on the gas pump. I'm just going to use another AI instead for this chat.

simonw

I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...

For 18,936 input, 2,905 output it cost 3.3612 cents.

Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes...

threecheese

I haven’t tried o3, but one issue I struggle with in large context analysis tasks is the LLMs are never thorough. In a task like this thread summarization, I typically need to break the document down and loop through chunks to ensure it actually “reads” everything. I might have had to recurse into individual conversations with some small max-depth and leaf count and run inference on each, and then have some aggregation at the end, otherwise it would miss a lot (or appear to, based on the output).

Is this a case of PEBKAC?

syntaxing

Depending on what you’re trying to do, it’s worth trying the 1M context Qwen Models. They only released 7 and 14B so it’s “intelligence” is limited but should be more than capable for coherent summary.

andrewci

[dead]

layman51

I noticed that it thought that GoatInGrey wrote “openai is no longer relevant.” However, they were just quoting a different user (buyucu) who was the person who first wrote that.

Eduard

3.3612 cents (I guess USD cents) is expensive!

thorum

> Highlighting freedom from proprietary shackles, this opinion emphasizes a philosophy not widely echoed by others in the thread.

breakingcups

It's definitely making some errors

kandesbunzler

Like?

simonw

I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21

Example usage:

  llm -m o3-mini 'write a poem about a pirate and a walrus' \
    -o reasoning_effort high

Output (comparing that with the default reasoning effort): https://github.com/simonw/llm/issues/728#issuecomment-262832...

(If anyone has a better demo prompt I'd love to hear about it)

theturtle32

A reasoning model is not meant for writing poetry. It's not very useful to evaluate it on such tasks.

mediaman

It's not clear that writing poetry is a bad use case. Reasoning models seem to actually do pretty well with creative writing and poetry. Deepseek's R1, for example, has much better poem structure than the underlying V3, and writers are saying R1 was the first model where they actually felt like it was a useful writing companion. R1 seems to think at length about word choice, correcting structure, pentameter, and so on.

beklein

Thank you for all the effort you put into this tool and keeping it up to date!

georgewsinger

Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?

Am I missing something?

anothermathbozo

I think this is with and without "tools." They explain it in the system card:

> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.

> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.

Bjorkbat

So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.

pockmarked19

YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).

georgewsinger

Makes sense. Thanks for the correction.

jakereps

The caption on the graph explains.

> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .

I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.

DrewHintz

I'm guessing an "internal tools scaffold" is something like Goose: https://github.com/block/goose

Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.

null

[deleted]

null

[deleted]

logicchains

Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".

ben_w

> lobotomise

Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".

Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.

And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...

bee_rider

If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.

I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.

Rastonbury

They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs

kkzz99

Or the number was never real to begin with.

sss111

So far, it seems like this is the hierarchy

o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini

o3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf

sho_hn

I think OpenAI really needs to rethink its product naming, especially now that they have a portfolio where there's no such clear hierarchy, but they have a place along different axis (speed, cost, reasoning, capabilities, etc).

Your summary attempt e.g. also misses o3-mini vs o3-mini-high. Lots of trade-ofs.

nullpoint420

Can't wait for the eventual rename to GPT Core, GPT Plus, GPT Pro, and GPT Pro Max models!

I can see it now:

> Unlock our industry leading reasoning features by upgrading to the GPT 4 Pro Max plan.

raphman

Oh, I'll probably wait for GPT 4 Pro Max v2 NG (improved)

wlll

I think I'll wait for the GTI model myself.

FridgeSeal

OpenAI chatGPT Pro Max XS Core, not to be confused with ChatGPT Max S Pro Net Core X, or ChatGPT Pro Max XS Professional CoPilot Edition.

rf15

They're strongly tied to Microsoft, so confusing branding is to be expected.

ngokevin

It needs to be clowned on here:

- Xbox, Xbox 360, Xbox One, Xbox One S/X, Xbox Series S/X

- Windows 3.1...98, 2000, ME, XP, Vista, 7, 8, 10

I guess it's better than headphones names (QC35, WH-1000XM3, M50x, HD560s).

ANewFormation

I can't wait for Project Unify which just devolves into a brand new p3-mini type naming convention. It's pretty much identical to the o3-mini, except the API is changed just enough to be completely incompatible and it crashes on any query using a word with more than two syllables. Fix coming soon, for 4 years so far.

On the bright side the app now has curved edges!

chris_va

One of my favorite parodies: https://www.youtube.com/watch?v=EUXnJraKM3k

nejsjsjsbsb

Flashbacks of the .NET zoo. At least they reigned that in.

Euphorbium

They can still do models o3o, oo3 and 3oo. Mini-o3o-high, not to be confused with mini-O3o-high (the first o is capital).

margalabargala

They should just start encoding the model ID in trinary using o, O, and 0.

Model 00oOo is better than Model 0OoO0!

brookst

You’re thinking too small. What about o10, O1o, o3-m1n1?

echelon

It's like AWS SKU naming (`c5d.metal`, `p5.48xlarge`, etc.), except non-technical consumers are expected to understand it.

nejsjsjsbsb

Those are not names but hashes used to look up the specs.

maeil

Have you seen Azure VM SKU naming? It's.. impressive.

sss111

Yeah I tried my best :(

I think they could've borrowed a page out of Apple's book, even mountain names would be better. Plus Sonoma, Ventura, and Yosemite are cool names.

kaaskop

Yeah their naming scheme is super confusing, I honestly confuse them all the time.

thot_experiment

at least if i ran the company you'd know that

ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt

losvedir

What about "o1 Pro mode". Is that just o1 but with more reasoning time, like this new o3-mini's different amount of reasoning options?

MichaelBurge

o1-pro is a different model than o1.

losvedir

Are you sure? Do you have any source for that? In this article[0] that was discussed here on HN this week, they say (claim):

> In fact, the O1 model used in OpenAI's ChatGPT Plus subscription for $20/month is basically the same model as the one used in the O1-Pro model featured in their new ChatGPT Pro subscription for 10x the price ($200/month, which raised plenty of eyebrows in the developer community); the main difference is that O1-Pro thinks for a lot longer before responding, generating vastly more COT logic tokens, and consuming a far larger amount of inference compute for every response.

Granted "basically" is pulling a lot of weight there, but that was the first time I'd seen anyone speculate either way.

[0] https://youtubetranscriptoptimizer.com/blog/05_the_short_cas...

JohnPrine

I don't think this is true

gundmc

If this is the hierarchy, why does 4o score so much higher than o1 on LLM Arena?

Worrisome for OpenAI that Gemini's mini/flash reasoning model outscores both o1 and 4o handily.

crazysim

Is it possible people are voting for speed of responsiveness too?

kgeist

I suspect people on LLM Arena don't ask complex questions too often, and reasoning models seem to perform worse than simple models when the goal is just casual conversation or retrieving embedded knowledge. Reasoning models probably 'overthink' in such cases. And slower, too.

usaar333

For non-stem perhaps.

For math/coding problems, o3 mini is tied if not better than o1.

lumost

I actually switched back from o1-preview to GPT-4o due to tooling integration and web search. I find that more often than not, the ability of GPT-4o to use these tools outweighs o1's improved accuracy.

LoveMortuus

How would the DeepSeek fit into this?

Or can it not compare? I don't know much about this stuff, but I've heard recently many people talk about DeepSeek and how unexpected it was.

sss111

Deepseek V3 is equivalent to 4o. Deepseek R1 is equivalent to o1 (if not better)

I think someone should just build an AI model comparing website at this point. Include all benchmarks and pricing

jsk2600

This one is good: https://artificialanalysis.ai/

ActVen

I really wish they would open up the reasoning effort toggle on o1 API. o1 Pro Mode is still the best overall model I have used for many complex tasks.

cyounkins

I switched an agent from Sonnet V2 to o3-mini (default medium mode) and got strangely poor results: only calling 1 tool at a time despite being asked to call multiple, not actually doing any work, and reporting that it did things it didn't

ern

I haven’t bothered with o3 mini, because who wants an “inferior” product? I was using 4o as a “smarter Google” until DeepSeek appeared (although its web search is being hammered now and I’m just using Google ).

o1 seems to have been neutered in the last week lots of disclaimers and butt-covering in its responses.

I also had an annoying discussion with o1 about the DC plane crash..it doesn’t have web access and its cutoff is 2024, so I don’t expect it know about the crash. However, after saying such an event is extremely unlikely and being almost patronisingly reassuring, it treated pasted news articles and links (which to be sure, it can’t access) as “fictionalized”, instead of acknowledging its own cut-off date, and that it could have been wrong. In contrast DeepSeek (with web search turned off) was less dismissive of the risks in DC airspace, and more aware of its own knowledge cut-off.

Coupled with the limited number of o1 responses for ChatGPT PLUS, I’ve cancelled my subscription for now.

profsummergig

Can someone please share the logic behind their version naming convention?

pookieinc

Can't wait to try this. What's amazing to me is that when this was revealed just one short month ago, the AI landscape looked very different than it does today with more AI companies jumping into the fray with very compelling models. I wonder how the AI shift has affected this release internally, future releases and their mindset moving forward... How does the efficiency change, the scope of their models, etc.

patrickhogan1

I thought it was o3 that was released one month ago and received high scores on ARC Prize - https://arcprize.org/blog/oai-o3-pub-breakthrough

If they were the same, I would have expected explicit references to o3 in the system card and how o3-mini is distilled or built from o3 - https://cdn.openai.com/o3-mini-system-card.pdf - but there are no references.

Excited at the pace all the same. Excited to dig in. The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred.

echelon

There's no moat, and they have to work even harder.

Competition is good.

lesuorac

I really don't think this is true. OpenAI has no moat because they have nothing unique; they're using mostly other people's (like Transformers) architectures and other companies hardware.

Their value-prop (moat) is that they've burnt more money than everybody else. That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

OpenAI isn't the only company. The Tech companies being beaten massively by Microsoft in #of H100s purchases are the ones with a moat. Google / Amazon with their custom AI chips are going to have a better performance per cost than others and that will be a moat. If you want to get the same performance per cost then you need to spend the time making your own chips which is years of effort (=moat).

sumedh

> That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

Google with all its money and smart engineers was not able to build a simple chat application.

lukan

"OpenAI has no moat because they have nothing unique"

It seems they have high quality trainingsdata. And the knowledge to work with it.

sangnoir

> That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.

DeepSeek has proven that the latter is possible, which drops a couple of River crossing rocks into the moat.

brookst

Brand is a moat

lumost

Capex was the theoretical moat, same as TSMC and similar businesses. DeepSeek poked a hole in this theory. OpenAI will need to deliver massive improvements to justify a 1 billion dollar training cost relative to 5 million dollars.

usef-

I don't know if you are, but a lot of people are still comparing one Deepseek training run to the entire costs of OpenAI.

The deepseek paper states that the $5mil number doesn't include development costs, only the final training run. And it doesn't include the estimated $1.4billion cost of the infrastructure/chips Deepseek owns.

Most of OpenAI's billion dollar costs is in inference, not training. It takes a lot of compute to serve so many users.

Dario said recently that Claude was in the tens of millions (and that it was a year earlier, so some cost decline is expected), do we have some reason to think OpenAI was so vastly different?

wahnfrieden

Collaboration is even better, per open source results.

It is the closed competition model that’s being left in the dust.

vok

Well, o3-mini-high just successfully found the root cause of a seg fault that o1 missed: mistakenly using _mm512_store_si512 for an unaligned store that should have been _mm512_storeu_si512.

ilaksh

It looks like a pretty significant increase on SWE-Bench. Although that makes me wonder if there was some formatting or gotcha that was holding the results back before.

If this will work for your use case then it could be a huge discount versus o1. Worth trying again if o1-mini couldn't handle the task before. $4/million output tokens versus $60.

https://platform.openai.com/docs/pricing

I am Tier 5 but I don't believe I have access to it in the API (at least it's not on the limits page and I haven't received an email). It says "rolling out to select Tier 3-5 customers" which means I will have to wait around and just be lucky I guess.

TeMPOraL

Tier 3 here and already see it on Limits page, so maybe the wait won't be long.

ilaksh

Yep, I got an email about o3-mini in the API an hour ago.

TechDebtDevin

Genuinely curious, What made you choose OpenAI as your preferred api provider? Its always been the least attractive to me.

ilaksh

I have mainly been using Claude 3.5/3.6 Sonnet via API in the last several months (or since 3.5 Sonnet came out). However, I was using o1 for a challenging task at one point, but last I tested it had issues with some extra backslashes for that application.

I also have tested with DeepSeek R1 and will test some more with that although in a way Claude 3.6 with CoT is pretty good. Last time I tried to test R1 their API was out.

eknkc

We extensively used the batch APIs to decrease cost and handle large amount of data. I also need JSON responses for a lot of things and OpenAI seem to have the best json schema output option out there.

ipaddr

Who else might be a good choice? Deepseek is down. Who has the cheapest gpt3.5 level or above api

Aperocky

Run it locally, the distilled smaller ones aren't bad at all.

TechDebtDevin

Ive personaly been using Deepseek (which has been better than for 3.5 for a really long time), and Perplexity, which is nice for their built in search. Ive actually been using Deepseek since it was free. Its been generally good for me. Ive mostly chosen both because of pricing as I generally dont use APIs for extermely complex prompts.

TeMPOraL

Until recently they were the only game in town, so maybe they accrued significant spend back then?

devindotcom

Sure as a clock, tick follows tock. Can't imagine trying to build out cost structures, business plans, product launches etc on such rapidly shifting sands. Good that you get more for your money, I suppose. But I get the feeling no model or provider is worth committing to in any serious way.