OpenAI dropped the price of o3 by 80%

503 comments

·June 10, 2025

34679

I'd like to offer a cautionary tale that involves my experience after seeing this post.

First, I tried enabling o3 via OpenRouter since I have credits with them already. I was met with the following:

"OpenAI requires bringing your own API key to use o3 over the API. Set up here: https://openrouter.ai/settings/integrations"

So I decided I would buy some API credits with my OpenAI account. I ponied up $20 and started Aider with my new API key set and o3 as the model. I get the following after sending a request:

"litellm.NotFoundError: OpenAIException - Your organization must be verified to use the model `o3`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."

At that point, the frustration was beginning to creep in. I returned to OpenAI and clicked on "Verify Organization". It turns out, "Verify Organization" actually means "Verify Personal Identity With Third Party" because I was given the following:

"To verify this organization, you’ll need to complete an identity check using our partner Persona."

Sigh I click "Start ID Check" and it opens a new tab for their "partner" Persona. The initial fine print says:

"By filling the checkbox below, you consent to Persona, OpenAI’s vendor, collecting, using, and utilizing its service providers to process your biometric information to verify your identity, identify fraud, and conduct quality assurance for Persona’s platform in accordance with its Privacy Policy and OpenAI’s privacy policy. Your biometric information will be stored for no more than 1 year."

OK, so now, we've gone from "I guess I'll give OpenAI a few bucks for API access" to "I need to verify my organization" to "There's no way in hell I'm agreeing to provide biometric data to a 3rd party I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder. How do I get my $20 back?"

leetrout

I actually contacted the California AG to get a refund from another AI company after they failed to refund me.

The AG office followed up and I got my refund. Worth my time to file because we should stop letting companies get away with this stuff where they show up with more requirements after paying.

Separately they also do not need my phone number after having my name, address and credit card.

Has anyone got info on why they are taking everyone’s phone number?

jazzyjackson

(having no insider info:) Because it can be used as a primary key ID across aggregated marketing databases including your voting history / party affiliation, income levels, personality and risk profiles etc etc etc. If a company wants to, and your data hygiene hasn't been tip top, your phone number is a pointer to a ton of intimate if not confidential data. Twitter was fined $150 million for asking for phone numbers under pretense of "protecting your account" or whatever but they actually used it for ad targeting.

>> Wednesday's 9th Circuit decision grew out of revelations that between 2013 and 2019, X mistakenly incorporated users' email addresses and phone numbers into an ad platform that allows companies to use their own marketing lists to target ads on the social platform.

>> In 2022, the Federal Trade Commission fined X $150 million over the privacy gaffe.

>> That same year, Washington resident Glen Morgan brought a class-action complaint against the company. He alleged that the ad-targeting glitch violated a Washington law prohibiting anyone from using “fraudulent, deceptive, or false means” to obtain telephone records of state residents.

>> X urged Dimke to dismiss Morgan's complaint for several reasons. Among other arguments, the company argued merely obtaining a user's phone number from him or her doesn't violate the state pretexting law, which refers to telephone “records.”

>> “If the legislature meant for 'telephone record' to include something as basic as the user’s own number, it surely would have said as much,” X argued in a written motion.

https://www.mediapost.com/publications/article/405501/None

sgarland

Tangential: please do not use a phone number as a PK. Aside from the nightmare of normalizing them, there is zero guarantee that someone will keep the same number.

azinman2

OpenAI doesn’t (currently) sell ads. I really cannot see a world where they’re wanting to sell ads to their API users only? It’s not like you need a phone number to use ChatGPT.

To me the obvious example is fraud/abuse protection.

giancarlostoro

Thank you for this comment… a relative of mine spent a ton of money on an AI product that never came a license he cannot use. I told him to contact his states AG just in case.

pembrook

Source: have dealt with fraud at scale before.

Phone number is the only way to reliably stop MOST abuse on a freemium product that doesn't require payment/identity verification upfront. You can easily block VOIP numbers and ensure the person connected to this number is paying for an actual phone plan, which cuts down dramatically on bogus accounts.

Hence why even Facebook requires a unique, non-VOIP phone number to create an account these days.

I'm sure this comment will get downvoted in favor of some other conspiratorial "because they're going to secretly sell my data!" tinfoil post (this is HN of course). But my explanation is the actual reason.

I would love if I could just use email to signup for free accounts everywhere still, but it's just too easily gamed at scale.

LexGray

On the flip side it makes a company seem sparklingly inept when they use VOIP as a method to filter valid users. I haven’t done business with companies like Netflix or Uber because I don’t feel like paying AT&T a cut for identity verification. There are plenty of other methods like digital licenses which are both more secure and with better privacy protections.

exceptione

Maybe they should look into a non-freemium business model. But that won't happen because they want to have as much personal data as possible.

- Parent talks about a paid product. If they wants to burn tokens, they are going to pay for it.

- Those phone requirements do not stop professional abusers, organized crime nor state sponsored groups. Case in point: twitter is overrun by bots, scammers and foreign info-ops swarms.

- Phone requirements might hinder non-professional abusers at best, but we are sidestepping the issue if those corporations deserve that much trust to compel regular users to sell themselves. Maybe the business model just sucks.

juros

Personally I found that rejecting disposable/temporary emails and flagging requests behind VPNs filtered out 99% of abuse on my sites.

No need to ask for a phone or card -- or worse, biometric data! -- which also removes friction.

SheinH

The discussion wasn't about freemium products though. Someone mentioned that they paid 20 bucks for OpenAI's API already and then they were asked for more verification.

AnthonyMouse

> I'm sure this comment will get downvoted in favor of some other conspiratorial "because they're going to secretly sell my data!" tinfoil post (this is HN of course). But my explanation is the actual reason.

Your explanation is inconsistent with the link in these comments showing Twitter getting fined for doing the opposite.

> Hence why even Facebook requires a unique, non-VOIP phone number to create an account these days.

Facebook is the company most known for disingenuous tracking schemes. They just got caught with their app running a service on localhost to provide tracking IDs to random shady third party websites.

> You can easily block VOIP numbers and ensure the person connected to this number is paying for an actual phone plan, which cuts down dramatically on bogus accounts.

There isn't any such thing as a "VOIP number", all phone numbers are phone numbers. There are only some profiteers claiming they can tell you that in exchange for money. Between MVNOs, small carriers, forwarding services, number portability, data inaccuracy and foreign users, those databases are practically random number generators with massive false positive rates.

Meanwhile major carriers are more than happy to give phone numbers in their ranges to spammers in bulk, to the point that this is now acting as a profit center for the spammers and allowing them to expand their spamming operations because they can get a large number of phone numbers those services claim aren't "VOIP numbers", use them for spamming the services they want to spam, and then sell cheap or ad-supported SMS service at a profit to other spammers or privacy-conscious people who want to sign up for a service they haven't used that number at yet.

charliebwrites

Doesn’t Sam Altman own a crypto currency company [1] that specifically collects biometric data to identify people?

Seems familiar…

[1] https://www.forbes.com/advisor/investing/cryptocurrency/what...

jjani

GP did mention this :)

> I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder

93po

the core tech and premise doesnt collect biometric data, but biometric data is collected for training purposes with consent and compensation. There is endless misinformation (willfully and ignorantly) around worldcoin but it is not, at its core, a biometric collection company

malfist

Collecting biometrics for training purposes is still collecting biometrics.

ddtaylor

I also am using OpenRouter because OpenAI isn't a great fit for me. I also stopped using OpenAI because they expire your API credits even if you don't use them. Yeah, it's only $10, but I'm not spending another dime with them.

numlocked

Hi - I'm the COO of OpenRouter. In practice we don't expire the credits, but have to reserve the right to, or else we have a uncapped liability literally forever. Can't operate that way :) Everyone who issues credits on a platform has to have some way of expiring them. It's not a profit center for us, or part of our P&L; just a protection we have to have.

mitthrowaway2

If you're worried about the unlimited liability, how about you refund the credits instead of expiring them?

otterley

Out of curiosity, what makes you different from a retailer or restaurant that has the same problem?

carstenhag

Why only 365 days? Would be way fairer and still ok for you (if it's such a big issue) to expire them after 5 years.

johnnyyyy

then you shouldn’t use OpenRouter. ToS: 4.2 Credit Expiration; Auto Recharge OpenRouter reserves the right to expire unused credits three hundred sixty-five (365) days after purchase

bonki

I wonder if they do this everywhere, in certain jurisdictions this is illegal.

cactusplant7374

That is so sleezy.

cedws

After how long do they expire?

zeograd

IIRC, 1 year

Marsymars

Oh I also recently got locked out of my linkedin account until I supply data to Persona.

(So I’m remaining locked out of my linkedin account.)

csomar

> How do I get my $20 back?

Contact support and ask for a refund. Then a charge back.

shmoogy

I was excited about trying o3 for my apps but I'm not doing this validation.. thanks for the heads up.

baq

Meanwhile the FSB and Mossad happily generate fake identities on demand.

romanovcode

The whole point of identity verification is for the same Mossad to gather your complete profile and everything else they can from OpenAI.

Since Mossad and CIA is essentially one organization they already do it, 100%.

AstroBen

KYC requirement + OpenAI preserving all logs in the same week?

jjani

OpenAI introduced this with the public availability of o3, so no.

It's also the only LLM provider which has this.

What OpenAI has that the others don't is SamA's insatiable thirst for everyone's biometric data.

mycall

I think KYC has been beaten by AI agents according to RepliBench [0] as obtaining compute requires KYC which has a high success rate in the graphic.

[0] https://www.aisi.gov.uk/work/replibench-measuring-autonomous...

infecto

KYC has been around for a few months I believe. Whenever they released some of the additional thought logs you had to be verified.

sschueller

Has anyone noticed that OpenAI has become "lazy"? When I ask questions now it will not give me a complete file or fix. Instead it tells me what I should do and I need to ask a second or third time to just do the thing I asked.

I don't see this happening with for example deepseek.

Is it possible they are saving on resources by having it answer that way?

tedsanders

Yeah, our models are sometimes too lazy. It’s not intentional, and future models will be less lazy.

When I worked at Netflix I sometimes heard the same speculation about intentionally bad recommendations, which people theorized would lower streaming and increase profit margins. It made even less sense there as streaming costs are usually less than a penny. In reality, it’s just hard to make perfect products!

(I work at OpenAI.)

ukblewis

Please be careful about the alternative. I’ve seen o3 doing excessive tool calls and research for relatively simple problems.

jillesvangurp

Yep, it defaults to doing a web search even when that doesn't make sense.

Example, I asked it to write something. And then I asked it to give me that blob of text in markdown format. So everything it needed was already in the conversation. That took a whole minute of doing web searches and what not.

I actually dislike using o3 for this reason. I keep the default to 4o. But sometimes I forget to switch back and it goes off boiling the oceans to answer a simple question. It's a bit too trigger happy with that. In general all this version and model soup is impossible to figure out for non technical users. And I noticed 4o is now sometimes starting to do the same. I guess, too many users never use the model drop down.

Hard_Space

After the last few weeks, where o3 seems desperate to do tool searches or re-crunch a bad gen even though I only asked a question about it, I assumed that the policy is to burn through credits at the fastest possible rate. With this price change, I don't know what's happening now...

anshumankmr

That was a problem in GPT 4 Turbo as well...

jazzyjackson

IMO its just that the models are very nondeterministic, and people get very different kinds of responses from it. I met a number of people who tried it when it first came out and it was just useless so they stopped trying it, other people (including me) got gobsmacking great responses and it felt like AGI was around the corner, but after enough coin flips your luck runs out and you get some lazy responses. Some people have more luck than others and wonder why everyone around them says it's trash.

0x1ceb00da

I think it's good. The model will probably make some mistake at first. Not doing the whole thing and just telling the user the direction it's going in gives us a chance to correct its mistakes.

thimabi

Can you share what are the main challenges OpenAI has been facing in terms of increasing access to top-tier and non-lazy models?

TZubiri

but maybe you are saying that because you are a CIA plant that's trying to make the product bad because of complex reasons.

takes tinfoil hat off

Oh, nvm, that makes sense.

TillE

Had a fun experience the other day asking "make a graph of [X] vs [Y]" (some chemistry calculations), and the response was blah blah blah explain explain "let me know if you want a graph of this!" Yeah ok thanks for offering.

csomar

I don't think that's laziness but maybe agent tuning.

mythz

I've been turned off with OpenAI and have been actively avoiding using any of their models for a while, luckily this is easy to do given the quality of Sonnet 4 / Gemini Pro 2.5.

Although I've always wondered how OpenAI could get away with o3's astronomical pricing, what does o3 do better than any other model to justify their premium cost?

jstummbillig

It's just a highly unoptimized space. There is very little market consolidation at this point, everyone is trying things out that lead to wildly different outcomes and processes and costs, even though in the end it's always just a bunch of utf-8 characters. o3 was probably just super expensive to run, and now, apparently, it's not anymore and can beat sonnet/opus 4 on pricing. It's fairly wild.

jsnider3

Very few customers pick the model based on cost, for many ChatGPT is the only one they know of.

hu3

> Very few customers pick the model based on cost.

What? 3 ou of 4 companies I consulted for that started using AI for coding marked cost as an important criteria. The 4th one has virtually infinite funding so they just don't care.

jsnider3

> 3 out of 4 companies I consulted for that started using AI for coding marked cost as an important criteria.

And those aren't average customers.

lvl155

Google has been catching up. Funny how fast this space is evolving. Just a few months ago, it was all about DeepSeek.

bitpush

Many would say Google's Gemini models are SOTA, although Claude seems to be doing well with coding tasks.

snarf21

Gemini has been better than Claude for me on a coding project. Claude kept telling me it update some code but the update wasn't in the output. Like, I had to re-prompt just for updated output 5 times in a row.

jacob019

I break out Gemini 2.5 pro when Claude gets stuck, it's just so slow and verbose. Claude follows instructions better and seems to better understand it's role in agentic workflows. Gemini does something different with the context, it has a deeper understanding of the control flow and can uncover edge case bugs that Claude misses. o3 seems better at high level thinking and planning, questioning if it should it be done and whether the challenge actually matches the need. They're kind of like colleagues with unique strengths. o3 does well with a lot of things, I just haven't used it as much because of the cost. Will probably use it more now.

ookdatnog

If the competition boils down to who has access to the largest amount of high quality data, it's hard to see how anyone but Google could win in the end: through Google Books they have scans of tens of millions of books, and published books are the highest quality texts there are.

itake

I've been learning vietnamese. Unfortunately, a lot of social media (reddit, fb, etc) has a new generation of language. The younger generation uses so much abbreviations and acronyms, ChatGPT and Google Translate can't keep up.

I think if you're goal is to have properly written langauge using older writing styles, then you're correct.

ookdatnog

I don't think it's simply a stylistic matter: it seems reasonable to assume that text in books tends to have higher information density, and contains longer and more complicated arguments (when compared to text obtained from social media posts, blogs, shorter articles, etc). If you want models that appear more intelligent, I think you need them to train on this kind of high-quality content.

The fact that these tend to be written in an older writing style is to me incidental. You could rewrite all your college text books in contemporary social media slang and I would still consider them high-quality texts.

johan914

I have been using Google’s models the past couple months, and was surprised to see how sycophantic chatGPT is now. It’s not just at the start or end of responses, it’s interspaced within the markdown, with little substance. Asking it to change its style makes it overuse technical terms.

malshe

I have observed that DeepSeek hallucinates a lot more than others for the same task. Anyone else experienced it?

resource_waste

Deepseek was exciting because you could download their model. They are seemingly 3rd place and have been since Gemini 2.5.

Squarex

I would put them on the fourth after Google, OpenAI and Anthropic. Still the best open weight llm.

behnamoh

how do we know it's not a quantized version of o3? what's stopping these firms from announcing the full model to perform well on the benchmarks and then gradually quantizing it (first at Q8 so no one notices, then Q6, then Q4, ...).

I have a suspicion that's how they were able to get gpt-4-turbo so fast. In practice, I found it inferior to the original GPT-4 but the company probably benchmaxxed the hell out of the turbo and 4o versions so even though they were worse models, users found them more pleasing.

CSMastermind

This is almost certainly what they're doing and rebranding the original o3 model as "o3-pro"

tedsanders

Nope, not what we’re doing.

o3 is still o3 (no nerfing) and o3-pro is new and better than o3.

If we were lying about this, it would be really easy to catch us - just run evals.

(I work at OpenAI.)

fastball

Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.

However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).

This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.

fny

Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.

o4-mini-high o4-mini o3 o3-pro gpt-4o

Oy.

energy123

Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.

meta_ai_x

Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs

MattDaEskimo

What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well

bn-l

Not quantized?

csomar

I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).

I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.

This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.

mliker

Where are you getting this information? What basis do you have for making this claim? OpenAI, despite its public drama, is still a massive brand and if this were exposed, would tank the company's reputation. I think making baseless claims like this is dangerous for HN

beering

I think Gell-Mann amnesia happens here too, where you can see how wrong HN comments are on a topic you know deeply, but then forget about that when reading the comments on another topic.

behnamoh

> rebranding the original o3 model as "o3-pro"

interesting take, I wouldn't be surprised if they did that.

anticensor

-pro models appear to be a best-of-10 sampling of the original full size model

Szpadel

how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.

if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time

but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them

lispisok

I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.

Tiberium

I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?

Kranar

I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.

When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.

herval

there's definitely measurements (eg https://hdsr.mitpress.mit.edu/pub/y95zitmz/release/2 ) but I imagine they're rare because those benchmarks are expensive, so nobody keeps running them all the time?

Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).

You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).

Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)

cainxinth

I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.

bobxmax

My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.

Which is why the base model wouldn't necessarily show differences when you benchmarked them.

85392_school

I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.

tshaddox

Yeah, it’s almost certainly hallucination (by the human user).

colordrops

It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.

JoshuaDavid

I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.

Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).

nabla9

It seems that least Google is overselling their compute capacity.

You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.

baq

Gemini is simply that good. I’m trying out Claude 4 every now and then and go back to Gemini to fix its mess…

edzitron

When you say "jammed," how do you mean?

JamesBarney

I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.

There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.

mhitza

That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.

Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).

beering

Are you able to quantify how quickly your perception gets skewed by how long you use the models?

beering

It’s easy to measure the models getting worse, so you should be suspicious that nobody who claims this has scientific evidence to back it up.

solfox

I have seen this behavior as well.

codr7

[flagged]

daseiner1

It's still a very competitive marketplace

mathgradthrow

honestly refreshing take.

bboygravity

But OpenAI breathes honesty. They're open source! They would never do such a thing. /s

tedsanders

It's the same model, no quantization, no gimmicks.

In the API, we never make silent changes to models, as that would be super annoying to API developers [1]. In ChatGPT, it's a little less clear when we update models because we don't want to bombard regular users with version numbers in the UI, but it's still not totally silent/opaque - we document all model updates in the ChatGPT release notes [2].

[1] chatgpt-4o-latest is an exception; we explicitly update this model pointer without warning.

[2] ChatGPT Release Notes document our updates to gpt-4o and other models: https://help.openai.com/en/articles/6825453-chatgpt-release-...

(I work at OpenAI.)

ctoth

From the announcement email:

> Today, we dropped the price of OpenAI o3 by 80%, bringing the cost down to $2 / 1M input tokens and $8 / 1M output tokens.

> We optimized our inference stack that serves o3—this is the same exact model, just cheaper.

hyperknot

I got 700+ tokens/sec on o3 after the announcement, I suspect it's very much a quantized version.

https://x.com/hyperknot/status/1932476190608036243

dist-epoch

Or maybe they just brought online much faster much cheaper hardware.

az226

Or they are using a speedy add-on decoder.

beering

Do you also have numbers on intelligence before and after?

zackangelo

Is that input tokens or output tokens/s?

carter-0

An OpenAI researcher claims it's the exact same model on X: https://x.com/aidan_mclau/status/1932507602216497608

ants_everywhere

Is this what happened to Gemini 2.5 Pro? It used to be very good, but it's started struggling on basic tasks.

The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.

SparkyMcUnicorn

The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.

It had a few bugs here or there when they pushed updates, but it didn't get worse.

ants_everywhere

Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.

My question is not whether this is true (it is) but why it's happening.

I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.

But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.

code_biologist

My use case is mostly creative writing.

IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.

Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.

In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.

esafak

Are there any benchmarks that track historical performance?

behnamoh

good question, and I don't know of any, although it's a no brainer that someone should make it.

a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.

SparkyMcUnicorn

Aider has one, but it hasn't been updated in months. People kept claiming models were getting worse, but the results proved that they weren't.

esafak

https://aider.chat/docs/leaderboards/by-release-date.html

__mharrison__

Updated yesterday... https://aider.chat/docs/leaderboards/

BeetleB

Why does OpenAI require me to verify my "organization" (which requires my state issued ID) to use o3?

valleyer

Don't bother anyway. There are lots of cases of people trying and failing to go through the process, and there is no way to try a second time.

https://community.openai.com/t/session-expired-verify-organi...

https://community.openai.com/t/callback-from-persona-id-chec...

https://community.openai.com/t/verification-issue-on-second-...

https://community.openai.com/t/verification-not-working-and-...

https://community.openai.com/t/organization-verfication-fail...

https://community.openai.com/t/help-organization-could-not-b...

https://community.openai.com/t/to-verify-an-organization-acc...

BeetleB

Yikes! Indeed, I won't bother.

bearjaws

Prevent Deepseek R2 being trained on it

piskov

If only there were people with multiple passports or, I don’t know, Kyrgyzstan.

How exactly will passport check prevent any training?

At most this will block API access to your average Ivan, not a state actor

ivanmontillam

I'm an average Ivan, and I got access.

BeetleB

Yeah, I just don't see myself using o3 when I have Gemini-2.5 Pro. I don't recall if Google Cloud verified my ID in the past, though. Still, no need to let yet another organization have my data if I'm not getting something better in return.

yyhhsj0521

It's most likely for regulation compliance, instead of a sincere attempt to block anyone from training on them.

lxgr

Is there also a corresponding increase in weekly messages for ChatGPT Plus users with o3?

In my experience, o4-mini and o4-mini-high are far behind o3 in utility, but since I’m rate-limited for the latter, I end up primarily using the former, which has kind of reinforced the perception that OpenAI’s thinking models are behind the competition altogether.

sunaookami

200 per week now: https://x.com/kevinweil/status/1932565467736027597

el_benhameen

My usage has also reflected the pretty heavy rate limits on o3. I find o4-mini-high to be quite good, but I agree that I would much rather use o3. Hoping this means an increase in the limits.

sagarpatil

Before: 50 messages per week Now: 100 messages per week

lxgr

That’s already been the case for a few weeks though, right? and it’s up from 50, whereas a price reduction by 80% would correspond to 5x the quota extrapolating linearly.

johnnyApplePRNG

Agreed 100% o3 is great but the rate limit window and the quota itself both render it almost useless for more than one off fixes.

It's great with those, however!

mrcwinn

Only at HN can the reaction to an 80% price drop be a wall of criticism.

alternatex

"80% price drop" is just a title. The wall of criticism is for the fine-print.

beering

What in the fine print are we criticising? Most of the negative comments make no reference to any fine print on their website.

xboxnolifes

The wall of criticism is all wild speculation, not fine print.

null

[deleted]

coffeecoders

Despite the popular take that LLMs have no moat and are burning cash, I find OpenAI's situation really promising.

Just yesterday, they reported an annualized revenue run rate of 10B. Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast - 30x revenue with over 500M active users.

It reminds me a lot of Uber in its earlier years—fast growth, heavy investment, but edging closer to profitability.

bitpush

The problem is your costs also scale with revenue. Ideally you want to have control costs as you scale (the first you build is expensive, but as you make more your costs come down).

For OpenAI, the more people use the product, the same you spend on compute unless they can supplement it with another ways of generating revenue.

I dont unfortunately think OpenAI will be able to hit sustained profitability (see Netflix for another example)

simonw

"... as you make more your costs come down"

I'd say dropping the price of o3 by 80% due to "engineers optimizing inferencing" is a strong sign that they're doing exactly that.

asadotzler

You trust their PR statements?

lossolo

> "engineers optimizing inferencing"

They finally implemented DeepSeek open source methods for fast inference?

Legend2440

>(see Netflix for another example)

Netflix has been profitable for over a decade though? They reported $8.7 billion in profit in 2024.

amazingamazing

They increased prices and are not selling a pure commodity tho

ACCount36

The bulk of AI costs are NOT in inference. They're in R&D and frontier training runs.

The more inference customers OpenAI has, the easier it is for them to reach profitability.

tptacek

All costs are not equal. There is a classic pattern of dogfights for winner-take-most product categories where the long term winner does the best job of acquiring customers at the expense of things like "engineering to reduce costs". I have no idea how the AI space is going to shake out, but if I had to pick between OpenAI's mindshare in the broadest possible cohort of users vs. best/most efficient model, I'd pick the customers.

Obviously, lots of nerds on HN have preferences for Gemini and Claude, and having used all three I completely get why that is. But we should remember we're not representative of the whole addressable market. There were probably nerds on like ancient dial-up bulletin boards explaining why Betamax was going to win, too.

awongh

We don't even know yet if the model is the product though, and if OpenAI is the company that will make the AI product/model, (chat that keeps expanding into other functionalities and capabilities) or will it be 10,000 companies using the OpenAI models. (well, it's probably both, but in what proportion of revenue)

TZubiri

Unlike Uber or whatsapp, there's no network effect. Don't think this is a winner takes all market, there was an article where we had this discussion earlier. Players who get a small market share are immediately profitable proportional to the market share (given a minimum size is exceeded.)

stavros

[flagged]

Magmalgebra

Anyone concerned about cost should remember that those costs are dropping exponenentially.

Similarly, nearly all AI products but especially OpenAI are heavily _under_ monetized. OpenAI is an excellent personal shopper - the ad revenue that could be generated from that rivals Facebook or Google.

smelendez

It wouldn't surprise me if they try, but ironically if GPT is a good personal shopper, it might make it harder to monetize with ads because people will trust the bot's organic responses more than the ads.

You could override its suggestions with paid ones, or nerf the bot's shopping abilities so it doesn't overshadow the sponsors, but that will destroy trust in the product in a very competitive industry.

You could put user-targeted ads on the site not necessarily related to the current query, like ads you would see on Facebook, but if the bot is really such a good personal shopper, people are literally at a ChatGPT prompt when they see the ads and will use it to comparison shop.

marsten

You raise a good point that this isn't a low marginal cost business like software, telecom, or (most of) the web. Efficiency will be a big advantage for companies that can achieve it, in part because it will let them scale to new AI use cases.

With the race to get new models out the door, I doubt any of these companies have done much to optimize cost so far. Google is a partial exception – they began developing the TPU ten years ago and the rest of their infrastructure has been optimized over the years to serve computationally expensive products (search, gmail, youtube, etc.).

aizk

> sustained profitability (see Netflix for another example)

What? Netflix is incredibly profitable.

bitpush

Probably a bad example from my part, but also because of increasing the costs and offering a tier with ads. I was mostly talking about the Netflix as it was originally concieved. "Give access to unlimited content at a flat fee", which didnt scale pretty well.

therealdrag0

As an anecdote they have first mover advantage on me. I pay monthly but mostly because it’s good enough and I can’t be bothered to try a bunch out and switch. But if the dust settles and prices drop i would be motivated to switch. How much that matters maybe depends if their revenue comes from app users or API plans. And first mover only works once. Now they maybe coasting on name recognition, but otherwise new users maybe load balanced among all the options.

jillesvangurp

The moat is increasingly becoming having access to billions needed to finance the infrastructure needed to serve billions. That's why Google is still in the game. They have that and they are very good at massive scale and have some cost advantages there.

OpenAI is very good at this as well because of their brand name. For many people ChatGPT is all they know. That's the one that's in the news. That's the one everybody keeps talking about. They have many millions of paying users at this point.

This is a non trivial moat. If you can only be successful by not serving most of the market for cost reasons, then you can't be successful. It's how Google has been able to guard its search empire for a quarter century. It's easy to match what they do algorithmically. But then growing from a niche search engine that has maybe a few tens of thousands of users (e.g. Kagi) to Google scale serving essentially most of this planet (minus some fire walled countries like Russia and China), is a bit of a journey.

So Google rolling out search integration is a big deal. It means they are readying themselves for that scale and will have billions of users exposed to this soon.

> Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast

Yes, they are valued based on world+dog needing agentic AIs and subscribing to the extent of tens or hundreds of dollars/month. It's going to outstrip revenue things like MS Office in its prime.

5B loss is peanuts compared to that. If they weren't burning that, their ambition level would be too low.

Uber now has a substantial portion of the month. They have about 3-4 billion revenue per month. A lot of cost obviously. But they managed 10B profit last year. And they are not done growing yet. They were overvalued at some point and then they crashed, but they are still there and it's a pretty healthy business at this point and that reflects in their stock price. It's basically valued higher now than at the time of the Softbank investment pre-IPO. Of course a lot of stuff needed to be sorted out for that to happen.

seydor

their moat is leaky because llm prices will be dropping forever and the only viable model will be a free model. Eventually everyone will catch up.

Plus there is the thing that "thinking models" can't really solve complex tasks / aren't really as good as they are believed to be .

Zaheer

I would wager most of their revenue is from the subscriptions - both consumer and business. That pricing is detached from the API pricing. The heavy emphasis on applications more recently is because they realize this as well.

rgavuliak

I don't think the no moat approach makes sense. In a world where more an more content and interaction is done with and via LLMs, the data of your users chatting with your LLM is a super valuable dataset.

ToucanLoucan

I mean sure, it's very promising if OpenAI's future is your only metric. It gets notably darker if you look at the broader picture of ChatGPT (and company)'s impact on our society.

* We have people uploading tons of zero-effort slop pieces to all manner of online storefronts, and making people less likely to buy overall because they assume everything is AI now

* We have an uncomfortable community of, to be blunt, actual cultists emerging around ChatGPT, doing all kinds of shit from annoying their friends and family all the way up to divorcing their spouses

* Education is struggling in all kinds of ways due to students using (and abusing) the tech, with already strained administrations struggling to figure out how to navigate it

Like yeah if your only metric is OpenAI's particular line going up, it's looking alright. And much like Uber, it's success seems to be corrosive to the society in which it operates. Is this supposed to be good news?

BugheadTorpeda6

I absolutely agree. I find it abhorrent.

arealaccount

Dying for a reference on the cult stuff, a quick search didn’t provide anything interesting.

MangoToupe

In addition to what the parent commenter was likely referring to, there are also the Zizians: https://en.wikipedia.org/wiki/Zizians

ToucanLoucan

Scroll through the ChatGPT subreddit right now and tell me there isn't a TON of people in there who are legitimately unwell. Reads like the back page notes of a dystopian novel.

wizzwizz4

https://futurism.com/chatgpt-mental-health-crises, which references the more famous https://www.rollingstone.com/culture/culture-features/ai-spi... but is a newer article.

SlowTao

Yes but in a typical western business sense they are merely optimizing for user engadgement and profits. What happens to society a decade from now because of all the slop being produced, that is not their concern. Facebook is just about connecting friends right, totally wont become a series of information moats and bubbles controlled by the algorithms...

A great communicator on the risks of AI being to heavily intergrated into society is Zak Stein. As someone who works in education, they are see first hand how people are becoming dependent on this stuff rather than any kind of self improvement. The people who are just handing over all their thinking to the machine. It is very bizarre and I am seeing it in my personal experience a lot more over the last few months.

null

[deleted]

blueblisters

This is the best model out there, priced level or lesser than Claude and Gemini

They’re not letting the competition breathe

Davidzheng

Gemini is close (if not better) so it just makes sense no? o3-pro might be ahead of pack tho

blueblisters

o3 does better especially if you use the api (not ChatGPT)

dorianjp

appreciate this, the faster we get to cheap commoditization, the better

ucha

Can we know for sure that the price drop is accompanied by a change in the model such as quantization?

On twitter, some people say that some models perform better at night when there is a less demand which allows them to serve a non-quantized model.

Since the models are only available through API and there is no test to check which version of the model is served, it's hard to know what we're buying...

monster_truck

Curious that the number of usages for plus users remained the same. I don't think they're actually doing anything material to lower the cost by a meaningful amount. It's just margin they've always had, and they cut it because magistral is pretty incredible for being completely free