GPT-4.5

1025 comments

·February 27, 2025

zaptrem

GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokens

GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens

It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:

> GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.

I'm still gonna give it a go, though.

swatcoder

> We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.

"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."

Not a confident place for an org trying to sustain a $XXXB valuation.

jodrellblank

> "Early testing shows that interacting with GPT‑4.5 feels more natural. Its broader knowledge base, improved ability to follow user intent, and greater “EQ” make it useful for tasks like improving writing, programming, and solving practical problems. We also expect it to hallucinate less."

"Early testing doesn't show that it hallucinates less, but we expect that putting that sentence nearby will lead you to draw a connection there yourself".

lovasoa

In the second handpicked example they give, GPT-4.5 says that "The Trojan Women Setting Fire to Their Fleet" by the French painter Claude Lorrain is renowned for its luminous depiction of fire. That is a hallucination.

There is no fire at all in the painting, only some smoke.

https://en.wikipedia.org/wiki/The_Trojan_Women_Set_Fire_to_t...

LeifCarrotson

That's some top-tier sales work right there.

I suck at and hate writing the mildly deceptive corporate puffery that seems to be in vogue. I wonder if GPT-4.5 can write that for me or if it's still not as good at it as the expert they paid to put that little gem together.

anoncareer0212

The link has data.

The link shows a significant reduction.

grep hallucination, or, https://imgur.com/a/mkDxe78.

esafak

GPT-4.5 may be an awesome model, some say!

istjohn

According to a graph they provide, it does hallucinate significantly less on at least one benchmark.

willy_k

So they made Claude that knows a bit more.

zaptrem

This seems like it should be attributed to better post training, not a bigger model.

justspacethings

The usage of "greater" is also interesting. It's like they are trying to say better, but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."

pinkmuffinere

This is a very harsh take. Another interpretation is “We know this is much more expensive, but it’s possible that some customers do value the improved performance enough to justify the additional cost. If we find that nobody wants that, we’ll shut it down, so please let us know if you value this option”.

mechagodzilla

I think that's the right interpretation, but that's pretty weak for a company that's nominally worth $150B but is currently bleeding money at a crazy clip. "We spent years and billions of dollars to come up with something that's 1) very expensive, and 2) possibly better under some circumstances than some of the alternatives." There are basically free, equally good competitors to all of their products, and pretty much any company that can scrape together enough dollars and GPUs to compete in this space manages to 'leapfrog' the other half dozen or so competitors for a few weeks until someone else does it again.

riwsky

Said the quiet part out loud! Or as we say these days, “transparently exposed the chain of thought tokens”.

Terr_

"I knew the dame was trouble the moment she walked into my office."

"Uh... excuse me, Detective Nick Danger? I'd like to retain your services."

"I waited for her to get the the point."

"Detective, who are you talking to?"

"I didn't want to deal with a client that was hearing voices, but money was tight and the rent was due. I pondered my next move."

"Mr. Danger, are you... narrating out loud?"

"Damn! My internal chain of thought, the key to my success--or at least, past successes--was leaking again. I rummaged for the familiar bottle of scotch in the drawer, kept for just such an occasion."

---

But seriously: These "AI" products basically run on movie-scripts already, where the LLM is used to append more "fitting" content, and glue-code is periodically performing any lines or actions that arise in connection to the Helpful Bot character. Real humans are tricked into thinking the finger-puppet is a discrete entity.

These new "reasoning" models are just switching the style of the movie script to film noir, where the Helpful Bot character is making a layer of unvoiced commentary. While it may make the story more cohesive, it isn't a qualitative change in the kind of illusory "thinking" going on.

porridgeraisin

Lol, nice one

EA-3167

Maybe if they build a few more data centers, they'll be able to construct their machine god. Just a few more dedicated power plants, a lake or two, a few hundred billion more and they'll crack this thing wide open.

And maybe Tesla is going to deliver truly full self driving tech any day now.

And Star Citizen will prove to have been worth it along along, and Bitcoin will rain from the heavens.

It's very difficult to remain charitable when people seem to always be chasing the new iteration of the same old thing, and we're expected to come along for the ride.

sho_hn

You have it all wrong. The end game is a scalable, reliable AI work force capable of finishing Star Citizen.

At least this is the benchmark for super-human general intelligence that I propose.

philistine

> And Star Citizen will prove to have been worth it along along

Once they've implemented saccades in the eyeballs of the characters wearing helmets in spaceship millions of kilometres apart, then it will all have been worth it.

bloomingkales

Star Citizen is a working model of how to do UBI. That entire staff of a thousand people is the test case.

alyandon

  And Star Citizen will prove to have been worth it along along

Sounds like someone isn't happy with the 4.0 eternally incrementing "alpha" version release. :-D

I keep checking in on SC every 6 months or so and still see the same old bugs. What a waste of potential. Fortunately, Elite Dangerous is enough of a space game to scratch my space game itch.

mattgreenrocks

It's an honor to be dragged along so many ubermensch's Incredible Journeys.

bodegajed

Could this path lead to solving world hunger too? :)

mcswell

Correction: We're expected to pay for the ride, whether we choose to come along or not.

JohnMakin

leave star citizen out of this :)

crystal_revenge

> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."

Having worked at my fair share of big tech companies (while preferring to stay in smaller startups), in so many of these tech announcement I can feel the pressure the PM had from leadership, and hear the quiet cries of the one to two experience engineers on the team arguing sprint after sprint that "this doesn't make sense!"

riwsky

> the quiet cries of the one to two experienced engineers on the team arguing sprint after sprint that "this doesn't make sense!"

“I have five years of Cassandra experience—and I don’t mean the db”

spaceman_2020

Really don’t understand what’s the use case for this. The o series models are better and cheaper. Sonnet 3.7 smokes it on coding. Deepseek R1 is free and does a better job than any of OAI’s free models

NewUser76312

Damn this never worked for me as a startup founder lol. Need that Altman "rizz" or what have you.

financetechbro

Maybe you didn’t push hard enough the impending doom that your product would bring to society

roarcher

AI in general is increasingly a solution in search of a problem, so this seems about right.

TeMPOraL

Only in the same sense as electricity is. The main tools apply to almost any activity humans do. It's already obvious that it's the solution to X for almost any X, but the devil is in the details - i.e. picking specific, simplest problems to start with.

harlanlewis

The price really is eye watering. At a glance, my first impression is this is something like Llama 3.1 405B, where the primary value may be realized in generating high quality synthetic data for training rather than direct use.

I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own.

https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...

sfink

> feel free to copy and claim as your own.

That's a nice sentiment, but I'd encourage you to add a license or something. The basic "something" would be adding a canonical URL into the spreadsheet itself somewhere, along with a notification that users can do what they want other than removing that URL. (And the URL would be described as "the original source" or something, not a claim that the particular version/incarnation someone is looking at is the same as what is at that URL.)

The risk is that someone will accidentally introduce errors or unsupportable claims, and people with the modified spreadsheet won't know that it's not The spreadsheet and so will discount its accuracy or trustability. (If people are trying to deceive others into thinking it's the original, they'll remove the notice, but that's a different problem.) It would be a shame for people to lose faith in your work because of crap that other people do that you have no say in.

isoprophlex

Thats... incredibly thorough. Wow. Thanks for sharing this.

6gvONxR4sf7o

Not just for training data, but for eval data. If you can spend a few grand on really good labels for benchmarking your attempts at making something feasible work, that’s also super handy.

swyx

> https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...

how do you do the different size circles and colored sequences like that? this is god tier skills

harlanlewis

hey, thank you! bubble charts, annotated with text and shapes using the Drawing tool. Working with the constraints of Google Sheets is its own challenge.

also - love the podcast, one of my favorites. the 3:1 io token price breakdown in my sheet is lifted directly from charts I've seen on latent space.

null

[deleted]

world2vec

Bubble charts?

bglusman

very impressive... also interested in your trip planner, it looks like invite only at the moment, but... would it be rude to ask for an invite?

gwyllimj

That is an amazing resource. Thanks for sharing!

taurath

What gets me is the whole cost structure is based on practically free services due to all the investor money. They’re not pulling in significant revenue with this pricing relative to what it costs to train the models, so the cost may be completely different if they had to recoup those costs, right?

senordevnyc

Hey, just FYI, I pasted your url from the spreadsheet title into Safari on macOS and got an SSL warning. Unfortunately I clicked through and now it works, so not sure what the exact cause looked like.

harlanlewis

I appreciate the bug report! Unfortunately this is a familiar and sporadically recurring issue with Netlify, which I should really move off of…

minimaxir

Sam Altman's explanation for the restriction is a bit fluffier: https://x.com/sama/status/1895203654103351462

> bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.)

chefandy

I’m not an expert or anything, but from my vantage point, each passing release makes Altman’s confidence look more aspirational than visionary, which is a really bad place to be with that kind of money tied up. My financial manager is pretty bullish on tech so I hope he is paying close attention to the way this market space is evolving. He’s good at his job, a nice guy, and surely wears much more expensive underwear than I do— I’d hate to see him lose a pair powering on his Bloomberg terminal in the morning one of these days.

igor47

You're the one buying him the underwear. Don't index funds outperform managed investing? I think especially after accounting for fees, but possibly even after accounting that 50% of money managers are below average.

Terr_

> each passing release makes Altman’s confidence look more aspirational than visionary

As an LLM cynic, I feel that point passed long go, perhaps even before Altman claimed countries would start wars to conquer the territory around GPU datacenters, or promoting the dream of a 7 T-for-trillion dollar investment deal, etc.

Alas, the market can remain irrational longer than I can remain solvent.

g-mork

release blog post author: this is definitely a research preview

ceo: it's ready

the pricing is probably a mixture of dealing with GPU scarcity and intentionally discouraging actual users. I can't imagine the pressure they must be under to show they are releasing and staying ahead, but Altman's tweet makes it clear they aren't really ready to sell this to the general public yet.

pk-protect-ai

Yeap, that the thing, they are not ahead anymore. Not since last summer at least. Yes they have probably largest customer base, but their models are not the best for a while already.

rebolek

Bad news: Sam Altman runs the show.

rvnx

[flagged]

hn_throwaway_99

The price is obviously 15-30x that of 4o, but I'd just posit that there are some use cases where it may make sense. It probably doesn't make sense for the "open-ended consumer facing chatbot" use case, but for other use cases that are fewer and higher value in nature, it could if it's abilities are considerably better than 4o.

For example, there are now a bunch of vendors that sell "respond to RFP" AI products. The number of RFPs that any sales organization responds to is probably no more than a couple a week, but it's a very time-consuming, laborious process. But the payoff is obviously very high if a response results in a closed sale. So here paying 30x for marginally better performance makes perfect sense.

I can think of a number of similar "high value, relatively low occurrence" use cases like this where the pricing may not be a big hindrance.

superq

Complete legal arguments as well. If I was an attorney, I'd love to have a sophisticated LLM write my crib notes for anything I might do or say in the court room, or even the complete direction that I'd take my case. For some cases, that'd be worth almost any price.

janoc

And which use case will that make sense then for?

Esp. when they aren't even sure whether they will commit to offering this long term? Who would be insane enough to build a product on top of something that may not be there tomorrow?

Those products require some extensive work, such a model finetuning on proprietary data. Who is going to invest time & money into something like that when OpenAI says right out of the gate they may not support this model for very long?

Basically OpenAI is telegraphing that this is yet another prototype that escaped a lab, not something that is actually ready for use and deployment.

Manouchehri

Yeah, agreed.

We’re one of those types of customers. We wrote an OpenAI API compatible gateway that automatically batches stuff for us, so we get 50% off for basically no extra dev work in our client applications.

I don’t care about speed, I care about getting the right answer. The cost is fine as long as the output generates us more profit.

nunez

RFP automation software has existed for a very long time. Anyone who spends lots of time on RFPs has this.

serjester

I suppose this was their final hurrah after two failed attempts at training GPT-5 with the traditional pre-training paradigm. Just confirms reasoning models are the only way forward.

crystal_revenge

> Just confirms reasoning models are the only way forward.

Reasoning models are roughly the equivalent to allow Hamiltonian Monte-Carlo models to "warm up" (i.e. start sampling from the typical set). This, unsurprisingly, yields better results (after all LLMs are just fancy Monte-carlo models in the end). However, it is extremely unlikely this improvement is without pretty reasonable limitations. Letting your HMC warm up is essential to good sampling, but letting "warm up more" doesn't result in radically better sampling.

While there have been impressive results in efficiency of sampling from the typical set seen in LLMs these days, we're clearly not making the major improvements in the capabilities of these models.

int_19h

Reasoning models can solve tasks that non-reasoning ones were unable to; how is that not an improvement? What constitutes "major" is subjective - if a "minor" improvement in overall performance means that the model can now successfully perform a task it was unable to solve before, that is a major advancement for that particular task.

granzymes

> Compared to OpenAI o1 and OpenAI o3‑mini, GPT‑4.5 is a more general-purpose, innately smarter model. We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other. As models like GPT‑4.5 become smarter and more knowledgeable through pre-training, they will serve as an even stronger foundation for reasoning and tool-using agents.

DebtDeflation

GPT 5 is likely just going to be a router model that decides whether to send the prompt to 4o, 4o mini, 4.5, o3, or o3 mini.

swores

My guess is that you're right about that being what's next (or maybe almost next) from them, but I think they'll save the name GPT-5 for the next actually-trained model (like 4.5 but a bigger jump), and use a different kind of name for the routing model.

Even by their poor standards at naming it would be weird to introduce a completely new type/concept, that can loop in models including the 4 / 4.5 series, while naming it part of that same series.

My bet: probably something weird like "oo1", or I suspect they might try to give it a name that sticks for people to think of as "the" model - either just calling it "ChatGPT", or coming up with something new that sounds more like a product name than a version number (OpenCore, or Central, or... whatever they think of)

lolinder

Except minus 4.5, because at these prices and results there's essentially no reason not to just use one of the existing models if you're going to be dynamically routing anyway.

jstummbillig

What it confirms, I think, is, that we are going to need a lot more chips.

georgemcbay

Further confirmation, IMO, that the idea that any of this leads to anything close to AGI is people getting high on their own supply (in some cases literally).

LLMs are a great tool for what is effectively collected knowledge search and summary (so long as you are willing to accept that you have to verify all of the 'knowledge' they spit back because they always have the ability to go off the rails) but they have been hitting the limits on how much better that can get without somehow introducing more real knowledge for close to 2 years now and everything since then is super incremental and IME mostly just benchmark gains and hype as opposed to actually being purely better.

I personally don't believe that more GPUs solves this, like, at all. But its great for Nvidia's stock price.

prisenco

Or, possibly, we're stuck waiting for another theoretical breakthrough before real progress is made.

DannyBee

Eh, no. More chips won't save this right now, or probably in the near future (IE barring someone sitting on a breakthrough right now).

It just means either

A. Lots and lots of hard work that get you a few percent at a time, but add up to a lot over time.

B. Completely different approaches that people actually think about for a while rather than trying to incrementally get something done in the next 1-2 months.

Most fields go through this stage. Sometimes more than once as they mature and loop back around :)

Right now, AI seems bad at doing either - at least, from the outside of most of these companies, and watching open source/etc.

While lots of little improvements seem to be released in lots of parts, it's rare to see anywhere that is collecting and aggregating them en masse and putting them in practice. It feels like for every 100 research papers, maybe 1 makes it into something in a way that anyone ends up using it by default.

This could be because they aren't really even a few percent (which would be yet a different problem, and in some ways worse), or it could be because nobody has cared to, or ...

I'm sure very large companies are doing a fairly reasonable job on this, because they historically do, but everyone else - even frameworks - it's still in the "here's a million knobs and things that may or may not help".

It's like if compilers had no "O0/O1/O2/O3' at all and were just like "here's 16,283 compiler passes - you can put them in any order and amount you want". Thanks! I hate it!

It's worse even because it's like this at every layer of the stack, whereas in this compiler example, it's just one layer.

At the rate of claimed improvements by papers in all parts of the stack, either lots and lots and lots is being lost because this is happening, in which case, eventually that percent adds up to enough for someone to be able to use to kill you, or nothing is being lost, in which case, people appear to be wasting untold amounts of time and energy, then trying to bullshit everyone else, and the field as a whole appears to be doing nothing about it. That seems, in a lot of ways, even worse. FWIW - I already know which one the cynics of HN believe, you don't have to tell me :P. This is obviously also presented as black and white, but the in-betweens don't seem much better.

Additionally, everyone seems to rush half-baked things to try to get the next incremental improvement released and out the door because they think it will help them stay "sticky" or whatever. History does not suggest this is a good plan and even if it was a good plan in theory, it's pretty hard to lock people in with what exists right now. There isn't enough anyone cares about and rushing out half-baked crap is not helping that. mindshare doesn't really matter if no one cares about using your product.

Does anyone using these things truly feel locked into anyone's ecosystem at this point? Do they feel like they will be soon?

I haven't met anyone who feels that way, even in corps spending tons and tons of money with these providers.

The public companies - i can at least understand given the fickleness of public markets. That was supposed to be one of the serious benefit of staying private. So watching private companies do the same thing - it's just sort of mind-boggling.

Hopefully they'll grow up soon, or someone who takes their time and does it right during one of the lulls will come and eat all of their lunches.

usaar333

For OpenAI perhaps? Sonnet 3.7 without extended thinking is quite strong. Swe-bench scores tie o3

stavros

How do you read those scores? I wanted to see how well 3.7 with thinking did, but I can't even read that table.

newfocogi

I think this is the correct take. There are other axes to scale on AND I expect we'll see smaller and smaller models approach this level of pre-trained performance. But I believe massive pre-training gains have hit clearly diminished returns (until I see evidence otherwise).

sebastiennight

I think it's fairer to compare it to the original GPT-4 which might the equivalent in term of "size" (though we don't have actual numbers for either).

GPT-4: Input $30.00 / 1M tokens ; Output $60.00 / 1M tokens

So 4.5 is 2.5x more expensive.

I think they announced this as their last non-reasoning model, so it was maybe with the goal of stretching pre-training as far as they could, just to see what new capabilities would show up. We'll find out as the community gives it a whirl.

I'm a Tier 5 org and I have it available already in the API.

minimaxir

The marginal costs for running a GPT-4-class LLM are much lower nowadays due to significant software and hardware innovations since then, so costs/pricing are harder to compare.

sebastiennight

Agreed, however it might make sense that a much-larger-than-GPT-4 LLM would also, at launch, be more expensive to run than the OG GPT-4 was at launch.

(And I think this is probably also scarecrow pricing to discourage casual users from clogging the API since they seem to be too compute-constrained to deliver this at scale)

spoaceman7777

There are some numbers on one of their Blackwell or Hopper info pages that notes the ability of their hardware in hosting an unnamed GPT model that is 1.8T params. My assumption was that it referred to GPT-4

Sounds to me like GPT 4.5 likely requires a full Blackwell HGX cabinet or something, thus OpenAI's reference to needing to scale out their compute more (Supermicro only opened up their Blackwell racks for General Availability last month, and they're the prime vendor for water-cooled Blackwell cabinets right now, and have the ability to throw up a GPU mega-cluster in a few weeks, like they did for xAI/Grok)

jstummbillig

Why would that be fairer? We can assume they did incorporate all learnings and optimizations they made post gpt-4 launch, no?

jychang

Definitely not. They don't distill their original models. 4o is a much more distilled and cheaper version of 4. I assume 4.5o would be a distilled and cheaper version of 4.5.

It'd be weird to release a distilled version without ever releasing the base undistilled version.

sebastiennight

Not necessarily.

If this huge model has taken months to pre-train and was expected to be released before, say, o3-mini, you could definitely have some last-minute optimizations in o3-mini that were not considered at the time of building the architecture of gpt-4.5.

OldGreenYodaGPT

2x that price for the 32k context via API at launch. So nearly the same price, but you get 4x the context

Culonavirus

Honestly if long context (that doesn't start to degrade quickly) is what you're after, I would use Grok 3 (not sure when the api version releases though). Over the last week or so I've had a massive thread of conversation with it that started with plenty of my project's relevant code (as in couple hundred lines), and several days later, after like 20 question-aswer blocks, you ask it something and it aswers "since you're doing that this way, and you said you want x, y and z, here are your options blabla"... It's like thinking Gemini but better. Also, unlike Gemini (and others) it seems to have a much more recent data cutoff. Try asking about some language feature / library / framework that has been released recently (say 3 months ago) and most of the models shit the bed, use older versions of the thing or just start to imitate what the code might look like. For example try asking Gemini if it can generate Tailwind 4 code, it will tell you that it's training cutoff is like October or something and Tailwind 4 "isn't released yet" and that it can try to imitate what the code might look like. Uhhhhhh, thanks I guess??

wavemode

This has been my suspicion for a long time - OpenAI have indeed been working on "GPT5", but training and running it is proving so expensive (and its actual reasoning abilities only marginally stronger than GPT4) that there's just no market for it.

It points to an overall plateau being reached in the performance of the transformer architecture.

camdenreslink

That would certainly reduce my anxiety about the future of my chosen profession.

malthaus

but while there is a plateau in the transformer architecture, what you can do with those base models by further finetuning / modifying / enhancing them is still largely unexplored so i still predict mind-blowing enhancements yearly for this foreseeable future. if they validate openai's valuation and investment needs is a different question.

goatlover

Certainly hope so. The tech billionaires are little to excited to achieve AGI and replace the workforce.

shoubidouwah

TBH, with the safety/alignment paradigm we have, workforce replacement was not my top concern when we hit AGI. A pause / lull in capabilities would be hugely helpful so that we can figure how not to die along with the lightcone...

JohnnyMarcone

I feel like this period has shown that we're not quite ready for a machine god. We'll see if RL hits a wall as well.

ur-whale

AI as it stands in 2025 is an amazing technology, but it is not a product at all.

As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.

My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability

They are also being chased by fierce competition and OpenSource which is very close behind. There simply is no moat.

It will not end well for investors who sunk money in these large AI startups (unless of course they manage to find a Softbank-style mark to sell the whole thing to), but everyone will benefit from the progress AI will have made during the bubble.

So, in the end, OpenAI will have, albeit very unwillingly, fulfilled their original charter of improving humanity's lot.

emptysongglass

I've been a Plus user for a long time now. My opinion is there is very much a ChatGPT suite of products that come together to make for a mostly delightful experience.

Three things I use all the time:

- Canvas for proofing and editing my article drafts before publishing. This has replaced an actual human editor for me.

- Voice for all sorts of things, mostly for thinking out loud about problems or a quick question about pop culture, what something means in another language, etc. The Sol voice is so approachable for me.

- GPTs I can use for things like D&D adventure summaries I need in a certain style every time without any manual prompting.

whiplash451

Except that if OpenAI goes bust, very little of what they did will actually be released to human kind.

So their contribution was really to fuel a race for opensource (which they contributed little to). Pretty complex of an argument.

jsheard

> My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability

The Information leaked their internal projections a few months ago, and apparently their own estimates have them losing $44B between then and 2029 when they expect to finally turn a profit, maybe.

j_maffe

That's surprisingly small

nyarlathotep_

> AI as it stands in 2025 is an amazing technology, but it is not a product at all.

Here I'm assuming "AI" to mean what's broadly called Generative AI (LLMs, photo, video generation)

I genuinely am struggling to see what the product is too.

The code assistant use cases are really impressive across the board (and I'm someone who was vocally against them less than a year ago), and I pay for Github CoPilot (for now) but I can't think of any offering otherwise to dispute your claim.

It seems like companies are desperate to find a market fit, and shoving the words "agentic" everywhere doesn't inspire confidence.

Here's the thing: I remember people lining up around the block for iPhone releases, XBox launches, hell even Grand Theft Auto midnight releases.

Is there a market of people clamoring to use/get anything GenAI related?

If any/all LLM services went down tonight, what's the impact? Kids do their own homework?

JavaScript programmers have to remember how to write React components?

Compare that with Google Maps disappearing, or similar.

LLMs are in a position where they're forced onto people and most frankly aren't that interested. Did anyone ASK for Microsoft throwing some Copilot things all over their operating system? Does anyone want Apple Intelligence, really?

otabdeveloper4

> I genuinely am struggling to see what the product is too.

They're nice for summarizing and categorizing text. We've had good solutions for that before, too (BERT, et al), but LLM's are marginally nicer.

> Is there a market of people clamoring to use/get anything GenAI related?

No. LLM's are lame and uncool. Kids especially dislike them a lot on that basis alone.

planetafro

I think search and chat are decent products as well. I am a Google subscriber and I just use Gemini as a replacement for search without ads. To me, this movement accelerated paid search in an unexpected way. I know the detractors will cry "hallucinations" and the ilk. I would counter with an argument about the state of the current web besieged by ads and misinformation. If people carry a reasonable amount of skepticism in all things, this is a fine use case. Trust but verify.

I do worry about model poisoning with fake truths but dont feel we are there yet.

beefnugs

Yes: the real truth is, if there really was a good AI created, then we wouldnt even know about it existing until a billion dollar company takes over some industry with only a handful of developers in the entire company. Only then would hints spill out into the world that its possible.

No "good" AI will ever be open to everyone and relatively cheap, this is the same phenomenon as "how to get rich" books

vineyardmike

> As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.

They have a super popular subscription service. If they keep iterating on the product enough, they can lag on the models. The business is the product not the models and not the API. Subscriptions are pretty sticky when you start getting your data entrenched in it. I keep my ChatGPT subscription because it’s the best app on Mac and already started to “learn me” through the memory and tasks feature.

Their app experience is easily the best out of their competitors (grok, Claude, etc). Which is a clear sign they know that it’s the product to sell. Things like DeepResearch and related are the way they’ll make it a sustainable business - add value-on-top experiences which drive the differentiation over commodities. Gemini is the only competitor that compares because it’s everywhere in Google surfaces. OpenAI’s pro tier will surely continue to get better, I think more LLM-enabled features will continue to be a differentiator. The biggest challenge will be continuing distribution and new features requiring interfacing with third parties to be more “agentic”.

Frankly, I think they have enough strength in product with their current models today that even if model training stalled it’d be a valuable business.

yard2010

Sir they are selling text by the ounce just like farmers sold tomatoes before Walmart, How is that not a business model?

jcgrillo

https://podcasts.apple.com/us/podcast/better-offline/id17305...

netdevphoenix

Is it official then?

Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But we all held our final decision until the next big release from OpenAI as Sam Altman has been making claims about AGI entering the workforce this year, OpenAI knowing how to build AGI and similar outlandish claims. We all knew that their next big release in 2025 would be the final deciding factor on whether they had some tech breakthrough that would upend the world (justifying their astronomical valuation) or if it would just be (slightly) more of the same (marking the beginning of their downfall).

The GPT-4.5 release points towards the latter. Thus, we should not expect OpenAI to exist as it does now (AI industry leader) in 2030, assuming it does exist at all by then.

However, just like the 19th century rail industry revolution, the fall of OpenAI will leave behind a very useful technology that while not catapulting humanity towards a singularity, will nonetheless make people's lives better. Not much consolation to the world's super rich who will lose tons of money once the LLM industry (let us remember that AI is not LLM) falls.

EDIT: "will nonetheless make people's lives better" to "might nonetheless make some people's lives better"

PaulRobinson

It's worth pointing out that GPT-4.5 seems focused on better pre-training and doesn't include reasoning.

I think GPT-5 - if/when it happens - will be 4.5 with reasoning, and as such it will feel very different.

The barrier, is the computational cost of it. Once 4.5 gets down to similar costs to 4.0 - which could be achieved through various optimization steps (what happened to the ternary stuff that was published last year that meant you could go many times faster without expensive GPUs?), and better/cheaper/more efficient hardware, you can throw reasoning into the mix and suddenly have a major step up in capability.

I am a user, not a researcher of builder. I do think we're in a hype bubble, I do think that LLMs are not The Answer, but I also think there is more mileage left in this path than you seem to. I think automated RL (not HF), reasoning, and better/optimal architectures and hardware mean there is a lot more we can get out of the stochastic parrots, yet.

highfrequency

Is it fair to still call LLMs stochastic parrots now that they are enriched with reasoning? Seems to me that the simple procedure of large-scale sampling + filtering makes it immediately plausible to get something better than the training distribution out of the LLM. In that sense the parrot metaphor seems suddenly wrong.

I don’t feel like this binary shift is adequately accounted for among the LLM cynics.

whimsicalism

it was never fair to call them stochastic parrots and anybody who is paying any attention knows that sequence models can generalize at least partially OOD

JohnKemeny

They are not enriched with reasoning, it's just snake oil, I'm afraid.

glenstein

>the barrier, is the computational cost of it. Once 4.5 gets down to similar costs to 4.0

Well, did 4.0 ever become lower cost? On the API side, its cost per tokens is a factor of 10 higher than 4o even though 4o is considered the better model.

I think 4.5 may just be retired wholesale, or perhaps a new model derived from it that is more efficient, a 4.5mini or something like that.

km144

I'm not convinced that LLMs in their current state are really making anyone's lives much better though. We really need more research applications for this technology for that to become apparent. Polluting the internet with regurgitated garbage produced by a chat bot does not benefit the world. Increasing the productivity of software developers does not help to the world. Solving more important problems should be the priority for this type of AI research & development.

pera

The explosion of garbage content is a big issue and has radically changed the way I use the web over the past year: Google and DuckDuckGo are not my primary tools anymore, instead I am now using specialized search engines more and more, for example, if I am looking for something I believe can be found in someone's personal blog I just use Marginalia or Mojeek, if I am searching for software issues I use GitHub's search, general info straight to Wikipedia, tech reviews HN's Algolia etc.

It might sound a bit cumbersome but it's actually super easy if you assign search keywords in your browser: for instance if I am looking for something on GitHub I just open a new tab on Firefox and type "gh tokio".

Workaccount2

LLM's have been extremely useful for me. They are incredibly powerful programmers, from the perspective of people who aren't programmers.

Just this past week claude 3.7 wrote a program for us to use to quickly modernize ancient (1990's) proprietary manufacturing machine files to contemporary automation files.

This allowed us to forgo a $1k/yr/user proprietary software package that would be able to do the same. The program Claude wrote took about 30 mins to make. Granted the program is extremely narrow in scope, but it does the one thing we need it to do.

This marks the third time I (a non-progammer) have used an LLM to create software that my company uses daily. The other two are a test system made by GPT-4 and an android app made by a mix of 4o and claude 3.5.

Bumpers may be useless and laughable to pro bowlers, but a godsend to those who don't really know what they are doing. We don't need to hire a bowler to knock over pins anymore.

Kye

Being able to quickly get a script for some simple automation, defining source and target formats in plain English, has been a huge help. There is simply no way I'm going to remember all that stuff as someone who doesn't program regularly, so the previous way to deal with it was to do it all manually. It was quicker than doing remedial Python just to forget it all again.

unshavedyak

I've also been toying with Claude Code recently and i (as en eng, ~10yr) think they are useful for pair programming the dumb work.

Eg as i've been trying Claude Code i still feel the need to babysit it with my primary work, and so i'd rather do it myself. However while i'm working if it could sit there and monitor it, note fixes, tests and documentation and then stub them in during breaks i think there's a lot of time savings to be gained.

Ie keep the doing simple tasks that it can get right 99% of the time and get it out of the way.

I also suspect there's context to be gained in watching the human work. Not learning per say, but understanding the areas being worked on, improving intuition on things the human needs or cares about, etc.

A `cargo lint --fix` on steroids is "simple" but still really sexy imo.

km144

I think that's great for work and great for corporations. I use AI at my job too, and I think it certainly does increase productivity!

How does any of this make the world a better place? CEOs like Sam Altman have very lofty ideas about the inherent potential "goodness" of higher-order artificial intelligence that I find thus far has not borne out in reality, save a few specific cases. Useful is not the same as good. Technology is inherently useful, that does not make it good.

dgsm98

> Solving more important problems should be the priority for this type of AI research & development.

Which problem spaces do you think are underserved in this aspect?

rjinman

As someone who is terrified of agentic ASI, I desperately hope this is true. We need more time to figure out alignment.

cle

I'm not sure this will ever be solved. It requires both a technical solution and social consensus. I don't see consensus on "alignment" happening any time soon. I think it'll boil down to "aligned with the goals of the nation-state", and lots of nation states have incompatible goals.

rjinman

I agree unfortunately. I might be a bit of an extremist on this issue. I genuinely think that building agentic ASI is suicidally stupid and we just shouldn’t do it. All the utopian visions we hear from the optimists describe unstable outcomes. A world populated by super-intelligent agents will be incredibly dangerous even if it appears initially to have gone well. We’ll have built a paradise in which we can never relax.

drdaeman

It doesn't do anyone any good to stress over non-existent things. ASI is a sci-fi trope, a pure fantasy in context of present day and time. AGI does not exist either, and AFAIK there's not even any agreement what it possibly means beyond very vague "no worse than a human".

In other words, I'm sure you're terrified of a modern fairy tale.

rgbrenner

"alignment" is a bs term made up to deflect blame from the overpromises the AI companies made to hype up their product to obtain their valuations.

DirkH

Big take given how much AI companies hate alignment folks.

fergonco

> will nonetheless make people's lives better

Probably not the lives of translators or graphic designers or music compositors. They will have to find new jobs. As llm prompt engineers, I guess.

yurishimo

Graphic designers I think are safe, at least within organizations that require a cohesive brand strategy. Getting the AI to respect all of the previous art will be a challenge at a certain scale.

Fiverr graphic designers on the other hand…

andy_ppp

Getting graphic designers to use the design system that they invented is quite a challenge too if I'm honest... should we really expect AI to be better than people? Having said that AI is never going to be adept at knowing how and when to ignore the human in the loop and do the "right" thing.

bearjaws

There are people generating mostly consistent AI porn models using LORA, the same strategy could be used to bias the model towards consistent output for corporate branding.

Even if its not perfect, many startups will be using AI to generate their branding for the first 5 years and put others out of a job.

Right now the tools are primitive, but leave it to the internet to pioneer the way with porn...

whimsicalism

absolutely a solvable problem even with no tech advances

vbezhenar

I feel like it was GPT-5 which was eventually renamed to keep up with expectations.

diego_sandoval

> OpenAI knowing how to build AGI and similar outlandish claims.

The fact that the scaling of pretrained models is hitting a wall doesn't invalidate any of those claims. Everyone in the industry is now shifting towards reasoning models (a.k.a. chain of thought, a.k.a. inference time reasoning, etc.) because it keeps scaling further than pretraining.

Sam said the phrase you refer to [1] in January, when OpenAI had already released o1 and was preparing to release o3.

[1] https://blog.samaltman.com/reflections

sebzim4500

This seems very dramatic given OpenAI still has the best model in the world `o3`.

oblio

The best model in the world is still basically a very stubborn, yet mediocre 16 year old with a memory the size of the internet.

entropi

> will nonetheless make people's lives better

While I mostly agree with your assessment, I am still not convinced of this part. Right now, it may be making our lives marginally better. But once the enshittification starts to set in, I think it has the potential to make things a lot worse.

E.g. I think the advertisement industry will just love the idea of product placements and whatnots into the AI assistant conversations.

dkdcwashere

*good*. the answer to this is legislation —- legally, stop allowing shitty ads everywhere all the time. I hope these problems we already have are exacerbated by the ease of generating content with LLMs and people actually have to think for themselves again

simonw

I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments):

  hn-summary.sh 43197872 -m gpt-4.5-preview

Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...

Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c280293d399...

It took 25797 input tokens and 1225 input tokens, for a total cost (calculated using https://tools.simonwillison.net/llm-prices ) of $2.11! It took 154 seconds to generate.

djhworld

interesting summary but it's hard to gauge whether this is better/worse than just piping the contents into a much cheaper model.

azinman2

It’d be great if someone would do that with the same data and prompt to other models.

I did like the formatting and attributions but didn’t necessarily want attributions like that for every section. I’m also not sure if it’s fully matching what I’m seeing in the thread but maybe the data I’m seeing is just newer.

simonw

Good call. Here's the same exact prompt run against:

GPT-4o: https://gist.github.com/simonw/592d651ec61daec66435a6f718c06...

GPT-4o Mini: https://gist.github.com/simonw/cc760217623769f0d7e4687332bce...

Claude 3.7 Sonnet: https://gist.github.com/simonw/6f11e1974e4d613258b3237380e0e...

Claude 3.5 Haiku: https://gist.github.com/simonw/c178f02c97961e225eb615d4b9a1d...

Gemini 2.0 Flash: https://gist.github.com/simonw/0c6f071d9ad1cea493de4e5e7a098...

Gemini 2.0 Flash Lite: https://gist.github.com/simonw/8a71396a4a219d8281e294b61a9d6...

Gemini 2.0 Pro (gemini-2.0-pro-exp-02-05): https://gist.github.com/simonw/112e3f4660a1a410151e86ec677e3...

alew1

Didn't seem to realize that "Still more coherent than the OpenAI lineup" wouldn't make sense out of context. (The actual comment quoted there is responding to someone who says they'd name their models Foo, Bar, Baz.)

willy_k

Wonder if there’s some pro-OpenAI system prompt getting in the way of that.

jdiff

It'd be a silly move considering how fast system prompts leak.

egobrain27

Seems to have trouble recognizing sarcasm:

"For example, there are now a bunch of vendors that sell 'respond to RFP' AI products... paying 30x for marginally better performance makes perfect sense." — hn_throwaway_99 (an uncommon opinion supporting possible niche high-cost uses).

mlyle

? You think hn_throwaway_99's comment is sarcastic? It makes perfect sense to me read "straight."

That is, sales orgs save a bunch of money using AI to respond to RFPs; they would still save a bunch of money using a more expensive AI, and any marginal improvement in sales closed would pay for it.

It maybe excessively summarized his comment which confused you-- but this is the kind of mistake human curators of quotes make, too.

sebzim4500

I don't think they are being sarcastic. Maybe you are the bot /s

3vidence

I don't know why but something about this section made me chuckle

""" These perspectives highlight that there remains nuance—even appreciation—of explorative model advancement not solely focused on immediate commercial viability """

Feels like the model is seeking validation

munksbeer

As expected, comments on LLM threads are overwhelmingly negative.

Personally, I still feel excited to see boundaries being pushed, however incremental our anecdotal opinions make them seem.

sundarurfriend

I disagree with most of the knee-jerk negativity in LLM threads, but in this case it mostly seems warranted. There are no "boundaries being pushed" here, this is just a desperate release from a company that finds itself losing more and more mindshare to other models and companies.

stevage

Huh. Disregarding the 4.5-specific bit here, a browser extension or possibly website that did this in general could be really useful.

Maybe even something that just noticed whenever you visited a site that had had significant HN discussion in the past, then let you trigger a summary.

ukuina

My site https://hackyournews.com does this!

Been keeping it alive and free for 18 months.

zigman1

Wow I find this very useful, thanks! Bookmarked.

wordpad25

there are literally hundreds of extensions and sites that do this

the problem is that they are competing each other into the ground hence they go unmaintained very quickly

getrecall.ai has been the most mature so far

ruibiks

Hey, check this one out with all the different flavors that existed out there. I think I made something better. https://cofyt.app

As far as I am aware, feel free to test it head-to-head. This is better than gecall, and you can chat with a transcript for detailed answers to your prompts

bredren

Hundreds that specifically focus on noticing a page you’re currently viewing has been not only posted to but undergone significant discussion on HN, and then providing a summary of those conversations?

Or that just provide summaries in general?

stevage

Thanks, it's amazing how much stuff is out there I don't know about.

sebastiennight

What I want is something that can read the thread out loud to me, using a different voice per user, so I can listen to a busy discussion thread like I would listen to a podcast.

joe_the_user

The headline and section: "Dystopian and Social Concerns about AI Features" are interesting. It's roughly true... but somehow that broad statement seems minimize the point discussed.

I'd headline that thread as "Concerns about output tone". There were comments about dystopian implications of tone, marketing implications of tone and implementation issues of tone.

Of course, that I can comment about the fine-points of an AI summary shows it's made progress. But there's a lot riding on how much progress these things can make and what sort. So it's still worth looking at.

colordrops

Maybe it's just confirmation bias but the language in your result output seems higher quality that previous models. Seems more natural and eloquent.

Topfi

Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to stay ahead of their competitors.

What has been shown feels like it could be achieved using a custom system prompt on older versions of OpenAIs models, and I struggle to see anything here that truly required ground-up training on such a massive scale. Hearing that they were forced to spread their training across multiple data centers simultaneously, coupled with their recent release of SWE-Lancer [0] which showed Anthropic (Claude 3.5 Sonnet (new) to be exact) handily beating them, I was really expecting something more than "slightly more casual/shorter output", which again, I fail to see how that wasn't possible by prompting GPT-4o.

Looking at pricing [1], I am frankly astonished.

> Input: $75.00 / 1M tokens > Cached input: $37.50 / 1M tokens > Output: $150.00 / 1M tokens

How could they justify that asking price? And, if they have some amazing capabilities that make a 30-fold pricing increase justifiable, why not show it? Like, OpenAI are many things, but I always felt they understood price vs performance incredibly well, from the start with gpt-3.5-turbo up to now with o3-mini, so this really baffles me. If GPT-4.5 can justify such immense cost in certain tasks, why hide that and if not, why release this at all?

[0] https://github.com/openai/SWELancer-Benchmark

[1] https://openai.com/api/pricing/

mvdtnz

> How could they justify that asking price?

They're still selling $1 for <$1. Like personal food delivery before it, consumers will eventually need to wake up to this fact - these things will get expensive, fast.

josh-sematic

One difference with food delivery/ride share: those can only have costs reduced so far. You can only pick up groceries and drive from A to B so quickly. And you can only push the wages down so far before you lose your gig workers. Whereas with these models we’ve consistently seen that a model inference that cost $1 several months ago can now be done with much less than $1 today. We don’t have any principled understanding of “we will never be able to make these models more efficient than X”, for any value of X that is in sight. Could the anticipated efficiencies fail to materialize? It’s possible but I personally wouldn’t put money on it.

phillipcarter

I read this more as "we are releasing a model checkpoint that we didn't optimize yet because Anthropic cranked up the pressure"

sebzim4500

This is often claimed on HN but there is no evidence that it is actually true.

sama has tweeted that they lose money on pro, but in general according to leaks chatgpt subscriptions are quite profitable. The reason the company isn't profitable in general is they spend billions on R&D.

Ekaros

I generally question how wide spread willingness to pay for the most expensive product is. And will most users of those who actually want AI go with ad ridden lesser models...

vel0city

I can just imagine Kraft having a subsidized AI model for recipe suggestions that adds Velveeta to everything.

BriggyDwiggs42

I’ll probably stick to open models at that point.

spiderfarmer

Let a thousand providers bloom.

tmaly

rethinking your comment "was that all" I am listening to the stream now and had a thought. Most of the new models that have come out in the past few weeks have been great at coding and logical reasoning. But 4o has been better at creative writing. I am wondering if 4.5 is going to be even better at creative writing than 4o.

dingnuts

if you generate "creative" writing, please tell your audience that it is generated, before asking them to read it.

I do not understand what possible motivation there could be for generating "creative writing" unless you enjoy reading meaningless stories yourself, in which case, be my guest.

vjerancrnjak

I still find all of them lacking on creative writing. The models are severely crippled by tokenization, complete lack of understanding of language rhythm.

They can’t generate a simple haiku consistently, something larger is more out of reach.

For example, give it a piece of poetry and ask for new verses and it just sucks at replicating the language structure and rhythm of original verses.

chamomeal

I might sound crazy but honestly fine-tuned GPT-3 absolutely blows all of these modern models out of the water when it comes to creative writing.

Maybe it was less lobotomized, or less covered in the prompt equivalent of red tape. Or maybe you just need to have a little bit of lunacy for fun creative writing. The new models are so much more useful, but IMO they don’t have even come close to GPT-3.

maeil

> But 4o has been better at creative writing

In what way? I find the opposite, 4o's output has a very strong AI vibe, much moreso than competitors like Claude and Gemini. You can immediately tell, and instructing it to write differently (except for obvious caricatures like "Write like Gen Z") doesn't seem to help.

petesergeant

> but on another feels like OpenAI really struggles to stay ahead of their competitors

on one hand. On the other hand, you can have 4o-mini and o3-mini back when you can pry them out of my cold dead hands. They're _fast_, they're _cheap_, and in 90% of cases where you're automating anything, they're all you need. Also they can handle significant volume.

I'm not sure that's going to save OpenAI, but their -mini models really are something special for the price/performance/accuracy.

nycdatasci

Funny you should suggest that it seems like a revised system prompt: https://chatgpt.com/share/67c0fda8-a940-800f-bbdc-6674a8375f...

nycdatasci

In case there was any confusion, the referenced link shows 4.5 claiming to be “ChatGPT 4.0 Turbo”. I have tried multiple times and various approaches. This model is aware of 4.5 via search, but insists that it is 4 or 4 turbo. Something doesn’t add up. This cannot be part of the response to R1, Grok 3, and Claude 3.7. Satya’s decision to limit capex seems prescient.

Bjorkbat

My first thought seeing this and looking at benchmarks was that if it wasn’t for reasoning, then either pundits would be saying we’ve hit a plateau, or at the very least OpenAI is clearly in 2nd place to Anthropic in model performance.

Of course we don’t live in such a world, but I thought of this nonetheless because for all the connotations that come with a 4.5 moniker this is kind of underwhelming.

uh_uh

Pundits were saying that deep learning has hit a plateau even before the LLM boom.

anshumankmr

I suspect they may launch a GPT4.5Turbo with a price cut... GPT4/GPT432k etc were all pricier than the GPT4Turbo models which also came with the added context length.. but with this huge jump in price, even 4.5Turbo if it does come out would be pricier

energy123

The niche of GPT-4.5 is lower hallucations than any existing model. Whether that niche justifies the price tag for a subset of usecases remains to be seen.

energy123

Actually, this comment of mine was incorrect, or at least we don't have enough information to conclude this. The metric OpenAI are reporting is the total number of incorrect responses on SimpleQA (and they're being beaten by Claude Haiku on this metric...), which is a deceptive metric because it doesn't account for non-responses. A better metric would be the ratio of Incorrects to the total number of attempts.

swagmoney1606

I have no idea how they justify $200/month for pro

jampa

First impression of GPT-4.5:

1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.5

2. The style it writes is way better: it keeps the tone you ask and makes better improvements on the flow. One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.

Some comparisons on a book draft: GPT4o (left) and GPT4.5 (green). I also adjusted the spacing around the paragraphs, to better diff match. I still am wary of using ChatGPT to help me write, even with GPT 4.5, but the improvement is very noticeable.

https://i.imgur.com/ogalyE0.png

muzani

In my experience, Gemini Flash has been the best at writing, and GPT 3.5 onwards has been terrible.

GPT-3 and GPT-2 were actually remarkably good at it, arguably better than a skilled human. I had a bit of fun ghostwriting with these and got a little fan base for a while.

It seems that GPT-4.5 is better than 4 but it's nowhere near the quality of GPT-3 davinci. Davinci-002 has been nerfed quite a bit, but in the end it's $2/MTok for higher quality output.

It's clear this is something users want, but OpenAI and Anthropic seem to be going in the opposite direction.

rl3

>1. It is very very slow, ... below took 7s to generate with 4o, but 46s with GPT4.5

This is positively luxurious by o1-pro standards which I'd say average 5 minutes. That said I totally agree even ~45s isn't viable for real-time interactions. I'm sure it'll be optimized.

Of course, my comparing it to the highest-end CoT model in [publicly-known] existence isn't entirely fair since they're sort of apples and oranges.

philomath_mn

I paid for pro to try `o1-pro` and I can't seem to find any use case to justify the insane inference time. `o3-mini-high` seems to do just as well in seconds vs. minutes.

null

[deleted]

azinman2

What are you doing with it? For me deep research tasks are where 5 minutes is fine, or something really hard that would take me way more time myself.

osigurdson

I'm wondering if generative AI will ultimately result in a very dense / bullet form style of writing. What we are doing now is effectively this:

bullet_points' = compress(expand(bullet_points))

We are impressed by lots of text so must expand via LLM in order to impress the reader. Since the reader doesn't have time or interest to read the content they must compress it back into bullet points / quick summary. Really, the original bullet points plus a bit more thinking would likely be a better form of communication.

not_a_bot_4sho

I'm reminded of this great comic

https://marketoonist.com/2023/03/ai-written-ai-read.html

anon373839

That’s what Axios does. For ordinary events coverage, it’s a great style.

ChiefNotAClue

Right side, by a large margin. Better word choice and more natural flow. It feels a lot more human.

rossant

Is there really no way to prompt GPT4o to use a more natural and informal tone matching GPT4.5's?

FergusArgyll

I opened your link in a new tab and looked at it a couple minutes later. By then I forgot which was o and which was .5

I honestly couldn't decide which I prefer

niek_pas

I definitely prefer the 4.5, but that might just be because it sounds 'less like ChatGPT', ironically.

sdesol

It just feels natural to me. The person knows the language but they are not trying to sound smart by using words that might have more impact "based on the words dictionary definition"

GPT 4.5 does feel like it is a step forward in producing natural language, and if they use it to provide reinforcement learning, this might have significant impact in the future smaller models.

dyauspitr

Imgur might be the worst image hosting site I’ve ever experienced. Any interaction with that page results in switching images and big ads and they hijack the back button. Absolutely terrible. How far they’ve fallen from when it first began.

thfuran

>One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.

Well, maybe like a Sophomore's bumbling attempt to write like Shakespeare.

vessenes

Similar reaction here. I will also note that it seems to know a lot more about me than previous models. I’m not sure if this is a broader web crawl, more space in the model, or more summarization of our chats or a combination, but I asked it to psychoanalyze a problem I’m having in the style of Jacques lacan and it was genuinely helpful and interesting, no interview required first; it just went right at me.

To borrow an iain banks word, the “fragre” def feels improved to me. I think I will prefer it to o1 pro, although I haven’t really hammered on it yet.

sebastiennight

It is interesting that they are focusing a large part of this release on the model having a higher "EQ" (Emotional Quotient).

We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".

This is very visible in the example comparing 4o with 4.5 when the user is complaining about failing a test, where 4o's response is what one would expect from a "typical AI response" with problem-solving bullets, and 4.5 is sending what you'd expect from a pal over instant messaging.

It seems Anthropic and Grok have both been moving in this direction as well. Are we going to see an escalation of foundation models impersonating "a friendly person" rather than "a helpful assistant"?

Personally I find this worrying and (as someone who builds upon SOTA model APIs) I really hope this behavior is not going to seep into API responses, or will at least be steerable through the system/developer prompt.

og_kalu

The whole robotic, monotone, helpful assistant thing was something these companies had to actively hammer in during the post-training stage. It's not really how LLMs will sound by default after pre-training.

I guess they're caring less and less about that effort especially since it hurts the model in some ways like creative writing.

capnrefsmmat

Maybe, but I'm not sure how much the style is deliberate vs. a consequence of the post-training tasks like summarization and problem solving. Without seeing the post-training tasks and rating systems it's hard to judge if it's a deliberate style or an emergent consequence of other things.

But it's definitely the case that base models sound more human than instruction-tuned variants. And the shift isn't just vocabulary, it's also in grammar and rhetorical style. There's a shift toward longer words, but also participial phrases, phrasal coordination (with "and" and "or"), and nominalizations (turning adjectives/adverbs into nouns, like "development" or "naturalness"). https://arxiv.org/abs/2410.16107

sebastiennight

How is "development" an adverb or adjective turned into a noun??

It comes from a French word (développement) and that in turns was just a natural derivation of the verb "développer"... no adverbs or adjectives (English or otherwise) seem to come into play here

turnsout

Or maybe they're just getting better at it, or developing better taste. After switching to Claude, I can't go back to ChatGPT's overly verbose bullet-point laden book reports every time I ask a question. I don't think that's pretraining—it's in the way OpenAI approaches tuning and prompting vs Anthropic.

sebastiennight

If it's just a different choice during RLHF, I'll be curious to see what are the trade-offs in performance.

The "buddy in a chat group" style answers do not make me feel like asking it for a story will make the story long/detailed/poignant enough to warrant the difference.

I'll give it a try and compare on creative tasks.

orbital-decay

Anthropic pretty much abandoned this direction after Claude 3, and said it wasn't what they wanted [1]. Claude 3.5+ is extremely dry and neutral, it doesn't seem to have the same training.

>Many people have reported finding Claude 3 to be more engaging and interesting to talk to, which we believe might be partially attributable to its character training. This wasn’t the core goal of character training, however. Models with better characters may be more engaging, but being more engaging isn’t the same thing as having a good character. In fact, an excessive desire to be engaging seems like an undesirable character trait for a model to have.

[1] https://www.anthropic.com/research/claude-character

Kye

It's the opposite incentive to ad-funded social media. One wants to drain your wallet and keep you hooked, the other wants you to spend as little of their funding as possible finding what you're looking for.

callc

> We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".

That’s a hard nope from me, when companies pull that move. I’ll stick to my flesh and blood humans who still hallucinate but only rarely.

bredren

Yes, the "personality" (vibe) of the model is a key qualitative attribute of gpt-4.5.

I suspect this has something to do with shining light on an increased value prop in a dimension many people will appreciate since gains on quantitative comparison with other models were not notable enough to pop eyeballs.

tmaly

I would like to see a humor test. So far, I have not seen any model response that has made me laugh.

tkgally

How does the following stand-up routine by Claude 3.7 Sonnet work for you?

https://gally.net/temp/20250225claudestandup2.html

fragmede

I chuckled.

Now you just need a Pro subscription to get Sora generate a video to go along with this and post it to YouTube and rake in the views (and the money that goes along with it).

sebastiennight

That was impressive. If it all came from just this short 4-line prompt, it's even more impressive.

All we're missing now is a text-to-video (or text+audio and then audio-to-video) that can convincingly follow the style instructions for emphasis and pausing. Or are we already there yet?

aprilthird2021

Reading this felt like reading junk food

EDIT: Junk food tastes kinda good though. This felt like drinking straight cooking oil. Tastes bad and bad for you.

jdiez17

Okay, you know what? I laughed a few times. Yeah it may not work as an actual stand up routine to a general audience, it’s kinda cringe (as most LLM-generated content), but it was legitimately entertaining to read.

phonon

"Tip your server" was a pretty great pun!

lurker9001

incredible

kruxigt

[dead]

thousand_nights

reddit tier humor, truly

it's just regurgitating overly emphasized cliches in a disgustingly enthusiastic tone

AgentME

My benchmark for this has been asking the model to write some tweets in the style of dril, a popular user who writes short funny tweets. Sometimes I include a few example tweets in the prompt too. Here's an example of results I got from Claude 3 Opus and GPT 4 for this last year: https://bsky.app/profile/macil.tech/post/3kpcvicmirs2v. My opinion is that Claude's results were mostly bangers while GPT's were all a bit groanworthy. I need to try this again with the latest models sometime.

sebastiennight

The "roast" tools that have popped up (using either DeepSeek or o3-mini) are pretty funny.

Eg. https://news.ycombinator.com/item?id=43163654

jcims

OK now that is some funny shit.

turnsout

If you like absurdist humor, go into the OpenAI playground, select 3.5-Turbo, and dial up the temperature to the point where the output devolves into garbled text after 500 tokens or so. The first ~200 tokens are in the freaking sweet spot of humor.

rl3

Maybe it's rose-colored glasses, but 3.5 was really the golden era for LLM comedy. More modern LLMs can't touch it.

Just ask it to write you a film screenplay involving some hard-ass 80s/90s action star and someone totally unrelated and opposite of that. The ensuring unhinged magic is unparalleled.

amarcheschi

Could someone post an example?

immibis

ChatGPT gave me this shell script: https://social.immibis.com/media/7102ac83cf4a200e48dd368938e... (obviously, don't download and execute a random shell script from the internet without reading it first)

I think reading it will make you laugh.

oblio

> We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".

And soon we'll have the new AI friend recommending Bud Lite™ and turning the beer can with the logo towards you.

sureIy

I don't know if I fully agree. The input clearly shows the need for emotional support more than "how do I pass this test?" The answer by 4o is comical even if you know you're talking to a machine.

It reminds me of the advice to "not offer solutions when a woman talks about her problems, but just listen."

neuroticnews25

How could a machine provide emotional support? When I ask questions like this to LLMs, it's always to brainstorm solutions. I get annoyed when I receive fake-attention follow-up questions instead.

I guess there's a trade-off between being human and being useful. But this isn't unique to LLMs, it's similar to how one wouldn't expect a deep personal connection with a customer service professional.

cynicalpeace

There are some businesses trying to do emotional support with AI, like AI GF's, etc

Some will make some profit as a niche thing (millions of users on a global scale, and if unit economics work, can make millions of $)

But it seems it will never be something really mainstream because most normal people don't care what a bot says or does.

The example I always think of is chess bots have been better at chess than humans for decades. But very few people watch stockfish tournaments. Everyone loves Magnus Carlsen though.

This is 100x for emotional support type things.

aprilthird2021

I think it's a good thing because, idk why, I just start tuning out after getting reams and reams of bullet points I'm already not super confident about the truthfulness of

freediver

The results for GPT - 4.5 are in for Kagi LLM benchmark too.

It does crush our benchmark - time to make new? ;) - with performance similar of that of reasoning models. It does come at a great price both in cost and speed.

A monster is what they created. But looking at the tasks it fails, some of them my 9 year old would solve. Still in this weird limbo space of super knowledge and low intelligence.

May be remembered as the last the last of the 'big ones', can't imagine this will be a path for the future.

https://help.kagi.com/kagi/ai/llm-benchmark.html

mjirv

Do you have results for gpt-4? I’d be very interested in seeing the lift here from their last “big one”.

wendyshu

Why don't you have Grok?

mhh__

No api for grok 3 might be why

theodorthe5

If Gemini 2 is the top in your benchmark, make sure to re-check your benchmark.

shawabawa3

Gemini 2 pro is actually very impressive (maybe not for coding, haven't used it for that)

Flash is pretty garbage but cheap

istjohn

Gemini 2.0 Pro is quite good.

aoeusnth1

Gemini 2 pro is pretty strong actually.

eightysixfour

Seeing OpenAI and Anthropic go different routes here is interesting. It is worth moving past the initial knee jerk reaction of this model being unimpressive and some of the comments about "they spent a massive amount of money and had to ship something for it..."

* Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.

* OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.

Based on all of the comments from OpenAI, GPT 4.5 is absolutely massive, and with that size comes the ability to store far more factual data. The scores in ability oriented things - like coding - don't show the kind of gains you get from reasoning models but the fact based test, SimpleQA, shows a pretty large jump and a dramatic reduction in hallucinations. You can imagine a scenario where GPT4.5 is coordinating multiple, smaller, reasoning agents and using its factual accuracy to enhance their reasoning, kind of like ruminating on an idea "feels" like a different process than having a chat with someone.

I'm really curious if they're actually combining two things right now that could be split as well, EQ/communications, and factual knowledge storage. This could all be a bust, but it is an interesting difference in approaches none-the-less, and worth considering that OpenAI could be right.

sebastiennight

> * OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.

Seems inaccurate as their most recent claim I've seen is that they expect this to be their last non-reasoning model, and are aiming to provide all capacities together in the future model releases (unifying the GPT-x and o-x lines)

See this claim on TFA:

> We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other.

eightysixfour

From Sam's twitter:

> After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.

> In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.

You could read this as unifying the models or building a unified systems which coordinate multiple models. The second sentence, to me, implies that o3 will still exist, it just won't be standalone, which matches the idea I shared above.

sebastiennight

Ah, great point. Yes, the wording here would imply that they're basically planning on building scaffolding around multiple models instead of having one more capable Swiss Army Knife model.

I would feel a bit bummed if GPT-5 turned out not to be a model, but rather a "product".

ryukoposting

> know when to think for a long time or not, and generally be useful for a very wide range of tasks.

I'm going to call it now - no customer is actually going to use this. It'll be a cute little bonus for their chatbot god-oracle, but virtually all of their b2b clients are going to demand "minimum latency at all times" or "maximum accuracy at all times."

tmpz22

I worry eliminating consumer choice will drive up prices for only a nominal gain in utility for most users.

billywhizz

or you could read it as a way to create a moat where none currently exists...

nomel

> OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.

The high level block diagrams for tech always end up converging to those found in biological systems.

eightysixfour

Yeah, I don't know enough real neuroscience to argue either side. What I can say is I feel like this path is more like the way that I observe that I think, it feels like there are different modes of thinking and processes in the brain, and it seems like transformers are able to emulate at least two different versions of that.

Once we figure out the frontal cortex & corpus callosum part of this, where we aren't calling other models over APIs instead of them all working in the same shared space, I have a feeling we'll be on to something pretty exciting.

throw234234234

> Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.

I don't think that is their primary motivation. The announcement post for Claude 3.7 was all about code which doesn't seem to imply "all use cases". Code this, new code tool that, telling customers that they look forward to what they build, etc. Very little mention of other use cases on the new model announcement at all. Their usage stats they published are telling - 80%+ or more of queries to Claude are all about code. i.e. I actually think while they are thinking of other use cases; they see the use case of code specifically as the major thing to optimize for.

OpenAI, given its different customer base and reach, is probably aiming for something more general.

IMO they all think that you need an "ensemble" of models with different capabilities to optimise for different use cases. Its more about how much compute resources each company has and what they target with those resources. Anthrophic I'm assuming has less compute resources and a narrower customer base so it economically may make sense to optimise just for that.

eightysixfour

That's possible, my counter point would be that if that was the case Anthropic would have built a smaller reasoning model instead of doing a "full" Claude. Instead, they built something which seems to be flexible across different types of responses.

Only time will tell.

jstummbillig

It can never be just reasoning, right? Reasoning is the multiplier on some base model, and surely no amount of reasoning on top of something like gpt-2 will get you o1.

This model is too expensive right now, but as compute gets cheaper — and we have to keep in mind, that it will — having a better base to multiply with will enable things that just more thinking won't.

eightysixfour

You can try for yourself with the distilled R1's that Deepseek released. The qwen-7b based model is quite impressive for its size and it can do a lot with additional context provided. I imagine for some domains you can provide enough context and let the inference time eventually solve it, for others you can't.

protocolture

Ever since those kids demo'd their fact checking engine here, which was just Input -> LLM -> Fact Database -> LLM -> LLM -> Output I have been betting that it will be advantageous to move in this general direction.

wongarsu

Or the other way around: smaller reasoning models that can call out to GPT-4.5 to get their facts right.

eightysixfour

Maybe, I’m inclined to think OpenAI believes the way I laid it out though, specifically because of their focus on communication and EQ in 4.5. It seems like they believe the large, non-reasoning model, will be “front of house.”

Or they’ll use some kind of trained router which sends the request to the one it thinks it should go to first.

bhouston

A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:

- ChatGPT 4.5 on AWS Bench verified: 38.0%

- ChatGPT 4o on AWS Bench verified: 30.7%

- OpenAI o3-mini on AWS Bench verified: 61.0%

BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder

[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...

pawelduda

Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.

jasonjmcghee

personal anecdote - claude code is the best llm devx i've had.

_cs2017_

I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.

If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.

aoeusnth1

I doubt o3s costs will be lower for that performance. They juice their benchmark results by letting it spend $100k in thinking tokens.

logicchains

>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code

That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.

cheema33

> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...

I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.

bhouston

Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code.

QuadmasterXLII

If open ai offers a more expensive model (4.5) and a cheaper model (3 mini) and both are worse, it starts to be a fair comparison

ehsanu1

It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.

Topfi

To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

[0] https://arxiv.org/pdf/2502.12115

simonw

If you want to try it out via their API you can run it through my LLM tool using uvx like this:

  uvx --with 'https://github.com/simonw/llm/archive/801b08bf40788c09aed6175252876310312fe667.zip' \
    llm -m gpt-4.5-preview 'impress me'

You may need to set an API key first, either with `export OPENAI_API_KEY='xxx'` or using this command to save it to a file:

  uvx llm keys set openai
  # paste key here

Or this to get a chat session going:

  uvx --with 'https://github.com/simonw/llm/archive/801b08bf40788c09aed6175252876310312fe667.zip' \
    llm chat -m gpt-4.5-preview

I'll probably have a proper release out later today. Details here: https://github.com/simonw/llm/issues/795

ashu1461

Just curious, does this stream the output or renders all at once ?

simonw

It streams the output. See animated demo here (bottom image on the page) https://simonwillison.net/2025/Feb/27/introducing-gpt-45/

null

[deleted]

antirez

In many ways I'm not an OpenAI fan (but I need to recognize their many merits). At the same time, I believe people are missing what they tried to do with GPT 4.5: it was needed and important to explore the pre-training scaling law in that direction. A gift to science, however selfist it could be.

throwaway314155

> A gift to science

This is hardly recognizable as science.

edit: Sorry, didn't feel this was a controversial opinion. What I meant to say was that for so-called science, this is not reproducible in any way whatsoever. Further, this page in particular has all the hallmarks of _marketing_ copy, not science.

Sometimes a failure is just a failure, not necessarily a gift. People could tell scaling wasn't working well before the release of GPT 4.5. I really don't see how this provides as much insight as is suggested.

Deepseek's models apparently still compare favorably with this one. What's more they did that work with the constraint of having _less_ money, not so much money they could run incredibly costly experiments that are likely to fail. We need more of the former, less of the latter.

tacet

if i understand correctly your argument, then i would say that it is very recognizable as science

>People could tell scaling wasn't working well before the release of GPT 4.5

Yes, on quick glance it seems so from 2020 openai research into scaling laws.

Scaling apparently didn't work well, so the theory about scaling not working well failed to be falsified. It's science.

vbezhenar

> People could tell scaling wasn't working well before the release of GPT 4.5.

Different people tell different things all the time. That's not science. Experiment is science.

kadushka

People could tell scaling wasn't working well before the release of GPT 4.5

Who could tell? Who has tried scaling up to this level?

throwaway314155

https://www.reuters.com/technology/artificial-intelligence/o...

> Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.

anshumankmr

OpenAI took a bullet for the team, by perhaps scaling the model to something bigger than the 1.6T params GPT4 possibly had and basically telling its competitors its not gonna be worth scaling much beyond those number of params in GPT4, without a change in the model architecture

joshuamcginnis

I'm one week in on heavy grok usage. I didn't think I'd say this, but for personal use, I'm considering cancelling my OpenAI plan.

The one thing I wish grok had was more separation of the UI from X itself. The interface being so coupled to X puts me off and makes it feel like a second-hand citizen. I like ChatGPTs minimalist UI.

richard_todd

I find grok to be the best overall experience for the types of tasks I try to give AI (mostly: analyze pdf, perform and proofread OCR, translate Medieval Latin and Hebrew, remind me how to do various things in python or SwiftUI). ChatGPT/gemini/copilot all fight me occasionally, but grok just tries to help. And the hallucinations aren’t as frequent, at least anecdotally.

aldanor

Theres grok.com which is standalone and with its own UI

There's also a standalone Grok app at least on iOS.

pzo

I wish they did also dedicated keyboard app like SwiftKey that has copilot integration

fzzzy

Don't they have a standalone Grok app now? I thought I saw that. [edit] ah some sibling comments mention this as well

barfingclouds

There’s a grok app for iPhone that’s basically the same as ChatGPT/deepseek/mistral/gemini/claude

andxor

https://grok.com/

null

[deleted]

martin82

Agree. Grok at the moment is king.

I just wish they had folders (or projects) like OpenAi has...

bn-l

They still haven’t released an API for 3