Hot take: GPT 4.5 is a nothing burger
33 comments
·February 28, 2025AnotherGoodName
imafish
Good. We need a break from insane AI advances.
kelseyfrog
ChatGPT's next release was AGI or bust, and they busted. Not really much more to it than that.
yokoprime
It does give an impression of diminishing returns on this family of models. The output presented in samples and quoted in benchmarks is very impressive, but compared with 4o it seems to be an incremental update with no new capabilites of note. This does, however, come at a massive increase in cost (both in terms of compute, but also monetary). Also, this update did not benefit from being launched in the wake of Claude 3.7 and DeepSeek, which both had more headlining improvements compared with what we got yesterday from OpenAI
2muchcoffeeman
Why would you not expect diminishing returns? Surely that’s the default position of all AI non-experts?
iLoveOncall
I don't know why people would downvote you.
ChatGPT was released in 2022! It doesn't feel like that, but it's been out for a long time and we've only seen marginal improvements since, and the wider public has simply not seen ANY improvement.
It's obvious that the technology has hit a brick wall and the farce which is to spend double the tokens to first come up with a plan and call that "reasoning" has not moved the needle either.
TwoPhonesOneKid
I thought we were supposed to see AI managing costs reductions by now. Instead I just see chatbots and frantic PMs looking to prove their utility.
cowlby
My quick (subjective) impression is that GPT-4.5 is doing better at maintaining philosophical discussions compared to GPT-4.0 or Claude. When using a Socratic approach, 4.5 consistently holds and challenges positions rather than quickly agreeing with me.
GPT-4.0 or Claude tend to flip into people-pleasing mode too easily, while 4.5 seemed to stay argumentative more readily.
hnuser123456
These super large models seem better at keeping track of unsaid nuance. I wonder if that can still be distilled into smaller models or if there is a minimum size for a minimum level of nuance even given infinite training.
A_D_E_P_T
You can fix the people-pleasing mode thing by simply adding the words "be critical" to your prompt.
As for 4.5... I've been playing around with it all day, and as far as I can tell it's objectively worse than o3-mini-high and Deepseek-R1. It's less imaginative, doesn't reason as well, doesn't code as well as o3-mini, doesn't write nearly as well as R1, its book and product recommendations are far more mainstream/normie, and all in all it's totally unimpressive.
Frankly, I don't know why OpenAI released it in this form, to people who already have access to o3-mini, o1-Pro, and Deep Research -- all of which are better tools.
nytesky
Are we reaching the point where AI generated data is a growing bit of the training data?
Upvoter33
Marcus with an anti-AI post - what a non surprise
weatherlite
Seems to me like he's right though
xnx
This seems to be the conventional wisdom.
_mitterpach
The output, from what I've seen, was okay? I don't know if it is that much better, and I think LLMs gain a lot by there not being an actual, objective measure by which you can compare two different models.
Sure, there are some coding competitions, there are some benchmarks, but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
Is there any reasonable way to compare outputs of fuzzy algorithms anyways? It is still an algorithm under the hood, with defined inputs, calculations and outputs, right? (just with a little bit of randomness defined by a random seed)
sen
I have a dozen or so very random prompts I feed into every new model that are based on things I’m very knowledgeable and passionate about, and compare the outputs. A couple are directly coding related, a couple are “write a few paragraphs explaining <technical thing>”, and the rest are purely about non-computer hobbies, etc.
I’ve found it way more useful for me personally than any of the “formal” tests, as I don’t really care how it scores on random tests but instead very much do care how well it does my day to day things.
It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
dylan604
> but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
yes? I mean, if you were really doing this, you could make both and see how they turned out. Or, if you were familiar with doing this and were just looking for a quick refresher, you'd know if something was off or not.
but just like everything else on the interweb, if you have no knowledge except for what ever your search result presented, you're screwed!
esafak
You ask people to rate them, possibly among multiple dimensions. People are much better at resolving comparisons than absolute assessments. https://lmarena.ai/
amelius
There are benchmarks, but they can be gamed.
dokka
It's good at taking code of large, complex libraries and finding the most optimal way to glue them together. Also, I gave it the code of several open source MudBlazor components and got great examples of how they should be used together to build what I want. Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.
evil-olive
> Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.
Sonnet 3.7: $3/million input tokens, $15/million output tokens [0]
GPT-4.5: $75/million input tokens, $150/million output tokens [1]
if it's 10-25x the cost, I would expect more than "slightly better"
IanCal
It really depends on how much it actually costs for a task though. 10x more of almost nothing isn't important.
refulgentis
It's unfortunate it is named 4.5 -- it is next generation scale, and it's a 1.0 of next-generation scale.
Sonnet is on its 3rd iteration, i.e. has considerably more post-training, most notably, reasoning via reinforcement learning.
yomansat
How do you feed them large code bases usually?
tptacek
It seems unlikely that an 8.48% one-day drop in NVDA would be a referendum on how much better GPT 4.5 was than GPT 4.
TwoPhonesOneKid
Yea but consulting the stock market for valuation seems like consulting a council of local morons what they think of someone. Any signal provoking such a drop would itself be many times more valuable, if it is indeed meaningful in the first place.
Direct usage states of chatgpt would be many orders of magnitude more meaningful. Though I guess you have a 50/50 shot of guessing accuracy so you might as well try....
dsizzle
This guys always finds a way to claim victory for his AI pessimism. In a few years: "Hot take: AI's Nobel Prizes aren't even in the top 10, just like I predicted"
resters
It took me a while to learn to maximize o1. But 4.5 should seemingly work like 4o.
Based on that it does seem underwhelming. Looking forward to hearing about any cases where it truly shines compared to other models.
How does this compare to the scaling laws predicted for ai?
(See observation 1 for context) https://blog.samaltman.com/three-observations
I feel this is way out of line given the resources and marginal gains. If the claimed scaling laws don’t hold (more resources = closer to intelligence) then LLMs in their current form are not going to lead to AGI.