Skip to content(if available)orjump to list(if available)

GPT-4.5: "Not a frontier model"?

GPT-4.5: "Not a frontier model"?

159 comments

·March 2, 2025

HarHarVeryFunny

GPT 4.5 also has a knowledge cutoff date of 10-2023.

https://www.reddit.com/r/singularity/comments/1izpb8t/gpt45_...

I'm guessing that this model was finished pre-training at least a year ago (it's been 2 years since GPT 4.0 was released) and they just didn't see the hoped-for performance gains to think it warranted releasing at the time, and so put all their effort into the Q-star/strawberry = eventual O1 reasoning effort instead.

It seems that OpenAI's reasoning model lead isn't perhaps what they thought it was, and the recent slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7) made them feel the need to release something themselves for appearances sake, so they dusted off this model, perhaps did a bit of post-training on it for EQ, and here we are.

The price is a bit of a mystery - perhaps just a reflection of an older model without all the latest efficiency tricks to make it cheaper. Maybe it's dense rather than MoE - who knows.

sigmoid10

Rumors said that GPT4.5 is an order of magnitude larger. Around 12 trillion parameters total (compared to GPT4's 1.2 trillion). It's almost certainly MoE as well, just a scaled up version. That would explain the cost. OpenAI also said that this is what they originally developed as "Omni" - the model supposed to succeed GPT4 but which fell behind expectations. So they renamed it 4.5 and shoehorned it in to remain in the news among all those competitor releases.

glenstein

This is all excellent detail. Wondering if there's any good suggestions for further reading on the inside baseball of what happened with GPT 4.5?

qeternity

Well, it's not...it gets most details wrong.

ljlolel

the gpt-4o ("omni") is probably a distilled 4.5; hence why not much quality difference

sigmoid10

4o has been out since May last year, while omni (now rechristened as 4.5) only finished training in October/November.

qeternity

GPT-4 was rumored to be 1.8T params...not 1.2

And the successor model was called "Orion", not "Omni".

zaptrem

You're thinking of "Orion" not "Omni" (GPT 4o stands for "Omni" since it's natively multimodal with image and audio input/output tokens)

Leary

How does this compare with Grok 3's parameter count? I know Grok 3 was trained on a larger cluster (100k-200k) but GPT 4.5 used distributed training.

glenstein

>The price is a bit of a mystery

I think it at least is somewhat analogous to what happened with pricing on previous models. GPT 4, despite being less capable than 4o, is an order of magnitude more expensive, and comparably expensive to o1. It seems like once the model is out, the price is the price, and the performance gains emerge but they emerge attached to new minified variations of previous models.

simonw

I don't think th October 2023 training cut-off means the model finished pre-training a year ago. All of OpenAI's models share that same cut-off date.

One theory is that they're worried about the increasing tide of LLM-generated slop that's been posted online since that date. I don't know if I buy that or not - other model providers (such as Anthropic, Gemini) don't seem worried about that.

Chance-Device

Releasing it was probably a mistake. In context what the model is could have been understood, but they haven’t really presented that context. Also it would be lost on a general audience.

The general public will naturally expect it to be the next big thing. Wasn’t that the point of releasing it? To seem like progress is being made? To try to make that point with a model that doesn’t deliver is a misstep.

If I were Sam Altman, I’d be pulling this back before it goes on general release, saying something like it was experimental and after user feedback the costs weren’t worth it and they’re working on something else as a replacement. Then o3 or whatever they actually are working on instead can be the “replacement” even if it’s much later.

datadrivenangel

or just say it was too good and thus too dangerous to release...

null

[deleted]

bilater

I sort of believed this but also 4.5 coming out last year would absolutely have been a big deal compared to what was out there at the time? I just dont understand why they would not launch it then.

null

[deleted]

numba888

> slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7)

Sonnet 3.7 is actually reasoning model.

LaurensBER

It's my understanding that reasoning in Sonnet 3.7 is optional and configurable.

I might be wrong but I couldn't find a source that indicates that the "base" model also implements reasoning.

amluto

From limited experimentation: Sonnet 3.7 has “extended thinking” as an option, although the UI, at least in the app, leaves something to be desired. It also has a beta feature called “Analysis” that seems to work by having the model output JavaScript code as part of its response that is then run and feeds back into the answer. Both of these abilities are visible — users can see the chain of thought and the analysis code.

It seems, based again on limited experimentation doing sort-of-real work, that analysis works quite well and extended thinking is so-so. Whereas DeepSeek R1 seems to be willing and perhaps even encouraged to second-guess itself (maybe this is a superpower of the “wait” token”), Sonnet 3.7 doesn’t seem to second-guess itself as much as it should. It will happily extended-think, generate a wrong answer, and then give a better answer after being asked a question that it really should have thought of itself.

(I’m not complaining. I’ve been a happy user of 3.7 for a whole day! But I think there’s plenty of room for improvement.)

wegfawefgawefg

so is grok3

tsunego

GPT-4.5 feels like OpenAI's way of discovering just how much we'll pay for diminishing returns.

The leap from GPT-4o to 4.5 isn't a leap—it's an expensive tiptoe toward incremental improvements, priced like a luxury item without the luxury payoff.

With pricing at 15x GPT-4o, they're practically daring us not to use it. Given this, I wouldn't be surprised if GPT-4.5 quietly disappears from the API once OpenAI finishes squeezing insights (and cash) out of this experiment.

zamadatix

Even this is a bit overly complicated/optimistic to me. Why not something as simple as: OpenAI has been building larger and larger models to great success for a long time. As a result, they were excited this one was going to be so much larger=so much better that the price to run it would be well worth the huge jump they were planning to get from it. What really happened is this method of scaling hit a wall and they were left with an expensive dud they won't get much out of but they have to release something for now otherwise they start falling well behind on the boards the next few months. Meanwhile they scramble focus to find other means of scaling like "chain of thought + runtime" provided.

hn_throwaway_99

Thank you so much for this comment. I don't really understand the need for people to go straight to semi-conspiratorial hypotheses, when the simpler explanation makes so much more sense. All the evidence is that this model is much larger than previous ones, so they must charge a lot more for inference because it costs so much more to run. OpenAI were the OGs when it came to scaling, so it's not surprising they went this route and eventually hit a wall.

I don't at all blame OpenAI for going down this path (indeed, I laud them for making expensive bets), but I do blame all the quote-un-quote "thought leaders" who were writing breathless posts about how AGI was just around the corner because things would just scale linearly forever. It was classic "based on historical data, this 10 year old will be 20 feet tall by the time he's 30" thinking, and lots of people called them out on this, and they either just ignored it or responded with "oh, simple not-in-the-know peons" dismissiveness.

bee_rider

It is weird because this is a board for working programmers for the most part. So like, who’s seen a gram conspiracy actually be accomplished? Probably now many. A lackluster product that gets released even though it sucks because too many people are highly motivated not to notice that it sucks? Everybody has experienced that, right?

danielbln

It works until it doesn't and hindsight is 20/20.

Kye

In fundamental science terms, it also proves once and for all that more model doesn't mean more better. Any forces within OpenAI pushing to move past just growing the model for gains now have a strong argument for going all-in on new processes.

kruxigt

[dead]

TZubiri

Time to enter the tick cycle.

I ask chatgpt to give me a map highlighting all spanish speaking countries, gives me stable diffusion trash.

Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.

It's a good LLM model already it doesn't need to be einstein and solve aerospatial equations. We just need to wait until they realize their limits and find the humility to build yet another useful product that won't conquer the world.

Willingham

I’ve thought of LLM’s as google 2.0 for some time now. Truly a world changing technology similar to how google changed the world, likely to have an even larger impact than google had as we create highly specialized Implementations of the technology in the coming decade…but it’s not energy positive nuclear fusion, or a polynomial time NP solver, it’s just google 2.0

dingnuts

Google 2.0 where you have to check every answer it gives you because it's authoritative about nothing.

Works great when the output is small enough to unit test or immediately try in situations with no possible negative outcomes.

Anything larger? Skip the LLM slop and go to the source. You have to go to the source, anyway.

bee_rider

LLMs could make some nice little tools.

However they’ll need to replace vast swathes of the economy to justify these AI companies’ market caps.

blharr

Giving ChatGPT stupid AI image generation was a huge nerf. I get frustrated with this all the time.

SketchySeaBeast

Oh, I think it's great they did that. It's super helpful for visualizing ChatGPT's limitations. Ask it for an absolutely full, overflowing glass of wine or a wrist watch whose time is 6:30 and it's obvious what it actually does. It's educational.

tiahura

I asked claude to give me a script in python to create a map highlighting all spanish speaking countries. it took 3 tries and then gave me a perfect svg and png.

dismalaf

> Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.

This is kind of the crux though. The only way to make LLMs more useful is to basically make them traditional AI. So it's not really a leap forward nevermind path to AGI.

hooverd

They should have called it "ChatGPT Enterprise".

tsunego

Exactly! designed specifically for people who love burning corporate budgets.

numba888

OpenAI is going to add it to Plus subscriptions. I.e. available for many at no additional cost. Likely with restrictions line N prompts/hour.

As for API price, when it matters businesses and people are willing to pay much more for just a bit better results. OpenAI doesn't take the other options away. So we don't lose anything.

fodkodrasz

IMO the 4o output is lot more Enterprise-compatible, the 4.5 being straight to the point and more natural is quite the opposite. Pricing-wise your point stands.

Disclaimer: have not tried 4.5 yet, just skimmed through the announcement, using 4o regularly.

Kerbonut

Apparently, OpenAI API “credits” expire after a year. I stupidly put another $20 and trying to blow through them, 4.5 is the easiest way considering recent 4o has fallen out of favor for other models and I don’t want to just let them expire again. An expiry after only one year is asinine.

Chance-Device

Yes. I also discovered this, and was also forced to blow through my credits in a rush. Terrible policy.

glenstein

I'm learning this for the first time now. I don't appreciate having to anticipate how many credits I'll use like its an FSA account.

heed

>Terrible policy.

And unfortunately one not exclusive to OpenAI. Anthropic credits also expire after 1 year.

jstummbillig

This is how pricing on human labour works. Nobody expects an employee that costs twice as much to produce twice the output for any given task. All that is expected is that they can do a narrow set of things, that another person can't.

neom

4.5 can extremely quickly distill and work with what I at least consider, complex nuanced thought. 4.5 is night and day better than every other AI for my work, it's quite clever and I like it.

Very quick mvp comparison for the show me what you mean crew: https://chatgpt.com/share/67c48fcc-db24-800f-865b-c0485efd7f... & https://chatgpt.com/share/67c48fe2-0830-800f-a370-7a18586e8b... (~30 seconds vs ~3 minutes)

nyrikki

The 4.5 has better 'vibes' but isn't 'better', as a concrete example:

> Mission is the operationalized version of vision; it translates aspiration into clear, achievable action.

The "Mission is the operationalized version of vision" is not in the corpus that I am find and is obviously a confabulated mixture of classic Taylorist like "strategic planning"

SOPs and metrics, which will be tied to compensation and the unfortunate ubiquitous nature of Taylorism would not result in shared purpose, but a bunch of Gantt charts past the planning horizon.

IMHO I would consider "complex nuanced thought" as understanding the historical issues and at least respect the divide between classical and neo-classical org theory. Or at least avoid pollution of more modern theories with classical baggage that is a significant barrier to delivering value.

Mission statements need to share strategic intent in an actionable way, strategy is not operationalization.

ewoodrich

I have been experimenting with 4.5 for a journaling app I am developing for my own personal needs, for example, turning bullet/unstructured thoughts into a consistent diary format/voice.

The quality of writing can be much better than Claude 3.5/3.7 at times but struggling with similar confabulation of information that is not in the original text but "sounds good/flows well". Which isn't ideal for a personal journal... I am still playing around with the system prompt but given the astronomical cost (even with me as the only user) with marginal benefits I am probably going to end up sticking with Claude for now.

Unless others have a recommendation for a less robot-y sounding model (that will, however, follow instructions precisely) with API access other than the mainstream Claude/OpenAI/Gemini models?

neom

I've found this on par with 4.5 in tone, but not as nuanced in connecting super wide ideas in systems, 4.5 still does that best: https://ai.google.dev/gemini-api/docs/thinking

(also: the person you are responding to is doing exactly what you're saying you don't want done, take something unrelated to the original text (Taylorism) but could sound good, and jam it in)

neom

The statement "Mission is the operationalized version of vision; it translates aspiration into clear, achievable action" isn't a Taylorist reduction of mission to mechanical processes - it's actually a nuanced understanding of how these organizational elements relate. You're misinterpreting what "operationalized" means in this context. From what i can tell, the 4.5 response isn't suggesting Taylorist implementation with Gantt charts etc it's describing how missions translate vision into actionable direction while remaining strategic. Instead of jargon, it's recognizing that founders need something between abstract vision and tactical execution. Missions serve this critical bridging function. CEO has vision, orgs capture the vision into their missions, people find their purpose when aligned via the 2. Without it, founders either get stuck in aspirational thinking or jump straight to implementation details without strategic guidance. The distinction matters exactly because it helps avoid the dysfunction that prevents startups from scaling effectively. I think you're assuming "operationalized" means tactical implementation (Gantt charts, SOPs) when in this context it means "made operational/actionable at a strategic level". Missions != mission statements. Also, you're creating a false dichotomy between "strategic intent" and "operationalization" when they very much, exist on a spectrum. (If anything, connecting employees to mission and purpose is the opposite of Tayloristic thinking, which viewed workers more as interchangeable parts than as stakeholders in a shared mission towards responding to a shared vision of global change) - You are doing what o1 pro did, and as I said: As a tool for teaching business to founders, personally, I find the 4.5 response to be better.

nyrikki

An example of a typical nieve definition of a mission statement is:

Concise, clear, and memorable statement that outlines a company's core purpose, values, and target audience.

> "made operational/actionable at a strategic level".

Taken the common definition from the first part of this plan, what do you think the average manager would do given that in the social sciences, operationalization is explicitly about measuring abstract qualities. [1]

"operationalization" is a compromise, trying to quantify qualitative properties, it is not typically subject to methods like MECE principal, because there are too many unknown unknowns.

You are correct that "operationalization" and "strategic intent" are not mutually exclusive in all aspects, but they are for mission statements that need to be durable across changes that no CEO can envision.

The "made operational/actionable at a strategic level" is the exact claim of pseudo scientific management theory (Greater Taylorism) that Japan directly targeted to destroy the US manufacturing sector. You can look at the former CEO of Komatsu if you want direct evidence.

GM:s failure to learn form Toyota at NUMII (sp?) is another.

The planning process needs to be informed by stratagy, but planning is not strategic, it has a limited horizon.

But you are correct that it is more nuanced and neither Taylor nor Tolstoy allowed for that.

Neo-classical org theory is when bounded rationality was first acknowledged, although the Prussian military figured that out long before Taylor grabbed his stopwatch to time people loading pig iron into train cars.

I encourage you to read:

Strategy: A History By sir Lawrence Freedman

For a more in depth discussion.

[1] https://socialsci.libretexts.org/Bookshelves/Sociology/Intro...

ttul

I believe 4.5 is a very large and rich model. The price is high because it's costly to inference; however, the bigger reason is to ensure that others don't distill from it. Big models have a rich latent space, but it takes time to squeeze the juice out.

esafak

That also means people won't use it. Way to shoot yourself in the foot.

The irony of a company that has distilled the word's information complaining about another company distilling their model...

ttul

The small number of use cases that do pay are providing gross margins as well as feedback that helps OpenAI in various ways. I don’t think it’s a stupid move at all.

cscurmudgeon

My assumption: There will be use cases where cost of using this will be smaller than the gain from it. Data from this will make the next version better and cheaper.

null

[deleted]

phillipcarter

My take from using it a bit is that they seem to have genuinely innovated on:

- Not writing things that go off in weird directions / staying grounded in "reality"

- Responding very well to tone preferences and catching nuance in what I say

It seems like it's less that it has a great "personality" like Claude, but that it's capable of adapting towards being the "personality" I want and "understanding" what I'm saying in ways that other models haven't been able to do for me.

XenophileJKO

So this kind of mirrors my feelings after using GPT-4.5 on general conversation and song writing.

GPT picked up on unspecified requirements almost instantly. It is subtle (and may be undesirable in some contexts). For example in my songs, I have to bracket the section headings, it picked up on that from my original input. All the other frontier models generally have to be reminded. Additionally, I separately asked for an edit to a music style description. When I asked GPT-4.5 to write a song all by itself, it included a music style description. No other model I have worked with has done this.

These are subtle differences, but in aggregate the model just generally needs less nudging to create what is required.

torginus

I haven't used 4.5 but have some experience using Claude for creative writing, and in my experience it sometimes has the uncanny ability to get to the core of my ideas, rephrasing my paragraph long descriptions into just a sentence or two, or both improving and concretizing my vague ideas into something that's insightful and tasteful.

Other times it locks itself into a dull style and ignores what I ask of it and just produces boring generic garbage, and I have to wrangle it hard to get some of the spark back.

I have no idea what's going on inside, but just like with Stable Diffusion, it's fairly easy to make something that has the spark of genius, and is very close to being perfect, but getting the last 10% there, and maintaining the quality seems almost impossible.

It's a very weird feeling, it's hard to put into words what is exactly going on, and probably even harder to make it into a benchmark, but it makes me constantly flip-flop between scared of being how good the AI is, and questioning why I ever bothered with using it in the first place, as I would've progress much faster without it.

null

[deleted]

pzo

Long term it might be hard to monetise those infrastructure considering their competition:

1) For coding (API) most probably will stick to Claude 3.5 / 3.7 - big market but still small comparing to all world wide problems

2) For non-coding API IMHO gemini 2.0 flash is the winner - dirty cheap (cheaper than 4o-mini), good enough and even better than gpt-4o, cheap audio and image input.

3) For subscription app ChatGPT is probably still the best but only slightly - they have the best advanced voice audio conversation but Grok will be probably eating their lunch here

anukin

Sesame model for voice audio imo is better than ChatGPT voice audio conversation. They are going to open source it as well.

bckr

Sure but is there an app I can talk to / work with? It seems they're a voice synthesis model company, not a chatbot app / tool company.

OsrsNeedsf2P

> They are going to open source it as well.

Means nothing until they do

Layvier

We were using gpt-4o for our chat agent, and after some experiments I think we'll move to flash 2.0. Faster, cheaper and a bit more reliable even. I also experimented with the experimental thinking version, and there a single node architecture seemed to work well enough (instead of multiple specialised sub agents nodes). It did better than deepseek actually. Now I'm waiting for the official release before spending more time on it.

ipaddr

For the rest of us using free tiers ChatGPT is hands down the winner allowing limited image generation, unlimited usage of some model and limited usage of 4o.

Claude is still stuck at 10 messages per day and gemini is less accurate/useful.

dingnuts

10 messages a day? How are people "vibe coding" with that?

irishloop

They're paying for Pro

siva7

It's marketed to be slightly better at "creative writing". This isn't the problem most businesses have with current-generation LLMs. On the other side; Anthropic releases nearly at the same time a new model which solves more practical problems for businesses to the point that for coding many insiders don't use OpenAI models for such tasks anymore.

dingnuts

I think it should be illegal to trick humans into reading "creative" machine output.

It strikes me as a form of fraud that steals my most precious resources: time and attention. I read creative writing to feel a human connection to the author. If the author is a machine and this is not disclosed, that's a huge lie.

It should be required that publishers label AI generated content.

Hoasi

> I think it should be illegal to trick humans into reading "creative" machine output.

Creativity has lost its meaning. Should it be illegal? The courts will take a long time to settle the matter. Reselling people's work against their will as creative machine output seems unethical, to say the least.

> It should be required that publishers label AI-generated content.

Strongly agree.

CuriouslyC

I'm pretty sure you read for pleasure, and feeling a human connection is one way that you derive pleasure. If it's the only way that you derive pleasure from reading, my condolences.

becquerel

Pretty much where my thoughts on this are. I rarely feel any particular sense of connection to the actual author when I read their books. And I have taken great pleasure from some AI stories (to the degree I put them up on my personal website as a way to keep them around).

Philpax

Under the dingnuts regime, Dwarf Fortress will be illegal. Actually, any game with a procedural story? You better believe that's a crime: we can't have a machine generate text a human might enjoy.

glenneroo

Dingnuts point was that it should be disclosed. Everyone knows Dwarf Fortress stories are procedural/AI generated, the authors aren't trying to hide that fact.

kubb

Seems like we're hitting the limits of the technology...

gloosx

Consumer limits. When something is good enough it can make stable money, then there is no real incentive to innovate beyond the bare minimum—just enough to keep consumers engaged, shareholders satisfied, and regulators at bay.

This is how modern capitalism and all corporations work, we will keep receiving new numbers in versions without any sensible change – consumers will keep updating their subscriptions out of habit – xyzAI PR managers, HR managers, corporate lawyers and myriads of other bureaucrats will keep receiving their paychecks secretly dreaming of retirement, xyzAI top management will burn money on countless acquisitions just to fight boredom, turning into xyz(ai)MEGAcorp doing everything from crude oil processing and styrofoam cups to web services and AI models.

No modern mega corporation is capable of making something else or different from what already worked for them just once. We could achieve universal wellfare and prosperity 60 years ago, that would’ve disrupted the cycle. Instead, we got planned obsolescence, endless subscription models, and a world where everything “new” is just a slightly repackaged version of last year’s product.

xmichael909

Yes, I believe the sprint is over, now its doing to be slow cycles maybe 18 months to see a 5% increase an ability and even that 5% increase will be highly subjective. Claude's new release is about the same 3.7 is arguably worse at some things than 3.5 and better at others. Based on the previous pace of release in about 6 months or so - if the next release from any of the leaders is about the same "kinda better kinda worse" then we'll know. Imagine how much money is going to evaporate from the stock market if this is the limit!!!

ANewFormation

You can keep getting rich off shovels long after the gold has run dry.

null

[deleted]

borgdefenser

To say 3.7 is worse is completely insane.

TIPSIO

I also hate waiting on reasoning.

I much would prefer a super lightning fast model that is cheaper but the same quality as these frontier models.

Let me query these things to death.

ljlolel

try groq (hyperfast chips) https://groq.com/

apwell23

does it mean we get a reprieve from "this is just the beginning" comments.

thfuran

Maybe if it takes many years before the next major architectural advancement.

kubb

I wouldn't count on it.

ghostly_s

I don't get it. Aren't these two sentences in the same paragraph contradictory?

>"Scaling to this size of model did NOT make a clear jump in capabilities we are measuring."

> "The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great."

XenophileJKO

No, it means that it got better on things orthogonal to what we have mostly been measuring. On the last few rounds, we have been mostly focusing on reasoning, not as much on knowledge, "creativity", or emotional resonance.

johnecheck

"It's better. We can't measure it, but we're pretty sure it's better. We also desperately need it to be better because we just spent a boat-load of money on it."

mirekrusin

Is somebody actually looking at those last percentages on benchmarks?

Aren't we making mistake of assuming benchmarks are purely 100% correct?

sunami-ai

Meanwhile all GPT4o models on Azure are set to be deprecated in May and there are no alternative models yet. We should start moving to Anthropic? DS too slow, melting under its own success. Anyone on GPT4o/Azure has any idea when they'll release the next "o" model?

Uvix

Only an older version of GPT-4o has been deprecated and will be removed in May. The newest version will be supported through at least 20 November 2025.

https://learn.microsoft.com/en-us/azure/ai-services/openai/c...

sunami-ai

The Nov 2024 release, which is due to be deprecated in Nov 2025, I was told has degraded performance compared to the Aug 2024 release. In fact, OpenAI Models page says their current GPT4o API is serving the Aug release. https://platform.openai.com/docs/models#gpt-4o

So I'm still on the Aug 24 release, which, with your reminding me, is not to be deprecated till Aug 2025, but that's less than 5 months from now, and we're skipping the Nov 2024 release just as OpenAI themselves have chosen to do.

EcommerceFlow

I've found 4.5 to be quite good at "business decisions", much better than other models. It does have some magic to it, similar to Grok 3, but maybe a bit smarter?

yimby2001

It seems like there’s a misunderstanding as why this happened. They’ve been baking this model for months. long before deep seek came out with fundamental new ways of distilling models. and even given that it’s not great it’s its large form, they’re going to distil from this going forward .. so it likely makes sense for them to periodically train these very large models as a basis.

lhl

I think this framing isn't quite right either. DeepSeek's R1 isn't very different from what OpenAI has already been doing with o1 (and that other groups have been doing as well). As for distilling - the R1 "distilled" models they released aren't even proper (logit) distillations, but just SFTs, not fundamentally new at all. But it's great that they published their full recipes and it's also great to see that it's effective. In fact we've seen now with LIMO, s1/s1.1, that even as few as 1K reasoning traces can get most LLMs to near SOTA math benchmarks. This mirrors the "Alpaca" moment in a lot of ways (and you could even directly mirror say LIMO w/ LIMA).

I think the main takeaway of GPT4.5 (Orion) is that it basically gives a perspective to all the "hit a wall" talk from the end of last year. Here we have a model that has been trained on by many accounts 10-100X the compute of GPT4, is likely several times larger in parameter count, but is only... subtly better, certainly not super-intelligent. I've been playing around w/ it a lot the past few days, both with several million tokens worth of non-standard benchmarks and talking to it and it is better than previous GPTs (in particular, it makes a big jump in humor), but I think it's clear that the "easy" gains in the near future are going to be figuring out how as many domains as possible can be approximately verified/RL'd.

As for the release? I suppose they could just have kept it internally for distillation/knowledge transfer, so I'm actually happy that they released it, even if it ends up not being a really "useful" model.