Skip to content(if available)orjump to list(if available)

OpenAI Progress

OpenAI Progress

199 comments

·August 16, 2025

simianwords

My interpretation of the progress.

3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.

I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.

* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.

o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.

o3 jump was incremental and so was gpt 5.

furyofantares

I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.

Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.

So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.

So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.

I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.

stavros

All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"

Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".

reasonableklout

I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.

muzani

I don't have a source for this (there's probably no sources from anything back then) but anecdotally, someone at an AI/ML talk said they just added more data and quality went up. Doubling the data doubled the quality. With other breakthroughs, people saw diminishing gains. It's sort of why Sam back then tweeted that he expected the amount of intelligence to double every N years.

I have the feeling they kept on this until GPT-4o (which was a different kind of data).

null

[deleted]

kevindamm

Transformers can train models with much larger parameter sizes compared to other model architectures (with the same amount of compute and time), so it has an evident advantage in terms of being able to scale. Whether scaling the models up to multi-billion parameters would eventually pay out was still a bet but it wasn't a wild bet out of nowhere.

stavros

I assume the cost was just very low? If it was 50-100k, maybe they figured they'd just try and see.

therein

Probably prior DARPA research or something.

Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.

I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.

How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?

faitswulff

What you're saying isn't necessarily mutually exclusive to what gp said.

GPT-2 was the most impressive leap in terms of whatever LLMs pass off as cognitive abilities, but GPT 3.5 to 4 was actually the point at which it became a useful tool (I'm assuming to programmers in particular).

GPT-2: Really convincing stochastic parrot

GPT-4: Can one-shot ffmpeg commands

jkubicek

> I could essentially replace it with Google for basic to slightly complex fact checking.

I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.

rich_sasha

I disagree. Some things are hard to Google, because you can't frame the question right. For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

Once you get an answer, it is easy enough to verify it.

mrandish

I agree. Since I'm recently retired and no longer code much, I don't have much need for LLMs but refining a complex, niche web search is the one thing where they're uniquely useful to me. It's usually when targeting the specific topic involves several keywords which have multiple plain English meanings that return a flood of erroneous results. Because LLMs abstract keywords to tokens based on underlying meaning, you can specify the domain in the prompt it'll usually select the relevant meanings of multi-meaning terms - which isn't possible in general purpose web search engines. So it helps narrow down closer to the specific needle I want in the haystack.

As other posters said, relying on LLMs for factual answers to challenging questions is error prone. I just want the LLM to give me the links and I'll then assess veracity like a normal web search. I think a web search interface allowed disambiguating multi-meaning keywords might be even better.

bloudermilk

If you’re looking for a possibly correct answer to an obscure question, that’s more like fact finding. Verifying it afterward is the “fact checking” step of that process.

crote

A good part of that can probably be attributed to how terrible Google has gotten over the years, though. 15 years ago it was fairly common for me to know something exists, be able to type the right combination of very specific keywords into Google, and get the exact result I was looking for.

In 2025 Google is trying very hard to serve the most profitable results instead, so it'll latch onto a random keyword, completely disregard the rest, and serve me whatever ad-infested garbage it thinks is close enough to look relevant for the query.

It isn't exactly hard to beat that - just bring back the 2010 Google algorithm. It's only a matter of time before LLMs will go down the same deliberate enshittification path.

LoganDark

> Some things are hard to Google, because you can't frame the question right.

I will say LLMs are great for taking an ambiguous query and figuring out how to word it so you can fact check with secondary sources. Also tip-of-my-tongue style queries.

littlestymaar

It's not the LLM alone though, it's “LLM with web search”, and as such 4o isn't really a leap at all there (IIRC perplexity was using an early Llama version and was already very good, long before OpenAI adding web search to ChatGPT).

password54321

This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.

oldsecondhand

The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.

sefrost

I like using LLMs and I have found they are incredibly useful writing and reviewing code at work.

However, when I want sources for things, I often find they link to pages that don't fully (or at all) back up the claims made. Sometimes other websites do, but the sources given to me by the LLM often don't. They might be about the same topic that I'm discussing, but they don't seem to always validate the claims.

If they could crack that problem it would be a major major win for me.

IgorPartola

From what I have seen, a lot of what it does is read articles also written by AI or forum posts with all the good and bad that comes with that.

cm2012

They outperform asking humans, unless you are asking an expert. On average

mkozlows

Modern ChatGPT will (typically on its own; always if you instruct it to) provide inline links to back up its answers. You can click on those if it seems dubious or if it's important, or trust it if it seems reasonably true and/or doesn't matter much.

The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.

pram

It does citations (Grok and Claude etc do too) but I've found when I read the source on some stuff (GitHub discussions and so on) it sometimes actually has nothing to do with what the LLM said. I've actually wasted a lot of time trying to find the actual spot in a threaded conversation where the example was supposedly stated.

platevoltage

In my experience, 80% of the links it provides are either 404, or go to a thread on a forum that is completely unrelated to the subject.

Im also someone who refuses to pay for it, so maybe the paid versions do better. who knows.

yieldcrv

It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.

When I need to cite a court case, well the truth is I'll still use GPT or a similar LLM, but I'll scrutinize it more and at the bare minimum make sure the case exists and is about the topic presented, before trying to corroborate the legal strategy with a new context window, different LLM, google, reddit, and different lawyer. At least I'm no longer relying on my own understanding, and what 1 lawyer procedurally generates for me.

Spivak

It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.

simonw

4o also added image input (previously only previewed in GPT4-vision) and enabled advanced voice mode audio input and output.

iammrpayments

I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.

mastercheif

Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.

ralusek

The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.

muzani

ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.

3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.

Both of these had an IQ of around 70 or so, so the customer service training made it a little more useful. But I mourn the loss of the "completion" way of interacting with AI vs "instruct" or "response".

Unfortunately with all the money in AI, we'll just see companies develop things that "pass all benchmarks", resulting in more creations like GPT-5. Grok at least seems to be on a slightly different route.

andai

davinci-002 is still available, and pretty close.

mat_b

This was my experience as well. 3.5 was the point where stackoverflow essentially became obsolete in my workflow.

jascha_eng

The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.

The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.

simianwords

Its strange how Claude achieves similar performance without reasoning tokens.

Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.

Alex-Programs

Yeah, I'd love something where you pronounce a word and it critiques your pronunciation in detail. Maybe it could give you little exercises for each sound, critiquing it, guiding you to doing it well.

If I were any good at ML I'd make it myself.

GaggiX

the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).

miller24

What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.

vunderba

I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.

For a human point of comparison, here's mine (50 words):

"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."

It's pretty difficult to get across more than some basic lore building in a scant 50 words.

Barbing

>For a human point of comparison, here's mine […]

Love that you thought of this!

furyofantares

Check out prompt 2, "Write a limerick about a dog".

The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)

They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.

jasonjmcghee

It's actually pretty surprising how poor the newer models are at writing.

I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.

Both GPT-4 and 5 wrote like a child in that example.

With a bit of prompting it did much better:

---

At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.

---

Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.

layer8

Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.

mmmore

I find GPT-5's story significantly better than text-davinci-001

raincole

I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.

Notatheist

I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.

furyofantares

Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.

stavros

For another view on progress, check out my silly old podcast:

https://deepdreams.stavros.io

The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.

GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".

I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.

redox99

GPT 4.5 (not shown here) is by far the best at writing.

null

[deleted]

willguest

My go-to for any big release is to have a discussion about self-awareness and dive in to constuctivist notions of agency and self-knowing from a perspective of intelligence that is not limited to human cognitive capacity.

I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.

After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.

5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.

Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."

bryant

> to orient toward the unfolding of possibility in others

This is a globally unique phrase, with nothing coming close other than this comment on the indexed web. It's also seemingly an original idea as I haven't heard anyone come close to describing a feeling (love or anything else) quite like this.

Food for thought. I'm not brave enough to draw a public conclusion about what this could mean.

jibal

It's not at all an original idea. The wording is uniquely stilted.

ThrowawayR2

Except "unfolding of possibility", as an exact phrase, seems to have millions of search hits, often in the context of pseudo-profound spiritualistic mumbo-jumbo like what the LLM emitted above. It's like fortune cookie-level writing.

dgfitz

I hate to say it, but doesn’t every VC do exactly this? “ orient toward the unfolding of possibility in others” is in no way a unique thought.

Hell, my spouse said something extremely similar to this to me the other day. “I didn’t just see you, I saw who you could be, and I was right” or something like that.

starchild3001

A few data points that highlight the scale of progress in a year:

1. LM Sys (Human Preference Benchmark):

GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).

2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):

GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)

3. IQ-style Testing:

In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)

4. IMO Gold, vibe coding:

1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.

My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.

NoahZuniga

The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".

starchild3001

If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.

fariszr

The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference! Then comes Davinci which is just insane, it's still good in these examples!

GPT-4 yaps way too much though, I don't remember it being like that.

It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!

ddtaylor

So we're at the corporate dick wagging part of the process?

lionkor

Must keep the hype train going, to keep the evaluation up as it's not really based on real value

platevoltage

That Koeningsegg isn't gonna pay for itself.

shubhamjain

Geez! When it comes to answering questions, GPT-5 almost always starts with glazing about what a great question it is, where as GPT-4 directly addresses the answer without the fluff. In a blind test, I would probably pick GPT-4 as a superior model, so I am not surprised why people feel so let down with GPT-5.

beering

GPT-4 is very different from the latest GPT-4o in tone. Users are not asking for the direct no-fluff GPT-4. They want the GPT-4o that praises you for being brilliant, then claims it will be “brutally honest” before stating some mundane take.

Kwpolska

GPT-4 starts many responses with "As an AI language model", "I'm an AI", "I am not a tax professional", "I am not a doctor". GPT-5 does away with that and assumes an authoritative tone.

aniviacat

GPT5 only commended the prompt on questions 7, 12, and 14. 3/14 is not so bad in my opinion.

(And of course, if you dislike glazing you can just switch to Robot personality.)

epolanski

I think that as the models will be further trained on existing data and likely chats sycophancy will keep getting word and worse.

null

[deleted]

machiaweliczny

Change to robot mode

magospietato

There is a quiet poetry to GPT1 and GPT2 that's lost even in the text-davinci output. I often wonder what we lose through reinforcement.

null

[deleted]

codezero

[dead]

actuallyalys

One thing that appears to have been lost between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human, let alone a human expert. Maybe those genuinely annoyed people, but it seems like they were potentially useful measure to prevent users from being overly credulous

GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.

andy_ppp

People seem to miss the humanity of previous GPTs from my understanding. GPT5 seems colder and more precise and better at holding itself together with larger contexts. People should know it’s AI, it does not need to explain this constantly for me, but I’m sure you can add that back in with some memory options if you prefer that?

benatkin

If you've ever seen long-form improv comedy, the GPT-5 way is superior. It's a "yes, and". It isn't a predefined character, but something emergent. You can of course say to "speak as an AI assistant like Siri and mention that you're an AI whenever it's relevant" if you want the old way. Very 2011: https://www.youtube.com/watch?v=nzgvod9BrcE

Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.

gordon_freeman

It seems like the progress from GPT-4 to GPT-5 has plateaued: for most prompts, I actually find GPT-4 more understandable than GPT-5 [1].

[1] Read the answers from GPT-4 and 5 for this math question: "Ugh I hate math, integration by parts doesn't make any sense"

energy123

Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.

reilly3000

GPT-5’s question about consciousness and its use of sibling seems to indicate there is some underlying self awareness in the system prompt, and that has perhaps contains concepts of consciousness. If not, where is that coming from? Recent training data containing more glurge?

shthed

They must have really hand picked those results, gpt4 would have been full of annoying emojis as bullet points and emdashes.

fariszr

GPT 4o ≠ GPT-4