svara
In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.
The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.
It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.
But to me it's very clear that the product that gets this right will be the one I use.
stacktrace
> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.
IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.
biofox
I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.
actionfromafar
I wonder if the only way to fix this with current LLMs, would be to generate a lot synthetic data for a select number topics you really don't want it "go off the rails" with. That synthetic data would be lots of variations on that "I don't know how to do X with Y".
robocat
> wrong or misleading explanations
Exactly the same issue occurs with search.
Unfortunately not everybody knows to mistrust AI responses, or have the skills to double-check information.
darkwater
No, it's not the same. Search results send/show you one or more specific pages/websites. And each website has a different trust factor. Yes, plenty of people repeat things they "read on the Internet" as truths, but it's easy to debunk some of them just based on the site reputation. With AI responses, the reputation is shared with the good answers as well, because they do give good answers most of the time, but also hallucinate errors.
lins1909
What is it about people making up lies to defend LLMs? In what world is it exactly the same as search? They're literally different things, since you get information from multiple sources and can do your own filtering.
null
incrudible
If somebody asks a question on Stackoverflow, it is unlikely that a human who does not know the answer will take time out of their day to completely fabricate a plausible sounding answer.
fauigerzigerk
I agree, but the question is how better grounding can be achieved without a major research breakthrough.
I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).
LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.
Now people will say that this is unavoidable given the way in which transformers work. And this is true.
But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.
cachius
Grounding in search results is what Perplexity pioneered and Google also does with AI mode and ChatGPT and others with web search tool.
As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.
cachius
This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.
I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775
Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the bots https://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147
phorkyas82
Isn't that what no LLM can provide: being free of hallucinations?
arw0n
I think the better word is confabulation; fabricating plausible but false narratives based on wrong memory. Fundamentally, these models try to produce plausible text. With language models getting large, they start creating internal world models, and some research shows they actually have truth dimensions. [0]
I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.
Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).
Tepix
> false narratives based on wrong memory
I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.
Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38
Here is the relevant quote by Trenton Bricken from the transcript:
One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.
But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.
svara
That's right - it does seem to have to do with trying to be helpful.
One demo of this that reliably works for me:
Write a draft of something and ask the LLM to find the errors.
Correct the errors, repeat.
It will never stop finding a list of errors!
The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.
officialchicken
No, the correct word is hallucinating. That's the word everyone uses and has been using. While it might not be technically correct, everyone knows what it means and more importantly, it's not a $3 word and everyone can relate to the concept. I also prefer all the _other_ more accurate alternative words Wikipedia offers to describe it:
"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"
kyletns
For the record, brains are also not free of hallucinations.
rimeice
I still don’t really get this argument/excuse for why it’s acceptable that LLMs hallucinate. These tools are meant to support us, but we end up with two parties who are, as you say, prone to “hallucination” and it becomes a situation of the blind leading the blind. Ideally in these scenarios there’s at least one party with a definitive or deterministic view so the other party (i.e. us) at least has some trust in the information they’re receiving and any decisions they make off the back of it.
delaminator
That’s not a very useful observation though is it?
The purpose of mechanisation is to standardise and over the long term reduce errors to zero.
Otoh “The final truth is there is no truth”
svara
Yes, they'll probably not go away, but it's got to be possible to handle them better.
Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.
It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.
intended
Doubt it. I suspect it’s fundamentally not possible in the spirit you intend it.
Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.
null
andai
So there's two levels to this problem.
Retrieval.
And then hallucination even in the face of perfect context.
Both are currently unsolved.
(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)
cachius
Re: retrieval: That's where the snake eats its tail as AI slop floods the web, grounding is like laying a foundation in a swamp. And that Rube Goldberg machine tries to prevent the snake from reaching its tail. But RGs are brittle and not exactly the thing you want to build infrstructure on. Just look at https://news.ycombinator.com/item?id=46239752 for an example how easy it can break.
jillesvangurp
It's increasingly a space that is constrained by the tools and integrations. Models provide a lot of raw capability. But with the right tools even the simpler, less capable models become useful.
Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.
Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.
Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.
Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.
anentropic
Yeah I basically always use "web search" option in ChatGPT for this reason, if not using one of the more advanced modes.
chuckSu
[dead]
breakingcups
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?
0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...
tedsanders
Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.
BoppreH
That would be a laudable goal, but I feel like it's contradicted by the text:
> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component
I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.
Edit: credit where credit is due, the recently-added disclaimer is nice:
> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
hnuser123456
Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.
furyofantares
They also changed "roughly match" to "sometimes match".
guerrilla
Leave it to OpenAI to be dishonest about being dishonest. It seems they're also editing this post without notice as well.
null
arscan
I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.
Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”
tedsanders
Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.
layer8
You know what would be great? If it had added some boxes with “might be X or Y, but not sure”.
g947o
When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.
I don't see any advantage in using the tool.
jacquesm
That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.
Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.
iwontberude
But it’s completely wrong.
iamdanieljohns
Is Adaptive Reasoning gone from GPT-5.2? It was a big part of the release of 5.1 and Codex-Max. Really felt like the future.
tedsanders
Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.
null
d--b
[flagged]
honeycrispy
Not sure what you mean, Altman does that fake-humility thing all the time.
It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).
wilg
What did Sam Altman say? Or is this more of a vague impression thing?
az226
And here is Gemini 3: https://media.licdn.com/dms/image/v2/D5610AQH7v9MtrZxxug/ima...
FinnKuhn
This is genuinly impressive. The OpenAI equivalent is less detailed AND less correct.
saejox
This is very impressive. Google really is ahead
8organicbits
Promotional content for LLMs is really poor. I was looking at Claude Code and the example on their homepage implements a feature, ignoring a warning about a security issue, commits locally, does not open a PR and then tries to close the GitHub issue. Whatever code it wrote they clearly didn't use as the issue from the prompt is still open. Bizarre examples.
timerol
Also a "stacked pair" of USB type-A ports, when there are clearly 4
dolmen
Not that bad compared to product images seen on AliExpress.
fumeux_fume
General purpose LLMs aren't very good with generating bounding boxes, so with that context, this is actually seen as decent performance for certain use cases.
jasonlotito
FTA: Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
You can find it right next to the image you are talking about.
tedsanders
To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.
LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.
da_grift_shift
'Commented after article was already edited in response to HN feedback' award
whalesalad
to be fair that image has the resolution of a flip phone from 2003
malfist
If I ask you a question and you don't have enough information to answer, you don't confidently give me an answer, you say you don't know.
I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.
AstroBen
No-one should have the expectation LLMs are giving correct answers 100% of the time. It's inherent to the tech for them to be confidently wrong
Code needs to be checked
References need to be checked
Any facts or claims need to be checked
redox99
It's trivial for a human that knows what a pc looks like. Maybe mistaking displayport for hdmi.
goobatrooba
I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "projects" with associated files (hello Google, please wake up to basic user-friendly organisation!)
But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...
I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!
empiko
Consider using structured output. You can define a JSON with specific fields, and LLMs are only used to fill in the values.
razster
The latest of the big three... OpenAI, Claude, and Google, none of their models are good. I've spent too much time monitoring them than just enjoying them. I've found it easier to run my own local LLM. The latest Gemini release, I gave it another go but only for it to misspell words and drift off into a fantasy world after a few chats with help restructuring guides. ChatGPT has become lazy for some reason and changes things I told it to ignore, randomly too. Claude was doing great until the latest release, then it started getting lazy after 20+k tokens. I tried making sure to keep a guide to refresh it if it started forgetting, but that didn't help.
Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.
marcosscriven
What are your best local models, and what hardware do you run them on?
striking
What's to stop you from using the APIs the way you'd like?
hnfong
There's a leaderboard that measures user experience, the "lmsys" Chatbot Arena Leaderboard ( https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard ). Main issue with it these days are that it kinda measures sycophancy and user preferred tone more than substance.
Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).
nullbound
<< I feel there is a point when all these benchmarks are meaningless.
I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.
delifue
Once a metric becomes optimization target, it ceases to become good metric.
ifwinterco
I'm not an expert but my understanding is transformers based models simply can't do some of those things, it isn't really how they work.
Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output
zone411
I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):
The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.
The medium-reasoning version also improves: 62.7 → 72.1.
The no-reasoning version also improves: 22.1 → 27.5.
Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.
Donald
Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive
capitainenemo
And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).
I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.
bigyabai
GPT-5.2 might be Google's best Gemini advertisement yet.
outside1234
Especially when you see the price
tikotus
Here's someone else testing models on a daily logic puzzle (Clues by Sam): https://www.nicksypteras.com/blog/cbs-benchmark.html GPT 5 Pro was the winner already before in that test.
thanhhaimai
This link doesn't have Gemini 3 performance on it. Do you have an updated link with the new models?
dezgeg
I've also tried Gemini 3 for Clues by Sam and it can do really well, have not seen it make a single mistake even for Hard and Tricky ones. Haven't run it on too many puzzles though.
crapple8430
GPT 5 Pro is a good 10x more expensive so it's an apples to oranges comparison.
Bombthecat
I would like to see a cost per percent or so row. I feel like grok would beat them all
scrollop
Why no grok 4.1 reasoning?
sanex
Do people other than Elon fans use grok? Honest question. I've never tried it.
sz4kerto
I'm using Gemini in general, but Grok too. That's because sometimes Gemini Thinking is too slow, but Fast can get confused a lot. Grok strikes a nice balance between being quite smart (not Gemini 3 Pro level, but close) and very fast.
buu700
I use Grok pretty heavily, and Elon doesn't factor into it any more than Sam and Sundar do when I use GPT and Gemini. A few use cases where it really shines:
* Research and planning
* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)
* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding
I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.
As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)
For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.
wdroz
Unlike openai, you can use the latest grok models without verifying your organization and giving your ID.
jbm
I use a few AIs together to examine the same code base. I find Grok better than some of the Chinese ones I've used, but it isn't in the same league as Claude or Codex.
bumling
I dislike Musk, and use Grok. I find it most useful for analyzing text to help check if there's anything I've missed in my own reading. Having it built in to Twitter is convenient and it has a generous free tier.
mac-attack
I can't understand why people would trust a CEO that regularly lies about product timelines, product features, his own personal life, etc. And that's before politicizing his entire kingdom by literally becoming a part of government and one of the larger donations of the current administration.
ralusek
Only thing I use grok for is if there is a current event/meme that I keep seeing referenced and I don't understand, it's good at pulling from tweets
fatata123
[dead]
agentifysh
Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:
- It is faster which is appreciated but not as fast as Opus 4.5
- I see no changes, very little noticeable improvements over 5.1
- I do not see any value in exchange for +40% in token costs
All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.
AstroBen
Did you notice much improvement going from Gemini 2.5 to 3? I didn't
I just think they're all struggling to provide real world improvements
chillfox
Gemini 3 Pro is the first model from Google that I have found usable, and it's very good. It has replaced Claude for me in some cases, but Claude is still my goto for use in coding agents.
(I only access these models via API)
dcre
Nearly everyone else (and every measure) seems to have found 3 a big improvement over 2.5.
agentifysh
oh yes im noticing significant improvements across the board but mainly having 1,000,000 token context makes a ton of difference, I can keep digging at a problem with out compaction.
XCSme
Maybe they are just more consistent, which is a bit hard to notice immediately.
dudeinhawaii
I noticed a quite noticeable improvement to the point where I made it my go-to model for questions. Coding-wise, not so much. As an intelligent model, writing up designs, investigations, general exploration/research tasks, it's top notch.
enraged_camel
Gemini 3 was a massive improvement over 2.5, yes.
free652
yes, 2.5 just couldnt use tools right. 3.0 is way better at coding. better than sonnet 4.5/
cmrdporcupine
I think what they're actually struggling with is costs. And I think they're all behind the scenes quantizing models to manage load here and there, and they're all giving inconsistent results.
I noticed huge improvement from Sonnet 4.5 to Opus 4.5 when it became unthrottled a couple weeks ago. I wasn't going to sign back up with Anthropic but I did. But two weeks in it's already starting to seem to be inconsistent. And when I go back to Sonnet it feels like they did something to lobotomize it.
Meanwhile I can fire up DeepSeek 3.2 or GLM 4.6 for a fraction of the cost and get almost as good as results.
hmottestad
I’m curious about if the model has gotten more consistent throughout the full context window? It’s something that OpenAI touted in the release, and I’m curious if it will make a difference for long running tasks or big code reviews.
agentifysh
one positive is that 5.2 is very good at finding bugs but not sure about throughputs I'd imagine it might be improved but haven't seen a real task to benchmark it on.
what I am curious about is 5.2-codex but many of us complained about 5.1-codex (it seemed to get tunnel visioned) and I have been using vanilla 5.1
its just getting very tiring to deal with 5 different permutations of 3 completely separate models but perhaps this is the intent and will keep you on a chase.
null
jbkkd
A new model doesn't address the fundamental reliability issues with OpenAI's enterprise tier.
As an enterprise customer, the experience has been disappointing. The platform is unstable, support is slow to respond even when escalated to account managers, and the UI is painfully slow to use. There are also baffling feature gaps, like the lack of connectors for custom GPTs.
None of the major providers have a perfect enterprise solution yet, but given OpenAI's market position, the gap between expectations and delivery is widening.
sigmoid10
Which tier are you? We are on the highest enterprise tier and I've found that OpenAI is a much more stable platform for high-usage than other providers. Can't say much about the UI though since I almost exclusively work with the API. I feel like UIs generally suck everywhere unless you want to do really generic stuff.
simonw
Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...
alechewitt
Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).
simonw
That is a very good benchmark. Interesting to see GPT-5.2 delivering on the promise of better vision support there.
Stevvo
The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
golly_ned
Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.
null
refulgentis
[flagged]
getnormality
Well, the variance is itself interesting.
throwaway102398
[dead]
BeetleB
They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).
tkgally
I added GPT-5.2 Pro to my pelican-alternatives benchmark for the first three prompts:
Generate an SVG of an octopus operating a pipe organ
Generate an SVG of a giraffe assembling a grandfather clock
Generate an SVG of a starfish driving a bulldozer
https://gally.net/temp/20251107pelican-alternatives/index.ht...
GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.
smusamashah
Hi, it doesn't have Gemini 3.5 Pro which seems to be the best at this
svantana
That's probably because "Gemini 3.5 Pro" doesn't exist
AstroBen
Seems to be getting more aerodynamic. A clear sign of AI intelligence
fxwin
the only benchmark i trust
minimaxir
Is that the first SVG pelican with drop shadows?
simonw
No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)
sroussey
What is good at SVG design?
KellyCriterion
Ive not seen any model being good in graphic/svg creation so far - all of the stuff mostly looks ugly and somewhat "synthetic-disorted".
And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)
culi
Not svg, but basically the same challenge:
https://clocks.brianmoore.com/
Probably Kimi or Deepseek are best
azinman2
Graphic designers?
mmaunder
Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2
Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.
Congrats OpenAI team. Huge day for you folks!!
Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.
Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.
Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.
Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.
And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!
ubutler
> Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2
As @lopuhin points out, they already claimed that context window for previous iterations of GPT-5.
The funny thing is though, I'm on the business plan, and none of their models, not GPT-5, GPT-5.1, GPT-5.2, GPT-5.2 Extended Thinking, GPT-5.2 Pro, etc., can really handle inputs beyond ~50k tokens.
I know because, when working with a really long Python file (>5k LoCs), it often claims there is a bug because, somewhere close to the end of the file, it cuts off and reads as '...'.
Gemini 3 Pro, by contrast, can genuinely handle long contexts.
lopuhin
Context window size of 400k is not new, gpt-5, 5.1, 5-mini, etc. have the same. But they do claim they improved long context performance which if true would be great.
energy123
But 400k was never usable in ChatGPT Plus/Pro subscriptions. It was nerfed down to 60-100k. If you submitted too long of a prompt they deleted the tokens on the end of your prompt before calling the model. Or if the chat got too long (still below 100k however) they deleted your first messages. This was 3 months ago.
Can someone with an active sub check whether we can submit a full 400k prompt (or at least 200k) and there is no prompt truncatation in the backend? I don't mean attaching a file which uses RAG.
piskov
Context windows for web
Fast (GPT‑5.2 Instant) Free: 16K Plus / Business: 32K Pro / Enterprise: 128K
Thinking (GPT‑5.2 Thinking) All paid tiers: 196K
https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...
eru
> Or if the chat got too long (still below 100k however) they deleted your first messages. This was 3 months ago.
I can believe that, but it also seems really silly? If your max context window is X and the chat has approached that, instead of outright deleting the first messages outright, why not have your model summarise the first quarter of tokens and place those at the beginning of the log you feed as context? Since the chat history is (mostly) immutable, this only adds a minimal overhead: you can cache the summarisation, and don't have to do that over and over again for each new message. (If partially summarised log gets too long, you summarise again.)
Since I can come up with this technique in half a minute of thinking about the problem, and the OpenAI folks are presumably not stupid, I wonder what downside I'm missing.
gunalx
API use was not merged in this way.
lhl
Anecdotally, I will say that for my toughest jobs GPT-5+ High in `codex` has been the best tool I've used - CUDA->HIP porting, finding bugs in torch, websockets, etc, it's able to test, reason deeply and find bugs. It can't make UI code for it's life however.
Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...
Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.
freedomben
I haven't done a ton of testing due to cost, but so far I've actually gotten worse results with xhigh than high with gpt-5.1-codex-max. Made me wonder if it was somehow a PEBKAC error. Have you done much comparison between high and xhigh?
dudeinhawaii
This is one of those areas where I think it's about the complexity of the task. What I mean is, if you set codex to xhigh by default, you're wasting compute. IF you're setting it at xhigh when troubleshooting a complex memory bug or something, you're presumably more likely to get a quality response.
I think in general, medium ends up being the best all-purpose setting while high+ are good for single task deep-drive. Or at least that has been my experience so far. You can theoretically let with work longer on a harder task as well.
A lot appears to depend on the problem and problem domain unfortunately.
I've used max in problem sets as diverse as "troubleshooting Cyberpunk mods" and figuring out a race condition in a server backend. In those cases, it did a pretty good job of exhausting available data (finding all available logs, digging into lua files), and narrowing a bug that every other model failed to get.
I guess in some sense you have to know from the onset that it's a "hard problem". That in and of itself is subjective.
wahnfrieden
You should also be making handoffs to/from Pro
robotswantdata
For a few weeks the Codex model has been cursed. Recommend sticking with 5.1 high , 5.2 feels good too but early days
tekacs
I found the same with Max xhigh. To the point that I switched back to just 5.1 High from 5.1 Codex Max. Maybe I should’ve tried Max high first.
tgtweak
have been on 1M context window with claude since 4.0 - it gets pretty expensive when you run 1M context on a long running project (mostly using it in cline for coding). I think they've realized more context length = more $ when dealing with most agentic coding workflows on api.
Workaccount2
You should be doing everything you can to keep context under 200k, ideally even 100k. All the models unwind so badly as context grows.
patates
I don't have that experience with gemini. Up to 90% full, it's just fine.
Suppafly
>Can I just say !!!!!!!! Hell yeah!
...
>THANK YOU!!
Man you're way too excited.
nathants
Usable input limit has not changed, and remains 400 - 128 = 272. Confirmed by looking for any changes in codex cli source, nope.
twisterius
[flagged]
mmaunder
My name is Mark Maunder. Not the fisheries expert. The other one when you google me. I’m 51 and as skeptical as you when it comes to tech. I’m the CTO of a well known cybersecurity company and merely a user of AI.
Since you critiqued my post, allow me to reciprocate: I sense the same deflector shields in you as many others here. I’d suggest embracing these products with a sense of optimism until proven otherwise and I’ve found that path leads to some amazing discoveries and moments where you realize how important and exciting this tech really is. Try out math that is too hard for you or programming languages that are labor intensive or languages that you don’t know. As the GitHub CEO said: this technology lets you increase your ambition.
bgwalter
I have tried the models and in domains I know well they are pathetic. They remove all nuance, make errors that non-experts do not notice and generally produce horrible code.
It is even worse in non-programming domains, where they chop up 100 websites and serve you incorrect bland slop.
If you are using them as a search helper, that sometimes works, though 2010 Google produced better results.
Oracle dropped 11% today due to over-investment in OpenAI. Non-programmers are acutely aware of what is going on.
jacquesm
[flagged]
GolfPopper
Replace 'products' with 'message', 'tech' with 'religion' and 'CEO' with 'prophet' and you have a bog-standard cult recruitment pitch.
bluefirebrand
[flagged]
CodeCompost
For the first time, I've actually hidden an AI story on HN.
I can't even anymore. Sorry this is not going anywhere.
onraglanroad
I suppose this is as good a place as any to mention this. I've now met two different devs who complained about the weird responses from their LLM of choice, and it turned out they were using a single session for everything. From recipes for the night, presents for the wife and then into programming issues the next day.
Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.
I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!
holtkam2
I know I sound like a snob but I’ve had many moments with Gen AI tools over the years that made me wonder: I wonder what these tools are like for someone who doesn’t know how LLMs work under the hood? It’s probably completely bizarre? Apps like Cursor or ChatGPT would be incomprehensible to me as a user, I feel.
Workaccount2
Using my parents as a reference, they just thought it was neat when I showed them GPT-4 years ago. My jaw was on the floor for weeks, but most regular folks I showed had a pretty "oh thats kinda neat" response.
Technology is already so insane and advanced that most people just take it as magic inside boxes, so nothing is surprising anymore. It's all equally incomprehensible already.
jacobedawson
This mirrors my experience, the non-technical people in my life either shrugged and said 'oh yeah that's cool' or started pointing out gnarly edge cases where it didn't work perfectly. Meanwhile as a techie my mind was (and still is) spinning with the shock and joy of using natural human language to converse with a super-humanly adept machine.
khafra
LLMs are an especially tough case, because the field of AI had to spend sixty years telling people that real AI was nothing like what you saw in the comics and movies; and now we have real AI that presents pretty much exactly like what you used to see in the comics and movies.
Agentlien
My parents reacted in just the same way and the lackluster response really took me by surprise.
mmaunder
Yeah I think a lot of us are taking knowing how LLMs work for granted. I did the fast.ai course a while back and then went off and played with VLLM and various LLMs optimizing execution, tweaking params etc. Then moved on and started being a user. But knowing how they work has been a game changer for my team and I. And context window is so obvious, but if you don't know what it is you're going to think AI sucks. Which now has me wondering: Is this why everyone thinks AI sucks? Maybe Simon Willison should write about this. Simon?
eru
> Is this why everyone thinks AI sucks?
Who's everyone? There are many, many people who think AI is great.
In reality, our contemporary AIs are (still) tools with glaring limitations. Some people overlook the limitations, or don't see them, and really hype them up. I guess the people who then take the hype at face value are those that think that AI sucks? I mean, they really do honestly suck in comparison to the hypest of hypes.
eru
> I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!
It's worse: Gemini (and ChatGPT, but to a lesser extent) have started suggesting random follow-up topics when they conclude that a chat in a session has exhausted a topic. Well, when I say random, I mean that they seem to be pulling it from the 'memory' of our other chats.
For a naive user without preconceived notions of how to use these tools, this guidance from the tools themselves would serve as a pretty big hint that they should intermingle their sessions.
ghostpepper
For ChatGPT you can turn this memory off in settings and delete the ones it's already created.
eru
I'm not complaining about the memory at all. I was complaining about the suggestion to continue with unrelated topics.
noname120
Problem is that by default ChatGPT has the “Reference chat history” option enabled in the Memory options. This causes any previous conversation to leak into the current one. Just creating a new conversation is not enough, you also need to disable that option.
0xdeafbeef
Only your questions are in it though
redhed
This is also the default in Gemini pretty sure, at least I remember turning it off. Make's no sense to me why this is the default.
gordonhart
> Makes no sense to me why this is the default.
You’re probably pretty far from the average user, who thinks “AI is so dumb” because it doesn’t remember what you told it yesterday.
astrange
Mostly because they built the feature and so that implicitly means they think it's cool.
I recommend turning it off because it makes the models way more sycophantic and can drive them (or you) insane.
onraglanroad
That seems like a terrible default. Unless they have a weighting system for different parts of context?
eru
They do (or at least they have something that behaves like weighting).
vintermann
It's not at all obvious where to drop the context, though. Maybe it helps to have similar tasks in the context, maybe not. It did really, shockingly well on a historical HTR task I gave it, so I gave it another one, in some ways an easier one... Thought it wouldn't hurt to have text in a similar style in the context. But then it suddenly did very poorly.
Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.
dcre
The models you interact with through the API (as opposed to chat UIs) are held stable and let you specify reasoning effort, so if you use a client that takes API keys, you might be able to solve both of those problems.
eru
> Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.
That's what websites have been doing for ages. Just like you can't step twice in the same river, you can't use the same version of Google Search twice, and never could.
SubiculumCode
I constantly switch out, even when it's on the same topic. It starts forming its own 'beliefs and assumptions', gets myopic. I also make use of the big three services in turn to attack ideas from multiple directions
nrds
> beliefs and assumptions
Unfortunately during coding I have found many LLMs like to encode their beliefs and assumptions into comments; and even when they don't, they're unavoidably feeding them into the code. Then future sessions pick up on these.
chasd00
I was listening to a podcast about people becoming obsessed and "in love" with an LLM like ChatGPT. Spouses were interviewed describing how mentally damaging it is to their partner and how their marriage/relationship is seriously at risk because of it. I couldn't believe no one has told these people to just goto the LLM and reset the context, that reverts the LLM back to a complete stranger. Granted that would be pretty devastating to the person in "the relationship" with the LLM since it wouldn't know them at all after that.
jncfhnb
It’s the majestic, corrupting glory of having a loyal cadre of empowering yes men normally only available to the rich and powerful, now available to the normies.
adamesque
that's not quite what parent was talking about, which is — don't just use one giant long conversation. resetting "memories" is a totally different thing (which still might be valuable to do occasionally, if they still let you)
onraglanroad
Actually, it's kind of the same. LLMs don't have a "new memory" system. They're like the guy from Memento. Context memory and long term from the training data. Can't make new memories from the context though.
(Not addressed to parent comment, but the inevitable others: Yes, this is an analogy, I don't need to hear another halfwit lecture on how LLMs don't really think or have memories. Thank you.)
nbardy
Those arc agi 2 improvements are insane.
Thats especially encouraging to me because those are all about generalization.
5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.
This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.
Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.
I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.
delifue
It's also possible that OpenAI use many human-generated similar-to-ARC data to train (semi-cheating). OpenAI has enough incentive to fake high score.
Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".
deaux
> 5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.
Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.
baq
Opus 4.5 answers most of my non-question comments with ‘you’re right.’ as the first thing in the output. At least I’m not absolutely right, I’ll take this as an improvement.
mmaunder
Same. Also got my attention re ARC-AGI-2. That's meaningful. And a HUGE leap.
cbracketdash
Slight tangent yet I think is quite interesting... you can try out the ARC-AGI 2 tasks by hand at this website [0] (along with other similar problem sets). Really puts into perspective the type of thinking AI is learning!
rallies
I work at the intersection of AI and investing, and I'm really amazed at the ability of this model to build spreadsheets.
I gave it a few tools to access sec filings (and a small local vector database), and it's generating full fledged spreadsheets with valid, real time data. Analysts in wallstreet are going to get really empowered, but for the first time, I'm really glad that retail investors are also getting these models.
Just put out the tool: https://github.com/ralliesai/tenk
npodbielski
Can't wait for being fired because some VP or other manager asked some model to prepare list of people with lowest productivity to pay ratio.
Model hallucinated half of the data?! Sorry we can't go back on this decision, that would make us look bad!
Or when some silly model will push everyone to invest in some radicoulous company and everybody will do it. Poisoning data attack to inject some I am Future Inc ™ company with high investment rate. After few months pocket money and vanish.
We are certainly going to live in interesting times.
rallies
Here's a nice parsing of all the important financials from an SEC report. This used to be really hard a few years ago.
https://docs.google.com/spreadsheets/d/1DVh5p3MnNvL4KqzEH0ME...
monatron
Nice tool - I appreciate you sharing the work!
null
https://platform.openai.com/docs/guides/latest-model
System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...