Recent AI model progress feels mostly like bullshit

455 comments

·April 6, 2025

InkCanon

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

AIPedant

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

larodi

This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.

Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.

sealeck

Half the researchers are at ETH Zurich (INSAIT is a partnership between EPFL, ETH and Sofia) - hardly an unreliable institution.

null

[deleted]

AIPedant

[dead]

apercu

In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.

waffletower

While I may be mistaken, but I don't believe that LLMs are trained on a large corpus of machine readable music representations, which would arguably be crucial to strong performance in common practice music theory. I would also surmise that most music theory related datasets largely arrive without musical representations altogether. A similar problem exists for many other fields, particularly mathematics, but it is much more profitable to invest the effort to span such representation gaps for them. I would not gauge LLM generality on music theory performance, when its niche representations are likely unavailable in training and it is widely perceived as having miniscule economic value.

code_for_monkey

music theory is a really good test because in my experience the AI is extremely bad at it

JohnKemeny

Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).

otabdeveloper4

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

bambax

How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.

billforsternz

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]

So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.

This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

aezart

> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.

bambax

I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.

I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.

katsura

To be fair, I love that magicians can pull tricks on me even though I know it is fake.

aoeusnth1

2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.

Sunspark

It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.

The AI will create something for you and tell you it was them.

prawn

"That's impossible because..."

"Good point! Blah blah blah..."

Absolutely shameless!

greenmartian

Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.

But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.

aurareturn

Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.

Workaccount2

Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.

vintermann

I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.

InDubioProRubio

Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?

swader999

It'll get it right next time because they'll hoover up the parent post.

senordevnyc

Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...

CamperBob2

It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.

Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.

CivBase

I just asked my company-approved AI chatbot the same question.

It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.

It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.

When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.

Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).

I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.

Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.

billforsternz

> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).

In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.

raxxorraxor

This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.

No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?

simonw

I had to look up these acronyms:

- USAMO - United States of America Mathematical Olympiad

- IMO - International Mathematical Olympiad

- ICPC - International Collegiate Programming Contest

Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.

sanxiyn

Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.

InkCanon

o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.

https://openai.com/index/learning-to-reason-with-llms/

The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.

alexlikeits1999

I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?

sanxiyn

AIME is so not IMO.

sigmoid10

>I'm incredibly surprised no one mentions this

If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.

Workaccount2

On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.

bglazer

Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.

So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.

I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.

**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com

I promise its a fun mathematical puzzle and the biology is pretty wild too

root_axis

It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.

Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.

Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.

worldsayshi

> they're more like calculators of language than agents that reason

This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.

jwrallie

True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.

MoonGhost

I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.

The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.

PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.

PPS: the fact that they can do known tasks with minor variations is already a huge time saver.

bglazer

Yes, I suspect that engineering the loss and hyperparams could eventually get this to work. However, I was hoping the model would help me get to a more fundamental insight into why the training falls into bad minima. Like the Wasserstein GAN is a principled change to the GAN that improves stability, not just fiddling around with Adam’s beta parameter.

The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.

And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability

torginus

When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.

Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.

I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.

melagonster

I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.

airstrike

I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model

pdimitar

What do you find inferior in 3.7 compared to 3.5 btw? I only recently started using Claude so I don't have a point of reference.

null

[deleted]

kristianp

Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.

null

[deleted]

usaar333

And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.

And this only suggested LLMs aren't trained well to write formal math proofs, which is true.

selcuka

> within a week

How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.

levocardia

They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...

bakkoting

New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.

And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.

MoonGhost

They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.

KolibriFly

Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.

iambateman

The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

When you ask it a question, it tends to say yes.

So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.

The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.

I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.

I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.

bluefirebrand

> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving

LLMs fundamentally do not want to seem anything

But the companies that are training them and making models available for professional use sure want them to seem agreeable

JohnKemeny

> LLMs fundamentally do not want to seem anything

You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.

Peritract

> a standard way of talking, not a claim about real agency.

A standard way of talking used by people who do also frequently claim real agency.

squiggleblaz

Reinforcement learning, maximise rewards? They work because rabbits like carrots. What does an LLM want? Haven't we already committed the fundamental error when we're saying we're using reinforcement learning and they want rewards?

mrweasel

That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".

I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.

squiggleblaz

I think it's your responsibility to control the LLM. Sometimes, I worry that I'm beginning to code myself into a corner, and I ask if this is the dumbest idea it's ever heard and it says there might be a better way to do it. Sometimes I'm totally sceptical and ask that question first thing. (Usually it hallucinates when I'm being really obtuse though, and in a bad case that's the first time I notice it.)

null

[deleted]

Terr_

Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.

boesboes

This rings true. What I notice is that the longer i let Claude work on some code for instance, the more bullshit it invents. I usually can delete about 50-60% of the code & tests it came up with.

And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.

Or maybe I just suck at prompting hehe

signa11

> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

umm, it seems to me that it is this (tfa):

     But I would nevertheless like to submit, based off of internal
     benchmarks, and my own and colleagues' perceptions using these models,
     that whatever gains these companies are reporting to the public, they
     are not reflective of economic usefulness or generality.

and then couple of lines down from the above statement, we have this:

     So maybe there's no mystery: The AI lab companies are lying, and when
     they improve benchmark results it's because they have seen the answers
     before and are writing them down.

signa11

[this went way outside the edit-window and hence a separate comment] imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow

     There was a little girl,
        Who had a little curl,
     Right in the middle of her forehead.
        When she was good,
        She was very good indeed,
     But when she was bad she was horrid.

malingo

"when you ask him anything, he never answers 'no' -- he just yesses you to death and then he takes your dough"

tristor

It's, in many ways, the same problem as having too many "yes men" on a team at work or in your middle management layer. You end up getting wishy-washy, half-assed "yes" answers to questions that everyone would have been better off if they'd been answered as "no" or "yes, with caveats" with predictable results.

In fact, this might be why so many business executives are enamored with LLMS/GenAI: It's a yes-man they don't even have to employ, and because they're not domain experts, as per usual, they can't tell that they're being fed a line of bullshit.

lukev

This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.

I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.

But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.

freehorse

There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.

Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.

otterley

Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.

freehorse

People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.

In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.

droopyEyelids

I think you might be caught up in a bit of the rationalist delusion.

People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.

But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.

Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.

FiniteIntegral

It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.

A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.

code_biologist

At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.

I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.

I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.

dsign

You want to block subjectivity? Write some formulas.

There are three questions to consider:

a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.

b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.

c) And then the last question: will AI continue improving?

If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).

There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.

lukev

Invalid expression: value of type "probability distribution" cannot be cast to type "boolean".

pdimitar

A lot of people judge by the lack of their desired outcome. Calling that fear and denial is disingenuous and unfair.

dsign

That's actually a valid point. I stand corrected.

lherron

Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.

I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.

I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.

InkCanon

Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.

null

[deleted]

KolibriFly

Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight

throwanem

> I'm really curious what features signal an ability to make "better choices" w/r/t AI

So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.

nialv7

Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.

wg0

Unlike many - I find author's complaints on the dot.

Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.

consumer451

I would go even further than TFA. In my personal experience using Windsurf daily, Sonnet 3.5 is still my preferred model. 3.7 makes many more changes that I did not ask for, often breaking things. This is an issue with many models, but it got worse with 3.7.

cootsnuck

Yea, I've experienced this too with 3.7. Not always though. It has been helpful for me more often than not helpful. But yea 3.5 "felt" better to me.

Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.

It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.

Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.

consumer451

> Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.

This would be so useful. I have thought about this missing piece a lot.

Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.

This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.

Zetaphor

I finally gave up on 3.7 in Cursor after three rounds of it completely ignoring what I asked it for so that it could instead solve an irrelevant linter error. The error in no way affected functionality.

Despite me rejecting the changes and explicitly telling it to ignore the linter it kept insisting on only trying to solve for that

behnamoh

3.7 is like a wild horse. you really must ground it with clear instructions. it sucks that it doesn't automatically know that but it's tameable.

consumer451

Could you share any successful prompting techniques for grounding 3.7, even just a project-specific example?

jonahx

My personal experience is right in line with the author's.

Also:

> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.

ants_everywhere

There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.

Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.

But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.

Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.

Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.

Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).

pclmulqdq

In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.

In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.

ants_everywhere

yes that would explain the effect I think. I'll try that out this week.

DebtDeflation

The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.

stafferxrr

It is like how I am not impressed by the models when it comes to progress with chemistry knowledge.

Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.

For the model to be useful at all, I would have to learn basic chemistry myself.

Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.

AIPedant

[dead]

HarHarVeryFunny

The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.

Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?

If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!

InkCanon

That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.

usaar333

> Highest score was 5%, supposedly SOTA last year was IMO silver level.

No LLM last year got silver. Deepmind had a highly specialized AI system earning that

throwawayffffas

In my view as well it's not really cheating, it's just over fitting.

If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.

KolibriFly

Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench

joelthelion

I've used gemini 2.5 this weekend with aider and it was frighteningly good.

It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

jchw

I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.

heresie-dabord

> It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

Even approximations must be right to be meaningful. If information is wrong, it's rubbish.

Presorting/labelling various data has value. Humans have done the real work there.

What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.

Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.

mountainriver

Yep, and what they are going in cursor either the agentic stuff is really game changing.

People who can’t recognize this intentionally have their heads in the sand

InkCanon

People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.

1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.

2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.

dimitri-vs

I guess you haven't been on /r/cursor or forum.cursor.com lately?

"game changing" isn't exactly the sentiment there the last couple months.

maccard

My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.

Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).

sksxihve

The whole MCP hype really shows how much of AI is bullshit. These LLMs have consumed more API documentation than possible for a single human and still need software engineers to write glue layers so they can use the APIs.

maccard

I don't think I agree, entirely.

The problem is that up until _very_ recently, it's been possible to get LLMs to generate interesting and exciting results (as a result of all the API documentation and codebases they've inhaled), but it's been very hard to make that usable. I think we need to be able to control the output format of the LLMs in a better way before we can work on what's in the output. I don't konw if MCP is the actual solution to that, but it's certainly an attempt at it...

sksxihve

That's reasonable along with your comment below too, but when you have the ceo of anthropic saying "AI will write all code for software engineers within a year" last month I would say that is pretty hard to believe given how it performs without user intervention (MCP etc...). It feels like bullshit just like the self driving car stuff did ~10 years ago.

machiaweliczny

Because it’s lossy compression. I also consumed a lot of books and even more movies and I don’t have good memory of it all. But some core facts and intuition from it.

sksxihve

AI is far better at regurgitating facts than me even if it's lossy compression but if someone gives me an api doc I can figure out how to use it without them writing a wrapper library around the parts that I need to use to solve whatever problem I'm working on.

fxtentacle

I'd say most of the recent AI model progress has been on price.

A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)

xiphias2

Have you compared it with 8-bit QwQ-17B?

In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.

redrove

There’s no QwQ 17B that I’m aware of. Do you have a HF link?

xiphias2

You're right, sorry...I just tested Qwen models, not QwQ, I see QwQ only has 32B.

shostack

Yeah, I'm thinking of this from a Wardley map standpoint.

What innovation opens up when AI gets sufficiently commoditized?

bredren

One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.

For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)

Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.

I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.

mentalgear

Brute force, brute force everything at least for the domains you can have automatic verification in.

throw310822

I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.

einrealist

LeCun criticized LLM technology recently in a presentation: https://www.youtube.com/watch?v=ETZfkkv6V7Y

The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification. I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....

pclmulqdq

I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.

whilenot-dev

I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.

voidhorse

You're completely correct, of course. The issue is that most people are not looking for quality, only efficiency. In particular, business owners don't care about sacrificing some correctness if it means they can fire slews of people. Worse, gullible "engineers" that should be the ones prioritizing correctness are so business-brainwashed themselves that they like wise slop up this nonsense at the expense of sacrificing their own concern for the only principles that even made the software business remotely close to being worthy of the title "engineering".

einrealist

That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.

A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.