The LLM Lobotomy?

esafak

This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

icyfox

I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

colordrops

In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.

jonplackett

This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.

briga

I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.

vintermann

They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

chaos_emergent

Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.

nothrabannosir

TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

zzzeek

your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)

yieldcrv

fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis

A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

colordrops

Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.

Aurornis

I read the article. No proof was included. Not even a graph of declining results.

ProjectArcturis

I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

That said, I would also love to see some examples or data, instead of just "it's getting worse".

SubiculumCode

I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.

transcriptase

Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.

bongodongobob

They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.

cush

Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?

criemen

Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

jonplackett

It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven

I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

gwynforthewyn

What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

jug

At least on OpenRouter, you can often verify what quant a provider is using for a particular model.

bigchillin

This is why we have open source. I noticed this with cursor, it’s not just an azure problem.

SirensOfTitan

I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

ukFxqnLa2sBSBf6

It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.

zzzeek

I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.

HN

The LLM Lobotomy?

The LLM Lobotomy?