From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding [pdf]

16 comments

·August 21, 2025

I recently worked on running a thorough healthcare eval on GPT-5. The results show a (slight) regression in GPT-5 performance compared to GPT-4 era models.

I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf

Visit

aresant

Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

woeirua

Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.

xnx

Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?

hypoxia

Did you try it with high reasoning effort?

ares623

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

SequoiaHope

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

ares623

I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

dcre

This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

username135

I wonder what changed with the models that created regression?

teaearlgraycold

Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.

null

[deleted]

woeirua

Interesting topic, but I'm not opening a PDF from some random website. Post a summary of the paper or the key findings here first.

42lux

It's hacker news. You can handle a PDF.

jeffbee

I approve of this level of paranoia, but I would just like to know why PDFs are dangerous (reasonable) but HTML is not (inconsistent).

HeatrayEnjoyer

PDFs can run almost anything and have an attack surface the size of Greece's coast.

HN

From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding [pdf]

From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding [pdf]