HealthBench

79 comments

·May 12, 2025

programmertote

I have no doubt that a lot of garden-variety diagnoses and treatments can be done by an AI system that is fine-tuned and vetted to accomplish the task. I recently had to pay $93 to have a virtual session with a physician to get prescription for a cough syrup, which I already knew what to take before talking to her because I did some research/reading. Some may argue, "Doctors studied years in med school and you shouldn't trust Google more than them", but knowing human's fallibility and knowing that a lot of doctors do look things up on places like https://www.wolterskluwer.com/en/solutions/uptodate to refresh/reaffirm their knowledge, I'd argue that if we are willing to take the risk, why shouldn't we be allowed to take that risk on our own? Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?

With the healthcare prices increasing at the breakneck speed, I am sure AI will take more and more role in diagnosing and treating people's common illnesses, and hopefully (doubt it), the some of that savings will be transferred to the patients.

P.S. In contrast to the US system, in my home city (Rangoon, Burma/Myanmar), I have multiple clinics near my home and a couple of pharmacy within two bus stops distance. I can either go buy most of the medications I need from the pharmacy (without prescription) and take them on my own (why am I not allowed to take that risk?) OR I can go see a doctor at one of these clinics to confirm my diagnosis, pay him/her $10-$20 for the visit, and then head down to the pharmacy to buy the medication. Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription, but a good number of other meds are available as long as I can afford them.

imiric

> Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?

Because you're paying for the expertise of someone who studied for more than a decade which you won't get from a random web search.

An AI system with today's technology should be less trustworthy for medical diagnosis than a web search. At least with a web search you might stumble upon a site with content from experts, assuming you trust yourself to be able to discern expert advice from bot-generated and spam content. Even if a doctor is doing the searching instead of me, I would pay them only for their knowledge to make that discernment for me. Why you think an AI could do better than a human at that is beyond me.

Your question reminds me of that famous Henry Ford GE invoice story:

> Making chalk mark on generator: $1.

> Knowing where to make mark: $9,999.

ChadNauseam

You answered why someone would want to pay $93, but not why they have to pay $93.

85392_school

AI systems have been improving. O3 now has the capability to decide to search multiple times as part of its response.

BeetleB

Where are you that you need a prescription to get cough medicine? The only ones I know of that require prescription are the ones with controlled substances.

HaZeust

Ain't nothing wrong with a little Wockhardt to get you through the day

dgfitz

> Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription

onlyrealcuzzo

There are simply not enough doctors for doctors to be doing this work.

We have a massive, massive shortage of doctors.

The industry is doing everything they can to make it worse by the day, so I won't hold my breath that we'll get the slightest bit of respite.

It'd obviously be ideal if everyone could see a doctor for an affordable price any time they wanted.

We don't live in the ideal world.

This would be a HUGE win for most people.

nradov

This is why most routine primary care can be delivered by Nurse Practitioners and Physician Assistants working under a doctor's supervision.

ezst

I mean, if such is your typical experience with healthcare, what you need is a big shake up, don't count on AI to help you there! (but it might inflate the profit margins for the few running the show at your expense).

yapyap

You think you won’t have to pay the AI? lol.

hollerith

IMHO it is much easier to increase the supply of AI doctor hours than the supply of human doctor hours, and supply constraints are the main thing maintaining doctor hours at the high price it currently sells for.

imiric

Good lord. The idea that a system that produces pseudo-random output without any semantic understanding can be relied on to give accurate health-related information is deeply flawed and troubling. It's one thing to use these things for finding patterns in data, for entertainment purposes, and for producing nonsensical code a human has to fix, but entirely different to rely on them for health diagnosis or advice. I shudder at the thought that a medical practitioner I go to will parrot whatever an LLM told them.

This insanity needs to be regulated yesterday.

FlamingMoe

Good lord. The idea that a sleep‑deprived, bias‑prone mammal can be relied on to give accurate health‑related information is deeply flawed and troubling. It’s one thing to seek a human for empathy, small talk, or a routine vaccine, but entirely different to assume an overworked doctor will consistently out‑diagnose a modern data tool. I shudder at the thought that the practitioner I visit might simply repeat whatever a drug‑rep lunch pitch or outdated guideline left in their memory.

stavros

This insanity needs to be regulated yesterday.

imiric

Thankfully, it is. :)

dcreater

Isn't there an obvious conflict of interest when the model maker is also the creator of a benchmark? I think at the very least it should be from a separate business entity under the non profit or from the non profit holding entity itself

gwd

I don't think it's necessarily bad to have the benchmark, but the graphs of Gemini and Claude doing worse than o3 did kind of leave a bad taste in my mouth. "Oh look, your models are worse than ours at this very important metric that we just made up! How terrible!"

progbits

It's also barely better, within the error bar on some metrics. So if this is the best they can do on their own benchmark that's pretty pathetic and shows they have no edge anymore.

iNic

I like that they include the "worst case score at k samples". This is a much more realistic view of what will happen, because someone will get that 1/100 response.

Zaheer

Impressive how well Grok performs in these tests. Grok feels 'underrated' in terms of how much other models (gemini, llama, etc) are in the news.

tough

you can't download grok's weights to run locally

simianwords

how is that relevant here?

tough

it helps explain why theres' less people talking about them than gemini or llama?

less people using them.

moralestapia

It's not.

Also, only one out of the ten models benchmarked have open weights, so I'm not sure what GP is arguing for.

andy99

My sense is that these benchmarks are not realistic in terms of the way the model is used. People building specialized AI systems are not, in my experience, letting users just chat with a base model, they would have some variant of RAG plus some guardrails plus other stuff (like routing to pre-written answers for common question).

So what use case does this test setup reflect? Is there a relevant commercial use case here?

pants2

This appears to be a very thoughtful and helpful study. It's also impressive to see the improvement in performance in just the last year of model development - almost double.

I've found o3 & deep research to be very effective in guiding my health plan. One interesting anecdote - I got hit in the chest (right over the heart) quite hard a month or so ago. I prompted o3 with my ensuing symptoms and heart rate / oxygenation data from my Apple watch, and it already knew my health history from previous conversations. It gave very good advice and properly diagnosed me with a costochondral sprain. It gave me a timeline to expect (which ended up being 100% accurate) and treatments / ointments to help.

IMO - it's a good idea to have a detailed prompt ready to go with your health history, height/weight, medications and supplements, etc. if anything's happening to you you've got it handy to give to o3 to help in a diagnosis.

quantumwoke

How confident was it in that diagnosis? Did you trust its confidence if so? I asked this to my medico wife and apparently the other possibility is pericardial effusion from injury which can be life threatening.

pizzathyme

Non-clinicians are using ChatGPT every day now to try to find assistance (right or wrong) to real-life medical problems. This is a great evaluation set that could prevent a lot of harm

unsupp0rted

Recently I uploaded a lab report to chatGPT and asked it to summarize it.

It hallucinated serious cancer, along with all the associated details you’d normally find on a lab report. It had an answer to every question I had pre-asked about the report.

The report said the opposite: no cancer detected.

dekhn

What was the outcome (did the subject of the lab report go on to present cancer)? (sometimes what we think is a false positive ends up being a true positive later, and cancer is a late-arriving signal).

maliker

Interesting. What LLM model? 4o, o3, 3.5? I had horrible performance with earlier models, but o3 has helped me with health stuff (hearing issues).

unsupp0rted

Whichever the default free model is right now- I stopped paying for it when Gemini 2.5 came out in Google's AI lab.

4o, o4? I'm certain it wasn't 3.5

Edit: while logged in

icelancer

> Whichever the default free model is right now

Sigh. This is a point in favor of not allowing free access to ChatGPT at all given that people are getting mad at GPT-4o-mini which is complete garbage for anything remotely complex... and garbage for most other things, too.

Just give 5 free queries of 4o/o3 or whatever and call it good.

pants2

If you're logged in, 4o, if you're not logged int, 4o-mini. Both don't score well on the benchmark!

maliker

Might be worth trying again with Gemini 2.5. The reasoning models like that one are much better at health questions.

dgfitz

I can’t believe you’re getting downvoted for answering the question about the next-token-predictor model you can’t recall using.

What is happening?

arcanemachiner

No, cancer detected!

mvdtnz

You're not on Reddit.

arcanemachiner

I typically respect the decorum here, but I couldn't help myself this time.

Gracana

I wonder if it was unable to read your report, and just answered as if role-playing?

I gave it a pdf of an engine manual recently and asked some questions, which it answered reasonably. It even pulled a schematic out for me, though it was the wrong one (it gave me a schematic for the CDI ignition variant that we first talked about, rather than the DSAI one we settled on later.)

astura

>I wonder if it was unable to read your report, and just answered as if role-playing?

FWIW,

This is what chatgpt gave me when I uploaded a lab report, it read the results back to me:

Here’s an interpretation of the blood test results you provided:

White Blood Cell Count (WBC) Value: 9.2 (Normal: 3.8 - 10.8 Thousand/uL)

Within the normal range, indicating no immediate signs of infection or inflammation.

Red Blood Cell Count (RBC) Value: 5.29 (Normal: 3.8 - 5.1 Million/uL)

Elevated (High). This could indicate dehydration, stress, or other conditions like polycythemia.

Hemoglobin Value: 10.9 (Normal: 11.7 - 15.5 g/dL)

Low. Suggests anemia, possibly due to iron deficiency or chronic disease.

Hematocrit Value: 36.0% (Normal: 35.0 - 45.0%)

Within the normal range, but on the lower side, supporting the presence of mild anemia.

Mean Corpuscular Volume (MCV) Value: 68.1 (Normal: 80.0 - 100.0 fL)

Low. Indicates microcytic anemia, often due to iron deficiency.

Mean Corpuscular Hemoglobin (MCH) Value: 20.6 (Normal: 27.0 - 33.0 pg)

Low. Suggests hypochromia (paler red blood cells), often associated with iron-deficiency anemia.

Mean Corpuscular Hemoglobin Concentration (MCHC) Value: 30.3 (Normal: 32.0 - 36.0 g/dL)

Slightly low. May reflect reduced hemoglobin in red blood cells, consistent with iron-deficiency anemia.

Red Cell Distribution Width (RDW) Value: 15.8% (Normal: 11.0 - 15.0%)

Elevated. Indicates variability in red blood cell size, common in iron-deficiency anemia.

Platelet Count Value: 397 (Normal: 140 - 400 Thousand/uL)

Normal, but on the higher side of the range.

Other Key Metrics Absolute Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils: All within normal ranges, suggesting no significant abnormalities in immune cell populations.

Anisocytosis (1+): Indicates variation in red blood cell size.

Hypochromasia (1+): Suggests pale red blood cells, supporting iron-deficiency anemia.

Summary These results suggest iron-deficiency anemia, characterized by low hemoglobin, low MCV, low MCH, and high RDW. The elevated red blood cell count may reflect compensatory mechanisms or mild dehydration.

Your provider recommends a heart-healthy, low-fat diet, which is great for overall health. However, addressing the anemia might also involve increasing dietary iron or taking iron supplements.

The diagnosis is wrong, btw, I don't have iron deficiency. The anemia is caused by a genetic condition called thalassemia, which has been verified by genetic tests. You can use the Mentzer Index to differentiate the two on a simple CBC - https://www.mdcalc.com/calc/10534/mentzer-index

Mine numbers return a "probable diagnosis."

I was wondering if chatgpt would catch it, nope, it didn't. It did say that it was a possibility once I suggested it though.

null

[deleted]

code-crapper-69

[flagged]

mrcwinn

Happy to see this. I've struggled with an injury for the past five years. I've been to multiple sports-focused physicians, had various scans. Responses from doctors have ranged from "everything seems fine, can't really figure this out" to [completely wrong hypothesis]. Tried acupuncture. Tried a chiropractor. I remember one doctor, though, had an interesting thought that seemed to make sense - but I've been so discouraged from so many false starts or misplaced hope, I didn't bother following up.

Finally I typed in my entire history into o3-deep-research and let it rip for a while. It came back with a theory for the injury that matched that one doctor, diagrams of muscle groups and even illustrations of proposed exercises. I'm not out of the woods yet, but I am cautiously optimistic for the first time in a long time.

candiddevmike

Why would you trust a LLM over a battery of human experts? I find it hard to believe that the doctors never proposed exercises or some kind of physical therapy for you, at least in the US.

BeetleB

I can't speak to the OP's condition, but having seen plenty of doctors and physical therapists in the US for over a decade:

Yes, they propose exercises.

No, they don't work.

For certain (common) conditions, PT seems to have it nailed - the exercises really help. For the others, it's just snake oil. Not backed by much research. The current state of the art is just not good when it comes to chronic pain.

So while I don't know if an LLM can be better than a battery of human experts, I do know that those human experts do not perform well. I'm guessing with the OP's case, that battery of human experts does not lead to a consensus - you just end up with 10 different treatments/diagnoses (and occasionally, one is a lot more common than the other, but it's still wrong).

zeroxfe

I've never used LLMs for this, but as someone who's been through a lot of sports-related injuries, I find doctors more or less useless (except for prescribing painkillers and performing surgeries.)

No doctor or physio has ever been able to fix my chronic issues, and I've always had to figure them out myself through lots of self-study and experimentation.

quantumwoke

Sounds like you needed a good PT not a physician.

mvdtnz

You trusted known grifters acupuncturists and chiropractors, but not doctors. Then you trusted a computer known for telling lies.

I think you should take a step back and re-assess your internal heuristics.

BigGreenJorts

They literally said they saw sport focused physicians.

scudsworth

i mean it sounds like that doctor had it right then? i dont think this reads as a win for "online self-diagnosis", ai-empowered or otherwise.

lcnPylGDnU4H9OF

From their story, it sounds like the one doctor had it right because the AI-empowered self diagnosis seems to agree. Or do I misunderstand?

Noumenon72

I hope recent cuts to government science have managed to hit enough of the safetyists and industry captures who keep us from just trying out new healthcare approaches like this and learning. They'd like nothing better than to replace the help you got with "As a large language model, I am unable to offer medical advice."

simianwords

I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.

The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.

srameshc

Is the Med-PaLM model that Google's has been working on meant to be considered for comparison ? If I'm not mistaken, it isn't publicly available.

> https://sites.research.google/med-palm/

aix1

Med-PaLM is old and has been superseded by (multiple generations of) Gemini.

HN

HealthBench

HealthBench