Skip to content(if available)orjump to list(if available)

GPT-5o-mini hallucinates medical residency applicant grades

medicalthrow

Hi HN, submitting from a burner since I'm an applicant this current medical residency admissions cycle. I thought it was interesting to show the real world implications of using LLMs to extract information from PDFs. For context, thalamus is a company that handles the "backend" for residency programs and all the applications they receive (including handling who to invite for interviews, etc). One of the more important factors in deciding applicant competitiveness is their medical school performance (their grades), but that information is buried in PDFs sent by schools (often not standardized). So this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...). Some programs have noticed there is a discrepancy between extracted vs reported grades (often in the direction of hallucinating "fails") and brought it to the attention of thalamus. Unfortunately, it doesn't look like the main company is discontinuing usage of the tool.

Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.

EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist

animalmother

Hi wondering if you could message me at shane.shifflett@dowjones.com or via signal at 929 638 0009? https://www.wsj.com/news/author/shane-shifflett

alexpotato

It's amazing how much of "inter organization information flow" still happens over PDFs and/or just FTP'ing files around.

A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.

The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.

I write more about the FTP side of things in the Twitter thread below: https://x.com/alexpotato/status/1809579426687983657

lozenge

Why don't they just email a form after/when you apply and you fill in all the grades in a structured data way? How many grades are we talking about here. Then the PDF would just be the proof that your grades were real.

philipallstar

Thank you for sharing this.

It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.

simonw

I imagine they would love to create a simple API for this, but the problem is convincing thousands of schools to use that API.

If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.

a-dub

they're essentially an ATS SAAS for medical school, if they have enough schools or enough prestigious schools, they can ask for whatever they want and the applicant schools would oblige. cheeky way to make it happen overnight: give a slight advantage to transcripts that are submitted digitally- the conversion would be complete within months.

daemonologist

The trouble is getting people to use your API - in this case med schools, but it can be much, much worse (more and smaller organizations sending you data, and in some industries you have a legal obligation to accept it in any format they care to send).

null

[deleted]

aprilthird2021

> this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...).

Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it

mattnewton

Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.

I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.

simonw

Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.

If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.

Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!

walkabout

I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.

Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.

daemonologist

I would love to hear more about the solutions you have in mind, if you're willing.

The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.

williamdclt

The parent says

> that information is buried in PDFs sent by schools (often not standardized).

I don't think OCR will help you there.

An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.

aprilthird2021

Don't most jobs do OCR on the resumes sent in for employment? I get that a resume is a more standard format. Maybe that's the rub

bbarnett

Welcome to the world of greybeards, baffled by everyone using AWS at 100s to 100000s of times the cost of your own servers.

lazystar

spectre/meltdown, finding out your 6 month order of ssd's was stolen after opening empty boxes in the datacenter, and having to write RCA's for customers after your racks go over the PSU's limit are things ya'll greybeards seem to gloss over in your calculations, heh

beernet

Nothing new to see here. If you are still surprised by model hallucinations in 2025, it might be time for you to catch up or jump on the next hype bandwagon. Also, they reacted well:

> Once confirmed, we corrected the extracted grade immediately.

> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.

I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

softwaredoug

It's true, but I think people have a misunderstanding that if you add search / RAG to ground the LLM, the LLM won't hallucinate. When in reality the LLM can still hallucinate, just convincingly in the language of whatever PDF it retrieved.

bigzyg33k

RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.

The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.

The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.

eoinbmorg

Is RAG the right tool for this? My understanding was that RAG uses vector similarity to compare queries (the extracted string) versus the search corpus (the PDF file) using vector similarities. The use case you describe is verification, which sounds like it would be better done with an exhaustive search via string comparison isntead of vector similarities.

I could be totally wrong here.

nerdjon

> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.

byteknight

Never thought about it from that perspective, but I think you're right. It is by design, not deceptive intent, just the infinite monkeys theorem where we've replaced randomness with pattern matching trained on massive datasets.

leprechaun1066

Another way to look at it is everything a LLM creates is a 'hallucination', some of these 'hallucinations' are more useful than others.

I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.

This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.

walkabout

The way I've been putting it for a while is, "all they do is hallucinate—it's the only thing they do. Sometimes the hallucinations are useful."

fragmede

The OpenAI paper on hallucinations gives actual technical reasons for them, if you're interested.

https://openai.com/index/why-language-models-hallucinate/

https://arxiv.org/abs/2509.04664

shadowgovt

The key idea is the model doesn't have any signal on "factual information." It has a huge corpus of training data and the assumption humans generally don't lie to each other when creating such a corpus.

... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).

MountDoom

> Nothing new to see here.

Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.

This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".

testdelacc1

I completely agree with you. GP’s cynical take is an upvote magnet but doesn’t contribute to the discourse.

seymore_12

All models are wrong, but some are useful. https://en.wikipedia.org/wiki/All_models_are_wrong

whoknowsidont

I think the definition of hallucination fits pretty neatly.

tdeck

The story isn't the hallucination, it's that people are using this shit in risky ways and ignoring the known problems with it. Engineers knew well before 1981 that building this [1] wasn't safe, but that didn't stop someone from building it. When it collapsed, it was a story.

[1] https://en.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse

H8crilA

I also hate the term "hallucination", but for a different reason. A hallucination is a confusion of internal stimulus as an external input. The models simply make errors, have bad memory, are overconfident, are sampling from a fantasy world, or straight up lie; often at rates that are not dissimilar from humans. For models to truly hallucinate, develop delusions and all that good schizophrenia stuff we would need to have a truly recurrent structure that has enough time to go through something similar to the prodrome, and build up distortions and ideas.

TL;DR: being wrong, even very wrong != hallucination

kbyatnal

School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.

Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).

What we've seen help in these cases are:

1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.

2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.

The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).

meisel

Would it help a lot to run it through multiple different AI systems and verify that they agree on the result?

kbyatnal

Yeah that can occasionally work and something we've tested, but it introduces a lot of noise unfortunately and makes systematic evals difficult.

simonw

Lots of comments in here that seem to have missed that this is about using vision-LLMs for OCR.

This makes it a slightly different issue from "hallucination" as seen in text based models. The model (which I think we can assume is GPT-5-mini in this case) is being fed scanned images of PDFs and is incorrectly reading the data from them.

Is this still a hallucination? I've been unable to identify a robust definition of that term, so it's not clearly wrong to call a model misinterpreting a document a "hallucination" even though it feels to me like a different category of mistake to an LLM inventing the title of a non-existent paper or lawsuit.

lysecret

These kinds of errors have always existed and will always exist there is no perfect way to extract info from documents like this.

simonw

The models really are getting better though. Compare Gemini 1.5 and Gemini 2.5 on the same PDF document (I've done this a bunch) and you can see the difference.

The open question is how much better they need to get before they can be deployed for situations like this that require a VERY high level of reliability.

lysecret

I fully agree. My point was more a lot of commenters seem or implicitly compare the llm based approach with some “better” or “simpler” approach which really doesn’t exist from my estimation LLMs are sota for this kind of extractions (though they still have issues).

hoosieree

People don't respect the chasm between "obviously no mistakes" and "no obvious mistakes".

Aurornis

Frustrating that their official recommendation is to verify the grades manually.

If a tool is designed to extract the grades for easy access, do we really believe that the end users will then verify the grades manually to confirm the output? If they’re doing that, why use the tool at all?

Maybe the tool can extract what it believes is the grades section and show a screenshot for a human to interpret.

landl0rd

Because the contract has already been signed, they can't guarantee it works right, and they don't want to be open to lawsuits. "You, mister wrongly-denied applicant, cannot sue us; we specifically told them to check all grades manually!"

SketchySeaBeast

This is why this particular emperor has no clothes. They keep trying to jam AI into stuff to make it "easier", but the LLMs, by their very nature do the tasks in lossy or incorrect ways. Imagine if Microsoft had sold Excel with a "be sure to verify all the calculations" caveat.

nisten

While I don't want to discount the work of any physician-founded org knowing the pain they go through from working with them after they've seen 18 patients in a days work, this still just just looks like bad software. With no testing, no internal bench.

Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?

It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.

Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.

So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?

powersurge360

I keep circling this with AI and I'm not really sure what to do with it. They mention that the AI is meant to be used as reference only in the linked article but what does that actually mean? Who is checking who? Is the AI filling out the data from what it sees in the PDF and the user is expected to check it or is the user filling out the data and the AI is expected to check it?

Is the cost of AI useful if all you're doing is something like 'linting' the extraction? How do you guarantee that people really, truly, are doing the same work as before and not just blindly clicking 'looks good'. What is the value of the AI telling you something when you cannot tell if it is lying?

omnicognate

Yeah, I've seen this "for reference only" wording in many places, often used as a sort of disclaimer on stuff that could be wrong, but I have absolutely no idea what it means in that context. To me "reference" implies comprehensive, high quality information that I can refer to when I need to know some obscure detail of something.

Is there some legal context in which this phrase has a specific meaning, perhaps?

bilekas

Am I crazy or has text parsing been mastered long before AI. Why is GPT being used in this scenario in the first place ?

mattnewton

Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.

null

[deleted]

tdeck

Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...

hansonkd

It seems like a default mode for AI should be to generate repeatable Regex for text extraction.

tdeck

Unfortunately many PDFs don't even internally represent text in a contiguous way.

socrateswasone

It's predicting the next token by statistical approximation. Hallucination vs fact is an ad-hoc distinction we impose on the result to suit our purpose.

lukeschlather

Using a mini model for this seems grossly irresponsible. I've been doing some work testing models for similar extraction tasks (nothing where a failure affects someone's grade or anything) and gpt mini / Gemini flash simply can't do this sort of thing. Using anything less than the highest model with reasoning, you're guaranteed to get this sort of thing happening.

It is very tempting to do it, obviously, with the cost difference, but it's not worth it. But on the other hand, people talk about LLMs with a broad brush and I don't know, there's still testing but I would be surprised to hear that GPT-5-pro with thinking had an issue like this.

owenthejumper

This sucks. Residency match is stressful as it is, and adding systems like these just make the experience even worse for the applicants.

Source: spouse matched in 2018. It was one of the most stressful periods of our lives.

softwaredoug

I see _even with search/RAG_ LLMs hallucinate. They just hallucinate more convincingly in the language of the documents you retrieved.

So you really have to double check when researching information that really matters.