Skip to content(if available)orjump to list(if available)

Why LLMs still have problems with OCR

Why LLMs still have problems with OCR

55 comments

·February 6, 2025

Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.

michaelbuckbee

I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

nodamage

I once did something similar with a recipe from a cookbook where the recipe started at the bottom of one page and continued onto the next page. It correctly identified the first few ingredients present in the photo of the first page but then proceeded to hallucinate another half-dozen or so ingredients in order to generate a complete recipe.

ritvikpandey21

yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.

llm_trw

It's not possible with current gen models.

To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.

See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414

amelius

Maybe ask it to return the bounding box of every glyph.

thegeomaster

This universally fails, on anything from frontier models to Gemini 2.0 Flash in its custom fine-tuned bounding box extraction mode.

llm_trw

This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.

I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

ritvikpandey21

we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses

Terr_

I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]

The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.

However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things.

[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p

TeMPOraL

> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.

Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".

Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.

jll29

In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.

ahoka

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

patcon

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

ritvikpandey21

would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task

sumedh

Are you targeting business or consumers?

I cannot find the pricing page.

sidmanchkanti21

our current customers are both enterprises and individuals.

pricing page is here https://www.runpulse.com/pricing-studio-pulse

mdbmdb

I would love to get access to that archive!

coder543

I'm somewhat surprised neither this article nor the previous one mention anything about the Florence-2 model series. I had thought that Florence-2 was not just surprisingly capable for this kind of work, but also easily fine-tunable for a particular kind of document, when you expect to process a lot of instances of that document and want to further optimize accuracy. It's extremely small (0.23B and 0.77B parameters), so it's easy to run, easy to fine-tune, and probably unlikely to overthink things.

https://arxiv.org/abs/2311.06242

https://huggingface.co/blog/finetune-florence2

https://blog.roboflow.com/florence-2-ocr/

https://www.assemblyai.com/blog/florence-2-how-it-works-how-...

I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.

In addition to the architectural issues mentioned in OP's article, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.

password4321

As opposed to the discussion 2 days ago with 400+ comments:

Ingesting PDFs and why Gemini 2.0 changes everything

https://news.ycombinator.com/item?id=42952605

h0l0cube

FTA:

> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.

password4321

Yes and per the poster's opening comment:

https://news.ycombinator.com/item?id=42966958#42966959

h0l0cube

It seemed you were implying the article was naive to the earlier post, whereas the OP poses itself as a rebuttal. Perhaps a fault of my inference.

jsight

That's what I thought too, but apparently the title is pure, absolute, rage-inducing clickbait.

The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.

dang

I assume you mean the title of the current thread? I've attempted to make it less baity now.

llm_trw

This is a response to: https://news.ycombinator.com/item?id=42952605

A fun threat to read for the current hype cycle.

You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.

A question to the authors.

Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:

1). Better training data.

2). Larger vision parts of the model.

In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.

__rito__

I was just trying a bunch of models for OCR. I only have 4 GB of VRAM in my personal machine.

My goal was to run an OCR model locally and extract text from scanned PDFs.

Many models could not even be run. Among those that did run, thanks to Ollama, provided very poor experience. Like llava-llama3, phi3.5 vision, etc.

What worked really well, but still not up to the mark- Surya [0].

It works perfectly on screenshots from true text PDFs, but not from scanned PDFs. Also has much better performance for English than Indian languages.

[0]: https://github.com/VikParuchuri/surya

markisus

I found this part questionable.

> Fixed patch sizes may split individual characters

> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

The author suggests that the standard ViT architecture is poorly suited for OCR because patches do not respect character boundaries and that the positional embeddings only embed the locations of patches, which are 16x16 pixels.

My mental model is that a token is a memory slot where computation results can be stored or retrieved from. There is no reason why we should want the layout of these memory slots must mimic the layout of the document, except at the very first layer, because then we don't have to think too hard about how to encode the document.

julienchastang

I've had limited but good experience (with both English and French text) with Tesseract, then getting ChatGPT to fix problems with clever prompting (e.g., pretend you are an expert OCR corrector, blah blah, blah).

ritvikpandey21

for most (text-dense) documents without much layout differences, these small prompt eng tricks work pretty well! scaling this to complex layouts and 1000+ page docs, we found the models don’t stick to their instructions. perhaps there’s some work to be done with 1M+ context length models so they don’t lose layout memory.

bryzaguy

Wasn’t seeing what OCR stands for, I believe it’s Optical Character Recognition.

kyriakos

I find that LLMs can read text off product label photos I can't even read myself.

AlphaAndOmega0

If you don't know what the text says, do you have access to some other form of ground truth? Because otherwise you don't know if they're reading illegible labels correctly!

kyriakos

I can know what the text says cause I have the actual product available :) but you are right if the llm can't read it will fill in the gap with hallucinations probably

ritvikpandey21

yes they usually can! we delved into the mathematics behind this a bit in the blog, but tldr the LLMs are making educated guesses based on the embedding similarities - which can be detrimental for ocr systems.

wkat4242

I noticed llama 3.2 8b has big problems reading white on black text. Black on white goes way better. But I think it makes sense. They don't look at text like a dedicated OCR algorithm. I see the article elaborates on the very well.

ritvikpandey21

thanks for the feedback!

levocardia

LLMs do not struggle at all with raw text: they never lose decimal places or drop digits when transcribing a table from raw text. So the problem is not the internal representation. I do this all the time and all major LLMs work eminently well at it.

The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.