Why LLMs still have problems with OCR

152 comments

·February 6, 2025

Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.

Visit

michaelbuckbee

I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

llm_trw

This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.

I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

osigurdson

It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.

sgc

> It does seem

And therein lies all the problem. The verification required for serious work is likely orders of magnitude more than anybody is willing to spend on.

For example, professional OCR companies have large teams of reviewers who double or triple review everything, and that is after the software itself flags recognition with varying degrees of certainty. I don't think companies are thinking of LLMs as tools that require that level of dedication and resources, in virtually all larger scale use cases.

jgalt212

In some cases this is true, but then why choose an expensive world model over a small net or random forest you trained specifically for the task at hand?

ritvikpandey21

we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses

dyauspitr

You can literally ask it to mark characters that are not clear instead of inferring them.

paulsutter

Unit tests work remarkably well

null

[deleted]

gcanyon

> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.

rainsford

While humans can and do make mistakes, it seems to me like there is a larger problem here that LLMs make mistakes for different reasons than humans and that those reasons make them much worse than humans at certain types of problems (e.g. OCR). Worse, this weakness might be fundamental to LLM design rather than something that can be fixed by just LLM-ing harder.

I think a lot of this gets lost in the discussion because people insist on using terminology that anthropomorphizes LLMs to make their mistakes sound human. So LLMs are "hallucinating" rather than having faulty output because their lossy, probabilistic model fundamentally doesn't actually "understand" what's being asked of it the way a human would.

SketchySeaBeast

Human beings also have an ability to doubt their own abilities and understanding. In the case of transcription, if someone has doubts about what they are transcribing they'll try and seek clarity instead of just making something up. They also have the capacity to know if something is too important to screw up and adjust their behaviour appropriately.

null

[deleted]

Terr_

I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]

The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.

However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.

[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p

TeMPOraL

> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.

Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".

Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.

pbhjpbhj

I often think that LLM issues like this could be solved by a final pass of "is the information in this image the same as this text" (ie generally a verification pass).

It might be that you would want to use a different model {non-generative} for that last pass -- which is like the 'array of experts' type approach. Or comparing to your human analogy, like reading back the list to your partner before you leave for the shops.

setr

Isn’t that just following through on “from how it was reached”? Without any of that additional information, if the LLM gave the same result, we should consider it the product of hallucination

williamcotton

On your epistemology, if you correctly guess the outcome of a random event then the statement, even if contingent on an event that did not yet occur, is still true. The same goes for every incorrect guess.

If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.

You then conflate the usefulness of some guessing skill with logic and statistics.

How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.

How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.

kaonwarb

To consider: do we overestimate what we know about how we humans reach an answer? (Humans are very capable of intuitively reading scrambled text, for example, as long as the beginning and ending of each word remains correct.)

afro88

The correct way to handle it is to ask the user if it's not clear, like a real assistant would

nodamage

I once did something similar with a recipe from a cookbook where the recipe started at the bottom of one page and continued onto the next page. It correctly identified the first few ingredients present in the photo of the first page but then proceeded to hallucinate another half-dozen or so ingredients in order to generate a complete recipe.

ritvikpandey21

yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.

llm_trw

It's not possible with current gen models.

To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.

See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414

amelius

Maybe ask it to return the bounding box of every glyph.

thegeomaster

This universally fails, on anything from frontier models to Gemini 2.0 Flash in its custom fine-tuned bounding box extraction mode.

jmartin2683

My experiences have been the same… that is to say nothing like what is reported here. This is more pitch than info.

Odd timing, too given flash 2.0 release and its performance on this problem.

Buttons840

Failed as a machine, succeeded as an intelligence. Intelligences don't make good machines though.

davidhs

I recently took a picture of the ingredients list on my multivitamin and fed it into ChatGPT o1-pro (at least that was the config.) and it made up ingredients and messed up the quantities.

daveguy

Sounds like a xerox.

coder543

I'm somewhat surprised neither this article nor the previous one mention anything about the Florence-2 model series. I had thought that Florence-2 was not just surprisingly capable for this kind of work, but also easily fine-tunable for a particular kind of document, when you expect to process a lot of instances of that document and want to further optimize accuracy. It's extremely small (0.23B and 0.77B parameters), so it's easy to run, easy to fine-tune, and probably unlikely to overthink things.

https://arxiv.org/abs/2311.06242

https://huggingface.co/blog/finetune-florence2

https://blog.roboflow.com/florence-2-ocr/

https://www.assemblyai.com/blog/florence-2-how-it-works-how-...

I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.

In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.

jll29

In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.

zeograd

I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)

moffkalast

When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?

ianhawes

Surya is on par with cloud vision offerings.

ritvikpandey21

would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task

sumedh

Are you targeting business or consumers?

I cannot find the pricing page.

sidmanchkanti21

our current customers are both enterprises and individuals.

pricing page is here https://www.runpulse.com/pricing-studio-pulse

patcon

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

ahoka

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

pbhjpbhj

If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?

mdbmdb

I would love to get access to that archive!

jeswin

If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."

> We have hundreds of examples like this queued up, so let us know if you want some more!

Link to it then, let people verify.

I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.

[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...

sgc

99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.

rahimnathwani

  99.8%+ on first pass

Even with the best OCR, and high resolution scans, you might not get this due to:

- the quality of the original paper documents, and

- the language

I have non-English documents for which I'd love to have 99% accuracy!

sgc

Language is often solvable by better dictionaries. I have been forced to make my own dictionaries in the past, that led to similar error rates as more mainstream languages like English. If you are talking about another alphabet like Cyrillic or Arabic etc, that is another problem.

null

[deleted]

davedx

Ha hi Jeswin! I was itching to reply to this post too, I wonder why…

jeswin

Dave! Our sample sizes were large enough, and tables complex enough to opine on this.

I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.

bambax

Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:

  PI tno Name             Time            3.5 km   18 C   (cont.)
  MEN B (39)                                                  3(34)         4(52)         5(53)         6(54)         7(55)         
  8(40)         9(57)
                                                               12(60)        13(61)        14(62)        15(63)        16(47)        
  17(48)       18(100)
                                                  1(51)         2(33)
                                                  10(58)        11(59)

The first column is offset vertically which mixes up information and is wrong.

I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:

  PI      tno Name                       Time
  
  MEN   B (39)                                                         3.5 km       18 C          (cont.)
                                                   1 (51)                  2 (33)                 3 (34)                  4 (52)                  5   (53)                  6 (54)                  7 (55)                  8 (40)                 9 (57)
                                                  10 (58)                 11 (59)                12 (60)                 13 (61)                 14 (62)                 15 (63)                16 (47)                 17 (48)                18 (100)
                                                  Finish
  
  13     425  Peter  Hodkinson          11:40       0:48   +0: 06 (21)      1:29  +0: 13 (28)      1:58   +0: 13 (24)      2:44   +0: 18 (23)      3:38   +0: 20 (19)     4:28    +0: 22 (18)     5:05   +0: 23 (17)      5:36   +0: 26 (17)      6:19   +0: 29 (19)
              Great  Britain                        0:48   +0: 06 (21)      0:41  +0: 09 (30)      0:29   +0: 01 (4)       0:46   +0: 07 (22)      0:54   +0: 02 (5)      0:50    +0: 03 (7)      0:37   +0: 02 (10)      0:31   +0: 03 (11)      0:43   +0: 05 (20)
                                                    6:47   +0: 28 (17)     7:02   +0: 29 (17)      8:21   +0: 38 (16)      8:41   +0: 39 (16)      9:00   +0: 41 (16)     9:13    +0: 42 (16)     9:43   +0: 42 (16)     10:36   +0: 43 (14)     11:32   +0: 41 (13)
                                                    0:28   +0: 02 (8)      0:15   +0: 01 (4)       1:19   +0: 11 (16)      0:20   +0: 03 (15)      0:19   +0: 02 (4)      0:13   +0: 02 (11)      0:30   +0: 01 (2)       0:53   +0: 01 (3)       0:56    0:00  (1)
                                                   11:40   +0: 40 (13)
                                                    0:08   +0: 00 (8)

(edit: line wrap messes it all up... still I think my version is better ;-)

jeswin

I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.

Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.

password4321

As opposed to the discussion 2 days ago with 400+ comments:

Ingesting PDFs and why Gemini 2.0 changes everything

https://news.ycombinator.com/item?id=42952605

h0l0cube

FTA:

> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.

password4321

Yes and per the poster's opening comment:

https://news.ycombinator.com/item?id=42966958#42966959

h0l0cube

It seemed you were implying the article was naive to the earlier post, whereas the OP poses itself as a rebuttal. Perhaps a fault of my inference.

jsight

That's what I thought too, but apparently the title is pure, absolute, rage-inducing clickbait.

The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.

dang

I assume you mean the title of the current thread? I've attempted to make it less baity now.

jsight

Indeed, the new title is far better. Thanks!

mehulashah

(CEO of Aryn here: https://aryn.ai)

Nice post and response to the previous one.

It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.

snthd

>Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.

Except for a very special kind of bug:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

>Xerox scanners/photocopiers randomly alter numbers in scanned documents

null

[deleted]

thorum

This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.

singularity2001

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

barrenko

not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.

hodapp

You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.

osigurdson

ChatGPT is also still hilariously bad at drawing diagrams - universally producing a silly cartoon with misspelled words. The rate of improvement over the past two years is effectively zero.

catlifeonmars

Why would you use ChatGPT to draw diagrams? It’s a generative language model. Just because you can doesn’t mean it’s the best tool for the job.

osigurdson

Why not include some suggested tools in your comment? "you are an idiot for using ChatGPT" (paraphrased) isn't very helpful.

Logge

That's DALL3 which is not an LLM

osigurdson

Good point. I probably knew that at one time but now leverage it via chatgpt so forgot. Does anyone know if there is an AI wall with text to image?

cnasc

I’m quite late here, but if you want a diagram you can ask the LLM to output Mermaid syntax and then paste that into Excalidraw or something else that can render based on Mermaid.

markisus

I found this part questionable.

> Fixed patch sizes may split individual characters

> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

The author suggests that the standard ViT architecture is poorly suited for OCR because patches do not respect character boundaries and that the positional embeddings only embed the locations of patches, which are 16x16 pixels.

My mental model is that a token is a memory slot where computation results can be stored or retrieved from. There is no reason why we should want the layout of these memory slots must mimic the layout of the document, except at the very first layer, because then we don't have to think too hard about how to encode the document.

bambax

I'm making a simple service that outputs layout-following ASCII from images, PDFs of images or text PDFs. I too think the risk of hallucination is in many cases too great.

I fed my system the first image in the post [0] and got the text below in return.

I will be looking for beta testers next week... Email if interested!

  VEH  YR  MAKE  MODEL        IDENTIFICATION   TYPE SYM  ST  TER USE  CLASS ALARM
     2 02  HOND  CIVIC EX     1HGEM22952L086006 PP    18  IL  37  L  887120
     LOSS PAYEE THAT APPLIES:     2
     3.02  HYUN  SONATA / GL  KMHWF25S72A671544 PP   16   IL  37  P  887120

  H    NO. COVERAGE DESCRIPTION   LIABILITY LIMIT (S) DEDUCTIBLE          PREMIUM
      2   Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor.-BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 141.00
           Other than Collision                            $ 250           $ 92.00
                                                     TOTAL FOR UNIT   2   $ 443.00
     3-  Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor. BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 136.00
           Other than Collision                            $ 250           $ 90.00
                                                     TOTAL FOR UNIT   3   $ 436.00
  
  DRIVER  INFORMATION
  DR VEH  SEX MAR   BIRTH  G / S PRIN     DVR LIC NO.    NAME                  PTS

[0] https://i.imgur.com/sLWQoFG.jpeg

HN

Why LLMs still have problems with OCR

Why LLMs still have problems with OCR