OlmOCR: Open-source tool to extract plain text from PDFs
45 comments
·February 25, 2025rahimnathwani
Good:
- no cloud service required, can run on local Nvidia GPU
- outputs a single stream of text with the correct reading order (even for multi column PDF)
- recognizes handwriting and stuff
Bad:
- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)
OP is the demo page, which lets you OCR 10 pages.
The code needs an Nvidia GPU to run: https://github.com/allenai/olmocr
Not sure if the VRAM requirements because I haven't tried running locally yet.
thelittleone
Text from diagrams can be useful in LLMs. For example an LLM can understand a flow charts decision making shapes etc, but without text it could misinterpret information. I process a bunch of PDFs including procedures. Diagrams are concerted to code. The text helps in many cases.
rahimnathwani
Diagrams are concerted to code
That's cool. May I ask what your pipeline looks like? And what code format do you use for diagrams? Mermaid?chad1n
These "OCR" tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I've been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it's not for "important" papers or papers that need 100% accuracy.
thelittleone
How about building a tool which indexes ocr chunks / tokens and a confidence grading. Setting a tolerance level and defining actions where the token or chunk (s) fall below that level. Actions could include could include automated verification using another model or last resort human.
Eisenstein
How would you calculate the confidence? LLMs are notoriously bad at grading their own output.
fschuett
Very impressive, it's the only AI Vision toolkit so far that actually recognizes Latin and medieval scripts. I've been trying to somehow translate public-domain medieval books (including the artwork and original layout) to PDF, so they can be re-printed, i.e pages like this: https://i.imgur.com/YLuF9sa.png - I tried a Google Vision + o1 solution, which did work to some extent, but not on the first try. This even recognizes the "E" of the artwork initial (or fixes it because of the context), which many OCR or AI solutions fail at.
The only think I'd need now is a way to get the original font and artwork positions (would be a great addition to OlmOCR). Potentially I could work up a solution to create the font manually (as most medieval books are written in the same writing style), then find the shape of the glyphs in the original image once I have the text and then mask out the artwork with some OpenCV magic.
yorwba
You might be interested in https://learnable-typewriter.github.io for extracting the glyph shapes once you have the OCRd text.
constantinum
Tested it with the following documents:
* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?
* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.*
* Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??
* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.
* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.
Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.
There's also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract/EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate
[1] - https://pg.llmwhisperer.unstract.com/ [2] - https://github.com/DS4SD/docling
phren0logy
FYI, you can choose which OCR engine Docling uses (from a handful of predefined choices) - it doesn’t have to be Tesseract.
https://ds4sd.github.io/docling/reference/pipeline_options/#...
simonw
I posted some notes on this here a couple of days ago: https://simonwillison.net/2025/Feb/26/olmocr/
brianjking
I'm using the GGUF in LMStudio found here: https://huggingface.co/allenai/olmOCR-7B-0225-preview-GGUF
mayoosh
what is the cost of running on the GPU?
mjnews
Here's a concise version:
Deployed a quick demo of this at https://olmocr.im/ if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.
Short, URL-forward, and focused on what HN readers care about (immediate testing + clear use case).
mjnews
Deployed a quick demo of this at https://olmocr.im if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.
TZubiri
It's amazing how of these solutions exist.
Such a hard problem that we create for ourselves.
zitterbewegung
Would like to know how this compares to https://github.com/tesseract-ocr/tesseract
rahimnathwani
Tesseract is multilingual.
Tesseract extracts all text from doc, without trying to fix reading order.
Tesseract runs in many more places, as it doesn't require a GPU.
Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.
maleldil
I haven't checked OlmOCR, but in my experience, Tesseract is awful for scientific papers. The structure is mangled, formulas are completely rubbish, tables are nearly useless, etc.
I also tried Docling (which I believe is LLM-based), which works fine, but the references section of the paper was too noisy, and Gemini 2.0 Flash was okay but too slow for a large number of PDFs[1].
I settled for downloading the LaTeX code from arXiv and using pandoc to parse that. I also needed to process citations, which was easy using pandoc's support for BibTeX to CSL JSON.
[1] Because of the number of output tokens, I had to split the PDF into pages and individually convert each one. Sometimes, the API would take too long to respond, making the overall system quite slow.
jesuslop
and mathpix
rahimnathwani
Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.
You can't run it locally, though, right?
kergonath
> The Mathpix mobile app has support for reading two column PDFs as a single column.
Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.
> You can't run it locally, though, right?
Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.
null
Krasnol
Make it an .exe file and storm the world's offices.
I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.
Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.
Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).
Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .
You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .
Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.