OlmOCR: Open-source tool to extract plain text from PDFs
9 comments
·February 25, 2025TZubiri
It's amazing how of these solutions exist.
Such a hard problem that we create for ourselves.
zitterbewegung
Would like to know how this compares to https://github.com/tesseract-ocr/tesseract
rahimnathwani
Tesseract is multilingual.
Tesseract extracts all text from doc, without trying to fix reading order.
Tesseract runs in many more places, as it doesn't require a GPU.
Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.
jesuslop
and mathpix
rahimnathwani
Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.
You can't run it locally, though, right?
kergonath
> The Mathpix mobile app has support for reading two column PDFs as a single column.
Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.
> You can't run it locally, though, right?
Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.
xz18r
Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.
rahimnathwani
Look at pages 18-20 of the technical report. I don't know of any non-AI OCR that can do as good a job as that.
Good:
- no cloud service required, can run on local Nvidia GPU
- outputs a single stream of text with the correct reading order (even for multi column PDF)
- recognizes handwriting and stuff
Bad:
- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)
OP is the demo page, which lets you OCR 10 pages.
The code needs an Nvidia GPU to run: https://github.com/allenai/olmocr
Not sure if the VRAM requirements because I haven't tried running locally yet.