Skip to content(if available)orjump to list(if available)

OlmOCR: Open-source tool to extract plain text from PDFs

rahimnathwani

Good:

- no cloud service required, can run on local Nvidia GPU

- outputs a single stream of text with the correct reading order (even for multi column PDF)

- recognizes handwriting and stuff

Bad:

- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)

OP is the demo page, which lets you OCR 10 pages.

The code needs an Nvidia GPU to run: https://github.com/allenai/olmocr

Not sure if the VRAM requirements because I haven't tried running locally yet.

TZubiri

It's amazing how of these solutions exist.

Such a hard problem that we create for ourselves.

zitterbewegung

Would like to know how this compares to https://github.com/tesseract-ocr/tesseract

rahimnathwani

Tesseract is multilingual.

Tesseract extracts all text from doc, without trying to fix reading order.

Tesseract runs in many more places, as it doesn't require a GPU.

Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.

jesuslop

and mathpix

rahimnathwani

Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.

You can't run it locally, though, right?

kergonath

> The Mathpix mobile app has support for reading two column PDFs as a single column.

Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.

> You can't run it locally, though, right?

Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.

xz18r

Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.

rahimnathwani

Look at pages 18-20 of the technical report. I don't know of any non-AI OCR that can do as good a job as that.