Skip to content(if available)orjump to list(if available)

Ingesting PDFs and why Gemini 2.0 changes everything

lazypenguin

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

xnx

Your OCR vendor would be smart to replace their own system with Gemini.

panarky

This is a big aha moment for me.

If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.

fallinditch

If I used Gemini 2.0 for extraction and chunking to feed into a RAG that I maintain on my local network, then what sort of locally-hosted LLM would I need to gain meaningful insights from my knowledge base? Would a 13B parameter model be sufficient?

potatoman22

Small point but is it doing semantic chunking, or loading the entire pdf into context? I've heard mixed results on semantic chunking.

panarky

It loads the entire PDF into context, but then it would be my job to chunk the output for RAG, and just doing arbitrary fixed-size blocks, or breaking on sentences or paragraphs is not ideal.

So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.

faxmeyourcode

I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

This is giving me hope that it's possible.

otoburb

>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.

[1] https://github.com/dgunning/edgartools

anirudhb99

(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.

barrenko

If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.

jgalt212

isn't everyone on iXBRL now? Or are you struggling with historical filings?

bionhoward

The Gemini api has a customer noncompete, so it’s not an option for AI, what are you working on that doesn’t compete with AI?

novaleaf

what doesn't compete with ai?

B-Con

You do realize most people aren't working on AI, right?

Also, OP mentioned fintech at the outset.

yzydserd

How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.

zacmps

I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:

"Correct errors in this OCR transcription".

bradfox2

This is what we do today. Have you tried it against Gemini 2.0?

therein

How does it behave if the body of text is offensive or what if it is talking about a recipe to purify UF-6 gas at home? Will it stop doing what it is doing and enter lecturing mode?

I am asking not to be cynical but because of my limited experience with using LLMs for any task that may operate on offensive or unknown input seems to get triggered by all sorts of unpredictable moral judgements and dragged into generating not the output I wanted, at all.

If I am asking this black box to give me a JSON output containing keywords for a certain text, if it happens to be offensive, it refuses to do that.

How does one tackle with that?

depr

So are you mostly processing PDFs with data? Or PDFs with just text, or images, graphs?

thelittleone

Not the parent, but we process PDFs with text, tables, diagrams. Works well if the schema is properly defined.

sensecall

Out of interest, did you parse into any sort of defined schema/structure?

gnat

Parent literally said so …

> Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

cess11

What hardware are you using to run it?

kccqzy

The Gemini model isn't open so it does not matter what hardware you have. You might have confused Gemini with Gemma.

rudolph9

We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.

https://tika.apache.org/

rjurney

I've been using NotebookLM powered by Gemini 2.0 for three projects and it is _really powerful_ for comprehending large corpuses you can't possibly read and thinking informed by all your sources. It has solid Q&A. When you ask a question or get a summary you like [which often happens] you can save it as a new note, putting it into the corpus for analysis. In this way your conclusions snowball. Yes, this experience actually happens and it is beautiful.

I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works - you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)

My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think - which REALLY helps out.

They are:

* The Stratigrapher - A Lovecraftian short story about the world's first city. All of Seton Lloyd/Faud Safar's work on Eridu. Various sources on Sumerian culture and religion All of Lovecraft's work and letters. Various sources about opium Some articles about nonlinear geometries

* FPGA Accelerated Graph Analytics An introduction to Verilog Papers on FPGAs and graph analytics Papers on Apache Spark architecture Papers on GraphFrames and a related rant I created about it and graph DBs A source on Spark-RAPIDS Papers on subgraph matching, graphlets, network motifs Papers on random graph models

* Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.

I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)

llm_trw

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

ck_one

What object detection model do you use?

kbyatnal

It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.

I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.

This requires things like:

- state-of-the-art parsing powered by VLMs and OCR

- multi-step extraction powered by semantic chunking, bounding boxes, and citations

- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)

- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy

- evaluation and benchmarking tools

- fine-tuning pipelines that turn reviewed corrections —> custom models

Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.

[1] https://extend.app/

anirudhb99

thanks a ton for all the amazing feedback on this thread! if

(a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or

(b) there are loss cases for which gemini doesn't work well today,

please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!

jbarrow

> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.

[1] https://qwenlm.github.io/blog/qwen2.5-vl/

ThinkBeat

Hmm I have been doing a but if this manually lately for a personal project. I am working on some old books that are far past any copyright, but they are not available anywhere on the net. (Being in Norwegian m makes a book a lot more obscure) so I have been working on creating ebooks out of them.

I have a scanner, and some OCR processes I run things through. I am close to 85% from my automatic process.

The pain of going from 85% to 99% though is considerable. (and in my case manual) (well Perl helps)

I went to try this AI on one of the short poem manufscript I have.

I told the prompt I wanted PDF to Markdown, it says sure go ahead give me the pdf. I went upload it. It spent a long time spinning. then a quick messages comes up, something like

"Failed to count tokens"

but it just flashes and goes away.

I guess the PDF is too big? Weird though, its not a lot of pages.

oedemis

there is also https://ds4sd.github.io/docling/ from ibm research which is mit license and track bounding boxes as rich json format

beklein

Great article, I couldn't find any details about the prompt... only the snippets of the `CHUNKING_PROMPT` and the `GET_NODE_BOUNDING_BOXES_PROMPT`.

Is there is any code example with a full prompt available from OP, or are there any references (such as similar GitHub repos) for those looking to get started within this topic?

Your insights would be highly appreciated.

sergiotapia

The article mentions OCR, but you're sending a PDF how is that OCR? Or is this is mistake? What if you send photos of the pages, that would be true OCR - does the performance and price remain the same?

If so this unlocks a massive workflow for us.

__jl__

The numbers in the blog post seem VERY inaccurate.

Quick calculation: Input pricing: Image input in 2.0 Flash is $0.0001935. Let's ignore the prompt. Output pricing: Let's assume 500 token per page, which is $0.0003

Cost per page: $0.0004935

That means 2,026 pages per dollar. Not 6,000!

Might still be cheaper than many solutions but I don't see where these numbers are coming from.

By the way, image input is much more expensive in Gemini 2.0 even for 2.0 Flash Lite.

Edit: The post says batch pricing, which would be 4k pages based on my calculation. Using batch pricing is pretty different though. Great if feasible but not practical in many contexts.

serjester

Correct, it's with batching Vertex pricing with slightly lower output tokens per page since a lot of pages are somewhat empty in real world docs - I wanted a fair comparison to providers that charge per page.

Regardless of what assumptions you use - it's still an order of magnitude + improvement over anything else.