Ingesting PDFs and why Gemini 2.0 changes everything

455 comments

·February 5, 2025

lazypenguin

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

kbyatnal

This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.

The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).

Disclaimer: I started an LLM doc processing infra company (https://extend.app/)

TeMPOraL

> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.

wraptile

That's what we did with our web scraping saas - with Extraction API¹ we shifted web scraped data parsing to support both predefined models for common objects like products, reviews etc. and direct LLM prompts that we further optimize for flexible extraction.

There's definitely space here to help the customer realize their extraction vision because it's still hard to scale this effectively on your own!

1 - https://scrapfly.io/extraction-api

quantumPilot

What's the value for a customer to pay a vendor that is only a wrapper around an LLM when they can leverage LLMs directly? I imagine tools being accessible for certain types of users, but for customers like those described here, you're better off replacing any OCR vendor with your own LLM integration

sitkack

Software is dead, if it isn't a prompt now, it will be a prompt in 6 months.

Most of what we think software is today, will just be a UI. But UIs are also dead.

Cumpiler69

>A smart vendor will shift into that space - they'll use that LLM themselves

It's a bit late to start shifting now since it takes time. Ideally they should already have a product on the market.

pmarreck

I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)

1) I don't mind destroying the binding to get the best quality. Any idea how I do so?

2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?

3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?

4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?

raghavsb

Great, I landed on the reasoning and citations bit through trial and error and the outputs improved for sure.

MajorData

`How did you add bounding boxes, especially if it is variety of files?

bitdribble

In my open source tool http://docrouter.ai I run both OCR and LLM/Gemini, using litellm to support multiple LLMs. The user can configure extraction schema & prompts, and use tags to select which prompt/llm combination runs on which uploaded PDF.

LLM extractions are searched in OCR output, and if matched, the bounding box is displayed based on OCR output.

Demo: app.github.ai (just register an account and try) Github: https://github.com/analytiq-hub/doc-router

Reach out to me at andrei@analytiqhub.com for questions. Am looking for feedback and collaborators.

montecruiseto

So why should I still use Extend instead of Gemini?

panta

How do you handle the privacy of the scanned documents?

bitdribble

With the docrouter.ai, it can be installed on prem. If using the SAAS version, users can collaborate in separate workspaces, modeled on how Databricks supports workspaces. Back end DB is Mongo, which keeps things simple.

One level of privacy is the workspace level separation in Mongo. But, if there is customer interest, other setups are possible. E.g. the way Databricks handles privacy is by actually giving each account its own back end services - and scoping workspaces within an account.

That is a good possible model.

kbyatnal

We work with fortune 500s in sensitive industries (healthcare, fintech, etc). Our policies are:

- data is never shared between customers

- data never gets used for training

- we also configure data retention policies to auto-purge after a time period

makeitdouble

> After trial and error with different models

As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.

To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.

That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.

And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.

tomrod

Consider turning down the temperature in the configuration? LLMs have a bit of randomness in them.

Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/

mejutoco

> and every single week the results were slightly different.

This is one of the reasons why open source offline models will always be part of the solution, if not the whole solution.

rafaelmn

Inconsistency comes from scaling - if you are optimizing your infra to be cos effective you will arrive at same tradeoffs. Not saying it's not nice to be able to make some of those decisions on your own - but if you're picking LLMs for simplicity - we are years away from running your own being in the same league for most people.

iandanforth

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

pigscantfly

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

kiratp

Quantized floating point math can, under certain scenarios, be non-associative.

When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.

bushbaba

That’s why you have azure openAI APIs which give a lot more consistency

itissid

Wait isn't there atleast a two step process here one is semantic segmentation followed by a method like texttract for text - to avoid hallucinations?

One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?

> accuracy was like 96% of that of the vendor and price was significantly cheaper.

I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?

themanmaran

One thing people always forget about traditional OCR providers (azure, tesseract, aws textract, etc.) is that they're ~85% accurate.

They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?

kapitalx

I'm the founder of https://doctly.ai, also pdf extraction.

The hallucination in LLM extraction is much more subtle as it will rewrite full sentences sometimes. It is much harder to spot when reading the document and sounds very plausible.

We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence. That way you have the option of trading compute and cost for accuracy.

anon373839

It’s a question of scale. When a traditional OCR system makes an error, it’s confined to a relatively small part of the overall text. (Think of “Plastics” becoming “PIastics”.) When a LLM hallucinates, there is no limit to how much text can be made up. Entire sentences can be rewritten because the model thinks they’re more plausible than the sentences that were actually printed. And because the bias is always toward plausibility, it’s an especially insidious problem.

miki123211

The difference is the kind of hallucinations you get.

Traditional OCR is more likely to skip characters, or replace them with similar -looking ones, so you often get AL or A1 instead of AI for example. In other words, traditional spelling mistakes. LLMs can do anything from hallucinating new paragraphs to slightly changing the meaning of a sentence. The text is still grammatically correct, it makes sense in the context, except that it's not what the document actually said.

I once gave it a hand-written list of words and their definitions and asked it to turn that into flashcards (a json array with "word" and "definition"). Traditional OCR struggled with this text, the results were extremely low-quality, badly formatted but still somewhat understandable. The few LLMs I've tried either straight up refused to do it, or gave me the correct list of words, but entirely hallucinated the definitions.

Scoundreller

> You literally get back characters + confidence intervals.

Oh god, I wish speech to text engines would colour code the whole thing like a heat map to focus your attention to review where it may have over-enthusiastically guessed at what was said.

You no knot.

somebehemoth

I know nothing about OCR providers. It seems like OCR failure would result in gibberish or awkward wording that might be easy to spot. Doesn't the LLM failure mode assert made up truths eloquently that are more difficult to spot?

nyarlathotep_

> is that they're ~85% accurate.

Speaking from experience, you need to double check "I" and "l" and "1" "0" and "O" all the time, accuracy seems to depend on the font and some other factors.

have a util script I use locally to copy some token values out of screenshots from a VMWare client (long story) and I have to manually adjust 9/times.

How relevant that is or isn't depends on the use case.

itissid

For an OCR company I imagine it is unconscionable to do this because if you would say OCR for an Oral History project for a library and you made hallucination errors, well you've replaced facts with fiction. Rewriting history? What the actual F.

phatfish

Probaly totally fine for a "fintech" (Crypto?) though. Most of them are just burning VC money anyway. Maybe a lucky customer gets a windfall because Gemini added some zeros.

threecheese

Normal OCR (like Tesseract) can be wrong as well (and IMO this happens frequently). It won’t hallucinate/straight make shit up like an LLM, but a human needs to review OCR results if the workload requires accuracy. Even across multiple runs of the same image an OCR can give different results (in some scenarios). No OCR system is perfectly accurate, they all use some kind of machine learning/floating point/potentially nondeterministic tech.

nthingtohide

Can confirm using gemini, some figure numbers were hallucinated. I had to cross-check each row to make sure data extracted is correct.

godapi

use different models to extract the page and cross check against each other. generally reduces issues alot

basch

Wouldn’t the temperature on something like OCR be very low. You want the same result every time. Isn’t some part of hallucination the randomness of temperature?

manmal

I can imagine reducing temp too much will lead to garbage results in situations where glyphs are unreadable.

serjester

The LLM's are near perfect (maybe parsing I instead of 1) - if you're using the outputs in the context of RAG, your errors are likely much much higher in the other parts of your system. Spending a ton of time and money chasing 9's when 99% of your system's errors have totally different root causes seems like a bad use of time (unless they're not).

j_timberlake

This sounds extremely like my old tax accounting job. OCR existed and "worked" but it was faster to just enter the numbers manually than fix all the errors.

Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.

eternauta3k

Germany (not exactly the cradle of digitalization) already auto-fills salary tax fields with data from the employer.

Andrex

They finally made filing free.

So, maybe this century?

kennyloginz

Check again, Elon and his Doge team killed that.

panarky

This is a big aha moment for me.

If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.

wussboy

Could it do exactly the same with a web page? Would this replace something like beautiful soup?

eitally

I don't know exactly how or what it's doing behind the scenes, but I've been massively impressed with the results Gemini's Deep Research mode has generated, including both traditional LLM freeform & bulleted output, but also tabular data that had to come from somewhere. I haven't tried cross-checking for accuracy but the reports do come with linked sources; my current estimation is that they're at least as good as a typical analyst at a consulting firm would create as a first draft.

fallinditch

If I used Gemini 2.0 for extraction and chunking to feed into a RAG that I maintain on my local network, then what sort of locally-hosted LLM would I need to gain meaningful insights from my knowledge base? Would a 13B parameter model be sufficient?

jhoechtl

Ypur lovalodel has littleore to do but stitch the already meaningzl pieces together.

The pre-step, chunking and semantic understanding is all that counts.

yeahwhatever10

Do you get meaningful insights with current RAG solutions?

potatoman22

Small point but is it doing semantic chunking, or loading the entire pdf into context? I've heard mixed results on semantic chunking.

panarky

It loads the entire PDF into context, but then it would be my job to chunk the output for RAG, and just doing arbitrary fixed-size blocks, or breaking on sentences or paragraphs is not ideal.

So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.

hattmall

It's cheap now because Google is subsidizing it, no?

vrosas

Spoiler: every model is deeply, deeply subsidized. At least Google's is subsidized by a real business with revenue, not VC's staring at the clock.

ForHackernews

This is great, I just want to highlight out how nuts it is that we have spun up whole industries around extracting text that was typically printed from a computer, back into a computer.

There should be laws that mandates that financial information be provided in a sensible format: even Office Open XML would be better than this insanity. Then we can redirect all this wasted effort into digging ditches and filling them back in again.

faxmeyourcode

I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

This is giving me hope that it's possible.

anirudhb99

(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.

otoburb

>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.

[1] https://github.com/dgunning/edgartools

faxmeyourcode

I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.

barrenko

If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.

jgalt212

isn't everyone on iXBRL now? Or are you struggling with historical filings?

faxmeyourcode

XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.

yzydserd

How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.

zacmps

I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:

"Correct errors in this OCR transcription".

therein

How does it behave if the body of text is offensive or what if it is talking about a recipe to purify UF-6 gas at home? Will it stop doing what it is doing and enter lecturing mode?

I am asking not to be cynical but because of my limited experience with using LLMs for any task that may operate on offensive or unknown input seems to get triggered by all sorts of unpredictable moral judgements and dragged into generating not the output I wanted, at all.

If I am asking this black box to give me a JSON output containing keywords for a certain text, if it happens to be offensive, it refuses to do that.

How does one tackle with that?

bradfox2

This is what we do today. Have you tried it against Gemini 2.0?

anirudhb99

member of the gemini team here -- personally, i'd recommend directly using gemini vs the document understanding services for OCR & general docs understanding tasks. From our internal evals gemini is now stronger than these solutions and is only going to get much better (higher precision, lower hallucination rates) from here.

joelhaus

Could we connect offline about using Gemini instead of the doc ai custom extractor we currently use in production?

This sounds amazing & I'd love your input on our specific use case.

joelatoutboundin.com

ajcp

GCP's Document AI service is now literally just a UI layer specific to document parsing use-cases back by Gemini models. When we realized that we dumped it and just use Gemini directly.

llm_trw

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

ajcp

Not sure what service you're basing your calculation on but with Gemmini I've processed 10,000,000+ shipping documents (PDF and PNGs) of every concievable layout in one month at under $1000 and an accuracy rate of between 80-82% (humans were at 66%).

The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database

Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.

llm_trw

>Not sure what service you're basing your calculation on but with Gemmini

The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.

    | Model                       | Pages per Dollar |
    |-----------------------------+------------------|
    | Gemini 2.0 Flash            | ≈ 6,000          |
    | Gemini 2.0 Flash Lite       | ≈ 12,000*        |
    | Gemini 1.5 Flash            | ≈ 10,000         |
    | AWS Textract                | ≈ 1,000          |
    | Gemini 1.5 Pro              | ≈ 700            |
    | OpenAI 4o-mini              | ≈ 450            |
    | LlamaParse                  | ≈ 300            |
    | OpenAI 4o                   | ≈ 200            |
    | Anthropic claude-3-5-sonnet | ≈ 100            |
    | Reducto                     | ≈ 100            |
    | Chunkr                      | ≈ 100            |

There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.

>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database

Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.

serjester

That assumes that you're able to find a model that can match Gemini's performance - I haven't come across anything that comes close (although hopefully that changes).

cpursley

Very cool! How are you storing it to a database - vectors? What do you do with the extracted data (in terms of being able to pull it up via some query system)?

ajcp

In this use-case the customer just wanted data not currently in the warehouse inventory management system capatured, so here we converted a JSON response to a classic table row schema (where 1 row = 1 document) and now boom, shipping data!

However we do very much recommend storing the raw model responses for audit and then at least as vector embeddings to orient the expectation that the data will need to be utilized for vector search and RAG. Kind of like "while we're here why don't we do what you're going to want to do at some point, even if it's not your use-case now..."

svieira

> [with] an accuracy rate of between 80-82% (humans were at 66%)

Was this human-verified in some way? If not, how did you establish the facts-on-the-ground about accuracy?

ajcp

Yup, unfortunately the only way to know how good an AI is at anything is to do the same way you'd do with a human: build a test that you know the answers to already. That's also why the accuracy evaluation was by far the most time intensive part of the development pipeline as we had to manually build a "ground-truth" dataset that we could evaluate the AI again.

jeswin

I feel compelled to reply. You've made a bunch of assumptions, and presented your success (likely with a limited set of table formats) as the one true way to parse PDFs. There's no such thing.

In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.

Recently someone posted this on HN, it's a good read: https://lukaspetersson.com/blog/2025/bitter-vertical/

> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.

> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.

> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.

metadat

Related discussion:

AI founders will learn the bitter lesson

https://news.ycombinator.com/item?id=42672790 - 25 days ago, 263 comments

The HN discussion contains a lot of interesting ideas, thanks for the pointer!

llm_trw

You're making an even less charitable set of assumptions:

1). I'm incompetent enough to ignore publicly available table benchmarks.

2). I'm incompetent enough to never look at poor quality data.

3). I'm incompetent enough to not create a validation dataset for all models that were available.

Needless to say you're wrong on all three.

My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.

pkkkzip

whoa, this is a really aggressive response. No one is calling you incompetent rather criticizing your assumptions.

> My day rate is $400 + taxes per hour if you want to be run through each point

Great, thanks for sharing.

danielparsons

bragging about billing $400 an hour LOL

vikp

Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.

I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.

llm_trw

I used sxml [0] unironically in this project extensively.

The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.

[0] https://en.wikipedia.org/wiki/SXML

cma

Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?

hackernewds

It's funny you astroturf your own project in a thread where another is presenting tangential info about their own

alemos

what does marker add on top of docling?

vikp

Docling is a great project, happy to see more people building in the space.

Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:

  - We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
  - we run an ordering model, so ordering is better for docs where the PDF orde ris bad
  - OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
  - References and links
  - Better equation conversion (soon including inline)

anon373839

This is a great comment. I will mention another benefit to this approach: the same pipeline works for PDFs that are digital-native and don't require OCR. After the object detection step, you collect the text directly from within the bounding boxes, and the text is error-free. Using Gemini means that you give this up.

siva7

You‘re describing yesterdays world. With the advancement of AI, there is no need for any of these many steps and stages of OCR anymore. There is no need for XML in your pipeline because Markdown is now equally suited for machine consumption by AI models.

llm_trw

The results we got 18 months ago are still better than the current gemini benchmarks at a fraction the cost.

As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.

JohnKemeny

Just commenting here so that I can find back to this comment later. You perfectly captured the AI hype in one small paragraph.

fransje26

Hey, why settle for yesteryear's world, with better accuracy, lower costs and local deployment, if you can use today's new shinny tool, reinvent the wheel, put everything in the cloud, and get hallucination for free..

raincole

Just commenting here to say the GP is spot on.

If you already have a high optimized pipeline built yesterday, then sure keep using it.

But if you start dealing with PDF today, just use Gemini. Use the most human readable formats you can find because we know AI will be optimized on understanding that. Don't even think about "stitching XML files" blahblah.

tzs

For future reference if you click on the timestamp of a comment that will bring you to a screen that has a “favorite” link. Click that to add the comment to your favorite comments list, which you can find on your profile page.

senko

> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

That is impressive. However, if someone needs to read a couple of hundred pages per day, there's no point in setting all that up.

Also, you neglected to mention the cost of setting everything up. The machine cost $20k; but your time, and cost to train yolo8, probably cost more than that. If you want to compare costs (find a point where local implementation such as this is better ROI), you should compare fully loaded costs.

thiht

Or, depending on your use case, you do it in one step and ask an LLM to extract data from a PDF.

What you describe is obviously better and more robust by a lot, but the LLM only approach is not "wrong". It’s simple, fast, easy to setup and understand, and it works. With less accuracy but it does work. Depending on the constraints, development budget and load it’s a perfectly acceptable solution.

We did this to handle 2000 documents per month and are satisfied with the results. If we need to upgrade to something better in the future we will, but in the mean time, it’s done.

eitally

Fwiw, I'm not convinced Gemini isn't using an document-based objection detection model for this, at least some parts of this or for some doc categories (especially common things like IDs, bills, tax forms, invoices & POs, shipping documents, etc that they've previously created document extractors for (as part of their DocAI cloud service).

simonw

I don't see why they would do that. The whole point of training a model like Gemini is that you train the model - if they want it to work great against those different categories of document the likely way to do it is to add a whole bunch of those documents to Gemini's regular training set.

twelve40

Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.

sconeguy

What would it have taken to store the plain text in some meta field in the document. Argh, so annoying.

dkjaudyeqooe

PDF provide that capability, but editors don't produce it, probably because printing is though OS drivers that don't support it, or PDF generators that don't support it. Or they do support it but users don't know to check that option, or turn it off because it makes PDFs too large.

user_7832

Do you know what this field/type is called, and I’d any of the big names (MS/Adobe etc) support creating such PDFs?

groby_b

PDF supports that just fine. It's just that many PDF publishers choose not to use that.

You can lead a horse to water...

shermantanktop

PDFs began as just postscript commands stored in a file. It’s a genius hack in a way that has become a Frankenstein’s monster.

nitwit005

People kind of dump whatever in pdf files, so I don't think a cleaner file format would do as much as you might think.

Digital fax services will generate pdf files, for example. They're just image data dumped into a pdf. Various scanners will also do so.

shermantanktop

is "put this glyph at coordinate (x,y)" really what you'd call "structured"?

dkjaudyeqooe

He's calling PDFs unstructured: structured editors -> unstructured PDF -> structured data

irjustin

It's not the structure that allows meaningful understanding.

Something that was clearly a table now becomes a bunch of glphy's physically close to eachother vs a group of other glphys but when considered as a group is a box visually separated from another group of glphys but actually part of a table.

surfingdino

In my experience AWS Textextract does a pretty good job without using LLMs.

Bluestein

... and call's it "portable", to boot.-

silverliver

We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I wonder if we will adapt our systems and procedures to account for hallucinations and "85%" accuracy.

And no, outlawing use the use of AI or increasing liability with its use will have next to nothing to deter its misuse and everyone knows it. My heart goes out to the remaining 15%.

anon373839

I love generative AI as a technology. But the worst thing about its arrival has been the reckless abandonment of all engineering discipline and common sense. It’s embarrassing.

sschueller

CCC talk about Xerox copiers changing numbers when doing OCR:

https://media.ccc.de/v/31c3_-_6558_-_de_-_saal_g_-_201412282...

tomrod

Would be nice to get a translation for a broader audience, glad folks are reporting this out!

sschueller

There is a translated one: https://www.youtube.com/watch?v=zXXmhxbQ-hk

ikrenji

the first thing that guy says that existing non-AI solutions are not that great. then he says that AI beats them in the accuracy. so i don't quite understand the point you're trying to make here

csomar

Humans accept a degree of error for convenience. (driving is one of them). But no, 15% is not the acceptable rate. More like 0.15% to 0.015% depending on the country.

tomrod

Meh, just maintain an audit log and an escalation subsystem. No need to be luddites when the problems are process, not tech stack.

freezed8

(disclaimer I am CEO of llamaindex, which includes LlamaParse)

Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.

Some quick notes: 1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.

2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.

3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.

M4v3R

The very first (and probably hand-picked & checked) example on your website [0] suffers from the very problem people are talking about here - in "Fiscal 2024" row it contains an error for CEO CAP column. On the image it says "$234.1" but the parsed result says "$234.4". A small error, but error nonetheless. I wonder if we can ever fix these kind of errors with LLM parsing.

[0] https://www.llamaindex.ai/llamaparse

dilDDoS

Looks like this was fixed, the parsed result says "$234.1" on my end. I wonder if the error was fixed manually or with another round of LLM parsing?

heidarb

I'm a happy customer. I wrote a ruby client for your API and have been parsing thousands of different types of PDFs through it with great results. I tested almost everything out there at the time and I couldn't find anything that came close to being as good as llamaparse.

BenGosub

Indeed, this is also my experience. I have tried a lot of things and where quality is more important than quantity, I doubt there are many tools that can come close to Llamaparse.

rendaw

All your examples are exquisitely clean digital renders of digital documents. How does it fare with real scans (noise, folds) or photos? Receipts?

Or is there a use case for digital non-text pdfs? Are people really generating image and not text-based PDFs? Or is the primary use case extracting structure, rather than text?

rahimnathwani

Hi Jerry,

How well does llamaparse work on foreign-language documents?

I have pipeline for Arabic-language docs using Azure for OCR and GPT-4o-mini to extract structured information. Would it be worth trying llamaparse to replace part of the pipeline or the whole thing?

freezed8

yes! we have foreign language support for better OCR on scans. Here's some more details. Docs: https://docs.cloud.llamaindex.ai/llamaparse/features/parsing... Notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...

rahimnathwani

What is disable_ocr=True for? Is it for documents that already have a text layer, that you don't want to OCR again?

sensanaty

There's an error right on your landing page [1] with the parsed document...

It's supposed to say 234.1, not 234.4

https://www.llamaindex.ai/llamaparse

7thpower

But can it do this table?!:

https://x.com/preston_mos/status/1853931388929511619?s=46

rjurney

I've been using NotebookLM powered by Gemini 2.0 for three projects and it is _really powerful_ for comprehending large corpuses you can't possibly read and thinking informed by all your sources. It has solid Q&A. When you ask a question or get a summary you like [which often happens] you can save it as a new note, putting it into the corpus for analysis. In this way your conclusions snowball. Yes, this experience actually happens and it is beautiful.

I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works - you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)

My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think - which REALLY helps out.

They are:

* The Stratigrapher - A Lovecraftian short story about the world's first city. All of Seton Lloyd/Faud Safar's work on Eridu. Various sources on Sumerian culture and religion All of Lovecraft's work and letters. Various sources about opium Some articles about nonlinear geometries

* FPGA Accelerated Graph Analytics An introduction to Verilog Papers on FPGAs and graph analytics Papers on Apache Spark architecture Papers on GraphFrames and a related rant I created about it and graph DBs A source on Spark-RAPIDS Papers on subgraph matching, graphlets, network motifs Papers on random graph models

* Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.

I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)

anirudhb99

thanks a ton for all the amazing feedback on this thread! if

(a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or

(b) there are loss cases for which gemini doesn't work well today,

please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!

nabla9

What if you need scan pages from thick paper books or binded documents without specialized book scanner?

I have two user cases in mind:

1. Photographs of open book.

2. Having video feed of open book where someone flips pages manually.

rudolph9

We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.

https://tika.apache.org/

rudolph9

Under the hood tika uses tesseract for ocr parsing. For clarity this all works surprisingly well generally speaking and it’s pretty easy to run your self and order of magnitude cheaper than most services out there.

https://tesseract-ocr.github.io/tessdoc/

gapeslape

In my mind, Gemini 2.0 changes everything because of the incredibly long context (2M tokens on some models), while having strong reasoning capabilities.

We are working on compliance solution (https://fx-lex.com) and RAG just doesn’t cut it for our use case. Legislation cannot be chunked if you want the model to reason well about it.

It’s magical to be able to just throw everything into the model. And the best thing is that we automatically benefit from future model improvements along all performance axes.

pvo50555

What does "throw everything into the model" entail in your context?

How much data are you able to feed into the model in a single prompt and on what hardware, if I may ask?

gapeslape

Gemini models run in the cloud, so there is no issue with hardware.

The EU regulations typically include delegated acts, technical standards, implementation standards and guidelines. With Gemini 2.0 we are able to just throw all of this into the model and have it figure out.

This approach gives way better results than anything we are able to achieve with RAG.

My personal bet is that this is how the future will look like. RAG will remain relevant, but only for extremely large document corpuses.

manmal

Maybe a dumb question, have you tried fine tuning on the corpus, and then adding a reasoning process (like all those R1 distillations)?

gapeslape

We haven't tried that, we might do that in the future.

My intuition - not based on any research - is that recall should be a lot better from in context data vs. weights in the model. For our use case, precise recall is paramount.

galvin

Somewhat tangential, but the EU has a directive mandating electronic invoicing for public procurement.

One of the standards that has come out of that is EN 16931, also known as ZUGFeRD and Factur-X, which basically involves embedding an XML file with the invoice details inside a PDF/A. It allows the PDF to be used like a regular PDF but it also allows the government procurement platforms to reliably parse the contents without any kind of intelligence.

It seems like a nice solution that would solve a lot of issues with ingesting PDFs for accounting if everyone somehow managed to agree a standard. Maybe if EN 16931 becomes more broadly available it might start getting used in the private sector too.

jbarrow

> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.

[1] https://qwenlm.github.io/blog/qwen2.5-vl/

fngjdflmdflg

>Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

This is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.

minimaxir

Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.

I suspect the issue is prompt engineering related.

> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

> - Values should be percentages of the image width and height (0 to 1)

LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.

fngjdflmdflg

Just tried this and it did not appear to work for me. Prompt:

>Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.

> - Use the top-left coordinate system

>this input document is 1080 x 1236 px. return the bounding boxes as integers

BoorishBears

https://github.com/google-gemini/cookbook/blob/a916686f95f43...

They say there's no magic prompt but I'd start with their default since there is usually some format used to improve performance with posttraining with tasks like this

minimaxir

"Might" being the operative word, particularly with models that have less prompt adherence. There's a few other prompt massaging tricks beyond the scope of a HN comment, the decimal issue is just one optimization.