PDF to Text, a challenging problem
104 comments
·May 13, 202590s_dev
bazzargh
Back in... 2006ish? I got annoyed with being unable to copy text from multicolumn scientific papers on my iRex (an early ereader that was somewhat hackable) so dug a bit into why that was. Under the hood, the pdf reader used poppler, so I modified poppler to infer reading order in multicolumn documents using algorithms that tessaract's author (Thomas Breuel) had published for OCR.
It was a bit of a heuristic hack; it was 20 years ago but as I recall poppler's ancient API didn't really represent text runs in a way you'd want for an accessibility API. A version of the multicolumn select made it in but it was a pain to try to persuade poppler's maintainer that subsequent suggestions to improve performance were ok - because they used slightly different heuristics so had different text selections in some circumstances. There was no 'right' answer, so wanting the results to match didn't make sense.
And that's how kpdf got multicolumn select, of a sort.
Using tessaract directly for this has probably made more sense for some years now.
pimlottc
This is life. So many times I’ve finished a project and thought to myself: “Now I am an expert at doing this. Yet I probably won’t ever do this again.” Because the next thing will completely in a different subject area and I’ll start again from the basics.
didericis
I built an auto HQ solver with tesseract when HQ was blowing up over thanksgiving (HQ was the gameshow by the vine people with live hosts). I would take a screenshot of the app during a question, share it/send it to a little local api, do a google query for the question, see how many times each answer on the first page appeared in the results, then rank the answers by probability.
Didn't work well/was a very naive way to search for answers (which is prob good/idk what kind of trouble I'd have gotten in if it let me or anyone else who used it win all the time), but it was fun to build.
korkybuchek
Not that I'm privy to your mind, but it probably was tesseract (and this is my exact experience too...although for me it was about 12 years ago).
null
svat
One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.
There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):
0 g 0 G
0 g 0 G
BT
/F19 10.9091 Tf 88.936 709.041 Td
[(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
-16.936 -21.922 Td
[(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
0 -21.923 Td
and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.kccqzy
When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might actually get something pretty close. For example I suppose each Tj becomes a <span> and each TJ becomes a collection of <span>s. (I'm fairly certain it doesn't use <canvas>.) And I suppose it must be very faithful to the original document to make it work.
chaps
Indeed! I've used it to parse documents I've received through FOIA -- sometimes it's just easier to write beautifulsoup code compared to having to deal with PDF's oddities.
whenc
Try with cpdf (disclaimer, wrote it):
cpdf -output-json -output-json-parse-content-streams in.pdf -o out.json
Then you can play around with the JSON, and turn it back to PDF with cpdf -j out.json -o out.pdf
No live back-and-forth though.svat
The live back-and-forth is the main point of what I'm asking for — I tried your cpdf (thanks for the mention; will add it to my list) and it too doesn't help; all it does is, somewhere 9000-odd lines into the JSON file, turn the part of the content stream corresponding to what I mentioned in the earlier comment into:
[
[ { "F": 0.0 }, "g" ],
[ { "F": 0.0 }, "G" ],
[ { "F": 0.0 }, "g" ],
[ { "F": 0.0 }, "G" ],
[ "BT" ],
[ "/F19", { "F": 10.9091 }, "Tf" ],
[ { "F": 88.93600000000001 }, { "F": 709.0410000000001 }, "Td" ],
[
[
"Subsequen",
{ "F": 28.0 },
"t",
{ "F": -374.0 },
"to",
{ "F": -373.0 },
"the",
{ "F": -373.0 },
"p",
{ "F": -28.0 },
"erio",
{ "F": -28.0 },
"d",
{ "F": -373.0 },
"analyzed",
{ "F": -373.0 },
"in",
{ "F": -374.0 },
"our",
{ "F": -373.0 },
"study",
{ "F": 83.0 },
",",
{ "F": -383.0 },
"Bridge's",
{ "F": -373.0 },
"paren",
{ "F": 27.0 },
"t",
{ "F": -373.0 },
"compan",
{ "F": 28.0 },
"y",
{ "F": -373.0 },
"Ne",
{ "F": -1.0 },
"wGlob",
{ "F": -27.0 },
"e",
{ "F": -374.0 },
"reduced"
],
"TJ"
],
[ { "F": -16.936 }, { "F": -21.922 }, "Td" ],
This is just a more verbose restatement of what's in the PDF file; the real questions I'm asking are:- How can a user get to this part, from viewing the PDF file? (Note that the PDF page objects are not necessarily a flat list; they are often nested at different levels of “kids”.)
- How can a user understand these instructions, and “see” how they correspond to what is visually displayed on the PDF file?
IIAOPSW
This might actually be something very valuable to me.
I have a bunch of documents right now that are annual statutory and financial disclosures of a large institute, and they are just barely differently organized from each year to the next to make it too tedious to cross compare them manually. I've been looking around for a tool that could break out the content and let me reorder it so that the same section is on the same page for every report.
This might be it.
dleeftink
Have a look at this notebook[0], not exactly what you're looking for but does provide a 'live' inspector of the various drawing operations contained in a PDF.
svat
Thanks, but I was not able to figure out how to get any use out of the notebook above. In what sense is it a 'live' inspector? All it seems to do is to just decompose the PDF into separate “ops” and “args” arrays (neither of which is meaningful without the other), but it does not seem “live” in any sense — how can one find the ops (and args) corresponding to a region of the PDF page, or vice-versa?
dleeftink
You can load up your own PDF and select a page up front after which it will display the opcodes for this page. Operations are not structurally grouped, but decomposed in three aligned arrays which can be grouped to your liking based on opcode or used as coordinates for intersection queries (e.g. combining the ops and args arrays).
The 'liveness' here is that you can derive multiple downstream cells (e.g. filters, groupings, drawing instructions) from the initial parsed PDF, which will update as you swap out the PDF file.
kbyatnal
"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:
1. reliable OCR from documents (to index for search, feed into a vector DB, etc)
2. structured data extraction (pull out targeted values)
3. end-to-end document pipelines (e.g. automate mortgage applications)
Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.
Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.
You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)
miki123211
There's also #4, reliable OCR and semantics extraction that works across many diverse classes of documents, which is relevant for accessibility.
This is hard because:
1. Unlike a business workflow which often only deals with a few specific kinds of documents, you never know what the user is going to get. You're making an abstract PDF reader, not an app that can process court documents in bankruptcy cases in Delaware.
2. You don't just need the text (like in traditional OCR), you need to recognize tables, page headers and footers, footnotes, headings, mathematics etc.
3. Because this is for human consumption, you want to minimize errors as much as possible, which means not using OCR when not needed, and relying on the underlying text embedded within the PDF while still extracting semantics. This means you essentially need two different paths, when the PDF only consists of images and when there are content streams you can get some information from.
3.1. But the content streams may contain different text from what's actually on the page, e.g. white-on-white text to hide information the user isn't supposed to see, or diacritics emulation with commands that manually draw acute accents instead of using proper unicode diacritics (LaTeX works that way).
4. You're likely running as a local app on the user's (possibly very underpowered) device, and likely don't have an associated server and subscription, so you can't use any cloud AI models.
5. You need to support forms. Since the user is using accessibility software, presumably they can't print and use a pen, so you need to handle the ones meant for printing too, not just the nice, spec-compatible ones.
This is very much an open problem and is not even remotely close to being solved. People have been taking stabs at it for years, but all current solutions suck in some way, and there's no single one that solves all 5 points correctly.
noosphr
>replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.
As someone who had to build custom tools because VLMs are so unreliable: anyone that uses VLMs for unprocessed images is in for more pain than all the providers which let LLMs without guard rails interact directly with consumers.
They are very good at image labeling. They are ok at very simple documents, e.g. single column text, centered single level of headings, one image or table per page, etc. (which is what all the MVP demos show). They need another trillion parameters to become bad at complex documents with tables and images.
Right now they hallucinate so badly that you simply _can't_ use them for something as simple as a table with a heading at the top, data in the middle and a summary at the bottom.
varunneal
I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction.
dwheeler
The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.
lelandfe
The better solution to a search engine extracting text from existing PDFs is to provide advice on how to author PDFs?
What's the timeline for this solution to pay off
chaps
Microsoft is one of the bigger contributors to this. Like -- why does excel have a feature to export to PDF, but not a feature to do the opposite? That export functionality really feels like it was given to a summer intern who finished it in two weeks and never had to deal with it ever again.
layer8
That’s true, but it also opens up the vulnerability of the source document being arbitrarily different from the rendered PDF content.
kerkeslager
That's true, but it's dependent on the creator of the PDF having aligned incentives with the consumer of the PDF.
In the e-Discovery field, it's commonplace for those providing evidence to dump it into a PDF purely so that it's harder for the opposing side's lawyers to consume. If both sides have lots of money this isn't a barrier, but for example public defenders don't have funds to hire someone (me!) to process the PDFs into a readable format, so realistically they end up taking much longer to process the data, which takes a psychological toll on the defendant. And that's if they process the data at all.
The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.
giovannibonetti
I wonder if AI will solve that
GaggiX
There are specialized models, but even generic ones like Gemini 2.0 Flash are really good and cheap, you can use them and embed the OCR inside the PDF to index to the original content.
yxhuvud
Sure, and if you have access to the source document the pdf was generated from, then that is a good thing to do.
But generally speaking, you don't have that control.
carabiner
I bet 90% of the problem space is legacy PDFs. My company has thousands of these. Some are crappy scans. Some have Adobe's OCR embedded, but most have none at all.
ramesh31
Edge cases in general.
It can be very tempting to look at the problem as a simple question of computation and data. There's a standard so it must be able to be implemented 1:1. But (and this is very similar to the web) baked into any rendering engine anyone actually uses are a million different little hacks and special conditionals built up over years of dealing with PDFs in the wild, that adds up to something which performs how users expect, not what is actually 100% formally correct.
patrick41638265
Good old https://linux.die.net/man/1/pdftotext and a little Python on top of its output will get you a long way if your documents are not too crazy. I use it to parse all my bank statements into an sqlite database for analysis.
ted_dunning
One of my favorite documents for highlighting the challenges described here is the PDF for this article:
https://academic.oup.com/auk/article/126/4/717/5148354
The first page is classic with two columns of text, centered headings, a text inclusion that sits between the columns and changes the line lengths and indentations for the columns. Then we get the fun of page headers that change between odd and even pages and section header conventions that vary drastically.
Oh... to make things even better, paragraphs doing get extra spacing and don't always have an indented first line.
Some of everything.
JKCalhoun
The API in CoreGraphics (MacOS) for PDF, at a basic level, simply presented the text, per page, in the order in which it was encoded in the dictionaries. And 95% of the time this was pretty good — and when working with PDFKit and Preview on the Mac, we got by with it for years.
If you stepped back you could imagine the app that originally had captured/produced the PDF — perhaps a word processor — it was likely rendering the text into the PDF context in some reasonable order from it's own text buffer(s). So even for two columns, you rather expect, and often found, that the text flowed correctly from the left column to the right. The text was therefore already in the correct order within the PDF document.
Now, footers, headers on the page — that would be anyone's guess as to what order the PDF-producing app dumped those into the PDF context.
1vuio0pswjnm7
Below is a PDF. It is a .txt file. I can save it with a .pdf extension and open it in a PDF viewer. I can make changes in a text editor. For example, by editing this text file, I can change the text displayed on the screen when the PDF is opened, the font, font size, line spacing, the maximum characters per line, number of lines per page, the paper width and height, as well as portrait versus landscape mode.
%PDF-1.4
1 0 obj
<<
/CreationDate (D:2025)
/Producer
>>
endobj
2 0 obj
<<
/Type /Catalog
/Pages 3 0 R
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Times-Roman
>>
endobj
5 0 obj
<<
/Font << /F1 4 0 R >>
/ProcSet [ /PDF /Text ]
>>
endobj
6 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources 5 0 R
/Contents 7 0 R
>>
endobj
7 0 obj
<<
/Length 8 0 R
>>
stream
BT
/F1 50 Tf
1 0 0 1 50 752 Tm
54 TL
(PDF is)'
((a) a text format)'
((b) a graphics format)'
((c) (a) and (b).)'
()'
ET
endstream
endobj
8 0 obj
53
endobj
3 0 obj
<<
/Type /Pages
/Count 1
/MediaBox [ 0 0 612 792 ]
/Kids [ 6 0 R ]
>>
endobj
xref
0 9
0000000000 65535 f
0000000009 00000 n
0000000113 00000 n
0000000514 00000 n
0000000162 00000 n
0000000240 00000 n
0000000311 00000 n
0000000391 00000 n
0000000496 00000 n
trailer
<<
/Size 9
/Root 2 0 R
/Info 1 0 R
>>
startxref
599
%%EOFswsieber
It can also have embedded binary streams. It was not made for text. It was made for layout and graphics. You give nice examples, but each of those lines could have been broken up into one call per character, or per word, even out of order.
1vuio0pswjnm7
"PDF" is an acronym for for "Portable Document Format"
"2.3.2 Portability
A PDF file is a 7-bit ASCII file, which means PDF files use only the printable subset of the ASCII character set to describe documents even those with images and special characters. As a result, PDF files are extremely portable across diverse hardware and operating system environments."
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
bartread
Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.
Not so PDFs.
I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.
I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.
It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.
hermitcrab
I am hoping at some point to be able to extract tabular data from PDFs for my data wrangling software. If anyone knows of a library that can extract tables from PDFs, can be inegrated into a C++ app and is free or less than a few hundred $, please let me know!
yxhuvud
My favorite is (official, governmental) documents that has one set of text that is rendered, and a totally different set of text that you get if you extract the text the normal way..
j45
PDFs inherently are a markup / xml format, the standard is available to learn from.
It's possible to create the same PDF in many, many, many ways.
Some might lean towards exporting a layout containing text and graphics from a graphics suite.
Others might lean towards exporting text and graphics from a word processor, which is words first.
The lens of how the creating app deals with information is often something that has input on how the PDF is output.
If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.
layer8
> PDFs inherently are a markup / xml format
This is false. PDFs are an object graph containing imperative-style drawing instructions (among many other things). There’s a way to add structural information on top (akin to an HTML document structure), but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.
davidthewatson
Thanks for your comment.
Indeed. Therein lies the rub.
Why?
Because no matter the fact that I've spent several years of my latent career crawling and parsing and outputting PDF data, I see now that pointing my LLLM stack at a directory of *.pdf just makes the invisible encoding of the object graph visible. It's a skeptical science.
The key transclusion may be to move from imperative to declarative tools or conditional to probabilistic tools, as many areas have in the last couple decades.
I've been following John Sterling's ocaml work for a while on related topics and the ideas floating around have been a good influence on me in forests and their forester which I found resonant given my own experience:
https://www.jonmsterling.com/index/index.xml
https://github.com/jonsterling/forest
I was gonna email john and ask whether it's still being worked on as I hope so, but I brought it up this morning as a way out of the noise that imperative programming PDF has been for a decade or more where turtles all the way down to the low-level root cause libraries mean that the high level imperative languages often display the exact same bugs despite significant differences as to what's being intended in the small on top of the stack vs the large on the bottom of the stack. It would help if "fitness for a particular purpose" decisions were thoughtful as to publishing and distribution but as the CFO likes to say, "Dave, that ship has already sailed." Sigh.
¯\_(ツ)_/¯
gibsonf1
We[1] Create "Units of Thought" from PDF's and then work with those for further discovery where a "Unit of Thought" is any paragraph, title, note heading - something that stands on its own semantically. We then create a hierarchy of objects from that pdf in the database for search and conceptual search - all at scale.
[1] https://graphmetrix.com/trinpod-server https://trinapp.com
IIAOPSW
I'm tempted to try it. My use case right now is a set of documents which are annual financial and statutory disclosures of a large institution. Every year they are formatted / organized slightly differently which makes it enormously tedious to manually find and compare the same basic section from one year to another, but they are consistent enough to recognize analogous sections from different years due to often reusing verbatim quotes or highly specific key words each time.
What I really want to do is take all these docs and just reorder all the content such that I can look at page n (or section whatever) scrolling down and compare it between different years by scrolling horizontally. Ideally with changes from one year to the next highlighted.
Can your product do this?
Sharlin
Some of the unsung heroes of the modern age are the programmers who, through what must have involved a lot of weeping and gnashing of teeth, have managed to implement the find, select, and copy operations in PDF readers.
noosphr
I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.
The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.
What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: https://github.com/DS4SD/DocLayNet if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.
Here's some pre-trained models that work OK out of the box: https://github.com/ppaanngggg/yolo-doclaynet I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.
VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?
elpalek
Recently tested a (non-english) pdf ocr with Gemini 2.5 Pro. First, directly ask it to extract text from pdf. Result: random text blob, not useable.
Second, I converted pdf into pages of jpg. Gemini performed exceptional. Near perfect text extraction with intact format in markdown.
Maybe there's internal difference when processing pdf vs jpg inside the model.
jagged-chisel
Model isn’t rendering the PDF probably, just looking in the file for text.
Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?
I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.