OpenDataLoader-PDF: An open source tool for structured PDF parsing
5 comments
·September 23, 2025clueless
Given the current llm context size limitation, what is the state of art for feeding large doc/text blobs into llm for accurate processing?
simonw
The current generation of models all support pretty long context now - the Gemini family has had 1m tokens for over a year, GPT-4.1 is 1m, interestingly GPT-5 is back down to 400,000, Claude 4 is 200,000 but there's a mode of Claude Sonnet 4 that can do 1m as well.
The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.
https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.
Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/
clueless
sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...
lysecret
Generally use 2.5 flash for this, works incredibly well. So many traditionally hard things can now we solved by stuffing it into a pretty cheap llm haha.
I've been thinking lately that maybe we need a new AI-friendly file format rather than continuing to hack on top of PDF's complicated spec. PDF was designed to have consistent and portable page display rendering, it was not a goal for it to be easily parseable afaik, which is why we have to go through these crazy hoops. If you've ever looked at how text is stored internally in PDF this becomes immediately obvious.
I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.