Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction
15 comments
·December 18, 2025canadiantim
Can you increase correctness by giving examples to the model? And key terms or nouns expected?
mikert89
AI models will do all this natively
ritvikpandey21
we disagree! we've found llms by themselves aren't enough and suffer from pretty big failure modes like hallucination and inferring text rather than pure transcription. we wrote a blog about this [1]. the right approach so far seems to be a hybrid workflow that uses very specific parts of the language model architecture.
mritchie712
> Why LLMs Suck at OCR
I paste screenshots into claude code everyday and it's incredible. As in, I can't believe how good it is. I send a screenshot of console logs, a UI and some HTML elements and it just "gets it".
So saying they "Suck" makes me not take your opinion seriously.
mikert89
they need to convince customers its what they need
asdev
How is this different from Extend(Also YC)?
ritvikpandey21
we're more focused on the core extraction layer itself rather than workflow tooling. we train our own vision models for layout detection, ocr, and table parsing from scratch. the key thing for us is determinism and auditability, so outputs are reproducible run over run, which matters a lot for regulated enterprises.
aryan1silver
looks really cool, congrats on the launch! are you guys using something similar to docling[https://github.com/docling-project/docling]?
rtaylorgarlock
Has docling improved? I had a bit of a nightmare integrating a docling pipeline earlier this year. Docs said it was VLM-ready, which I spent lots of hours finding out was not true, just to find a relevant github issue which would've saved me a ton of hours :/ allegedly fixed, but wow that burned me bigtime.
ritvikpandey21
our team has tested docling pretty extensively, works well for simpler text-heavy docs without complex layouts, but the moment you introduce tables or multi-column stuff it doesn't maintain layout well.
throw03172019
Congrats on launch! We have been using this for a new feature we are building in our SaaS app. It’s results were better than Datalab from our tests, especially in the handwriting category.
vikp
Hi, I'm a founder of Datalab. I'm not trying to take away from the launch (congrats), just wanted to respond to the specific feedback.
I'm glad you found a solution that worked for you, but this is pretty surprising to hear - our new model, chandra, saturates handwriting-heavy benchmarks like this one - https://www.datalab.to/blog/saturating-the-olmocr-benchmark ,and our production models are more performant than OSS.
Did you test some time ago? We've made a bunch of updates in the last couple of months. Happy to issue some credits if you ever want to try again - vik@datalab.to.
sidmanchkanti21
Thanks for testing! Glad the results work well for you
ritvikpandey21
thanks! appreciate the kind words
sidcool
Congrats on launching. Seems very interesting.
Hi HN, we’re Sid and Ritvik, co-founders of Pulse. Pulse is a document extraction system to create LLM-ready text. We built Pulse as we realized that although modern vision language models are very good at producing plausible text, that makes them risky for OCR and data ingestion at scale.
When we started working on document extraction, we assumed the same thing many teams do today: foundation models were improving quickly, multi modal systems appeared to read documents well, and for small or clean inputs that assumption often held. The limitations showed up once we began processing real documents in volume. Long PDFs, dense tables, mixed layouts, low-fidelity scans, and financial or operational data exposed errors that were subtle, hard to detect, and expensive to correct. Outputs often looked reasonable while containing small but meaningful mistakes, especially in tables and numeric fields.
A lot of our work since then has been applied research. We run controlled evaluations on complex documents, fine tune vision models, and build labeled datasets where ground truth actually matters. There have been many nights where our team stayed up hand annotating pages, drawing bounding boxes around tables, labeling charts point by point, or debating whether a number was unreadable or simply poorly scanned. That process shaped our intuition far more than benchmarks alone.
One thing became clear quickly. The core challenge was not extraction itself, but confidence. Vision language models embed document images into high-dimensional representations optimized for semantic understanding rather than precise transcription. That process is inherently lossy. When uncertainty appears, models tend to resolve it using learned priors instead of surfacing ambiguity. This behavior can be helpful in consumer settings. In production pipelines, it creates verification problems that do not scale well.
Pulse grew out of trying to address this gap through system design rather than prompting alone. Instead of treating document understanding as a single generative step, the system separates layout analysis from language modeling. Documents are normalized into structured representations that preserve hierarchy and tables before schema mapping occurs. Extraction is constrained by schemas defined ahead of time, and extracted values are tied back to source locations so uncertainty can be inspected rather than guessed away. In practice, this results in a hybrid approach that combines traditional computer vision techniques, layout models, and vision language models, because no single approach handled these cases reliably on its own.
We are intentionally sharing a few documents that reflect the types of inputs that motivated this work. These are representative of cases where we saw generic OCR or VLM-based pipelines struggle.
Here is a financial 10K: https://platform.runpulse.com/dashboard/examples/example1
Here is a newspaper: https://platform.runpulse.com/dashboard/examples/example2
Here is a rent roll: https://platform.runpulse.com/dashboard/examples/example3
Pulse is not perfect, particularly on highly degraded scans or uncommon handwriting, and there is still room for improvement. The goal is not to eliminate errors entirely, but to make them visible, auditable, and easier to reason about.
Pulse is available via usage-based access to the API and platform You can try it here and access the API docs here.
Demo link here: https://video.runpulse.com/video/pulse-platform-walkthrough-...
We’re interested in hearing how others here evaluate correctness for document extraction, which failure modes you have seen in practice, and what signals you rely on to decide whether an output can be trusted. We will be around to answer questions and are happy to run additional documents if people want to share examples.