DeepSeek OCR
31 comments
·October 20, 2025krackers
looobay
LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.
Maybe they would render texts to an image before tokenizing to reduce the compute cost.
krackers
But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?
So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.
f33d5173
Vision is how humans see text. So text must have built in adaptations to protect from visual noise. For example, two words that look similar must never appear in similar contexts, or else they would be conflated. Hence we can safely reduce such words to the same token. Or something like that.
imjonse
I wonder if text written using chinese characters is more compatible with such vision centric compression than latin text.
looobay
Vision tokens are a good compression medium because with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements, because one vision token represent multiple pixels (and possibly multiple words). This is why its a good compression medium for compute.
It will never be as precise as textual tokens but it can be really good as they show in the paper.
yoran
How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?
ozgune
OmniAI has a benchmark that companies LLMs to cloud OCR services.
https://getomni.ai/blog/ocr-benchmark (Feb 2025)
Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.
sandblast
Not sure why you're being downvoted, I'm also curious.
x______________
>先天下之忧而忧
How is this an example of a prompt?Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."
Can anyone shed some light on this saying or why it's in the article?
gudzpoz
This clause is usually used together with the next sentence in the original poem:
> 先天下之忧而忧,后天下之乐而乐
> (put the world's worries before yours, and put your happiness after the world's)
Since the model is a language model, they probably use this to demonstrate the model's language capabilities – the model should be able to complete the whole sentence pair. The paper also mentions this:
> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.
So I believe it is just a text-only demonstration.
raincole
It's a very famous (classical) Chinese phrase.
Both translations don't catch the meaning well though. It means: "worry before the rest of the world (notice that they have something to) worry." The next part is 後天下之樂而樂("be happy only after the rest of the world is happy.")
I don't know why it's a prompt example.
fspeech
Google is closer. This is from a famous essay expressing tbe author's desire to bear the burden for the world. Essay is 岳阳楼记 by 范仲淹 in year 1046 https://zh.wikisource.org/zh-hans/%E5%B2%B3%E9%99%BD%E6%A8%9...
SequoiaHope
Ask a language model - ChatGPT says it’s a line from a famous poem “Memorial to Yueyang Tower” which expresses the Confucian ideal of selfless concern for people and society.
piker
This looks really cool for prototyping and playing around.
It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.
randomNumber7
Imagine you build an image segmentation model for a e.g. specific industrial application.
With this LLM approach you can at least create your training data from the raw images with natural language.
piker
That does make sense
CloseChoice
It's deepseek so one can expect an open-source license but for anyone (like me) who wants to see that explicitly, since it's not obvious in the GitHub repo: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...
TLDR: It's MIT licensed
AndroTux
> since it's not obvious in the GitHub repo
Literally says MIT license on the right sidebar and in the readme tab and in the file called LICENSE
k_sze
It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful ones.
brightUiso
Please a bit of education, what does it do?
farseer
How good is this compared to most commercial OCR software?
ozim
Any vision model is better than commercial OCR software.
Etheryte
I'm not really sure if that's an accurate summary of the state of the art, [0] is a better overview. In short, SOTA multi-modal LLMs are the best option for handwriting, nearly anything is good at printed text, for printed media, specialty models from hyperscalers are slightly better than multi-modal LLMs.
szundi
[dead]
empressplay
This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up, but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can convert magazine layouts to markdown too
bugglebeetle
Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.
rfoo
If you look you'd notice that it's the same Haoran Wei behind DeepSeek-OCR and GOT-OCR2.0 :p
bethekind
Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens needed. The x axis is inverted, descending with distance from the origin.
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote
>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.
(I guess you could say a picture token is worth 10 textual tokens...)
Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?
And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.