DeepSeek-Prover-V2
55 comments
·April 30, 2025islewis
bearjaws
It's actually pretty hilarious how far into detail they can go.
For example, I made a bot that you could give it a problem statement, and then it would return an array of steps to accomplish it.
Then you could take the steps, and click on them to break them down and add them to the list. If you just kept clicking you would get to excruciating detail.
For example taking out the trash can become over ~70 individual steps if you really drill into the details.
Some of the steps:
Stand close to the trash can – Position yourself so you have stable footing and easy access.
Place one hand on the rim of the can – Use your non-dominant hand to press lightly on the edge of the trash can to hold it in place.
Grip the top edge of the bag with your other hand – Find the part of the bag that extends past the rim.
Gently lift the bag upward – While your one hand stabilizes the can, slowly pull the bag up with the other.
Tilt the can slightly if needed – If the bag sticks or creates suction, rock or tilt the can slightly while continuing to lift.
Avoid jerking motions – Move steadily to prevent tears or spills
eightysixfour
This used to be part of one of the intro to engineering courses at my school - write an XX page document describing how to make a peanut butter and jelly sandwich.
lugu
This is how I imagine llms are used in robotics, with one or two more levels of description.
thrance
This feels like a manual for infiltrated aliens: "How to pass as humans, Vol. I"
roywiggins
or for goblins:
criley2
Imo current models can already break things up into bite sized pieces. The limiter I've seen is twofold
1) Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished
and 2) Getting an agentic coding tool which can actually handle the scale of doing 50 small projects back to back. With these agentic tools I find they start projects off really strong but by task #5 they're just making a mess with every change.
I've played with keeping basically a dev-progress.md file and implementation-plan.md file that I keep in context for every request and end each task by updating files. But me manually keeping all this context isn't solving all my problems.
And all the while, tools like Cline are gobbling up 2M tokens to make small changes.
jhrmnn
> Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished
This is a struggle for every human I’ve ever worked with
pertymcpert
Yes. I wonder if the path forward will be to create systems of agents that work as a team, with an "architect" or "technical lead" AI directing the work of more specialized execution AIs. This could alleviate the issue of context pollution as the technical lead doesn't have to hold all of the context when working on a small problem, and vice versa.
Shit. Do we need agile AI now?
cadamsdotcom
And it should be powerful for breaking down reasoning chains of thought too.
qoez
The best part about these is that I know the weights are static so I know I won't have to deal with a sassy unusable update for a week suddenly.
Implicated
Or, like with Claude, it being effectively lobotomized during north american 'business' hours. 3am PST? Cracked. 8am PST? ... mentally challenged.
navanchauhan
This is pretty interesting. Do you have more information about this?
devoutsalsa
I’m pretty sure the parent comment is referring to capacity constraints. When the Americans come online in the morning, Claude frequently can’t keep up with demand and error messages saying the system is at capacity are common.
ekez
I wonder if the authors have tried incorporating error feedback from Lean into their models.
Work from 2023 [1] showed general purpose models did better when they were able to incorporate error feedback, humans incorporate error feedback, but none of the SOTA models on minif2f seem to.
jasonjmcghee
Super interesting that they chose 671B and 7B. no like 32B which feels like a "sweet spot"
SweetSoftPillow
I feel like this is very logical way to do things. Test hypothesis on small model, play around, get it working, apply findings to big model.
bredren
Also notable is the earliest planning for a positive reception release of a new model might include both parameter-based and skill type market segmentation.
--> "In an increasingly crowded field of LLMs, how will our (costly to produce) model stand out?"
simianwords
related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.
A prover model might be used as a tool in the coming future.
Arcuru
For a concrete example today, see https://openrouter.ai/openrouter/auto
simianwords
thats nice but imagine first having models that are expert in specific domains. routing seems to be the easy part (just feed the available models as tools to your wrapper LLM)
samvaran
Is that not what MoE models already do?
oofbaroomf
No. Each expert is not separately trained, and while they may store different concepts, they are not meant to be different experts in specific domains. However, there are certain technologies to route requests to different domain expert LLMs or even fine-tuning adapters, such as RouteLLM.
woah
Why do you think that a hand-configured selection between "different domains" is better than the training-based approach in MoE?
retinaros
That might already happen behind what they call test time compute
AlexCoventry
MoE models route each token, in every transformer layer, to a set of specialized feed-forward networks (fully-connected perceptrons, basically), based on a score derived from the token's current representation.
koakuma-chan
> related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.
In the future? I'm pretty sure people do that already.
simianwords
No I disagree. I would want ChatGPT to abstract away expert models - biochemistry model, coding model, physics model and maybe O3 would use these models as tools to come up with an answer.
The point being that a separate expert model would be better at its own field than a single model that tries to be good at everything. Intuitively it makes sense, in practice I have seen anecdotes where finetuning a small model on domain data makes the model lose coherence on other topics.
koakuma-chan
> have seen anecdotes where finetuning a small model on domain data makes the model lose coherence on other topics
This is expected behaviour.
energy123
It's crudely done though.
kratom_sandwich
Mistrals model is a mixture-of-experts model
someguy101010
The No Free Lunch Theorem implies that something like this is inevitable https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...
repsilat
A system of n experts is no different to a single expert wrt the NFLT. The theorem is entirely indifferent to (ie "equally skeptical of") the idea.
null
mcshicks
You can run the model for free on openrouter.ai. I have played around with lean, slowly working my way through mathmatics in lean. I was stuck on this problem in section 3.6
theorem convergesTo_unique {s : ℕ → ℝ} {a b : ℝ} (sa : ConvergesTo s a) (sb : ConvergesTo s b) :
For fun I tried it on the free model on openrouter.ai. Got the answer the first time.
https://leanprover-community.github.io/mathematics_in_lean/m...
Here's the answer just to give you a feel.
by_contra h
have h₁ : a ≠ b := h
have h₂ : |a - b| > 0 := by
apply abs_pos.mpr
exact sub_ne_zero.mpr h₁
-- Use the definition of convergence to find N₁ and N₂
have h₃ := sa (|a - b| / 2) (by linarith)
have h₄ := sb (|a - b| / 2) (by linarith)
cases' h₃ with N₁ h₃
cases' h₄ with N₂ h₄
-- Choose N to be the maximum of N₁ and N₂
let N := max N₁ N₂
have h₅ := h₃ N (by simp [N, le_max_left])
have h₆ := h₄ N (by simp [N, le_max_right])
-- Derive a contradiction using the triangle inequality
have h₇ : |s N - a| < |a - b| / 2 := by simpa using h₅
have h₈ : |s N - b| < |a - b| / 2 := by simpa using h₆
have h₉ : |a - b| < |a - b| := by
calc
|a - b| = |a - s N + (s N - b)| := by ring_nf
_ ≤ |a - s N| + |s N - b| := by
apply abs_add
_ = |s N - a| + |s N - b| := by
rw [abs_sub_comm]
_ < |a - b| / 2 + |a - b| / 2 := by
linarith
_ = |a - b| := by ring
linarith
amelius
Is someone working on similar ideas for compiler back-ends?
Fokamul
>"an open-source large language model"
Is it really opensource? Something changed?
eightysixfour
Everyone calls "open weight" models "open source" at this point, it is wrong, but we'll have to find another way to fight this fight. Maybe "open data and pipeline" or something.
whatshisface
How much education would a human need to perform at this level on the benchmarks?
pama
Learning to formalize math to then prove Putnam competition problems rigorously in Lean would require you to have mid-to-advanced college level math and CS background. (Learning to do a small fraction of the Putnam competition without using Lean probably only needs strong highschool math and early undergrad math, with training for competitions a strong bonus.)
null
revskill
The way intelligence works to me, is more about:
- Making correct and smart assumption. Currently all LLM bots are too stupid at making good assumptions. I don't want to explicitly repeat and repeat again my own assumptions while the context is clear enough. Hey bots, try harder.
- LLM bot needs to bring their own secondary and contextual memory in reasoning, i don't want to do it for you, ok ? You're the bot.
- Thinking out of the box. This is the final stage of intelligence. Adapt old technique to make your own technique to solve non-existing before problems.
nthingtohide
I propose human-AI interaction data must be made public. This is our collective wikipedia of AI era. Otherwise our progress will be blank line after 2022. Just as Egyptians didn't write down process to move giant rocks.
szundi
[dead]
smusamashah
That Putnam bench graph (middle one) is showing 49/658 solve rate.
> The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench.
Which is 0.07% (edit: 7%) for PutnamBench
darkmighty
49/658 is 7%
smusamashah
Sorry, forgot multiply by 100
booi
I bet DeepSeek-Prover-V2 wouldn't have made that mistake
gallerdude
classic human hallucination
HappyPanacea
How likely is it that Putnam answers were in DeepSeek's training data?
EvgeniyZh
The solutions weren't published anywhere. There is also no good automatic way to generate solutions as far as I know, even expensive ones (previous sota was 10 solutions and one before was 8 using pass@3200 for 7b model). Potentially the developers could've paid some people who are good in putnam-level math problems and lean to write solutions for LLMs. It is hard to estimate likelihood of that but it sounds like waste of money given relatively marginal problem/benchmark.
HappyPanacea
AoPS seems to have a forum dedicated to Putnam (including 2024): https://artofproblemsolving.com/community/c3249_putnam and here is a pdf with solutions to Putnam 2023: https://kskedlaya.org/putnam-archive/2023s.pdf
> The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals
It feels pretty intuitive to me that the ability for an LLM to break a complex problem down into smaller, more easily solvable pieces will unlock the next level of complexity.
This pattern feels like a technique often taught to junior engineers- how to break up a multi-week project into bitesized tasks. This model is obviously math focused, but I see no reason why this wouldn't be incredibly powerful for code based problem solving.