Skywork-OR1: new SOTA 32B thinking model with open weight

30 comments

·April 13, 2025

byefruit

> Both of our models are trained on top of DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-32B.

Not to take away from their work but this shouldn't be buried at the bottom of the page - there's a gulf between completely new models and fine-tuning.

israrkhan

Agreed. Also their name make it seem like it is totally new model.

If they needed to assign their own name to it, at least they could have included the parent (and grant parent) model names in the name.

Just like the name DeepSeek-R1-Distill-Qwen-7B clearly says that it is a distilled Qwen model.

qeternity

DeepSeek probably would have done this anyway, but they did release a Llama 8B distillation and the Meta terms of use require any derivative works to have Llama in the name. So it also might have just made sense to do for all of them.

Otoh, there aren't many frontier labs that have actually done finetunes.

diggan

> the Meta terms of use require any derivative works to have Llama in the name

Technically it requires the derivatives to begin with "llama". So "DeepSeek-R1-Distill-Llama-8B" isn't OK by the license, while "Llama-3_1-Nemotron-Ultra-253B-v1" would be OK.

> [...] If you use the Llama Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name.

I've previously written a summary that includes all parts of the license that I think others are likely to have missed: https://notes.victor.earth/youre-probably-breaking-the-llama...

lumost

I suspect that we'll see a lot of variations on this, with the open models catching up to SOTA - and the foundation models being relatively static - there will be many new SOTA's built off of existing foundation models.

How many of the latest databases are postgres forks?

adamkochanowicz

Also, am I reading that right? They trained it not only on another model, not only one that is already distilled on another model, but one that is much lower in parameters (7B)?

rahimnathwani

They took the best available models for the architecture they chose (in two sizes), and fine tuned those models with additional training data. They don't say where they got that training data, or what combo of SFT and/or RLHF they used. It's likely that the training data was generated by larger models.

GodelNumbering

This happens a lot on r/localLlama since a few months. Big headline claims followed by "oh yeah it's a finetune"

chvid

How is the score on AIME2024 relevant if AIME2024 has been used to train the model?

nyrikki

That is pretty much a universal problem. If you look at the problems anyone's models has solved, they are all well represented in the corpus.

Remember that AIME is intended for high schoolers with just pencils, erasers, rulers, and compasses to solve in 3 hours. There is an entire industry providing supplementary material to prepare students for concepts are not directly covered in typical school material.

As various blogs and tests often pull from previous years make it into all the common sources like stackoverlow/exchange, reddit etc.., them explicitly stating to have trained on AIME problems prior to 2024 explicitly isn't much different.

Basically expect any model to train on all AIME problems available before their knowledge cutoff date.

To me, "How is the score on AIME2024 relevant" is because it is still not that high (from a practical consideration) despite directly training on it.

Mixed in with all the models success falling dramatically with AIME2025 demonstrates the above, and hints that Rao's claim that compiling in the verifier in training/scratch-space/prompt/fine-tuning etc... in a way the model can reliably access is what matters.

ipsum2

Google Gemini (2.5 pro) made the same "mistake", their data cut off is January 2025, and AIME 2024 is in Feburary 2024..

naomiclarkson

github repo: https://github.com/SkyworkAI/Skywork-OR1

blog: https://capricious-hydrogen-41c.notion.site/Skywork-Open-Rea...

huggingface: https://huggingface.co/collections/Skywork/skywork-or1-67fa1...

rubymamis

I tend to prefer running locally non-thinking models since they output the result significantly faster.

nico

Any specific model recommendations for running locally?

Also, what tasks are you using them for?

genewitch

Phi 4. Its fast and reasonable enough, but with local models you have to know what you want to do. If you want a chat bot you use something with Hermes tunes, if you want code you want a coder - a lot of people like the deepseek distill qwen instruct for coding.

There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow without datacenter cards with enough RAM to hold them.

I just asked your very question a day or two ago because I put back together a machine with a 3060 12GB and wondered what sota was on that amount of RAM.

If you use lmstudio it will auto pick which of the quantized models to get, but you can pick a larger model quant if you want. You pick a model and a parameter size and it will choose the "best" quantization for your hardware. Generally.

nico

Thank you for the insightful reply

> There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow

Is there something like a “prompt router”, that can automatically decide what model to use based on the type of prompt/task?

rubymamis

I mostly like to evaluate them whenever I ask a remote model (Calude 3.7, ChatGPT 4.5), to see how far they have progressed. From my tests qwen 2.5 coder 32b is still the best local model for coding tasks. I've also tried Phi 4, nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro M4 46GB RAM.

scribu

From their Notion page:

> Skywork-OR1-32B-Preview delivers the 671B-parameter Deepseek-R1 performance on math tasks (AIME24 and AIME25) and coding tasks (LiveCodeBench).

Impressive, if true: much better performance than the vanilla distills of R1.

Plus it’s a fully open-source release (including data selection and training code).

y2236li

Interesting – focusing on the 671B parameter model feels like a significant step. It’s a compelling contrast to the previous models and sets a strong benchmark. It’s great that they’re embracing open weights and data too – that’s a crucial aspect for innovation.

CharlesW

> It’s great that they’re embracing open […] data too…

It could be, but as I type this it's currently vaporware: https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data

qwertox

I know one can rent consumer GPUs on the internet, where people like you and me offer their free GPU time to people who need it for a price. They basically get a GPU-enabled VM on your machine.

But is there something like a distributed network akin to SETI@home and the likes which is free for training models? Where a consensus is made on which model is trained and that any derivative works must be open source, including all the tooling and hosting platform? Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?

qeternity

> Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?

Yes, it's possible. But no, it would not be remotely sensible given the performance implications. There is a reason why Nvidia is a multi trillion dollar company, and it's as much about networking as it is about GPUs.

kmeisthax

Back in the early days of AI art, before AI became way too cringe to think about, I wondered about this exact thing[0]. The problem I learned later is that most AI training (and inference) is not dependent so much on the GPU compute, but on memory bandwidth and communication. A huge chunk of AI training is just figuring out how to minimize or hide the bottleneck the inter-GPU interconnect imposes so you can scale to multiple cards.

The BOINC model of distributed computing is to separate everything into little work units that can be sent out to multiple machines who then return a result that can be integrated back into the whole. If you were to train foundation models this way, you'd be packaging up the current model state n and a certain amount of trainset items into a work unit, and the result would be model weight offsets to be added back into model state n+1. But you wouldn't be able to benefit from any of the gradients calculated by other users until they submitted their work units and n+1 got calculated. So there'd be a lot of redundant work and training progress would slow down, versus a closely-coupled set of GPUs where they have enough bandwidth to exchange gradients every batch.

For the record, I never actually built a distributed training cluster. But when I learned what AI actually wants to go fast, I realized distributed training probably couldn't work over just renting big GPUs.

Most people do not have GPUs with enough RAM to do meaningful AI work. Generative AI models work autoregressively: that is, all of their weights are repeatedly used in a tight loop. In order for a GPU to provide a meaningful speedup it needs to have the whole model in GPU memory, because PCIe is slow (high latency) and also slow (low bandwidth). Nvidia knows this and that's why they are very stingy on GPU VRAM. Furthermore, training a model takes more memory than merely running it; I believe gradients are something like the number of weights times your batch size in terms of memory usage. There's two ways I could see around this, both of which are going to cause further problems:

- You could make 'mini' workunits where certain specific layers of the model are frozen and do not generate gradients. So you'd only train, say, 10% of the model at any one time. This is how you train very large models in centralized computing; you put a slice of the model on each GPU and exchange activations and gradients each batch. But we're on a distributed computer, so we don't have that kind of tight coupling, and we converge slower or not at all if we do this.

- You can change the model architecture to load specific chunks of weights at each layer, with another neural network to decide what chunks to load for each token. This is known as a "Mixture of Experts" model and it's the most efficient way we know of to stream weights in and out of a GPU, but training has to be aware of it and you can't change the size of the chunks to fit the current GPU. MoE lets a model have access to a lot of weights, but the scaling is worse. e.g. an 8x44B parameter MoE model is NOT equivalent to a 352B non-MoE model. It also causes problems with training that you have to solve for: very common bits of knowledge will be replicated across chunks, and certain chunks can become favored by the model because they're getting more gradients, which causes them to be favored more, so they get more gradients.

[0] My specific goal was to train a txt2img model purely on public domain Wikimedia Commons data, which failed for different reasons having to do with the fact that most of AI is just dataset sorting.

iamnotagenius

[dead]

y2236li

[flagged]

HN

Skywork-OR1: new SOTA 32B thinking model with open weight

Skywork-OR1: new SOTA 32B thinking model with open weight