Skip to content(if available)orjump to list(if available)

Qwen3: Think deeper, act faster

Qwen3: Think deeper, act faster

134 comments

·April 28, 2025

stavros

I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Qwen3 with max thinking got it even more wrong than the rest, for what it's worth.

kenjackson

You really had me until the last half of the last sentence.

stavros

The plural of anecdote is data.

WhitneyLand

The plural of reliable data is not anecdote.

rtaylorgarlock

Only in the same way that the plural of 'opinion' is 'fact' ;)

concrete_head

Can you please share the problem?

stavros

I don't really want it added to the training set, but eh. Here you go:

> Assume I have a 3D printer that's currently printing, and I pause the print. What expends more energy, keeping the hotend at some temperature above room temperature and heating it up the rest of the way when I want to use it, or turning it completely off and then heat it all the way when I need it? Is there an amount of time beyond which the answer varies?

All LLMs I've tried get it wrong because they assume that the hotend cools immediately when stopping the heating, but realize this when asked about it. Qwen didn't realize it, and gave the answer that 30 minutes of heating the hotend is better than turning it off and back on when needed.

pylotlight

Some calculation around heat loss and required heat expenditure to reheat per material or something?

arthurcolle

Hi, I'm starting an evals company, would love to have you as an advisor!

999900000999

Not OP, but what exactly do I need to do.

I'll do it for cheap if you'll let me work remote from outside the states.

refulgentis

I believe they're kidding, playing on "my singular question isn't answered correctly"

phonon

Qwen3-235B-A22B?

stavros

Yep, on Qwen chat.

null

[deleted]

natrys

They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.

Not even going into performance, need to test first. But what a stellar release just for attention to all these peripheral details alone. This should be the standard for major release, instead of whatever Meta was doing with Llama 4 (hope Meta can surprise us at LlamaCon tomorrow though).

[1] https://qwen.readthedocs.io/en/latest/

Jayakumark

Second this , they patched all major llm frameworks like llama.cpp, transformers , vllm, sglang, ollama etc weeks before for qwen3 support and released model weights everywhere around same time. Like a global movie release. Cannot undermine mine this level of detail and effort.

echelon

Alibaba, I have a huge favor to ask if you're listening. You guys very obviously care about the community.

We need an answer to gpt-image-1. Can you please pair Qwen with Wan? That would literally change the art world forever.

gpt-image-1 is an almost wholesale replacement of ComfyUI and SD/Flux ControlNets. I can't underscore how big of a deal it is. As such, OpenAI has leapt ahead and threatens to start capturing more of the market for AI images and video. The expense of designing and training a multimodal model presents challenges to the open source community, and it's unlikely that Black Forest Labs or an open effort can do it. It's really a place where only Alibaba can shine.

If we get an open weights multimodal image gen model that we can fine tune, then it's game over - open models will be 100% the future. If not, then the giants are going to start controlling media creation. It'll be the domain of OpenAI and Google alone. Firing a salvo here will keep media creation highly competitive.

So please, pretty please work on an LLM/Diffusion multimodal image gen model. It would change the world instantly.

And keep up the great work with Wan Video! It's easily going to surpass Kling and Veo. The controllability is already well worth the tradeoffs.

kadushka

they have already worked with many community quant makers

I’m curious, who are the community quant makers?

natrys

I had Unsloth[1] and Bartowski[2] in mind. Both said on Reddit that Qwen had allowed them access to weights before release to ensure smooth sailing.

[1] https://huggingface.co/unsloth

[2] https://huggingface.co/bartowski

null

[deleted]

tough

nvm

kadushka

I understand the context, I’m asking for names.

dkga

This cannot be stressed enough.

sroussey

Well, the link to huggingface is broken at the moment.

daemonologist

It's up now: https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

The space loads eventually as well; might just be that HF is under a lot of load.

tough

Thank you!!

null

[deleted]

WhitneyLand

China is doing a great job raising doubt about any lead the major US labs may still have. This is solid progress across the board.

The new battlefront may be to take reasoning to the level of abstraction and creativity to handle math problems without a numerical answer (for ex: https://arxiv.org/pdf/2503.21934).

I suspect that kind of ability will generalize well to other areas and be a significant step toward human level thinking.

sega_sai

With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?

I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all the different sizes, I am absolutely at loss which models with which sizes would be fast enough. I.e. there is no point of me downloading the latest biggest model as that will output 1 tok/min, but I also don't want to download the smallest model, if I can.

Any advice ?

GodelNumbering

There are a lot of variables here such as your hardware's memory bandwidth, speed at which at processes tensors etc.

A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case

refulgentis

4 bit is absolutely fine.

I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.

I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.

brigade

It's incredibly niche, but Gemma 3 27b can recognize a number of popular video game characters even in novel fanart (I was a little surprised at that when messing around with its vision). But the Q4 quants, even with QAT, are very likely to name a random wrong character from within the same franchise, even when Q8 quants name the correct character.

Niche of a niche, but just kind of interesting how the quantization jostles the name recall.

mmoskal

Just for some callibration: approx. no one runs 32 bit for LLMs on any sort of iron, big or otherwise. Some models (eg DeepSeek V3, and derivatives like R1) are native FP8. FP8 was also common for llama3 405b serving.

whimsicalism

> 8 vs 16 is near verboten in public conversation.

i mean, deepseek is fp8

rahimnathwani

With 8GB VRAM, I would try this one first:

https://ollama.com/library/qwen3:8b-q4_K_M

For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.

frainfreeze

Bartowski quants on hugging face are excellent starting point in your case. Pretty much every upload he does has a note how to pick model vram wise. If you follow the recommendations you'll have good user experience. Then next step is localllama subreddit. Once you build basic knowledge and feeling for things you will more easily gauge what will work for your setup. There is no out of the box calculator

Spooky23

Depends what fast means.

I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.

My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.

The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.

xiphias2

When I tested Qwen with different sizes / quants, generally the 8-bit quant versions had the best quality for the same speed.

4-bit was ,,fine'', but a smaller 8-bit version beat it in quality for the same speed

wmf

Speed should be proportional to the number of active parameters, so all 7B Q4 models will have similar performance.

jack_pp

Use the free chatgpt to help you write a script to download them all and test speed

colechristensen

>is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?

Not simply, no.

But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.

simonw

Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a year at this point and to my knowledge this assumed bias hasn't actually resulted in any documented problems from people using the models.

Aside from https://huggingface.co/blog/leonardlin/chinese-llm-censorshi... I haven't seen a great deal of research into this.

Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?

eunos

The avoiding talking part is more on the Frontend level censorship I think. It doesn't censor on API

nyclounge

This is NOT true. At least on the 1.5B version model on my local machine. It blocks answers when using offline mode. Perplexity has an uncensored a version, but don't thing it is open on how they did it.

yawnxyz

Here's a blog post on Perplexity's R1 1776, which they post-trained

https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776

theturtletalks

Didn't know Perplexity cracked R1's censorship but it is completely uncensored. Anyone can try even without an account: https://labs.perplexity.ai/. HuggingFace also was working on Open R1 but not sure how far they got.

refulgentis

^ This, as well as there was a lot of confusion over DeepSeek when it was released, the reasoning models were built on other models, inter alia Qwen (Chinese) and Llama (US). So one's mileage varied significantly

CSMastermind

Right now these models have less censorship than their US counterparts.

With that said, they're in a fight for dominance so censoring now would be foolish. If they win and establish a monopoly then the screws will start to turn.

horacemorace

In my limited experience, models like Llama and Gemma are far more censored than Qwen and Deepseek.

neves

Try to ask any model about Israel and Hamas

Havoc

It’s a complete non-issue. Especially with open weights.

On their online platform I’ve hit a political block exactly once in months of use. Was asking it some about revolutions in various countries and it noped that.

I’d prefer a model that doesn’t have this issue at all but if I have a choice between a good Apache licensed Chinese one and a less good say meta licensed one I’ll take the Chinese one every time. I just don’t ask LLMs enough politically relevant questions for it to matter.

To be fair maybe that take is the LLM equivalent of „I have nothing to hide“ on surveillance

sirnonw

[dead]

pbmango

It is also possible that this "world view tuning" may have just been the manifestation of how these models gained public attention. Whether intentional or not, seeing the Tiananmen Square reposts across all social feeds may have done more to spread awareness of these models technical merits than the technical merits themselves would have. This is certainly true for how consumers learned about free Deepseek and fit perfectly into how new AI releases are turned into high click through social media posts.

refulgentis

I'm curious if there's any data to come to that conclusion, its hard for me to do "They did the censor training to DeepSeek because they knew consumers would love free DeepSeek after seeing screenshots of Tiananmen censorship in screenshots of DeepSeek"

(the steelman here, ofc, is "the screenshots drove buzz which drove usage!", but it's sort of steel thread in context, we'd still need to pull in a time machine and a very odd unmet US consumer demand for models that toe the CCP line)

pbmango

> Whether intentional or not

I am not claiming it was intentional, but it certainly magnified the media attention. Maybe luck and not 4d chess.

minimaxir

DeepSeek R1 was a massive outlier in terms of media attention (a free model that can potentially kill OpenAI!), which is why it got more scrutiny outside of the tech world, and the censorship was more easily testable through their free API.

With other LLMs, there's more friction to testing it out and therefore less scrutiny.

rfoo

The model does have some bias builtin, but it's lighter than expected. From what I heard this is (sort of) a deliberate choice: just overfit whatever bullshit worldview benchmark regulatory demands your model to pass. Don't actually try to be better at it.

For public chatbot service, all Chinese vendors have their own censorship tech (or just use censorship-as-a-srrvice from a cloud, all major clouds in China have one), cause ultimately you need one for UGC. So why not just censor LLM output with the same stack, too.

dylanjcastillo

I’m most excited about Qwen-30B-A3B. Seems like a good choice for offline/local-only coding assistants.

Until now I found that open weight models were either not as good as their proprietary counterparts or too slow to run locally. This looks like a good balance.

htsh

curious, why the 30b MoE over the 32b dense for local coding?

I do not know much about the benchmarks but the two coding ones look similar.

Casteil

The MoE version with 3b active parameters will run significantly faster (tokens/second) on the same hardware, by about an order of magnitude (i.e. ~4t/s vs ~40t/s)

genpfault

> The MoE version with 3b active parameters

~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.

esafak

Could this variant be run on a CPU?

moconnor

Probably very well

minimaxir

A 0.6B LLM with a 32k context window is interesting, even if it was trained using only distillation (which is not ideal as it misses nuance). That would be a fun base model for fine-tuning.

Out of all the Qwen3 models on Hugging Face, it's the most downloaded/hearted. https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

jasonjmcghee

these 0.5 and 0.6B models etc. are _fantastic_ for using as a draft model in speculative decoding. lm studio makes this super easy to do - i have it on like every model i play with now

my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work

mmoskal

Spec decoding only depends on the tokenizer used. It's transfering either the draft token sequence or at most draft logits to the main model.

jasonjmcghee

I suppose that makes sense, for some reason I was under the impression that the models need to be aligned / have the same tuning or they'd have different probability distributions and would reject the draft model really often.

foundry27

I find the situation the big LLM players find themselves in quite ironic. Sam Altman promised (edit: under duress, from a twitter poll gone wrong) to release an open source model at the level of o3-mini to catch up to the perceived OSS supremacy of Deepseek/Qwen. Now Qwen3’s release makes a model that’s “only” equivalent to o3-mini effectively dead on arrival, both socially and economically.

aoeusnth1

I have a hard time believing that he hadn't already made up his mind to make an open source model when he posted the poll in the first place

krackers

I don't think they will ever do an open-source release, because then the curtains would be pulled back and people would see that they're not actually state of the art. Lama 4 already sort of tanked Meta's reputation, if OpenAI did that it'd decimate the value of their company.

If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.

mks_shuffle

Does anyone have insights on the best approaches to compare reasoning models? It is often recommended to use a higher temperature for more creative answers and lower temperature values for more logical and deterministic outputs. However, I am not sure how applicable this advice is for reasoning models. For example, Deepseek-R1 and QwQ-32b recommend a temperature around 0.6, rather than lower values like 0.1–0.3. The Qwen3 blog provides performance comparisons between multiple reasoning models, and I am interested in knowing what configurations they used. However, the paper is not available yet. If anyone has links to papers focused on this topic, please share them here. Also, please feel free to correct me if I’m mistaken about anything. Thanks!

daemonologist

It sounds like these models think a lot, seems like the benchmarks are run with a thinking budget of 32k tokens - the full context length. (Paper's not published yet so I'm just going by what's on the website.) Still, hugely impressive if the published benchmarks hold up under real world use - the A3B in particular, outperforming QWQ, could be handy for CPU inference.

Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.

croemer

The benchmark results are so incredibly good they are hard to believe. A 30B model that's competitive with Gemini 2.5 Pro and way better than Gemma 27B?

Update: I tested "ollama run qwen3:30b" (the MoE) locally and while it thought much it wasn't that smart. After 3 follow up questions it ended up in an infinite loop.

I just tried again, and it ended up in an infinite loop immediately, just a single prompt, no follow-up: "Write a Python script to build a Fitch parsimony tree by stepwise addition. Take a Fasta alignment as input and produce a nwk string as outpput."

Update 2: The dense one "ollama run qwen3:32b" is much better (albeit slower of course). It still keeps on thinking for what feels like forever until it misremembers the initial prompt.

rahimnathwani

You tried a 4-bit quantized version, not the original.

qwen3:30b has the same checksum as https://ollama.com/library/qwen3:30b-a3b-q4_K_M

croemer

What is the original? The blog post doesn't state the quantization they benchmarked.

rahimnathwani

This 61GB one: https://ollama.com/library/qwen3:30b-a3b-fp16

You can see it's roughly the same size as the one in the official repo (16 files of 4GB each):

https://huggingface.co/Qwen/Qwen3-30B-A3B/tree/main

cye131

These performance numbers look absolutely incredible. The MoE outperforms o1 with 3B active parameters?

We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.

stavros

> We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.

After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.

the_arun

I'm dreaming of a time when commodity CPUs run LLMs for inference & serve at scale.

thierrydamiba

How do people typically do napkin math to figure out if their machine can “handle” a model?

derbaum

Very rough (!) napkin math: for a q8 model (almost lossless) you have parameters = VRAM requirement. For q4 with some performance loss it's roughly half. Then you add a little bit for the context window and overhead. So a 32B model q4 should run comfortably on 20-24 GB.

Again, very rough numbers, there's calculators online.

daemonologist

The ultra-simplified napkin math is 1 GB (V)RAM per 1 billion parameters, at a 4-5 bit-per-weight quantization. This usually gives most of the performance of the full size model and leaves a little bit of room for context, although not necessarily the full supported size.

hn8726

Wondering if I'll get corrected, but my _napkin math_ is looking at the model download size — I estimate it needs at least this amount of vram/ram, and usually the difference in size between various models is large enough not to worry if the real requirements are size +5% or 10% or 15%. LM studio also shows you which models your machine should handle

samsartor

The absolutely dumbest way is to compare the number of parameters with your bytes of RAM. If you have 2 or more bytes of RAM for every parameter you can generally run the model easily (eg 3B model with 8GB of RAM). 1 byte per parameter and it is still possible, but starts to get tricky.

Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.

null

[deleted]