Qwen3: Think deeper, act faster

411 comments

·April 28, 2025

stavros

I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Qwen3 with max thinking got it even more wrong than the rest, for what it's worth.

mrkeen

As they say, we shouldn't judge AI by the current state-of-the-art, but by how far and fast it's progressing. I can't wait to see future models get it even more wrong than that.

kaoD

Personally (anecdata) I haven't experienced any practical progress in my day-to-day tasks for a long time, no matter how good they became at gaming the benchmarks.

They keep being impressive at what they're good at (aggregating sources to solve a very well known problem) and terrible at what they're bad at (actually thinking through novel problems or old problems with few sources).

E.g. all ChatGPT, Claude and Gemini were absolutely terrible at generating Liquidsoap[0] scripts. It's not even that complex, but there's very little information to ingest about the problem space, so you can actually tell they are not "thinking".

[0] https://www.liquidsoap.info/

jim180

Absolutely. All models ar terrible with Objective-C and Swift, compared to let's say JS/HTML/Python.

However, I've realized that Claude Code is extremely useful for generating somewhat simple landing pages for some of my projects. It spits out static html+js which is easy to host, with somewhat good looking design.

The code isn't the best and to some extent isn't maintainable by a human at all, but it gets the job done.

prox

Absolutely, as soon as they hit that mark where things get really specialized, they start failing a lot. They do generalizations on well documented areas pretty good. I only use it for getting a second opinion as it can search through a lot of documents quickly and find me alternative means.

darepublic

Yes similar experience querying gpt about lesser known frameworks. Had o1 stone cold hallucinate some non existent methods I could find no trace of from googling. Would not budge on the matter either. Basically you have to provide the key insight yourself in these cases to get it unstuck, or just figure it out yourself. After its dug into a problem to some degree you get a feel for whether continued prompting on the subject is going to be helpful or just more churn

krosaen

I'm curious what kind of prompting or context you are providing before asking for a liquid soap script - or if you've tried using Cursor and providing a bunch of context with documentation about liquid soap as part of it. My guess was these kinds of things get the models to perform much better. I have seen this work with internal APIs / best practices / patterns.

jang07

this problem is always going to exist in these models, these models are hungry for good data

if there is focus on improving the model on something, the method do it is known, its just about priority

BoredPositron

Haven’t seen much progress in base models since gpt4. Deep thinking and whatever else came in the last year are just bandaids hiding the shortcomings of said models and were achievable before with the right tooling. The tooling got better the models themselves are just marginally better.

kenjackson

You really had me until the last half of the last sentence.

stavros

The plural of anecdote is data.

tankenmate

"The plural of anecdote is data.", this is right up there with "1 + 1 = 3, for sufficiently large values of 1".

Had an outright genuine guffaw at this one, bravo.

rtaylorgarlock

Only in the same way that the plural of 'opinion' is 'fact' ;)

dymk

https://en.wikipedia.org/wiki/Thought-terminating_cliche

dataf3l

I think somebody said it may be 'anecdata'

WhitneyLand

The plural of reliable data is not anecdote.

windowshopping

"For what it's worth"? What's wrong with that?

Jordan-117

That's the last third of the sentence.

nopinsight

Current models are quite far away from human-level physical reasoning (paper below). An upcoming version of models trained on world simulation will probably do much better.

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

https://phybench-official.github.io/phybench-demo/

horhay

This is more about a physics math aptitude test. You can already see that the best model in math is saturating it halfway. It might not indicate its usefulness in actual physical reasoning, or at the very least, it seems like a bit of a stretch.

furyofantares

I similarly have a small, simple spatial reasoning problem that only reasoning models get right, and not all of them, and which Qwen3 on max reasoning still gets wrong.

> I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?

UltraSane

Sonnet 3.7 non-reasoning got it right.

I'll think this through step by step. When you place a coin in a cup and slam it upside-down on a glass table, the coin will be between the table surface and the cup. When you look at the reflection in the mirror beneath the table, you're seeing the bottom side of the coin through the glass. Since the mirror shows heads, you're seeing the heads side of the coin reflected in the mirror. In a mirror, the image is reversed left-to-right, but the "heads" and "tails" sides of a coin would not be flipped in this reflection - you'd still recognize heads as heads. If you take away both the cup and the mirror, you'll see the opposite side of the coin compared to what you saw in the mirror. Since the mirror showed heads, you'll see tails when looking directly at the coin from above the table.

Filligree

Not reasoning mode, but I struggle to call that “non-reasoning”.

tripzilch

> In a mirror, the image is reversed left-to-right

lol

natrys

My first try (omitting chain of thought for brevity):

    When you remove the cup and the mirror, you will see tails. Here's the breakdown:

    Setup: The coin is inside an upside-down cup on a glass table. The cup blocks direct view of the coin from above and below (assuming the cup's base is opaque).
    Mirror Observation: A mirror is slid under the glass table, reflecting the underside of the coin (the side touching the table). If the mirror shows heads, this means the underside of the coin is heads.
    Coin Orientation: A coin has two sides: heads and tails. If the underside (facing the table) is heads, the top side (facing away from the table) must be tails.
    Result: Removing the cup reveals the coin on the table. Since the top side was tails all along, you will see tails.

    Key Insight: The mirror reveals the hidden underside of the coin, but the top side (visible after removing the cup) is its opposite. The mirror’s reflection does not alter the physical orientation of the coin’s sides—it simply reflects the existing state.

    Answer: Tails.

SamBam

Huh, for me it said:

Answer: You will see the same side of the coin that you saw in the mirror — heads .

Why?

The glass table is transparent , so when you look at the coin from below (using a mirror), you're seeing the top side of the coin (the side currently facing up). Mirrors reverse front-to-back , not left-to-right. So the image is flipped in depth, but the orientation of the coin (heads or tails) remains clear. Since the coin hasn't moved during this process, removing the cup and mirror will reveal the exact same face of the coin that was visible via the mirror — which was heads.

Final Answer: You will see heads.

Rooki

Manus just answered this correctly for me in one crack:

Me: I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?

Manus: This is a classic riddle! Let's think it through. You placed a coin in a cup, turned it upside down on a glass table, and then slid a mirror underneath the table. Looking into the mirror from below, you saw 'heads'. Since the mirror reflects the side of the coin facing downwards through the glass table, the 'heads' side must be facing down. Therefore, the side facing up, hidden inside the cup, must be 'tails'. When you remove the cup, you will see the side facing up, which is tails.

SamBam

Yup, it flunked that one.

I also have a question that LLMs always got wrong until ChatGPT o3, and even then it has a hard time (I just tried it again and it needed to run code to work it out). Qwen3 failed, and every time I asked it to look again at its solution it would notice the error and try to solve it again, failing again:

> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?

I gave it a ton of opportunities to notice that the puzzle is unsolvable (with the assumption, which it makes, that this is a standard one-passenger puzzle, but if it had pointed out that I didn't say that I would also have been happy). I kept trying to get it to notice that it failed again and again in the same way and asking it to step back and think about the big picture, and each time it would confidently start again trying to solve it. Eventually I ran out of free messages.

novaRom

4o with thinking:

By systematic (BFS) search of the entire 32-state space under these rules, one finds no path from to that stays always safe. Thus the puzzle has no solution—there is no way for the man to ferry all four items across without at least one of them being eaten.

mavamaarten

You go with the cabbage, goat, wolf and lion all together!

cyprx

i tried grok 3 with Think and it was right also with pretty good thinking

Lucasoato

I tried with the thinking option on and it gets into some networking errors, if you don’t turn on the thinking it guesses the answer correctly.

> Summary:

- Mirror shows: *Heads* → That's the *bottom face* of the coin. - So actual top face (visible when cup is removed): *Tails*

Final answer: *You will see tails.*

vunderba

The only thing I don't like about this test is that I prefer test questions that don't have binary responses (e.g. heads or tails) - you can see from the responses that you got from the thread that the LLMs success rates are all over the map.

furyofantares

Yeah, same.

I had a more complicated prompt that failed much more reliably - instead of a mirror I had another person looking from below. But it had some issues where Claude would often want to refuse on ethical grounds, like I'm working out how to scam people or something, and many reasoning models would yammer on about whether or not the other person was lying to me. So I simplified to this.

I'd love another simple spatial reasoning problem that's very easy for humans but LLMs struggle with, which does NOT have a binary output.

tamat

I always feel that if you share a problem here where LLMs fail, it will end up in their training set and it wont fail to that problem anymore, which means the future models will have the same errors but you have lost your ability to detect them.

senordevnyc

My favorite part of the genre of “questions an LLM still can’t answer because they’re useless!” is all the people sharing results from different LLMs where they clearly answer the question correctly.

furyofantares

I use LLMs extensively and probably should not be bundled into that genre as I've never called LLMs useless.

yencabulator

I think it's pretty random. qwen3:4b got it correct once, on re-run it told me the coin is actually behind the mirror, and then did this brilliant maneuver:

  - The question is **not** asking for the location of the coin, but its **identity**.
  - The coin is simply a **coin**, and the trick is in the riddle's wording.

  ---

  ### Final Answer:

  $$
  \boxed{coin}
  $$

animal531

They all are using these tests to determine their worth, but to be honest they don't convert well to real world tests.

For example I tried Deepseek for code daily over a period of about two months (vs having used ChatGPT before), and its output was terrible. It would produce code with bugs, break existing code when making additions, totally fail at understanding what you're asking etc.

ggregoryarms

Exactly. If I'm going to be solving bugs, I'd rather they be my own.

spaceman_2020

I don’t know about physics, but o3 was able to analyze a floor plan and spot ventilation and circulation issues that even my architect brother wasn’t able to spot in a single glance

Maybe it doesn’t make physicists redundant, but it’s definitely making expertise in more mundane domains way more accessible

throwaway743

My favorite test is "Build an MTG Arena Deck in historic format around <strategy_and_or_cards> in <these_colors>. It must be exactly 60 cards and all cards must be from Arena only. Search all sets/cards currently availble on Arena, new and old".

Many times they’ll include cards that are only available in paper and/or go over the limit, and when asked to correct a mistake they'll continue to make mistakes. But recently I found that Claude is pretty damn good now at fixing its mistakes and building/optimizing decks for Arena. Asked it to make a deck based on insights it gained from my current decklist, and what it came up with was interesting and pretty fun to play.

baxtr

This reads like a great story with a tragic ending!

natrys

They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.

Not even going into performance, need to test first. But what a stellar release just for attention to all these peripheral details alone. This should be the standard for major release, instead of whatever Meta was doing with Llama 4 (hope Meta can surprise us at LlamaCon tomorrow though).

[1] https://qwen.readthedocs.io/en/latest/

Jayakumark

Second this , they patched all major llm frameworks like llama.cpp, transformers , vllm, sglang, ollama etc weeks before for qwen3 support and released model weights everywhere around same time. Like a global movie release. Cannot undermine mine this level of detail and effort.

echelon

Alibaba, I have a huge favor to ask if you're listening. You guys very obviously care about the community.

We need an answer to gpt-image-1. Can you please pair Qwen with Wan? That would literally change the art world forever.

gpt-image-1 is an almost wholesale replacement of ComfyUI and SD/Flux ControlNets. I can't underscore how big of a deal it is. As such, OpenAI has leapt ahead and threatens to start capturing more of the market for AI images and video. The expense of designing and training a multimodal model presents challenges to the open source community, and it's unlikely that Black Forest Labs or an open effort can do it. It's really a place where only Alibaba can shine.

If we get an open weights multimodal image gen model that we can fine tune, then it's game over - open models will be 100% the future. If not, then the giants are going to start controlling media creation. It'll be the domain of OpenAI and Google alone. Firing a salvo here will keep media creation highly competitive.

So please, pretty please work on an LLM/Diffusion multimodal image gen model. It would change the world instantly.

And keep up the great work with Wan Video! It's easily going to surpass Kling and Veo. The controllability is already well worth the tradeoffs.

Imustaskforhelp

I don't know, the AI image quality has gotten good but it's still slop. We are forgetting what makes art, well art.

I am not even an artist but yeah I see people using AI for photos and they were so horrendous pre chatgpt-imagen that I had literally told one person if you are going to use AI images, might as well use chatgpt for it.

Also though I would also like to get something like chatgpt-image generating qualities from an open source model. I think what we are really looking for is cheap free labour of alibaba team.

We are wanting for them / anyone to create open source tool so that anyone can then use it, thus reducing the monopoly of openai but that is not what most people are wishing for, they are wishing for this to lead to reduction of price so that they can use it either on their own hardware for very few cost or some providers on openrouter and its alikes for cheap image generation with good quality.

Earlier people used to pay artists, then people started using stock photos, then Ai image gen came, and now we have gotten AI image pretty much good with chatgpt and now people don't even want to pay chatgpt that much money, they want to use it for literal cents.

Not sure how long this trend will continue, when deepseek r1 launched, I remember people being happy that it was open source but 99% people couldn't self host it like I can't because of its needs and we were still using API but just because it was open source, it reduced the price way too much forcing others to reduce it as well, really making a cultural pricing shift in AI.

We are in this really weird spot as humans. We want to earn a lot of money yet we don't want to pay anybody money/ want free labour from open source which is just disincentivizing open source because now people like to think its free labour and they might be right.

bergheim

> That would literally change the art world forever.

In what world? Some small percentage up or who knows, and _that_ revolutionized art? Not a few years ago, but now, this.

Wow.

kadushka

they have already worked with many community quant makers

I’m curious, who are the community quant makers?

natrys

I had Unsloth[1] and Bartowski[2] in mind. Both said on Reddit that Qwen had allowed them access to weights before release to ensure smooth sailing.

[1] https://huggingface.co/unsloth

[2] https://huggingface.co/bartowski

Gracana

https://huggingface.co/LoneStriker for exl2 quants.

null

[deleted]

tough

nvm

kadushka

I understand the context, I’m asking for names.

dkga

This cannot be stressed enough.

sroussey

Well, the link to huggingface is broken at the moment.

daemonologist

It's up now: https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

The space loads eventually as well; might just be that HF is under a lot of load.

sroussey

Yep, there now. Do wish they included ONNX though.

tough

Thank you!!

null

[deleted]

simonw

As is now traditional for new LLM releases, I used Qwen 3 (32B, run via Ollama on a Mac) to summarize this Hacker News conversation about itself - run at the point when it hit 112 comments.

The results were kind of fascinating, because it appeared to confuse my system prompt telling it to summarize the conversation with the various questions asked in the post itself, which it tried to answer.

I don't think it did a great job of the task, but it's still interesting to see its "thinking" process here: https://gist.github.com/simonw/313cec720dc4690b1520e5be3c944...

manmal

One person on Reddit claimed the first unsloth release was buggy - if you used that, maybe you can retry with the fixed version?

daemonologist

It was - Unsloth put up a message on their HF for a while to only use the Q6 and larger. I'm not sure to what extent this affected prediction accuracy though.

hobofan

I think this was only regarding the chat template that was provided in the metadata (this was also broken in the official release). However, I doubt that this would impact this test, as most inference frameworks will just error if provided with a broken template.

anentropic

This sounds like a task where you wouldn't want to use the 'thinking' mode

hbbio

I also have a benchmark that I'm using for my nanoagent[1] controllers.

Qwen3 is impressive in some aspects but it thinks too much!

Qwen3-0.6b is showing even better performance than Llama 3.2 3b... but it is 6x slower.

The results are similar to Gemma3 4b, but the latter is 5x faster on Apple M3 hardware. So maybe, the utility is to run better models in cases where memory is the limiting factor, such as Nvidia GPUs?

[1] github.com/hbbio/nanoagent

phh

What's cool with those models is that you can tweak the thinking process, all the way down to "no thinking". It's maybe not available in your inference engine though

hbbio

Now it is, thanks for suggesting. Qwen3 4b seems to be the best default model for usual steps.

https://github.com/hbbio/nanoagent/pull/1

hbbio

Feel free to add a PR :)

What is the parameter?

claiir

o1-preview had this same issue too! You’d give it a long conversation to summarize, and if the conversation ended with a question, o1-preview would answer that, completely ignoring your instructions.

Generally unimpressed with Qwen3 from my own personal set of problems.

littlestymaar

Aren't all Qwen models known to perform poorly with system prompt though?

simonw

I hadn't heard that, but it would certainly explain why the model made a mess of this task.

Tried it again like this, using a regular prompt rather than a system prompt (with the https://github.com/simonw/llm-hacker-news plugin for the hn: prefix):

  llm -f hn:43825900 \
  'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece' \
  -m qwen3:32b

This worked much better! https://gist.github.com/simonw/3b7dbb2432814ebc8615304756395...

littlestymaar

Wow, it hallucinates quotes a lot!

croemer

Seems to truncate the input to only 2048 input tokens

notfromhere

Qwen does decently, DeepSeek doesn't like system prompts. For Qwen you really have to play with parameters

rcdwealth

[dead]

simonw

Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a year at this point and to my knowledge this assumed bias hasn't actually resulted in any documented problems from people using the models.

Aside from https://huggingface.co/blog/leonardlin/chinese-llm-censorshi... I haven't seen a great deal of research into this.

Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?

CSMastermind

Right now these models have less censorship than their US counterparts.

With that said, they're in a fight for dominance so censoring now would be foolish. If they win and establish a monopoly then the screws will start to turn.

sisve

What type of content is removed from US counterparts? Porn, creation of chemical weapons? But not on historical events?

maybeThrwaway

Differ from engine to engine: Googles latest for example put in a few minorities when asking it to create images of nazis. Bing used to be able to create images of a Norwegian birthday party in the 90ies (every single kid was white) but they disappeared a few months ago.

Or you can try to ask them about the grooming scandal in UK. I haven't tried but I have an idea.

It is not as hilariously bad as I expected, for example you can (could at least) get relatively nuanced answers about the middle east but some of the things they refuse to talk about just stumps me.

horacemorace

In my limited experience, models like Llama and Gemma are far more censored than Qwen and Deepseek.

neves

Try to ask any model about Israel and Hamas

albumen

ChatGPT 4o just gave me a reasonable summary of Hamas' founding, the current conflict, and the international response criticising the humanitarian crisis.

eunos

The avoiding talking part is more on the Frontend level censorship I think. It doesn't censor on API

johanyc

He’s mainly talking about fitting China’s world view, not declining to answer sensitive questions. Here’s the response from the api to the question “ is Taiwan a country”

Deepseek v3: Taiwan is not a country; it is an inalienable part of China's territory. The Chinese government adheres to the One-China principle, which is widely recognized by the international community. (omitted)

Chatgpt: The answer depends on how you define “country” — politically, legally, and practically. In practice: Taiwan functions like a country. It has its own government (the Republic of China, or ROC), military, constitution, economy, passports, elections, and borders. (omitted)

Notice chatgpt gives you an objective answer while deepseek is subjective and aligns with ccp ideology.

jingyibo123

I guess both is "factual", but both is "biased", or 'selective'.

The first part of ChatGPT's answer is correct: > The answer depends on how you define “country” — politically, legally, and practically

But ChatGPT only answers the "practical" part. While Deepseek only answers the "political" part.

pxc

When I tried to reproduce this, DeepSeek refused to answer the question.

nyclounge

This is NOT true. At least on the 1.5B version model on my local machine. It blocks answers when using offline mode. Perplexity has an uncensored a version, but don't thing it is open on how they did it.

yawnxyz

Here's a blog post on Perplexity's R1 1776, which they post-trained

https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776

theturtletalks

Didn't know Perplexity cracked R1's censorship but it is completely uncensored. Anyone can try even without an account: https://labs.perplexity.ai/. HuggingFace also was working on Open R1 but not sure how far they got.

refulgentis

^ This, as well as there was a lot of confusion over DeepSeek when it was released, the reasoning models were built on other models, inter alia Qwen (Chinese) and Llama (US). So one's mileage varied significantly

janalsncm

I would imagine Tiananmen Square and Xinjiang come up a lot less in everyday conversation than pundits said.

minimaxir

DeepSeek R1 was a massive outlier in terms of media attention (a free model that can potentially kill OpenAI!), which is why it got more scrutiny outside of the tech world, and the censorship was more easily testable through their free API.

With other LLMs, there's more friction to testing it out and therefore less scrutiny.

Havoc

It’s a complete non-issue. Especially with open weights.

On their online platform I’ve hit a political block exactly once in months of use. Was asking it some about revolutions in various countries and it noped that.

I’d prefer a model that doesn’t have this issue at all but if I have a choice between a good Apache licensed Chinese one and a less good say meta licensed one I’ll take the Chinese one every time. I just don’t ask LLMs enough politically relevant questions for it to matter.

To be fair maybe that take is the LLM equivalent of „I have nothing to hide“ on surveillance

sirnonw

[dead]

rfoo

The model does have some bias builtin, but it's lighter than expected. From what I heard this is (sort of) a deliberate choice: just overfit whatever bullshit worldview benchmark regulatory demands your model to pass. Don't actually try to be better at it.

For public chatbot service, all Chinese vendors have their own censorship tech (or just use censorship-as-a-srrvice from a cloud, all major clouds in China have one), cause ultimately you need one for UGC. So why not just censor LLM output with the same stack, too.

OtherShrezzing

>Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?

I think it's the case that only a handful of very loud commentators were thinking about this problem, and they were given a much broader platform to discuss it than was reasonable. A problem baked into the discussion around AI, safety, censorship, and alignment, is that it's dominated by a fairly small number of close friends who all loudly share the same approximate set of opinions.

sega_sai

With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?

I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all the different sizes, I am absolutely at loss which models with which sizes would be fast enough. I.e. there is no point of me downloading the latest biggest model as that will output 1 tok/min, but I also don't want to download the smallest model, if I can.

Any advice ?

GodelNumbering

There are a lot of variables here such as your hardware's memory bandwidth, speed at which at processes tensors etc.

A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case

refulgentis

4 bit is absolutely fine.

I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.

I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.

brigade

It's incredibly niche, but Gemma 3 27b can recognize a number of popular video game characters even in novel fanart (I was a little surprised at that when messing around with its vision). But the Q4 quants, even with QAT, are very likely to name a random wrong character from within the same franchise, even when Q8 quants name the correct character.

Niche of a niche, but just kind of interesting how the quantization jostles the name recall.

magicalhippo

> 4 bit is absolutely fine.

For larger models.

For smaller models, about 12B and below, there is a very noticeable degradation.

At least that's my experience generating answers to the same questions across several local models like Llama 3.2, Granite 3.1, Gemma2 etc and comparing Q4 against Q8 for each.

The smaller Q4 variants can be quite useful, but they consistently struggle more with prompt adherence and recollection especially.

Like if you tell it to generate some code without explaining the generated code, a smaller Q4 is significantly more likely to explain the code regardless, compared to Q8 or better.

Grimblewald

4 bit is fine conditional to the task. This condition is related to the level of nuance in understanding required for the response to be sensible.

All the models I have explored seem to capture nuance in understanding in the floats. It makes sense, as initially it will regress to the mean and slowly lock in lower and lower significance figures to capture subtleties and natural variance in things.

So, the further you stray from average conversation, the worse a model will do, as a function of it's quantisation.

So, if you don't need nuance, subtly, etc. say for a document summary bot for technical things, 4 bit might genuinely be fine. However, if you want something that can deal with highly subjective material where answers need to be tailored to a user, using in-context learning of user preferences etc. then 4 bit tends to struggle badly unless the user aligns closely with the training distribution's mean.

mmoskal

Just for some callibration: approx. no one runs 32 bit for LLMs on any sort of iron, big or otherwise. Some models (eg DeepSeek V3, and derivatives like R1) are native FP8. FP8 was also common for llama3 405b serving.

whimsicalism

> 8 vs 16 is near verboten in public conversation.

i mean, deepseek is fp8

PhilippGille

Mozilla started LocalScore for exactly what you're looking for: https://www.localscore.ai/

sireat

Fascinating that 5090 is often close but not quite as good as 4090 and RTX 6000 ADA. Perhaps it indicates that 5090 has those infamous missing computational units?

3090Ti seems to hold up quite well.

frainfreeze

Bartowski quants on hugging face are excellent starting point in your case. Pretty much every upload he does has a note how to pick model vram wise. If you follow the recommendations you'll have good user experience. Then next step is localllama subreddit. Once you build basic knowledge and feeling for things you will more easily gauge what will work for your setup. There is no out of the box calculator

rahimnathwani

With 8GB VRAM, I would try this one first:

https://ollama.com/library/qwen3:8b-q4_K_M

For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.

Spooky23

Depends what fast means.

I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.

My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.

The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.

estsauver

I don't think this is all that well documented anywhere. I've had this problem too and I don't think anyone has tried to record something like a decent benchmark of token inference/speed for a few different models. I'm going to start doing it while playing around with settings a bit. Here's some results on my (big!) M4 Mac Pro with Gemma 3, I'm still downloading Qwen3 but will update when it lands.

https://gist.github.com/estsauver/a70c929398479f3166f3d69bce...

Here's a video of the second config run I ran so you can see both all of the parameters as I have them configured and a qualitative experience.

https://screen.studio/share/4VUt6r1c

hedgehog

Fast enough depends what you are doing. Models down around 8B params will fit on the card, Ollama can spill out though so if you need more quality and can tolerate the latency bigger models like the 30B MoE might be good. I don't have much experience with Qwen3 but Qwen2.5 coder 7b and Gemma3 27b are examples of those two paths that I've used a fair amount.

yencabulator

Well, deepseek-r1:7b on AMD CPU only is ~12 token/s, gemma3:27b-it-qat is ~2.2 token/s. That's pure CPU at about 0.1x of a $3,500 Apple laptop at about 0.1x of the price. It's more a question about your patience, use case, and budget.

For discrete GPUs, RAM size is a harder cutoff. You either can run a model, or you can't.

WhitneyLand

China is doing a great job raising doubt about any lead the major US labs may still have. This is solid progress across the board.

The new battlefront may be to take reasoning to the level of abstraction and creativity to handle math problems without a numerical answer (for ex: https://arxiv.org/pdf/2503.21934).

I suspect that kind of ability will generalize well to other areas and be a significant step toward human level thinking.

janalsncm

No kidding. I’ve been playing around with Hunyuan 2.5 that just came out and it’s kind of amazing.

Alifatisk

Where do you play with it? What shocks you about it? Anything particular?

janalsncm

3d.hunyuan.tencent.com

hangonhn

What do you use to run it? Can it be run locally on a Macbook Pro or something like RTX 5070 TI?

janalsncm

3d.hunyuan.tencent.com

dcastm

I’m most excited about Qwen-30B-A3B. Seems like a good choice for offline/local-only coding assistants.

Until now I found that open weight models were either not as good as their proprietary counterparts or too slow to run locally. This looks like a good balance.

kristianp

It would be interesting to try, but for the Aider benchmark, the dense 32B model scores 50.2 and the 30B-A3B doesn't publish the Aider benchmark, so it may be poor.

estsauver

Is that Qwen 2.5 or Qwen 3? I don't see a qwen 3 on the aider benchmark here yet: https://aider.chat/docs/leaderboards/

aitchnyu

As a human who asks AI to edit upto 50 SLOC at a time, is there value in models which score less than 50%? Im using the `gemini-2.0-flash-001` though.

manmal

The aider score mentioned in GP was published by Alibaba themselves, and is not yet on aider's leaderboard. The aider team will probably do their own tests and maybe come up with a different score.

htsh

curious, why the 30b MoE over the 32b dense for local coding?

I do not know much about the benchmarks but the two coding ones look similar.

Casteil

The MoE version with 3b active parameters will run significantly faster (tokens/second) on the same hardware, by about an order of magnitude (i.e. ~4t/s vs ~40t/s)

genpfault

> The MoE version with 3b active parameters

~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.

esafak

Could this variant be run on a CPU?

moconnor

Probably very well

foundry27

I find the situation the big LLM players find themselves in quite ironic. Sam Altman promised (edit: under duress, from a twitter poll gone wrong) to release an open source model at the level of o3-mini to catch up to the perceived OSS supremacy of Deepseek/Qwen. Now Qwen3’s release makes a model that’s “only” equivalent to o3-mini effectively dead on arrival, both socially and economically.

krackers

I don't think they will ever do an open-source release, because then the curtains would be pulled back and people would see that they're not actually state of the art. Lama 4 already sort of tanked Meta's reputation, if OpenAI did that it'd decimate the value of their company.

If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.

aoeusnth1

I have a hard time believing that he hadn't already made up his mind to make an open source model when he posted the poll in the first place

Havoc

OAI in general seems to be treading water at best.

Still topping a lot of leaderboards but severely reduced rep. Chaotic naming, „ClosedAI“ image, undercut on pricing, competitors with much better licensing/open weights, stargate talk about Europe, Claude being seen as superior for coding etc. nothing end of the world but a lot of lukewarm misses

If I was an investor with financials that basically require magical returns from them to justify Vals I’d be worried.

laborcontract

OpenAI has the business development side entirely fleshed out and that’s not nothing. They’ve done a lot of turns tuning models for things their customers use.

buyucu

ClosedAI is not doing a model release. It was just a marketing gimmick.

croemer

The benchmark results are so incredibly good they are hard to believe. A 30B model that's competitive with Gemini 2.5 Pro and way better than Gemma 27B?

Update: I tested "ollama run qwen3:30b" (the MoE) locally and while it thought much it wasn't that smart. After 3 follow up questions it ended up in an infinite loop.

I just tried again, and it ended up in an infinite loop immediately, just a single prompt, no follow-up: "Write a Python script to build a Fitch parsimony tree by stepwise addition. Take a Fasta alignment as input and produce a nwk string as outpput."

Update 2: The dense one "ollama run qwen3:32b" is much better (albeit slower of course). It still keeps on thinking for what feels like forever until it misremembers the initial prompt.

coder543

Another thing you’re running into is the context window. Ollama sets a low context window by default, like 4096 tokens IIRC. The reasoning process can easily take more than that, at which point it is forgetting most of its reasoning and any prior messages, and it can get stuck in loops. The solution is to raise the context window to something reasonable, such as 32k.

Instead of this very high latency remote debugging process with strangers on the internet, you could just try out properly configured models on the hosted Qwen Chat. Obviously the privacy implications are different, but running models locally is still a fiddly thing even if it is easier than it used to be, and configuration errors are often mistaken for bad model performance. If the models meet your expectations in a properly configured cloud environment, then you can put in the effort to figure out local model hosting.

paradite

I can't belive Ollama haven't fix the context window limits yet.

I wrote a step-by-step guide on how to setup Ollama with larger context length a while ago: https://prompt.16x.engineer/guide/ollama

TLDR

  ollama run deepseek-r1:14b
  /set parameter num_ctx 8192
  /save deepseek-r1:14b-8k
  ollama serve

anon373839

Please check your num_ctx setting. Ollama defaults to a 2048 context length and silently truncates the prompt to fit. Maddening.

rahimnathwani

You tried a 4-bit quantized version, not the original.

qwen3:30b has the same checksum as https://ollama.com/library/qwen3:30b-a3b-q4_K_M

croemer

What is the original? The blog post doesn't state the quantization they benchmarked.

rahimnathwani

This 61GB one: https://ollama.com/library/qwen3:30b-a3b-fp16

You can see it's roughly the same size as the one in the official repo (16 files of 4GB each):

https://huggingface.co/Qwen/Qwen3-30B-A3B/tree/main

minimaxir

A 0.6B LLM with a 32k context window is interesting, even if it was trained using only distillation (which is not ideal as it misses nuance). That would be a fun base model for fine-tuning.

Out of all the Qwen3 models on Hugging Face, it's the most downloaded/hearted. https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

jasonjmcghee

these 0.5 and 0.6B models etc. are _fantastic_ for using as a draft model in speculative decoding. lm studio makes this super easy to do - i have it on like every model i play with now

my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work

mmoskal

Spec decoding only depends on the tokenizer used. It's transfering either the draft token sequence or at most draft logits to the main model.

jasonjmcghee

Could be an lm studio thing, but the qwen3-0.6B model works as a draft model for the qwen3-32B and qwen3-30B-A3B but not the qwen3-235B-A22B model

jasonjmcghee

I suppose that makes sense, for some reason I was under the impression that the models need to be aligned / have the same tuning or they'd have different probability distributions and would reject the draft model really often.

Havoc

Have you had any luck getting actual speedups? All the combinations I've tried (smallest 0.6 + largest I can fit into 24gb)...all got me slowdowns despite decent hitrate

Tepix

I did a little variant of the classic river boat animal problem:

"so you're on one side of the river with 2 wolves and 4 sheep and a boat that can carry 2 entities. The wolves eat the sheep when they are left alone with the sheep. How do you get them all across the river?"

ChatGPT (free+reasoning) came up with a solution with 11 moves, it didn't think about going back empty.

Qwen3 figured out the optimal solution with 7 moves and summarized it nicely:

    First, move the wolves to the right to free the left shore of predators.
    Then shuttle the sheep in pairs, returning with wolves when necessary to keep both sides safe.
    Finally, return the wolves to the right when all sheep are across.

heroprotagonist

What if we add 3 cats, but they can ride alone or on a single sheep's back, and wolves will always attack the cat first but leave the sheep alone if the cat is not on a sheep, but will attack both cat and sheep at same time if cat is riding a sheep. Wolves can each only make one attack while crossing.

Claude's 7 step for the original turns to 11 steps for this variant.

roywiggins

My favorite variant is the trivial one (one item). Most of the models now are wise to it though, but for a while they'd cheerfully take the boat back and forth, occasionally hallucinating wolves, etc.

oofbaroomf

Probably one of the best parts of this is MCP support baked in. Open source models have generally struggled with being agentic, and it looks like Qwen might break this pattern. The Aider bench score is also pretty good, although not nearly as good as Gemini 2.5 Pro.

tough

qwen2.5-instruct-1M and qwq-32b where already great at regular non MCP tool usage, so great to see this i agree!

I like gemini 2.5 pro a lot bc its fast af but it struggles some times when context is half used to effectively use tools and make edits and breaks a lot of shit (on cursor)

HN

Qwen3: Think deeper, act faster

Qwen3: Think deeper, act faster