GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2

45 comments

·August 10, 2025

mark_l_watson

Wow, Sebastian Raschk's blog articles are jewels - much appreciated.

I use the get-oss and qwen3 models a lot (smaller models locally using Ollama and LM Studio) and commercial APIs for the full size models.

For local model use, I get very good results with get-oss when I "over prompt," that is, I specify a larger amount of context information than I usually do. Qwen3 is simply awesome.

Until about three years ago, I have always understood neural network models (starting in the 1980s), GAN, Recurrent, LSTM, etc. well enough to write implementations. I really miss the feeling that I could develop at least simpler LLMs on my own. I am slowly working through Sebastian Raschk's excellent book https://www.manning.com/books/build-a-large-language-model-f... but I will probably never finish it (to be honest).

lvl155

He does an amazing job of keeping me up to date on this insanely fast-paced space.

7moritz7

Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.

In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.

So presumably, this comes down to...

- training technique or data

- dimension

- lower number of large experts vs higher number of small experts

jszymborski

If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.

7moritz7

That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.

Edit: found this analysis, it's on the HN frontpage right now

> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.

https://x.com/jxmnop/status/1953899426075816164

CuriouslyC

The strategy of Phi isn't bad, it's just not general. It's really a model that's meant to be fine tuned, but unfortunately fine tuning tends to shit on RL'd behavior, so it ended up not being that useful. If someone made a Phi style model with an architecture that was designed to take knowledge adapters/experts (i.e. small MoE model designed to get separately trained networks plugged into them with routing updates via special LoRA) it'd actually be super useful.

unstatusthequo

Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.

“What is the word that starts with S, ends with E, and contains A? → SAEA”

Then when I said that’s not a word and you gave me the answer already, no fun, it said

“I do not have access to confirm that word.”

threeducks

FWIW, I asked gpt-oss-120b this question 10 times and the answer was always "sauce", "sane" or "sale". I also tried different temperatures (from 0 to 1), which did not seem to have an effect on the correctness of the answer.

EDIT: I now have also questioned the smaller gpt-oss-20b (free) 10 times via OpenRouter (default settings, provider was AtlasCloud) and the answers were: sage, sane, sane, space, sane, sane, sane, sane, space, sane.

You are either very unlucky, your configuration is suboptimal (weird system prompt perhaps?) or there is some bug in whichever system you are using for inference.

roscas

"The word is SAME." for qwen3-coder and it took like 10 seconds to write all the thinking because it writes it down, but processing was 0 seconds.

"I need to find a word that starts with S, ends with E, and contains A.

Let me think through this step by step:

    Starts with S
    Ends with E
    Contains A somewhere in the middle

One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.

Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:

    Starts with S
    Ends with E  
    Contains A

Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.

Let me think of other possibilities:

    "SQUARE" - starts with S, ends with E, contains A (in the middle)
    "SALE" - starts with S, ends with E, contains A
    "SAME" - starts with S, ends with E, contains A

Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.

The word is SAME. "

verisimi

lol. The answer it gave doesn't even end in an 'E'.

BoorishBears

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model

cranberryturkey

qwen3 is slow though. i used it. it worked, but it was slow and lacking features.

roscas

From my experience, qwen3-coder is way better. I only have gpt-oss:20b installed to make a few more tests but I give it a program to make a summary of what it does and qwen3 just works in a few seconds, while gpt-oss was cancelled after 5 minuts... doing nothing.

So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.

I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.

Qwen3-coder is amazing. Best I used so far.

lvl155

Qwen3 coder 480B is quite good and on par with Sonnet 4. It’s the first time I realized the Chinese models are probably going to eclipse US-based models pretty soon, at least for coding.

indigodaddy

Where do you use qwen3 480b from, I'm not even seeing it on Openrouter. EDIT nm, openrouter is just calling it qwen3-coder-- when I click for more info it shows it's Qwen3-Coder-480B-A35B-Instruct. And it's one of their free models. Nice

cpursley

That might be a stretch, maybe Sonnet 3.5. But it is pretty impressive as is Kimi on opencode.

smokel

The 20B version doesn't fit in 10GB. That might explain some issues?

mhitza

I've been using lightly gpt-oss-20b but what I've found is that for smaller (single sentence) prompts it was easy enough to have it loop infinitely. Since I'm running it with llama.cpp I've set a small repetition penalty and haven't encountered those issues since (I'm using it a couple of times a day to analyze diffs, so I might have just gotten lucky since)

nicolaslem

I had the same issue with other models where they would loop repeating the same character, sentence or paragraph indefinitely. Turns out the context size some tools set by default is 2k and this is way too small.

ModelForge

I’ve been using the ollama version (uses about 13 Gb RAM on macOS) and haven’t had that issue yet. I wonder if that’s maybe an issue of the llama.cpp port?

mhitza

Never used ollama, only ready to go models via llamafile and llama.cpp.

Maybe ollama has some defaults it applies to models? I start testing models at 0 temp and tweak from there depending how they behave.

SV_BubbleTime

Are you using this in an agentic way or in a copy and paste and “code this” single input single output way?

I’d like to know how far the frontier models are from the local for agentic coding.

chaos_emergent

> This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced.

Wait, is this true? That seems like a wild statement to make, relatively unsubstantiated?

typon

No this is well known. Look for Table 2.2 in GPT3 paper.

Scene_Cast2

I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.

This is contrary to what I've seen in a large ML shop, where architectural tuning was king.

bobbylarrybobby

My guess is that at LLM scale, you really can't try to hyperparameter tune — it's just too expensive. You probably have to do some basic testing of different architectures, settle on one, and then figure out how to make best use of it (data and RL).

null

[deleted]

ModelForge

Good point. LLMs lower the barrier to entry if someone has enough resources because those architectures are more robust to tweaks given one throws enough compute and data at them. You can even violate scaling laws and still get a good model (like Llama 3 showed back then)

gglon

> At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96.)

Tencent's hunyuan-turbos, another hybrid, is currently ranked at 22. https://arxiv.org/abs/2505.15431

oezi

One question I was wondering about regarding the open models released by big labs is how much more the could improve with additional training. GPT-OSS has 2.1m hours of training, how much score improvements could we see at double that?

ModelForge

I think GPT-4.5 was potentially the original GPT-5 model that was larger and pre-trained on more data. Too bad it was too expensive to deploy at scale so that we never saw the RL-ed version

poorman

As we saw with GPT-5 the RL technique of training doesn't scale forever

oezi

I meant scaling the base training before RL.

storus

In my tests, GPT-OSS-120B Q8 was close to DeepSeek R1 671B Q16 in solving graduate-level math but much faster with way fewer thinking tokens.

overfeed

Supporting TFA'd thesis that it's trained to be good at benchmarks.

pryelluw

The Qwen3 4B has been very good to use local. I barely use the online models. Web searches are now more targeted thanks to it. Don’t quite fully trust the output but it’s generally good. Mods like these will revolutionize local knowledge and automation

indigodaddy

Qwen is telling you better search parameters to then search the web with, or qwen is actually doing web searches for you?

poorman

This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation.

tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.

[1] https://opencode.ai/

ModelForge

The ollama one uses even less (around 13 GB), which is nice. Apparently the gpt-oss team also shared the mxfp4 optimizations for metal

homarp

"From GPT-2 to gpt-oss: Analyzing the Architectural Advances And How They Stack Up Against Qwen3"