Skip to content(if available)orjump to list(if available)

Open models by OpenAI

foundry27

Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.

rfoo

Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

liuliu

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.

nxobject

It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.

tgtweak

I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.

Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.

highfrequency

I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.

asadm

> research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

same seems to be true for humans

mclau157

You can get similar insights looking at the github repo https://github.com/openai/gpt-oss

danieldk

Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).

logicchains

>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool

They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.

rushingcreek

The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.

cco

The lede is being missed imo.

gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.

I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.

But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.

I'm blown away.

FergusArgyll

> To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.

This would be a great "AGI" test. See if it can derive biohazards from first principles

simonw

Just posted my initial impressions, took a couple of hours to write them up because there's a lot in this release! https://simonwillison.net/2025/Aug/5/gpt-oss/

TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.

The 20B model runs on my Mac laptop using less than 15GB of RAM.

x187463

Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.

tedivm

It only seems like that if you haven't been following other open source efforts. Models like Qwen perform ridiculously well and do so on very restricted hardware. I'm looking forward to seeing benchmarks to see how these new open source models compare.

Rhubarrbb

Agreed, these models seem relatively mediocre to Qwen3 / GLM 4.5

modeless

Nah, these are much smaller models than Qwen3 and GLM 4.5 with similar performance. Fewer parameters and fewer bits per parameter. They are much more impressive and will run on garden variety gaming PCs at more than usable speed. I can't wait to try on my 4090 at home.

There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.

moralestapia

You can always get your $0 back.

cvadict

Yes, but they are suuuuper safe. /s

So far I have mixed impressions, but they do indeed seem noticeably weaker than comparably-sized Qwen3 / GLM4.5 models. Part of the reason may be that the oai models do appear to be much more lobotomized than their Chinese counterparts (which are surprisingly uncensored). There's research showing that "aligning" a model makes it dumber.

echelon

This might mean there's no moat for anything.

Kind of a P=NP, but for software deliverability.

CamperBob2

On the subject of who has a moat and who doesn't, it's interesting to look the role of patents in the early development of wireless technology. There was WWI, and there was WWII, but the players in the nascent radio industry had serious beef with each other.

I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.

a_wild_dandan

Right? I still remember the safety outrage of releasing Llama. Now? My 96 GB of (V)RAM MacBook will be running a 120B parameter frontier lab model. So excited to get my hands on the MLX quants and see how it feels compared to GLM-4.5-air.

4b6442477b1280b

in that era, OpenAI and Anthropic were still deluding themselves into thinking they would be the "stewards" of generative AI, and the last US administration was very keen on regoolating everything under the sun, so "safety" was just an angle for regulatory capture.

God bless China.

a_wild_dandan

Oh absolutely, AI labs certainly talk their books, including any safety angles. The controversy/outrage extended far beyond those incentivized companies too. Many people had good faith worries about Llama. Open-weight models are now vastly more powerful than Llama-1, yet the sky hasn't fallen. It's just fascinating to me how apocalyptic people are.

I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.

narrator

Yeah, China is e/acc. Nice cheap solar panels too. Thanks China. The problem is their ominous policies like not allowing almost any immigration, and their domestic Han Supremacist propaganda, and all that make it look a bit like this might be Han Supremacy e/acc. Is it better than wester/decel? Hard to say, but at least the western/decel people are now starting to talk about building power plants, at least for datacenters, and things like that instead of demanding whole branches of computer science be classified, as they were threatening to Marc Andreessen when he visited the Biden admin last year.

bogtog

When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?

davio

On a M1 MacBook Air with 8GB, I got this running Gemma 3n:

12.63 tok/sec • 860 tokens • 1.52s to first token

I'm amazed it works at all with such limited RAM

v5v3

I have started a crowdfunding to get you a MacBook air with 16gb. You poor thing.

phonon

Here's a 4bit 70B parameter model, https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.

n42

here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr

and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM

I am, um, floored

Rhubarrbb

Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.

Davidzheng

the active param count is low so it should be fast.

a_wild_dandan

GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.

null

[deleted]

Imustaskforhelp

Okay I will be honest, I was so hyped up about This model but then I went to localllama and saw it that the:

120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)

pxc

Qwen3 Coder is 4x its size! Grok 3 is over 22x its size!

What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?

It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.

ascorbic

That's SVGBench, which is a useful benchmark but isn't much of a test of general coding

Imustaskforhelp

Hm alright, I will see how this model actually plays around instead of forming quick opinions..

Thanks.

logicchains

It's only got around 5 billion active parameters; it'd be a miracle if it was competitive at coding with SOTA models that have significantly more.

jph00

On this bench it underperforms vs glm-4.5-air, which is an MoE with fewer total params but more active params.

tyho

What's the easiest way to get these local models browsing the web right now?

dizhn

aider uses Playwright. I don't know what everybody is using but that's a good starting point.

larodi

We be running them in PIs off spare juice in no time, and they be billions given how chips and embedded spreads…

ClassAndBurn

Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.

My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.

N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.

There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.

Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.

[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

xpe

> Open models are going to win long-term.

[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:

- publishing open weights helps your competitors catch up

- publishing open weights doesn't improve your own research agenda

- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains

- publishing open weights distracts your organization from attaining a sustainable business model / funding stream

- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)

[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...

[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...

lechatonnoir

I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.

Adrig

I'm a layman but it seemed to me that the industry is going towards robust foundational models on which we plug tools, databases, and processes to expand their capabilities.

In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.

xpe

> Open models are going to win long-term.

[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:

For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:

1. releasing a new version of an open-weight model; and

2. how much secret sauce do you hold back?

* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.

albertzeyer

> Once someone hits AGI/SGI

I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.

Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.

Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?

dom96

> I don't think there will be such a unique event.

I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".

renmillar

There's no reason that models too large for consumer hardware wouldn't keep a huge edge, is there?

xpe

> Open models are going to win long-term.

[3 of 3] What would it take for this statement to be false or missing the point?

Maybe we find ourselves in a future where:

- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.

- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.

- The future might involve thousands or millions of highly tailored models

teaearlgraycold

> N-1 model value depreciates insanely fast

This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N

lukax

Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).

[1] https://github.com/openai/harmony

[2] https://github.com/openai/tiktoken

[3] https://github.com/openai/codex

chilipepperhott

As an engineer that primarily uses Rust, this is a good omen.

zone411

I benchmarked the 120B version on the Extended NYT Connections (759 questions, https://github.com/lechmazur/nyt-connections) and on 120B and 20B on Thematic Generalization (810 questions, https://github.com/lechmazur/generalization). Opus 4.1 benchmarks are also there.

deviation

So this confirms a best-in-class model release within the next few days?

From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?

ticulatedspline

Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.

There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.

winterrx

GPT-5 coming Thursday.

ciaranmca

Is this the stealth models horizon alpha and beta? I was generally impressed with them(although I really only used it in chats rather than any code tasks). In terms of chat I increasingly see very little difference between the current SOTA closed models and their open weight counterparts.

boringg

How much hype do we anticipate with the release of GPT-5 or whichever name to be included? And how many new features?

selectodude

Excited to have to send them a copy of my drivers license to try and use it. That’ll take the hype down a notch.

bredren

Undoubtedly. It would otherwise reduce the perceived value of their current product offering.

The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.

Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.

og_kalu

Even before today, the last week or so, it's been clear for a couple reasons, that GPT-5's release was imminent.

null

[deleted]

logicchains

> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it

Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.

timmg

Orthogonal, but I just wanted to say how awesome Ollama is. It took 2 seconds to find the model and a minute to download and now I'm using it.

Kudos to that team.

_ache_

To be fair, it's with the help of OpenAI. They did it together, before the official release.

https://ollama.com/blog/gpt-oss

aubanel

From experience, it's much more engineering work on the integrator's side than on OpenAI's. Basically they provide you their new model in advance, but they don't know the specifics of your system, so it's normal that you do most of the work. Thus I'm particularly impressed by Cerebras: they only have a few models supported for their extreme perf inference, it must have been huge bespoke work to integrate.

henriquegodoy

Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.

I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.

coolspot

10B * 2000 t/s = 20,000 GB/s memory bandwidth . Apple hardware can do 1k GB/s .

artembugara

Disclamer: probably dumb questions

so, the 20b model.

Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?

mlyle

An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...

d3m0t3p

You can batch only if you have distinct chat in parallel,

mythz

gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card.

[1] https://ollama.com/library/gpt-oss

dragonwriter

You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.

artembugara

thanks, this part is clear to me.

but I need to understand 20 x 1k token throughput

I assume it just might be too early to know the answer

Tostino

I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).

PeterStuer

(answer for 1 inference) Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).

Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.

vl

How Macs compare to RTXs for this? I.e. what numbers can be expected from Mac mini/Mac Studio with 64/128/256/512GB of unified memory?

petuman

> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.

spott

Groq is offering 1k tokens per second for the 20B model.

You are unlikely to match groq on off the shelf hardware as far as I'm aware.