Skip to content(if available)orjump to list(if available)

Mistral Small 3

Mistral Small 3

165 comments

·January 30, 2025

simonw

I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.

I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. This downloads 14GB of model weights:

  ollama run mistral-small:24b
Then using my https://llm.datasette.io/ tool (so I can log my prompts to SQLite):

  llm install llm-ollama
  llm -m mistral-small:24b "say hi"
More notes here: https://simonwillison.net/2025/Jan/30/mistral-small-3/

simonw

The API pricing is notable too: they dropped the prices by half from the old Mistral Small - that one was $0.20/million tokens of input, $0.60/million for output.

The new Mistral Small 3 API model is $0.10/$0.30.

For comparison, GPT-4o-mini is $0.15/$0.60.

85392_school

Competition will likely be cheaper. (For context, Deepinfra runs larger 32B models at $0.07/$0.16)

isoprophlex

I make very heavy use of structured output (to convert unstructured data into something processable, eg for process mining on customer service mailboxes)

Is it any good for this, if you tested it?

I'm looking for something that hits the sweet spot of runs locally & follows prescribed output structure, but I've been quite underwhelmed so far

enkrs

I thought structured output is a solved problem now. I've had consistent results with ollama structured outputs [1] by passing Zod schema with the request. Works even with very small models. What are the challenges you're facing?

[1] https://ollama.com/blog/structured-outputs

freehorse

Structured output is solved, it is structuring data that's not, because that is an unbounded problem. There is no limit to how messy your data may be, and no limit to the accuracy and efficiency you may require.

I have used such models to structure human-generated data into sth a script can then read and process, getting important aspects in this data (eg what time the human reported doing X thing, how long, with whom etc) into like a csv file with columns eg timestamps and whatever variables I am interested in.

Der_Einzige

For anyone who thinks it isn't "solved", outlines debunked the paper which claims that "structured generation harms creativity":

https://blog.dottxt.co/say-what-you-mean.html

the_mitsuhiko

I get decent JSON from it quite well with the "assistant: {" trick. I'm not sure how well trained it is to do JSON. The template on ollama has tools calls so I assume they made sure JSON works: https://ollama.com/library/mistral-small:24b/blobs/6db27cd4e...

a_wild_dandan

And for anyone looking to dig deeper, check out "grammar-based sampling."

azinman2

What’s the “assistant: {" trick? You just end your prompt with that?

mercer

I've found phi4 to be very good for this.

rkwz

What local models are you currently using and what issues are you facing?

starik36

The only model that I've found to be useful in processing customer emails is o1-preview. The rest of the models work as well, but don't get all the minutia of the emails.

My scenario is pretty specific though and is all about determining intent (e.g. what does the customer want) and mapping it onto my internal structures.

The model is very slow, but definitely worth it.

pks016

Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper.

Or point me to right direction

themanmaran

Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.

We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.

But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.

pks016

Most of them are experimental studies. So it would be text extraction of something like title, authors, species of the study, sample size etc. And classify based on the content of the pdfs.

I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.

rahimnathwani

Given you have 64GB RAM, you could run mistral-small:24b-instruct-2501-q8_0

prettyblocks

Hey Simon - In your experience, what's the best "small" model for function/tool calling? Of the ones I've tested they seem to return the function call even when it's not needed, which requires all kinds of meta prompting & templating to weed out. Have you found a model that more or less just gets it right?

simonw

I'm afraid I don't have a great answer for that - I haven't spent nearly enough time with function calling in these models.

I'm hoping to add function calling to my LLM library soon which will make me much better equipped to experiment here.

prettyblocks

Cool, thanks for the reply. Looking forward to following along with your progress.

jhickok

Do you know how many tokens per second you are getting? I have a similar laptop that I can test on later but if you have that info handy let me know!

snickell

M2 max with 64GB: 14 tokens/s running `ollama run mistral-small:24b --verbose`

halyconWays

Maybe I'm an outlier but I don't see much value in running tiny local models vs. using a more powerful desktop in my house to host a larger and far more usable model. I run Open WebUI and connect it to my own llama.cpp/koboldcpp that runs a 4-bit 70B model, and can connect to it anywhere easily with Tailscale. For questions that even 70B can't handle I have Open WebUI hit OpenRouter and can choose between all the flagship models.

Every time I've tried a tiny model it's been too questionable to trust.

kamranjon

Have you tried Gemma 27b? I’ve been using it with llamafile and it’s pretty incredible. I think the winds are changing a bit and small models are becoming much more capable. Worth giving some of the smaller ones a shot if it’s been a while. I can run Gemma 27b on my 32gb MacBook Pro and it’s pretty capable with code too.

jonas21

I don't understand the joke.

simonw

It's hardly a joke at all. Even the very best models tend to be awful at writing jokes.

I find the addition of an explanation at the end (never a sign of a good joke) amusing at the meta-level:

  Why did the badger bring a puffin to the party?

  Because he heard puffins make great party 'Puffins'!

  (That's a play on the word "puffins" and the phrase "party people.")

dgacmu

A man walks up to an llm and asks it to tell him a joke about a puffin and a badger.

The llm replies with a joke that is barely a joke.

The man says "another."

The llm gives another unfunny response.

"Another!"

Followed by another similarly lacking response.

"Another!"

With exasperation, the llm replies "stop badgering me!"

Except it won't, because that's not a high likelihood output. ;)

becquerel

And yet LLMs tend to in fact be very funny, just very very rarely on purpose.

emmelaich

Apparently "party puffin" is a company that sells cheap party supplies and decorations. That's all that I can think of.

tadamcz

Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:

* run LLM evaluations systematically and at scale

* share the data with the public in a rigorous and transparent way

We use the UK government's Inspect [1] library to run the evaluations.

As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).

You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...

Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.

It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.

[0]: https://epoch.ai/

[1]: https://github.com/UKGovernmentBEIS/inspect_ai

[2]: https://arxiv.org/abs/2103.03874

[3]: https://huggingface.co/datasets/TIGER-Lab/MathInstruct

coalteddy

Thanks a lot for this eval!

One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest

asb

Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:

We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-licensed models

diggan

Note that this seems to be about the weights themselves, AFAIK, the actual training code and datasets (for example) aren't actually publicly available.

It's a bit like developing a binary application and slapping a FOSS license on the binary while keeping the code proprietary. Not saying that's wrong or anything, but people reading these announcements tend to misunderstand what actually got FOSS licensed when the companies write stuff like this.

crawshaw

It's not the same as slapping an open source license on a binary, because unencumbered weights are so much more generally useful than your typical program binary. Weights are fine-tunable and embeddable into a wide range of software.

To consider just the power of fine tuning: all of the press DeepSeek have received is over their R1 model, a relatively tiny fine-tune on their open source V3 model. The vast majority of the compute and data pipeline work to build R1 was complete in V3, while that final fine-tuning step to R1 is possible even by an enthusiastic dedicated individual. (And there are many interesting ways of doing it.)

The insistence every time open sourced model weights come up that it is not "truly" open source is tiring. There is enormous value in open source weights compared to closed APIs. Let us call them open source weights. What you want can be "open source data" or somesuch.

diggan

> The insistence every time open sourced model weights come up that it is not "truly" open source is tiring. There is enormous value in open source weights compared to closed APIs. Let us call them open source weights. What you want can be "open source data" or somesuch.

Agree that there is more value in open source weights than closed APIs, but what I really want to enable, is people learning how to create their own models from scratch. FOSS to me means being able to learn from other projects, how to build the thing yourself, and I wrote about why this is important to me here: https://news.ycombinator.com/item?id=42878817

It's not a puritan view but purely practical. Many companies started using FOSS as a marketing label (like what Meta does) and as someone who probably wouldn't be a software developer without being able to learn from FOSS, it fucking sucks that the ML/AI ecosystem is seemingly OK with the term being hijacked.

eldenring

Its not the exact same since you can still finetune it, you can modify the weights, serve it with different engines, etc.

This kind of purity test mindset doesn't help anyone. They are shipping the most modifiable form of their model.

null

[deleted]

diggan

Agree that it's not exactly the same, all analogies have holes, they're simplifications after all.

I guess I'm vary of the messaging because I'm a developer 99% thanks to FOSS, and being able to learn from FOSS projects how to build similar stuff myself. Without FOSS, I probably wouldn't have been able to "escape" the working-class my family was "stuck in" when I grew up.

I want to do whatever I can to make sure others have the same opportunity, and it doesn't matter if the weights themselves are FOSS or not, others cannot learn how to create their own models based on just looking at the weights. You need to be able to learn the model architecture, training and what datasets models are using too, otherwise you won't get very far.

> This kind of purity test mindset doesn't help anyone. They are shipping the most modifiable form of their model.

It does help others who might be stuck in the same situation I was stuck in, that's not nothing nor is it about "purity". They're not shipping the most open model they can, they could have done something like OLMo (https://github.com/allenai/OLMo) which can teach people how to build their own models from scratch.

null

[deleted]

zamalek

Binaries can do arbitrary things, like report home to a central server. Weights cannot.

actualwitch

Depending on format, they might.

dismalaf

But the weights can be modified. Also the real key is that you can host it yourself, fine tune and make money from it without restriction. That's what it's really about. No one (well, few) cares about recreating it because if they could they'd simply have made one from scratch themselves.

diggan

The same is true for FOSS in general. You're arguing that because no one (almost) builds their own X from scratch, there is therefore no value in having resources available for how to build your own X from scratch.

For someone who basically couldn't become a developer with FOSS, this way of thinking is so backwards, especially on Hacker News. I thought we were pro-FOSS in general, but somehow LLMs get a pass because "they're too complicated and no one would build one from scratch".

jacooper

> Note that this seems to be about the weights themselves, AFAIK, the actual training code and datasets (for example) aren't actually publicly available.

Like every other open source / source available LLM?

diggan

Like every other Open Source LLM weights, yes. But looking around, there are models that are 100% FOSS, like OLMo (https://github.com/allenai/OLMo).

Also, I don't buy the argument that because many in the ecosystem mislabel/mislead people about the licensing, makes it ethically OK for everyone else to do so too.

mcraiha

The binary comparison is a bit bad, since binary can have copyrights. Weights cannot.

diggan

Has that actually been tried in court, or is that your guess? Because you seem confident, but I don't think this has been tried (yet)

youssefabdelm

I guess since they're not ahead anymore they decide to go back to open source.

mythz

Happy to seem them back to releasing OSS models, we used a lot of their OSS models early last year before they were eclipsed by better models and never bothered to try any of their large models which IMO weren't great value.

littlestymaar

I wonder if that's a consequence of the Deepseek distill release: fine-tuned Qwen and Llama were both released by Deepseek, but not Mistral, and that's was a missed PR opportunity for them for no good reason.

dismalaf

They must have realized they were becoming irrelevant... I know I forgot about them and have been using other models locally. Openness is a huge win, even if I am using Mistral's hosting service I want to know I can always host it myself too, to protect my business against rug pulls and the like.

No one's going to pay for an inferior closed model...

mohsen1

Not so subtle in function calling example[1]

        "role": "assistant",
        "content": "---\n\nOpenAI is a FOR-profit company.",

[1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...

spwa4

So the point of this release is

1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)

2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)

3) Performance, according to mistral, somewhere between Qwen 2.5 32B and Llama 3.3 70B, roughly equal with GPT4o-mini

4) ollama run mistral-small (14G download) 9 tokens/sec on the question "who is the president of the US?" (also to enjoy that the answer ISN'T orange idiot)

freehorse

I tried just a few of the code generating prompts I have used last days, and it looks quite good and promising. It seems at least on par with qwen2.5-coder-32b which was the first local model i would actually use for code. I am also surprised how far we went with small models producing such more polished output in the last year.

On another note, I also wish they would follow up with a new version of the 8x7B mixtral. It was one of my favourite models, but at the time it could barely fit in my ram, and now that I have more ram it is rather outdated. But I don't complain, this model anyway is great and it is great that they are one of the companies which actually publish such models targeted to edge computing.

rahimnathwani

Until today, no language model I've run locally on a 32GB M1 has been able to answer this question correctly: "What was Mary J Blige's first album?"

Today, a 4-bit quantized version of Mistral Small (14GB model size) answered correctly :)

https://ollama.com/library/mistral-small:24b-instruct-2501-q...

kamranjon

I just tried your question against Gemma 2 27b llamafile on my M1 Macbook with 32gb of ram, here is the transcript:

>>> What was Mary J Blige's first album?

Mary J. Blige's first album was titled *"What's the 411?"*.

It was released on July 28, 1992, by Uptown Records and became a critical and commercial success, establishing her as the "Queen of Hip-Hop Soul."

Would you like to know more about the album, like its tracklist or its impact on music?

msp26

Finally, all the recent MoE model releases make me depressed with my mere 24GB VRAM.

> Note that Mistral Small 3 is neither trained with RL nor synthetic data

Not using synthetic data at all is a little strange

colonial

I recall seeing some complaints recently w.r.t. one of the heavily synthetic models (Phi?) - apparently they tend to overfit on STEM "book knowledge" while struggling with fuzzier stuff and instruction following.

I'm not much of an LLM user, though, so take my warmed over recollections with a grain of salt.

bloopernova

I'm surprised no GPU cards are available with like a TB of older/cheaper RAM.

gr3ml1n

Not surprising at all: Nvidia doesn't want to compete with their own datacenter cards.

wongarsu

AMD could arguably do it. But they have to focus to stay above water at all, and "put 128GB or more of DDR5 ram on any previous-gen GPU" is probably not in their focus. With the state of their software it's not even certain if the community could pick up the slack and turn that into a popular solution.

papichulo2023

Nvidia upcoming 'minipc' has shared ram up to 128gb for around 3k. No a competitor but pretty good for enthusiast.

Hopefully is at least quadchannel.

aurareturn

Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

What good is 1TB RAM if the bandwidth is fed through a straw? Models would run very slow.

You can see this effect on 128GB MacBook Pros. Yes, the model will fit but it’s slow. 500GB/s of memory bandwidth feeds 128GB RAM at a maximum rate of 3.9x per second. This means if your model is 128GB large, your max tokens/s is 3.9. In the real world, it’s more like 2-3 tokens/s after overhead and compute. That’s too slow to use comfortably.

You’re probably wondering why not increase memory bandwidth too. Well, you need faster memory chips such as HBM and/or more memory channels. These changes will result in drastically more power consumption and bigger memory controllers. Great, you’ll pay for those. Now you’re bottlenecked by compute. Just add more compute? Ok, you just recreated the Nvidia H100 GPU. That’ll be $20k please.

Some people have tried to use AMD Epyc CPUs with 8 channel memory for inference but those are also painfully slow in most cases.

acoard

> Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.

null

[deleted]

timestretch

Their models have been great, but I wish they'd include the number of parameters in the model name, like every other model.

jbentley1

It's 24B parameters

null

[deleted]

rcarmo

There's also a 22b model that I appreciate, since it _almost_ fits into my 12GB 3060. But, alas, I might need to get a new GPU if this trend of fatter smaller models continues.

yodsanklai

I'm curious, what people do with these smaller models?

Beretta_Vexee

RAG mainly, Feature extraction, tagging, Document and e-mail classification. You don't need a 24B parameter to know whether the e-mail should go to accounting or customer support.

Panoramix

Would this work for non-text data? Like finding outliers in a time series or classifying trends, that kind of thing

pheeney

What models would you recommend for basic classification if you don't need a 24B parameter one?

josh-sematic

You might find this comparison chart helpful: https://www.airtrain.ai/blog/how-15-top-llms-perform-on-clas...

Note: from October; also I work at Airtrain

celestialcheese

Classification, tagging tasks. Way easier than older ML techniques and very fast to implement.

mattgreenrocks

When compared against more traditional ML approaches, how do they fare in terms of quality?

spmurrayzzz

Historically the problem with using LLMs for the super simple conventional NLP stuff is that they were hard to control in terms of output. If you wanted a one-word answer for a classification task, you'd often have to deal with it responding in a paragraph. This obviously hurts precision and accuracy quite a bit. There were tricks you could use (like using few-shot examples or GBNF grammars or training low-rank adapters or even re-asking the model) to constrain output a bit, but they weren't perfect.

Over the last 12-18 months though, the instruction-following capabilities of the models have improved substantially. This new mistral model in particular is fantastic at doing what you ask.

My approach to this personally and professionally is to just benchmark. If I have a classification task, I use a tiny model first, eval both, and see how much improvement I'd get using an LLM. Generally speaking though, the vram costs are so high for the latter that its often not worth it. It really is a case-by-case decision though. Sometimes you want one generic model to do a bunch of tasks rather than train/finetune a dozen small models that you manage in production instead.

andrewgross

Super easy to get started, but lacking for larger datasets where you want to understand a bit more about predictions. You generally lose things like prediction probability (though this can be recovered if you chop the head off and just assign output logits to classes instead of tokens), repeatability across experiments, and the ability to tune the model by changing the data. You can still do fine tuning, though itll be more expensive and painfaul than a BERT model.

Still, you can go from 0 to ~mostly~ clean data in a few prompts and iterations, vs potentially a few hours with a fine tuning pipeline for BERT. They can actually work well in tandem to bootstrap some training data and then use them together to refine your classification.

frankfrank13

They're fast, I used 4o mini to run the final synthesis in a CoT app and to do initial entity/value extraction in an ETL. Mistral is pretty good for code completions too, if I was in the Cursor business I would consider a model like this for small code-block level completions, and let the bigger models handle chat, large requests, etc.

_boffin_

Cleaning messy assessor data. Email draft generation.

ignoramous

Mistral repeatedly emphasize on "accuracy" and "latency" for this Small (24b) model; which to me means (and as they also point out):

- Local virtual assistants.

- Local automated workflows.

Also from TFA:

  Our customers are evaluating Mistral Small 3 across multiple industries, including:

  - Financial services customers for fraud detection
  - Healthcare providers for customer triaging
  - Robotics, automotive, and manufacturing companies for on-device command and control
  - Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.

superkuh

Not spend $6000 on hardware because they run on computers we already have. But more seriously, they're fine and plenty fun for making recreational IRC bots.

cptcobalt

This is really exciting—the 12-32b size range has my favorite model size on my home computer, and the mistrals have been historically great and embraced for various fine-tuning.

At 24b, I think this has a good chance of fitting on my more memory constrained work computer.

ericol

> the mistrals have been historically great and embraced for various fine-tuning Are there any guides on fine tuning them that you can recommend?

ekam

Unsloth is the one I personally hear the most about