Magistral — the first reasoning model by Mistral AI
443 comments
·June 10, 2025danielhanchen
ozgune
Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.
For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.
R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).
derefr
I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?
semi-extrinsic
Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.
Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.
saratogacx
I have the same card on my machine at home, what is your config to run the model?
danielhanchen
I'm surprised it does very well as well - that's pretty cool to see!
danielhanchen
Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:
1. Removed KL Divergence
2. Normalize by total length (Dr. GRPO style)
3. Minibatch normalization for advantages
4. Relaxing trust region
gyrovagueGeist
Does anyone know why they added minibatch advantage normalization (or when it can be useful)?
The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
danielhanchen
Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!
Onavo
> Removed KL Divergence
Wait, how are they computing the loss?
danielhanchen
Oh it's the KL term sorry - beta * KL ie they set beta to 0.
The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things
trc001
It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless
mjburgess
It's just a penalty term that they delete
monkmartinez
At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?
danielhanchen
Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja
gavi
too much thinking
https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...
fzzzy
My impression from running the first R1 release locally was that it also does too much thinking.
reissbaker
Magistral Small seems wayyy too heavy-handed with its RL to me:
\boxed{Hey! How can I help you today?}
They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).
It also forgets to <think> unless you use their special system prompt reminding it to.
Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.
cluckindan
It does not do any thinking. It is a statistical model, just like the rest of them.
trebligdivad
Nice! I'm running on CPU only, so it's interesting to compare - the Magistral-Small-2506_Q8_0.gguf runs at under 2 tokens/s on my 16 core, but your UD-IQ2_XXS gets about 5.5 tokens/s which is fast enough to be useful - but it does hallucinate a bit more and loop a little; but still actually pretty good for something so small.
danielhanchen
Oh nice! I normally suggest maybe Q4_K_XL to be on the safe side :)
cpldcpu
But this is just the SFT - "distilled" model, not the one optimized with RL, right?
danielhanchen
Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL
pu_pe
Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is struggling to keep up with the state-of-the-art.
hmottestad
With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.
Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...
reissbaker
It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.
That being said, it's still very impressive for a 24B.
I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.
Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.
Of course, OpenAI, Google, and Anthropic will have released new models by then too...
redman25
It may not have been intentionally misleading. Some benchmarks can take a lot of horsepower and time to run. Their preparation for release likely was done well in advance of the model release before the new deepseek r1 model had even been available to test.
hmottestad
Mistral isn’t using misleading benchmarks. I linked to DeepSeek’s own benchmark results that DeepSeek created. I couldn’t find anything newer.
Can you link me to the benchmark you found?
adventured
It's because DeepSeek was a fast copy. That was the easy part and it's why they didn't have to use so much compute to get near the top. Going well beyond o3 or 2.5 Pro is drastically more expensive than fast copy. China's cultural approach to building substantial things produces this sort of outcome regularly, you see the same approach in automobiles, planes, Internet services, industrial machinery, military, et al. Innovation is very expensive and time consuming, fast copy is more often very inexpensive and rapid. 85% good enough is often good enough, that additional 10-15% is comically expensive and difficult as you climb.
orbital-decay
This terrible and vague stereotyping about "China" while having no clue about the subject should have no place on HN but somehow always creeps in and is upvoted by someone. DeepSeek is not "China", they had nobody to copy from, they released their first 7B reasoning model back in April 2024, it was ahead of then-SotA models in math and validated their approach. They did a ton of new things besides training a reasoning model, and likely have more to come, as they have a completely different background than most AI companies. It's more of a cross-pollination of different areas of expertise.
natrys
Not disagreeing with the overarching point but:
> That was the easy part
Is a bit hand-wavy in that it doesn't explain why it's only DeepSeek who can do this "easy" thing, but still not Meta, Mistral or anyone else really. There are many other players who have way more compute than DeepSeek (even inside China, not even considering rest of the world), and I can assure you more or less everyone trains on synthetic data/distillation from whatever bigger model they can access.
null
MaxPock
I understand that the French are very innovative so why isn't their model SOTA ?
melicerte
If you look at Mistral investors[0], you will quickly understand that Mistral is far from being European. My understanding is it is mainly owned by US companies with a few other companies from EU and other places in the world.
[0] https://tracxn.com/d/companies/mistral-ai/__SLZq7rzxLYqqA97j... (edited for typo)
pdabbadabba
For the purposes of GP's comment, I think the nationalities of the people actually running the company and doing the work are more relevant than who has invested.
derektank
And, perhaps most relevantly, the regulatory environment the people are working in. French people working in America are probably more productive than French people working in France (if for no other reason because they probably work more hours in America than France).
kergonath
It’s a French company, subject to French laws and European regulations. That’s what matters, from a user point of view.
epolanski
Jm2c but I feel conflicted about this arms race.
You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.
I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?
louiskottmann
Indeed, and with the technology plateau-ing, being 6-12 months late with less debt is just long term thinking.
Also, Europe being in the race is a big deal for consumers.
sisve
Being the best European AI company is also a multi billion business. Its not like China or the US respects GDPR. A lot of companies will choose the best European company.
ACCount36
>with the technology plateau-ing
People were claiming that since year 2022. Where's the plateau?
adventured
Why would the debt matter when you have $60 billion in ad revenue and are generating $20 billion in op income? That's OpenAI 5-7 years from now, if they're able to maintain their position with consumers. Once they attach an ad product their margins will rapidly soar due to the comparatively low cost of the ad segment.
The technology is closer to a decade from seeing a plateau for the large general models. GPT o3 is significantly beyond o1 (much less 3.5 which was just Nov 2022). Claude 4 is significantly beyond 3.5. They're not subtle improvements. And most likely there will be a splintering of specialization that will see huge leaps outside the large general models. The radical leap in coding capabilities over the past 12-18 months is just an early example of how that will work, and it will affect every segment of human endeavour.
adventured
A similar sentiment existed for a long time about Uber and now they're very profitable and own their market. It was worth the burn to capture the market. Who says OpenAI can't roll over to profitable at a stable scale? Conquer the market, hike the price to $29.95 (family account, no ads; $19.95 individual account with ads; etc etc). To say nothing of how they can branch out in terms of being the interaction point that replaces the search box. The advertising value of owning the land that OpenAI is taking is well over $100 billion in annual revenue. Amazon's retail business is terrible, their ad business is fantastic. As OpenAI bolts on an ad product their margin potential will skyrocket and the cost side will be modest in comparison.
Over the coming years it won't be possible to stay a mere 6-12 months behind as the costs to build and maintain the AI super-infrastructure keeps climbing. It'll become a guaranteed implosion scenario. Winning will provide the ongoing immense resources needed to keep pushing up the hill forever. Everybody else - except a few - will fall away. The same outcome took place in search. Anybody spot Lycos, Excite, Hotbot, AltaVista around? It costs an enormous amount of money to try to keep up with Google (Bing, Baidu, Yandex) in search and scale it. This will be an even more brutal example of that, as the costs are even higher to scale.
The only way Mistral survives is if they're heavily subsidized directly by European states.
otabdeveloper4
> now they're very profitable and own their market.
No they don't. They failed in every market except a few niche ones.
aDyslecticCrow
> It was worth the burn to capture the market.
You cannot compare Uber to the AI market. They are too different. Uber captured the market because having three taxi services is annoying. But people are readily jumping between models using multi-model platforms. And nobody is significantly ahead of the pack. There is nothing that sets anyone apart aside from the rate at which they are burning capital. Any advantage is closed within a year.
If OpenAI wants to make a profit, it will raise prices and be dropped at a heartbeat for the next cheapest option. Most software stacks are designed to be model-agnostic, making integration or support a non-factor.
xmcqdpt2
I think the jury is still out on Uber. They first became profitable in 2023 after 15 years of massive losses. They still burned way more money than they ever made.
jasonthorsness
Even if it isn't as capable, having a model with control over training is probably strategically important for every major region of the world. But it could only fall so far behind before it effectively doesn't work in the eyes of the users.
tootie
As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.
littlestymaar
> Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison.
That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).
funnym0nk3y
Thought so too. I don't know how it could be different though. They are competing against behemoths like OpenAI or Google, but have only 200 people. Even Anthropic has over 1000 people. DeepSeek has less than 200 people so the comparison seems fair.
wafngar
But they have built a fully “independent” pipeline. Deepseek and others probably trained in gpt4, o1 or whatever data.
dwedge
Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.
I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.
A month later I got the bill for that crap and deleted my account.
Maybe this is better but I'm over hype marketing from mistral
notnullorvoid
I wouldn't trust any of these LLM teams to produce a good OCR model. OCR from 10 years ago is better than the crap they put out.
megalomanu
We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in 34–37 seconds. The output quality was slightly lower but still remain acceptable for us. We’ll continue testing, but the early results are promising. I'm glad to see Mistral prioritizing speed over raw power, there’s definitely a need for that.
nbardy
I bet you can close the gap with a finetune.
Should be quiet easy if you have some o4-mini results sitting around.
kamranjon
I am curious why you would choose a reasoning model for JSON generation?
I was recently working on a user facing feature using self-hosted Gemma 27b with VLLM and was getting fully formed JSON results in ~7 seconds (even that I would like to optimize further) - obviously the size of the JSON is important but I’d never use a reasoning model for this because they’re constantly circling and just wasting compute.
I haven’t really found a super convincing use-case for reasoning models yet, other than a chat style interface or an assistant to bounce ideas off of.
megalomanu
It is for generating a big nested JSON, quite complex from a business standpoint (lots of different business concepts). We didn't have good results with simple models.
simonw
Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/
atxtechbro
Hi Simon,
What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?
Thanks, Morgan
diggan
Ollama doesn't like proper naming for some reason, so `ollama pull magistral:latest` lands you with the q4_K_M version (currently, subject to change).
Mistral's API defaults to `magistral-medium-2506` right now, which is running with full precision, no quantization.
otabdeveloper4
Nobody should be ever using ollama, for any reason.
It literally only makes everything worse and more convoluted with zero benefits.
samtheprogram
Not only the quantization, but what’s available via ollama is magistral-small (for local inference), not the -medium variant.
simonw
Yes, the bad one was Mistral Small running locally, the better one was Mistral Medium via their API.
internet_points
> I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.
but then someone found that, at least for distilled models,
> correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness
https://arxiv.org/pdf/2505.13792
ie. the conclusion doesn't necessarily follow from the reasoning. So is there still value in seeing the reasoning? There may be useful information in the reasoning, but I'm not sure it can be interpreted by humans as a typical human chain of reasoning, maybe it should be interpreted more as a loud multi-party discussion on the relevant subject which may have informed the conclusion but not necessarily lead to it.
OTOH, considering the effects of automation fatigue vs human oversight, I guess it's unlikely anyone will ever look at the reasoning in practice, except to summarily verify that it's there and tick the boxes on some form.
christianqchung
I don't understand why the benchmark selections are so scattered and limited. It only compares Magistral Medium with Deepseek V3, R1, and the other close weighted Mistral Medium 3. Why did they leave off Magistral Small entirely, alongside comparisons with Alibaba Qwen or the mini versions of o3 and o4?
elAhmo
When they include comparisons, it is always a deliberate decision what to show and, more importantly, what not to show. If they had data that would show better performance compared to those models, there is no reason for them to not emphasize that.
CobrastanJorji
Etymological fun: both "mistral" and "magistral" mean "masterly."
Mistral comes from Occitan for masterly, although today as far as I know it's only used in English when talking about mediterranean winds.
Magistral is just the adjective form of "magister," so "like a master."
If you want to make a few bucks, maybe look up some more obscure synonyms for masterly and pick up the domain names.
snakeboy
> as far as I know it's only used in English when talking about mediterranean winds.
It's a French company, and "mistral" has this usage in French as well. Also, "magistral" is just the french translation of "masterful".
arnaudsm
I wished the charts included Qwen3, the current SOTA in reasoning.
Qwen3-4B almost beats Magistral-22B on the 4 available benchmarks, and Qwen3-30B-A3B is miles ahead.
SparkyMcUnicorn
30-A3B is a really impressive model.
I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o
freehorse
It is a great model, and blazing fast, which is actually very useful esp for "reasoning" models, as they produce a lot of tokens.
I wish mistral were back into making MoE models. I loved their 8x7 mixtral, it was one of the greatest models I could run the time it went out, but it is outdated now. I wish somebody was out making a similar size MoE model, which could comfortably sit in a 64GB ram macbook and be fast. Currently the qwen 30-A3B is the only one I know of, but it would be nice to have something slightly bigger/better (incl a non-reasoning base one). All the other MoE models are just too big to run locally in more standard hardware.
iamnotagenius
[dead]
poorman
Is there a popular benchmark site people use? Becaues I had to test all these by hand and `Qwen3-30B-A3B` still seems like the best model I can run in that relative parameter space (/memory requirements).
arnaudsm
- https://livebench.ai/#/ + AIME + LiveCodeBench for reasoning
- MMLU-Pro for knowledge
- https://lmarena.ai/leaderboard for user preference
We only got Magistral's GPQA, AIME & livecodebench so far.
resource_waste
No surprise on my end. Mistral has been basically useless due to other models always being better.
But its European, so its a point of pride.
Relevance or not, we will keep hearing the name as a result.
devmor
I would agree, Qwen3 is definitely the most impressive "reasoning" model I've evaluated so far.
alister
As a quick test of logical reasoning and basic Wikipedia-level knowledge, I asked Mistral AI the following question:
A Brazilian citizen is flying from Sao Paulo to Paris, with a connection in Lisbon. Does he need to clear immigration in Lisbon or in Paris or in both cities or in neither city?
Mistral AI said that "immigration control will only be cleared in Paris," which I think is wrong.
After I pointed it to the Wikipedia article on this topic[1], it corrected itself to say that "immigration control will be cleared in Lisbon, the first point of entry into the Schengen Area."
I tried the same question with Meta AI (Llama 4) and it did much worse: It said that the traveler "wouldn't need to clear immigration in either Lisbon or Paris, given the flight connections are within the Schengen Area", which is completely incorrect.
I'd be interested to hear if other LLMs give a correct answer.
mcintyre1994
I think Gemini's answer (2.5 Flash) is impressive
----
Since both Portugal and France are part of the Schengen Area, and a Brazilian citizen generally does not need a visa for short stays (up to 90 days in any 180-day period) in the Schengen Area, here's how immigration will work:
Lisbon: The Brazilian citizen will need to clear immigration in Lisbon. This is because Lisbon is the first point of entry into the Schengen Area. At this point, their passport will be stamped, and they will be officially admitted into the Schengen Zone.
Paris: Once they have cleared immigration in Lisbon, their flight from Lisbon to Paris is considered a domestic flight within the Schengen Area. Therefore, they will not need to clear immigration again in Paris.
Important Note: While Brazilians currently enjoy visa-free travel, the European Travel Information and Authorization System (ETIAS) is expected to become mandatory by late 2026. Once implemented, Brazilian citizens will need to obtain this electronic authorization before their trip to Europe, even for visa-free stays. However, this is a pre-travel authorization, not a visa in the traditional sense, and the immigration clearance process at the first point of entry would remain the same.
viraptor
It looks like you're not sure of the answer yourself. That's really not a good way to test the LLMs. You'll just prefer the one that convinces you the most rather than the correct one.
schneehertz
This is deepseek-r1's answer:
```
Based on the Schengen Area rules for international travel:
Lisbon (Portugal) - Immigration clearance is required here. Since Lisbon is the first point of entry into the Schengen Area (and Brazil is outside Schengen), the passenger must clear immigration (passport control) in Lisbon. This is standard for all non-Schengen arrivals.
Paris (France) - No immigration clearance is needed. After clearing immigration in Lisbon, the onward flight to Paris is considered a domestic Schengen flight (no border checks within Schengen).
```
jaza
Sounds to me like "immigration in Lisbon or in Paris or in both cities or in neither city" is a trick question, because (on top of immigration in Lisbon as the Schengen entry point) surely the traveller also needs to clear immigration in Sao Paulo (as the Brazil / Mercosur exit point)?
eigenspace
I wonder if they adjusted it since reading this thread, or if you forgot to turn on reasoning or if you just got unlucky. For me, when I ask this it thinks for about 1.5 minutes and responds with
"The Brazilian citizen needs to clear immigration in Lisbon."
When I asked it to explain why, it said
"The Brazilian citizen needs to clear immigration in Lisbon because Lisbon is the first Schengen entry point when traveling from a non-Schengen country (Brazil) to a Schengen country (France via Portugal). According to Schengen rules, travelers must clear immigration at the first Schengen port of entry. Since Paris is also within the Schengen Area, no additional immigration checks are required upon arrival in Paris."
marsa
doing some reason.. uhh intuitioning i imagine brazil and portugal might have some sort of a visa-free deal going on in which case llama 4 might actually be right here?
mcintyre1994
AFAIK Schengen has a common visa policy, so there couldn't be such a deal between Brazil and Portugal. It'd also be extremely surprising if two countries not in a common travel area had a deal where you didn't have to clear customs at all, I suspect that doesn't exist anywhere in the world.
alister
Brazilians don't need a visa for Portugal, France, or any Schengen country. But everybody has to pass through immigration control (at least a passport check even if you don't need a visa) when entering the Schengen zone. My question was which country would that happen in.
rafram
Is the number of em-dashes in this marketing copy indicative of the kind of output that the model produces? If so, might want to tone it down a bit.
sebmellen
> Our early tests indicated that Magistral is an excellent creative companion. We highly recommend it for creative writing and storytelling, with the model capable of producing coherent or — if needed — delightfully eccentric copy.
iamnotagenius
[dead]
cAtte_
49 em-dashes, 59 commas. that's a crazy ratio
pembrook
This meme that humans don’t use em dashes needs to die.
It’s an extremely useful tool in writing and I’ve been using it for decades.
rafram
I love a good em-dash, but this page overuses them (nearly 1:1 ratio of em-dashes to commas!) and puts them in places where they just do not belong.
hskalin
That's very weird, I on the other hand don't remember noticing them or using them before the advent of chatgpt. Maybe it's a cultural thing.
It makes sense that humans would have been using it though, chatgpt learned from us afterall
fivestones
Same here. The em dash has been maybe my favorite punctuation since at least the early 2000s. All the em dash output from LLMs looks really natural to me.
ModernMech
But the em dashes — if appreciated — are delightfully eccentric and whimsical!
tiahura
Unless you're a lawyer. We love 'em.
NicuCalcea
As a journalist, same!
lee-rhapsody
Also a journalist. I use em-dashes all the time
drusepth
As an author... same!
saratogacx
That is just Mistral's market style. You see it on a lot of their pages. The model output doesn't share the same love for the long dash.
xmcqdpt2
We don’t have em dashes as punctuation in French —- commas are usually used instead —- so we get overly excited about using them when we can —- everybody likes novelty.
johnisgood
I do not know but sometimes when I type "-" and press space, LibreOffice converts it to an em-dash. I get rid of it so people won't confuse me with an LLM.
bee_rider
How many other open-weights reasoning models are there?
Is it possible to run multiple reasoning models on one problem? (Why not? I guess).
Another funny thought is: they release their Small model, and kept their Medium as a premium service. I wonder if you could do chains with Medium run occasionally, linked together by local runs of Small?
simonw
Qwen 3 and DeepSeek R1 and Phi-4 Reasoning are the best open weights reasoning models I know of.
ls612
Just Deepseek I think and there are distillations of that that can run on consumer hardware if you really want.
nake13
The Magistral Small can fit within a single RTX 4090 or a 32GB RAM MacBook once quantized.
the_sleaze_
Excellent news for me.
How does one figure this out? As in I want to know the comparable Deepseek or Llama equivalent (size-wise) and don't want to figure it out by trial and error.
lolive
Is it indeed the plan of Apple to eventually run such kind of models direcly inside a iPhone? Or are the specs of any stateOfTheArt smartphone well below the minimum requirements of such "lightweight" models?
I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF
ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL
or
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral