Voxtral – Frontier open source speech understanding models
20 comments
·July 15, 2025kamranjon
Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.
ipsum2
24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)
azinman2
But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel
sheerun
In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.
Raed667
It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties
GaggiX
There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507
lostmsu
Does it support realtime transcription? What is the ~latency?
danelski
They claim to undercut competitors of similar quality by half for both models, yet they released both as Apache 2.0 instead of following smaller - open, larger - closed strategy used for their last releases. What's different here?
halJordan
They didn't release voxtral large so your question doesn't really make sense
wmf
They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.
Havoc
Probably not looking to directly compete in transcription space
homarp
weights:https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 and https://huggingface.co/mistralai/Voxtral-Small-24B-2507
homarp
Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.
Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
lostmsu
My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.
ImageXav
How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.
lostmsu
Harvesting idle compute. https://borgcloud.org/speech-to-text
4b11b4
This is your service?
BetterWhisper
Do you support speaker recognition?
Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.
Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.