High-Fidelity Simultaneous Speech-to-Speech Translation
41 comments
·July 3, 2025Grosvenor
This is so cool. The future is cool!
I wonder how it will work on languages that have different grammatical structure than french/english? Like Finno-Ugric languages which have sort of a Yoda speech to them. Edit: In Finno-Ugric languages words later on in a sentence can completely change the meaning. Will be interesting to look at.
It's considerate of them to name it after my favourite whisky.
nine_k
If Finnish is not widely known, German is more familiar, and there you can put the "nicht" at the very end of a sentence, reversing its meaning. Also, the verb may come close to the end, after an extended description of the subject / object; in English, you want the verb early.
Human translators somehow handle that; machines would likely exhibit a similar delay.
mananaysiempre
Vaguely related anecdote: have you ever dictated a number to a French speaker? When you say “forty-two” or “seventy-six”, an English speaker will start writing the 4 or the 7 the moment they hear the “forty” or the “seventy”. The French speaker will also write the 4 the moment they hear the “quarante” in “quarante-deux” (40+2), but when you say “soixante-seize” (60+16), they will (without thinking about it!) only start writing 76 at the end of the whole thing, because after only hearing the “soixante” they can’t tell if they’ll need to write a 6 or a 7.
lapink
The alignment between source and target is automatically inferred, basically by searching when the uncertainty over a given output word reduces the most once enough input words are seen. This is then lifted to the audio domain. In theory the same trick should work even with longer grammatical inversions between languages, although this will lead to larger delays. To be tested!
notphilipmoran
It will interesting to see if it runs into issues in syntax of sentences. What am thinking of is specifically between Spanish and English, sentence structures often look completely different. How will this real time interpretation be affected?
jauntywundrkind
Link to repo: https://github.com/kyutai-labs/hibiki
gagabity
Yandex Browser has been doing this for Russian for a while, if you go to YT it offers to translate to Russian, it does multiple speakers and voices from what I remember. Not sure if all the technicalities are the same.
totetsu
All these Japanese project names and no Japanese support (ToT)
woodson
Check out this model based on the same architecture for Japanese: https://github.com/nu-dialogue/j-moshi
jdkee
They just open sourced their newest TTS today.
wenc
Wow, that's impressive! It even has a "sarcastic" voice which drips with sarcasm.
clueless
"Hibiki currently only supports French-to-English translation."
AIorNot
this is amazing - love to play with this- what about other languages besides french to english
lapink
Adding more languages is definitely planned! This was Tom (the first author) master’s internship project with Kyutai, and it was easier to prototype the idea with a single pair. Also he will be presenting this work at ICML in two weeks if anyone is around and wants to learn more.
benlivengood
Now to get the model to run in an earbud...
lapink
The model can actually run on an iPhone 16 Pro, so if the earbud is connected to one that could work!
gcanyon
Almost as good as a babel fish!
Bluestein
That would be insane.-
Thinking of it, the whole "stack" from earbuds to phone to cloud - even in just something so "commonplace" as Assistant or Alexa ...
... Is amazing: All that computing power at our disposal.-
cs702
Nice. I'm impressed.
Translator jobs are going to go poof! overnight.
Just sayin'.
desultir
Translators sure, interpreters no.
Interpreters also have to factor in cultural context and customs, ensuring that meaning is conveyed without offence being given in formal contexts.
esafak
I don't see why software couldn't do that, if you give them the context.
cortesoft
That seems like something LLMs could eventually get good at
mschuster91
As long as youtube keeps translating "ham" to "Schinken" no matter the context, translators will have jobs.
For anyone else looking for examples: https://huggingface.co/spaces/kyutai/hibiki-samples