Show HN: Real-time AI Voice Chat at ~500ms Latency
65 comments
·May 5, 2025jedberg
Reason077
The best, most human-like AI voice chat I've seen yet is Sesame (www.sesame.com). It has delays, but fills them very naturally with normal human speech nuances like "hmmm", "uhhh", "hold on while I look that up" etc. If there's a longer delay it'll even try to make a bit of small talk, just like a human conversation partner might.
modeless
My take on this is that voice AI has not truly arrived until it has mastered the "Interrupting Cow" benchmark.
robbomacrae
Spot on. I’d add that most serious transcription services take around 200-300ms but the 500ms overall latency is sort of a gold standard. For the AI in KFC drive thrus in AU we’re trialing techniques that make it much closer to the human type of interacting. This includes interrupts either when useful or by accident - as good voice activity detection also has a bit of latency.
varispeed
> AI in KFC drive thrus
That right here is an anxiety trigger and would make me skip the place.
There is nothing more ruining the day like arguing with a robot who keeps misinterpreting what you said.
coolspot
They have a fallback to a human operator when stopwords and/or stop conditions are detected.
joshstrange
> where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.
I'm not an expert on LLMs but that feels completely counter to how LLMs work (again, _not_ an expert). I don't know how we can "stream" the input and have the generation update/change in real time, at least not in 1 model. Then again, what is a "model"? Maybe your model fires off multiple generations internally and starts generating after every word, or at least starts asking sub-LLM models "Do I have enough to reply?" and once it does it generates a reply and interrupts.
I'm not sure how most apps handle the user interrupting, in regards to the conversation context. Do they stop generation but use what they have generated already in the context? Do they cut off where the LLM got interrupted? Something like "LLM: ..and then the horse walked... -USER INTERRUPTED-. User: ....". It's not a purely-voice-LLM issue but it comes up way more for that since rarely are you stopping generation (in the demo, that's been done for a while when he interrupts), just the TTS.
tomp
If your model is fast enough, you can definitely do it. That's literally how "streaming Whisper" works, just rerun the model on the accumulated audio every x00ms. LLMs could definitely work the same way, technically they're less complex than Whisper (which is an encoder/decoder architecture, LLMs are decoder-only) but of course much larger (hence slower), so ... maybe rerun just a part of it? etc.
woodson
Human-to-human conversational patterns are highly specific to cultural and contextual aspects. Sounds like I’m stating the obvious, but developers regularly disregard that and then wonder why things feel unnatural for users. The “median delay” may not be the most useful thing to look at.
To properly learn more appropriate delays, it can be useful to find a proxy measure that can predict when a response can/should be given. For example, look at Kyutai’s use of change in perplexity in predictions from a text translation model for developing simultaneous speech-to-speech translation (https://github.com/kyutai-labs/hibiki).
r0fl
Great insights. When I have a conversation with another person sometimes they cut me off when they are trying to make a point. I have talked to ChatGPT and grok at length (hours of brain storming, learning things, etc) and AI has never interrupted aggressively to try to make a point stick better
varispeed
This silence detection is what makes me unable to chat with AI. It is not natural and creates pressure.
True AI chat should know when to talk based on conversation and not things like silence.
Voice to text is stripping conversation from a lot of context as well.
krainboltgreene
I would also suspect that a human has much less patience for a robot interrupting them than a human.
smeej
I'm certainly in that category. At least with a human, I can excuse it by imagining the person grew up with half a dozen siblings and always had to fight to get a word in edgewise. With a robot, it's interrupting on purpose.
koljab
I built RealtimeVoiceChat because I was frustrated with the latency in most voice AI interactions. This is an open-source (MIT license) system designed for real-time, local voice conversations with LLMs.
Quick Demo Video (50s): https://www.youtube.com/watch?v=HM_IQuuuPX8
The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.
Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.
It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.
Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.
The code is here: https://github.com/KoljaB/RealtimeVoiceChat
zaggynl
Neat! I'm already using openwebui/ollama with a 7900 xtx but the STT and TTS parts don't seem to work with it yet:
2025-05-05 20:53:15,808] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
Error loading model for checkpoint ./models/Lasinya: This op had not been implemented on CPU backend.
dotancohen
This looks great. What hardware do you use, or have you tested it on?
ivape
Would you say you are using the best-in-class speech to text libs at the moment? I feel like this space is moving fast because the last time I was headed down this track, I was sure whisper-cpp was the best.
koljab
I'm not sure tbh. Whisper was king for so long time now, especially with the ctranslate2 implementation from faster_whisper. Now nvidia open sourced Parakeet TDT today and it instantly went no 1 on open asr leaderboard. Will have to evaluate these latest models, they look strong.
kristopolous
https://yummy-fir-7a4.notion.site/dia is the new hotness.
oezi
Paraket is english only. Stick with Whisper.
The core innovation is happening in TTS at the moment.
ivape
Yeah, I figured you would know. Thanks for that, bookmarking that asr leaderboard.
smusamashah
Saying this as a user of these tools (openai, Google voice chat etc). These are fast yes, but they don't allow talking naturally with pauses. When we talk, we take long and small pauses for thinking or for other reasons.
With these tools, AI starts taking as soon as we stop. Happens both in text and voice chat tools.
I saw a demo on twitter a few weeks back where AI was waiting for the person to actually finish what he was saying. Length of pauses wasn't a problem. I don't how complex that problem is though. Probably another AI needs to analyse the input so far a decide if it's a pause or not.
joshstrange
This 100%, yes!
I've found myself putting in filler words or holding a noise "Uhhhhhhhhh" while I'm trying to form a thought but I don't want the LLM to start replying. It's a really hard problem for sure. Similar to the problem of allowing for interruptions but not stopping if the user just says "Right!", "Yes", aka active listening.
One thing I love about MacWhisper (not special to just this STT tool) is it's hold to talk so I can stop talking for as long as I want then start again without it deciding I'm done.
WhitneyLand
>>they don't allow talking naturally
Neither do phone calls. Round trip latency can easily be 300ms, which we’ve all learned to adapt our speech to.
If you want to feel true luxury find an old analog PTSN line. No compression artifacts or delays. Beautiful and seamless 50ms latency.
Digital was a terrible event for call quality.
mvdtnz
I don't know how your post is relevant to the discussion of AI models interrupting if I pause for half a second?
qwertox
Maybe we should settle on some special sound or word which officially signals that we're making a pause for whatever reason, but that we intend to continue with dictating in a couple of seconds. Like "Hmm, wait".
twodave
Alternatively we could pretend it’s a radio and follow those conventions.
ivape
Two input streams sounds like a good hacky solution. One input stream captures everything, the second is on the look out for your filler words like "um, aahh, waaiit, no nevermind, scratch that". The second stream can act as the veto-command and cut off the LLM. A third input stream can simply be on the lookout for long pauses. All this gets very resource intensive quickly. I been meaning to make this but since I haven't, I'm going to punish myself and just give the idea away. Hopefully I'll learn my lesson.
emtrixx
Could that not work with simple instructions? Let the AI decide to respond only with a special wait token until it thinks you are ready. Might not work perfectly but would be a start.
LZ_Khan
Honestly I think this is a problem of over-engineering and simply allowing the user to press a button when he wants to start talking and press it when he's done is good enough. Or even a codeword for start and finish.
We don't need to feel like we're talking to a real person yet.
null
SubiculumCode
Yeah, when I am trying to learn about a topic, I need to think about my question, you know, pausing mid-sentence. All the products jump in and interrupt, no matter if I tell them not to do so. Non-annoying humans don't jump in to fill the gap, they read my face, they take cues, then wait for me to finish. Its one thing to ask an AI to give me directions to the nearest taco stand, its another to have a dialogue about complex topics.
joshstrange
This is very, very cool! The interrupting was a "wow" moment for me (I know it's not "new new" but to see it so well done in open source was awesome).
Question about the Interrupt feature, how does it handle "Mmk", "Yes", "Of course", "cough", etc? Aside from the sycophancy from OpenAI's voice chat (no, not every question I ask is a "great question!") I dislike that a noise sometimes stops the AI from responding and there isn't a great way to get back on track, to pick up where you left off.
It's a hard problem, how do you stop replying quickly AND make sure you are stopping for a good reason?
koljab
That's a great question! My first implementation was interruption on voice activity after echo cancellation. It still had way too many false positives. I changed it to incoming realtime transcription as a trigger. That adds a bit of latency but that gets compensated by way better accuracy.
Edit: just realized the irony but it's really a good question lol
joshstrange
That answer is even more than I could have hoped for. I worried doing that might be too slow. I wonder if it could be improved (without breaking something else) to "know" when to continue based on what it heard (active listening), maybe after a small pause. I'd put up with a chance of it continuing when I don't want it to as long as "Stop" would always work as a final fallback.
Also, it took me longer than I care to admit to get your irony reference. Well done.
Edit: Just to expand on that in case it was not clear, this would be the ideal case I think:
LLM: You're going to want to start by installing XYZ, then you
Human: Ahh, right
LLM: Slight pause, makes sure that there is nothing more and checks if the reply is a follow up question/response or just active listening
LLM: ...Then you will want to...
lacoolj
Call me when the AI can interrupt YOU :)
tintor
After interrupt, unspoken words from LLM are still in the chat window. Is LLM even aware that it was interrupted and where exactly?
cannonpr
Kind of surprised nobody has brought up https://www.sesame.com/research/crossing_the_uncanny_valley_...
It interacts nearly like a human, can and does interrupt me once it has enough context in many situations, and has exceedingly low levels of latency, using for the first time was a fairly shocking experience for me.
varispeed
Didn't expect it to be that good! Nice.
briga
I'm starting to feel like LLMs need to be tuned for shorter responses. For every short sentence you give them they outputs paragraphs of text. Sometimes it's even good text, but not every input sentence needs a mini-essay in response.
Very cool project though. Maybe you can fine tune the prompt to change how chatty your AI is.
fintechie
Quite good, it would sound much better with SOTA voices though:
thamer
Does Dia support configuring voices now? I looked at it when it was first released, and you could only specify [S1] [S2] for the speakers, but not how they would sound.
There was also a very prominent issue where the voices would be sped up if the text was over a few sentences long; the longer the text, the faster it was spoken. One suggestion was to split the conversation into chunks with only one or two "turns" per speaker, but then you'd hear two voices then two more, then two more… with no way to configure any of it.
Dia looked cool on the surface when it was released, but it was only a demo for now and not at all usable for any real use case, even for a personal app. I'm sure they'll get to these issues eventually, but most comments I've seen so far recommending it are from people who have not actually used it or they would know of these major limitations.
koljab
Dia is too slow, I need a time to first audio chunk of ~100 milliseconds. Also generations fail too often (artifacts etc)
IshKebab
Impressive! I guess the speech synthesis quality is the best available open source at the moment?
The endgame of this is surely a continuously running wave to wave model with no text tokens at all? Or at least none in the main path.
dcreater
Does the docker container work on Mac?
koljab
I doubt TTS will be fast enough for realtime without a Nvidia GPU
oldgregg
Nice work, I like the lightweight web front end and your implementation of VAD.
I did some research into this about a year ago. Some fun facts I learned:
- The median delay between speakers in a human to human conversation is zero milliseconds. In other words, about 1/2 the time, one speaker interrupts the other, making the delay negative.
- Humans don't care about delays when speaking to known AIs. They assume the AI will need time to think. Most users will qualify a 1000ms delay is acceptable and a 500ms delay as exceptional.
- Every voice assistant up to that point (and probably still today) has a minimum delay of about 300ms, because they all use silence detection to decide when to start responding, and you need about 300ms of silence to reliably differentiate that from a speaker's normal pause
- Alexa actually has a setting to increase this wait time for slower speakers.
You'll notice in this demo video that the AI never interrupts him, which is what makes it feel like a not quite human interaction (plus the stilted intonations of the voice).
Humans appear to process speech in a much more steaming why, constantly updating their parsing of the sentence until they have a high enough confidence level to respond, but using context clues and prior knowledge.
For a voice assistant to get the "human" levels, it will have to work more like this, where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.