A16Z AI Voice Update 2025
32 comments
·February 25, 2025dougb5
scarface_74
Before LLMs, chatbots and voice bots were dumb pattern matchers. You had to list every “utterance” that you wanted to match on. The only variance was in the “slots”.
An utterance is something like “give me directions from $source to $destination”.
LLMs mean that you don’t have to give the system every utterance in every supported language.
seydor
First, I can't listen to this article so this makes their point kinda less relevant.
> It is the most frequent (and information dense)
Second, this is false. Voice is effective when the sensory context is available to both people, e.g. in the dinner table where "pass the salt" makes immediate sense. Otherwise it is an erratic form of communication, prone to misunderstanding, often repetitive and redundant.
It is not more information dense, but it is the most immediate. The latency of AI applications makes its immediacy less useful.
Tepix
Voice is pretty sweet if you're driving for example
BrandiATMuhkuh
I'm pretty convinced that voice interaction will be the biggest UI change since apps.
Voice is simply natural to humans. Downloading an app to learn about the departure of the next bus is not.
I used voice bots to let my 5-year-old play role-playing games (e.g., checking into a hotel) or let my parents (60+) call a fake car dealership.
It's amazing to observe. They behave as if they're talking to a human, especially when doing it via a phone. That is exactly the UX a computer system should have—simply a phone number and voice.
As soon as people have to learn something new (a new webpage, a new app, etc.), something is wrong.
doug_durham
Voice interaction requires an enclosed area. I find it difficult to use any voice assistants in my life. Other people think I'm talking to them. Perhaps we'll all get single person offices with closing doors.
anon7000
You’re underestimating how many people are super antisocial, or at least don’t like talking that much! But it’s a fair point — I’d use Siri more if it was reliable
azinman2
> For enterprises, AI directly replaces human labor with technology. It's cheaper, faster, more reliable — and often even outperforms humans.
That’s… quite the claim. I guess we’re picking the worst people, the best voice-based AI, the easiest of scenarios, and a total desire for humanity to remove other human from interaction.
Pretty dark and sinister if you ask me.
ivolimmen
I have yet to 'meet' a voice AI on a phone. If I do and I can tell; I will hang up and the company just lost a client. I am a person and I like speaking to persons not machines. If a company thinks I am not worth talking to a human you are not worth my money.
nfm
I dunno, that seems a bit narrow minded to me. You're making an assumption about talking to AI being a worse experience than talking to a person (which is frequently _terrible_).
What if you were able to get helpful support, 24/7/365, with no time waiting in a queue, in your own language (regardless of the service provider's location and 'native' language support)? And the company was able to provide the product and support for it cheaper, resulting in less cost to you?
We're far from there, but I expect it'll happen.
Tepix
Have you used chatgpt advanced voice mode (voluntarily) and what was your experience like?
demaga
What's your plan for when you can't tell anymore?
anonzzzies
Voice is the most dense form of communication? Maybe if AI does stt perfectly all the time, but then the reverse, tts is really not very efficient for me; I read far faster and can do a fast skim (taking milliseconds) to see if the answer is in there or reprompt instead of having to listen to the slow warbling of something/someone only to conclude it was worthless. Oh and tts, at least for me, is not perfect; it often gets things wrong making the other side return nonsense too.
rozap
> Voice is the most dense form of communication
This is one those claims that's like....yea I guess you can go on the internet and just say things.
What a stupid slide deck. Jesus Christ.
muglug
I'd much rather type questions than ask them. Being able to review what I've written before I hit send gives me a sense of control lacking in voice interfaces.
mohsen1
Lots of negativity in the comments. if voice works, it's a superior UI than GUI. It's an article from an investment firm that are betting on this. Nothing wrong with that
anonzzzies
> it's a superior UI than GUI.
I don't believe that. For input, maybe (you do draw things probably to explain stuff, or send reference documents). For output, not at all; it really sucks. Not only is reading faster/more economical (if you can read of course, but that's another story); adding visuals (images, charts, but tables, animations, videos, calendars, kanban, mindmaps etc etc) aka GUI really helps in communicating. That's all GUI.
vessenes
Wow, lot’s of negative responses here on voice. I’m a reader. I read. A lot. And I still think 4o’s advanced voice mode is unique, extremely useful, and I dearly wish we had open models or even some closed competitive models that were as good as it.
I will note that the model has been successively nerfed, massively, from launch, you can watch some demo pre-launch videos, or just try out some basic engagement, for instance, try asking it to talk to you in various accents and see which ones Open AI deems “inappropriate” to ask for and which are fine. This kind of enshittification I think is pretty likely when you are the only one in town with a product.
That said, even moderately enshittified, there’s something magic about an end to end trained multimodal model — it can change tone of voice on request. In fact, my standard prompt asks it to mirror my tone of voice and cadence. This is really unique. It’s not achievable through a whisper -> LLM -> Synthesizer/TTS approach. It can give you a Boston accent, speculate that a Marseille accent is the equivalent in French, and then (at least try) to give you a Marseille accent. This is pretty strong medicine, and I love it.
There’s been so much LLM commoditization this year, and of course the chains keep moving forward on intelligence. But, I hope Ms. Moore is correct that we’ll see better and more voice models soon, and that someone can crack the architecture.
krembo
I think many of the negative comments in this thread haven't seen recently the human-machine interactions of the young generations with Siri and her chatbot friends.
threeseed
AI voice is like AI art. I am sure many people will appreciate it and love it.
But the whole point of this medium is that you want the humanity and personality. Otherwise just use text.
wewewedxfgdf
I'm not convinced TTS can get all the way to the quality of professional actors for things like audiobooks.
I'll take a professional actor over TTS any day - incomparably better quality even with the best TTS.
TheAceOfHearts
The technology is still extremely young and immature. As models get better, it should be possible to build tools that allow manual annotations to tweak how much emotion and expressiveness goes into what's being communicated, and eventually it should be possible to fully automate a first pass at this which produces passable results.
In any case, I think the biggest win is that tons of books which have never received audiobooks now have the option of getting a way better alternative than legacy TTS tools. Even if current TTS tools are a bit limited, they still feel like a massive leap in quality from what was available a few years back. Making it trivial to generate better audiobooks will help make tons of information more accessible to people.
The choice of audiobook is rarely going to be between a professional actor and TTS, but between no audiobook at all or a TTS version.
doughnutstracks
Slightly off-topic, but here’s a video comparing a real voice actor to a mod in a video game. Personally, I think the mod sounds much better.
lelandfe
Meh - even though the original got negative reactions (for not hitting the mark as a sultry femme fatale), I still think her VO did better readings than a lot of the ones in this. Some of the mod's sound like several different takes by different people spliced together.
(This is still a crazy impressive amount of work, they clearly labored over matching things to facial expressions)
> Voice agents allow businesses to be available to their customers 24/7 to answer questions, schedule appointments, or complete purchases. Customer availability and business availability no longer have to match 1:1 (ever tried to call an East Coast bank after 3 p.m. PT?). With voice agents, every business can always be online.
I don't get it -- textual support chatbots have been around for decades. Even if we accept the premise that people would rather speak to them by voice, how do voice agents represent some kind of sea change in availability?
(And I personally find customer support chatbots deeply frustrating to use for reasons that have nothing to do with the modality or the quality of the AI model. I only ever need to use one when the question I have is not answered in the documentation, which is often the extent of the chatbot's business-specific training data. Inevitably I end up being led in circles, screaming for a human.)