PlayAI's new Dialog model achieves 3:1 preference in human evals
18 comments
·February 3, 2025tkgally
ekianjo
Yeah there are no good options for Japanese yet (except maybe in Japan but I haven't heard of good Ai models for speech locally)
quickgist
For some reason, most of these (and other narration AIs) sound like someone reading off a teleprompter, rather than natural speaking voices. I'm not sure what exactly it is, but I'm left feeling like the speaker isn't really sure of what the next words are, and the stresses between the words are all over the place. It's like the emphasis over a sentence doesn't really match how humans sound.
crazygringo
Yup, and that's going to be the case until AI's can really model human psychology.
Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.
If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.
So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.
It's a gigantic difference from text, which encodes vastly less information.
hedora
Oof. I've heard recent AI generated narrators, and they were OK (much better than a few years ago, much worse than professional humans), but something about the digital postprocessing in this article's youtube video reminded me of fingernails on a chalkboard.
I couldn't get half way through.
echelon
PlayHT's voices are nowhere near as good as ElevenLabs. These self-reported studies are marketing.
In any case, voice is such a thin vertical that I half expect the Chinese to release an open source TTS model that out-performs everything on the market. Tencent probably has one of these cooking right now.
refulgentis
This is an excellent point: there's some sort of oddity where it's hard to say its definitively AI, but I can definitively say it's a...low quality human?
I really like "off a teleprompter", it accurately characterizes the subtle dissonances where it sounds like someone who is reading something they haven't read before. 0:14 "infectious (flat) beatsss (drawn out), which is near diametrically opposed to the paired snappy 0:12 "soulful (high / low) vocals (high)."
thot_experiment
I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are:
- zero shot voice cloning isn't there yet
- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps
- F5 and fish-speech are both good as well
- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)
- finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable
- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on
- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid
- really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural nets
TL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech
ekianjo
Xtts is non commercial use only though
adriand
Do these services restrict the content that their AIs give voice to? If so, what are the typical restrictions? Like do they seek to prevent their tech being used for scamming, erotica, hate speech, etc? Or is it pretty much anything goes?
nine_k
How, do you think, can they restrict that? Require that in the EULA, then sue someone who breaks the rules at a scale large enough to be worth the cost of the lawyers?
Or do you think they should analyze the text's sentiment and raise a flag if the sentiment is obviously breaking the EULA, e.g. some kind of hate speech?
How would you implement that?
SeanAnderson
wow! The preview is amazing! I would've 100% assumed those were human narrations if I wasn't given leading context.
vessenes
So, this is really impressive. Expressivity and pacing are wayyy better. Eleven Labs has been tops for some time, but the difference is pretty remarkable!
legofan94
Thanks Peter! We think it really crushes for emotive text. Anything from storytelling to being emotionally reassuring. Still a lot of things up our sleeve too!
vessenes
I have a particular use case I’m interested in using agents for - any chance you want to have a call?
In brief I’d like to be able to generate conversations via api choosing voices that should be unique on the order of thousands. Essentially I’m trying to simulate conversations in a small town. Eleven is not set up for this.
Ideally I’d be able to pick a spot in latent space for a voice programmatically. But I’m open to suggestions.
chzp94
Awesome!
I tried it with a paragraph of English taken from a formal speech, and it sounded quite good. I would not have been able to distinguish it from a skilled human narrator.
But then I tried a paragraph of Japanese text, also from a formal speech, with the language set to Japanese and the narrator set to Yumiko Narrative. The result was a weird mixture of Korean, Chinese, and Japanese readings for the kanji and kana, all with a Korean accent, and numbers read in English with an American accent. I regenerated the output twice, and the results were similar. Completely unusable.
I tried the same paragraph on ElevenLabs. The output was all in Japanese and had natural intonation, but there were two or three misreadings per sentence that would render it unusable for any practical purpose. Examples: 私の生の声 was read as watashi no koe no koe when it should have been watashi no nama no koe. 公開形式 was read as kōkai keiji instead of kōkai keishiki. Neither kanji misreading would be correct in any context. Even weirder, the year 2020 was read as 2021. Such misreadings would confuse and mislead any listeners.
I know that Japanese text-to-speech is especially challenging because kanji can often be read many different ways depending on the context, the specific referent, and other factors. But based on these tests, neither PlayAI nor ElevenLabs should be offering Japanese TTS services commercially yet.