Accents in latent spaces: How AI hears accent strength in English
58 comments
·May 6, 2025asveikau
JoshTko
Thank you for pinpoints my confusion/disconnect on what lack of improvement that I was sensing. There was an improvement on pacing, and cadence, yes, but that was not the main challenge with Victors accent. Visually I'd say victor improved by at most 5% and not 50% as indicated by the visualization. In some regards it was even harder to understand than the original due to speed and cadence without improvement in core pronunciation.
anadalakra
"If Victor wanted to move beyond this point, the sound-by-sound phonetic analysis available in the BoldVoice app would allow him to understand the patterns in pronunciation and stress that contribute to Eliza’s accent and teach him how to apply them in his own speech."
Indeed Victor would likely receive a personalized lesson and practice on the NG sound on the app.
runelohrhauge
This is fascinating work. Love seeing how you’re combining machine learning with practical coaching to support real accent improvement. The concept of an “accent fingerprint” is especially clever, and the visualization of progress in latent space really brings it to life. Excited to see where you take this next!
georgewsinger
This is so cool. Real-time accent feedback is something language learners have never had throughout all of human history, until now.
Along similar lines, it would be useful to map a speaker's vowels in vowel-space (and likewise for consonants?) to compare native to non-native speakers.
I can't wait until something like this is available for Japanese.
pjc50
> something language learners have never had throughout all of human history
.. unless they had access to a native speaker and/or vocal coach? While an automated Henry Higgins is nifty, it's not something humans haven't been able to do themselves.
anadalakra
Native speakers are less helpful at this than you might think. Speech coaches are absolutely the way to go, but they're outside the price range for most people ($200+/hr for a good coach). BoldVoice gives coach-level feedback and instruction at a price point that everyone can access, on demand.
ilyausorov
That's a fascinating idea! Definitely something to try out for our team. We actively and continuously do all sorts of experiments with our machine learning models to be able to extract the most useful insights. We will definitely share if we find something useful here.
WhitneyLand
The hear my own voice without an accent thing is a really cool party trick.
I’d consider making this feature available free with super low friction, maybe no signup required, to get some viral traction.
ilyausorov
What if it was already available? Try it out at https://accentfilter.com!
null
pjc50
What the vector-space data gets right, and what the human commentary tends not to, is the idea that accents are a complex statistical distribution. You should be careful about the concept of a "default" or "neutral" accent. Telecommunications has spent the 20th century flattening accents together, as has accent discrimination. There's always the tendency for people to say "my accent is the neutral standard against which all others should be measured".
ilyausorov
For sure, and I don't think we ever use the term default or neutral. The "the American English accent of our expert accent coach Eliza" is just that -- it's one accent.
As a learning platform that provides instruction to our users, we do need to set some kind of direction in our pedagogy, but we 100% recognize that there isn't just 1 American English accent, and there's lots of variance.
lurk2
> There's always the tendency for people to say "my accent is the neutral standard against which all others should be measured".
You can measure this by mutual intelligibility with other accent groupings.
SamBam
Like others recently, I've been extremely impressed by LLM's ability to play GeoGuessr, or, more generally, to geo-locate random snapshots that you give them, with what seem (to me) to be almost no context clues. (I gave ChatGPT loads of holiday snapshots, screenshotted to remove metadata, and it did amazingly.)
I assume that, with enough training, we could get similarly accurate guesses of a person's linguistic history from their voice data.
Obviously it would be extremely tricky for lots of people. For instance, many people think I sound English or Irish. I grew up in France to American parents who both went to Oxford and spent 15 years in England. I wouldn't be surprised, though, if a well-trained model could do much better on my accent than "you sound kinda Irish."
nmstoker
Yes, although I believe this is a speaker embedding model here, so not LLM related.
This kind of speech clustering has been possible for years - the exciting point with their model here is how it's highly focused on accents alone. Here's a video of mine from 2020 that demonstrated this kind of voice clustering in the Mozilla TTS repo (sadly the code got broken + dropped after a refactoring). Bokeh made it possible to directly click on points in a cluster and have them play
https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l
note: take care when listening as the audio level varies a bit (sorry!)
ilyausorov
We actually did something like this for non-native English speakers a few months back. Check out https://accentoracle.com (most mind-blowing if you're a non native English speaker)
nmeofthestate
I'm 42% Arabic apparently! And 20% Russian. Got an 81% American accent level. I guess it is tuned to non-native-English speaker accents.
ilyausorov
Was that right? Or what is the correct native language it should have predicted? Note the %s in the accent breakdown section are prediction probabilities
chris_va
I bet you are right.
I had a forensic linguistics TA during college who was able to identify the island in southeast Asia one of the students grew up on, and where they moved to in the UK as a teenager before coming to the US (if I am remembering this story right).
From what I gather, there are a lot of clues in how we speak that most brains edit out when parsing language.
ccppurcell
Oh pssh. There's no such thing as accent strength. There's only accent distance. Accent strength is just an artefact of distance from the accent of a socially dominant group.
dmurray
The article defines accent strength in precisely this way, as the difference "relative to native speakers of English".
That group has a vast range of accents, but it's believable that that range occupies an identifiable part of the multi-dimensional accent space, and has very little overlap with, for example, beginner ESL students from China.
Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that. And if language families exist upon a continuum - there must be some point on that continuum where you are no longer speaking English, but say Scots or Friesian or Nigerian Creole instead. Accents close to those points are objectively stronger.
But there is a lot of freedom in how you measure centrality - if you weight by number of speakers, you might expect to get some mid-American or mid-Atlantic accent, but wind up with the dialect of semi-literate Hyderabad call centre workers.
null
joshuaissac
> relative to native speakers of English
> Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that
Is that what BoldVoice is actually doing? At least from the article is saying, it is measuring the strength of the user's American English accent (maybe GenAm?), and there is no discussion of any user choice of native accent to target.
dmurray
> Is that what BoldVoice is actually doing?
No, I don't think it is doing that, I'm just taking issue with cccpurcell, who seems to believe that any definition of accent strength is chauvinistic.
ilyausorov
Indeed, although the inference output of the model is based on the ratings input that we trained it on. And that rating input was done by American English native speakers, so this iteration of the model is centered towards those accents more than e.g. UK or Australian or other accents of English from outside the US.
ilyausorov
Sure, that's fair. We apply labels that have a connotation of strength based on the distance, but the underlying calculation is indeed based on distance.
semiquaver
What a silly nitpick. You’re just using different words to say the same thing.
ccheever
This is really cool.
Just had an employee at our company start expensing BoldVoice. Being able to be understood more easily is a big deal for global remote employees.
(Note - I am a small investor in BoldVoice)
adhsu01
Super cool work, congrats BoldVoice team! I've always thought that one of the non-obvious applications of voice cloning/matching is the ability to show a language learner what they would sound like with a more native accent.
ilyausorov
This and more exciting features are coming to the BoldVoice app soon!
oscar120
this^
fxtentacle
What a great AI use-case! At first, I felt excited ...
But then I read their privacy policy. They want permission to save all of my audio interactions for all eternity. It's so sad that I will never try out their (admittedly super cool) AI tech.
anadalakra
You can reach out and request your data to be deleted at any time.
fxtentacle
"if you wish to opt out of future collection of voice samples, you may do so by disabling voice-related features in the BoldVoice app. Please note that this may limit the functionality of certain services."
Yeah, I can opt out. By not using any voice-related feature in their voice training app.
anadalakra
If you're still actively using the app, the voice will be retained and processed so that you can receive instant feedback, and also so that you receive additional personalized practice items and video lessons based on your speech needs. If you don't want the samples saved "in perpetuity", you can request them to be deleted once you decide that you're done with the application. Hope this helps!
childintime
I didn't find international english, would have been interesting.
Also, the USA writing convention falls short, like "who put the dot inside the string."
crazy. Rationals "put the dot after the string". No spelling corrector should change that.
Victor's problem isn't really the vowels or pacing. The final consonants are soft or not really audible. I am not hearing the /ŋ/ of "long" as the most marked example. It sounds closer to "law". In his "improved" recording he hasn't fixed this.
I sometimes see content on social media encouraging people to sound more native or improve their accent. But IMO it's perfectly ok to have an accent, as long as the speech meets some baseline of intelligibility. (So Victor needs to work on "long" but not "days".) I've even come across people who are trying to mimick a native accent but lose intelligibility, where they'd sound better with their foreign accent. (An example I've seen is a native Spanish speaker trying to imitate the American accent's intervocalic T and D, and I don't understand them. A Spanish /t/ or /d/ would be different from most English language accents, but be way more understandable.)