OpenAI Audio Models

316 comments

·March 20, 2025

benjismith

If I'm reading the pricing correctly, these models are SIGNIFICANTLY cheaper than ElevenLabs.

https://platform.openai.com/docs/pricing

If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.

https://elevenlabs.io/pricing

With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.

With OpenAI, I could get 11,000 minutes of TTS for $165.

Somebody check my math... Is this right?

furyofantares

It's way cheaper - everyone is, elevenlabs is very expensive. Nobody matches their quality though. Especially if you want something that doesn't sound like a voice assistant/audiobook/podcast/news anchor/tv announcer.

This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.

camillomiller

It's cheap because everything OpenAI does is subsidized by investors' money. Until that stupid money flows all good! Then either they'll go the way of WeWork, or enshittification will happen to make it possible for them to make the books work. I don't see any other option. Unless Softbank decides it has some 150 Billion to squander on buying them off. There's a lot of head-in-the-sand behavior going on around OpenAI fundamentals and I don't understand exactly why it's not more in the open yet.

ImprobableTruth

If you compare with e.g. Deepseek and other hosters, you'll find that OpenAI is actually almost certainly charging very high margins (Deepseek has an 80% profit margin and they're 10x cheaper than openai).

The training/R&D might make OpenAI burn VC cash, but this isn't comparable with companies like WeWork whose products actively burn cash

BoorishBears

That's not true. ElevenLabs margins are insane and their largest advantage is high quality audio data.

ashvardanian

To be fair, ElevenLabs has raised of the order of $300M of VC money as well.

asah

haha, yeah this combo was pretty hilarious and highly inconsistent from reading to reading: https://www.openai.fm/#b2a4c1ca-b15a-44eb-9cd9-377f0e47e5a6

com2kid

Elevenlabs is an ecosystem play. They have hundreds of different voices, legally licensed from real people who chose to upload their voice. It is a marketplace of voices.

None of the other major players is trying to do that, not sure why.

SXX

Going with this would mean AI companies suppose to pay for something like voices or other training data.

It's far better to just steal it all and ask government for exception.

null

[deleted]

fixprix

It looks like they are targeting Google's TTS price point which is $16 per million characters which comes out to $0.015/minute.

oidar

ElevenLabs is the only one offering speech to speech generation where the intonation, prosody, and timing is kept intact. This allows for one expressive voice actor to slip into many other voices.

goshx

OpenAI’s Realtime speech to speech is far superior than ElevenLabs.

noahlt

What ElevenLabs and OpenAI call “speech to speech” are completely different.

ElevenLabs’ takes as input audio of speech and maps it to a new speech audio that sounds like a different speaker said it, but with the exact same intonation.

OpenAI’s is an end-to-end multimodal conversational model that listens to a user speaking and responds in audio.

echelon

ElevenLabs is incredibly over-priced and that's how they were able to achieve the MRR that led to their incredible fundraising.

No matter what happens, they'll eventually be undercut and matched in terms of quality. It'll be a race to the bottom for them too.

ElevenLabs is going to have a tough time. They've been way too expensive.

MrAssisted

I hope they find a more unique product offering that takes hold. Everybody thinks of them as text-to-speech but I use ElevenLabs exclusively for speech-to-speech for vtubing as my AI character. They're kind of the only game in town for doing super high quality speech-to-speech (unless someone here has an alternative which I'd LOVE to know about). I've tried https://github.com/w-okada/voice-changer which is great because it's real-time but the quality is enough of a step down that actual words I'm saying become unclear and difficult to understand. Also with that I am tied to using my RTX 3090 desktop vs ElevenLabs which I can do in the cloud from my laptop anywhere.

I'm pretty much dependent on ElevenLabs to do my vtubing at this point but I can't imagine speech-to-speech has wide adoption so I don't know if they'll even keep it around.

eob

Are you comfortable sharing the video & lip-sync stack you use? I don't know anything about the space but am curious to check out what's possible these days.

null

[deleted]

muyuu

you can't be too expensive as a first mover provided you sell your service

whatever capital they've accrued, it won't hurt when the market prices are lower

huijzer

Yes ElevenLabs is orders of magnitude more expensive than everyone else. Very clever from a business perspective, I think. They are (were?) the best so know that people will pay a premium for that.

internet101010

Yeah the way I see it this is where we find the value of customization. We are already seeing its use by YouTube video essay creators who turn their own voice into models. I want to see corporate executives get on board so that we can finally ditch the god awful phone quality in earnings calls.

lukebuehler

yes, I think you are right. When I did the math on 11labs million chars I got the same numbers (Pro plan).

I'm super happy about this, since I took a bet that exactly this would happen. I've just been building a consumer TTS app that could only work with significant cheaper TTS prices per million character (or self-hosted models)

lherron

Kokoro TTS is pretty good for open source. Worth checking out.

lukebuehler

Yes, kokoro is great, and the language flexibility is a huge plus too. And the best prices per character is for sure if you self-host.

stavros

Oh man, they have the "Sky" voice, and it seems to be the same one that OpenAI had but then removed? Not sure how that's possible, but I'm very happy about it.

zacmps

What does it do?

lukebuehler

Convert any file (pdf, epub, txt) to an audoibook, downloadable as mp3, or directly listenable via RSS feed in, say, Apple Potcasts app.

Basically make one-off audiobooks for yourself or a few friends.

benjismith

Same for me :)

forgotpasagain

Almost everyone is cheaper than ElevenLabs though.

jeffharris

Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!

claiir

Hi Jeff. This is awesome. Any plans to add word timestamps to the new speech-to-text models, though?

> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.

Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)

adeptima

Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts)

Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).

Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.

Appreciate any tips on the subject

keepamovin

You need speaker attribution, right?

noosphr

Having read the docs - used chat gpt to summarize them - there is no mention of speaker diarization for these models.

This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.

Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.

In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.

markush_

> This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.

I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.

noosphr

After throwing the average meeting as an mp3 to your system, yes, you have diarization solved much better than everyone else I've tried by far. I'd say you're 95% of the way to being good enough for becoming the backbone of monolingual corporate meeting transcription, and I'll be buying API tokens the next time I need to do this instead of training a custom model. Your transcription however isn't that great - but good enough for LLMs to figure out a minutes of the meeting.

That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.

vessenes

Hi Jeff, thanks for these and congrats on the launch. Your docs mention supporting accents. I cannot get accents to work at all with the demo.

For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.

As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.

A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?

B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.

This is a pernicious second order race and culture impact that I think is not where the company should be.

I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.

Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?

simonw

Is there any chance that gpt-4o-transcribe might get confused and accidentally follow instructions in the audio stream instead of transcribing them?

simonw

Here's a partial answer to my own question: https://news.ycombinator.com/item?id=43427525

> e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard

"Much better" doesn't sound like it can't happen at all though.

dandiep

1) Previous TTS models had problems with major problems accents. E.g. a Spanish sentence could drift from a Spain accent to Mexican to American all within one sentence. Has this been improved and/or is it still a WIP?

2) What is the latency?

3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?

4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?

jeffharris

1/ we've been working a lot on accents, so expect improvements with these models... though we're not done. Would be curious how you find them. And try giving specific detailed instructions + examples for the accents you want

2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime

3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved

wewewedxfgdf

As mentioned in another comment, the British accents are very far from being authentic.

jbaudanza

3) Whisper really needs to be paired with Silero VAD, otherwise the hallucination problem makes it almost unusable.

dandiep

100% and I’ve done this, but it’s still there.

kiney

Are the new models released with weights under an open license like whisper? If not, is it planned for the future?

a-r-t

Hi Jeff, are there any plans to support dual-channel audio recordings (e.g., Twilio phone call audio) for speech-to-text models? Currently, we have to either process each channel separately and lose conversational context, or merge channels and lose speaker identification.

jeffharris

this has been coming up often recently. nothing to announce yet, but when enough developers ask for it, we'll build it into the model's training

diarization is also a feature we plan to add

a-r-t

Glad to hear it's on your radar. I'd imagine phone call transcription is a significant use case.

ekzy

I’m not entirely sure what you mean but twilio recordings supports dual channels already

a-r-t

Transcribing Twilio's dual-channel recordings using OpenAI's speech-to-text while preserving channel identification.

urbandw311er

Hey Jeff, this is awesome! I’m actually building a S2S application right now for a startup with the Realtime API and keen to know when these new voices/expressive prompting will be coming to it?

Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.

Some sort of tri-party WebRTC session maybe?

simonw

Both the text-to-speech and the speech-to-text models launched here suffer from reliability issues due to combining instructions and data in the same stream of tokens.

I'm not yet sure how much of a problem this is for real-world applications. I wrote a few notes on this here: https://simonwillison.net/2025/Mar/20/new-openai-audio-model...

accrual

Thanks for the write up. I've been writing assembly lately, so as soon as I read your comment, I thought "hmm reminds me of section .text and section .data".

kibbi

Large text-to-speech and speech-to-text models have been greatly improving recently.

But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.

In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.

I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).

The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.

I'd pay for something like this as long as it's less expensive than Acapela.

(My use case is an AAC app.)

5kg

May I introduce to you

https://huggingface.co/canopylabs/orpheus-3b-0.1-ft

(no affiliation)

it's English only afaics.

kibbi

The sample sounds impressive, but based on their claim -- 'Streaming inference is faster than playback even on an A100 40GB for the 3 billion parameter model' -- I don't think this could run on a standard laptop.

wingworks

Did you try Kokoro? You can self host that. https://huggingface.co/spaces/hexgrad/Kokoro-TTS

kibbi

Thanks! But I get the impression that with Kokoro, a strong CPU still requires about two seconds to generate one sentence, which is too much of a delay for a TTS voice in an AAC app.

I'd rather accept a little compromise regarding the voice and intonation quality, as long as the TTS system doesn't frequently garble words. The AAC app is used on tablet PCs running from battery, so the lower the CPU usage and energy draw, the better.

SamPatt

Definitely give it a try yourself. It's very small and shouldn't be hard to test.

ZeroTalent

Look into https://superwhisper.com and their local models. Pretty decent.

kibbi

Thank you, but they say "Offline models only run really well on Apple Silicon macs."

ZeroTalent

Many SOTA apps are, unfortunately, only for Apple M Macs.

dharmab

I use Piper for one of my apps. It runs on CPU and doesn't require a GPU. It will run well on a raspberry pi. I found a couple of permissively licensed voices that could handle technical terms without garbling them.

However, it is unmaintained and the Apple Silicon build is broken.

My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.

kibbi

I had already suspected that I hadn't found all the possibilities regarding Tortoise TTS, Coqui, Piper, etc. It is sometimes difficult to determine how good a TTS framework really is.

Do you possibly have links to the voices you found?

dharmab

Here's my code! https://github.com/dharmab/skyeye/tree/main/pkg/synthesizer

Ey7NFZ3P0nzAe

I heard good things about fish audio

benjismith

Is there way to get "speech marks" alongside the generated audio?

FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:

{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}

{"time":732,"type":"word","start":7,"end":11,"value":"it's"}

{"time":932,"type":"word","start":12,"end":16,"value":"nice"}

{"time":1193,"type":"word","start":17,"end":19,"value":"to"}

{"time":1280,"type":"word","start":20,"end":23,"value":"see"}

{"time":1473,"type":"word","start":24,"end":27,"value":"you"}

{"time":1577,"type":"word","start":28,"end":33,"value":"today"}

AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...

The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.

https://docs.aws.amazon.com/polly/latest/dg/output.html

minimaxir

Passing the generated audio back to GPT-4o to ask for the structured annotations would be a fun test case.

jeffharris

this is a good solve. we don't support word time stamps natively yet, but are working on teaching GPT-4o that skill

celestialcheese

whisper-1 has this with the verbose_json output. Has word level and sentence level, works fairly well.

Looks like the new models don't have this feature yet.

crazygringo

This is astonishing. I can type anything I want into the "vibe" box and it does it for the given text. Accents, attitudes, personality types... I'm amazed.

The level of intelligent "prosody" here -- the rhythm and intonation, the pauses and personality -- I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.

Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.

Some fun ones I just came up with:

> Imposing villain with an upper class British accent, speaking threateningly and with menace.

> Helpful customer support assistant with a Southern drawl who's very enthusiastic.

> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.

solardev

Guess that's why the video game voice actors are still on strike: https://en.m.wikipedia.org/wiki/2024%E2%80%93present_SAG-AFT...

If we as developers are scared of AI taking our jobs, the voice actors have it much worse...

KeplerBoy

I don't see how a strike will do anything but accelerate the professions inevitable demise. Can anyone explain how this could ever end in favor of the human laborers striking?

solardev

I am not affiliated with the strikers, but I think the idea is that, for now, the companies still want to use at least some human voice acting. So if they want to hire them, they either have to negotiate with the guild or try to find an individual scab willing to cross the picket line and get hired despite the strike. In some industries, there's enough non-union workers that finding replacement workers is easy enough. I guess the voice actors are sufficiently unionized that it's not so easy there, and it seems to have caused some delays in production and also some games being shipped without all their voice lines.

But as you surmise, this is at best a stalling tactic. Once the tech gets good enough, fewer companies will want to pay for human voice acting labor. Unions can help powerless individuals negotiate better through collective bargaining, but they can't altogether stop technological change. Jobs, theirs and ours, eventually become obsolete...

I don't necessarily think we should artificially protect jobs against technology, but I sure wish we had a better social safety net and retraining and placement programs for people needing to change careers due to factors outside their control.

101008

What a horrible world we live on...

clbrmbr

I got one German “w” when using the following prompt, but most of the “w” were still pronounced as liquids rather than labial fricatives.

> Speak with an exaggerated German accent, pronouncing all “w” as “v”

ForTheKidz

> Everyone's just going to pick whatever voice they're in the mood for.

I can't say I've ever had this impulse. Also, to point out the obvious, there's little reason to pay for an audiobook if there's no human reading it. Especially if you already bought the physical text.

cholantesh

As the sibling comment suggests, the impulse is probably more on the part of an Ubisoft or an EA project director to avoid hiring a voice actor.

l72

I've been trying to get it to scream with some humorous results:

Vibe:

Voice Affect: A Primal Scream from the top of your lungs!

Tone: LOUD. A RAW SCREAM

Emotion: Intense primal rage.

Pronunciation: Draw out the last word until you are out of breath.

Script:

EVERY THING WAS SAD!

l72

I've had no success trying to get a black metal raspy voice for a poetry reading.

d4rkp4ttern

Didn’t look closely, but is there a way to clone a voice from a few seconds of recording and then feed the sample to generate the text in the same voice?

d4rkp4ttern

Apparently Orpheus also has voice cloning.

solardev

Not here. ElevenLabs can do that.

anigbrowl

Can't say I'm enthused about another novel technological way to destroy the living of people who work in the arts.

borgdefenser

I am always listening to audio books but they are no good anymore after playing with this for 2 minutes.

I am never really in the mood for a different voice. I am going to dial in the voice I want and only going to want to listen with that voice.

This is so awesome. So many audio books have been ruined by the voice actor for me. What sticks out in my head is The Book of Why by Judea Pearl read by Mel Foster. Brutal.

So many books I want as audio books too that no one would bother to record.

throwup238

The ElevenReader app from ElevenLabs has been able to do that for a while now and they’ve licensed some celebrity voices like Burt Reynolds. You can use the browser share function to send it a webpage to read or upload a PDF or epub of a book.

It’s far from perfect though. I’m listening to Shattered Sword (about the battle of midway) which has lots of academic style citations so every other sentence or paragraph ends with it spelling out the citation number like “end of sentence dot one zero”, it’ll often mangle numbers like “1,000 pound bomb” becomes “one zero zero zero pound bomb”, and it tries way too hard to expand abbreviations so “Operation AL” becomes “Operation Alabama” when it’s really short for Aleutian Islands.

minimaxir

One very important quote from the official announcement:

> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.

The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.

The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.

mlsu

I gave it (part of) the classic Navy Seal copypasta.

Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.

https://www.openai.fm/#56f804ab-9183-4802-9624-adc706c7b9f8

jtbayly

I don’t get it. These voices all have a not-so-subtle vibration in them that makes them feel worse than Siri to me. I was expecting a lot better.

pier25

yeah the voices sound terrible

I'm guessing their spectral generator is super low res to save on resources

stavros

Is there a way to pay for higher quality? I don't see a way to pay at all, this just works without an API key, even with the generated code. I agree though, these voices sound like their buffer is always underrunning.

null

[deleted]

minimaxir

This is an official OpenAI tool linked from the new model announcement (https://openai.com/index/introducing-our-next-generation-aud... ), despite the branding difference.

gherard5555

I tried some wacky strings like "𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯NNNNNNNNNNNNNN𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯"

Its hilarious either they start to make harsh noise or say nonsense trying so sing something

gherard5555

Also this one is terrifying if combined with the fitness instructor :

"*scream* AAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHH !!!!!!!!!"

corobo

All these voices are too good these days. I want my home assistant to sound like Auto from Wall-E, dammit!

Anyone out there doing any nice robotic robot voices?

Best I've got so far is a blend of Ralph and Zarvox from MacOS' `say`, haha

  say -v zarvox -r 180 "[[volm 0.8]] ${message}" &
  say -v ralph -r 180 "${message}"

ranguna

You could apply a robotic filter on top of these voices.

HN

OpenAI Audio Models

OpenAI Audio Models