Skip to content(if available)orjump to list(if available)

Chatterbox TTS

Chatterbox TTS

184 comments

·June 11, 2025

Mizza

Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)

This is a good release if they're not too cherry picked!

I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.

ianbicking

FWIW in my recent experience I've found LLMs are very good at reading through the transcription errors

(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)

vunderba

Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.

I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.

Public gist in case anyone finds it useful:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

sovok

An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.

Tokumei-no-hito

thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?

mikepurvis

I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.

ianbicking

If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.

Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.

miki123211

This is actually something people used to do.

old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.

throwawaymaths

do you know if any current locally hostable public transcribers are good at diarization? for some tasks having even crude diarization would improve QOL by a huge factor. i was looking at a whisper diarization python package for a bit but it was a bitch to deploy.

iainmerrick

Deepgram does it.

pinter69

Right you are. I've used speechmatics, they do a decent jon with transcription

theyinwhy

1 error every 78 characters?

pinter69

The way to measure transcription accuracy is word error and not character error. I have not really checked or trusted) speechmatics' accuracy benchmarks But, from my experience and personal impression - it looks good, haven't done a quantitative benchmark

causal

Play with the Huggingface demo and I'm guessing this page is a little cherry-picked? In particular I am not getting that kind of emotion in my responses.

backnotprop

It is hard to get consistent emotion with this. There are some parameters, and you can go a bit crazy, but it gets weird…

lvl155

Can’t you get around that by synthetic data?

echelon

I absolutely ADORE that this has swearing directly in the demo. And from Pulp Fiction, too!

> Any of you fucking pricks move and I'll execute every motherfucking last one of you.

I'm so tired of the boring old "miss daisy" demos.

People in the indie TTS community often use the Navy Seals copypasta [1, 2]. It's refreshing to see Resemble using swear words themselves.

They know how this will be used.

[1] https://en.wikipedia.org/wiki/Copypasta

[2] https://knowyourmeme.com/memes/navy-seal-copypasta

bschwindHN

Heh, I always type out the first sentence or two of the Navy Seal copypasta when trying out keyboards.

lukax

[flagged]

junon

You should really disclaim that you're affiliated.

https://news.ycombinator.com/item?id=41866830

travisvn

Chatterbox is fantastic.

I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/

Best voice cloning option available locally by far, in my experience.

mistersquid

> Chatterbox is fantastic.

> I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-ap

Gave your wrapper a try and, wow, I'm blown away by both Chatterbox TTS and your API wrapper.

Excuse the rudimentary level of what follows.

Was looking for a quick and dirty CLI incantation to specify a local text file instead of the inline `input` object, but couldn't figure it.

Pointers much appreciated.

travisvn

This API wrapper was initially made to support a particular use case where someone's running, say, Open WebUI or AnythingLLM or some other local LLM frontend.

A lot of these frontends have an option for using OpenAI's TTS API, and some of them allow you to specify the URL for that endpoint, allowing for "drop-in replacements" like this project.

So the speech generation endpoint in the API is designed to fill that niche. However, its usage is pretty basic and there are curl statements in the README for testing your setup.

Anyway, to get to your actual question, let me see if I can whip something up. I'll edit this comment with the command if I can swing it.

In the meantime, can I assume your local text files are actual `.txt` files?

mistersquid

This is way more of a response than I could have even hoped for. Thank you so much.

To answer your question, yes, my local text files are .txt files.

venusenvy47

Would this be usable on a PC without a GPU?

travisvn

It can definitely run on CPU — but I'm not sure if it can run on a machine without a GPU entirely.

To be honest, it uses a decently large amount of resources. If you had a GPU, you could expect about 4-5 gb memory usage. And given the optimizations for tensors on GPUs, I'm not sure how well things would work "CPU only".

If you try it, let me know. There are some "CPU" Docker builds in the repo you could look at for guidance.

If you want free TTS without using local resources, you could try edge-tts https://github.com/travisvn/openai-edge-tts

xnx

Quarrel

Fun to play with.

It makes my Australian accent sound very English though, in a posh RP way.

Very natural sounding, but not at all recreating my accent.

Still, amazingly clear and perfect for most TTS uses where you aren't actually impersonating anyone.

echelon

Sadly they don't publish any training or fine tuning code, so this isn't "open" in the way that Flux or Stable Diffusion are "open".

If you want better "open" models, these all sound better for zero shot:

Zeroshot TTS: MaskGCT, MegaTTS3

Zeroshot VC: Seed-VC, MegaTTS3

Granted, only Seed-VC has training/fine tuning code, but all of these models sound better than Chatterbox. So if you're going to deal with something you can't fine tune and you need a better zero shot fit to your voice, use one of these models instead. (Especially ByteDance's MegaTTS3. ByteDance research runs circles around most TTS research teams except for ElevenLabs. They've got way more money and PhD researchers than the smaller labs, plus a copious amount of training data.)

xnx

Great tip. I hadn't heard of MegaTTS3.

null

[deleted]

cpill

But whats the inference speed like on these? Can you use them in a realtime interaction with an agent?

skatanski

How does it work from the privacy standpoint? Can they use recorded samples for training?

teraflop

> Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...

I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?

jchw

Possibly a sort of CYA gesture, kinda like how original Stable Diffusion had a content filter IIRC. Could also just be to prevent people from accidentally getting peanut butter in the toothpaste WRT training data, too.

throw101010

Stable Diffusion or rather Automatic1111 which was initially the UI of choice for SD models had a joke/fake "watermark" setting too which was deliberately doing nothing besides poking fun at people who were thinking that open source projects would really waste time on developing something that could easily be stripped/reverted by the virtue of being open source anyways.

vunderba

Yeah, there's even a flag to turn it off in the parser `--no-watermark`. I assumed they added it for downstream users pulling it in as a "feature" for their larger product.

echelon

1. Any non-OpenAI, non-Google, non-ElevenLabs player is going to have to aggressively open source or they'll become 100% irrelevant. The TTS market leaders are obvious and deeply entrenched, and Resemble, Play(HT), et al. have to aggressively cater to developers by offering up their weights [1].

2. This is CYA for that. Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).

[1] This is the right way to do it. Offer source code and weights, offer their own API/fine tuning so developers don't have to deal with the hassle. That's how they win back some market share.

[2] https://www.404media.co/wikipedia-pauses-ai-generated-summar...

echelon

Nevermind, this is just ~3/10 open, or not really open at all [1]:

https://github.com/resemble-ai/chatterbox/issues/45#issuecom...

> For now, that means we’re not releasing the training code, and fine-tuning will be something we support through our paid API (https://app.resemble.ai). This helps us pay the bills and keep pushing out models that (hopefully) benefit everyone.

Big bummer here, Resemble. This is not at all open.

For everyone stumbling upon this, there are better "open weights" models than Resemble's Chatterbox TTS:

Zeroshot TTS: MaskGCT, MegaTTS3

Zeroshot VC: Seed-VC, MegaTTS3

These are really good robust models that score higher in openness.

Unfortunately only Seed-VC is fully open. But all of the above still beat Resemble's Chatterbox in zero shot MOS (we tested a lot), especially the mega-OP Chinese models.

(ByteDance slaps with all things AI. Their new secretive video model is better than Veo 3, if you haven't already seen it [2]!)

You can totally ignore this model masquerading as "open". Resemble isn't really being generous at all here, and this is some cheap wool over the eyes trickery. They know they retain all of the cards here, and really - if you're just going to use an API, why not just use ElevenLabs?

Shame on y'all, Resemble. This isn't "open" AI.

The Chinese are going to wipe the floor with TTS. ByteDance released their model in a more open manner than yours, and it sounds way better and generalizes to voices with higher speaker similarity.

Playing with open source is a path forward, but it has to be in good faith. Please do better.

[1] "10/10" open includes: 1. model code, 2. training code, 3. fine tuning code, 4. inference code, 5. raw training data, 6. processed training data, 7. weights, 8. license to outputs, 9. research paper, 10. patents. For something to be a good model, it should have 7/10 or above.

[2] https://artificialanalysis.ai/text-to-video/arena?tab=leader...

fastball

The weights are indeed open (both accessible and licensing-wise): you don't need to put that in square quotes. Training code is not. You can fine-tune the weights yourself with your own training code. Saying that isn't open is like saying ffmpeg isn't open because it doesn't do everything I need it to do and I have to wrap it with own code to achieve my goals.

tedip

Cant make everyone happy :)

gcr

not a single top-tier lab has a "10/10 open" model for any model type for any learning application since ResNet, it's not fair to shit on them solely for this

unstablediffusi

>Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).

it is highly amusing that they still believe they can put that genie back in the bottle with their usual crybully bullshit.

nine_k

Some measures like that still sort of work. Try loading a scanned picture of a dollar bill into Photoshop. Try printing it on a color printer. Try printing anything on a coor printer without the yellow tracking pixels.

A lock needs not be infinitely strong to be useful, it just needs to take more resources to crack it than the locked thing is worth.

ineedasername

The emotional exaggeration is interesting, though I don't think I've come across anything quite so versatile and easy to "sculpt" as Elevenlabs and it's ability to generate a voice on the basis of a description of how you want the voice to sound. SparkTTS allows some additional parameters, and it's project on GitHub has placeholders in its code that indicate the model might be refined for more fine grained emotional control. As it is, I've had some success with it and other models by trying to influence prosody and tonality with some heavy handed queues in the text, which can then be used with VC to get closer to desired results, but it's a much more cumbersome process than Eleven.

pryelluw

Silly question, what’s the lowest spec hardware this will run ?

thorum

This GitHub issue says 6-7 GB VRAM: https://github.com/resemble-ai/chatterbox/issues/44

But if the model is any good someone will probably find a way to optimize it to run on even less.

Edit: Got it running on an old Nvidia 2060, I'm seeing ~5 GB VRAM peak.

magicalhippo

Looking at the issues page, it seems it's not well optimized[1] currently.

So out of the box it seems quite beefy consumer hardware will be needed for it to perform reasonably. However it seems like there's significant potential for improvements, though I'm no expert.

[1]: https://github.com/resemble-ai/chatterbox/issues/127

01HNNWZ0MV43FF

I was going to report how it runs on an old CPU but after fussing with it for about 30 minutes, I can't even get it to run.

Listing the issues in case it helps anyone:

- It doesn't work with Python 3.13, luckily `uv` makes it easy to build a venv with 3.12

- It said numpy 1.26.4 doesn't exist. It definitely does, but `uv pip` was searching for it on the pytorch repo. I passed an `--index-strategy` flag so it would check other repos. This could just be a bug in uv, but when I see "numpy 1.26.4 doesn't exist" and numpy is currently on 2.x, my brain starts to cramp up.

- The `pip install chatterbox-tts` version has a bug in CPU-only mode, so I cloned the Git repo

- The version at the tip of main requires `protobuf-compiler` installed on Debian

- I got a weird CMake error that I can't decipher. I think maybe it's complaining that the Python dev headers are not installed. Why would they be, I'm trying to do inference, not compile Python...

I know anger isn't productive but this is my experience almost any time I'm running Somebody Else's Python Project. Hit an issue, back up, hit another issue, back up, after an hour it still doesn't run.

thorum

We’ll know AGI has arrived when it can figure out Python dependency conflicts

kevin_thibedeau

It'll just throw up its virtual hands and switch to something better after transpiling all the Python code in a fit.

blharr

Maybe this wasn't here when you looked at it, but maybe try Python 3.11?

> We developed and tested Chatterbox on Python 3.11 on Debain 11 OS; the versions of the dependencies are pinned in pyproject.toml to ensure consistency.

keyle

It's not a silly question, it's the best question!

If something can be run for free but it's cheaper to rent, it voids the DIY aspect of it.

bityard

Not a silly question, I came here to ask too. Curious to know whether I need a GPU costing 4 digits or if it will run on my 12-year-old thinkpad shitbox. Or something in between.

nmstoker

I've found it excellent with really common accents but with other accents (that are pretty common too) it can easily get stuck picking a different accent. For instance several Scottish recordings ended up Australian, likewise a fairly mild Yorkshire accent

a_wild_dandan

I think this says more about Scottish than the model.

Quarrel

> For instance several Scottish recordings ended up Australian

Funnily enough, it made my Australian accent sound very English RP. I was suddenly very posh.

ltrg

I'm English (RP) and it gave me a Yorkshire accent and Scottish accent in turn.

m3sta

Like a professional actor!

abraxas

Are these things good enough to narrate a book convincingly or does the voice lose coherence after a few paragraphs being spoken?

vunderba

Most of these TTS systems tend to fall apart the longer the text - it's a good idea to just wrap any longform text into separate paragraph segmented batches and then stitch them back together again at the end.

I've also found that if your one-shot sample wave isn't really clean that sometimes Chatterbox produces random unholy whooshing sounds at the end of the generated audio which is an added bonus if you're recording Dante's Inferno.

elektor

Yes, I've generated an audiobook of a epub using this tool and the result was passable: https://github.com/santinic/audiblez

venusenvy47

Regarding your example "On a Google Colab's T4 GPU via Cuda, it takes about 5 minutes to convert "Animal's Farm"", do you know the approximate cost to perform this? I've only used Colab at the free level, so I have no concept of the costs for GPU time.

raincole

Once it's good enough Audible will be flooded with AI-narrated books so we'll know soon. (The only question is whether Amazon would disclose it, ofc)

landl0rd

Flip side is a solution where I can have a book without an audiobook auto-generated (or use an existing ebook rather than paying audible $30 for their version) and it's "good enough" is a legit improvement. AI generated isn't as good but it's better than nothing. Also, being able to interrupt and ask for more detail/context would be pretty nice. Like I'm reading some Pynchon and I have to stop sometimes and look up the name of a reference to some product nobody knows now, stuff like that.

skygazer

If you're willing to forgo the interactive LLM bit, kokoro-tts (just a script using Kokoro-ONNX) takes epubs and outputs a series of wavs or mp3s that need to be stitched together into chapters or audiobook m4a with some ffmpeg fu. I've listened to several generated audiobooks, and found them pretty good. Some nice generic narration-like prosody. It uses espeak-ng to generate phonemes and passes those to the model to render voice, so it generally pronounces things quite well. It comes with a handful of nice voices and several can be blended, but no easy voice cloning, like chatterbox, that I'm aware of.

https://github.com/nazdridoy/kokoro-tts/blob/main/kokoro-tts

russellbeattie

Audible has already flooded their store with generated audio books. Go to the "Plus Catalog" and it's filled with them. The quality at the moment is complete trash, but I can't imagine it won't get better quickly.

The whole audiobook business will eventually disappear - probably within the decade. There will only be ebooks and on-device AI assistants will read it to you on demand.

I imagine it'll go like this: First pre-generated audiobooks as audio files. Next, online service to generate audio on demand with hyper customizable voices which can be downloaded. Next, a new ebook format which embeds instructions for narration and pronunciation to be read on-device. Finally, AI that's good enough to read it like a storyteller instantly without hints.

satvikpendem

> There will only be ebooks and on-device AI assistants will read it to you on demand.

Honestly I read (or rather, listen to) a lot of books already by getting the epubs onto my phone then using a very basic TTS to read it out. Yes, they're definitely not as lifelike as even the most common AI TTS systems but they're good enough to listen to at high speed. Moon+ Reader is pretty good for Android, not sure about iOS.

fatesblind

its watermarked

mianos

It's open source. It's not in the model. The watermark function is added to show you how to use it. You can just remove it.

``` watermarked_wav = self.watermarker.apply_watermarl(... ```

pinter69

I consult a company in the space (not resemble) and I can definitely say it can narrate a book

wsintra2022

A year ago for fun I gave a friend a Carl Rogers therapy audiobook, for fun I made an Attenbrough esque reading and it was pretty good over a year ago so should be better now.

philipkiely

Example implementation with sample inference code + voice cloning example:

https://github.com/basetenlabs/truss-examples/tree/main/chat...

Still working on streaming

tevon

I just tested it out locally, really excellent quality, the server was easy to set up and well documented.

I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.

audiala

What is the current state of the art for open source multilingual TTS? I have found Kokoro to be great as English as well, but am still searching for a good solution for French, Japanese, German...

barrell

I’ve also been looking for this. OpenVoice2 supports a few languages (5 IIRC), but I haven’t seen anything usable yet

ojw0816

Looks good! What is the difference between the open-source version and the priced version?