Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

256 comments

·August 6, 2025

Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. We are excited to launch a preview of our smallest model, which is less than 25 MB. This model has 15M parameters.

This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!

We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.

We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!

Visit

MutedEstate45

The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

woadwarrior01

> It’s that KittenTTS is Apache-2.0

Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.

[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...

[2]: https://github.com/bootphon/phonemizer

Edit: It looks like I replied to an LLM generated comment.

oezi

The issue is even bigger: phonemizer is using espeak-ng, which isn't very good at turning graphemes into phonemes. In other TTS which rely on phonemes (e.g. Zonos) it turned out to be one of the key issues which cause bad generations.

And it isn't something you can fix, because the model was trained on bad phonemes (everyone uses Whisper + then phonemizes the text transcript).

jacereda

https://github.com/KittenML/KittenTTS/issues/17

dspillett

> IANAL, but AFAICS this leaves 2 options, switching the license or removing that dependency.

There is a third option: asking the project for an exception.

Though that is unlikely to be granted¹ leaving you back with just the other two options.

And of course a forth choice: just ignore the license. This is the option taken by companies like Onyx, whose products I might otherwise be interested in…

----

[1] Those of us who pick GPL3 or AGPL generally do so to keep things definite and an exception would muddy the waters, also it might not even be possible if the project has many maintainers as relicensing would require agreement from all who have provided code that is in the current release. Furthermore, if it has inherited the license from one of its dependencies, an exception is even less practical.

ape4

Once the license issues are resolved it would nice if you could install it on a distro with the normal package manager.

gorgoiler

This would only apply if they were distributing the GPL licensed code alongside their own code.

If my MIT-licensed one-line Python library has this line of code…

  run([“bash”, “-c”, “echo hello”])

…I’m not suddenly subject to bash’s licensing. For anyone wanting to run my stuff though, they’re going to need to make sure they themselves have bash installed.

(But, to argue against my own point, if an OS vendor ships my library alongside a copy of bash, do they have to now relicense my library as GPL?)

ApolloFortyNine

The FSF thinks it counts as a derivative work and you have to use the LGPL to allow linking.

However, this has never actually been proven in court, and there's many good arguments that linking doesn't count as a derivative work.

Old post by a lawyer someone else found (version 3 wouldn't affect this) [1]

For me personally I don't really understand how, if dynamic linking was viral, using linux to run code isn't viral. Surely at some level what linux does to run your code calls GPLed code.

It doesn't really matter though, since the FSF stance is enough to scare companies from not using it, and any individual is highly unlikely to be sued.

[1] https://www.linuxjournal.com/article/6366

r4indeer

> This would only apply if they were distributing the GPL licensed code alongside their own code.

As far as I understand the FSF's interpretation of their license, that's not true. Even if you only dynamically link to GPL-licensed code, you create a combined work which has to be licensed, as a whole, under the GPL.

I don't believe that this extends to calling an external program via its CLI, but that's not what the code in question seems to be doing.

(This is not an endorsement, but merely my understanding on how the GPL is supposed to work.)

woadwarrior01

This is a false analogy. It's quite straightforward.

Running bash (via exec()/fork()/spawn()/etc) isn't the same as (statically or dynamically) linking with its codebase. If your MIT-licensed one-liner links to code that's GPL licensed, then it gets infected by the GPL license.

calvinmorrison

GPL is for boomers at this point. Floppy disks? Distribution? You can use a tool but you cant change it? A DLL call means you need to redistribute your code but forking doesn't?

Sillyness

keyKeeper

Okay, what's stopping you from feeding the code into an LLM and re-write it and make it yours? You can even add extra steps like make it analyze the code block by block then supervise it as it is rewriting it. Bam. AI age IP freedom.

Morals may stop you but other than that? IMHO all open source code is public domain code if anyone is willing to spend some AI tokens.

Twirrim

That would be a derivative work, and still be subject to the license terms and conditions, at best.

There are standard ways to approach this called clean room engineering.

https://en.m.wikipedia.org/wiki/Clean-room_design

One person reads the code and produces a detailed technical specification. Someone reviews it to ensure that there is nothing in there that could be classified as copyrighted material, then a third person (who has never seen the original code) implements the spec.

You could use an LLM at both stages, but you'd have to be able to prove that the LLM that does the implementation had no prior knowledge of the code in question... Which given how LLMs have been trained seems to me to be very dubious territory for now until that legal situation gets resolved.

K0balt

AI is useful in Chinese walling code, but it’s not as easy as you make it sound. To stay out of legal trouble, you probably should refactor the code into a different language, then back into the target language. In the end, it turns into a process of being forced to understand the codebase and supervising its rewriting. I’ve translated libraries into another language using LLMs, I’d say that process was 1/2 the labor of writing it myself. So in the end, going 2 ways, you may as well rewrite the code yourself… but working with the LLM will make you familiar with the subject matter so you -could- rewrite the code, so I guess you could think of it as a sort of buggy tutorial process?

woadwarrior01

Tell me you haven't used LLMs on large, non-trivial codebases without telling me... :)

defanor

A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.

eSpeak NG's data files take about 12 MB (multi-lingual).

I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.

Joel_Mckay

Custom voices could be added, but the speed was more important to some users.

$ ls -lh /usr/bin/flite

Listed as 27K last I checked.

I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3

pjc50

> KittenTTS is Apache-2.0

What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?

phh

It depends on espeak-ng which is GPLv3

rohan_joshi

yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)

entropie

I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.

Iam curious how fast this is with CPU only.

null

[deleted]

antisol

  System Requirements
  Works literally everywhere

Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.

On another machie the python version is too new, and the package/dependencies don't want to install.

akx

I opened a couple of PRs to fix this situation:

https://github.com/KittenML/KittenTTS/pull/21 https://github.com/KittenML/KittenTTS/pull/24 https://github.com/KittenML/KittenTTS/pull/25

If you have `uv` installed, you can try my merged ref that has all of these PRs (and #22, a fix for short generation being trimmed unnecessarily) with

    uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --output output.wav --text "This high quality TTS model works without a GPU"

tetris11

Thanks for the quick intro into UV, it looks like docker layers for python

I found the TTS a bit slow so I piped the output into ffplay with 1.2x speedup to make it sound a bit better

   uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --text "I serve 12 different beers at my restaurant for over 1000000 customers" --voice expr-voice-3-m --output - | ffplay -af "atempo=1.2" -f wav -

VagabundoP

Install it with uvx that should solve the python issues.

https://docs.astral.sh/uv/guides/tools/

uv installation:

https://docs.astral.sh/uv/getting-started/installation/

miellaby

You're supposed to use venv for everything but the python scripts distributed with your os

xena

It doesn't work on Fedora because of the lack of g++ having the right version.

hahn-kev

Python man

baobun

    man python

There you go.

wizzwizz4

  PYTHON(1)                   General Commands Manual                  PYTHON(1)
  
  NAME
       python - an object-oriented programming language

  SYNOPSIS
       python [ -c command | script | - ] [ arguments ]
  
  DESCRIPTION
       Python is the standard programming language.

Computer scientists love Python, not just because whitespace comes first ASCIIbetically, but because it's the standard. Everyone else loves Python because it's PYTHON!

IshKebab

Yeah some people have a problem and think "I'll use Python". Now they have like fifty problems.

sigmoid10

There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.

lynx97

Debian pretty much "solved" this by making pip refuse to install packages if you are not in an venv.

gm678

It needed distro buy in and implementation, but this is from the Python side: https://peps.python.org/pep-0668/

auscompgeek

IIRC that's actually a change in upstream pip.

ChickeNES

Ditto OpenSUSE, at least on Tumbleweed

superkuh

Yep. Python stopped being Python a decade ago. Now there are just innumberable Pythons. Perl... on the otherhand, you can still run any perl script from any time on any system perl interpreter and it works! Granted, perl is unpopular and not getting constant new features re: hardcore math/computation libs.

Anyway, I think I'll stick with Festival 1.96 for TTS. It's super fast even on my core2duo and I have exactly zero chance of getting this Python 3'ish script to run on any machine with an OS older than a handful of years.

m-s-y

It breaks my heart that Perl fell out of favor. Perl “6” didn’t help in the slightest.

dzogchen

Such an ignorant thing to say for something that requires 25MB RAM.

Bilal_io

Not sure what the size has to do with anything.

I send you a 500kb Windows .exe file and claim it runs literally everywhere.

Would it be ignorant to say anything against it because of its size?

mlboss

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

Aachen

Impressive technical achievement, but in terms of whether I'd use it: oof, that male voice is like one of these fake-excited newsreaders. Like they're always at the edge of their breath. The female one is better but still someone reading out an advertisement for a product they were told they must act extra excited for. I assume this is what the majority of training data was like and not an intentional setting for the demo. Unsure whether I could get used to that

I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech

Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said

Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues

willwade

anyone else wants to try sherpaOnnx you can try this.. https://github.com/willwade/tts-wrapper we recently added in the kokoro models which should sound a lot better. There are a LOT of models to choose from. I have a feeling the Droid app isnt handling cold starts very well.

smusamashah

The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.

Retr0id

The people calling it "OK" probably tried it for themselves. Whatever model is being demoed in that video is not the same as the 25MB model they released.

Zardoz84

Sounds very clear. For a non native english speaker like me, it's easy to understand.

KaiserPro

was it cross trained on futurama voices?

junon

That would be a feature!

tapper

Sounds slow and like something from an anine

ricardobeat

Speech speed is always a tunable parameter and not something intrinsic to the model.

The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.

dvh

I think he meant overacting typical for English dubs.

numpad0

The only real questions are which Chinese gacha game they ripped data from and whether they used Claude Code or Gemini CLI for Python code. I bet one can get a formant match from output this much overfit to whatever data. This isn't going to stay up for long.

null

[deleted]

nine_k

I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

theshrike79

This is what Apple is envisioning with their SLMs, like having a model specifically for managing calendar events. It doesn't need to have the full knowledge of all humanity in it - just what it needs to manage the calendar.

WhyNotHugo

Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).

Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.

amelius

But resistors are, even in theory, heat dissipating devices. Unlike transistors, which can in theory be perfectly on or off (in both cases not dissipating heat).

divamgupta

That is our vision too!

rohan_joshi

yeah totally. the quality of these tiny models are only going to go up.

blopker

Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

nine_k

Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.

mfro

In the Culture novels, Iain Banks imagines that we would become uncomfortable with the uncanny realism of transmitted voices / holograms, and intentionally include some level of distortion to indicate you're speaking to an image

userbinator

A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.

We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:

https://en.wikipedia.org/wiki/Software_Automatic_Mouth

https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)

miki123211

SAM and the way it works is not what people typically associate with the term "formant synthesizer."

DECtalk[1,2] would be a much better example, that's as formant as you get.

[1] https://en.wikipedia.org/wiki/DECtalk [2] https://webspeak.terminal.ink

saretup

Well, this one is a bit too jarring to the ears.

tapper

Yeah blind people love eloquence

roywiggins

This one is at least an interesting idea: https://genderlessvoice.com/

dang

Meet Q, a Genderless Voice - https://news.ycombinator.com/item?id=19505835 - March 2019 (235 comments)

cosmojg

The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.

degamad

Interesting concept, but why is that site filled with Top X blogspam?

cyberax

It doesn't sound genderless.

pbronez

Huh. Sounds perfectly intelligible and definitively artificial. Feels weakly feminine to me, but only because I was primed to think about gender from the branding.

It’s a good choice for a robot voice. It’s easier to understand than the formant synths or deliberately distorted human voices. The genderless aspect is alien enough to avoid the uncanny valley. You intuitively know you’re dealing with something a little different.

incone123

Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.

I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)

regularfry

Slightly different there because it's important in both cases that Ripley (and we) can't tell they're androids until it's explicitly uncovered. The whole point is that they're not presented as artificial. Same in Blade Runner: "more human than human". You don't have a film without the ambiguity there.

looperhacks

I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that

Twirrim

> I don't expect a smart toaster to talk like a BBC host;

Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec

addandsubtract

If you're on a Mac, you can type "say [thing to say]" into your terminal.

msgodel

I personally prefer the older synthetic voices for TTS when the text is coming from software or a language model.

bkyan

I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?

divamgupta

Currently we don't have chunking enabled yet. We will add it soon. That will remove the length limitations.

null

[deleted]

cess11

Perhaps a length limit? I tried this:

"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."

It takes a while until it starts generating sound on my i7 cores but it kind of works.

This also works:

"blah. bleh. blih. bloh. blyh. bluh."

So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.

Retr0id

I tried to replicate their demo text but it doesn't sound as good for some reason.

If anyone else wants to try:

> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.

cortesoft

Is the demo using the not smallest model?

Retr0id

Perhaps, but the 25MB model is the only thing they've released

quantummagic

Doesn't work here. Backend module returns 404 :

https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

Retr0id

Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...

(seems reverted now)

nxnsxnbx

Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all

divamgupta

This is just an early checkpoint. We hope that the quality will improve in the future.

itake

> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape

Doesn't seem to work with thai.

jainilprajapati

You can also try on https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

Aardwolf

On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?

a2128

ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...

Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)

rohan_joshi

yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.

klipklop

I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.

devnen

That's a great point about the dependencies.

To make the setup easier and add a few features people are asking for here (like GPU support and long text handling), I built a self-hosted server for this model: https://github.com/devnen/Kitten-TTS-Server

The goal was a setup that "just works" using a standard Python virtual environment to avoid dependency conflicts.

The setup is just the standard git clone, pip install in a venv, and python server.py.

k4rnaj1k

[dead]

Dayshine

It mentions ONNX, so I imagine an ONNX model is or will be available.

ONNX runtime is a single library, with C#'s package being ~115MB compressed.

Not tiny, but usually only a few lines to actually run and only a single dependency.

divamgupta

We will try to get rid of dependencies.

wongarsu

The repository already runs an ONNX model. But the onnx model doesn't get English text as input, it gets tokenized phonemes. The prepocessing for that is where most of the dependencies come from.

Which is completely reasonable imho, but obviously comes with tradeoffs.

pbronez

For space sensitive applications like embedded systems, could you shift the preprocessing to compile time?

You would need to constrain the vocabulary to see any benefits, but that could be reasonable. For example, you an enumeration of numbers, units and metric names could handle dynamic time, temperature and other dashboard items.

For something more complex like offline navigation, you already need to store a map. You could store street names as tokens instead of text. Add a few turn commands, and you have offline spoken directions without on device pre-processing.

WhyNotHugo

Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.

zelphirkalt

This case might be different, but ... usually that "later" never happens.

keyle

I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

Dayshine

Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.

My mid-range AMD CPU is multiple times faster than realtime with parakeet.

jiehong

Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.

blensor

"The brown fox jumps over the lazy dog.."

Average duration per generation: 1.28 seconds

Characters processed per second: 30.35

"Um"

Average duration per generation: 0.22 seconds

Characters processed per second: 9.23

"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."

Average duration per generation: 2.25 seconds

Characters processed per second: 35.04

processor : 0

vendor_id : AuthenticAMD

cpu family : 25

model : 80

model name : AMD Ryzen 7 5800H with Radeon Graphics

stepping : 0

microcode : 0xa50000c

cpu MHz : 1397.397

cache size : 512 KB

keyle

assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.

moffkalast

Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.

I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.

colechristensen

>Aside: Are there any models for understanding voice to text, fully offline, without training?

OpenAI's whisper is a few years old and pretty solid.

https://github.com/openai/whisper

Teever

Any idea what factors play into latency in TTS models?

divamgupta

Mostly model size, and input size. Some models which use attention are O(N^2)

dr_kiszonka

Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.

Why does it happen? I'm genuinely curious.

3rd3

You probably mean "e.g." as "for example", not "i.e."?

This might be on purpose and part of the training data because "for example" just sounds much better than "e.g.". Presumably for most purposes, linguistic naturalness is more important than fidelity.

layer8

Sometimes I use “for example” and “e.g.” in consecutive sentences to not sound repetitive, or possibly even within the same sentence (e.g. in parentheses). In that case, speaking both as “for example” would degrade it linguistically.

In any case, I’d like TTS to not take that kind of artistic freedom.

Retr0id

They're often trained from video subtitles, and humans writing subtitles make that kind of mistake too.

lynx97

Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.

jpc0

Not entirely related but humans have the same problem.

For scriptwriting when doing voice overs we always explicitly write out everything. So instead of 1 000 000 we would write one million or a million. This is a trivial example but if the number was 1 548 736 you will almost never be able to just read that off. However one million, five hundred and forty eight thousand, seven hundred and thirty six can just be read without parsing.

Same with urls, W W W dot Google dot com.

lynx97

Regarding humans, yes and no. If a human had constantly problems with 3 and 4 digit numbers like tts-1-hd does, I'd ask myself if they were neurodivergent in some way.

And yes, I added instructions along the lines of what you describe to my prompt. Its just sad that we have to. After all, LLM TTS has solved a bunch of real problems, like switching languages in a text, or foreign words. The pronounciation is better then anything we ever had. But it fails to read short numbers. I feel like that small issue could probably have been solved by doing some fine tuning. But I actually dont really understand the tech for it, so...

wongarsu

From the web demo this model is really good at numbers. It rushes through them, slurs them a bit together, but they are all correct, even 7 digit numbers (didn't test further).

Looks like they are sidestepping these kinds of issues by generating the phonemes with the preprocessing stage of traditional speech synthesizers, and using the LLM only to turn those phonemes into natural-ish sounding speech. That limits how natural the model can become, but it should be able to correctly pronounce anything the preprocessing can pronounce

rishav_sharan

Question for the experts here; What would be a SOTA TTS that can run on an average laptop (32GB RAM, 4GB VRAM). I just want to attach a TTS to my SLM output, and get the highest possible voice quality/ human resembleness.

kroaton

Try Unmute by Kyutai - https://unmute.sh/

wkat4242

Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

kamranjon

The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.

echelon

> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.

This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.

Since then, the trend has been to scale up. We need more models to scale down.

In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.

Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).

guskel

Chatterbox is also worth a try.

jainilprajapati

You should give try to https://pinokio.co/

kenarsa

Try https://github.com/Picovoice/orca

toisanji

Wow, amazing and good work, I hope to see more amazing models running on CPUs!

rohan_joshi

thanks, we're going to release many more models in the future, that can run on just CPUs.

thedangler

Elixir folks. How would I use this with Elixir? I'm new to Elixir and could use this in about 15 days.

bglusman

It looks like it's Python, so it might be possible to use via https://github.com/livebook-dev/pythonx ? But the parallel huggingface/bumblebee idea was also good, hadn't seen or thought of, that definitely works for a lot of other models, curious if you get working! Some chance I'll play with this myself in a few months, so feel free to report back here or DM me!

dorian-graph

It's not possible so far via Bumblebee, unfortunately[1].

[1] https://github.com/elixir-nx/bumblebee/issues/209