Skip to content(if available)orjump to list(if available)

Generate audiobooks from E-books with Kokoro-82M

laserbeam

On the one hand, this is very convenient. Probably cool for some non-fiction.

On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well, for example by changing the pacing during chaotic moments. Or those audiobooks with multiple narrators and different voices for each character. Not to mention that sometimes the only cue you get for who's speaking during dialogue is how the voice actor changes their tone. I have mixed feelings about using this and losing some of that quality.

I would totally use this over amateur ebooks or public domain audiobooks like the ones on project guttenberg. As cool as it is/was for someone to contribute to free books... as a listener it was always jarring to switch to a new chapter and hear a completely different voice and microphone quality for no reason.

stavros

> On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well

This (and everything else with AI) isn't saying "you don't need good actors any more". It's saying "if you don't have an audiobook, you can make a mediocre one automatically".

AI (text, images, videos, whatever) doesn't replace the top end, it replaces the entire bottom-to-middle end.

j4coh

RIP to future top-enders that would normally have started out on the bottom to middle end.

sam_lowry_

> RIP to future top-enders that would normally have started out on the bottom to middle end.

This stance always reminds me of the Profession, a 1957 novella by Isaac Asimov that depicts pretty much the future where there are only top performers and the ignorant crowd.

aredox

Bingo. AI is going to destroy any pathway for training and accruing experience.

An embalming tech for our dying civilization.

gosub100

I'm super opposed to AI, but I see this as a rare positive. As someone already said, the win here is to have a audiobook where one doesn't yet exist. hell, maybe the tables will turn and the scrubs will do the hard work of discovering which titles are popular with an audience, then the ebook industry can capitalize on AI by hiring voice actors to produce proper titles?

null

[deleted]

credit_guy

By that time, AI will beat the toppest of the top enders. Remember the time Deep Blue barely beat Kasparov? Now no human, or group of humans can beat a chess engine, even one that runs on an iPhone.

Der_Einzige

Not RIP at all. "Meritocracy" was coined in a book literally warning us about how terrible such a society would be: https://en.wikipedia.org/wiki/The_Rise_of_the_Meritocracy

The "top-enders" are the privileged who need to have some of their gains for their intelligence redistributed to others. The alternative is "survival of the smartest", which is de-facto what we have today and what Young was trying to warn us about.

numpad0

AI TTS has been available for quite some time. Tacotron V1 is about 8 years old. I don't think we saw much bottom end replacement.

IMGO(gut opinion), generative AI is a consumption aid, like a strong antacid. It lets us be done with $content quicker, for content = {book, art, noisy_email, coding_task}. There's obvious preconceptions forming among us all from "generative" nomenclature, but lots of surviving usages are rather reductive in relevant useful manners.

sam_lowry_

Yeah, let us not blame AI. Audible damaged the quality of audiobooks than AI.

whazor

A GenAI model that read audiobooks with such dramatisation is really my dream. There are so many books that I would want to listen to, but still lack such an adaptation. Also it takes months after the book release before the audiobook gets released.

Just imagine what this would do for writers. They can get instant feedback and adjust their book for the audiobook.

WillAdams

Yes, but if the alternative is not having a book, or having to listen to one poorly read (I love Librivox, but there are some books which I just haven't been able to finish because of readers, and many more which were nixed for family vacation travel listening on that account), this may be workable.

rd11235

I agree but the opposite can be true too. Sometimes the narrator seems to target some general audience that doesn’t fit me at all, in a way that makes me cringe when I listen, until I stop listening altogether. In these cases I’d rather listen to a relatively flat narration from a tool like this.

felixhummel

I wholeheartedly agree. https://en.m.wikipedia.org/wiki/Stephen_Briggs got me hooked on Terry Pratchett's Discworld series. I loved "Going Postal".

IndrekR

I know someone who listened Terry Pratchett's "Wachen! Wachen!" audiobook on Spotify while living in Germany for few years. It was so well narrated that he also acquired some peculiarities of local dialects used by specific characters in the book. Locals in Bavaria were quite surprised of a foreigner speaking such language.

dmazin

Absolutely.

Even on the non-fiction side, the narration for Gleick's The Information adds something.

While I want this tool for all the stuff with no narration, NYT/New Yorker/etc replacing human narrators with AI ones has been so shitty. The human narrators sound good, not just average. They add something. The AI narrators are simply bad.

ahoka

I guess this is still very useful if you are blind.

loktarogar

Yeah, for accessibility purposes on things that aren't already narrated, this is kind of thing is huge.

em-bee

that's the thing. it's not just for accessibility. anything not already narrated is a fair target for TTS. i don't have time to sit down and read books. all reading is done on the go, while getting around or doing daily routines at home. i have a small book that i am reading now, which should take a few hours to finish, but in the time i manage to get done reading it i will probably have listened to two or three audio books.

oh, and it's also a boon for those who can't afford to buy audiobooks.

flir

I was just thinking about automatically slapping an mp3 on every blog post, just an accessibility nicety.

Can someone with low vision tell me if this would be useful to them? It may be that specialist tools already do this better.

micw

With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.

I wonder if a standardized markup exists to do so.

albert_e

There is SSML for speech markup to indicate various characters of speech like whispers, pronunciation, pace, emphasis, etc.

With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.

Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.

If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?

Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.

micw

Good points, thank you! I just tested it. While ChatGPT was very good in adding generic (textual) annotations, the result for generating SSML where very poor (lack of voice names, lack of distinction between narrator and character etc).

Probably the results with a model trained for this plus human audit could lead to very good results.

pegasus

They still wouldn't be high quality. It's just not possible to capture the precise tone of voice in an annotation, and that precision I believe really makes a difference. My experience is that the deeper the narrator understands the text and conveys that understanding, the easier it becomes for me to absorb that information.

vasco

Have you tried those "podcast from a paper" models? They do some of the things you are saying they don't, although it's not 100% it's also miles ahead of for example human Polish TV lectors, or other monotone style narrations.

KeplerBoy

Don't end to end trained models already do this to some extent? Like raising the pitch towards a question mark, like a human would.

TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html

micw

That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.

But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.

I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.

woolion

If you look for a lot of the great classics, audiobooks results are inundated with basic TTS "audiobooks" that are impossible to filter out. These are impossible to listen to because they lack the proper intonation marking the end of sentences, making it very tiring to parse. It might be better than tuna can sounding recordings, especially if you want to ear them in traffic (a common requirement), but that's about it. The alternative, if you want real quality recordings, is to stop reading classics and instead read latest Japanime Isekai of murder mystery, these have very good options on the market. Anyway, I don't think it needs more justification that it covers a good niche usage.

I'm checking what the actual quality is (not a cherry-picked example), but:

Started at: 13:20:04 Total characters: 264,081 Total words: 41548 Reading chapter 1 (197,687 characters)...

That's 1h30 ago, there's no kind of progress notification of any kind, so I'm hoping it will finish sometime. It's using 100% of all available CPUs so it's quite a bother. (this is "tale of a tub" by Swift, it's about half of a typical novel length)

csantini

Yeah, that's a known issue, if the book is all on a single chapter you don't get any sense of progress. I may fix that next weekend

swores

Can anyone recommend an open source option that would allow training on a custom voice (my own, so I'd be able to record as many snippets as it needed to train on) to allow me to use it for TTS generation without sharing it off my machine?

Edit: I'll wait to see if any recommendations get made here, if not I might give this one a go: https://github.com/coqui-ai/TTS

numpad0

I think you can probably generate TTS audio by classical means, and voice2voice that audio through RVC or Beatrice V2. Haven't looked into it in a while but Beatrice is apparently super fast and CPU only.

pprotas

I would love to have an e-reader that allows me to switch between text and audio at the press of a button. Imagine reading your book on the couch and then switching into audio mode while doing the dishes seamlessly, by connecting bluetooth headphones.

InsideOutSanta

Kindles used to provide this feature, but publishers and/or the Authors Guild stopped it, because audio rights and text rights are handled differently. In other words, when Amazon sells you a text book, it does not have the right to then also do TTS on that text and let you listen to it.

There's some contemporary discussion of what happened here: https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...

I think there is still integration with Audible, though. If you buy a book on the Kindle and on Audible, the position will sync, and you can switch between listening and reading without losing your place in the book.

albert_e

Yes the feature is called WhisperSync -- I used it many years ago and it was pretty good.

I tried it while on a treadmill so it allowed me to follow the book with more focus without sacrificing much else.

thfuran

Isn't whisper sync the current version that relies on owning both the ebook and audiobook?

Brybry

I used that TTS feature semi-regularly on a Kindle 2.

It wasn't a good experience but it was nice to be able to keep 'reading' a book while I was exercising.

It worked for me for over a decade, until I broke the device. I don't know if I never updated the firmware or if the fact I used Calibre to convert books bypassed the feature gate.

dsign

It is a supported feature in the epub 3.0 standard. It's possible to distribute an epub with audio, and have the audio sync to the HTML elements that form the ebook's text. And there is an e-reader that actually supports this feature, I can't remember which one now but it should be possible to find it with Google.

It's more of an open problem how to create those epubs. I have some code that can do it using Elevenlabs audio, but I imagine it way harder to have something similar for a human narrator.... who's going to do the sync? Maybe we need a sync AI.

llamaimperative

Boox Ultra Tab whatever the fuck (their product naming sucks) + Readwise Reader = amazing for this

Not quite seamless but it works. It has a cursor that follows the words as they’re spoken to, which allows you to read and hear (“immersive reading”) which I find to be extremely helpful for maintaining focus.

freefaler

You can do it easily with non-DRM books (or DRM stripped books):

For Android:

- Moon+ reader pro - some paid high-quality TTS voices (like Acapella)

For iOS:

- Kybook reader and internal iOS voices (no external TTS voices for the walled garden)

This works well enough to listen to a book while you walk and when you get back home read on the WC from the place you stopped.

Additionally if you buy a tablet or an android ebook reader, you install the app there an you can continue on your bigger/better device seamlessly.

Whisper-sync for the masses! Ahoy...

basedrum

But you need an android phone, and can't use a kobo or similar wink reader?

freefaler

for ios you use Kybook on your iphone and your ipad. It syncs positions between the devices. When you go for a walk, opens Kybook, start TTS. When back home, open your tablet, you'll see the page TTS has stopped reading to.

zoidb

Not directly related to the software, but interestingly on the authors website there is a Schedule a free call with me (https://claudio.uk/templates/call.html). I wonder if randos on the internet ever do that, and how it works out.

sam_lowry_

His LLM will answer the call.

qurashee

This looks incredible! I’ve had an idea simmering in the back of my mind for a while now: creating an audiobook from an ebook for my commute using the voice of a specific audiobook narrator I really enjoy. The concept struck me after coming across the Infinite Conversation project here on HN. Unfortunately, I just haven’t found the time to bring it to life yet. :(

vinni2

What about the copyright issue? You can’t mimic the voice of a narrator without their consent. OpenAI landed in trouble after using Scarlett Johansson’s voice in a demo.

https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...

notachatbot123

No limitations on this kind of thing if you are in private use.

qurashee

Indeed I was thinking about private use only.

vinni2

Forgive me for not knowing it was for personal use.

benatkin

She only won in that OpenAI decided it wasn’t worth the trouble.

amrrs

Kokoro really mentions that they used only permissive licensed voice

herculity275

Very nice! I fiddled with this idea a few months back but the models available at the time were woefully slow on a macbook. Will definitely give this a spin, there's a large category of web serials and less popular translated novels that never get audiobook releases.

cwmoore

The word “kokoro” means “heart” in Japanese, which I learned making the (heart shaped and paperback) puzzle books at https://www.kakurokokoro.com/

tkgally

Note that kokoro (心) means “heart” in the sense of “spirit,” “soul,” “mind,” “emotions,” etc. It doesn’t mean “heart” in the sense of “internal organ that pumps blood.” That is shinzō (心臓).

I once heard an American friend with so-so Japanese ability ask a Japanese woman who had recently had a heart operation how her kokoro was doing, and she looked surprised and taken aback.

Side note: After I started reading HN in 2019, I was struck by how many tech products mentioned here have Japanese names. I compiled a list for a few years and eventually posted it:

https://news.ycombinator.com/item?id=31310370

terhechte

Its also the name of the AI in Terminator Zero https://villains.fandom.com/wiki/Kokoro

I'm not sure if that is related here.

lc64

"was trained on <100 hours of audio"

How the hell was it trained on that little data ?

bbminner

I suppose it means per speaker. And it is based on a simplified style tts 2 which from my small dive into the subject seems one of the smaller models achieving great quality.

null

[deleted]

Havoc

Yeah that surprised me as well - seems low vs what is used on text llms . To be fair 100 hours of speaking is a lot of speaking though

edude03

But it covers five? Languages so if all equal it’s just 20 hours per language.

em-bee

in the linked audio sample it says the training data is mostly english. also another comment claims that the japanese quality is not good, so i'd be suspicious about all the other languages.

TypoAtLineZero

I am having a very similar setup locally, which uses Chrome with the 'Read Aloud' plugin. I am capturing the audio stream via QJackCtl/VLC. Voices, speed, pitch can be adjusted. Efficient and quickly set up

mikkom

What I really want and hope that someone does is to make an audiobook service that converts books to audiobooks but so that each character has own voice.

Som audiobooks have this and I think it really makes the experience much more engaging.

(Also maybe some background sound effects but not sure about that, some books also have this and it's quite nice too)

albert_e

I hope a plugin for Calibre ebook management software comes along that makes it easier to convert select titles from your epub library to decent audio versions -- and a decent open source app for tablets and smartphones that can let us seamlessly consume both the ebook and audiobook at will.