Skip to content(if available)orjump to list(if available)

Show HN: Dia, an open-weights TTS model for generating realistic dialogue

Havoc

Sounds really good & human! Got a fair bit of unexpected artifacts though. e.g. 3 seconds hissing noise before dialogue. And music in background when I added (happy) in an attempt to control tone. Also don't understand how to control the S1 and S2 speakers...is it just random based on temp?

> TODO Docker support

Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal

hemloc_io

Very cool!

Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding

tyrauber

Hey, do yourself a favor and listen to the fun example:

> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!

Seriously impressive. Wish I could direct link the audio.

Kudos to the Dia team.

jinay

For anyone who wants to listen, it's on this page: https://yummy-fir-7a4.notion.site/dia

DoctorOW

A little overacted, it reminds me of the voice acting in those flash cartoons you'd see in the early days of YouTube. That's not to say it isn't good work, it still sounds remarkably human. Just silly humans :)

mrandish

Wow. Thanks for posting the direct link to examples. Those sound incredibly good and would be impressive for a frontier lab. For two people over a few months, it's spectacular.

nojs

This is so good. Reminds me of The Office. I love how bad the other examples are.

fwip

The text is lifted from a scene in The Office: https://youtu.be/gO8N3L_aERg?si=y7PggNrKlVQm0qyX&t=82

notdian

made a small change and got it running on M2 Pro 16GB Macbook pro, the quality is amazing.

https://github.com/nari-labs/dia/pull/4

noiv

Can confirm, runs straight forward on 15.4.1@M4, THX.

rustc

Is this Apache licensed or a custom one? The README contains this:

> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:

> Identity Misuse: Do not produce audio resembling real individuals without permission.

> ...

Specifically the phrase "intended solely for research and educational use".

null

[deleted]

strobe

just in case, another opensource project using same name https://wiki.gnome.org/Apps/Dia/

https://gitlab.gnome.org/GNOME/dia

freedomben

Fun, I can't get to it because I can't get past the "Making sure you're not a bot!" page. It's just stuck at "calculating...". I understand the desire to slow down AI bots, but . If all the gnome apps are now behind this, they just completely shut down a small-time contributor. I love to play with Gnome apps and help out with things here and there, but I'm not going to fight with this damn thing to do so.

null

[deleted]

SoKamil

And another one, not open source but in AI sphere: https://www.diabrowser.com/

toebee

Thanks for the heads-up! We weren’t aware of the GNOME Dia project. Since we focus on speech AI, we’ll make sure to clarify that distinction.

aclark

Ditto this! Dia diagram tool user here just noticing the name clash. Good luck with your Dia!! Assuming both can exist in harmony. :-)

mrandish

> Assuming both can exist in harmony.

I'm sure they can... talk it over.

I'll show myself out.

Magma7404

I know it's a bit ridiculous to see that as some kind of conspiracy, but I have seen a very long list of AI-related projects that got the same name as a famous open-source project, as if they wanted to hijack the popularity of those projects, and Dia is yet another example. It was relatively famous a few years ago and you cannot have forgotten it if you used Linux for more than a few weeks. It's almost done on purpose.

teddyh

The generous interpretation is that the AI hype people just didn’t know about those other projects, i.e. that they are neither open source developers, nor users.

gapan

Of course, how could they have known? Doing a basic web search before deciding on a name is so last year.

xhkkffbf

Are there different voices? Or only [s1] and [s2] in the examples?

toebee

Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

heystefan

Could one usecase be generating an audiobook with this from existing books? I wonder if I could fine-tune the "characters" that speak these lines since you said it's a single pass whole the whole convo. Wonder if that's a limitation for this kind of a usecase (where speed is not imperative).

gfaure

Amazing that you developed this over the course of three months! Can you drop any insight into how you pulled together the audio data?

isoprophlex

+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!

bzuker

hey, this looks (or rather, sounds) amazing! Does it work with different languages or is it English only?

nickthegreek

Are there any examples of the audio differences between the this and the larger model?

new_user_final

Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.

Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.

xbmcuser

Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.

codingmoh

Hey, this is really cool! Curious how good the multi-language support is. Also - pretty wild that you trained the whole thing yourselves, especially without prior experience in speech models.

Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.

Versipelle

This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.

I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with

azinman2

Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?

senordevnyc

Honestly, I’d say that’s true only for the author. Anyone else is just going to be interpreting the words to understand how to best convey the character / emotion / situation / etc., just like an AI will have to do. If an AI can do that more effectively than a human, why not?

The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.

DrSiemer

As somebody who has listened to hundreds of audiobooks, I can tell you authors are generally not the best choice to voice their own work. They may know every intent, but they are writers, not actors.

The most skilled readers will make you want to read books _just because they narrated them_. They add a unique quality to the story, that you do not get from reading yourself or from watching a video adaptation.

Currently I'm in The Age of Madness, read by Steven Pacey. He's fantastic. The late Roy Dotrice is worth a mention as well, for voicing Game of Thrones and claiming the Guinness world record for most distinct voices (224) in one series.

It will be awesome if we can create readings automatically, but it will be a while before TTS can compete with the best readers out there.

mclau157

Realistic voice acting for audio books, realistic images for each page, realistic videos for each page, oh wait I just created a movie, maybe I can change the plot? Oh wait I just created a video game

eob

Bravo -- this is fantastic.

I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.