Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding
Wow. Thanks for posting the direct link to examples. Those sound incredibly good and would be impressive for a frontier lab. For two people over a few months, it's spectacular.
A little overacted, it reminds me of the voice acting in those flash cartoons you'd see in the early days of YouTube. That's not to say it isn't good work, it still sounds remarkably human. Just silly humans :)
Is this Apache licensed or a custom one? The README contains this:
> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:
> Identity Misuse: Do not produce audio resembling real individuals without permission.
> ...
Specifically the phrase "intended solely for research and educational use".
Sorry for the confusion. the license is plain Apache 2.0, and we changed the wording to "intended for research and educational use." The point was, users are free to use it for their use cases, just don't do shady stuff with it.
Sounds really good & human! Got a fair bit of unexpected artifacts though. e.g. 3 seconds hissing noise before dialogue. And music in background when I added (happy) in an attempt to control tone. Also don't understand how to control the S1 and S2 speakers...is it just random based on temp?
> TODO Docker support
Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal
> Also don't understand how to control the S1 and S2 speakers...
Do a clip with the speakers you want as the audio prompt, add the text of that clip (with speaker tags) of the clip at the beginning of your text prompt, and it clones the voices from your audio prompt for the output.
Thank you for the kind words! Dia wasn’t fine tuned on certain speaker, so you will get random voices every time you run it, unless you add a prompt / fix the seed.
The outputs are a bit unstable, might need to add cleaner training data and run longer training sessions. Hopefully we can do something like OAI Whisper and update with better performing checkpoints!
Yep. I just didn't spend the time to track down the location tbh. Plus huggingface usually does links to a cache folder that I don't recall the location of
Literally got cuda containers working earlier today so haven't spent a huge amount of time figuring things out
Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
I know it’s taboo to ask, but I must: where’s the dataset from? Very eager to play around with audio models myself, but I find existing datasets limiting
+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!
Could one usecase be generating an audiobook with this from existing books? I wonder if I could fine-tune the "characters" that speak these lines since you said it's a single pass whole the whole convo. Wonder if that's a limitation for this kind of a usecase (where speed is not imperative).
Hi! This is awesome for size and quality. I want to see a book reading example or try it myself.
This is a tangent point but it would have been nicer if it wasn't a notion site. You could put the same page on github pages and it will be much lighter to open, navigate and link (like people trying to link some audio)
Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.
Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.
Fun, I can't get to it because I can't get past the "Making sure you're not a bot!" page. It's just stuck at "calculating...". I understand the desire to slow down AI bots, but . If all the gnome apps are now behind this, they just completely shut down a small-time contributor. I love to play with Gnome apps and help out with things here and there, but I'm not going to fight with this damn thing to do so.
I know it's a bit ridiculous to see that as some kind of conspiracy, but I have seen a very long list of AI-related projects that got the same name as a famous open-source project, as if they wanted to hijack the popularity of those projects, and Dia is yet another example. It was relatively famous a few years ago and you cannot have forgotten it if you used Linux for more than a few weeks. It's almost done on purpose.
The generous interpretation is that the AI hype people just didn’t know about those other projects, i.e. that they are neither open source developers, nor users.
Was this trained on Planet Money / NPR podcasts? The last audio (continuation of prompt) sounds eerily like Planet Money, I had to double check if my Spotify had accidentally started playing.
It started with Ira Glass voice and now the default voice is someone that sounds like they're not certain they should be saying the very banal thing they are about to say, followed by a hand-shake protocol of nervous laughter.
This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
> Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?
Of course, but it's not always available.
For example, I would love an audiobook for Stanisław Lem's "The Invincible," as I just finished its video game adaptation, yet it simply doesn't exist in my native language.
It's quite seldom that the author narrates the audiobooks I listen to, and sometimes the narrator does a horrible job, butchering the characters with exaggerated tones.
Honestly, I’d say that’s true only for the author. Anyone else is just going to be interpreting the words to understand how to best convey the character / emotion / situation / etc., just like an AI will have to do. If an AI can do that more effectively than a human, why not?
The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.
As somebody who has listened to hundreds of audiobooks, I can tell you authors are generally not the best choice to voice their own work. They may know every intent, but they are writers, not actors.
The most skilled readers will make you want to read books _just because they narrated them_. They add a unique quality to the story, that you do not get from reading yourself or from watching a video adaptation.
Currently I'm in The Age of Madness, read by Steven Pacey. He's fantastic. The late Roy Dotrice is worth a mention as well, for voicing Game of Thrones and claiming the Guinness world record for most distinct voices (224) in one series.
It will be awesome if we can create readings automatically, but it will be a while before TTS can compete with the best readers out there.
I’d suggest even if the TTS sounded good, I’d still rather a human because:
1. It’s a job that seems worthwhile to support, especially as it’s “practice” that only adds to a lifetime of work and improves their central skill set
2. A voice actor will bring their own flare, just like any actor does to their job
3. They (should) prepare for the book, understanding what it’s about in its entirety, and bring that context to the reading
Realistic voice acting for audio books, realistic images for each page, realistic videos for each page, oh wait I just created a movie, maybe I can change the plot? Oh wait I just created a video game
Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.
The audio quality is seriously impressive. Any plans to add word-level timing maps? For my usecase that is a requirement, so unfortunately I cannot use this yet, but I would very much like to.
Hey, this is really cool! Curious how good the multi-language support is. Also - pretty wild that you trained the whole thing yourselves, especially without prior experience in speech models.
Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.
Sounds awesome!
I think it won't be very hard to run it using output streaming, although that might require beefier GPUs. Give us an email and we can talk more - nari.ai.contact at gmail dot com.
It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)
We're envisioning a platform with a social aspect, so that is the biggest difference. Also, bigger models!
We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)
I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
Interesting. I haven't thought of that problem before. I'm guessing a large enough audio dataset for medical terminology does not exist publicly.
But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.
You can add an audio prompt and prepend text corresponding to it in the script. You can get a feel for it by trying the second example in the Gradio interface!
Looking forward to try. My current go-to solution is E5-F2 (great cloning, decent delivery, ok audio quality, a lot of incoherence here and there forcing you to do multiple generations).
I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.
Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding
> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!
Seriously impressive. Wish I could direct link the audio.
Kudos to the Dia team.
> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:
> Identity Misuse: Do not produce audio resembling real individuals without permission.
> ...
Specifically the phrase "intended solely for research and educational use".
Thanks for the feedback :)
https://github.com/nari-labs/dia/pull/4
> TODO Docker support
Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal
Do a clip with the speakers you want as the audio prompt, add the text of that clip (with speaker tags) of the clip at the beginning of your text prompt, and it clones the voices from your audio prompt for the output.
The outputs are a bit unstable, might need to add cleaner training data and run longer training sessions. Hopefully we can do something like OAI Whisper and update with better performing checkpoints!
Surely it just downloads to a directory that can be volume mapped?
Literally got cuda containers working earlier today so haven't spent a huge amount of time figuring things out
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
This is a tangent point but it would have been nicer if it wasn't a notion site. You could put the same page on github pages and it will be much lighter to open, navigate and link (like people trying to link some audio)
Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.
https://gitlab.gnome.org/GNOME/dia
I'm sure they can... talk it over.
I'll show myself out.
It started with Ira Glass voice and now the default voice is someone that sounds like they're not certain they should be saying the very banal thing they are about to say, followed by a hand-shake protocol of nervous laughter.
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
Of course, but it's not always available.
For example, I would love an audiobook for Stanisław Lem's "The Invincible," as I just finished its video game adaptation, yet it simply doesn't exist in my native language.
It's quite seldom that the author narrates the audiobooks I listen to, and sometimes the narrator does a horrible job, butchering the characters with exaggerated tones.
The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.
The most skilled readers will make you want to read books _just because they narrated them_. They add a unique quality to the story, that you do not get from reading yourself or from watching a video adaptation.
Currently I'm in The Age of Madness, read by Steven Pacey. He's fantastic. The late Roy Dotrice is worth a mention as well, for voicing Game of Thrones and claiming the Guinness world record for most distinct voices (224) in one series.
It will be awesome if we can create readings automatically, but it will be a while before TTS can compete with the best readers out there.
1. It’s a job that seems worthwhile to support, especially as it’s “practice” that only adds to a lifetime of work and improves their central skill set
2. A voice actor will bring their own flare, just like any actor does to their job
3. They (should) prepare for the book, understanding what it’s about in its entirety, and bring that context to the reading
Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.
time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?
the python code snippet seems to imply that the entire audio bytes are generated directly?
It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)
Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.
We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)
I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.
I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.
https://github.com/SparkAudio/Spark-TTS
Sounds awesome in the demo page though.
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.
What're the recommended GPU cloud providers for using such open-weights models?
I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.