Llasa: Llama-Based Speech Synthesis
9 comments
·May 1, 2025CalmStorm
LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.
WastedCucumber
Probably the title should have the correct capitalization then. Cause I was fully expecting a speech synthesis tool that sounded like llamas talking human language and now I'm bummed out!
mring33621
the long 'uuuuhhhhhhh' from some of the lesser models is killing me.
jszymborski
based on the samples, it really seams like anything smaller than 3B is pretty useless.
hadlock
If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.
StevenNunez
I can't wait see this integrated into Open WebUI! These sound amazing.
dheera
> employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align
I really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in/out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.
These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.
exe34
Sounds like a solid SaaS business plan!
Odd that the page doesn't seem to link to either,
paper: https://arxiv.org/abs/2502.04128
github: https://github.com/zhenye234/LLaSA_training