Llasa: Llama-Based Speech Synthesis

ks2048

Odd that the page doesn't seem to link to either,

github: https://github.com/zhenye234/LLaSA_training

CalmStorm

LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.

WastedCucumber

Probably the title should have the correct capitalization then. Cause I was fully expecting a speech synthesis tool that sounded like llamas talking human language and now I'm bummed out!

mring33621

the long 'uuuuhhhhhhh' from some of the lesser models is killing me.

jszymborski

based on the samples, it really seams like anything smaller than 3B is pretty useless.

hadlock

If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.

StevenNunez

I can't wait see this integrated into Open WebUI! These sound amazing.

dheera

> employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align

I really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in/out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.

These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.

exe34

Sounds like a solid SaaS business plan!

HN

Llasa: Llama-Based Speech Synthesis

Llasa: Llama-Based Speech Synthesis