Skip to content(if available)orjump to list(if available)

Neural audio codecs: how to get audio into LLMs

miki123211

> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

tsol

Did they respond differently depending on what race they thought you were? I'm surprised they would even do that honestly. I thought they were trained on text conversations which presumably wouldn't have any of that to learn from.

OisinMoran

You can often tell where someone is from from text alone! There are plenty of idiosyncrasies even in how different English speaking countries use the language.

anotherhue

Ah stop

thwarted

If it did, it responded based on the accent it picked up on not race, because race and accent are orthogonal, correlation does not imply causation.

sbrother

I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".

bigzyg33k

advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.

oezi

Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.

sbrother

right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.

cubefox

But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.

idonotknowwhy

Qwen3 omni transcriber can do this. It can describe the voice, emotion very well

85392_school

I've also had luck with Gemini. If I made a few noises and asked which one was higher pitched, it could easily tell.

bob1029

Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

WithinReason

JPEG is a good idea:

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.

https://proceedings.neurips.cc/paper_files/paper/2018/file/7...

I suspect mp3 is also a good idea

PaulDavisThe1st

The approach in TFA encodes into a 32 dimensional space. I suspect this is significantly more dimensions than any psycho-acoustic compression algorithm uses. Also, throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

bob1029

> throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?

PaulDavisThe1st

Humans that read (at least) Indo-European languages can read texts in their native language with all the vowels removed. Does that suggest that it would be a good idea to remove the vowels from text before using it for training text-based LLMs ?

Presumably you want to train on as rich a set of data as possible, even if some of that data is redundant or irrelevant when it comes to human perception.

542354234235

I imagine it would be like if there were Rosetta Stones of text, written with a language you could read and a language you couldn't. For your purposes, discarding the text you can't read would be fine and you wouldn't lose anything. But if you were ingesting a bunch into an LLM, the additional text would give the LLM more context and help it make connections and relate words more accurately, even if you never were going to have it output anything in the language you don't understand.

The inaudible sounds add context and additional datapoints on how the audible sounds are related.

CaptainOfCoit

Maybe that things outside our audible range could impact/influence things inside of our audible range?

ACCount37

You can try to train an adapter from a raw 400-byte MP3 frame to an embedding for a given LLM (4096+ floating point numbers, exact precision varies).

But you'd need that information to be digestible for a neural network. Otherwise, you'll have a very hard time getting that adapter to work.

As a rule: neural networks love highly redundant data, and hate highly compressed data at their inputs. Tokenized text good, GZIP compressed bytestream bad. But who knows, really. It's a rule of thumb, not a mathematical law. So you could have some success getting that MP3-based adapter to work. I've seen weirder shit work.

cubefox

Well, I believe language models usually use 2-byte (16 bit) tokens, which corresponds to an embedding dimension of 2^16=65536. With 400 bytes per token this would be 2^(400*8), which is an extremely large number. Way too large to be practical, I assume.

trollbridge

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

nmfisher

The article is talking about doing exactly that. The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.

There are samples towards the end of the post of their LLM trained on their Mimi audio codec.

CGMthrowaway

> The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).

An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).

I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)

jampekka

> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.

I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.

PaulDavisThe1st

> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

As an experienced DAW author, I very, very much doubt this.

What can be done relatively easy is to "see" or rather "follow along" in the waveform when listening to the audio. But I read your claim as being that someone could look at the waveform (which is already decimated from the original) and identify words or phonemes without hearing the associated audio. I am extremely skeptical that there is anyone anywhere in the world who can do this.

duped

> the key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens

Did Claude Shannon not answer this question in 1948? You need at least 1 bit per 6dB of dynamic range for each symbol and 2B symbols per second where B is the bandwidth of the signal.

Compression techniques are all about getting below that fundamental limit but it's not like this is an unsolved problem. Or is 1kbaud too much for LLMs?

ca_tech

There is data but nowhere near the amount of written language that is fairly normalized and without the need to account for additional features such as language, dialect, intonation, facial expression, hand gestures. Speech to text is used as the translation layer as it throws many of those other features away and contextualizes it into a set of tokens that are much more efficient to map between languages.

mohsen1

It costs more to train on audio tokens but I'm sure we will get there. Training a model on transcript of a lecture on YouTube vs. training on audio of it will make a difference.

benob

Audio tokenization consumes at least 4x tokens versus text. So there is an efficiency problem to start with. Then is there enough audio data to train a LLM from scratch?

542354234235

Don't we have tens of thousands of hours (hundreds of thousands?) of closed captioned tv shows and movies? How many hours of news broadcasts with transcripts do we have? Maybe I just don't understand what is needed, but it seems like we have a lot of data to work with.

roboror

Sure but that needs to be licensed

cruffle_duffle

Correct me if I’m wrong but you need more than just closed captions. You need precise timing too. I’d think you’d need the text to line up exactly with the audio so when the voice makes an “A” sound the text it aligns with is “A” as well.

So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up.

But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!

trollbridge

Start an MVNO that offers cheaper phone plans and and train on all those phone calls.

There are big libraries of old speeches.

Simply capture all all current radio/tv transmissions and train on that (we've already established copyright doesn't apply to LLM training, right?)

miki123211

> Start an MVNO that offers cheaper phone plans and and train on all those phone calls.

q: What is 2+2?

A: The warranty for your car has expired...

MichealCodes

I don't think we've had the transformer moment for audio training yet, but yes, in theory audio-first models will be much more capable.

trollbridge

Particularly interesting would be transformations between tokenised audio and tokenised text.

I recall someone telling me once up to 90% of communication can be non-verbal, so when an LLM sticks to just text, it's only getting 10% of the data.

crazygringo

This is fascinating.

Obviously working directly with audio is vastly more complex than with text.

But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.

I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.

I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.

I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.

quinndupont

There’s a long history of attempts at artificial speech that take this approach, recreating mouth parts and vibrating air. They are all pretty silly, like this work, which fails to understand how writing isn’t just a derivative of speech.

crazygringo

> They are all pretty silly,

Huh? How?

> like this work which fails to understand how writing isn’t just a derivative of speech.

The whole point of the article is that writing isn't just a derivative of speech. It's in the introduction.

duped

In speech coding/synthesis this called a "source-filter" model (decompose speech production into a sound generator in the vocal folds and filter in the vocal tract, and parameterize them) and it's actually older than Tukey and Cooley's rediscovery of the FFT.

robviren

This has got to be one of the most visually pleasing explanations I have seen of these concepts. Congrats!

I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.

quinndupont

Y’all need to learn about the history and development of spoken language and writing. Writing isn’t just a copy or derivation of writing. LLMs work because of the conceptual characteristics of writing (consider the distinctions between ideographic, logographic, alphabetical…). What a sloppy mess!

Read some Wittgenstein and Goodman, but especially Derrida who calls this logocentrism.

mondainx

Thanks for sharing this well written post that I will share with my team; we just recently started using audio/voice in our AI suite and the details herein will be helpful and informative.

amelius

> Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (...), but it’s a wrapper, not real speech understanding.

But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.

Workaccount2

Nothing is real understanding because we have no benchmark for understanding because we don't mechanistically know what understanding is. The best we have is people "vibe knowing" a benchmark that they made up on the spot.

hbarka

>> So now that you’re convinced that audio LLMs are the path to AGI…

Someone please explain if the author was being cheeky or serious

null

[deleted]

bkitano19

Awesome post!