Quantized Llama models with increased speed and a reduced memory footprint

127 comments

·October 24, 2024

tveita

So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.

Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.

I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.

Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.

As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf

derefr

> But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.

And that problem has a name: dithering!

Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.

And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!

arijo

Another way to understand dithering is by smearing the frequency spectrum of the original image preventing extreme frequency values to distort the image after quantization - this can be done by applying kernel filters on the original image.

Which I think is what is happening with SpinQuant as well - a smoothing of the frequency spectrum of the model weights, confirmed by the smearing of the singular values of the weight matrices.

eirikbakke

Fascinating! Does that mean you could improve performance further with Floyd–Steinberg dithering? (I.e. instead of rotating randomly, you track accumulated quantization error and add that amount instead.)

eru

Floyd-Steinberg etc mostly look better to the human eye, but I'm not sure in what more 'objective' sense they would be better than random dithering?

rini17

But images have regular adjacent pixels to work with. Don't think the algo can be straight applied to irregularly placed points in manydimensional space.

127

The best type of dithering is done with error diffusion. There's a convolutional kernel the diffuses the error over multiple adjacent data points.

grahamj

I'm just on the edge of understanding this but if I'm visualizing this right you're talking about a point source at the center of a sphere and a bitmap indicating where all the vectors intersect the surface. But that would mean the lengths would all be the same.

Isn't it the lengths/distances to neighbors that is the main information being stored in a vector db? Or is it just that what you're talking about only concerns the angles so the lengths are not part of the discussion?

I'm a dev but still have a lot to learn about ML :)

kridsdale3

My understanding is that yes, it actually is normalized to have the lengths all be the same, and thus the angle from (hyperdimensional) 0,0,0,(...n) is all that matters. The "distance between two embeddings" is able to simply to cosign of the two angles.

arijo

Seems really intriguing could you help me grok how this random perturbations of the points of the hypersphere surface are related to smearing the model weights?

digdugdirk

I'm sorry, I don't understand the language you're speaking. English please?

(Just kidding - but if you have any recommendations for learning resources to get started being able to understand what you're talking about, I'd greatly appreciate it.)

uoaei

Rabbit hole ahoy:

https://surma.dev/things/ditherpunk/

baq

in 2024 you paste the dense comment into your favorite LLM (preferably multiple) and ask it to explain on your desired level (whatever that may be). works remarkably well for every topic I tried it with (e.g. jargon-heavy financial tweets.)

ninja3925

Interestingly, FAISS does exactly that before doing Product Quantization and it works very well (errors are much lower compared to no rotation). They call it “optimal PQ”. During training time, they iterate to find a good candidate and save the best one.

Perhaps not entirely coincidentally, FAISS is also maintained by FB.

https://faiss.ai/cpp_api/struct/structfaiss_1_1OPQMatrix.htm...

arijo

I find the geometrical intuition of rotating a vector in high dimensional space to minimize its largest values (vector basis projections) beautiful.

I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?

govg

Tangentially related to the idea of "apply a random rotation matrix" is one where you apply a random matrix to a set of points to preserve distances between them but transform them into a lower dimensional space. This is captured by the JL Lemma [1].

[1] - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

beagle3

Actually, “apply a random matrix” is often the solution to a large dimensional space involving near neighbours.

The Johnson-Lindenstrauss lemma asserts that a multiplying by a random matrix (some conditions apply, but iirc rotation matrices satisfy them) keeps, in many senses, the distances between points even if the dimension drops very significantly (some conditions apply but usually satisfied by real world data)

This is, in fact, the theoretical underpinning of compressed sensing.

jjssmith

You might like an information-theoretic take on SpinQuant and the likes [1].

tl;dr: round((2*R)*x) is not a great idea for an R-bit quantization.

[1] https://arxiv.org/abs/2410.13780

nisten

It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).

Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.

spi

Aside from the weirdness of calling "good old" something that was released 17 months ago :-D I mean, deep learning is evolving at crazy rhythm, but you just can't assume a good paper gets written in days.

That said, as others have pointed out, and as it's also written on the blog post, they are entirely different methods. QLoRA requires access to the full training data, while theoretically you can apply SpinQuant to any given model. For example, they also apply it to Mistral, not only to their LLaMA.

(QLoRA also takes some time and compute to apply, but since SpinQuant also implies learning some weights, I don't know if it's actually faster/cheaper, too)

Aeolun

It’s a little bizarre that I feel like I’m actually starting to respect this little bit of Meta…

FuckButtons

I think meta and facebook before it have always valued a very high standard of engineering, and have also been generally pretty good about open sourcing a lot of that work in a way that allows a lot of people to work with their tools. This doesn’t seem all that out of character.

ipaddr

It's a huge company with a lot of different voices. One may create react and open source it while another would add a clause that if you sue facebook over anything your react license disappears. When they are good they are really good.

formalsystem

The naming is unfortunate but in this blog QLoRA is referring to Quantization-Aware Training with LoRA adaptor

lambda-research

I think the benefit is that SpinQuant had higher throughput and required less memory. At least according to the tables at the bottom of the article.

Definitely nice to see them not cherrypick results - makes them more believable that its not the best along all axes.

ipsum2

Those are different approaches afaict.

miven

I mean, it's no free lunch, you still need to expend significantly more compute for the QLoRA training compared to any usual PTQ method, be it SpinQuant or any other more conventional quantization approaches.

theanonymousone

May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/

com2kid

3B models are perfectly capable, I've had great luck with Phi 3.5.

> For example, they seem to not care about instructions to only write a response and no explanation

You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.

You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)

Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.

I wrote a very small blog post at https://meanderingthoughts.hashnode.dev/unlock-the-full-pote... explaining some of this.

teleforce

I wonder if CUE can help the situation in similar fashion to the DSL methods that you've described in your blog post [1]. After all CUE fundamentals are based on feature structure from the deterministic approach of NLP unlike LLM that's stochastic NLP [2],[3]. Perhaps deterministic and non-deterministic approaches is the potent combination that can effectively help reduce much of the footprint to get to the same results and being energy efficient in the process.

[1] Cue – A language for defining, generating, and validating data:

https://news.ycombinator.com/item?id=20847943

[2] Feature structure:

https://en.m.wikipedia.org/wiki/Feature_structure

[3] The Logic of CUE:

https://cuelang.org/docs/concept/the-logic-of-cue/

com2kid

On my LinkedIn post about this topic someone actually replied with a superior method of steering LLM output compared to anything else I've ever heard of, so I've decided that until I find time to implement their method, I'm not going to worry about things.

tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.

I hope that makes sense.

It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.

The performance should be super fast on locally hosted models that are using context caching.

Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.

wswope

I’ve only toyed with them a bit, and had a similar experience - but did find I got better output by forcing them to adhere to a fixed grammar: https://github.com/ggerganov/llama.cpp/tree/master/grammars

For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.

beoberha

For me, it was almost random if I would get a little spiel at the beginning of my response - even on the unquantized 8b instruct. Since ollama doesn’t support grammars, I was trying to get it to work where I had a prompt that summarized an article and extracted and classified certain information that I requested. Then I had another prompt that would digest the summary and spit out a structured JSON output. It was much better than trying to do it in one prompt, but still far too random even with temperature at 0. Sometimes the first prompt misclassified things. Sometimes the second prompt would include a “here’s your structured output”.

And Claude did everything perfectly ;)

BoorishBears

Why not preprompt with ```json {

jkukul

Yes, you can pre-fill the assistant's response with "```json {" or even "{" and that should increase the likelihood of getting a proper JSON in the response, but it's still not guaranteed. It's not nearly reliable enough for a production use case, even on a bigger (8B) model.

I could recommend using ollama or VLLm inference servers. They support a `response_format="json"` parameter (by implementing grammars on top of the base model). It makes it reliable for a production use, but in my experience the quality of the response decreases slightly when a grammar is applied.

scriptsmith

Yes, I've used the v3.2 3B-Instruct model in a Slack app. Specifically using vLLM, with a template: https://github.com/vllm-project/vllm/blob/main/examples/tool...

Works as expected if you provide a few system prompts with context.

accrual

Not in production, but I've used a 3B model to test a local LLM application I'm working on. I needed a full end-to-end request/response and it's a lot faster asking a 3B model than an 8B model. I could setup a test harness and replay the responses... but this was a lot simpler.

jdthedisciple

If for testing then why not just mock the whole thing for ultimate performance ... ?

nkozyra

Probably faster to use off the shelf model with llama.cpp than to mock it

itsTyrion

I've tried using 3B outside of production. Asked it to be the character needed, like 30 words and use German. Instructions were consistently ignored, sometimes sentences devolved into Gibberish or English was mixed in halfway through. Don't even want to know how lobotomized 1B is.

JohnHammersley

> For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline

I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.

Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.

ipsum2

You can't expect a 1B model to perform as well as 7B or chatGPT, probably the best use case is speculative decoding or to use to fine tune for a specific use case.

theanonymousone

What is "speculative decoding"?

regularfry

Speculative decoding is using a small model to quickly generate a sequence that every so often you pass through a larger model to check and correct. It can be much faster than just using the larger model, with tolerably close accuracy.

formalsystem

Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!

philipkglass

What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:

https://huggingface.co/docs/hub/en/gguf#quantization-types

It might even mean a non-GGUF quantization scheme; I'm just an intermediate user of local models, not an expert user or developer.

formalsystem

Please ignore my previous comments - I double checked with the model developers and here's the correction. Vanilla PTQ means no fancy quantization algorithm like SpinQuant, AWQ, etc. was applied. It just applied the same quantization scheme mentioned in the post (4bit per-group with g_size=32 symmetric weight, 8bit dynamic per token activation).

formalsystem

So this should be referring to w8a8 (weights and activations in 8 bit)

So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization

imjonse

Were there comparisons made to AWS, Smoothquant, GPTQ or other non-vanilla PTQ methods? Thanks.

Evidlo

I have a non-ML question.

In vanilla Pytorch I have the following expression:

    t.sum(values[inds] * weights)

If 'inds' is int8, I get "IndexError: tensors used as indices must be long, int, byte or bool tensors".

Is this still true if I use torchao?

formalsystem

The issue here is memory in PyTorch is byte addressable and that's a limitation we can't solve without making a lot more changes to PyTorch. But in your specific case, if you'd like to pack more data into `values` you can use a combination of clever bit shifting, torch.cat and other bit twiddling pytorch like ops to pack more data. It's a trick we use quite heavily in torchao

Evidlo

Arent int8s byte-aligned though? I thought this restriction was originally motivated by maintenance overhead of having to support more dtypes.

saagarjha

Do you ever pronounce torchao in a way that rhymes with "wow"

formalsystem

My wife calls it torch AAAW

philipkglass

These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?

[1] https://ollama.com/library/llama3.2/tags

yuvalr1

Looking at how to deploy 1B and 3B Llama models on Android for inference. Some posts online recommend using Termux (an amazing app) to have an emulated shell and then install as if it's Linux, using ollama for example. However, this forces you into a manual installation process, and also most of the people don't know what Termux is, and would be afraid to install it from F-Droid.

Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?

I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.

tugdual

I actually did something similar using llama.cpp a while back, would be curious to see the speedup with this model.

https://github.com/TugdualKerjan/bunny/tree/main

niutech

You can use MLC LLM: https://llm.mlc.ai/

antonvs

This might be of use:

https://github.com/a-ghorbani/pocketpal-ai

null

[deleted]

cmsj

It really bugs me that every time I see posts about new models, there is never any indication of how much VRAM one needs to actually run them.

qeternity

That's because it's easily calculable and also somewhat impossible to say in any meaningful sense.

Most weights are released as fp16/bf16 so 2 bytes per weight. So just double the number of parameters = the number of gigabytes of VRAM. Llama 3.1 8B ~= 16GB weights in fp16. At 4bit quantization, it would be half the number of parameters so Llama 3.1 8B ~= 4GB weights.

But this is just weights. The real issue is context and output length: how much data are you feeding in? This is where VRAM can explode, and it's entirely use-case dependent. So for a 128k context model, the range of VRAM usage is huge.

The reality is, if you're not able to quickly estimate the above, you're probably not running local models anyway.

bick_nyers

Perhaps I'm being charitable but I read OP's comment in the light of what you described with context length. Batching, context length, and attention implementation vary these numbers wildly. I can fit a 6bit quant Mistral Small (22b) on a 3090 with ~10-12k context, but Qwen2VL (7b, well 8.3b if you include vision encoder) also maxes out my 3090 VRAM with an 8bit quant and ~16k context.

I do think it would be good to include some info. on "what we expect to be common deployment scenarios, and here's some sample VRAM values".

Tangentially, whenever these models get released with fine-tuning scripts (FFT and Lora) I've yet to find a model that provides accurate information on the actual amount of VRAM required to train the model. Often times it's always 8x80GB for FFT, even for a 7B model, but you can tweak the batch sizes and DeepSpeed config. to drop that down to 4x80GB, then with some tricks (8bit Adam, Activation Checkpointing), drop it down to 2x80GB.

formalsystem

You can estimate context length impact by doing back of the envelope calculations on KV cache size: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length

Some pretty charts here https://github.com/pytorch/ao/issues/539

Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.

It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.

But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.

My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.

Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.

tucnak

>But it struggles with json

You should customise your sampler to mandate JSON grammar after ```json tokens.

Grammar samplers are clever! But in the case of a missing escape character you’ll end up with a corrupted string.

Take for example: "A dog says \"Woof!\""

With a grammar, you’ll end up with "A dog says " when the model forgets to escape.

Which is valid JSON, but not what the model intended.

So it’s usually better to catch the exception and ask the model to try again.

Unless you’ve come across a sampler with backtracking? That would be cool

Evidlo

Why don't they actually say what the size of the model is in GB?

That and average inference times on common hardware is what I'm curious about.

Ardren

The last table shows memory usage and performance on an Android phone.

> Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B.

itsTyrion

Wait, so I can get incorrect information and text summaries with things added or cut off even faster and on mobile now? that's amazing.

nikolayasdf123

what's your opinion on LlamaStack?

for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.

is ExecuTorch any better?

SoLoMo123

Hi, I'm Mergen and I work on ExecuTorch.

ExecuTorch is a runtime for mobile and embedded devices to run PyTorch models directly. Currently it runs pretty fast on CPU, but expanding our use-case for mobile accelerators and GPUs.

We're still in our early stages (just turned beta status). But try it out and let us know.

Regarding Llama Stack, it is built by my colleagues. What were some concrete issues have you experienced? If you have error/bug reports, I'll happy to pass along.

nikolayasdf123

will give executorch a try.

with llamastack, well making it work with CUDA for starters would be great.

it is also bloated. something that supposed to take direct 100 lines of code and a couple files, takes dozens of files, multiple frameworks, generators.. which in the end do not work at all, and nobody knows why. very obscure framework. can't believe this code is coming from Meta.

Tepix

From TFA:

> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B

No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!

HN

Quantized Llama models with increased speed and a reduced memory footprint

Quantized Llama models with increased speed and a reduced memory footprint