Skip to content(if available)orjump to list(if available)

Lossless LLM compression for efficient GPU inference via dynamic-length float

jhj

This is just a consequence of the fact that bfloat16 has a very high dynamic range which is not all used. People like hyperparameters that look like 0.01 not 10^10, even though there is the same fractional precision available at each exponent and if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions).

Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.

This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.

Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).

iandanforth

For those who don't bother to click through profiles, Jeff really knows what he's talking about. Much of Meta/FAIR + community benefits from his code.

VladVladikoff

I really love HN for this reason. Full of some of the brightest minds on the internet. Often the comments have very interesting information, instead of stupid knee jerk reactions to post titles.

vessenes

Thanks Jeff -- can you point me to something written up about rANS? All I find on line is turbulence modeling solutions; I presume this is not what you're referring to.

As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?

The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.

zorgmonkey

I don't know of any great write-ups unfortunately, but the rANS you're looking for is range asymmetric numeral systems.

hinkley

Do you think there’s a call for introducing an even smaller float that can pack more values into a SIMD register? Like a 12 bit?

badmonster

What stands out most is the practical implication: enabling lossless inference of a 405B-parameter model on a single node with 8×80GB GPUs is wild. That’s a huge unlock for research labs and startups alike that want to run frontier models without massive infrastructure costs.

latchkey

> That’s a huge unlock for research labs and startups alike that want to run frontier models without massive infrastructure costs.

Or let one of the neoclouds take care of the infrastructure costs and rent it out from them. Disclosure: I run one of them.

airstrike

Keep up the great work! We need more of you and other players.

Some unsolicited feedback: I would suggest reworking your landing page so that the language is always from your customers' perspective. Your customers want to solve a real internal problem that they have. Talking about how great your company is will always have less impact than talking about how you know what that problem is and how you intend to solve it.

Your mission is relevant to you and your investors, not to your customers. They care about themselves.

Your "quick start" should be an interactive form. I shouldn't have to remember what to put in an email to reach out to you. Make it easy for me. Also move that to the front page, provide a few "standard" packages and a custom one. Reduce the friction to clicking the CTA.

Since your pricing is transparent, you should be able to tell me what that price will be before I even submit a request. I assume you're cheaper than the competition (otherwise why would I not go with them?) so make that obvious. Check out Backblaze's website for an example page: https://www.backblaze.com/cloud-storage/pricing

Shell out a few grand and hire a designer to make your page look more professional. Something like https://oxide.computer/ but with the points above, as they also make the same mistake of making their home page read like a pitch deck.

latchkey

Fantastic unsolicited feedback, I'm definitely taking this to heart!

Website is intended to be more like documentation instead of a pitch deck or useless splash with a contact us form. I dislike sites like Oxide, I scroll past and don't read or ingest any of the fancy parts. Of course, you're right, this probably needs to be less about me. =)

Friction definitely needs to be improved. That part is being worked on right now. Our intention is to be fully self-service, so that you don't have to talk to us at all, unless you want to. Credit card and go.

We recently lowered our prices to be competitive with the rest of the market vs. focusing on people who care more about what we offer. We weren't trying to be cheaper than everyone else, we were trying to offer a better service. Lesson learned and pricing adjusted. Streisand effect, I don't like to mention the other players much.

Again, thanks!

sundarurfriend

> neoclouds

For anyone else who hadn't heard of this term:

> Neoclouds are startups specializing in AI-specific cloud computing. Unlike their larger competitors, they don’t develop proprietary chips. Instead, they rely heavily on Nvidia’s cutting-edge GPUs to power their operations. By focusing solely on AI workloads, these companies offer specialized solutions tailored to AI developers’ needs.

from https://www.tlciscreative.com/the-rise-of-neoclouds-shaping-...

latchkey

I believe that the term was first coined by SemiAnalysis in this article:

https://semianalysis.com/2024/10/03/ai-neocloud-playbook-and...

Ringz

I need your services in Cape Town South Africa. It’s hard to find good data centers here.

latchkey

Rent from us! hello@hotaisle.ai

saagarjha

That just moves the infrastructure costs to your cloud bill.

latchkey

True, but there is so much value that we provide above and beyond just a cloud bill, that I think it is worth it. This is way more than racking and stacking commodity servers and providing a ssh login.

It is novel equipment that few have ever used before outside of a relatively small HPC community. It regularly breaks and has issues (bugs) that need industry relationships to manage properly. We've had one server down for over a month now cause SMCI can't get their sh/t together to fix it. That's a $250k+ 350lbs paperweight. Good luck to any other small company that wants to negotiate that relationship.

We are offering a very valuable service by enabling easy access to some of the most powerful compute available today. How many people do you think have a good grasp of what it takes to configure rocev2 & 8x400G across a cluster of servers? Good luck trying to hire talent that can set that up, they already have jobs.

The capex / opex / complexity involved with deploying this level of gear is huge and only getting larger as the industry shifts to bigger/better/faster (ie: air cooling is dead). Things are moving so quickly, that equipment you purchased a year ago is now already out of date (H100 -> H200 is a great example). You're going to have to have a pretty impressive depreciation model to deploy this yourself.

I wouldn't just dismiss this as moving costs around.

Der_Einzige

4 but quants of DeepSeek or llama3 405n already fit on those GPUs and purported to have almost 0 loss compared to the full model. Doesn’t seem like that big of a deal given this

miohtama

I am not expert here, so want to ask what's magical about 405B number?

daveguy

That's the size of the largest, most capable, open source models. Specifically Llama 3.1 has 405B parameters. Deepseek's largest model is 671B parameters.

mhitza

Small corrections. Llama 3.1 is not an Open Source model, but a Llama 3.1 Licensed model. Neither is DeepSeek apparently https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LIC... which I was of the false opinion that it is. Though I never considered using it, so haven't checked the license before.

danielmarkbruce

It's... useful right now...it's not a huge unlock in a world where model size, GPU memory size, different precision support are changing quickly.

jhj

Unlike quantization, dimensionality reduction/low rank approximation, distillation etc, lossless compression is an always-correct addition to any ML system as you are computing the same thing you did before, the only question is if it is fast enough to not cause substantial bottlenecks and if the achievable compression ratio is high enough to be useful.

Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.

If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.

striking

Is GPU memory size really changing that quickly? For that matter, is model size?

kadushka

What's rapidly changing are quantization algorithms, and hardware features to support those algorithms. For example, Blackwell GPUs support dynamic FP4 quantization with group size 16. At that group size it's close to lossless (in terms of accuracy metrics).

latchkey

Both AMD and Nvidia are dumping more and more memory into their GPUs.

MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be 288 HBM3e (and support FP4/6).

danielmarkbruce

Yes, yes.

Nvidia about to release blackwell ultra with 288GB. Go back to maybe 2018 and max was 16gb if memory serves.

DeepSeek recently release a 670 gb model. A couple years ago Falcon's 180gb seemed huge.

Animats

Once this weight format war settles down, hardware can be built to support it. Presumably you want matrix multiply hardware optimized for whatever weight format turns out to be reasonably optimal.

eoerl

Optimization is post hoc here : you have to train first to be able to huffman en ode, so it's not a pure format question

loufe

I'm so grateful to live through such exciting times. I can open HN every two to some exciting new news about ML/transformer models. I really should read more into it, but does llama.cpp use a "custom kernel" per se, with cublas, or is it just making good use of the cublas kernal?

jonplackett

It’s funny that you’re missing the time frame from your sentence.

2 weeks? Two months? Two days? Two minutes?

All of the above are true sometimes! Exciting times indeed.

aseligman

Some additional context: many real world agent use cases struggle to balance quality, cost, and performance. This technique can help avoid the tradeoffs that quantization techniques introduce, including unpredictable results while you try cost optimize an agent. In some cases the cost savings can be significant using dfloat11 as you squeeze into more affordable GPUs.

* I work with xmad.ai

jsemrau

I still hold the opinion that ternary instead of binary would lead to an even higher degree of compression.

xmasotto

The underlying memory is still binary, or were you proposing an entirely new computer architecture with ternary gates?

thund

Is this different than ZipNN? https://arxiv.org/pdf/2411.05239

I see it mentioned but can’t understand if it’s based on it or different/better…

jhj

Not really, it's just adding some data transposition (coalescing individual bytes from the data words together) and an option to use a LZ/dictionary-type compressor to compress redundant things. But an LZ-type compressor doesn't make much sense on NN weights I think since it is not as redundant as most text data with many repeats, and also the space of possible dictionary matches is pretty small since unless the data is highly sparse, there may not be many repetitions that you can leverage to avoid the dictionary overhead.

If you add an LZ-type compressor and have this be in the critical path for inference, then decompression will be a lot slower. It would be best to fuse decompression with the compute kernels (e.g., a GEMM that performs decompression on each tile before the arithmetic), and the simpler the decompression routine, the easier this will be.

yjftsjthsd-h

> Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models.

The context length alone probably makes it worthwhile even if your models fit in memory, but I'm curious if it improves tokens/sec even all on GPU, since in my very amateur understanding LLMs tend to be constrained by memory bandwidth?

philjohn

My mental model is saying it might do, much like on slow hard drives DoubleSpace in DOS slightly sped up loading data from disk.

hnuser123456

If the model is 70% the size, it will be 1/0.7 = 1.43x the speed.

wills_forward

So this could universally decrease the memory requirements by un-quantitized LLMs by 30%? Seems big if true.

moffkalast

Not as big when Q8 quantization is already considered overkill and cuts it down to 50% (and a flat 2x speed boost without any additional compute overhead mind you) and the more common Q4KM is more like 30%. Definitely interesting if it can be added to existing quantization, but K quants do already use different precision levels for different layers depending on general perplexity impact which is similar to this entropy metric they use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not even considering calibrated imatrix which does something conceptually similar to FFT to compress even higher.

janalsncm

Quantization is not lossless.

danielmarkbruce

Nobody really cares if it meets a strict definition of lossless.

mountainriver

Is it possible to run this on new models? It seem like the code is only for inference, unless I’m misunderstanding

aazo11

This is a huge unlock for on-device inference. The download time of larger models makes local inference unusable for non-technical users.

luotuoshangdui

Does it affect speed?