SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs
9 comments
·March 26, 2025imtringued
Ey7NFZ3P0nzAe
> all it does is clamp negative values to zero. How could a network use that for learning?
This "simple" effect is actually huge because it allows non linear mapping between input and output. This, is completely changing the size of what's learnable.
mentalgear
I feel like for many tasks, there's a certain "good enough" threshold that local small LMs can do as good but private and no cloud LLM is needed. I think the future is mostly on-device SLMs and their agentic coordination.
In that sense, a local agentic framework (js/ts based) would be soon very relevant.
digdugdirk
Any reason why you're calling out a need for a framework to be js/ts based? There's plenty of python frameworks in active development, some of which have js bindings/libraries.
mentalgear
minimal setup. Anything python requires the host machine to have the correct python version, pm and libs installed (which is far more than normal users can do), or have it compiled within a virtual python executable (big!).
Web-native technology requires minimal setup as it can basically run in any browser (or electron) as is.
PaulHoule
uv has revolutionized the Python situation, mostly.
I recently updated YOShInOn's Python environment to be repeatable, the python packaging part was pretty simple, but getting CUDA running in WSL2 was a little tricky. Turns out my "game ready" NVIDIA drivers in Windows install a certain version of the base CUDA libs in WSL somehow. You have to install another 5 libraries with deb packaging inside the WSL which is not too hard but I realized I had a version mismatch about 1/3 of the way through but decided to barrel ahead. I installed most of the deb's manually but got fatigued and installed the last one automatically. Somehow it all works so I'm not messing with it, but I am a bit intimidated about what to do if I have problems and need to tear it down.
Extremely low bit quantization makes me curious why it is so effective.
Why is it better to run a bigger model with more parameters at lower accuracy?
Obviously more parameters are better, but why is that the case exactly? For that you need to understand that a transformer layer consists of the self attention mechanism followed by a bog standard feedforward network (usually multiple MLPs). Most of the parameters are here.
My personal theory is based on the fact that ReLU is the simplest possible activation function that works, yet all it does is clamp negative values to zero. How could a network use that for learning?
The answer to the question is quite simple. If you have weights w_i that are negative and take the sum = \sum_i w_i times x_i plus positive bias, then throw that into ReLU, you will get a boolean function that turns off when the negative sum is smaller than the bias. This means you can build a comparison operator using ReLU. Take it a few steps further and you can probably implement any arbitrary boolean function directly in each row of your MLP.
This means that most of the precision is only really needed during training, because you want a nicely continuosly differentiable function for gradient descent, but the model itself is mostly operating on a form of fuzzy boolean logic.
This means that the embedding length, basically the size of a token, plays a key role in the ability to encode these mostly binary concepts.
Bigger models have wider tokens. That's why bigger models with low bit quantization outperform smaller models with high bit quantization.