Every Flop Counts: Scaling a 300B LLM Without Premium GPUs
10 comments
·March 24, 2025rahen
I'm pretty surprised by the claimed memory usage for 300B parameters (table 1). If we compare similar models:
- Llama 3.1 with 405B parameters: 2 TB of memory (FP32), 500 GB (FP8)
- DeepSeek R1 with 671B parameters: 1.3 TB (scaling linearly, around 600 GB for 300B parameters)
Ling claims no more than 96 GB of memory, most likely for inference. That's far more than a 20% reduction. Am I missing something?
cavisne
I think they only claim their "Ling-Lite" 17B model can fit on a single 96GB GPU, their 300B model needs 8 of them (768GB of HBM)
fxtentacle
Some of these models still produce great results with something low like 2.7 bits per variable.
vednig
They've shared some interesting optimization techniques for bigger LLMs that's all, not exactly low powered devices as in power consumption. Still a good read.
They never mention what hardware they're on.
Table 1 is the closest thing. Device specs for six devices: 120-989 TFLOPS and 64-96 GB RAM.
An RTX 5090 is about 105 TFLOPS.
https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216