Every Flop Counts: Scaling a 300B LLM Without Premium GPUs

flowerthoughts

They never mention what hardware they're on.

Table 1 is the closest thing. Device specs for six devices: 120-989 TFLOPS and 64-96 GB RAM.

An RTX 5090 is about 105 TFLOPS.

https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216

bshark

The 96GB (HBM2e) SKU is named PPU from T-head semiconductor (basically a subsidiary of Alibaba). The spec is very similar to H20. Other chips they were using include Huawei Ascend 910B (64GB) and maybe other domestic designed chips.

boulos

I was surprised not to see a Kunlun P800 there.

rahen

I'm pretty surprised by the claimed memory usage for 300B parameters (table 1). If we compare similar models:

- Llama 3.1 with 405B parameters: 2 TB of memory (FP32), 500 GB (FP8)

- DeepSeek R1 with 671B parameters: 1.3 TB (scaling linearly, around 600 GB for 300B parameters)

Ling claims no more than 96 GB of memory, most likely for inference. That's far more than a 20% reduction. Am I missing something?

cavisne

I think they only claim their "Ling-Lite" 17B model can fit on a single 96GB GPU, their 300B model needs 8 of them (768GB of HBM)

fxtentacle

Some of these models still produce great results with something low like 2.7 bits per variable.

vednig

They've shared some interesting optimization techniques for bigger LLMs that's all, not exactly low powered devices as in power consumption. Still a good read.

osti

I think this is the one where they train LLM without NVIDIA GPU's.

cavisne

They talk about CUDA level tracing in their framework. I assume its just consumer GPU's that Nvidia say arent meant to be used in datacenters.

stphantom

[dead]

HN

Every Flop Counts: Scaling a 300B LLM Without Premium GPUs

Every Flop Counts: Scaling a 300B LLM Without Premium GPUs