Skip to content(if available)orjump to list(if available)

'I paid for the whole GPU, I am going to use the whole GPU'

mooreds

The subtitle (which is important but was too long for the HN submission) is "A high-level guide to GPU utilization".

J_Shelby_J

I’m at the integration testing and benchmarking phase of a rust crate for allocating LLMs to GPUs and System RAM. The impetus is that single models are limited in what they can achieve and more complex workflows require LLM swarms. Think of a lot of smaller models doing reasoning steps or tasks like search and then a big model to summarize it all.

It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).

I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.

Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.

charles_irl

Oh, I wrote this! Thanks for sharing it.

freeqaz

Anything you feel is worth adding for the HN crowd while you've got our attention? :)

(Thanks for writing this btw!)

charles_irl

Hmm, hard to say!

In the few months since I originally wrote this, I've come to an even greater appreciation of just how hard it is to maximize utilization of the Tensor Cores. It's a lot more than just kernel parameter tuning and using a few parallel programming tricks (parallel reduce, unrolling). It really borks your CUDA code -- you need warp specialization, you need to break warp uniformity, you need to work with explicit asynchrony. Hoping to write about this for the Modal blog/GPU Glossary soon!

I also spent a bit of time working with ncu/"NSight Compute". I'd probably include a bit about it in the section on how to improve your MFU if I rewrote the article today. But tl;dr use the profiling tool, Luke! And a good way to learn is to watch NVIDIA's GTC talks.

That said, I've also noticed even more cases where GPU kernel utilization is well below target. I think (and Horace He has argued) that that comes in part from optimized GEMMs running so fast on Tensor Cores that host overhead becomes the bottleneck (classic Amdahl). This unfortunately means more host logic needs to be compiled -- either graph-compiled as in torch.compile or moved into a compiled language.

Mockapapella

This is a good article on the "fog of war" for GPU inference. Modal has been doing a great job of aggregating and disseminating info on how to think about high quality AI inference. Learned some fun stuff -- thanks for posting it.

> the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.

Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_/¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...

charles_irl

Nice article! I had to restrain myself from ranting on our blog :)

alexjplant

OT: I'm not really sure what the author meant by

> Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s

since FM was more of an 80s thing. Even their linked comment says

> Throughout the 90s FM was old-hat. Nobody wanted to hear those woody clangy sounds of the 80s anymore.

FM synthesis has kept being a thing ever since in specific applications but the zeitgeist of the 90s (and its modern postmodern retreads like vaporwave) is arguably digital sampling.

charles_irl

That's a fair point! I was thinking of the Yamaha chips in the Sega consoles mentioned in that comment -- which certainly defined the sound of the 1990s for me as a child. But my small town Midwestern up-bringing was behind the curve!

Will replace with the lore-accurate "late 1900s".

semessier

well, related: fractional GPUs to multiplex workloads for aggregate utilization have been a topic for some time with no definite (NVIDIA) solutions for it: https://vhpc.org

charles_irl

We looked into this at Modal! We put out vGPUs but didn't see demand and our internal benchmarks for MPS and Green Contexts didn't indicate a big win.

The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.

And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).

kristianpaul

I'm still trying to use all my CPUs...

drob518

And we’re back to time-sharing.

charles_irl

When I'm feeling sassy, I like to tell people that Modal is "Enterprise Java Beans for AI".

dehrmann

Tomcat wanted to be some sort of compile once, run anywhere Docker.

esperent

What is time-sharing and why is being back to it a bad thing?

r3tr0

We spend a lot of time on getting these measurement w eBPF

You can check us out at https://yeet.cx

Heres an overview of our GPU specific solution

https://yeet.cx/solutions/maximize-ai-infra-roi

null

[deleted]

freeqaz

How fast are modern GPU boxes able to spin up these days? Loading in a massive blob of weights into VRAM feels like it's gotta be the bottleneck even if server provisioning is fast.

Or am I naive and my knowledge is outdated? I am genuinely curious what people see and what providers are capable of in 2025.

tedivm

Once you have the models on local storage you can move pretty quickly from there to VRAM, I've never found that to be the biggest bottleneck. The problem is provisioning itself, especially if you have to actually move models locally. Some of this can be avoided with extremely expensive networking (infiniband to a NAS with model weights), but that's not something you're going to have fun dealing with in a cloud environment.

It might help to remember that the training process is essentially a game of "how fast can we shove data into these GPUs", and having a GPU sit idle because the data can't get into it fast enough is a challenge people have been tackling since at least the P100 series. This has resulted in improvements on the GPUs as well as all the hardware around them. Getting data into the chips is one of the most efficient processes at this point.

freeqaz

How do Serverless GPU Cloud Providers deal with that then? Do they go down the Infiniband-to-NAS rabbit hole to build all of their infrastructure? Or do they just setup an NVME RAID cache to hold the models locally? (Maybe an LRU? With the system memory also being used?)

I imagine in the real world that model usage follows a zipfian distribution, ie, a small number of models (<10) represent 95% of the machines for inference workloads. And, for those machines, you can just load the weights off of your ~40gbit ethernet connection since they're never cycling.

But for that last 5%, I feel like that's where it becomes important. If I'm running a weird, custom model and I want Lambda-like billing... what's the stack? Is the market big enough that people care? (And do most people just use LORAs which are much easier to hot swap?)

Training I imagine is a totally different ballpark because you're constantly checkpointing, transferring data at each step, etc, versus inference. That's a world I know a lot less about though!

mountainriver

There is some recent work to make the model loading significantly faster by InferX. They are claiming to be able to load the a 7b in under 2 seconds.

I haven’t tried it but if that’s the case it’s a game changer

charles_irl

We've talked to them and there's some impressive technology there!

kllrnohj

pcie 5.0 x16 is ~64gb/s of bandwidth. Real world is never perfect, but it's not exactly a small pipe here.

pavelstoev

GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.

cph123

Could SR-IOV VFs be a solution to that?