Llama.cpp AI Performance with the GeForce RTX 5090 Review
24 comments
·March 10, 2025Tepix
sgt
I've heard however that the EPYC rigs are in practice getting very low tokens/sec and the Macs like the Ultras and high memory are getting much beter - like by an order of magnitude. So in that sense, the only sensible option now (i.e. "local energy efficient LLM on a budget") is to get the Mac.
magicalhippo
[delayed]
porphyra
There are Chinese modded 4090s with 48 and 96 GB of ram that seem like a sweet spot for fast inference of these moderately sized models.
42lux
The 3090 was always a fallacy without native fp8.
threeducks
Performance for text generation is memory-limited, so lack of native fp8 support does not matter. You have more than enough compute left over to do the math in whichever floating point format you fancy.
Tepix
Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.
There's weight-only FP8 in vLLM on NVidia Ampere: https://docs.vllm.ai/en/latest/features/quantization/fp8.htm...
imtringued
How? It's not like Nvidia is some also-ran company for which people did not build custom kernels that combine dequantization and GEMV/GEMM in a single kernel.
alecco
Is llama.cpp's CUDA implementation decent? (e.g. does it use CUTLASS properly or something more low level)
regularfry
The implementation's here: https://github.com/ggml-org/llama.cpp/tree/master/ggml/src/g...
benob
The benchmark only touches 8B-class models at 8-bit quantification. Would be interesting to see how it fares with models that use more of the card ram, and under varying quantization and context lengths.
threeducks
I agree. This benchmark should have compared the largest ~4 bit quantized model that fits into VRAM, which would be somewhere around 32B for RTX 3090/4090/5090.
For text generation, which is the most important metric, the tokens per second will scale almost linearly with memory bandwidth (936 GB/s, 1008 GB/s and 1792 GB/s respectively), but we might see more interesting results when comparing prompt processing, speculative decoding with various models, vLLM vs llama.cpp vs TGI, prompt length, context length, text type/programming language (actually makes a difference with speculative decoding), cache quantization and sampling methods. Results should also be checked for correctness (perplexity or some benchmark like HumanEval etc.) to make sure that results are not garbage.
If anyone from Phoronix is reading this, this post might be a good point to get you started: https://old.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacp...
At time of writing, Qwen2.5-Coder-32B-Instruct-GGUF with one of the smaller variants for speculative decoding is probably the best local model for most programming tasks, but keep an eye out for any new models. They will probably show up in Bartowksi's "Recommended large models" list, which is also a good place to download quantized models: https://huggingface.co/bartowski
Maxious
There's been some INT4/NVFP4 gains too https://hanlab.mit.edu/blog/svdquant-nvfp4 https://blackforestlabs.ai/flux-nvidia-blackwell/
chakintosh
Curious if the author checked whether his card doesn't have any missing ROPs.
wewewedxfgdf
It would be very interesting to see these alongside benchmarks for Apple M4, AMD Halo Strix and other AMD cards.
daghamm
Comparison for previous generations
https://www.hardware-corner.net/guides/gpu-benchmark-large-l...
mirekrusin
When running on apple silicon you want to use mlx, not llama.cpp as this benchmark does. Performance is much better than what's plotted there and seems to be getting better, right?
Power consumption is almost 10x smaller for apple.
Vram is more than 10x larger.
Price wise for running same size models apple is cheaper.
Upper limit (larger models, longer context) is far larger for apple (for nvidia you can easily put 2x cards, more than that it becomes whole complex setup no ordinary person can do).
Am I missing something or apple is simply currently better for local llms?
sgt
I'm trying to find out about that as well as I'm considering a local LLM for some heavy prototyping. I don't mind which HW I buy, but it's on a relative budget and energy efficiency is also not a bad thing. Seems the Ultra can do 40 tokens/sec on DeepSeek and nothing even comes close at that price point.
nicman23
there is a plateau where you simply need more compute and the m4 cores are not enough, so even if they have enough ram for the model the token/s is not useful
teaearlgraycold
And workstation and server-class cards.
3np
Correct me if I'm wrong, but I have the impression that we'd usually expect to see bigger efficiency gains while these are marginal?
If so that would confirm the notion that they've hit a ceiling and pushing against physical limitations.
magicalhippo
> I have the impression that we'd usually expect to see bigger efficiency gains while these are marginal?
The 50-series is made using the same manufacturing process ("node") as the 40-series, and there is not a major difference in design.
So the 50-series is more like tweaking an engine that previously topped out at 5000 RPM so it's now topping out at 6000 RPM, without changing anything fundamental. Yes it's making more horsepower but it's using more fuel to do so.
3np
Great to see Mr Larabel@Phoronix both maintaining consistent legit reporting and still have time for one-offs like this in these times of AI slop and other OG writers either quitting or succumbing to the vortex. Hats off!
littlestymaar
TL;DR; performance isn't bad, but perd perd Watt isn't better than 4080 or 4090 and can even be significantly lower than 4090 in certain contexts.
It used to be that for locally running GenAI, VRAM per dollar was king, so used NVidia RTX 3090 cards were the undisputed darlings of DYI LLM with 24GB for 600€-800€ or so. Sticking two of these in one PC isn't too difficult despite them using 350W each.
Then Apple introduced Macs with 128 GB and more unified memory at 800GB/s and the ability to load models as large as 70GB (70b FP8) or even larger ones. The M1 Ultra was unable to take full advantage of the excellent RAM speed, but with the M2 and the M3, performance is improving. Just be prepared to spend 5000€ or more for a M3 Ultra. Another alternative would be a EPYC 9005 system with 12x DDR5-6000 RAM for 576GB/s of memory bandwidth with the LLM (preferably MoE) running on the CPU instead of a GPU.
However today, with the latest, surprisingly good reasoning models like QwQ-32B using up thousands or tens of thousands of tokens in their replies, performance is getting more important than previously and these systems (Macs and even RTX 3090s) might fall out of favor, because waiting for a finished reply will take several minutes or even tens of minutes. Nvidia Ampere and Apple silicon (AFAIK) are also missing FP4 support in hardware, which doesn't help.
For the same reason AMD Halo Strix with a mere 273GB/s of RAM bandwidth and perhaps also NVidia Project Digits (also speculated to offer similar RAM bandwidth) might just be too slow for reasoning models with more than 50GB or so of active parameters.
On the other hand, if the prices for the RTX 5090 remain at 3500€, they will likely remain insignificant for the DIY crowd for that reason alone.
Perhaps AMD will take the crown with a variant of their RDNA4 RX 9070 card with 32GB of VRAM priced at around 1000€? Probably wishful thinking…