I clustered four Framework Mainboards to test LLMs

rtkwe

Discussed heavily yesterday: https://news.ycombinator.com/item?id=44827862

jauntywundrkind

> For networking, I expected more out of the Thunderbolt / USB4 ports, but could only get 10 Gbps.

I really wish we saw more testing of USB subsystems! With PCIe being so limited, there's such allure to having two USB4 ports! But will they work?

IIRC we saw similar very low bandwidth on Apple's ARM chips too. This was during M1 or so; dunno if things got better with that chip or future ones! Presumably so or I feel like we'd be hearing about it, but also, these things can just go so hidden!

It was really cool back in Ryzen 1 era seeing their CPU get some USB on the cpu itself, not have to go through the IO/peripheral Hub (southbridge?), with its limited connection to the CPU. There's a great up breakout chart here, showing both the 1800x and the various chipsets available: relishable data. https://www.techpowerup.com/cpu-specs/ryzen-7-1800x.c1879

I feel like there's been some recent improvements to USB4/thunderbolt in the kernel, to really insure all lanes get used. But I'm struggling to find a reference/link. What kernel was this tested against? If nothing else, it's be great to poke around at debugfs, to make sure it's getting all the lanes configured. https://www.phoronix.com/news/Linux-6.13-USB-Changes

dylan604

I read the title and thought that sounds like something...saw the domain at the end, yup. Exactly.

bawana

Mem bandwidth sucks compared to Mac Studio ultra 3. And you cant add gpus easily although as an apu it is impressive and way better than nvidias gold box. Wendell said it better. Im waiting for the Mac Studio ultra 5

sliken

Apparently the frameworks desktop's 5g bit network isn't fast enough to scale well with LLM inference workloads, even for a modest GPU. Anyone know what kind of network is required to scale well for a single modest GPU?

geerlingguy

In the case of llama.cpp's RPC mode, the network isn't the limiting factor for inference, but for distributing layers to nodes.

I was monitoring the network while running various models, and for all models, the first step was to copy over layers (a few gigabytes to 100 or so GB for the huge models), and that would max out the 5 Gbps connection.

But then while warming up and processing, there were only 5-10 Mbps of traffic, so you could do it over a string and tin cans, almost.

But that's a limitation of the current RPC architecture, it can't really parallelize processing, so as I noted in the post and in my video, it kinda uses resources round-robin style, and you can only get worse performance across the entire cluster than on a single node for any model you can fit on the single node.

rtkwe

No network interconnect is going to scale well until you get into the expensive enterprise realm where infiniband and other direct connect copper/fiber reigns. The issue is less raw bandwidth but latency. Network is inherently 100x+ slower than memory access so when you start sharing a memory intensive workload like an LLM across a normal network it's going to crater your performance unless the work can be somewhat chunked to keep communication between nodes on the network to a minimum.

_joel

> usually resulting in one word repeating ad infinitum

I've had that using gemini (via windsurf). Doesn't seem to happen with other models. No idea if there's any correlation but it's an interesting failure mode.

mattnewton

This is usually a symptom of greedy sampling (always picking the most probable token) on smaller models. It's possible that configuration had different sampling defaults, ie. was not using top p or temperature. I'm not familiar with distributed-llama but from searching the git repo it looks like it at least takes a --temperature flag and probably has one for top p.

I'd recommend rerunning the benchmarks with the sampling methods explicitly configured the same in each tool. It's tempting to benchmark with all the nondeterminism turned off, but I think it's less useful since in practice for any model you're self hosting for real work you're going to probably want top-p sampling or something like it and you want to benchmark the implementation of that too.

I've never seen gemini do this though, that'd be kinda wild if they shipped something that samples that way. I wonder if windsurf was sending a different config over the api or if this was a different bug.

mrbungie

Yep, sometimes Gemini for some reason ends up in what I call "ergodic self-flagellation".

Here are some examples: https://www.reddit.com/r/GeminiAI/comments/1lxqbxa/i_am_actu...

hinkley

Jeff! Someone needs to make Framework MBs work in a blade arrangement, and you seem to be the likely person to get it done.

chickensong

lifeinthevoid

Setup looks very sexy.

oblio

Those numbers are better than I was expecting.

HN

I clustered four Framework Mainboards to test LLMs

I clustered four Framework Mainboards to test LLMs