%CPU utilization is a lie

141 comments

·September 3, 2025

morning-coffee

The lie is that hyper thread "cores" are equal to real "cores". Maybe this is what happens when an over 20-year old technology (hack) becomes ubiquitous and gets forgotten about? (We have to rediscover why our performance measurements don't seem to make sense?)

Utilization is not a lie, it is a measurement of a well-defined quantity, but people make assumptions to extrapolate capacity models from it, and that is where reality diverges from expectations.

Hyperthreading (SMT) and Turbo (clock scaling) are only a part of the variables causing non-linearity, there are a number of other resources that are shared across cores and "run out" as load increases, like memory bandwidth, interconnect capacity, processor caches. Some bottlenecks might come even from the software, like spinlocks, which have non-linear impact on utilization.

Furthermore, most CPU utilization metrics average over very long windows, from several seconds to a minute, but what really matters for the performance of a latency-sensitive server happens in the time-scale of tens to hundreds of milliseconds, and a multi-second average will not distinguish a bursty behavior from a smooth one. The latter has likely much more capacity to scale up.

Unfortunately, the suggested approach is not that accurate either, because it hinges on two inherently unstable concepts

> Benchmark how much work your server can do before having errors or unacceptable latency.

The measurement of this is extremely noisy, as you want to detect the point where the server starts becoming unstable. Even if you look at a very simple queueing theory model, the derivatives close to saturation explode, so any nondeterministic noise is extremely amplified.

> Report how much work your server is currently doing.

There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

Ultimately, the confidence intervals you get from the load testing approach might be as large as what you can get from building an empirical model from utilization measurement, as long as you measure your utilization correctly.

eklitzke

I agree. If you actually know what you're doing you can use perf and/or ftrace to get highly detailed processor metrics over short periods of time, and you can see the effects of things like CPU stalls from cache misses, CPU stalls from memory accesses, scheduler effects, and many other things. But most of these metrics are not very actionable anyway (the vast majority of people are not going to know what to do with their IPC or cache hit or branch hit numbers).

What most people care about is some combination of latency and utilization. As a very rough rule of thumb, for many workloads you can get up to about 80% CPU utilization before you start seeing serious impacts on workload latency. Beyond that you can increase utilization but you start seeing your workload latency suffer from all of the effects you mentioned.

To know how much latency is impacted by utilization you need to measure your specific workload. Also, how much you care about latency depends on what you're doing. In many cases people care much more about throughput than latency, so if that's the top metric then optimize for that. If you care about application latency as well as throughput then you need to measure both of those and decide what tradeoffs are acceptable.

tracker1

> There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

I think this is probably one of the most important points... similarly, is this public facing work dealing with any kind of user request, or is it simply crunching numbers/data to build an AI model from a stable backlog/queue?

My take has always been with modern multi-core, hyper-threaded CPUs that are burstable is to consider ~60% a "loaded" server. That should have work split if it's that way for any significant portion of a day. Mostly dealing with user-facing services. So bursts and higher traffic portions of the day are dramatically different from lower utilization portions of the day.

A decade ago, this lead to a lot of work for cloud provisioning on demand for the heavier load times. Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price). Today, I'd lean to over-provisioning dedicated server hardware and supplement with cloud services (and/or self-cloud-like on K8s) as pragmatically as reasonable... depending on the services of course. I'm not currently in a position where I have this level of input though.

Just looking at how, as an example, StackOverflow scaled in the early days is even more possible/prudent today to a much larger extent... You can go a very long way with a half/full rack and a 10gb uplink in a colo data center or two.

In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon. Just my own take.

everforward

> In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon.

I think this depends on workload still because IO heavy apps hyperthread well and can push up to 100%. I think most of the apps I've worked on end up being IO bound because "waiting on SQL results" or the more generic "waiting on downstream results" is 90% of their runtime. They might spend more time reading those responses off the wire than they do actually processing anything.

There are definitely things that isn't true of though, and your metrics read about right to me.

jimmySixDOF

IEEE Hot Interconnects just wrapped up and they discussed latency performance tuning for Ultra Ethernet where it looks smooth on 2- or 5- sec view but at 100ms you see the obvious frame burst effects. If you don't match your profiling to the workload a false negative compounds your original problem by thinking you tested this so better look elsewhere.

SAI_Peregrinus

That's all true, and the % part is still a lie. As you note, CPU utilization isn't linear, and percentages are linear measures. CPU utilization isn't a lie, % CPU utilization is.

It is a linear percentage of the amount of time the CPU is not idle. It is not linear in the amount of useful work, but that's not what "utilization" means.

The lie is the assumption that CPU time is linear in useful work, but that has nothing to do with the definition of utilization, it's just something that people sometimes naively believe.

> CPU utilization isn't a lie, % CPU utilization is

What do you mean by this? Utilization is, by definition, a ratio. % just determines that the scale is in [0, 100].

SirMaster

What about 2 workloads that both register 100% CPU usage, but one workload draws significantly more power and heats the CPU up way more? Seems like that workload is utilizing more of the CPU, more of the transistors or something.

inetknght

Indeed, and there's a thing called "race to sleep". That is, you want to light up as much of the core as possible as fast as possible so you can get the CPU back to idle as soon as possible to save on battery power, because having the CPU active for more time (but not using as many circuits as it "could") draws a lot more power.

MBCook

At the same time, it takes a certain amount of time for a CPU to switch power levels, and I remember it being surprisingly slow on some (older?) processors.

So in Linux (and I assume elsewhere) there were attempts to figure out if the cost in time/power to move up to a higher power state would be worth the faster processing, or if staying lower power but slower would end up using less power because it was a short task.

I think the last chips I remember seeing numbers for were some of the older Apple M-series chips, and they were lightning fast to switch power levels. That would certainly make it easier to figure out if it was worth going up to a higher power state, if I’m remembering correctly.

saagarjha

Yes, this is pretty normal; your processor will downclock to accommodate. For HPC where the workloads are pretty clearly defined it’s possible to even measure how close you’re coming to the thermal envelope and adjust the workload.

throwaway31131

Percent utilization for most operating systems is the amount of time the idle task is not scheduled. So for both workloads the idle task was never scheduled, hence 100% "utilization".

BrendanLong

Some esoteric methods of measuring CPU utilizations are to calculate either the current power usage over the max available power, or the current temperature over the max operating temperature. Unfortunately these are typically even more non-linear than the standard metrics (but they can be useful sometimes).

gblargg

Like measuring RMS of an AC voltage by running it through a heating element: https://wikipedia.org/wiki/True_RMS_converter#Thermal_conver...

null

[deleted]

null

[deleted]

kqr

It might be a lie, but it surely is a practical one. In my brief foray into site reliability engineering I used CPU utilisation (of CPU-bofund tasks) with queueing theory to choose how to scale servers before big events.

The %CPU suggestions ran contrary to (and were much more conservative than) the "old wisdom" that would otherwise have been used. It worked out great at much lower cost than otherwise.

What I'm trying to say is you shouldn't be afraid of using semi-crappy indicators just because they're semi-crappy. If it's the best you got it might be good enough anyway.

In the case of CPU utilisation, though, the number in production shouldn't go above 40 % for many reasons. At 40 % there's usually still a little headroom. The mistake of the author was not using fundamentals of queueing theory to avoid high utilisation!

therealdrag0

> semi-crappy indicator … good enough.

Agree. Another example of this is for metrics as percentiles per host that you have to average, vs histograms per host that get percentile calculated at aggregation time among hosts. Sure an avg/max of a percentile is technically not a percentile, but in practice switching between one or the other hasn’t affected my operations at all. Yet I know some people are adamant about mathematical correctness as if that translates to operations.

arccy

That works ok when you have evenly distributed load (which you want / would hope to have), much less so when your workload is highly unbalanced.

mayama

Combination of CPU% and loadavg would generally tell how system is doing. I had systems where loadavg is high, waiting on network/io, but little cpu%. Tracing high load is not always straightforward as cpu% though, you have to go through io%, net%, syscalls etc.

saagarjha

40% seems quite lightly utilized tbh

cpncrunch

I tend to use 50% as a soft target, which seems like a good compromise. Sometimes it may go a little bit over that, but if it's occasional it shouldn't be an issue.

It's not good to go much over 50% on a server (assuming half the cpus are just hyperthreads), because you're essentially relying on your load being able to share the actual cpu cores. At some point, when the load increases too much, there may not be any headroom left for sharing those physical cpus. You then get to the point where adding a little bit more load to 80% suddenly results in 95% utilization.

zekrioca

I noticed exactly the same thing. The author is saying something that has been repeatedly written in queueing theory books for decades, still they are noticing this only now.

mustache_kimono

Reminds me of Brendan Gregg's "CPU Utilization is Wrong" but this blog fails to discuss that blog's key point that CPU utilization is a measure of whether or not the CPU is busy, including whether the CPU is waiting [0]. That blog also explains that the IPC (instructions per cycle) metric actually measures useful work hidden within that busy state.

[0]: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization...

4gotunameagain

What's up with Brendans and CPU utilisation concerns, any Brendan to shine some light ?

BrendanLong

I'd love to explain, but you'd need to change your name to Brendan first.

PaulKeeble

This is bang on, you can't count the hyperthreads as double the performance, typically they are actually in practice only going to bring 15-30% if the job works well with it and their use will double the latency. Failing to account for loss in clockspeed as the core utilisation climbs is another way its not linear and in modern software for the desktop its really something to pay careful attention to.

It should be possible from the information you can get on a CPU from the OS to better estimate utilisation involving at the very least these two factors. It becomes a bit more tricky to start to account for significantly going past the cache or available memory bandwidth and the potential drop in performance to existing threads that occurs from the increased pipeline stalls. But it can definitely be done better than it is currently.

c2h5oh

To complicate things more HT performance varies wildly between CPU architectures and workloads. e.g. AMD implementation, especially in later Zen cores, is closer to a performance of a full thread than you'd see in Intel CPUs. Provided you are not memory bandwidth starved.

RaftPeople

> To complicate things more HT performance varies wildly between CPU architectures and workloads.

IBM's Power cpu's have also traditionally done a great job with SMT compared to Intel's implementation.

shim__

Whats the difference between Intels and AMDs approach?

richardwhiuk

Basically it comes down to how much shared vs dedicated resources each core has.

magicalhippo

For memory-bound applications the scaling can be much better. A renderer I worked on was primarily memory-bound walking the accelerator structure, and saw 60-70% increase from hyperthreads.

But overall yeah.

tonymet

I like his empirical approach to get to the root significance of the cpu %-age indicator. Software engineers and data analysts take discrete "data" measurements and statistics for granted.

"data" / "stats" are only a report, and that report is often incorrect.

freehorse

Author discovers that performance does not scale proportionally to %CPU utilisation, and gets instead to the conclusion that %CPU utilisation is a lie.

There are many reasons for the lack of a proportional relationship, even in the case where you do not have hyperthreading or downclocking (in which cases you just need to interpret %CPU utilisation in that context, rather than declare it "a lie"). Even in apple silicon where these are usually not an issue, you often do not get an exactly proportional scaling. There may be overheads when utilising multiple cores wrt how data is passed around, or resource bottlenecks other than CPU.

saagarjha

Apple silicon downclocks quite a lot especially if you have a passively cooled machine

tgma

The way they refer to cores in their system is confusing and non-standard. The author talks about a 5900X as a 24 core machine and discusses as if there are 24 cores, 12 of which are piggybacking on the other 12. In reality, there are 24 hyperthreads that are pretty much pairwise symmetric that execute on top of 12 cores with two sets of instruction pipeline sharing same underlying functional units.

saghm

Years ago, when trying to explain hyper threading to my brother, who doesn't have any specialized technical knowledge, he came up with the analogy that it's like 2-ply toilet paper. You don't quite have 24 distinct things, but you have 12 that are roughly twice as useful as the individual ones, although you can't really separate them and expect them to work right.

nayuki

Nah, it's easier than that. Putting two chefs in the same kitchen doesn't let you cook twice the amount of food in the same amount of time, because sometimes the two chefs need to use the same resource at the same time - e.g. sink, counter space, oven. But, the additional chef does improve the utilization of the kitchen equipment, leaving fewer things unused.

whizzter

Maybe simplify more to make the concept of shared resource explicit.

2 chefs with one stove. As long as they're doing other things than frying it's ok and speeding things up but once they both need the stove you're down to 1 working and 1 waiting.

BobbyTables2

That’s perfect!

Especially when it come to those advertisements “6 large rolls == 18 normal rolls”.

Sure it might be thicker but nobody wipes their butt with 1/3 a square…

skeezyboy

> he came up with the analogy that it's like 2-ply toilet paper.

as in youd only use it to wipe excrement from around your sphincter

BrendanLong

Thanks for the feedback. I think you're right, so I changed a bunch of references and updated the description of the processor to 12 core / 24 thread. In some cases, I still think "cores" is the right terminology though, since my OS (confusingly) reports utilization as-if I had 24 cores.

sroussey

Eh, what’s a thread really? It’s a term for us humans.

The difference between two threads and one core or two cores with shared resources?

Nothing is really all that neat and clean.

It more of a 2 level NUMA type architecture with 2 sets of 6 SMP sets of 2.

The scheduler may look at it that way (depending), but to the end user? Or even to most of the system? Nah.

tgma

There are observable differences. For example, under HT, TLB flush or context switch will likely be observable by a neighboring thread whereas for in a full dedicated core, you won't observe such things.

Neil44

If both SMT cores are being asked to do the same workload they will likely contend for the same resource and execution units internally so the boost from SMT will be less. If they have different workloads the boost will be more. Now throw in P and E cores on newer CPU's, turbo and non-turbo, everything gets very complicated. I did see a study that adding SMT got a much better performance per watt boost than adding turbo which was interesting/useful.

sroussey

Will be interesting when (if?) Intel ships software defined cores which are the logical inverse of hyper threading.

Instead of having a big core with two instruction pipelines sharing big ALUs etc, they have two (or more) cores that combine resources and become one core.

Almost the same, yet quite different.

https://patents.google.com/patent/EP4579444A1/en

tgma

There was the dreaded AMD FX chip which was advertised as 8 core, but shared functional units. Got sued, etc.

hedora

That patent seems to be describing a dumb way to implement pipelining / speculative execution. Am I missing something?

Anyway, by my reading, it’s also similar to the Itanic, er, Itanium, where the “cores” that got combined were pipeline stages.

judge123

This hits so close to home. I once tried to explain to a manager that a server at 60% utilization had zero room left, and they looked at me like I had two heads. I wish I had this article back then!

hinkley

You also want to hit him with queueing theory.

Up to a hair over 60% utilization the queuing delays on any work queue remain essentially negligible. At 70 they become noticeable, and at 80% they've doubled. And then it just turns into a shitshow from there on.

The rule of thumb is 60% is zero, and 80% is the inflection point where delays go exponential.

The biggest cluster I ran, we hit about 65% CPU at our target P95 time, which is pretty much right on the theoretical mark.

BrendanLong

A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible.

hinkley

The vertical for this company was one where the daily traffic was oddly regular. That the two lines matched expectations likely has to do with the smoothness of the load.

The biggest problem was not variance in request rate it was variance in request cost, which is usually where queuing kicks in, unless you're being dumb about things. I think for a lot of apps p98 is probably a better metric to chase, p99 and p100 are useful for understanding your application better, but I'm not sure you want your bosses to fixate on them.

But our contracts were for p95, which was fortunate given the workload, or at least whoever made the contracts got good advice from the engineering team.

kccqzy

If your SLO is 100 ms you need far more granular measurement periods than that. You should measure the p99 or p100 utilization for every 5-ms interval or so.

Ambroisie

Do you have a link to a more in-depth analysis of the queuing theory for these numbers?

PunchyHamster

that entirely depends on workload. especially now when average server CPUs start at 32 cores

null

[deleted]

dragontamer

There's many ways CPU utilization fails to work as expected.

I didn't expect an article on this style. I was expecting the normal Linux/Windows utilization but wtf it's all RAM bottlenecked and the CPU is actually quiet and possibly down clocking thing.

CPU Utilization is only how many cores are given threads to run by the OS (be it Windows or Linux). Those threads could be 100% blocked on memcpy but that's still CPU utilization.

-------

Hyperthreads help: if one thread is truly CPU bound (or even more specifically: AVX / Vector unit bound), while a 2nd thread is hyperthreaded together that's memcpy / RAM bound, you'll magically get more performance due to higher utilization of resources. (Load/store units are separate from AVX compute units).

In any case, this is a perennial subject with always new discoveries about how CPU Utilization is far less intuitive than many think. Still kinda fun to learn about new perspectives on this matter in any case.

null

[deleted]

swiftcoder

I remember being stuck in a discussion with management one time, that went something like this: Manager: CPU utilisation is 100% under load! We have to migrate to bigger instances. Me: but is the CPU actually doing useful work?

(chat, it was not. busy waiting is CPU utilisation too)

HN

%CPU utilization is a lie

%CPU utilization is a lie