Are efficiency and horizontal scalability at odds?
19 comments
·February 12, 2025datadrivenangel
PaulKeeble
An intel 12900k (Gen 12) compared to a 2600k (Gen 2, launched 2011) is about 120% faster or a bit over 2 times in single threaded applications, those +5-15% uplifts every generation add up over time but its nothing like the earlier years when they might double in performance in a single generation.
It really depends if that application uses AES 256 bit and other modern instructions. The 12900k has 16 cores vs 4 of the 2600k, although 8 of those extra cores are E-cores. This performance increase doesn't necessarily come from free given the application may need to be adjusted to utilise those extra cores especially when half of them are slower to ensure the workload is distributed properly.
Even within a vertical scaling by getting a new processor for just single threaded applications its interesting that much of the big benefits come from targeting the new instructions and then the new cores. Both of which may require source updates to get significant performance uplift from.
https://www.cpu-monkey.com/en/compare_cpu-intel_core_i7_1270...
einpoklum
> is about 120% faster or a bit over 2 times in single threaded applications
1. Doesn't that also account for speedups in memory and I/O?
2. Even if the app is single-threaded, the OS isn't, so unless it's very very inactive other than the foreground application (which is possible), there might still be an effect of the higher core count.
no_wizard
Funnily enough, most apps aren't taking enough advantage of multi-core multi-threading environments that are common across all major platforms.
The single biggest bottleneck to improvement is the general lack of developers using the APIs to the fullest extent when designing applications. Its not really hardware anymore.
Though, to the points being made, we aren't seeing the 18 month doubling like we did in the earlier decades of computing.
jaggederest
Unless you're multitasking, the OS on a separate thread gets you about 5-10% speedup. It's not really noteworthy.
Unless you lived through the 1990s I don't think you understand how fast things were improving. Routine doubling of scores every 18 months is an insane thing. In 1990 the state of the art was 8mhz chips. By 2002, the state of the art was a 5ghz chip. So almost a thousand times faster in a decade.
Are chips now a thousand times faster than they were in 2015? No they are not.
bee_rider
I think it is often the case that people want to describe the problem as “single core performance has stagnated for decades” because it makes it look like their solution is necessary to make any progress at all.
Actually, single core performance has been improving. Not as fast as it was in the 90’s maybe, but it is improving.
However, we can speed things up even more by using multiple computers. And it is a really interesting problem where you get to worry about all sorts of fun things, like hiding MPI communication between compute.
Nobody wants to say “I have found that if I can make an already fast process even faster by putting in a lot of effort, which I will do because my job is actually really fun.” Technical jobs are supposed to be stressful and serious. The world is doomed and science will stop… unless I come up with a magic trick!
Legend2440
Single-core performance looks pretty stagnant on this graph, especially in the last ten years: https://imgur.com/DrOvPZt
Transistor count has continued to increase exponentially, but single-threaded performance has improved slowly and appears to be leveling off. We may never get another 100x or even 10x improvement in single-threaded performance.
It is going to be necessary to parallelize to see gains in the future.
achierius
But it's not flat? 10% growth a year is still growth.
gopalv
> Computers have gotten a lot faster, even if the clock speed is not that much faster
We're not stagnating but the same code I thought was too slow in 1998 was good enough in 2008, which is probably not true for code I would've thrown away in 2015.
The only place where that has happened in the last decade is for IOPS - old IOPS heavy code which would have been rewritten with group-commit tricks is probably slower than a naive implementation that fsync'd all the time. A 2015 first cut of IO code probably beats the spinning disk optimized version from the same year on modern hardware.
The clock-speed comment is totally on the money though - a lot of the clocks were spent waiting for memory latencies and those have improved significantly across the years particularly if you use an Apple Silicon style memory which is physically closer in a light cone from the DIMMs of the past.
Legend2440
A lot of clocks are still spent waiting for memory. GPUs in particular are limited by memory bandwidth despite a memory bus that runs at terabytes per second.
Back when I started programming, it was reasonable to precompute lookup tables for multiplications and trig functions. Now you'd never do that - it's far cheaper to recompute it than to look it up from memory.
Ygg2
> This is false? Computers have gotten a lot faster
Depends, what you mean by much. Single threaded performance is no longer 2x fast after a year. I mean, even in the GPU section, you get graphics that looks slightly better for 2-4x the cost (see street prices of 2080 vs 3080 vs 4080).
Computing has hit the point of diminishing returns, exponetial growth for linear prices is no longer possible.
paulsutter
Could you share some numbers on this? Lots of folks would be interested I'm sure
null
jeffbee
Yeah, that detail sinks the rest of it. Even if we assume datacenter CPUs were the market preference has been for more cores operating at the same ~2400MHz speed for a long time, what you get for 1 CPU-second these days is ridiculous compared to what you could have gotten 20 years ago. We're talking about NetBurst Xeons as a baseline.
jeeyoungk
DuckDB would've been a good example to be included, because it tries to target the need for horizontal scalability with an efficient implementation altogether. If your use case stays below the need for horizontal scalability (which in the modern world, mixture of clever implementation and crazy powerful computers do allow), then you can tackle quite a large workload.
awkward
I suppose if you're doing one you're not doing the other - the promise of future horizontal scale definitely justifies a lot of arguments about premature optimization.
However, they aren't necessarily opposed. Optimization is usually subtractive - it's slicing parts off the total runtime. Horizontal scale is multiplicative - you're doing the same thing more times. Outside some very specific limits, usually efficiency means horizontal scaling is more effective. A slightly shorter runtime many times over means a much shorter runtime.
einpoklum
I'd say they're not fundamentally at odds, but they're at odds with a "greedy approach". That is, it is much easier to scale out when you're willing to make constraining assumptions about your program; and willing to pay a lot of overhead for distributed resource management, migrating pieces of work etc. If you want to scale while maintaining efficiency, you have to be aware of more things about the work that's being distributed; you have to struggle much harder to avoid different kinds of overhead and idleness; and if you really want to go the extra mile you need to think of how to turn the distribution partially to your _benefit_ (example: Using the overhead you pay for fault-tolerance or high-availability by storing copies of your data in different formats, allowing different computations to prefer one format over the other; while on a single machine you wouldn't even have the extra copies).
Joel_Mckay
Depends what you are optimizing, and whether your design uses application layer implicit load-balancing. Thus, avoiding constraints within the design patterns before they hit the routers can often reduce traditional design cost by 37 times or more.
YMMV, depends if your data stream states are truly separable. =3
xzyyyz
not convincing. (horizontal) scalability comes at cost, but it changes size of the problem we can handle considerably.
"The downside is that for the past couple of decades computers haven't gotten much faster, except in ways that require recoding (like GPUs and multicore)."
This is false? Computers have gotten a lot faster, even if the clock speed is not that much faster. A single modern CPU core turboing at ~5Ghz is going to be significantly faster than a 20 year old cpu overclocked to ~4.5Ghz.