Are efficiency and horizontal scalability at odds?

39 comments

·February 12, 2025

datadrivenangel

"The downside is that for the past couple of decades computers haven't gotten much faster, except in ways that require recoding (like GPUs and multicore)."

This is false? Computers have gotten a lot faster, even if the clock speed is not that much faster. A single modern CPU core turboing at ~5Ghz is going to be significantly faster than a 20 year old cpu overclocked to ~4.5Ghz.

PaulKeeble

An intel 12900k (Gen 12) compared to a 2600k (Gen 2, launched 2011) is about 120% faster or a bit over 2 times in single threaded applications, those +5-15% uplifts every generation add up over time but its nothing like the earlier years when they might double in performance in a single generation.

It really depends if that application uses AES 256 bit and other modern instructions. The 12900k has 16 cores vs 4 of the 2600k, although 8 of those extra cores are E-cores. This performance increase doesn't necessarily come from free given the application may need to be adjusted to utilise those extra cores especially when half of them are slower to ensure the workload is distributed properly.

Even within a vertical scaling by getting a new processor for just single threaded applications its interesting that much of the big benefits come from targeting the new instructions and then the new cores. Both of which may require source updates to get significant performance uplift from.

https://www.cpu-monkey.com/en/compare_cpu-intel_core_i7_1270...

antisthenes

> This performance increase doesn't necessarily come from free given the application may need to be adjusted to utilise those extra cores especially when half of them are slower to ensure the workload is distributed properly.

It especially doesn't come for free when you consider that 12900k uses nearly 2.5x the power of a 2600k at peak.

I'm not even sure 12900k can operate at full load under air cooling for longer than a few minutes.

einpoklum

> is about 120% faster or a bit over 2 times in single threaded applications

1. Doesn't that also account for speedups in memory and I/O?

2. Even if the app is single-threaded, the OS isn't, so unless it's very very inactive other than the foreground application (which is possible), there might still be an effect of the higher core count.

no_wizard

Funnily enough, most apps aren't taking enough advantage of multi-core multi-threading environments that are common across all major platforms.

The single biggest bottleneck to improvement is the general lack of developers using the APIs to the fullest extent when designing applications. Its not really hardware anymore.

Though, to the points being made, we aren't seeing the 18 month doubling like we did in the earlier decades of computing.

jaggederest

Unless you're multitasking, the OS on a separate thread gets you about 5-10% speedup. It's not really noteworthy.

Unless you lived through the 1990s I don't think you understand how fast things were improving. Routine doubling of scores every 18 months is an insane thing. In 1990 the state of the art was 8mhz chips. By 2002, the state of the art was a 5ghz chip. So almost a thousand times faster in a decade.

Are chips now a thousand times faster than they were in 2015? No they are not.

foota

I think this is meant to be read as, "over the past decade, you haven't been able to wait a year and buy a new CPU to solve your vertical scalability issues.", not necessarily to claim that there hasn't been significant growth when compared over the entire window.

hinkley

m4, m5, and m6 had a bit of oomph per dollar that was nice to have. Then Amazon got greedy and priced the 7's a little more dearly, and it turns out depending on your workload the math is wrong, particularly for the c6i and c7i versus the m-series, where I think their ratios got a bit wonky. Their benchmarks were not reflective of the real world applications I had to deploy the way they had been in previous generations. I feel like this is a case of them believing their own PR to everyone's detriment.

They're going to have to lower their prices on both the Intel and AMD versions by at least 5% before it even became price competitive to upgrade. Even if you ignored the rounding errors of needing for instance X.3 machines to keep the same response curve, which played out for us and greatly compounded my indifference for the new machines.

There's a good chance that team will skip straight from m6 to m8 when they are available since as far as I'm aware they rarely change their prices.

Ygg2

> This is false? Computers have gotten a lot faster

Depends, what you mean by much. Single threaded performance is no longer 2x fast after a year. I mean, even in the GPU section, you get graphics that looks slightly better for 2-4x the cost (see street prices of 2080 vs 3080 vs 4080).

Computing has hit the point of diminishing returns, exponetial growth for linear prices is no longer possible.

gopalv

> Computers have gotten a lot faster, even if the clock speed is not that much faster

We're not stagnating but the same code I thought was too slow in 1998 was good enough in 2008, which is probably not true for code I would've thrown away in 2015.

The only place where that has happened in the last decade is for IOPS - old IOPS heavy code which would have been rewritten with group-commit tricks is probably slower than a naive implementation that fsync'd all the time. A 2015 first cut of IO code probably beats the spinning disk optimized version from the same year on modern hardware.

The clock-speed comment is totally on the money though - a lot of the clocks were spent waiting for memory latencies and those have improved significantly across the years particularly if you use an Apple Silicon style memory which is physically closer in a light cone from the DIMMs of the past.

Legend2440

A lot of clocks are still spent waiting for memory. GPUs in particular are limited by memory bandwidth despite a memory bus that runs at terabytes per second.

Back when I started programming, it was reasonable to precompute lookup tables for multiplications and trig functions. Now you'd never do that - it's far cheaper to recompute it than to look it up from memory.

rbanffy

Indeed. When I was in college I designed a stack-based CPU. It was a perfectly sane design decision to keep the stack in main memory. It made the design VERY simple (but I kept a couple registers in the architecture, just in case).

These days it'd be an unbelievably stupid decision.

bee_rider

I think it is often the case that people want to describe the problem as “single core performance has stagnated for decades” because it makes it look like their solution is necessary to make any progress at all.

Actually, single core performance has been improving. Not as fast as it was in the 90’s maybe, but it is improving.

However, we can speed things up even more by using multiple computers. And it is a really interesting problem where you get to worry about all sorts of fun things, like hiding MPI communication between compute.

Nobody wants to say “I have found that if I can make an already fast process even faster by putting in a lot of effort, which I will do because my job is actually really fun.” Technical jobs are supposed to be stressful and serious. The world is doomed and science will stop… unless I come up with a magic trick!

Legend2440

Single-core performance looks pretty stagnant on this graph, especially in the last ten years: https://imgur.com/DrOvPZt

Transistor count has continued to increase exponentially, but single-threaded performance has improved slowly and appears to be leveling off. We may never get another 100x or even 10x improvement in single-threaded performance.

It is going to be necessary to parallelize to see gains in the future.

achierius

But it's not flat? 10% growth a year is still growth.

motorest

> Single-core performance looks pretty stagnant on this graph, especially in the last ten years (...)

The vertical axis seems to be in log scale, and single core performance seems to be increasing at the same rate since 2000-2010.

This suggests single core performance is actually growing at an exponential rate.

What is your interpretation?

bee_rider

I mean, it definitely isn’t the 90’s anymore, for sure.

But I don’t think that chart tells the whole story. First off, it only shows half of the last ten years. Things feel like they’ve been picking up a little bit, with Apple and AMD executing fairly well. Second, average is being dragged down a bit as thin-and-light laptops have become very popular. A lot of the chips in that chart could probably be run faster on a higher energy budget, right?

null

[deleted]

lmm

> A single modern CPU core turboing at ~5Ghz is going to be significantly faster than a 20 year old cpu overclocked to ~4.5Ghz.

Is it actually going to be faster than a P4 running at the speeds that they can clock up to, especially if we're talking about sustained performance rather than a brief turbo? Obviously "non-vectorisable integer computation that never stalls a 70 cycle pipeline" is not the only thing that matters for software performance, but it's worth comparing.

jeffbee

Yeah, that detail sinks the rest of it. Even if we assume datacenter CPUs were the market preference has been for more cores operating at the same ~2400MHz speed for a long time, what you get for 1 CPU-second these days is ridiculous compared to what you could have gotten 20 years ago. We're talking about NetBurst Xeons as a baseline.

paulsutter

Could you share some numbers on this? Lots of folks would be interested I'm sure

null

[deleted]

CharlieDigital

This reminds of a Google Paper from 2023 titled Towards Modern Development of Cloud Applications [0]

Good read.

I ended up putting together a "practical" example of a modular monolith [1] which meets the spirit of the paper and wrote about it here: https://chrlschn.dev/blog/2024/01/a-practical-guide-to-modul...

The goal is to run everything in-process locally, but as modular, separate processes remotely with each node assuming a "role" or a "slice" of the whole. Local DX is great while retaining a lot of flexibility on how to deploy and scale.

[0] https://dl.acm.org/doi/10.1145/3593856.3595909

[1] https://github.com/CharlieDigital/dn8-modular-monolith

jeeyoungk

DuckDB would've been a good example to be included, because it tries to target the need for horizontal scalability with an efficient implementation altogether. If your use case stays below the need for horizontal scalability (which in the modern world, mixture of clever implementation and crazy powerful computers do allow), then you can tackle quite a large workload.

memhole

And even then you have things like this:

https://www.boilingdata.com/

einpoklum

I'd say they're not fundamentally at odds, but they're at odds with a "greedy approach". That is, it is much easier to scale out when you're willing to make constraining assumptions about your program; and willing to pay a lot of overhead for distributed resource management, migrating pieces of work etc. If you want to scale while maintaining efficiency, you have to be aware of more things about the work that's being distributed; you have to struggle much harder to avoid different kinds of overhead and idleness; and if you really want to go the extra mile you need to think of how to turn the distribution partially to your _benefit_ (example: Using the overhead you pay for fault-tolerance or high-availability by storing copies of your data in different formats, allowing different computations to prefer one format over the other; while on a single machine you wouldn't even have the extra copies).

mjb

> Distributed algorithms require more coordination.

Sometimes! There's a whole body of research about when distributed algorithms require coordination. One example is the CALM theorem (https://arxiv.org/abs/1901.01930), and another is the ways that scalable database systems avoid read coordination (https://brooker.co.za/blog/2025/02/04/versioning.html).

> Distribution also means you have fewer natural correctness guarantees, so you need more administrative overhead to avoid race conditions.

I don't believe this is true.

> If we know the exact sequence of computations, we can aim to minimize cache misses.

Sure, but it's not clear why this is possible in a local context and not a distributed one (and, in fact, in may be easier in the distributed context). One example of how it's easier in the distributed context is snapshotting in MemoryDB (https://brooker.co.za/blog/2024/04/25/memorydb.html).

> But then you lose the assumption "recently inserted rows are close together in the index", which I've read can lead to significant slowdowns.

Or significant speed ups because you avoid false sharing!

> Maybe there's also a cultural element to this conflict. What if the engineers interested in "efficiency" are different from the engineers interested in "horizontal scaling"?

This is, of course, a false dichotomy. Distributed systems don't only (or even primarily, in most cases) exist for scaling, but availability, resilience, durability, business continuity, and other concerns.

To make this purely about scalability is naive.

> I'm not sure where this fits in but scaling a volume of tasks conflicts less than scaling individual tasks

Indeed. Coordination avoidance is the fundamental mechanism of scaling. This is visible in CALM, in Amdahl's law, and in many of the other frameworks for thinking through this space.

> If you have 1,000 machines and need to crunch one big graph, you probably want the most scalable algorithm. If you instead have 50,000 small graphs, you probably want the most efficient algorithm, which you then run on all 1,000 machines

False dichotomy again. Algorithms can be both efficient and scalable, and the true shape of the trade-off between them is both super interesting and an ongoing research area.

I normally enjoy Hillel's writing and thoughtfulness, but this post seems like a big miss.

KronisLV

In web development, I've seen some systems that I can only call Singleton architectures, where horizontal scalability just isn't possible. For example, regular monoliths you can stand up multiple instances of, balance the load between them and they'll be okay... except when a monolith has session data stored in memory or some other business state data, or uses the file system directly for storing certain files (or even doing batch processing on them), or just any local data instead of relying on something like Redis/Valkey/RabbitMQ/S3 or even just a table in an RDBMS.

It's not that you can't make the code that runs in those systems perform pretty fast, but it means that a whole class of problems that could be worked around by more instances suddenly cannot be mitigated that way.

Your DB connection pooler has some issues under a lot of load? Too bad. Your app has some sub-optimal logic in it that slows down the whole instance when specific long running scheduled business processes are running? So sad. Want separate instances for internal users and other systems to integrate with? Not happening.

Essentially, horizontal scalability lets you sleep more soundly in a world full of inconsistency.

PaulHoule

Two kinds of efficiency: (i) computing efficiency, (ii) developer efficiency

If you are doing everything on one machine you can keep everything in RAM, if you have more than one machine you're going to have to serialize all the data that passes between the machines and send it over some communication link which is never going to be as fast as reading it out of RAM.

The simple case is that there is not a lot of coordination needed between the machines but the coordination overhead can easily be worse than O(N) where N is the number of machines.

Developer efficiency may be more important than computing efficiency in many cases. Hadoop, like Bitcoin, blindsided the distributed systems community because it was so much worse than MPI [1] but MPI applications were being developed by PhD students and postdocs and with Hadoop you could replace your

  grep ... | awk ... | sort ... | uniq | grep ... | awk ...

pipeline with a simple Pig script with no talent at all and have it run pretty fast by most people's standards on a huge dataset... Even though the MPI geniuses could get it to run 1000x faster.

[1] https://en.wikipedia.org/wiki/Message_Passing_Interface

awkward

I suppose if you're doing one you're not doing the other - the promise of future horizontal scale definitely justifies a lot of arguments about premature optimization.

However, they aren't necessarily opposed. Optimization is usually subtractive - it's slicing parts off the total runtime. Horizontal scale is multiplicative - you're doing the same thing more times. Outside some very specific limits, usually efficiency means horizontal scaling is more effective. A slightly shorter runtime many times over means a much shorter runtime.

Joel_Mckay

Depends what you are optimizing, and whether your design uses application layer implicit load-balancing. Thus, avoiding constraints within the design patterns before they hit the routers can often reduce traditional design cost by 37 times or more.

YMMV, depends if your data stream states are truly separable. =3

xzyyyz

not convincing. (horizontal) scalability comes at cost, but it changes size of the problem we can handle considerably.

HN

Are efficiency and horizontal scalability at odds?

Are efficiency and horizontal scalability at odds?