The testing pyramid is an outdated economic model

44 comments

·January 20, 2025

creesch

Okay? This seems like a fluff blog post as the trophy concept was coined back in 2018 if I am correct. Coming from wiremock it makes sense given their product, but it is just marketing fluff.

Honestly, as long as the GUI tip remains as small as possible I am mostly fine with whatever shape it takes below there. For modern web applications with a lot of APIs it does make sense to use a trophy. For other applications without such a communication layer a more traditional pyramid does make more sense.

What a lot of people often seem to completely overlook in discussions like this is that the pyramid isn't a goal in itself. It is intended as a way to think about where you place your tests. More specifically place tests where they make sense, provide most value and are least fragile.

Which is why the GUI should be avoided for any test that are testing logic, hence being the smallest section on whatever shape you come up with. Everything else highly depends on what sort of infrastructure you are dealing with, the scope of your application, etc.

hitchstory

I think the premise is correct and I think you are disagreeing with it.

Yes, the pyramid was set out as a goal in its original incarnation. That was deeply wrong. The shape ought to be emergent and determined by the nature of the app being tested (i went into detail on what should determine that here https://news.ycombinator.com/item?id=42709404)

Some of the most useful tests Ive worked with HAVE had a large GUI tip. The GUI behavior was the most stable surface whose behavior was clearly defined which everybody agreed upon. all the code got tested. GUI tests provided the greatest freedom to refactor, covered the most bugs and provided the most value by far on that project.

GUI tests are not inherently fragile or inherently too slow either. This is just a tendency that is highly context specific, and as the "pyramid" demonstrates - if you build a rule out of a tendency that is context specific it's going to be a shit rule.

creesch

> Some of the most useful tests Ive worked with HAVE had a large GUI tip. The GUI behavior was the most stable surface whose behavior was clearly defined which everybody agreed upon.

This might be true, but that might also have said something about the layers below that and actually be a symptom for a larger issue within the development organisation.

> GUI tests are not inherently fragile or inherently too slow either.

Compared to testing APIs or Unit tests they are though. Not only do you need to navigate an interface with a machine that is actually intended for humans, you need to also deal with the additional overhead.

hitchstory

>something about the layers below

Absolutely. They were tests over a big ball of mud in a company I had joined recently.

This is, I think, the only good way to work with what is probably (unfortunately) the most common type of real world code architecture.

If your testing approach cant deal with big fragile balls of mud then it is bad. This is why I dont have a lot of respect for the crowd that thinks you must DI first "in order to be able to test". Such architectures are fragile and will break under attempts to introduce dependency inversion.

>Compared to testing APIs or Unit tests they are though.

In the above example there probably wasnt a single code interface or API under the hood that was any good. Coupling to any of those interfaces was fragile with a capital F if you actually expected to refactor any of them (which I did).

Even for decent quality code, the freedom to refactor interfaces is wildly underrated and it is curtailed by coupling a test to it.

orwin

I'm pretty sure you ought to make the tests you think are needed, and the 'form' is not a good metric.

Our libraries with a lot of logic and calculations are dominated by unit tests, our libraries that talk to external API are dominated by integration tests. That's just good testing, I'm not sure you need to imagine a pyramid or a vase to decide the tests you do.

diggan

People are addicted to metrics, so when "code coverage" becomes something to measure, people tend to go crazy with the testing, even trivial stuff that doesn't really need to be put under tests.

My personal rule of thumb is something like: If it makes you go slower, you're doing too little/much testing, if it makes you go faster, you're doing the right amount of testing.

If you find yourself having to rewrite 10% of the code base every time a test changes, you're probably doing too much testing (or not treating your test case as production code). If you find yourself breaking stuff all over the place when doing other things, you're doing too little testing.

As most things, it's a balance, too extreme in either direction will hurt.

OJFord

That probably works well when you've already established a good baseline, but when tests barely exist (or greenfield) or they're crap, I find it can be a real slowdown to try to test something you know you should.

Especially if there's something not entirely straightforward about it, like you need to figure out a way to instrument/harness something for the first time so that you can actually test against it. (Arguably inherently doesn't happen with unit tests though I guess.)

diggan

> like you need to figure out a way to instrument/harness something for the first time so that you can actually test against it

Indeed, and sometimes it's not worth to figure out, but sometimes it is. For example, if you have a piece of code that haven't changed since the beginning, and there been no bugs stemming from that code, adding tests to it is kind of futile unless you already know it has to be refactored soon.

On the other hand, if you join a existing project with almost no tests, and for the last N months, that code is responsible for introducing lots of bugs, then it's probably worth it to spend time building some testing infrastructure around the project with that part first up to be under lots of tests.

Basically, if the calculation shows that you can save more time by not having those bugs, you have some budget to spend on making the tests work well for that part.

anymouse123456

> I find it can be a real slowdown to...

Not OP, but this is the vibes approach I often use (and believe they're advocating for).

If it feels painful to add a new test, it's likely time (or nearly time) to make adding tests, or at least that test easier.

I've found that once it's easy to add tests, new tests often start appearing, so that first effort can be some of the most high-impact work available to me.

cogman10

Completely agree.

I think the problem some devs want hard and fast rules when a lot of the time the right answer is "it depends" and experience dictated actions.

anymouse123456

The thing that frustrated me as a new player, was that there often seemed to be two opposing poles and neither were very helpful.

There is the dogmatic rules crowd (triggers my self-diagnosed opposition defiance disorder), and the "it depends" crowd, which left me screaming, "ON WHAT?!"

When I was in this position, I found Kent Beck and Martin Fowler's notion of "Code Smells" [0] really helpful. Though admittedly, the comprehensive enumeration with associated Refactorings was probably a bridge too far.

"Code Smells" lean toward the "it depends" vibe, but with just enough structure to aid in decision making. It also bypasses my inflexible opposition to stupid rules in stupid places.

I try to frame too much or too little testing as a Code Smell and discussing it that way often (not always) leads to reasonably easy consensus related to what we should (or shouldn't) do about it.

[0] https://martinfowler.com/bliki/CodeSmell.html

cogman10

> which left me screaming, "ON WHAT?!"

The on what is context and situational dependent. Heck, there's even an aspect of personal preference in there.

From the perspective of a code smell, it's very similar to real life smell. Garlic is an awesome ingredient and in the right context, like a good Italian dish or pizza, it's the thing that makes you go "mmm". However, if you are making oatmeal or a desert, garlic is probably the last smell you want.

Code smells are much the same way and, like garlic, people will disagree on what is a code smell and good coding practice. While there are some smells that are somewhat universally despised (such as raw sewage) there are others that can be arguably good or bad based on context or preference.

To pull this out further, a common code smell is "long methods". Yet, if you are writing something like a parser or a CPU emulator, then a giant switch statement is really going to be the right thing to inject at some point even though it might create a 1000 line monstrosity method.

Martin acknowledges this in his essay.

> ... smells don't always indicate a problem. Some long methods are just fine. You have to look deeper to see if there is an underlying problem there - smells aren't inherently bad on their own - they are often an indicator of a problem rather than the problem themselves.

This where I think the correct position to take is just going to be "it depends" with an understanding that even if something isn't your preference it might be someone else's. Today's best practice has a nasty tendency to turn into tomorrows code smell. Being aware of that fact will help you not to quickly jump to conclusions about the state of a code base or the competence of it's devs. You might even learn something cool about what you can do by breaking the rules/smells/dogma.

I know it can be frustrating, but really a lot of this just comes with experience and humility to know you and everyone else won't always be right about everything. There's no high priest of good code.

blueflow

Yeah, its the "model" that is outdated. Not the testing itself.

VMG

> The pyramid is also an artifact of the era in which it was created.

> Computers were slower, testing and debugging tools were rudimentary, and developer infrastructure was bloated, cumbersome, and inefficient."

What AMD giveth, Electron taketh away.

No matter how fast computers get, developers will figure out a way to use that extra compute to make the build and the test cycle slower again.

Of course it is all relative - it is hard to define what a "unit" test is when you are building on top of enormous abstractions to begin with.

No matter what you call the test, it should be fast. I feel productive when I can iterate on a piece of code with 2 second feedback time.

giorgioz

> What AMD giveth, Electron taketh away.

This is actually true but the moralistic negative tone and no explanation about it makes me think the writer did not understand why this is happening and why it has both PROs and CONs. It's similar to some other statements I heard before on this subject "It's pointless to add/increase roads, there will always be traffic". It's true there will always be traffic but it's not pointless. There will always be traffic because moving more cars becomes faster so more people do it. You should consider though that traffic on a single lane helped 100 people while traffic on a 2 lanes street helped many more people. The same is true for software development. Computers get faster, but programs tend to stay around the same amount of perceived speed. Like roads increase and yet there is still the same amount of traffic. When computers get faster it means that developers can write code faster and so they can write more code and/or cheaper code. Writing programs becomes also cheaper so developer need to be less expert and trained. The computer that brought astronauts to the moon was probably less powerful than today's smart thermostat. Yet to land on the moon with that computer required a team of people that were likely at phd level, intensely focus and dedicated and they were all socially and culturally adjacent to the inventor of the computer. By comparison, today's programs do trivial things using immense resources. And yet because many more developers can code, there are also immensely more programs about millions of use cases that are developed all over the world, by people that do not even speak English in some cases.

So programs did become less efficient because the true bottleneck was not the efficiency of the program. The true bottleneck was developer hours and skills.

This doesn't mean that it's okay for all programs to be slow or that you should be satisfied in using programs that you perceive as slow. The correlation between speed/efficiency of a program and its UX it's a Bell curve. At the beginning the faster it gets the better the UX. After a certain speed though the UX improves marginally. If the final user cannot distinguish between the speed of two different programs it means the bottleneck is not anymore about speed and another characteristic becomes the bottleneck. This said, there will always be work for efficiency engineers or low level developers to write more performant code. But not all code will require to be written as efficiently as possible.

VMG

I didn't intend it as a negative tone. It is meant as an observation: while the raw system speed has increased orders of magnitudes, certain high-level operations seem to remain constant speed over time.

The time it takes to booting an operating system, start a program, compile a program or run a test suite seems to remain somewhat constant over my career.

It indicates that the determining factor is not the clock speed of the underlying system but instead the pain tolerance of the users or developers.

giorgioz

Thank you for answering and explaining @VMG! Yes exactly, some features of software will naturally gravitate to some values where they are good enough and something else becomes the bottleneck and determining factors. I still believe though there will be a niche space for people that need the extra performance because their use case takes an advantage from it.

mkoubaa

Software has gotten so slow we've forgotten how fast computers are

Ygg2

Also, hardware is rapidly approaching saturation point of the Moore's law. Software will have to adapt.

weinzierl

"The pyramid is also an artifact of the era in which it was created. Computers were slower, testing and debugging tools were rudimentary, and developer infrastructure was bloated, cumbersome, and inefficient."

In addition to that, I think a major point is that the testing pyramid was conceived in a world where desktop GUI applications ruled. Testing a desktop GUI was incredibly expensive and automation extremely fragile. That is in my opinion where the pointy tip of the pyramid came from in the first place.

"But the majority of tests are of the entire service, via its API [..]"

I think this is where you get the best bang for your buck because your goal to keep your tests robust is well aligned with the goal to keep the API stable. This is not the case above and below, where the goal of robust tests is always at odds with change, quick adaption and rapid iteration.

dtech

these pointy shape still holds, because we often have multiple services now and testing across services is difficult and expensive.

mattgreenrocks

IMO, it's less about the type of test and more about your ability to get in and test as many code paths as you can of features that your users perceive as critical.

Sometimes that requires E2E tests, sometimes that's integration or unit tests.

My preference is to use something like functional core/imperative shell as much as possible, but the more external dependencies you have the more work you have to do to create an isolated environment free of IO. Not saying it isn't worthwhile, but sometimes it's easier to simply just accept that the tests will be slower due to relying on real endpoints and move on. After all, tests should support velocity, not be an end in and of themselves.

mrkeen

We don't need this article because the 90%-unit/10%-itest was only ever a goal to aspire to. Just like achieving 90% code-coverage - no need for a thinkpiece to say that 40% or 60% is now 'the right amount' of code-coverage.

We like units because they are fast, deterministic, parallelisable... all the good stuff. Relative to that ideal, integrations are slower, flakier, more sequential, etc.

While I've never gone full-TDD, those guys have it absolutely right that testability is a design/coding activity, not a testing activity. TDD will tell you if you're writing unit-testable or not, but it won't tell you how. Dependency-inversion / Ports-and-adapters / Hexagonal-architecture are the topics to read on how to write testable code.

What's my personal stake in this? Firstly, our bugfix-to-prod-release window is about four hours. Way too long. Secondly, as someone relatively new to this codebase, when I stumble across some suspicious logic, I can't just spit out a new unit test to see what it does, since it's so intermingled with MS-SQL and partner integrations. Our methods pass around handles to the DB like candy.

So what I think has happened here, is that we generally don't think about writing testable code as an industry. Therefore our code is all integrations, and no units. So when we go to test it, of course the classic testing pyramid is unachievable.

null

[deleted]

bluGill

What is a unit?

I have never seen a formal definition. Without that we cannot have any discussion.

To some a unit is a function. To some it is a module (generally someone else's module). To some it is an entire application. To some it is the entire computer in your embedded device. To some it is the entire device... Most people have no clue what we are talking about and don't car (also should not care).

mrkeen

My working definition is that a unit is something which can be unit-tested. By that I mean: the code being tested has the same properties that you expect of a good unit-test. I guess fast and deterministic are good enough properties for a working definition.

writeFile(fname,"Hello, World") is only one thing, but its behavior will depend on the state of the filesystem, so it's not a unit.

parseComplicatedObject(bytes) could be a unit (even if it calls out to many other sub-parsers - as long as they are also units).

One thing I see in a lot of companies is efforts to reduce test flakiness. Devs will attempt this work in src/test. But if the code is flaky, and you change the test from flaky to reliable, then you have just decreased the realism of your test. Like I mentioned with my comment above, you reduce test flakiness by doing the hard work in src/main. The src/test changes should be easy after that.

jonstewart

From TFA:

  Since then, significant progress in both technology and development practices has transformed testing in three key ways:

  1) It’s now possible to run a wide range of tests on an application very quickly through its public interface, enabling a broader scope of testing without excessive time or resource constraints.
  
  2) Improved test frameworks have brought down the cost and effort required to write robust, maintainable integration tests, offering accessible, scalable ways of validating the interplay between components.
  
  3) The development of sophisticated debugging tools and enhanced observability platforms has made it much easier to identify the root causes of failures. This improvement reduces the reliance on narrowly focused unit tests.

These assertions are simply made, not argued or justified. Maybe they apply to the code they're writing? I don't think they apply to my code.

cies

In the place we currently work in we follow the "test have to be paid for". So either devs believe that they can build/fix a piece of code faster by adding tests (this is usually unit tests), in this case they are part of another issue. Or the business mandates the tests (usually integration or e2e), which is then put on the board as an issue.

Is is very specific to the business we are in: "move fast and occasionally break things" is acceptable.

OTOH we have focused a lot on adding type safety to avoid many basic mistakes. Exceptions to result-types, no more implicit nulls, replace JS with Elm, replace Java with Kotlin, replace SQL-in-strings with jOOQ, and a culture of trying to write code that does not allow bad states to be expressed. This had precedence on writing extensive test suites.

teeray

> 1) It’s now possible to run a wide range of tests on an application very quickly through its public interface, enabling a broader scope of testing without excessive time or resource constraints.

> 2) Improved test frameworks have brought down the cost and effort required to write robust, maintainable integration tests, offering accessible, scalable ways of validating the interplay between components.

> 3) The development of sophisticated debugging tools and enhanced observability platforms has made it much easier to identify the root causes of failures. This improvement reduces the reliance on narrowly focused unit tests.

Citation needed on all of these. Where are the specific tools that make running all of these magically fast and reliable (read: not flakey) integration tests possible?

bluGill

I have long written the pyramid with the vertical axis of runtime. At the bottom are tests that run so fast you would run them every build even if there is no possible way your code changes could break those tests. Then tests that you run if they test what you just changed. Next tests you run on all CI builds [locally only when they fail]. Then tests you run regularly but not ever CI builds (every night or every week). Then tests that are so expensive you run them rarely - these generally are manual tests where a human is running the test in some real world condition that is hard to setup.

There is value in every level.

sohnee

I think we all agree that the majority of tests should be fast, automated, and robust. When the pyramid was written, that probably _did_ mean unit tests. In 2025 it doesn't.

Let's keep the pyramid but rename the segments!

__MatrixMan__

I propose the names: small, mid, and large. More nuanced lingo around tests is typically about sounding smart when really what you should be doing it not classifying but analyzing your particular case which will always defy classification at least a little.

tobyhinloopen

It kinda depends on your architecture. If you can run integration tests for cheap, it makes sense to favor them over smaller unit tests.

I like to design my applications so all slow components can be mocked by faster alternatives, and have the HTTP stack as thin as possible so I can basically call a function and assert the output, while the output closely resembles the final HTTP response, either rendering a template with a blob of data, or rendering the blob of data as JSON.

HN

The testing pyramid is an outdated economic model

The testing pyramid is an outdated economic model