Working on complex systems: What I learned working at Google

146 comments

·May 13, 2025

braza

One of my pet peeves with the usage of complex(ity) out of the traditional time/space in computer science is that most of the time the OPs of several articles over the internet do not make the distinction between boundaried/arbitrary complexity, where most of the time the person has most of the control of what is being implemented, and domain/accidental/environmental complexity, which is wide open and carries a lot of intrinsic and most of the time unsolvable constraints.

Yes, they are Google; yes, they have a great pool of talent around; yes, they do a lot of hard stuff; but most of the time when I read those articles, I miss those kinds of distinctions.

Not lowballing the guys at Google, they do amazing stuff, but in some domains of domain/accidental/environmental complexity (e.g. sea logistics, manufacturing, industry, etc.) where most of the time you do not have the data, I believe that they are way more complex/harder than most of the problems that the ones that they deal with.

kubb

I’d wager 90% time spent at Google is fighting incidental organizational complexity, which is virtually unlimited.

repeekad

The phrase thrown around was “collaboration headwind”, the idea was if project success depends on 1 person with a 95% chance of success, project success also had a 95% chance. But if 10 people each need to succeed at a 95% chance, suddenly the project success likelihood becomes 60%…

In reality, lazy domain owners layered on processes, meetings, documents, and multiple approvals until it took 6 months to change the text on a button, ugh

miki123211

Another side of this coin is that the expected payoff from a project depends on how many unrelated projects your organization is engaging in, which is deeply counterintuitive to most people.

Every project carries with it three possibilities: that of success, where the company makes money, that of failure, where the company does not, and that of a "critical failure", where the project goes so wrong that it results in a major lawsuit, regulatory fine or PR disaster that costs the company more than the project was ever expected to make.

If you're a startup, the worst that can happen to your company is the value going to 0. From an investor's perspective, there's not much of a difference between burning all the money ($10m) and not finding product-market-fit (normal failure), or your company getting sued for $3b and going bankrupt (critical failure). The result is the same, the investment is lost. For a large corporation, a $3b lawsuit is far more costly than sinking $10m into a failed project.

You can trade off these three possibilities against each other. Maybe forcing each release through an arduous checklist of legal review or internationalization and accessibility testing decreases success rates by 10%, but moves the "critical failure rate" from 1% to 0.5%. From a startup's perspective, this is a bad tradeoff, but if you're a barely-profitable R&D project at big co, the checklist is the right call to make.

This problem is independent from all the other causes to which bureaucracy is usually attributed, like the number of layers of management, internal culture, or "organizational scar tissue." Just from a legal and brand safety perspective, the bigger your org, the more bureaucracy makes sense, no matter how efficient you can get your org to be.

apercu

> lazy domain owners

Interesting. As a consultant for the most of the last 25 years, my experience is the domain owners are typically invested and have strong opinions on the things that impact their jobs.

Executive leadership, on the other hand, doesn't want to actually know the issues and eyes glaze over as they look at their watches because they have a tee time.

pclmulqdq

There's a culture of "I won't approve this unless it does something for me" at Google. So now changing the text on a button comes with 2 minor refactors, 10 obvious-but-ignored bugfixes, and 5 experiments that it is actually better.

sollewitt

Coordination Headwind: https://komoroske.com/slime-mold/

xnx

The old "If you want to go fast, go alone. If you want to go far, go together."

zenogantner

Well, good management/tech leadership is about making sure that the risks coming from individual failure points (10 people in your example) are recognized and mitigated, and that the individuals involved can flag risks and conflicts early enough so that the overall project success probability does not go down as you describe...

steveBK123

The assumptions in that math are wrong anyway. Once you depend on 10 people, the chance that they each achieve "95% successful execution" is 0.

This is only partially down to the impossibility of having every staff member on a project be A++ players.

There is coordination RISK not just coordination overhead. Think planning a 2 week trip with your spouse with multiple planes/trains/hotels, museum/exhibit ticket bookings, meal reservations, etc. Inevitably something gets misunderstood/miscommunicated between the two of you and therefore mis-implemented.

Now add more communication nodes to the graph and watch the error rate explode.

Demiurge

And when you’re at a smaller company 90% of your time is fighting societal complexity, limit of which also approaches infinity, but at a steeper angle.

No greater Scott’s man can tell you that the reality is surprisingly complex, and sometimes you have resources to organize and fight them, and sometimes you use those resources wiser than the other group of people, and can share the lessons. Sometimes, you just have no idea if your lesson is even useful. Let’s judge the story on its merits and learn what we can from it.

apercu

Look, I've never had to design, build or maintain systems at the scale of a FAANG, but that doesn't mean I haven't been involved in pretty complicated systems (e.g., 5000 different pricing and subsidy rules for 5000 different corporate clients with individually negotiated hardware subsidies (changing all the time) and service plans, commission structure, plus logistics, which involves not only shipping but shipping to specific departments for configuration before the device goes to the employee, etc.

Arbitrarily, 95% of the time the issues were people problems, not technical ones.

simianwords

Equally important is the amount of time they save because of available abstractions to use like infra, tooling etc

materielle

I think I understand what you are saying, and I agree.

Google has all sorts of great infra projects that simplify complex domains. Those are solved problems now, so nobody notices it.

The existence of incidental complexity isn’t evidence that the counter factual is less complexity.

tuyiown

I think this is addressed with the complex vs complicated intro. Most problems with uncontrolled / uncontrollable variables will be approached with an incremental solution, e.g. you'll restrict those variables voluntarily or involuntarily and let issues being solved organically / manually, or automatisation will be plain and simple being abandoned.

This qualify as complicated. Delving in complicated problems is mostly driven by business opportunity, always has limited scaling, and tend to be discarded by big players.

braza

I don't think this is adequately addressed by the "complicated vs. complex" framing—especially not when the distinction is made using reductive examples like taxes (structured, bureaucratic, highly formalized) versus climate change (broad, urgent, signaling-heavy).

That doesn’t feel right.

Let me bring a non-trivial, concrete example—something mundane: “ePOD,” which refers to Electronic Proof of Delivery.

ePOD, in terms of technical implementation, can be complex to design for all logistics companies out there like Flexport, Amazon, DHL, UPS, and so on.

The implementation itself—e.g., the box with a signature open-drawing field and a "confirm" button—can be as complex as they want from a pure technical perspective.

Now comes, for me at least, the complex part: in some logistics companies, the ePOD adoption rate is circa 46%. In other words, in 54% of all deliveries, you do not have a real-time (not before 36–48 hours) way to know and track whether the person received the goods or not. Unsurprisingly, most of those are still done on paper. And we have:

- Truck drivers are often independent contractors.

- Rural or low-tech regions lack infrastructure.

- Incentive structures don’t align.

- Digitization workflows involve physical paper handoffs, WhatsApp messages, or third-party scans.

So the real complexity isn't only "technical implementation of ePOD" but; "having the ePOD, how to maximize it's adoption/coverage with a lot uncertainty, fragmentation, and human unpredictability on the ground?".

That’s not just complicated, it’s complex 'cause we have: - Socio-technical constraints,

- Behavioral incentives,

- Operational logistics,

- Fragmented accountability,

- And incomplete or delayed data.

We went off the highly controlled scenario (arbitrarily bounded technical implementation) that could be considered complicated (if we want to be reductionist, as the OP has done), and now we’re navigating uncertainty and N amount of issues that can go wrong.

tuyiown

I was very centered on the software part of the problem. A complex problem can be solved with a complicated chain of small technical solution. At implementation level, its complicated, not complex, e.g. you mostly need knowledge of the general problem to understand the solution, and many added things make things complicated.

My take is that if your complex problem is only solvable by complex software (e.g. not a combination of simple small parts), and _cannot_ be reduced to simpler things, you are in the complex space.

Maybe it's too reductive, it's just my opinion, but it's a good way for me to determine predictability on ability to solve a problem with many unknown, at the engineering level. The dangerous blockers are in complex space, identifying them early is critical. Complicated stuff can be worked around and solved later.

__MatrixMan__

I don't think it is, because the intro gets it wrong. If a problem's time or space complexity increases from O(n^2) to O(n^3) there's nothing necessarily novel about that, it's just... more.

Complicated on the other hand, involves the addition of one or more complicating factors beyond just "the problem is big". It's a qualitative thing, like maybe nobody has built adequate tools for the problem domain, or maybe you don't even know if the solution is possible until you've already invested quite a lot towards that solution. Or maybe you have to simultaneously put on this song and dance regarding story points and show continual progress even though you have not yet found a continuous path from where you are to your goal.

Climate change is both, doing your taxes is (typically) merely complex. As for complicated-but-not-complex, that's like realizing that you don't have your wallet after you've already ordered your food: qualitatively messy, quantitatively simple.

To put it differently, complicated is about the number of different domains you have to consider, complex is about--given some domain--how difficult the consideration in that domain are.

Perhaps the author's usage is common enough in certain audiences, but it's not consistent with how we discuss computational complexity. Which is a shame since they are talking about solving problems with computers.

rawgabbit

If you consider their history of killing well loved products and foisting unwarranted products such as Google Plus onto customers, Google is for lack of a better word just plain stupid. Google is like a person with an IQ of 200 but would get run over by oncoming traffic because they have zero common sense.

williamdclt

I've not seen "accidental" complexity used to mean "domain" (or "environmental" or "inherent") complexity before. It usually means "the complexity you created for yourself and isn't fundamental to the problem you're solving"

tanelpoder

Also, anything you do with enterprise (cloud) customers. People like to talk about scale a lot and data people tend to think about individual (distributed) systems that can go webscale. A single system with many users is still a single system. In enterprise you have two additional types of scale:

1) scale of application variety (10k different apps with different needs and history)

2) scale of human capability (ingenuity), this scale starts from sub-zero and can go pretty high (but not guaranteed)

mwbajor

Im a HW engineer and don't really understand "complexity" as far as this article describes it. I didn't read it in depth but it doesn't really give any good examples with specifics. Can someone give a detailed example of what the author is really talking about?

junto

Cynefin framework:

https://en.m.wikipedia.org/wiki/Cynefin_framework

aweiher

System Thinking 101

TexanFeller

Rich Hickey is famous for talking about easy vs. simple/complex and essential vs. incidental complexity.

“Simple Made Easy”: https://youtu.be/SxdOUGdseq4?si=H-1tyfL881NawCPA

null

[deleted]

dmoy

> My immediate reaction in my head was: "This is impossible". But then, a teammate said: "But we're Google, we should be able to manage it!".

Google, where the impossible stuff is reduced to merely hard, and the easy stuff is raised to hard.

dijit

This is probably the most accurate statement possible.

“I just want to store 5TiB somewhere”

“Ha! Did you book multiple bigtable cells”

https://youtu.be/3t6L-FlfeaI?si=C5PJcrvLepABZsVF

Phelinofist

What are peer-bonuses?

dijit

The idea is if someone helps you in a really big way that you’re able to reward that. So you can ask the company to give the person either credits for an internal store, or a direct addition to their salary for one month.

Obviously, there are limits to how many pay bonuses you can give out and if it’s direct money or store credits.

Directly asking for a peer bonus’ is not very “googly” (and yes, this is a term they use- in case you needed evidence of Google being a bit cultish).

There are companies who help do this “as a service”; https://bonusly.com/

decimalenough

Basically a way to "tip" people for going out of their way to help you, except that the "tip" comes out of the company's pocket, not yours.

To prevent obvious abuse, you need to provide a rationale, the receiver's manager must approve and there's a limit to how many you can dish out per quarter.

yndoendo

I was in Kindergarten and watching my fellow classmates get gold star stickers on their work. They were excited when it happened to them. I saw it as being given nothing of real value and person could just go to the store and buy them for $1 or $2.

It is a social engineering technique to exploit more work without increasing wages. Just like "Employee of the Month" or a "Pizza Party."

Company I work for does this with gift cards as rewards. I was reprimanded because I sent an email to HR that this " gift" is as useful as a wet rage in the rain. I don't eat at restaurants that are franchises or have a ticker on Wall Street. Prefer local brick and mortar over Walmart and will never financial support Amazon.

If you want to truly honor my accomplishments, give me a raise or more PTO. Anything else is futile. That gift card to Walmart has 0 value towards a quality purchase like a RADAR or LiDAR development kit to learn more or such.

Rebelgecko

You can give someone a $175 bonus for being particularly helpful or going above and beyond. Everyone can give 20/year so it doesn't have to be that crazy of an effort to get one (although most people don't give out all 20 and the limit wasn't even enforced for a while).

It technically requires manager approval but it's kind of a faux pas for a manager to deny one unless it's a duplicate.

socalgal2

Something designed to remove all intrinsic motivation from employees

cmrdporcupine

Or "How many MDB groups do I need to get approved to join over multiple days/weeks, before I can do the 30 second thing I need to do?"

Do not miss

newsclues

“the difficult we do immediately. The impossible takes a little longer” WW2 US army engineer corp

fuzzfactor

>“the difficult we do immediately. The impossible takes a little longer”

This was posted in my front office when I started my company over 30 years ago.

It was a no-brainer, same thing I was doing for my employer beforehand. Experimentation.

By the author's distinction in the terminology, if you consider the complexity relative to the complications in something like Google technology, it is on a different scale compared to the absolute chaos relative to the mere remaining complexity when you apply it to natural science.

I learned how to do what I do directly from people who did it in World War II.

And that was when I was over 40 years younger, plus I'm not done yet. Still carrying the baton in the industrial environment where the institutions have a pseudo-military style hierarchy and bureaucracy. Which I'm very comfortable working around ;)

Well, the army is a massive mainstream corp.

There are always some things that corps don't handle very well, but generals don't always care, if they have overwhelming force to apply, lots of different kinds of objectives can be overcome.

Teamwork, planning, military-style discipline & chain-of-command/org-chart, strength in numbers, all elements which are hallmarks of effective armies over the centuries.

The engineers are an elite team among them. Traditionally like the technology arm, engaged to leverage the massive resources even more effectively.

The bigger the objective, the stronger these elements will be brought to bear.

Even in an unopposed maneuver, steam-rolling all easily recognized obstacles more and more effectively as they up the ante, at the same time bigger and bigger unscoped problems accumulate which are exactly the kind that can not be solved with teamwork and planning (since these are often completely forbidden). When there must be extreme individual ability far beyond that, and it must emanate from the top decision-maker or have "equivalent" access to the top individual decision-maker. IOW might as well not even be "in" the org chart since it's just a few individuals directly attached to the top square, nobody's working for further promotions or recognition beyond that point.

When military discipline in practice is simply not enough discipline, and not exactly the kind that's needed by a long shot.

That's why even in the military there are a few Navy Seals here and there, because sometimes there are serious problems that are the kind of impossible that a whole army cannot solve ;)

brap

“and the easy... well, that’s not a good promo artifact, so never”

gilbetron

I've interviewed many current and ex Googlers, and one thing we've discovered is that we have to be careful overindexing on the scale and complexity of systems they work on. Google is insanely huge and complex, but have insane and complex tooling to help developers. "I worked on a project that affected 250 million users" is something we'll here and sounds amazing, but in reality, from their perspective, they spent months working through the complex Google dev, QA, and deployment process and pushed out a relatively straightforward change, but that change was for a massive system.

They have a unique and distinctive experience, but it usually isn't what you expect. It is rare to encounter someone from Google that actually built something of significance, and those that have are always at the staff+ level and had been there 10+ years.

If I were to make another generalization, the [g|x]ooglers that worked on relatively "small" projects are often the most interesting, as they had resources to build something from the ground up and do attempt some really interesting projects.

neilv

> My immediate reaction in my head was: "This is impossible". But then, a teammate said: "But we're Google, we should be able to manage it!".

"We can do it!" confidence can be mostly great. (Though you might have to allow for the possibility of failure.)

What I don't have a perfect rule for is how to avoid that twisting into arrogance and exceptionalism.

Like, "My theory is correct, so I can falsify this experiment."

Or "I have so much career potential, it's to everyone's advantage for me to cheat to advance."

Or "Of course we'll do the right thing with grabbing this unchecked power, since we're morally superior."

Or "We're better than those other people, and they should be exterminated."

Maybe part of the solution is to respect the power of will, effort, perseverance, processes, etc., but to be concerned when people don't also respect the power and truth of humility, and start thinking of individual/group selves as innately superior?

yodsanklai

Sorry to say that, but this sounds a bit like a fantasy. I think the vast majority of Google employees don't see themselves as particularly brillant or special. Even there, lots of people have imposter syndrome.

Actually, I've found this is a constant in life, whatever you achieve, you end up in a situation where you're pretty average among your peers. You may feel proud to get into Google for a few months, and then you're quickly humbled down.

neilv

Understood, but I meant to ask about the more general problem -- not specific to Google, only prompted by that quote.

(Also, to be clear about my examples: I don't think Google is fabricating their dissertation research, nor do I think Google is genocidal.)

If you're suggesting there's a lot of humility, yet "But we can do it!" still works, that's great, and I'd be interested in more nuances of that.

RenThraysk

There is a certain amount of irony when the cookie policy agreement is buggy on a story about complicated & complex systems.

Clicking on "Only Necessary" causes the cookie policy agreement to reappear.

amdivia

Had the same issue

wooque

Same here, it's because you have third party cookies blocked.

junto

My assumption with bugs like this is that they are “geographically based edge cases” that have been poorly tested due to engineers not being in the right location to test it, but affects a large number of users without throwing a error that can be logged.

GDPR banner only to be used in EU, with conditional of only accepting non-essential cookies, and the engineer or QA is based in the U.S.

As a side note, as someone that lives in the EU my pattern of usage here is:

- choose only non-essential, but if not a presented option then

- reject all cookies, but if no reject all available then

- switch the reader mode (or hide distracting items), or if not possible then

- close tab

I’m getting much more aggressive when dealing with cookie banners dark patterns. I will not a third third party advertising cookies as much as possible and support websites that allow me an easy way to opt out of them.

jajko

Not for me, on Chrome now

CommenterPerson

It didn't appear on DuckDuckGo either, Thanks.

nonethewiser

I dont see a cookie banner. Thankfully.

ggm

I think there are two myths applicable here. Probably more.

One myth is that complex systems are inherently bad. Armed forces are incredibly complex. That's why it can take 10 or more rear echelon staff to support one fighting soldier. Supply chain logistics and materiel is complex. Middle ages wars stopped when gunpowder supplies ran out.

Another myth is that simple systems are always better and remain simple. They can be, yes. After all, DNA exists. But some beautiful things demand complexity built up from simple things. We still don't entirely understand how DNA and environment combine. Much is hidden in this simple system.

I do believe one programming language might be a rational simplification. If you exclude all the DSL which people implement to tune it.

jcranmer

> Middle ages wars stopped when gunpowder supplies ran out.

The arquebus is the first mass gunpowder weapon, and doesn't see large scale use until around the 1480s at the very, very tail end of the Middle Ages (the exact end date people use varies based on topic and region, but 1500 is a good, round date for the end).

In Medieval armies, your limiting factor is generally that food is being provided by ransacking the local area for food and that a decent portion of your army is made up of farmers who need to be back home in the harvest season. A highly competent army might be able to procure food without acting as a plague on all the local farmlands, but most Medieval states lacked sufficient state capacity to manage that (in Europe, essentially only the Byzantines could do that).

zmb_

Following the definition from the article, armed forces seems like a complicated system, not a complex one. There is a structured, repeatable solution for armed forces. It does not exhibit the hallmark characteristics of complex systems listed in the article like emergent behaviors.

cowboylowrez

not a fan of the article for this reason alone. good points made, but no reason to redefine perfectly good words when we already have words that work fine.

p_v_doom

Agreed. The problem is not complexity. Every system must process a certain amount of information. And the systems complexity must be able to match that amount. The fundamental problem is about designing systems that can manage complexity, especially runaway complexity.

jajko

> Middle ages wars stopped when gunpowder supplies ran out

Ukraine would be conquered by russia rather quickly if russians weren't so hilariously incompetent in these complex tasks, and war logistics being the king of them. Remember that 64km queue of heavy machinery [1] just sitting still? This was 2022, and we talk about fuel and food, the basics of logistics support.

[1] https://en.wikipedia.org/wiki/Russian_Kyiv_convoy

destring

I come from a math background so it’s a bit surprising when software engineers don’t have such basic that systems modeling vocabulary. Everyone should give Thinking in Systems by Meadows a read

Pavilion2095

The cookie banner reappears indefinitely on this website when I click 'only necessary' lol.

teivah

Sorry about that, I'm my newsletter provider (Substack) which is very buggy sometimes.

romanovcode

Probably because it is overly complex system.

owebmaster

by option or incompetence because serving text over http is very well abstracted nowadays.

codydkdc

if only there were some simple solution to host a static website without cookies and other garbage

https://cloud.google.com/storage/docs/hosting-static-website + pick your favorite OSS CMS

nonethewiser

Thankfully I dont see a cookie banner at all. Did you try moving continents?

EasyMark

or ublock and adding all the cookie blocker lists

kossTKR

  "This is one possible characteristic of complex systems: they behave in ways that can hardly be predicted just by looking at their parts, making them harder to debug and manage."

To be honest this doesn't sound too different from many smaller and medium sized internetprojects i've worked on, because of the asynchronous nature of the web, with promises, timing issues and race conditions leading to weirdness that's pretty hard to debug because you have to "playback" with the cascading randomness of request timing, responses, encoding, browser/server shenanigans etc.

tunesmith

I think the definitions of complex/complicated get muddled with the question of whether something is truly a closed system. Often times something is defined as "complex" when all they mean is that their model doesn't incorporate the externalities. But I don't know if I've come across a description of a truly closed system that has "emergent behavior". I don't know if LLMs qualify.

gilleain

Mostly overlapping definition of what a 'complex system' is with :

https://en.wikipedia.org/wiki/Complex_system

although I understood the key part of a system being complex (as opposed to complicated) is having a large number of types of interaction. So a system with a large number of parts is not enough, those parts have to interact in a number of different ways for the system to exhibit emergent effects.

Something like that. I remember reading a lot of books about this kind of thing a while ago :)

ninetyninenine

Except computers attempt to model mathematics in an ideal world.

Unless your problem comes from something side effects on a computer that can’t be modeled mathematically there is nothing technically stopping you from modeling the problem as mathematical problem then solving that problem via mathematics.

Like the output of the LLM can’t be modeled. We literally do not understand it. Are the problems faced by the SRE exactly the same? You give a system an input of B and you can’t predict the output of A mathematically? It doesn’t even have to be a single equation. A simulation can do it.

ratorx

I think the vast majority of SRE problems are in the “side effects” category. But higher level than the hardware-level side effects of the computer that you might be imagining.

The core problem is building a high enough fidelity model to simulate enough of the real world to make the simulation actually useful. As soon as you have some system feedback loops, the complexity of building a useful model skyrockets.

Even in “pure” functions, the supporting infrastructure can be hard to simulate and critical in affecting the outputs.

Even doing something simple like adding two numbers requires an unimaginable amount of hidden complexity under the hood. It is almost impossible for these things to not have second-order effects and emergent behaviour under enough scale.

ninetyninenine

Can you give me an example of some problem that emerged that was absolutely unpredictable.

HN

Working on complex systems: What I learned working at Google

Working on complex systems: What I learned working at Google