Skip to content(if available)orjump to list(if available)

Why Twilio Segment moved from microservices back to a monolith

mjr00

> Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.

If you must to deploy every service because of a library change, you don't have services, you have a distributed monolith. The entire idea of a "shared library" which must be kept updated across your entire service fleet is antithetical to how you need to treat services.

wowohwow

I think your point while valid, it is probably a lot more nuanced. From the post it's more akin to an Amazon shared build and deployment system than "every library update needs to redeploy every time scenario".

It's likely there's a single source of truth where you pull libraries or shared resources from, when team A wants to update the pointer to library-latest to 2.0 but the current reference of library-latest is still 1.0, everyone needs to migrate off of it otherwise things will break due to backwards compatibility or whatever.

Likewise, if there's a -need- to remove a version for a vulnerability or what have you, then everyone needs to redeploy, sure, but the centralized benefit of this likely outweighs the security cost and complexity of tracking the patching and deployment process for each and every service.

I would say those systems -are- and likely would be classified as micro services but from a cost and ease perspective operate within a shared services environment. I don't think it's fair to consider this style of design decision as a distributed monolith.

By that level of logic, having a singular business entity vs 140 individual business entities for each service would mean it's a distributed monolith.

mjr00

> It's likely there's a single source of truth for where you pull libraries or shared resources from, when team A wants to update the pointer to library-latest to 2.0 but the current reference of library-latest is still 1.0, everyone needs to migrate off of it otherwise things will break due to backwards compatibility or whatever.

No, this misses one of the biggest benefits of services; you explicitly don't need everyone to upgrade library-latest to 2.0 at the same time. If you do find yourself in a situation where you can't upgrade a core library like e.g. SQLAlchemy or Spring, or the underlying Python/Java/Go/etc runtime, without requiring updates to every service, you are back in the realm of a distributed monolith.

wowohwow

I disagree. Both can be true at the same time. A good design should not point to library-latest in a production setting, it should point to a stable known good version via direct reference, i.e library-1.0.0-stable.

However, the world we live in, people choose pointing to latest, to avoid manual work and trust other teams did the right diligence when updating to the latest version.

You can point to a stable version in the model I described and still be distributed and a micro service, while depending on a shared service or repository.

3rodents

Yes, you’re describing a distributed monolith. Microservices are independent, with nothing shared. They define a public interface and that’s it, that’s the entire exposed surface area. You will need to do major version bumps sometimes, when there are backwards incompatible changes to make, but these are rare.

The logical problem you’re running into is exactly why microservices are such a bad idea for most businesses. How many businesses can have entirely independent system components?

Almost all “microservice” systems in production are distributed monoliths. Real microservices are incredibly rare.

A mental model for true microservices is something akin to depending on the APIs of Netflix, Hulu, HBO Max and YouTube. They’ll have their own data models, their own versioning cycles and all that you consume is the public interface.

wowohwow

This type of elitist mentality is such a problem and such a drain for software development. "Real micro services are incredibly rare". I'll repeat myself from my other post, by this level of logic nothing is a micro service.

Do you depend on a cloud provider? Not a microservice. Do you depend on an ISP for Internet? Not a microservice. Depend on humans to do something? Not a microservice.

Textbook definitions and reality rarely coincide, rather than taking such a fundamentalist approach that leads nowhere, recognize that for all intents and purposes, what I described is a microservice, not a distributed monolith.

reactordev

I was coming here to say this. That the whole idea of a shared library couples all those services together. Sounds like someone wanted to be clever and then included their cleverness all over the platform. Dooming all services together.

Decoupling is the first part of microservices. Pass messages. Use json. I shouldn’t need your code to function. Just your API. Then you can be clever and scale out and deploy on saturdays if you want to and it doesn’t disturb the rest of us.

andrewmutz

Needing to upgrade a library everywhere isn’t necessarily a sign of inappropriate coupling.

For example, a library with a security vulnerability would need to be upgraded everywhere regardless of how well you’ve designed your system.

In that example the monolith is much easier to work with.

jameshart

A library which patches a security vulnerability should do so by bumping a patch version, maintaining backward compatibility. Taking a patch update to a library should mean no changes to your code, just rerun your tests and redeploy.

If libraries bump minor or major versions, they are imposing work on all the consuming services to accept the version, make compatibility changes, test and deploy.

mjr00

While you're right, I can only think of twice in my career where there was a "code red all services must update now", which were log4shell and spectre/meltdown (which were a bit different anyway). I just don't think this comes up enough in practice to be worth optimizing for.

wowohwow

You have not been in the field very long than I presume? There's multiple per year that require all hands on deck depending on your tech stack. Just look at the recent NPM supply chain attacks.

mettamage

Example: log4j. That was an update fiasco everywhere.

smrtinsert

1 line change and redeploy

null

[deleted]

smrtinsert

100%. It's almost like they jumped into it not understanding what they were signing up for.

j45

Monorepos reasonably well designed and fleixble to grow with you can increase development speed quite a bit.

rtpg

I am _not_ a microservices guy (like... at all) but reading this the "monorepo"/"microservices" false dichotomy stands out to me.

I think way too much tooling assumes 1:1 pairings between services and repos (_especially_ CI work). In huge orgs Git/whatever VCS you're using would have problems with everything in one repo, but I do think that there's loads of value in having everything in one spot even if it's all deployed more or less independently.

But so many settings and workflows couple repos together so it's hard to even have a frontend and backend in the same place if both teams manage those differently. So you end up having to mess around with N repos and can't send the one cross-cutting pull request very easily.

I would very much like to see improvements on this front, where one repo could still be split up on the forge side (or the CI side) in interesting ways, so review friction and local dev work friction can go down.

(shorter: github and friends should let me point to a folder and say that this is a different thing, without me having to interact with git submodules. I think this is easier than it used to be _but_)

GeneralMayhem

I worked on building this at $PREV_EMPLOYER. We used a single repo for many services, so that you could run tests on all affected binaries/downstream libraries when a library changed.

We used Bazel to maintain the dependency tree, and then triggered builds based on a custom Github Actions hook that would use `bazel query` to find the transitive closure of affected targets. Then, if anything in a directory was affected, we'd trigger the set of tests defined in a config file in that directory (defaulting to :...), each as its own workflow run that would block PR submission. That worked really well, with the only real limiting factor being the ultimate upper limit of a repo in Github, but of course took a fair amount (a few SWE-months) to build all the tooling.

MrDarcy

Reading it with hindsight, their problems have less to do with the technical trade off of micro or monolith services and much more to do with the quality and organizational structure of their engineering department. The decisions and reasons given shine a light on the quality. The repository and test layout shine a light on the structure.

Given the quality and the structure neither approach really matters much. The root problems are elsewhere.

CharlieDigital

My observation is that many teams lack strong "technical discipline"; someone that says "no, don't do that", makes the case, and takes a stand. It's easy to let the complexity genie out of the bottle if the team doesn't have someone like this with enough clout/authority to actually make the team pause.

monkaiju

Conway's Law shines again!

It's amazing how much explanatory power it has, to the point that I can predict at least some traits about a company's codebase during an interview process, without directly asking them about it.

maxdo

Both approaches can fail. Especially in environments like Node.js or Python, there's a clear limit to how much code an event loop can handle before performance seriously degrades.

I managed a product where a team of 6–8 people handles 200+ microservices. I've also managed other teams at the same time on another product where 80+ people managed a monolith.

What i learned? Both approaches have pros and cons.

With microservices, it's much easier to push isolated changes with just one or two people. At the same time, global changes become significantly harder.

That's the trade-off, and your mental model needs to align with your business logic. If your software solves a tightly connected business problem, microservices probably aren't the right fit.

On the other hand, if you have a multitude of integrations with different lifecycles but a stable internal protocol, microservices can be a lifesaver.

If someone tries to tell you one approach is universally better, they're being dogmatic/religious rather than rational.

Ultimately, it's not about architecture, it's about how you build abstractions and approach testing and decoupling.

rozap

To me this rationalization has always felt like duct tape over the real problem, which is that the runtime is poorly suited to what people are trying to do.

These problems are effectively solved on beam, the jvm, rust, go, etc.

strken

Can you explain a bit more about what you mean by a limit on how much code an event loop can handle? What's the limit, numerically, and which units does it use? Are you running out of CPU cache?

joker666

I assume he means, how much work you let the event loop do without yielding. It doesn't matter if there's 200K lines of code but no real traffic to keep the event loop busy.

otterley

Discussion in 2018, when this blog post was published: https://news.ycombinator.com/item?id=17499137

nyrikki

> Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.

This is the problem with the undefined nature of the term `microservices`, In my experience if you can't develop in a way that allows you to deploy all services independently and without coordination between services, it may not be a good fit for your orgs needs.

In the parent SOA(v2), what they described is a well known anti-pattern: [0]

    Application Silos to SOA Silos
       * Doing SOA right is not just about technology. It also requires optimal cross-team communications. 
    Web Service Sprawl
        * Create services only where and when they are needed. Target areas of greatest ROI, and avoid the service sprawl headache.
If you cannot, due to technical or political reasons, retain the ability to independently deploy a service, no matter if you choose to actually independently deploy, you will not gain most of the advantages that were the original selling point of microservices, which had to do more with organizational scaling than technical conserns.

There are other reasons to consider the pattern, especially due to the tooling available, but it is simply not a silver bullet.

And yes, I get that not everyone is going to accept Chris Richardson's definitions[1], but even in more modern versions of this, people always seem to run into the most problems because they try to shove it in a place where the pattern isn't appropriate, or isn't possible.

But kudos to Twilio for doing what every team should be, reassessing if their previous decisions were still valid and moving forward with new choices when they aren't.

[0] https://www.oracle.com/technetwork/topics/entarch/oea-soa-an... [1] https://microservices.io/post/architecture/2022/05/04/micros...

yearolinuxdsktp

I would caution that microservices should be architected with technical concerns first—-being able to deploy independently is a valid technical concern too.

Doing it for organizational scaling can lead to insular vision with turf defensive attitude, as teams are rewarded on the individual service’s performance and not the complete product’s performance. Also refactoring services now means organizational refactoring, so the friction to refactor is massively increased.

I agree that patterns should be used where most appropriate, instead of blindly.

What pains me is that a language like “Cloud-Native” has been usurped to mean microservices. Did Twilio just stop having a “Cloud-Native” product due to shipping a monolith? According to CNCF, yes. According to reason, no.

develatio

can you add [2018] to the title, please?

pmbanugo

have they reverted to microservices?

Towaway69

Mono services in a micro repository. /s

chmod775

In practice most monoliths turned into "microservices" are just monoliths in disguise. They still have most of the failure modes of the original monolith, but now with all the complexity and considerable challenges of distributed computing layered on top.

Microservices as a goal is mostly touted by people who don't know what the heck they're doing - the kind of people who tend to mistakenly believe blind adherence to one philosophy or the other will help them turn their shoddy work into something passable.

Engineer something that makes sense. If, once you're done, whatever you've built fits the description of "monolith" or "microservices", that's fine.

However if you're just following some cult hoping it works out for your particular use-case, it's time to reevaluate whether you've chosen the right profession.

Nextgrid

Microservices were a fad during a period where complexity and solving self-inflicted problems were rewarded more than building an actual sustainable business. It was purely a career- & resume-polishing move for everyone involved.

Putting this anywhere near "engineering" is an insult to even the shoddiest, OceanGate-levels of engineering.

null

[deleted]

ShakataGaNai

Too much of anything sucks. Too big of a monolith? Sucks. Too many microservices? Sucks. Getting the right balance is HARD.

Plus, it's ALWAYS easier/better to run v2 of something when you completely re-write v1 from scratch. The article could have just as easily been "Why Segment moved from 100 microservices to 5" or "Why Segment rewrote every microservice". The benefits of hindsight and real-world data shouldn't be undersold.

At the end of the day, write something, get it out there. Make decisions, accept some of them will be wrong. Be willing to correct for those mistakes or at least accept they will be a pain for a while.

In short: No matter what you do the first time around... it's wrong.

0xbadcafebee

[delayed]

shoo

Great writeup. Much of this is more about testing, how package dependencies are expressed and many-repo/singlerepo tradeoffs than "microservices"!

Maintaining and testing a codebase containing many external integrations ("Destinations") was one of the drivers behind the earlier decision to shatter into many repos, to isolate the impact of Destination-specific test suite failures caused because some tests were actually testing integration to external 3rd party services.

One way to think about that situation is in terms of packages, their dependency structure, how those dependencies are expressed (e.g. decoupled via versioned artefact releases, directly coupled via monorepo style source checkout), their rates of change, and the quality of their automated tests suites (high quality meaning the test suite runs really fast, tests only the thing it is meant to test, has low rates of false negatives and false positives, low quality meaning the opposite).

Their initial situation was one that rapidly becomes unworkable: a shared library package undergoing a high rate of change depended on by many Destination packages, each with low quality test suites, where the dependencies were expressed in a directly-coupled way by virtue of everything existing in a single repo.

There's a general principle here: multiple packages in a single repo with directly-coupled dependencies, where those packages have test suites with wildly varying levels of quality, quickly becomes a nightmare to maintain. The packages with low quality test suites that depend upon high quality rapidly changing shared packages generate spurious test failures that need to be triaged and slow down development. Maintainers of packages that depend upon rapidly changing shared package but do not have high quality test suites able to detect regressions may find their package frequently gets broken without anyone realising in time.

Their initial move solves this problem by shattering the single repo and trade directly-coupled dependencies with decoupled versioned dependencies, to decouple the rate of change of the shared package from the per Destination packages. That was an incremental improvement but added the complexity and overhead of maintaining multiple versions of the "shared" library and per-repo boilerplate, which grows over time as more Destinations are added or more changes are made to the shared library while deferring the work to upgrade and retest Destinations to use it.

Their later move was to reverse this, go back to directly-coupled dependencies, but instead improve the quality of their per-Destination test suites, particularly by introducing record/replay style testing of Destinations. Great move. This means that the test suite of each Destination is measuring "is the Destination package adhering to its contract in how it should integrate with the 3rd party API & integrate with the shared package?" without being conflated with testing stuff that's outside of the control of code in the repo (is the 3rd party service even up, etc).

btown

Some important context to this 2018 article is given here: https://www.twilio.com/en-us/blog/archive/2018/introducing-c...

TL;DR they have a highly partitioned job database, where a job is a delivery of a specific event to a specific destination, and each partition is acted upon by at-most-one worker at a time, so lock contention is only at the infrastructure level.

In that context, each worker can handle a similar balanced workload between destinations, with a fraction of production traffic, so a monorepo makes all the sense in the world.

IMO it speaks to the way in which microservices can be a way to enforce good boundaries between teams... but the drawbacks are significant, and a cross-team review process for API changes and extensions can be equally effective and enable simplified architectures that sidestep many distributed-system problems at scale.