Skip to content(if available)orjump to list(if available)

Meta's Hyperscale Infrastructure: Overview and Insights

vasco

> Moreover, once it [Threads] was developed, the infrastructure teams were given only two day’s notice to prepare for its production launch. Most large organizations would take longer than two days just to draft a project plan involving dozens of interdependent teams, let alone execute it. At Meta, however, we quickly established war rooms across distributed sites, bringing together both infrastructure and product teams to address issues in real-time. Despite the tight timeline, the app’s launch was highly successful, reaching 100 million users within just five days, making it the fastest-growing app in history.

Kind of more impressive to have kept this ability to ship fast than anything else. A lot of work is needed to not let the bureaucracy increase and stop the lawyers or other functions from creating approval gates everywhere. Or at least to be able to have war rooms that can get it done so quickly.

javier2

They already had these teams and they already had the infrastructure. 100M is a drop in the ocean to Meta

Simon_O_Rourke

They might have launched it quickly, but nobody (relatively speaking) gave a tinkers damn about the end product.

All form and no function.

2c2c2c

incumbent networks don't really lose. they saw potential blood in the water at the time with the rumblings of a mass exodus and made an excellent attempt at capitalizing though.

threads as a product was DOA when that didn't work. you need a network of interesting important people for it to be useful. when the migration didn't happen, you ended up with a bunch of instagram meme influencers reposting their content across two apps instead

I think their strategy combined with an open offer exclusivity bonus could have given them the stickiness. up front 5k, 10k, 15k, etc to a twitter user that matches their follower of at least 25k, 50k, 75k, etc count on threads and agrees to exclusively post there for a year. people weren't getting paid on twitter so this would have been alluring

hansmayer

Precisely, for all the talk of efficiency last few years, how do we even begin to measure the total waste of effort and energy of so many smart people that this was? All that effort and stress for effectively nothing, or perhaps even net negative effect.

wordpad25

how is it net negative effect? they have Hundreds of millions of ACTIVE users

even if it was spun off to be stand alone product without Instagram it would still be worth billions

Retr0id

Threads feels more like a new feature in the facebook meta-app (heh) than a new app in its own right, especially with how it has to be bound to an instagram profile.

"our new feature reached 100m users in 5 days" sounds a little less impressive (especially given meta has multiple billions of users to start with).

pinoy420

Sounds like a horrific place to work. Imagine the pressure

huijzer

I yesterday sat in a meeting where someone showed for an hour all kinds of organizational charts with acronyms and vague terms, and who could do what and who is the boss of who. The various organizations were there to help employees dealing with European privacy regulations while also publishing data open access when possible. Basically: Which organizations parts can help you to deal with contradicting government regulations.

I honestly find high pressure work more relaxing than these kinds of meetings.

vasco

Having been in both setups, the frustration and demotivation I got when working in the bureaucracy was way worse than doing the rare weekend work and on-call rotations at a place where things moved fast. Many people get pumped up by doing stuff.

falconertc

I think the joy of working at this scale makes up for that for a lot of people. There are plenty of low-stress, low-impact companies that someone that's bigtech-approved can go to

spacebanana7

Imagine the compensation.

null

[deleted]

qwe----3

I’m jealous

silisili

I thought the same thing, but then realize that Threads was a huge flop.

Is it really a skill to very quickly release a dud app?

I don't know the answer to that. Bypassing bureaucracy seems like heaven, but it feels like it also bypassed the product folks entirely.

chikere232

It's certainly a skill to launch quickly at that scale. There are plenty of bureaucratically managed slowly launched duds too

I hate meta with a passion, but I don't deny they have some great infrastructure and engineers to enable the bad things they do to the world

cmdtab

What is threads lacking from a product perspective?

captainbland

I guess its only real USP is "automatically import your Instagram friends", except that doesn't really work properly because only a fraction of people on Instagram seem to be interested in Threads.

Its fediverse integration stuff isn't panning out because nobody in the fediverse is stupid enough to let them federate. The single thread per message instead of hashtags thing doesn't seem to add a lot either.

silisili

It was given no thought or advertisement or anything. It -feels- like an engineering marathon to make a safe space from Twitter.

I don't know that it's a bad effort, but it's one that rose and died seemingly the same day.

I feel like more time with a good product person would have given more thought to fit, advertising, release, and so on.

yokoprime

Novelty

ianlevesque

What? It has more than 300M users.

mukunda_johnson

I think most of those are Instagram shoving it in your face. Yeah I'm a "Threads user", but only because of the inline feed in Instagram. I'm annoyed when there is a notification blip but it turns out to be Threads spam.

silisili

Today, maybe? I haven't kept up.

I'm talking specifically about its launch.

ahoka

How many of those are bots?

yuliyp

I find it interesting how they describe the PHP web front end as a "serverless" or "function as a service" architecture. I guess it's a matter of perspective. It's a service that has a monolithic codebase with lots of endpoints deployed to it. I guess from the perspective of the maintainer of one of those endpoints it's "serverless" but that abstraction (like all abstractions) has leaks: the teams responsible for the top endpoints and those working on shared libraries can't treat the infra as a given, but rather need to be acutely aware of its limitations and performance characteristics.

vineyardmike

I think the full description of “stateless, serverless functions” adds a bit of clarity. My read of this is that whatever code is running doesn’t maintain state between requests, and doesn’t touch the underlying operating system. Which seems pretty standard for highly managed environments anyways. It’s been years since I’ve written backend API code that touched the underlying system, or left objects in heap between requests.

The flexibility of knowing that any machine can instantly run the code for your API gives a lot of flexibility to rapidly scale up an API.

Nothing is “serverless” to everyone. Especially when you run the data center. But being “serverless” and even sitting above the “language runtime” gives API developers a lot of freedom to focus on business logic.

whstl

“Serverless” is not synonymous with Lambda, it’s just a computing model.

Lots of companies are hosting old monoliths on Amazon Fargate, for example.

lionkor

> the image is not cached at CDN109 when the user requests it, CDN109 forwards the request to a nearby PoP. The PoP then forwards the request to the load balancer in a datacenter region, which retrieves the image from the storage system.

Say I want a 1MB image, wouldn't it be faster to serve me the 1MB image over a slow connection with 100ms latency, than going through multiple hops of increasing latency, with multiple round trips?

Say I request the image directly:

me -- 100ms --> datacenter

datacenter -- 100ms --> me

Say I now go through Meta's system, assuming that goes to the same Datacenter in the end, and there's no FTL tech:

me -- 10ms --> CDN

CDN -- 10ms --> PoP

PoP -- 90ms --> datacenter

datacenter -- 90ms --> PoP

PoP -- 10ms --> CDN

CDN -- 10ms --> me

null

[deleted]

karparov

[dead]

linkregister

> In a datacenter environment, we prefer centralized controllers over decentralized ones due to their simplicity and ability to make higher-quality decisions. In many cases, a hybrid approach—a centralized control plane combined with a decentralized data plane—provides the best of both worlds.

This approach appears to be one of the most optimal designs for software networking (service mesh) and for storage (database operations) for organizations with large server counts. I was surprised to see their IP networking to follow the same model, rather than primarily relying on BGP.

It was omitted in this paper, but I would expect for local caching to be used to reduce load on L7 routers and for improved latency for database queries. Clients can invalidate caches and perform another lookup to the service mesh after a reasonable timeout (100-500ms).

udev4096

As much as I hate clownfare, it would be very interesting if they published something like that because of the sheer number of data centers they operate

sofixa

They have published a lot of information on their blog[1]. It's piecemeal, and around other articles about optimisations or security fixes or failures/postmortems, but it's there. Stuff like:

https://blog.cloudflare.com/how-we-use-hashicorp-nomad/

https://blog.cloudflare.com/cloudflare-deployment-in-guam/

https://blog.cloudflare.com/behind-the-scenes-with-stream-li...

1 - https://blog.cloudflare.com/

ribadeo

At least half their gak is due to them NOT moving quickly and NOT wanting to break things.

IIRC, graphql is a means of papering over a bunch of legacy APIs. They removed foreign keys from mysql using it as a column store db, a vestige of the original LAMP stack still on PHP.

I don't think Meta infrastructural choices are applicable to most folk.

What does serverless land your average dev? A high AWS bill. Elastic managed Kubernetes stack? A higher bill.

Did you know that you can use YAML and provision actual cloud provider resources with boring tech? Welcome to Ansible. There is no need to recreate Linux network stack when you have the Linux network stack, and it actually works!

Quite a lot of hacky gak is required when you run node.js as a production public facing web service. A statically compiled binary won't invent novel code execution paths 4 days into a memory leaking runtime bender.

Boring tech is boring, I guess, even if it's new and shiny. Facebook creates tech to mitigate the pathologies their past continuously present.

sofixa

> Did you know that you can use YAML and provision actual cloud provider resources with boring tech? Welcome to Ansible

Anyone using Ansible for cloud infrastructure management is not to be taken seriously. It's among the worst tools for the job - not (always) idempotent, no state tracking, slow, very limited in the resources it can manage, very lacking templating, fun stuff like "state: absent", running, and then having to remove the corresponding lines to delete, etc etc. You're literally better of bash scripting the cloud provider's CLI than using Ansible. Terraform/OpenTofu, Pulumi/tfcdk if you hate your future self are just clearly so much better.

ForHackernews

> Facebook creates tech to mitigate the pathologies their past continuously present.

Remember when they hacked a running Android Dalvik machine because their organizational constraints were such that they could never remove code or delete unused classes?

https://archive.is/nIPlg

https://engineering.fb.com/2013/03/04/android/under-the-hood...

Facebook seems like a place where they do amazing engineering to temporarily stave off the disastrous consequences of their previous feat of amazing engineering.

davedx

Very interesting, in particular the explicit comparisons with hyperscalers.

I almost wonder if this is preparation for them launching their own public cloud. Anyone from Meta care to comment?

toast0

I left before they were Meta, and maybe things have changed, but I don't think they have any intention of being a public cloud. Yes, they've got a lot of similar services as a public cloud, but there's a lot of opinionated choices that make sense for them that I think would be hard to convince customers to accept.

Their infrastructure is cloudy, but it's built around mostly a single customer and assumes the infrastructure software people and the application software people communicate deeply and continously. Running on a public cloud isn't that similar, at least as a small customer.

Could they pivot towards being a cloud service? Probably, but they'd need to do a lot of work to make their platform viable and to earn trust of potential customers, and they'd be entering a crowded market; there's already 6 S&P100 companies in Cloud (Amazon, Google, Microsoft, Oracle, IBM, Salesforce), and tons of smaller players.

IMHO, given their revenues and profit margins, there's no reason to do all the work it would take to offer cloud services too. Unless there's some opportunistic large customer deal made. They also might also need to renegotiate their content node agreements if they use them to serve cloud customer traffic, and that's a long process.

blitzar

> a lot of opinionated choices that make sense for them that I think would be hard to convince customers to accept

I get this vibe whenever I use the AWS or GCS dashboards, yet here we are!

linkregister

Offering a Heroku level of deployment abstraction to one organization's own software engineers, while maintaining performance, is an amazing achievement. Developing a cloud product and all the packaged services with account separation, autoscaling, and multiple regions is another massive endeavor.

Think of it as the difference between OCI and AWS.

Meta would be unlikely to launch a public cloud uncompetitive with Amazon's feature set.

olivermuty

It would be pretty impressive to launch a public cloud I would trust even less than GCP :D

arjvik

GCP got this reputation because it’s a second class citizen within Google. Google’s own internal infra (Borg, Blaze) is top-notch.

If Meta can pull off the public cloud correctly, I’d trust them greatly - they’ve shown significant engineering and product competence till now, even if they could use some more consistent and stable UI.

mathverse

Dont all new projects go to gcp within google?

Cthulhu_

It would make sense for them to do so if they have variable load but a lot of hardware waiting for action, like Amazon had back when - busy in the evenings, peak load around the holidays, crickets at night. But if they had that, I'm confident they would have been renting out servers a long time ago now. I wonder if they themselves are customers of the cloud providers.

bagels

A lot of the ingredients are there, but it would take a ton of work to separate the Facebook from the internal tools, and make them customer facing.

The internal tools heavily depend on other internal tools, and none of them were written with customers other than Facebook in mind.

kmdrpc

Not a word about Thrift, perhaps it was too low level for an infra overview, but I would have expected it to make some technical impact from a global perspective.

linkregister

I wouldn't be surprised if they have switched to GRPC for improved performance. They mentioned that RPC libraries are centrally maintained in their monorepo; a migration from Thrift to GRPC might have taken less than 6 months.

vitaut

gRPC and Thrift are comparable in performance and there is actually an opposite trend of switching from gRPC to Thrift in the few places where the former is still used.

linkregister

I participated in a painfully slow migration from Thrift to gRPC. I did not record the performance metrics, but it was internally advertised to be significantly more performant. There are still some Thrift services running at the organization, but most were migrated to gRPC and certainly not migrated back.

bagels

I haven't been there a while, but that is extremely unlikely.

revskill

Interesting that facebook built the whole thing but still useless to me.

asdasd1234

Is there any public information available for the deploy/observability tool?

linkregister

Unfortunately I didn't see any code posted for Conveyor. The following USENIX paper is available: https://www.usenix.org/system/files/osdi23-grubic.pdf .

karparov

[dead]