Skip to content(if available)orjump to list(if available)

Distributed systems programming has stalled

bsnnkv

Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

alabastervlog

I've found the rush to distributed computing when it's not strictly necessary kinda baffling. The costs in complexity are extreme. I can't imagine the median company doing this stuff is actually getting either better uptime or performance out of it—sure, it maybe recovers better if something breaks, maybe if you did everything right and regularly test that stuff (approximately nobody does though), but there's also so very much more crap that can break in the first place.

Plus: far worse performance ("but it scales smoothly" OK but your max probable scale, which I'll admit does seem high on paper if you've not done much of this stuff before, can fit on one mid-size server, you've just forgotten how powerful computers are because you've been in cloud-land too long...) and crazy-high costs for related hardware(-equivalents), resources, and services.

All because we're afraid to shell into an actual server and tail a log, I guess? I don't know what else it could be aside from some allergy to doing things the "old way"? I dunno man, seems way simpler and less likely to waste my whole day trying to figure out why, in fact, the logs I need weren't fucking collected in the first place, or got buried some damn corner of our Cloud I'll never find without writing a 20-line "log query" in some awful language I never use for anything else, in some shitty web dashboard.

Fewer, or cheaper, personnel? I've never seen cloud transitions do anything but the opposite.

It's like the whole industry went collectively insane at the same time.

[EDIT] Oh, and I forgot, for everything you gain in cloud capabilities it seems like you lose two or three things that are feasible when you're running your own servers. Simple shit that's just "add two lines to the nginx config and do an apt-install" becomes three sprints of custom work or whatever, or just doesn't happen because it'd be too expensive. I don't get why someone would give that stuff up unless they really, really had to.

[EDIT EDIT] I get that this rant is more about "the cloud" than distributed systems per se, but trying to build "cloud native" is the way that most orgs accidentally end up dealing with distributed systems in a much bigger way than they have to.

whstl

I share your opinions, and really enjoyed your rant.

But it's funny. The transition to distributed/cloud feels like the rush to OOP early in my career. All of a sudden there were certain developers who would claim it was impossible to ship features in procedural codebases, and then proceed to make a fucking mess out of everything using classes, completely misunderstanding what they were selling.

It is also not unlike what Web-MVC felt like in the mid-2000s. Suddenly everything that came before was considered complete trash by some people that started appearing around me. Then the same people disparaging the old ways started building super rigid CRUD apps with mountains of boilerplate.

(Probably the only thing I was immediately on board with was the transition from desktop to web, because it actually solved more problems than it created. IMO, IME and YYMV)

Later we also had React and Docker.

I'm not salty or anything: I also tried and became proficient in all of those things. Including microservices and the cloud. But it was more out of market pressure than out of personal preference. And like you said, it has a place when it's strictly necessary.

But now I finally do mostly procedural programming, in Go, in single servers.

sakesun

Your comment inspire me to brush up my Delphi skill.

dekhn

I am always happy when I can take a system that is based on distributed computing, and convert it to a stateless single machine job that runs just as quickly but does not have the complexity associated with distributed computing.

Reccently I was going to do a fairly big download of a dataset (45T) and when I first looked at it, figured I could shard the file list and run a bunch of parallel loaders on our cluster.

Instead, I made a VM with 120TB storage (using AWS with FSX) and ran a single instance of git clone for several days (unattended; just periodically checking in to make sure that git was still running). The storage was more than 2X the dataset size because git LFS requires 2X disk space. A single multithreaded git process was able to download at 350MB/sec and it finished at the predicted time (about 3 days). Then I used 'aws sync' to copy the data back to s3, writing at over 1GB/sec. When I copied the data between two buckets, the rate was 3GB/sec.

That said, there are things we simply can't do without distributed computing because there are strong limits on how many CPUs and local storage can be connected to a single memory address space.

achierius

My wheelhouse is lower on the stack, so I'm curious as to what you mean by "stateless single machine job" -- do you just mean that it runs from start to end, without options for suspension/migration/resumption/etc.?

jimbokun

Distributed or not is a very binary function. If you can run in one large server, great, just write everything in non-distributed fashion.

But once you need that second server, everything about your application needs to work in distributed fashion.

th0ma5

I wish I could upvote you again. The complexity balloons when you try to adapt something that wasn't distributed, and often things can be way simpler and more robust if you start with a distributed concept.

nonc3dr3w

[dead]

FpUser

This is part of what I do for living. C++ backend software running on real hardware which is currently insanely powerful. There is of course spare standby in case things go South. Works like a charm and I have yet to have a client that scratched it anywhere close to overloading server.

I understand that it can not deal with FAANG scale problems, but those are relevant only to a small subset of businesses.

intelVISA

The highly profitable, self-inflicted problem of using 200 QPS Python frameworks everywhere.

tayo42

This rant misses two things that people always miss

On distributed. Qps scaling isn't the only reason and I suspect rarely the reason. It's mostly driven by availability needs.

It's also driven my organizational structure and teams. Two teams don't need to be fighting over the same server to deploy their code. So it gets broken out into services with clear api boundaries.

And ssh to servers might be fine for you. But systems and access are designed to protect the bottom tier of employees that will mess things up when they tweak things manually. And tweaking things by hand isn't reproducible when they break.

null

[deleted]

Karrot_Kream

Horizontal scaling is also a huge cost savings. If you can run your application with a tiny VM most of the time and scale it up when things get hot, then you save money. If you know your service is used during business hours you can provision extra capacity during business hours and release that capacity during off hours.

icedchai

I've seen this as well. A relatively simple application becomes a mess of terraform configuration for CloudFront, Lambda, API Gateway, S3, RDS and a half dozen other lesser services because someone had an obsession with "serverless." And performance is worse. And there's as much Terraform as there is actually application code.

motorest

> I've found the rush to distributed computing when it's not strictly necessary kinda baffling.

I'm not entirely sure you understand the problem domain, or even the high-level problem. The is or ever was a "rush" to distributed computing.

What you actually have is this global epifany that having multiple computers communicating over a network to do something actually has a name, and it's called distributed computing.

This means that we had (and still have) guys like you who look at distributed systems and somehow do not understand they are looking at distributed systems. They don't understand that mundane things like a mobile app supporting authentication or someone opening a webpage or email is a distributed system. They don't understand that the discussion on monolith vs microservices is orthogonal to the topic of distributed systems.

So the people railing against distributed systems are essentially complaining about their own ignorance and failure to actually understand the high-level problem.

You have two options: acknowledge that, unless you're writing a desktop app that does nothing over a network, odds are every single application you touch is a node in a distributed system, or keep fooling yourself into believing it isn't. I mean, if a webpage fails to load then you just hit F5, right? And if your app just fails to fetch something from a service you just restart it, right? That can't possibly be a distributed system, and those scenarios can't possibly be mitigated by basic distributed computing strategies, isn't it?

Everything is simple to those who do not understand the problem, and those who do are just making things up.

lucyjojo

you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).

this would lead to a pointless conversation, if it were to ever happen.

throwawaymaths

the minute you have a client (browser, e.g.) and a server you're doing a distributed system and you should be thinking a little bit about edge cases like loss of connection, incomplete tx. a lot of the goto protocols (tcp, http, even stuff like s3) are built with the complexities of distributed systems in mind so for most basic cases, a little thought goes a long way. but you get weird shit happening all the time (that may be tolerable) if you don't put any effort into it.

jasonjayr

> Whenever I would bring up this gap I would be told that we can't spent time and wait for people to create "magic tools".

That sounds like an awful organizational ethos. 30hrs to make a "magic tool" to save 300hrs across the organization sounds like a no-brainer to anyone paying attention. It sounds like they didn't even want to invest in out-sourced "magic tools" to help either.

bsnnkv

The real kicker is that it wasn't even management saying this, it was "senior" developers on the team.

I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.

zelphirkalt

Senior doesn't always mean smarter or more experienced or anything really. It just all depends on the company and its culture. It can also mean "worked for longer" (which is not equal to more experienced, as you can famously have 10 times 1y experience, instead of 10y experience) and "more aligned with how management at the company acts".

Henchman21

IME, “senior” often means “who is left after the brain-drain & layoffs are done” when you’re at a medium sized company that isn’t prominent.

the_sleaze_

_To play devils advocate_: It could've sounded like the "new guy" came in and decided he needed to rewrite everything; bring in new xyx; steer the ship. New guy could even have been stepping directly on the toes of those senior developers who had fought and won wars to get were they are now.

In my -very- humble opinion, you should wait at least a year before making big swinging changes or recommendations, most importantly in any big company.

Jach

There's also immense resistance to figuring out how to code something if an approach isn't at once obvious. Hence "magic". Sometimes a "spike doc" can convince people. My favorite second-hand instance of this was a MS employee insisting that a fast rendering terminal emulator was so hard as to require "an entire doctoral research project in performant terminal emulation".

scottlamb

> I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.

That's weird. I love debugging, and so I'm always trying to learn new ways to do it better. I mean, how can it be any other way? How can someone love something and be that committed to sucking at it?

whstl

I saw a case like this recently, and the fact is that the team responsible was completely burned out and was just doing anything to avoid people from giving them more work, but they also didn't trust anyone else to do it.

One of the engineers just quit on the spot for a better paid position, the other was demoted and is currently under heavy depression last I heard from him.

jbreckmckye

Why would people who are good at [scarce, valuable skill] and get paid [many bananas] to practice it want to even imagine a world where that skill is now redundant? ;-)

cmrdporcupine

Consider that there is a class of human motivation / work culture that considers "figuring it out" to be the point of the job and just accepts or embraces complexity as "that's what I'm paid to do" and gets an ego-satisfaction from it. Why admit weakness? I can read the logs by timestamp and resolve the confusions from the CAP theorem from there!

Excessive drawing of boxes and lines, and the production of systems around them becomes a kind of Glass Bead Game. "I'm paid to build abstractions and then figure out how to keep them glued together!" Likewise, recomposing events in your head from logs, or from side effects -- that's somehow the marker of being good at your job.

The same kind of motivation underlies people who eschew or disparage GUI debuggers (log statements should be good enough or you're not a real programmer), too.

Investing in observability tools means admitting that the complexity might overwhelm you.

As an older software engineer the complexity overwhelmed me a long time ago and I strongly believe in making the machines do analysis work so I don't have to. Observability is a huge part of that.

Also many people need to be shown what observability tools / frameworks can do for them, as they may not have had prior exposure.

And back to the topic of the whole thread, too: can we back up and admit that distributed systems is questionable as an end in itself? It's a means to an end, and distributing something should be considered only as an approach when a simpler, monolithic system (that is easier to reasona bout) no longer suffices.

Finally I find that the original authors of systems are generally not the ones interested in building out observability hooks and tools because for them the way the system works (or doesn't work) is naturally intuitive because of their experience writing it.

lumost

Anecdotally, I see a major under appreciation for just how fast and efficient modern hardware is in the distributed systems community.

I’ve seen a great many engineers become so used to provisioning compute that they forget that the same “service” can be deployed in multiple places. Or jump to building an orchestration component when a simple single process job would do the trick.

intelVISA

Distributed systems always ends up a dumping ground of failed tech solutions to deep org dysfunction.

Weak tech leadership? Let's "fix" that with some microservices.

Now it's FUBAR? Conceal it with some cloud native horrors, sacrifice a revolving door of 'smart' disempowered engineers to keep the theater going til you can jump to the next target.

Funny because dis sys is pretty solved since Lamport, 40+ years ago.

whstl

I suffered through this in two companies and man, it isn't easy.

First one was a multi-billion-Unicorn had everything converted to microservices, with everything customized in Kubernetes. One day I even had to fix a few bugs in the service mesh because the guy who wrote it left and I was the only person not fighting fires able to write the language it was in. I left right after the backend-of-the-frontend failed to sustain traffic during a month where they literally had zero customers (Corona).

At the second one there was a mandate to rewrite everything to microservices and it took another team 5 months to migrate a single 100-line class I wrote into a microservice. It just wasn't meant to be. Then the only guy who knows how the infrastructure works got burnout after being yelled at too many times and then got demoted, and last I heard is at home with depression.

Weak leadership doesn't even begin to describe it, especially the second.

But remembering it is a nice reminder that a job is just a means of getting a payment.

rbjorklin

Would you mind sharing some more specific information/references to Lamport’s work?

vitus

The three big papers: clocks [0], Paxos [1], Byzantine generals [2].

[0] https://lamport.azurewebsites.net/pubs/time-clocks.pdf

[1] https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf

[2] https://lamport.azurewebsites.net/pubs/byz.pdf

Or, if you prefer wiki articles:

https://en.wikipedia.org/wiki/Lamport_timestamp

https://en.wikipedia.org/wiki/Paxos_(computer_science)

https://en.wikipedia.org/wiki/Byzantine_fault

I don't know that I would call it "solved", but he certainly contributed a huge amount to the field.

madhadron

Lamport's website has his collected works. The paper to start with is "Time, clocks, and the ordering of events in a distributed system." Read it closely all the way to the end. Everyone seems to miss the last couple sections for some reason.

bob1029

> Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

I've never once been granted explicit permission to try a different path without being burdened by a mountain of constraints that ultimately render the effort pointless.

If you want to try a new thing, just build it. No one is going to encourage you to shoot holes through things that they hang their own egos from.

DrFalkyn

Hope you can justify that during sprint planning / standup

bob1029

If you are going to just build it in the absence of explicit buy-in, you certainly shouldn't spend time on the standup talking about it. Wait until your idea is completely formed and then drop a 5 minute demo on the team.

It can be challenging to push through to a completed demo without someone cheering you on every morning. I find this to be helpful more than hurtful if we are interested in the greater good. If you want to go against the grain (everyone else on the team), then you need to be really sure before you start wasting everyone else's time. Prove it to yourself first.

fatnoah

> The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it.

One of the most significant "triumphs" of my technical career came at a startup where I started as a Principal Engineer and left as the VP Engineering. When I started, we had nightly outages requiring Engineering on-call, and by the time I left, no one could remember a recent issue that required Engineers to wake up.

It was a ton of work and required a strong investment in quality & resilience, but even bigger impact was from observability. We couldn't afford APM, so we took a very deliberate approach to what we logged and how, and stuffed it into an ELK stack for reporting. The immediate benefit was a drastic reduction in time to diagnose issues, and effectively let our small operations team triage issues and easily identify app vs. infra issues almost immediately. Additionally, it was much easier to identify and mitigate fragility in our code and infra.

The net result was an increase in availability from 98.5% to 99.995%, and I think observability contributed to at least half of that.

fra

As someone who builds observability tools for embedded software, I am flabbergasted that you're finding a more tools-friendly culture in embedded than in distributed systems!

Most hardware companies have zero observability, and haven't yet seen the light ("our code doesn't really have bugs" is a quote I hear multiple times a week!).

whstl

It's probably a "grass is greener" situation.

My experience with mid-size to enterprise is having lots of observability and observability-adjacent tools purchased but not properly configured. Or the completely wrong tools for the job being used.

A few I've seen recently: Grafana running on local Docker of developers because of lack of permissions in the production version (the cherry on top: the CTO himself installed this on the PMs computers), Prometheus integration implemented by dev team but env variables still missing after a couple years, several thousand a month being paid to Datadog but nothing being done with the data nor with the dog.

On startups it's surprisingly different, IME. But as soon as you "elect" a group to be administrator of a certain tool or some resource needed by those tools, you're doomed.

anonzzzies

I really love embedded work; at least it gives you the feeling that you have control over things. Not everything being confused and black boxed where you have to burn a goat to make it work, sometimes.

porridgeraisin

> where you have to burn a goat to make it work, sometimes.

Or talk to a goat, sometimes

https://modernfarmer.com/2014/05/successful-video-game-devel...

gklitt

This is outside my area of expertise, but the post sounds like it’s asking for “choreographic programming”, where you can write an algorithm in a single function while reasoning explicitly about how it gets distributed:

https://en.m.wikipedia.org/wiki/Choreographic_programming

I’m curious to what extent the work in that area meets the need.

shadaj

You caught me! That's what my next post is about :)

lachlan_gray

This could be a fun example to work with :p

https://en.m.wikipedia.org/wiki/Shakespeare_Programming_Lang...

shadaj

You might enjoy my first ever blog post from ~10 years ago, when I first learned about distributed systems: https://www.shadaj.me/writing/romeo-juliet-and-reactive-prog...

ashton314

I see from your bio that you are a PhD student. What are you doing with choreographies? (I’m in this space too.)

roadbuster

How does "choreographic programming" differ from the actor model?

LegionMammal978

From what I can tell, the important distinction is that all actors (and their messages) are described alongside each other, instead of being described separately. There are many implementations of the actor model, but most of them are the 'static-location architectures' that TFA talks about.

rectang

Ten years ago, I had lunch with Patricia Shanahan, who worked for Sun on multi-core CPUs several decades ago (before taking a post-career turn volunteering at the ASF which is where I met her). There was a striking similarity between the problems that Sun had been concerned with back then and the problems of the distributed systems that power so much the world today.

Some time has passed since then — and yet, most people still develop software using sequential programming models, thinking about concurrency occasionally.

It is a durable paradigm. There has been no revolution of the sort that the author of this post yearns for. If "Distributed Systems Programming Has Stalled", it stalled a long time ago, and perhaps for good reasons.

EtCepeyd

> and perhaps for good reasons

For the very good reason that the underlying math is insanely complicated and tiresome for mere practitioners (which, although I have a background in math, I openly aim to be).

For example, even if you assume sequential consistency (which is an expensive assumption) in a C or C++ language multi-threaded program, reasoning about the program isn't easy. And once you consider barriers, atomics, load-acqire/store-release explicitly, the "SMP" (shared memory) proposition falls apart, and you can't avoid programming for a message passing system, with independent actors -- be those separate networked servers, or separate CPUs on a board. I claim that struggling with async messaging between independent peers as a baseline is not why most people get interested in programming.

Our systems (= normal motherboards on one and, and networked peer to peer systems on the other end) have become so concurrent that doing nearly anything efficiently nowadays requires us to think about messaging between peers, and that's very-very foreign to our traditional, sequential, imperative programming languages. (It's also foreign to how most of us think.)

Thus, I certainly don't want a simple (but leaky) software / programming abstraction that hides the underlying hardware complexity; instead, I want the hardware to be simple (as little internally-distributed as possible), so that the simplicity of the (sequential, imperative) programming language then reflect and match the hardware well. I think this can only be found in embedded nowadays (if at all), which is why I think many are drawn to embedded recently.

hinkley

I think SaaS and multicore hardware are evolving together because a queue of unrelated, partially ordered tasks running in parallel is a hell of a lot easier to think about than trying to leverage 6-128 cores to keep from ending up with a single user process that’s wasting 84-99% of available resources. Most people are not equipped to contend with Amdahl’s Law. Carving 5% out of the sequential part of a calculation is quickly becoming more time efficient than taking 50% out of the parallel parts, and we’ve spent 40 years beating the urge to reach for 1-4% improvements out of people. When people find out I got a 30% improvement by doing 8+6+4+4+3+2+1.5+1.5 they quickly find someplace else to be. The person who did the compressed pointer work on v8 to make it as fast as 64 bit pointers is the only other person in over a decade I’ve seen document working this way. If you’re reading this we should do lunch.

So because we discovered a lucrative, embarrassingly parallel problem domain that’s what basically the entire industry has been doing for 15 years, since multicore became unavoidable. We have web services and compilers being multi-core and not a lot in between. How many video games still run like three threads and each of those for completely distinct tasks?

gue5t

Personally I've been inspired by nnethercote's logs (https://nnethercote.github.io/) of incremental single-digit percentage performance improvements to rustc over the past several years. The serial portion of compilers is still quite significant and efforts to e.g. parallelize the entire rustc frontend are heroic slogs that have run into subtle semantic problems (deadlocks and races) that have made it very hard to land them. Not to disparage those working on that approach, but it is really difficult! Meanwhile, dozens of small speedups accumulate to really significant performance improvements over time.

linkregister

> 8+6+4+4+3+2+1.5+1.5

What is this referring to? It sounds like a fascinating problem.

gmadsen

I know c++ has a lack luster implementation, but do coroutines and channels solve some of these complaints? although not inherently multithreaded, many things shouldn't be multithreaded , just paused. and channels insteaded of shared memory can control order

hinkley

Coroutines basically make the same observation as transmit windows in TCP/IP: you don’t send data as fast as you can if the other end can’t process it, but also if you send one at a time then you’re going to be twiddling your fingers an awful lot. So you send ten, or twenty, and you wait for signs of progress before you send more.

On coroutines it’s not the network but the L1 cache. You’re better off running a function a dozen times and then running another than running each in turn.

EtCepeyd

I've found both explicit future/promise management and coroutines difficult (even irritating) to reason about. Co-routines look simpler at the surface (than explicit future chaining), and so their the syntax is less atrocious, but there are nasty traps. For example:

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines...

vacuity

I think trying to shoehorn everything into sequential, imperative code is a mistake. The burden of performance should be on the programmer's cognitive load, aided where possible by the computer. Hardware should indeed be simple, but not molded to current assumptions. It's indeed true that concurrency of various fashions and the attempts at standardizing it are taxing on programmers. However, I posit this is largely essential complexity and we should accept that big problems deserve focus and commitment. People malign frameworks and standards (obligatory https://xckd.com/927), but the answer is not shying away from them but rather leveraging them while being flexible.

cmrdporcupine

What we need is for formal verification tools (for linearizability, etc.) to be far more understood and common.

hinkley

I think the underlying premise of Cloud is:

Pay a 100% premium on compute resources in order to pretend the 8 Fallacies of Distributed Computing don’t exist.

I sat out the beginning of Cloud and was shocked at how completely absent they are from conversations within the space. When the hangover hits it’ll be ugly. The Devil always gets his due.

jimbokun

The author critiques having sequential code executing on individual nodes, uninformed by the larger distributed algorithm in which they play a part.

However, I think there are great advantages to that style. It’s easier to analyze and test the sequential code for correctness. Then it writes a Kafka message or makes an HTTP call and doesn’t need to be concerned with whatever is handling the next step in the process.

Then assembling the sequential components once they are all working individually is a much simpler task.

bigmutant

The fundamental problems are communication lag and lack of information about why issues occur (encapsulated by the Byzantine Generals problem). I like to imagine trying to build a fault-tolerant, reliable system for the Solar System. Would the techniques we use today (retries, timeouts, etc) really be adequate given that lag is upwards of hours instead of milliseconds? But that's the crux of these systems, coordination (mostly) works because systems are close together (same board, at most same DC)

shadaj

Stay tuned for the next blog post for one potential answer :) My PhD has been focused on this gap!

rectang

As a programmer, I hope that your answer continues to abstract away the problems of concurrency from me, the way that CPU designers have managed, so that I can still think sequentially except when I need to. (And as a senior engineer, you need to — developing reliable concurrent systems is like pilots landing planes in bad weather, part of the job.)

hinkley

I was doing some Java code recently after spending a decade in async code and boy that first few minutes was like jumping into a cold pool. Took me a moment to switch gears back to everything is blocking and that function just takes 500ms sometimes, waiting for IO.

hinkley

I don’t think there’s anyone in the Elixir community who wouldn’t love it if companies would figure out that everyone is writing software that contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Erlang, and start hiring Elixir or Gleam devs.

The future is here, but it is not evenly distributed.

ikety

It's so odd seeing people dissuade others for implementing "niche" languages like Elixir or Gleam. If you post a job opportunity with these languages, I guarantee you will be swamped with qualified candidates that are very passionate and excited to work with these languages full time.

hinkley

At this point I’m worried that because elixir is over 10 years old that it’ll never arrive. But then Python is older than Java and here we are.

rramadass

I decided long ago (after having implemented various protocols and shared-memory multi-threaded code) that what i like best is to use Erlang as "the fabric" for the graph of distributed computing and C/C++ for heavy lifting at any node.

hinkley

I’m hoping the JIT will finally start to change that.

jimbokun

Yes, my sense reading the article was the user is reinventing Erlang.

ignoramous

> writing software that contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Erlang

Since you were at AWS (?), you'd know that Erlang did get its shot at distributed systems there. I'm unsure what went wrong, but if not c/c++, it was all JVM based languages soon after that.

hinkley

No I worked a contract in the retail side and would not wish that job on anyone. My most recent favorite boss works there now and I haven’t even said hi because I’m afraid he’ll offer me a job.

worthless-trash

I have noticed that in corporate, langauges dont fail on technical merit, but its fashion merits.

null

[deleted]

seivan

[dead]

bigmutant

Good resources for understanding Distributed Systems:

- MIT course with Robert Morris (of Morris Worm fame): https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

- Martin Kleppmann (author of DDIA): https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjc...

If you can work through the above (and DDIA), you'll have a solid understanding of the issues in Distributed System, like Consensus, Causality, Split Brain, etc. You'll also gain a critical eye of Cloud Services and be able to articulate their drawbacks (ex: did you know that replication to DynamoDB Secondary Indexes is eventually consistent? What effects can that have on your applications?)

ignoramous

> Robert Morris (of Morris Worm fame)

(of Y Combinator fame, too)

dingnuts

>The static-location model seems like the right place to start, since it is at least capable of expressing all the types of distributed systems we might want to implement, even if the programming model offers us little help in reasoning about the distribution. We were missing two things that the arbitrary-location model offered:

> Writing logic that spans several machines right next to each other, in a single function

> Surfacing semantic information on distributed behavior such as message reordering, retries, and serialization formats across network boundaries

Aren't these features offered by Erlang?

shadaj

Erlang (is great but) is still much closer to the static-location (Actors) paradigm than what I’m aspiring for. For example, if you have stateful calculations, they are typically implemented as isolated (static-location) loops that aren’t textually co-located with the message senders.

null

[deleted]

chuckledog

Great point. Erlang is still going strong, in fact WhatsApp is implemented in Erlang

prophesi

Yep, the words fault tolerance and distributed computing immediately brings to my mind Erlang/Elixir.

tracnar

The unison programming language does foray into a truly distributed programming language: https://www.unison-lang.org/

aDyslecticCrow

Functional programming languages already have alot of powerful concepts for distributed programming. Loads of the distributed programming techniques used elsewhere are often taken from an obscure fp language from years prior. Erlang comes to mind as still quite uniquely distributed without non fp comparison

Unison seems to build on it further. Very cool

KaiserPro

Distributed systems are hard, as well all know.

However the number of people that actually need a distributed system is pretty small. With the rise of kubernetes, the number of people who've not been burnt by going distributed when they didn't need to has rapidly dropped.

You go distributed either because you are desperate, or because you think it would be fun. K8s takes the fun out of most things.

Moreover, with machines suddenly getting vast IO improvements, the need for going distributed is much less than it was 10 years. (yes i know there is fault tolerance, but that adds another dimension of pain.)

sd9

> the number of people who've not been burnt by going distributed when they didn't need to has rapidly dropped

Gosh, this was hard to parse! I’m still not sure I’ve got it. Do you mean “kubernetes has caused more people to suffer due to going distributed unnecessarily”, or something else?

boarush

Had me confused for a second too, but I think it is the former that they meant.

K8s has unneeded complexity which is really not required at even decent enough scales, if you've put in enough effort to architect a solution that makes the right calls for your business.

KaiserPro

yeah sorry, double negatives.

People got burnt by kubernetes, and that pissed in the well of enthusiasm for experimenting with distributed systems

bormaj

Any specific pitfalls to avoid with K8s? I've used it to some degree of success in a production environment, but I keep deployments relatively simple.

KaiserPro

Its a spectrum rather than a binary thing, however you are asking the right questions!

One of the things that is most powerful about K8s is that it gives you a lot of primitives to build things with. This is also its biggest drawback.

If you are running real physical infrastructure and want to run several hundreds of "services" (as in software, not k8s services) then kubernetes is probably a good fit, but you have a storage and secrets problem to solve as well.

On the cloud, unless you're using a managed service, its almost certainly easier to either use lambdas (for low traffic services) or one of the many managed docker hosting services they have.

Some of them are even K8s API compatible.

but _why_?

At its heart, k8s is a "run this thing here with these resources" system. AWS also does this, so duplicating it costs time and money. For most people the benefit of running ~20 services < 5 dbs and storage on k8s is negative. Its a steep learning curve, very large attack surface (You need ot secure the instance and then k8s permissions) and its an extra layer of things to maintain. For example, running a DB on k8s is perfoectly possible, and there are bunch of patterns you can follow. But you're on the hook for persistence, backup and recovery. managed DBs are more expensive to run, but they cost 0 engineer hours to implement.

BUT

You do get access to helm, which means that you can copypasta mostly working systems into your cluster. (but again like running untrusted docker images, thats not a great thing to do.)

The other thing to note is the networking scheme is badshit crazy and working with ipv6 is still tricky.

margorczynski

Distributed systems are cool but most people don't really get how much complexity it introduces which leads them to fad-driven decisions like using Event Sourcing where there is no fundamental need to use it. I've seen projects getting burned because of the complexity and overhead it introduces where "simpler" approaches worked well and were easy to extend/fix. Hard to find and fix bugs, much slower feature addition and lots of more goodies the blogs with toy examples don't speak about.

nine_k

The best recipe I know is to start from a modular monolith [1] and split it when and if you need to scale way past a few dozen nodes.

Event sourcing is a logical structure; you can implement it with SQLite or even flat files, locally, if you your problem domain is served well by it. Adding Kafka as the first step is most likely a costly overkill.

[1]: https://awesome-architecture.com/modular-monolith/

margorczynski

What you're speaking of is a need/usability-based design and extension where you design the solution with certain "safety valves" that let you scale it up when needed.

This is in contrast to the fad-driven design and over-engineering that I'm speaking of (here I simply used ES as an example) that is usually introduced because someone in power saw a blog post or 1h talk and it looked cool. And Kafka will be used because it is the most "scalable" and shiny solution, there is no pros-vs-cons analysis.

rjbwork

If the choice has already been made to do a distributed system (outside of the engineer's control...), is a choice to use Event Sourcing by the engineer then a good idea?

mrkeen

In my experience:

1) We are surrounded by distributed systems all the time. When we buy and sell B2B software, we don't know what's stored in our partners databases, they don't know what's in ours. Who should ask whom, and when? If the data sources disagree, whose is correct? Just being given access to a REST API and a couple of webhooks is all you need to be in full distributed systems land.

2) I honestly do not know of a better approach than event-sourcing (i.e. replicated state machine) to coordinate among multiple masters like this. The only technique I can think of that comes close is Paxos - which does not depend on events. But then the first thing I would do if I only had Paxos, would be to use it to bootstrap some kind of event system on top of it.

Even the non-event-sourcing technologies like DBs use events (journals, write-ahead-logs, sstables, etc.) in their own implementation. (However that does not imply that you're getting events 'for free' by using these systems.)

My co-workers do not put any alternatives forward. Reading a database, deciding what action to do, and then carrying out said action is basically the working definition of a race-condition. Bankers and accountants had this figured out thousands of years ago: a bank can't send a wagon across the country with queries like "How much money is in Joe's account?" wait a week for the reply, and then send a second wagon saying "Update Joe's account so it has $36.43 in it now". It's laughable. But now that we have 50-150ms latencies, we feel comfortable doing GETs and POSTs (with a million times more traffic) and somehow think we're not going to get our numbers wrong.

Like, what's an alternative? I have a shiny billion-dollar fully-ACID SQL db with my customer accounts in them. And my SAAS partner bank also has that technology. Put forward literally any idea other than events that will let us coordinate their accounts such that they're not able to double-spend money, or are prevented from spending money if a node is down. I want an alternative to event sourcing.

margorczynski

Again - do not fixate on the ES thing as it was put forward only as an example. You're presenting a case when for the given scenario after analysis and weighting the alternatives this is the most optimal solution where I'm speaking about introducing unnecessary complexity just because the tech is cool and trendy.

sanity

The article makes great points about why distributed programming has stalled, but I think there's still room for innovation—especially in how we handle state consistency in decentralized systems.

In Freenet[1], we’ve been exploring a novel approach to consistency that avoids the usual trade-offs between strong consistency and availability. Instead of treating state as a single evolving object, we model updates as summarizable deltas—each a commutative monoid—allowing peers to merge state independently in any order while achieving eventual consistency.

This eliminates the need for heavyweight consensus protocols while still ensuring nodes converge on a consistent view of the data. More details here: https://freenet.org/news/summary-delta-sync/

Would love to hear thoughts from others working on similar problems!

[1] https://freenet.org/

Karrot_Kream

Haven't read the post yet (I should, I have been vaguely following y'all along but obviously not close enough!) How is this different from delta-based CRDTs? I've built (admittedly toy) CRDTs as DAGs that ship deltas using lattice operations and it's really not that hard to have it work. There's already CRDT based distributed stores out there. How is this any different?

sanity

Good question! Freenet is a decentralized key-value store, but unlike traditional KV stores, the keys are WebAssembly (WASM) contracts. These contracts define not just what values (i.e., data or state) are valid for that key but also when and how the value can be mutated. They also specify how to efficiently synchronize the value across peers using summaries and deltas.

Each contract determines how state changes are validated, summarized, and merged, meaning you can efficiently implement almost any CRDT mechanism in WASM on top of Freenet. Another key difference is that Freenet is an observable KV store, allowing you to subscribe to values and receive immediate updates when they change.

Karrot_Kream

That's really cool, now I have to read this post. Thanks!

herval

Throwing in my two cents on the LLM impact - I've been seeing an increasing number of systems where core part of the functionality is either LLMs or LLM-generated code (sometimes on the fly, sometimes cached for reuse). If you think distributed systems were difficult before, try to imagine a system where the code being executed _isn't even debuggable or repeatable_.

It feels like we're racing towards a level of complexity in software that's just impossible for humans to grasp.

klysm

That's okay though! We can just make LLMs grasp it!

herval

ironically or not, the best way to have LLMs be effective at writing valid code is when they work on microservices. Since the scope is smaller and the boundary is clear, tools like Cursor/Windsurf seem to make very few mistakes (compared to pointing them at your monorepo, where they usually end up completely wrong)

klysm

Is it then up to the human to specify the services and how they interact?

moffkalast

"There's always a larger model."

synergy20

I saw comments about embedded development, which I have been doing that for a long time, just want to make a point here: the pay has upper limits, you will be paid fine but will reach the pay limit very fast, and it will stay there for the rest of your career. they can swap someone in with that price tag to do whatever you are working on, because, after all, embedded devel is not rocket science.

cmrdporcupine

The problem with embedded is its proximity to EE which is frankly underpaid.

But it's also more that the "other" kind of SWE work -- "backend" etc is frankly overpaid because of the copious quantities of $$ dumped into it by VC and ad money.