Skip to content(if available)orjump to list(if available)

Distributed systems programming has stalled

bsnnkv

Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

jasonjayr

> Whenever I would bring up this gap I would be told that we can't spent time and wait for people to create "magic tools".

That sounds like an awful organizational ethos. 30hrs to make a "magic tool" to save 300hrs across the organization sounds like a no-brainer to anyone paying attention. It sounds like they didn't even want to invest in out-sourced "magic tools" to help either.

bsnnkv

The real kicker is that it wasn't even management saying this, it was "senior" developers on the team.

I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.

zelphirkalt

Senior doesn't always mean smarter or more experienced or anything really. It just all depends on the company and its culture. It can also mean "worked for longer" (which is not equal to more experienced, as you can famously have 10 times 1y experience, instead of 10y experience) and "more aligned with how management at the company acts".

the_sleaze_

_To play devils advocate_: It could've sounded like the "new guy" came in and decided he needed to rewrite everything; bring in new xyx; steer the ship. New guy could even have been stepping directly on the toes of those senior developers who had fought and won wars to get were they are now.

In my -very- humble opinion, you should wait at least a year before making big swinging changes or recommendations, most importantly in any big company.

Henchman21

IME, “senior” often means “who is left after the brain-drain & layoffs are done” when you’re at a medium sized company that isn’t prominent.

jbreckmckye

Why would people who are good at [scarce, valuable skill] and get paid [many bananas] to practice it want to even imagine a world where that skill is now redundant? ;-)

cmrdporcupine

Consider that there is a class of human motivation / work culture that considers "figuring it out" to be the point of the job and just accepts or embraces complexity as "that's what I'm paid to do" and gets an ego-satisfaction from it. Why admit weakness? I can read the logs by timestamp and resolve the confusions from the CAP theorem from there!

Excessive drawing of boxes and lines, and the production of systems around them becomes a kind of Glass Bead Game. "I'm paid to build abstractions and then figure out how to keep them glued together!" Likewise, recomposing events in your head from logs, or from side effects -- that's somehow the marker of being good at your job.

The same kind of motivation underlies people who eschew or disparage GUI debuggers (log statements should be good enough or you're not a real programmer), too.

Investing in observability tools means admitting that the complexity might overwhelm you.

As an older software engineer the complexity overwhelmed me a long time ago and I strongly believe in making the machines do analysis work so I don't have to. Observability is a huge part of that.

Also many people need to be shown what observability tools / frameworks can do for them, as they may not have had prior exposure.

And back to the topic of the whole thread, too: can we back up and admit that distributed systems is questionable as an end in itself? It's a means to an end, and distributing something should be considered only as an approach when a simpler, monolithic system (that is easier to reasona bout) no longer suffices.

Finally I find that the original authors of systems are generally not the ones interested in building out observability hooks and tools because for them the way the system works (or doesn't work) is naturally intuitive because of their experience writing it.

alabastervlog

I've found the rush to distributed computing when it's not strictly necessary kinda baffling. The costs in complexity are extreme. I can't imagine the median company doing this stuff is actually getting either better uptime or performance out of it—sure, it maybe recovers better if something breaks, maybe if you did everything right and regularly test that stuff (approximately nobody does though), but there's also so very much more crap that can break in the first place.

Plus: far worse performance ("but it scales smoothly" OK but your max probable scale, which I'll admit does seem high on paper if you've not done much of this stuff before, can fit on one mid-size server, you've just forgotten how powerful computers are because you've been in cloud-land too long...) and crazy-high costs for related hardware(-equivalents), resources, and services.

All because we're afraid to shell into an actual server and tail a log, I guess? I don't know what else it could be aside from some allergy to doing things the "old way"? I dunno man, seems way simpler and less likely to waste my whole day trying to figure out why, in fact, the logs I need weren't fucking collected in the first place, or got buried some damn corner of our Cloud I'll never find without writing a 20-line "log query" in some awful language I never use for anything else, in some shitty web dashboard.

Fewer, or cheaper, personnel? I've never seen cloud transitions do anything but the opposite.

It's like the whole industry went collectively insane at the same time.

[EDIT] Oh, and I forgot, for everything you gain in cloud capabilities it seems like you lose two or three things that are feasible when you're running your own servers. Simple shit that's just "add two lines to the nginx config and do an apt-install" becomes three sprints of custom work or whatever, or just doesn't happen because it'd be too expensive. I don't get why someone would give that stuff up unless they really, really had to.

[EDIT EDIT] I get that this rant is more about "the cloud" than distributed systems per se, but trying to build "cloud native" is the way that most orgs accidentally end up dealing with distributed systems in a much bigger way than they have to.

dekhn

I am always happy when I can take a system that is based on distributed computing, and convert it to a stateless single machine job that runs just as quickly but does not have the complexity associated with distributed computing.

Reccently I was going to do a fairly big download of a dataset (45T) and when I first looked at it, figured I could shard the file list and run a bunch of parallel loaders on our cluster.

Instead, I made a VM with 120TB storage (using AWS with FSX) and ran a single instance of git clone for several days (unattended; just periodically checking in to make sure that git was still running). The storage was more than 2X the dataset size because git LFS requires 2X disk space. A single multithreaded git process was able to download at 350MB/sec and it finished at the predicted time (about 3 days). Then I used 'aws sync' to copy the data back to s3, writing at over 1GB/sec. When I copied the data between two buckets, the rate was 3GB/sec.

That said, there are things we simply can't do without distributed computing because there are strong limits on how many CPUs and local storage can be connected to a single memory address space.

achierius

My wheelhouse is lower on the stack, so I'm curious as to what you mean by "stateless single machine job" -- do you just mean that it runs from start to end, without options for suspension/migration/resumption/etc.?

jimbokun

Distributed or not is a very binary function. If you can run in one large server, great, just write everything in non-distributed fashion.

But once you need that second server, everything about your application needs to work in distributed fashion.

th0ma5

I wish I could upvote you again. The complexity balloons when you try to adapt something that wasn't distributed, and often things can be way simpler and more robust if you start with a distributed concept.

throwawaymaths

the minute you have a client (browser, e.g.) and a server you're doing a distributed system and you should be thinking a little bit about edge cases like loss of connection, incomplete tx. a lot of the goto protocols (tcp, http, even stuff like s3) are built with the complexities of distributed systems in mind so for most basic cases, a little thought goes a long way. but you get weird shit happening all the time (that may be tolerable) if you don't put any effort into it.

lumost

Anecdotally, I see a major under appreciation for just how fast and efficient modern hardware is in the distributed systems community.

I’ve seen a great many engineers become so used to provisioning compute that they forget that the same “service” can be deployed in multiple places. Or jump to building an orchestration component when a simple single process job would do the trick.

bryanlarsen

I think you were unlikely in your distributed system job and lucky in your embedded job. Embedded is filled with crappy 3rd party and in-house tooling, far more so than distributed, in my experience.

Embedded does give you a greater feeling of control. When things aren't working, it's much more likely to be your own fault.

bob1029

> Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

I've never once been granted explicit permission to try a different path without being burdened by a mountain of constraints that ultimately render the effort pointless.

If you want to try a new thing, just build it. No one is going to encourage you to shoot holes through things that they hang their own egos from.

EtCepeyd

This resonates a lot with me.

Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard. Embedded systems sometimes require the parallelization of some hot spots, but those are more like the exception AIUI, and you have a lot more control over things; everything is more local and sequential. Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath. Embedded is simpler, and it seems to require less that practitioners become advanced mathematicians for day to day work.

AlotOfReading

Most embedded systems are distributed systems these days, there's simply a cultural barrier that prevents most practitioners from fully grappling with that fact. A lot of systems I've worked on have benefited from copying ideas invented by distributed systems folks working on networking stuff 20 years ago.

DanielHB

I worked in an IoT platform that consisted of 3 embedded CPUs and one linux board. The kicker was that the linux board could only talk directly to one of the chips, but had to be capable of updating the software running on all of them.

That platform was parallelizable of up to 6 of its kind in a master-slave configuration (so the platform in the physical position 1 would assume the "master role" for a total of 18 embedded chips and 6 linux boards) on top of having optionally one more box with one more CPU in it for managing some other stuff and integrating with each of our clients hardware. Each client had a different integration, but at least they mostly integrated with us, not the other way around.

Yeah it was MUCH more complex than your average cloud. Of course the original designers didn't even bother to make a common network protocol for the messages, so each point of communication not only used a different binary format, they also used different wire formats (CAN bus, Modbus and ethernet).

But at least you didn't need to know kubernetes, just a bunch of custom stuff that wasn't well documented. Oh yeah and don't forget the boot loaders for each embedded CPU, we had to update the bootloaders so many times...

The only saving grace is that a lot of the system could rely on the literal physical security because you need to have physical access (and a crane) to reach most of the system. Pretty much only the linux boards had to have high security standards and that was not that complicated to lock down (besides maintaining a custom yocto distribution that is).

Thaxll

It does not requires any math because 99.9% of the time the issue is not in the low level implementation but in the business logic that the dev did.

No one goes to review the transaction engine of Postgress.

EtCepeyd

I tend to disagree.

- You work on postgres: you have to deal with the transaction engine's internals.

- You work in enterprise application intergration (EAI): you have ten legacy systems that inevitably don't all interoperate with any one specific transaction manager product. Thus, you have to build adapters, message routing and propagation, gateways, at-least-once-but-idempotent delivery, and similar stuff, yourself. SQL business logic will be part of it, but it will not solve the hard problems, and you still have to dig through multiple log files on multiple servers, hoping that you can rely on unique request IDs end-to-end (and that the timestamps across those multiple servers won't be overly contradictory).

In other words: same challenges at either end of the spectrum.

motorest

> Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard.

I think this take is misguided. Most of the systems nowadays, specially those involving any sort of network cals, are already distributed systems. Yet, the amount of systems go even close to touching fancy consensus algorithms is very very limited. If you are in a position to design a system and you hear "Paxos" coming out of your mouth, that's the moment you need to step back and think about what you are doing. Odds are you are creating your own problems, and then blaming the tools.

convolvatron

this is completely backwards. the tools may have some internal consistency guarantees, handle some classes of failures, etc. They are leaky abstractions that are partially correct. There were not collectively designed to handle all failures and consistent views no matter their composition.

From the other direction, Paxos, two generals, serializability, etc. are not hard concepts at all. Implementing custome solutions in this space _is_ hard and prone to error, but the foundations are simple and sound.

You seem to be claiming that you shouldn't need to understand the latter, that the former gives you everything you need. I would say that if you build systems using existing tools without even thinking about the latter, you're just signing up to handling preventable errors manually and treating this box that you own and black and inscrutable.

toast0

That's true, but you can do a lot of that once, and then get on with your life, if you build the right structures. I've gotten a huge amount of mileage from consensus to decide where to send reads/writes to, then everyone sends their reads/writes for the same piece of data to the same place; that place does the application logic where it's simple, and sends the result back. If you don't get the result back in time, bubble it up to the end-user application and it may retry or not, depending.

This is built upon a framework of the network is either working or the server team / ops team is paged and will be actively trying to figure it out. It doesn't work nearly as well if you work in an environment where the network is consistently slightly broken.

beoberha

Yep - I’ve very much been living the former for almost a decade now. It is especially difficult when the components stretch across organizations. It doesn’t quite address what the author here is getting at, but it does make me believe that this new programming model will come from academia and not industry.

anonzzzies

I really love embedded work; at least it gives you the feeling that you have control over things. Not everything being confused and black boxed where you have to burn a goat to make it work, sometimes.

porridgeraisin

> where you have to burn a goat to make it work, sometimes.

Or talk to a goat, sometimes

https://modernfarmer.com/2014/05/successful-video-game-devel...

hintymad

This reminds me of Rob Pike's article "Systems Software Research is Irrelevant," written about 15 years ago. Perhaps many systems have matured to a point where any improvement appears incremental to engineers, so the conviction to develop a new programming model isn't strong enough. Or perhaps we're in a temporary plateau, and a groundbreaking tool will emerge in a few years.

Regarding Laddad's point, building tools native to distributed systems programming might be intrinsically difficult. It's not for lack of trying. We've invented numerous algebras, calculi, programming models, and experimental programming languages over the past decades, yet somehow none has really taken off. If anything, I'd venture to assert that object storage, perhaps including Amazon DynamoDB, has changed the landscape of programming distributed systems. These two systems, which optimize for throughput and reliability, make programming distributed systems much easier. Want a queue system? Build on top of S3. Want a database? Focus on query engines and outsource storage to S3. Want a task queue? Just poll DDB tables. Want to exchange states en masse? Use S3. The list goes on.

Internally to S3, I think the biggest achievement is that S3 can use scalability to its advantage. Adding a new machine makes S3 cheaper, faster, and more reliable. Unfortunately, this involves multiple moving parts and is therefore difficult to abstract into a tool. Perhaps an arbitrarily scalable metadata service is what everyone could benefit from? Case in point, Meta's warm storage can scale to multiple exabytes with a flat namespace. Reading the paper, I realized that many designs in the warm storage are standard, and the real magic lies in its metadata management, which happens to be outsourced to Meta's ZippyDB. Meanwhile, open-source solutions often boast about their scalability, but in reality, all known ones have certain limits, usually no more than 100PBs or a few thousand nodes.

gklitt

This is outside my area of expertise, but the post sounds like it’s asking for “choreographic programming”, where you can write an algorithm in a single function while reasoning explicitly about how it gets distributed:

https://en.m.wikipedia.org/wiki/Choreographic_programming

I’m curious to what extent the work in that area meets the need.

shadaj

You caught me! That's what my next post is about :)

lachlan_gray

This could be a fun example to work with :p

https://en.m.wikipedia.org/wiki/Shakespeare_Programming_Lang...

shadaj

You might enjoy my first ever blog post from ~10 years ago, when I first learned about distributed systems: https://www.shadaj.me/writing/romeo-juliet-and-reactive-prog...

rectang

Ten years ago, I had lunch with Patricia Shanahan, who worked for Sun on multi-core CPUs several decades ago (before taking a post-career turn volunteering at the ASF which is where I met her). There was a striking similarity between the problems that Sun had been concerned with back then and the problems of the distributed systems that power so much the world today.

Some time has passed since then — and yet, most people still develop software using sequential programming models, thinking about concurrency occasionally.

It is a durable paradigm. There has been no revolution of the sort that the author of this post yearns for. If "Distributed Systems Programming Has Stalled", it stalled a long time ago, and perhaps for good reasons.

EtCepeyd

> and perhaps for good reasons

For the very good reason that the underlying math is insanely complicated and tiresome for mere practitioners (which, although I have a background in math, I openly aim to be).

For example, even if you assume sequential consistency (which is an expensive assumption) in a C or C++ language multi-threaded program, reasoning about the program isn't easy. And once you consider barriers, atomics, load-acqire/store-release explicitly, the "SMP" (shared memory) proposition falls apart, and you can't avoid programming for a message passing system, with independent actors -- be those separate networked servers, or separate CPUs on a board. I claim that struggling with async messaging between independent peers as a baseline is not why most people get interested in programming.

Our systems (= normal motherboards on one and, and networked peer to peer systems on the other end) have become so concurrent that doing nearly anything efficiently nowadays requires us to think about messaging between peers, and that's very-very foreign to our traditional, sequential, imperative programming languages. (It's also foreign to how most of us think.)

Thus, I certainly don't want a simple (but leaky) software / programming abstraction that hides the underlying hardware complexity; instead, I want the hardware to be simple (as little internally-distributed as possible), so that the simplicity of the (sequential, imperative) programming language then reflect and match the hardware well. I think this can only be found in embedded nowadays (if at all), which is why I think many are drawn to embedded recently.

hinkley

I think SaaS and multicore hardware are evolving together because a queue of unrelated, partially ordered tasks running in parallel is a hell of a lot easier to think about than trying to leverage 6-128 cores to keep from ending up with a single user process that’s wasting 84-99% of available resources. Most people are not equipped to contend with Amdahl’s Law. Carving 5% out of the sequential part of a calculation is quickly becoming more time efficient than taking 50% out of the parallel parts, and we’ve spent 40 years beating the urge to reach for 1-4% improvements out of people. When people find out I got a 30% improvement by doing 8+6+4+4+3+2+1.5+1.5 they quickly find someplace else to be. The person who did the compressed pointer work on v8 to make it as fast as 64 bit pointers is the only other person in over a decade I’ve seen document working this way. If you’re reading this we should do lunch.

So because we discovered a lucrative, embarrassingly parallel problem domain that’s what basically the entire industry has been doing for 15 years, since multicore became unavoidable. We have web services and compilers being multi-core and not a lot in between. How many video games still run like three threads and each of those for completely distinct tasks?

gmadsen

I know c++ has a lack luster implementation, but do coroutines and channels solve some of these complaints? although not inherently multithreaded, many things shouldn't be multithreaded , just paused. and channels insteaded of shared memory can control order

hinkley

Coroutines basically make the same observation as transmit windows in TCP/IP: you don’t send data as fast as you can if the other end can’t process it, but also if you send one at a time then you’re going to be twiddling your fingers an awful lot. So you send ten, or twenty, and you wait for signs of progress before you send more.

On coroutines it’s not the network but the L1 cache. You’re better off running a function a dozen times and then running another than running each in turn.

EtCepeyd

I've found both explicit future/promise management and coroutines difficult (even irritating) to reason about. Co-routines look simpler at the surface (than explicit future chaining), and so their the syntax is less atrocious, but there are nasty traps. For example:

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines...

cmrdporcupine

What we need is for formal verification tools (for linearizability, etc.) to be far more understood and common.

bigmutant

The fundamental problems are communication lag and lack of information about why issues occur (encapsulated by the Byzantine Generals problem). I like to imagine trying to build a fault-tolerant, reliable system for the Solar System. Would the techniques we use today (retries, timeouts, etc) really be adequate given that lag is upwards of hours instead of milliseconds? But that's the crux of these systems, coordination (mostly) works because systems are close together (same board, at most same DC)

hinkley

I think the underlying premise of Cloud is:

Pay a 100% premium on compute resources in order to pretend the 8 Fallacies of Distributed Computing don’t exist.

I sat out the beginning of Cloud and was shocked at how completely absent they are from conversations within the space. When the hangover hits it’ll be ugly. The Devil always gets his due.

jimbokun

The author critiques having sequential code executing on individual nodes, uninformed by the larger distributed algorithm in which they play a part.

However, I think there are great advantages to that style. It’s easier to analyze and test the sequential code for correctness. Then it writes a Kafka message or makes an HTTP call and doesn’t need to be concerned with whatever is handling the next step in the process.

Then assembling the sequential components once they are all working individually is a much simpler task.

shadaj

Stay tuned for the next blog post for one potential answer :) My PhD has been focused on this gap!

rectang

As a programmer, I hope that your answer continues to abstract away the problems of concurrency from me, the way that CPU designers have managed, so that I can still think sequentially except when I need to. (And as a senior engineer, you need to — developing reliable concurrent systems is like pilots landing planes in bad weather, part of the job.)

hinkley

I was doing some Java code recently after spending a decade in async code and boy that first few minutes was like jumping into a cold pool. Took me a moment to switch gears back to everything is blocking and that function just takes 500ms sometimes, waiting for IO.

hinkley

I don’t think there’s anyone in the Elixir community who wouldn’t love it if companies would figure out that everyone is writing software that contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Erlang, and start hiring Elixir or Gleam devs.

The future is here, but it is not evenly distributed.

ikety

It's so odd seeing people dissuade others for implementing "niche" languages like Elixir or Gleam. If you post a job opportunity with these languages, I guarantee you will be swamped with qualified candidates that are very passionate and excited to work with these languages full time.

jimbokun

Yes, my sense reading the article was the user is reinventing Erlang.

ignoramous

> writing software that contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Erlang

Since you were at AWS (?), you'd know that Erlang did get its shot at distributed systems there. I'm unsure what went wrong, but if not c/c++, it was all JVM based languages soon after that.

hinkley

No I worked a contract in the retail side and would not wish that job on anyone. My most recent favorite boss works there now and I haven’t even said hi because I’m afraid he’ll offer me a job.

null

[deleted]

KaiserPro

Distributed systems are hard, as well all know.

However the number of people that actually need a distributed system is pretty small. With the rise of kubernetes, the number of people who've not been burnt by going distributed when they didn't need to has rapidly dropped.

You go distributed either because you are desperate, or because you think it would be fun. K8s takes the fun out of most things.

Moreover, with machines suddenly getting vast IO improvements, the need for going distributed is much less than it was 10 years. (yes i know there is fault tolerance, but that adds another dimension of pain.)

sd9

> the number of people who've not been burnt by going distributed when they didn't need to has rapidly dropped

Gosh, this was hard to parse! I’m still not sure I’ve got it. Do you mean “kubernetes has caused more people to suffer due to going distributed unnecessarily”, or something else?

boarush

Had me confused for a second too, but I think it is the former that they meant.

K8s has unneeded complexity which is really not required at even decent enough scales, if you've put in enough effort to architect a solution that makes the right calls for your business.

KaiserPro

yeah sorry, double negatives.

People got burnt by kubernetes, and that pissed in the well of enthusiasm for experimenting with distributed systems

bigmutant

Good resources for understanding Distributed Systems:

- MIT course with Robert Morris (of Morris Worm fame): https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

- Martin Kleppmann (author of DDIA): https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjc...

If you can work through the above (and DDIA), you'll have a solid understanding of the issues in Distributed System, like Consensus, Causality, Split Brain, etc. You'll also gain a critical eye of Cloud Services and be able to articulate their drawbacks (ex: did you know that replication to DynamoDB Secondary Indexes is eventually consistent? What effects can that have on your applications?)

ignoramous

> Robert Morris (of Morris Worm fame)

(of Y Combinator fame, too)

tracnar

The unison programming language does foray into a truly distributed programming language: https://www.unison-lang.org/

aDyslecticCrow

Functional programming languages already have alot of powerful concepts for distributed programming. Loads of the distributed programming techniques used elsewhere are often taken from an obscure fp language from years prior. Erlang comes to mind as still quite uniquely distributed without non fp comparison

Unison seems to build on it further. Very cool

gregw2

What I noticed missing in this analysis of distributed systems programming was a recognition/discussion of how distributed databases (or datalakes) decoupling storage from compute have changed the art of the possible.

In the old days of databases, if you put all your data in one place, you could scale up (SMP) but scaling out (MPP) really was challenging. Nowdays, you (iceberg), or a DB vendor (Snowflake, Databricks, BigQuery, even BigTable, etc), put all your data on S3/GCS/ADLS and you can scale out compute to read traffic as much as you want (as long as you accept something like a snapshot isolation read level and traffic is largely read-only or writes are distributed across your tables and not all to one big table.)

You can now share data across your different compute nodes or applications/systems by managing permissions pointers managed via a cloud metadata/catalog service. You can get microservice databases without each having completely separate datastores in a way.

sanity

The article makes great points about why distributed programming has stalled, but I think there's still room for innovation—especially in how we handle state consistency in decentralized systems.

In Freenet[1], we’ve been exploring a novel approach to consistency that avoids the usual trade-offs between strong consistency and availability. Instead of treating state as a single evolving object, we model updates as summarizable deltas—each a commutative monoid—allowing peers to merge state independently in any order while achieving eventual consistency.

This eliminates the need for heavyweight consensus protocols while still ensuring nodes converge on a consistent view of the data. More details here: https://freenet.org/news/summary-delta-sync/

Would love to hear thoughts from others working on similar problems!

[1] https://freenet.org/

Karrot_Kream

Haven't read the post yet (I should, I have been vaguely following y'all along but obviously not close enough!) How is this different from delta-based CRDTs? I've built (admittedly toy) CRDTs as DAGs that ship deltas using lattice operations and it's really not that hard to have it work. There's already CRDT based distributed stores out there. How is this any different?

sanity

Good question! Freenet is a decentralized key-value store, but unlike traditional KV stores, the keys are WebAssembly (WASM) contracts. These contracts define not just what values (i.e., data or state) are valid for that key but also when and how the value can be mutated. They also specify how to efficiently synchronize the value across peers using summaries and deltas.

Each contract determines how state changes are validated, summarized, and merged, meaning you can efficiently implement almost any CRDT mechanism in WASM on top of Freenet. Another key difference is that Freenet is an observable KV store, allowing you to subscribe to values and receive immediate updates when they change.

Karrot_Kream

That's really cool, now I have to read this post. Thanks!

margorczynski

Distributed systems are cool but most people don't really get how much complexity it introduces which leads them to fad-driven decisions like using Event Sourcing where there is no fundamental need to use it. I've seen projects getting burned because of the complexity and overhead it introduces where "simpler" approaches worked well and were easy to extend/fix. Hard to find and fix bugs, much slower feature addition and lots of more goodies the blogs with toy examples don't speak about.

nine_k

The best recipe I know is to start from a modular monolith [1] and split it when and if you need to scale way past a few dozen nodes.

Event sourcing is a logical structure; you can implement it with SQLite or even flat files, locally, if you your problem domain is served well by it. Adding Kafka as the first step is most likely a costly overkill.

[1]: https://awesome-architecture.com/modular-monolith/

margorczynski

What you're speaking of is a need/usability-based design and extension where you design the solution with certain "safety valves" that let you scale it up when needed.

This is in contrast to the fad-driven design and over-engineering that I'm speaking of (here I simply used ES as an example) that is usually introduced because someone in power saw a blog post or 1h talk and it looked cool. And Kafka will be used because it is the most "scalable" and shiny solution, there is no pros-vs-cons analysis.

rjbwork

If the choice has already been made to do a distributed system (outside of the engineer's control...), is a choice to use Event Sourcing by the engineer then a good idea?

kodablah

> Just like the external-distribution model, arbitrary-location architectures often come with a performance cost. Durable execution systems typically snapshot their state to a persistent store between every step.

This is not true by most definitions of "snapshot". Most (all?) durable execution systems use event sourcing and therefore it's effectively an immutable event log. And it's only events that have external side effects enough to rebuild the state, not all state. While technically this is not free, it's much more optimal than the traditional definition of capturing and storing a "snapshot".

> But this simplicity comes at a significant cost: control. By letting the runtime decide how the code is distributed [...] we don’t want to give up: Explicit control over placement of logic on machines, with the ability to perform local, atomic computations

Not all durable execution systems require you to give this up completely. Temporal (disclaimer: my employer) allows grouping of logical work by task queue which many users use to pick locations of work, even so far as a task queue per physical resource which is very common for those wanting that explicit control. Also there are primitives for executing short, local operations within workflows assuming that's what is meant there.