Skip to content(if available)orjump to list(if available)

Erlang's not about lightweight processes and message passing (2023)

bcardarella

The amazing thing about Erlang and the BEAM is it's depth of features. To the OP the Behaviour/Interface of Erlang is their biggest take away. For me I believe it is how you require far far less development resources to build complex systems than you would require in any other language (provided comparable experience in both stacks). And for many the lightweight processes and programming model.

OTP itself has so much in it. We've been working on compiling Elixir to run on iOS devices. Not only can we do that through the release process but through using the ei library provided in Erlang we can compile a Node in C that will interface with any other Erlang node over a typical distributed network as you would for Erlang, Elixir, Gleam, etc... furthermore there is a rpc library in Erlang where from C we can make function calls and interface with our Elixir application. Yes, the encoding/decoding has an overhead and FFI would be faster but we're still way within our latency budget and we got this stood up in a few days without even have heard of it before.

The larger point here is that Erlang has been solving many of the problems that modern tech stacks are struggling with and it has solved for scale and implementation cost and it solved these problems decades ago. I know HN has a bit of a click-bait love relationship with Erlang/Elixir but it hasn't translated over to adoption and there are companies that are just burning money trying to do what you get out of the box for free with the Erlang stack.

hosh

I went from a company that used Elixir in the backend to one that uses Nodejs.

I had gone in neutral about Nodejs, having never really used it much.

These projects I worked on were backend data pipeline that did not even process that much data. And yet somehow, it was incredibly difficult to isolate exactly the main bug. Along the way, I found out all sorts of things about Nodejs and when I compare it with Elixir/Erlang/OTP, I came to the conclusion that Node.js is unreliable by design.

Don't get me wrong. I've done a lot of Ruby work before, and I've messed with Python. Many current-generation language platforms are struggling with building reliable distributed systems, things that the BEAM VM and OTP platform had already figured out.

elcritch

Elixir never performs all to well in microbenchmarks. Yet in every application I've seen Elixir/Erlang projects compared to more standard Node, Python, or even C# projects and the Elixir one generally has way better performance and feels much faster even under load.

Personally I think much of it is due to async being predominant in Node and python. Async seems much harder than actor or even threading for debugging performance issues. Sure it feels easier to do async at first. But async leads to small bloat adding up and makes it very difficult to debug and track down. It makes profiling harder, etc.

In BEAM, every actor has its own queue. It's trivial to inspect and analyze performance blockages. Async by contrast puts everything into one giant processing queue. Plus every function call in async gets extra overhead added. It all adds up.

hosh

This has to do with how async works without preemption and resource limits.

There's a counter-intuitive thing when trying to balance load across resources: applying resource limits helps the system run better overall.

One example: when scaling a web app, there comes a point when scaling up the database doesn't seem to help. So we're tempted to increase the connection pool because that looks like a bottleneck. Increasing the pool can make the overall system perform worse, because often times, it is slow queries and poorly performing queries that is stopping up the system.

Another example: one of the systems I worked on has over 250 node runtimes running on a single, large server. It used pm2 and did not apply cgroups to limit CPU resources. The whole system was a hog, and I temporarily fixed it by consolidating things to run on about 50 node runtimes.

When I moved them over to Kubernetes, I also applied CPU resource limit, each in its own pod. I set the limits based on what I measured when they were all running on PM2 ... but the same code running on Kubernetes ran with 10x less CPU overall. Why? Because the async code were not allowed to just run grabbing as much CPU as it can for as long as it can, and the kernel scheduler was able to fairly run. That allowed the entire system to run with less resources overall.

There's probably some math that folks who know Operations Research can prove all this.

null

[deleted]

jayd16

> Async by contrast puts everything into one giant processing queue

I don't know about node but C# has async contexts you can use .

neonsunset

> Async by contrast puts everything into one giant processing queue

How can you make performance claims while getting the details completely wrong?

Neither .NET's nor Rust's Tokio async implementations work this way. They use all available cores (unless overridden) and implement work-stealing threadpool. .NET in addition uses hill-climbing and cooperative blocking detection mechanism to quickly adapt to workloads and ensure optimal throughput. All that while spending 0.1x CPU on computation when compared to BEAM, and having much lower memory footprint. You cannot compare Erlang/Elixir with top of the line compiled languages.

mm263

I'd appreciate an in-depth write-up about deficiencies you found in Node and how Erlang fixes them

jlouis

Bandwidth vs latency. Erlang is designed to keep low latency under load with graceful degradation.

hosh

Starting with the most general: Nodejs suffers in the same way that other async systems do -- the lack of preemption means that certain async threads can starve other async threads. You can see this on GUI desktop apps when the GUI freezes because it wasn't written in a way to take that into account.

In other words, the runtime feature that Nodejs is the most proud of and markets to the world as its main advantage does not scale well in a reliable way.

The BEAM runtime has preemption and will degrade in performance much more gracefully. In most situations, because of preemption (and hot code reloading) you still have a chance for attaching a REPL to the live runtime while under load. That allows someone to understand the live environment and maybe even hot patch the live code until a the real fix can run through the continuous delivery system.

I'm not going to go into the bad Javascript syntax bloopers that still haunts us, and only partially mitigated by Typescript. That is documented in "Javascript: The Good Parts". Or how the "async" keyword colors function calls, forcing everything in a call chain to also be async, or forcing you to use the older callbacks. Most people I talk to who love Typescript don't consider those as issues.

The _main_ problems are:

1. Async threads can easily get orphaned in Nodejs. This doesn't happen when using OTP on BEAM because you typically start a gen_server (or a gen_*) under a supervisor. Even processes that are not supervised can be tracked. Because pids (identifiers to processes) are first-class primitives, you can always access the scheduler which will tell you _all_ of the running processes. If you were to attach a Nodejs REPL, you can't really tell. This is because there is no encapsulation of the process, no way to track when something went async, no way to send control messages to those async processes.

2. Because async threads are easily orphaned, errors that get thrown gets easily lost. The response I get from people who love Typescript on Nodejs tells me that is what the linter is for. That is, we're going to use an external tool to enforce all errors gets handled, rather than having the design of the language and the runtime handle the error. In the BEAM runtime, unhandled errors within the process crashes the process, without crashing anything else; processes that are monitoring that process that crashed gets notified by the runtime that it has crashed. The engineer can then define the logic for handling that crash (retry? restart? throw an error?).

3. The gen_server behavior in OTP defines ways to send control messages. This allows more nuanced approaches to managing subsystems than just restarting when things crash.

I'm pretty much at the point where I would not really want to work on deploying Nodejs on the backend. I don't see how something like Deno would fix anything. Typescript is incapable of fixing this, because these are design flaws in the runtime itself.

javcasas

I have been looking for an Erlang thing akin to Apache Airflow or Argo Workflows. Something that allows me to define a DAG of processes, so that they run one after the other. How would you implement something like that?

goosejuice

Have a look at GenStage, flow, Broadway, Oban pro in elixir land. But OTP alone can get you pretty far.

winrid

The problem with Node is observability. They've optimized away observability to where it's hard to find performance problems compared to the JVM to Beam.

coldtea

>Node.js is unreliable by design

Well, it's just a hack and some C libraries on top of a browser Javascript engine.

No big thought went into it, either before or after it got big.

paradox460

Adding to this, the primitives erlang, and descendants, give you are very easy to work with, and therefore very easy to test.

Take GenServer. The workhorse of most BEAM systems. Everything it does it basically just calling various functions with simple parameters. So you can test it just by call l calling those functions and manually passing parameters to it, and asserting on its output. No need to set up complex testing systems that are capable of dealing with asynchronous code, no need to handle pauses and wait for coffee to finish running in your tests. It's something a lot of juniors tend to miss, but it's liberating when figured out

relistan

C nodes are under appreciated. We have one (Cgo) for communicating between Go and Elixir services running in the same Kubernetes pod. The docs are also pretty good for Erlang and its C libs.

agent281

> I know HN has a bit of a click-bait love relationship with Erlang/Elixir but it hasn't translated over to adoption and there are companies that are just burning money trying to do what you get out of the box for free with the Erlang stack.

Do you or the community have a sense why that is?

throwawaymaths

Elixir is "bad" because it is not a friendly language for people who want to be architecture astronauts at the code level (you can definitely be an architecture astronaut at the process management level but that's a very advanced concept). And a lot of CTOs are architecture astronauts.

hosh

That's the opposite of my experience. I tend to get those "architect astronauts" in teams using other languages platforms, and the folks I work with Erlang or Elixir tend to be pragmatic and willing to dig down the stack to troubleshoot problems.

runlaszlorun

Apologies for my ignorance but what's an "architecture astronaut"?

paradox460

Not just that, but there is no giant gorilla backing BEAM. Google pushes Go and Java, Microsoft node and c#

solid_fuel

My personal opinion as a fan and adopter of the stack is that the benefit is often seen down the line, with the upfront adoption cost being roughly the same.

E.g. the built in telemetry system is fantastic, but when you are first adopting the stack it still takes a day or two to read the docs and get events flowing into - say - DataDog, which is roughly the same amount of time as basically every other solution.

The benefit of Elixir here is that the telemetry stack is very standardized across Elixir projects and libraries, and there are fewer moving pieces - no extra microservices or docker containers to ship with everything else. But that benefit comes 2 years down the line when you need to change the telemetry system.

jayd16

There's no killer app, as in a reason to add it to your tech stack.

The closest I've come across was trying to maintain an ejabberd cluster and add some custom extensions.

Between mnesia and the learning curve of the language itself, it was not fun.

There are also no popular syntax-alikes. There is no massive corporation pushing Erlang either directly or indirectly through success. Supposedly Erlang breeds success but it's referred to as a "secret" weapon because no one big is pushing it.

Erlang seems neat but it feels like you need to take a leap of faith and businesses are risk averse.

sintax

> There is no massive corporation pushing Erlang either directly or indirectly through success.

Isn't there this "small" company that has a chat app that is using erlang :P

0x457

Well, if we're talking medium-size companies - hard to bring any new language.

If we're talking pure modern-tech company - good luck bringing anything other than JS because "more developers == more growth" mentality.

So it's either end up being used where decision makers know/want-to-learn Erlang/Elixir or when all other possiblity was exhausted.

jacobsenscott

These incremental benefits don't translate to an order of magnitude more productivity, or stability, or profitability. Given the choice, as a business owner, future proofing is about being able to draw from the most plentiful and cheapest pool of workers. The sausage all looks the same on the outside.

hosh

That is not true, especially with Section 174 (for the US). Right now, if you want to hire an Elixir engineer, you're better off finding a generalist willing to learn and use Elixir, and you would probably get someone who is very capable.

With Section 174 in play in the US, it tends to drive companies hiring specialists and attempting to use AI for the rest of it.

My own experience is that ... I don't really want to draw from the most plentiful and cheapest pool of workers. I've seen the kind of tech that produces. You basically have a small handful of software engineers carrying the rest.

Elixir itself is a kind of secret, unfair advantage for tech startups that uses it.

bsder

Erlang looks weird--Prolog-base, tail recursive loops, extensive pattern matching.

Also, a lot of the power of Erlang is the OTP (Open Telecom Platform) even more than Erlang, itself. You have to internalize those architectural decisions (expect crashes--do fast restart) to get the full power of Erlang.

Elixir seems like it has been finding more traction by looking more like mainstream languages. In addition, languages on the BEAM (like Elixir) made the BEAM much better documented, understood and portable.

hosh

Eventually, you use Elixir enough, and Erlang starts looking pretty.

hinkley

I’ve worked with a few individuals, mostly managers, who intended to write books informed by our experiences. It was always frustrating for me to see that we disagreed about what aspects of our work made us successful. There was always something they minimized as being nice that I felt was essential.

And here we see someone claiming that lightweight processes and message passing aren’t the secret sauce, missing that Erlang as Communicating Sequential Processes is indivisible from those qualities, and then repeatedly mentioning CSP as part of the secret sauce.

Examples:

> The application programmer writes sequential code, all concurrency is hidden away in the behaviour;

> Easier for new team members to get started: business logic is sequential, similar structure that they might have seen before elsewhere;

> Supervisors and the “let it crash” philosophy, appear to produce reliable systems. Joe uses the Ericsson AXD301 telephone switch example again (p. 191):

Behaviors are interesting and solve a commonly encountered problem in the 80’s that was still being solved in some cases in the 00’s, but it’s a means as much as an end in Erlang. It’s how they implemented those other qualities. But I don’t know if they had to, to make Erlang still mostly be Erlang.

fidotron

Erlang isn't CSP, it's the Actor model. https://en.wikipedia.org/wiki/Actor_model

CSP is what inspired the golang channels, via occam and some other languages. The whole synchronization on unbuffered channels is the most obvious differentiator, though there are others like the actor concept of pattern matching over a mailbox.

The whole CSP vs actor debate is quite interesting when you get down to it because they superficially look kind of similar but are radically different in implications.

coolKid721

Watch the guy who came up with Erlang, the Actor model and CSP discuss it: https://youtu.be/37wFVVVZlVU One of my favorite videos on youtube.

fidotron

Totally! I love the Alan Kay and Armstrong one too https://www.youtube.com/watch?v=fhOHn9TClXY

hinkley

There are a lot of languages that now claim to be 'Actor Model' and have only a shade on Erlang's fault tolerance and load balancing. That term no longer has the gravitas it once had.

throwawaymaths

The actor model in general doesn't really care about fault tolerance in the way that erlang does.

silisili

Is Erlang considered CSP? I've always thought it wasn't really, and had its own thing called 'actors' which are id'd and can communicate directly, vs CSP which are anonymous and use channel messaging.

I've always thought the actor model made more sense, but highly YMMV.

hinkley

The erlang docs only go as far as saying it’s functionally similar to CSP.

I think the term Actor Model has been so semantically diluted at this point that the phrase also understates what Erlang has as well.

Neither CSP nor AM require process isolation to work, which means they can work when they work but fail much much worse. They are necessary but insufficient.

hajile

It's like saying they are both Turing Complete or that SML modules and Haskell typeclasses are functionally equivalent (even though their use in practice is quite different).

Aactors must always have a known address to be accessed and you share them by sharing addresses. You also wouldn't pass an actor to an actor and you'd pass an address instead. CSP channels are first-class. You can create anonymous channels and even pass channels through other channels. This is similar to languages with lambdas and first-class functions vs other languages where every function has a name and functions cannot be passed to other functions.

Actors are naturally async-only and (for example) make no attempt to solve the two generals problem while CSP implementations generally try to enforce synchronization. CSP also enforces message order while actors don't guarantee that messages will be received in the order they were sent.

These are all more theoretical than actual though. CSP channels may be anonymous to the programmer, but they all get process IDs just like Actors would. Actors may seem async, but they can (and no doubt do in practice) make stronger guarantees about message order and synchronicity when on the same CPU. Likewise, CSP would give the illusion of synchronicity and ordering across CPUs where none actually exists (just like TCP).

sitkack

Managers make up their own narrative based on vibes.

Towaway69

I've just gotten back into Erlang becuase of the lightweight processes and message passing, so far behaviour has been secondary (i.e. just learning about them)!

The project is about bring visual Flow Based Programming(FBP)[1] to Erlang. FBP seems to be made for Erlang and I was surprised there was something already but there does not seem to be.

My goto tool for FBP is Node-RED and hence the basic idea is to bolt a Node-RED frontend on to an Erlang backend and to have every node being a process. Node-REDs frontend is great for modelling message passing between nodes, hence there is a very simply one-to-one mapping to Erlangs processes and messages.

I've implemented some basics and started to create some unit tests as flows to slowly build up functionality. I would really like this to be 100% compatiable to Node-RED the NodeJS backend. For more details, the github repo --> https://github.com/gorenje/erlang-red

Overall Erlang is amazingly well suited to this and astonished that no one else has done anything like this - or have they?

[1] = https://jpaulm.github.io/fbp/index.html

runlaszlorun

Love the idea as well! Would I be wrong in thinking that, at a high-level, fbp is like erlang processes where message flow is one way?

__jonas

Oh that's really cool to see! I always thought a visual programming language on the BEAM would be fun

mcintyre1994

This is a really cool idea!

Towaway69

Thank you, it's also a lot of fun to do :)

Hopeful I can get some useful functionality together without hitting my Erlang coding limits!

Any help is greatly appreciated :+1:

unoti

I came here looking for information about why Ericsson stopped using Erlang, and for more information about Joe's firing.

The short answer seems to be that they pivoted to Java for new projects, which marginalized Erlang. Then Joe and colleagues formed Bluetail in 1998. They were bought by Nortel. Nortel was a telecom giant forming about a third of the value of the Toronto Stock Exchange. In 2000 Nortel's stock reached $125 per share, but by 2002 the stock had gone down to less than $1. This was all part of the dot com crash, and Nortel was hit particularly hard because of the dot com bubble burst corresponding with a big downturn in telecom spending.

It seems safe to look at Joe's layoff as more of a "his unit was the first to slip beneath the waves on a sinking ship" situation, as they laid off 60,000 employees or more than two thirds of their workforce. The layoff was not a sign that he may not have been pulling his weight. It was part of a big move of desperation not to be taken as a sign of the ineffectiveness of that business unit.

cmrdporcupine

It's very weird to me to see the word "fired" in this context. "Laid off" is more appropriate. "Fired" is very value-laden and implies fault and termination with cause. Which I'm sure if that was somehow actually true the original article author would know nothing about, nor would it be any of their business.

senderista

For me the most interesting concept in Erlang/BEAM is that partial recovery is built in from the ground up. When an unexpected state is encountered, instead of either killing the entire process or trying to proceed and risking corruption, you just roll back to a known good state, at the most granular level possible. This idea was researched many years ago under the name of "microreboots"(associated with "crash-only software"), but only Erlang/BEAM made it a first-class concept in a production system.

benmmurphy

You still have to be careful with supervision trees and parts of the tree restarting. For example your system might work if the whole erlang operating system process is suddenly killed and restarted but your system might start corrupting data if parts of the erlang process tree is restarted. Erlang gives you a good model to work with these problems but it doesn't allow you to completely turn off your brain. If you walk in thinking that you can just let things restart and everything will be fine then you might end up getting burnt.

alienthrowaway

> You still have to be careful with supervision trees and parts of the tree restarting [...] Erlang gives you a good model to work with these problems but it doesn't allow you to completely turn off your brain.

Erlang gives architects the tools to restart as little, or as much of the tree as they like, so I hope they have their brains fully engaged when working on the infrastructure that underlies their projects. For complex projects, it's vital think long and hard about state-interactions and sub-system dependencies, but the upside for Erlang is that this infrastructure is separated from sequential code via behaviors, and if the organization is big enough, the behaviors will be owned by a dedicated infrastructure team (or person) and consumed by product teams, with clear demarcations of responsibilities.

juped

Yes, you can design your system pathologically to make it wrong.

groestl

> When an unexpected state is encountered, instead of either killing the entire process or trying to proceed and risking corruption, you just roll back to a known good state, at the most granular level possible.

> but only Erlang/BEAM made it a first-class concept in a production system.

Exceptions?

juped

Can you explain how exceptions (partial stack unwinding while carrying a value) do this?

SpaghettiCthulu

Maybe their idea is that you can have a thread that processes work from a queue and catch any exceptions thrown during that processing and just continue processing other work.

travisgriggs

To me, Erlang/Elixir’s power is not necessarily the Actor model implementation, the matching from prolog, immutability, behaviors, etc, but Joes desire to demonstrate you could do more with less.

It is a well thought out and trued system of computation that has a consistency rarely witnessed in other languages, much less the “web”. It is not perfect. But it is pretty impressive.

Unfortunately, I find the appreciation and uptake for what simplicity empowers in the software world pretty under appreciated. Complexity allows people to become specialists, managers to have big teams and lots of meetings, experts to stay experts.

Erlang was being developed in a period where companies were trying to implement software solutions with smaller headcounts, limited horsepower, etc. A multi decade outpouring of cash into the domain has made the value of “less will mean more for all of us in good ways” less of an attractor.

runlaszlorun

You've just convinced me to spend some more time with Erlang! I've dabbled a bit and, at least on the surface, prefer erlang syntax over elixir.

LtdJorge

Me too, as weird as it might sound

travisgriggs

I mostly prefer the Elixir syntax. But I don’t care for all of the “ends”. Wish they had taken more inspiration from Python in that department.

zelphirkalt

Reminds me of Rich Hickey's talk about Simple VS Easy.

kamma4434

I do not agree with the contents of this article. Behaviors are possible because of the underlying architecture of the system. Behaviors are not interfaces – they are more like abstract object in a language like Java so they implement basic and self-contained functionalities that are hidden behind a collaboration interface, but they couldn’t do much without the underlying infrastructure that makes sure that every process is totally separate from other processes, the that all processes can be safely closed without leaking memory, or resources, and that you cannot just share a rogue pointer between two different processes.

What Joe did in his thesis is to show you how you can build reliable systems (and up point, reliable distribuited systems) by using a given set of Lego blocks.

The reason why you need the erlang vm to implement something like that appropriately – and that you cannot do that fully on a different VM – is that without the underlying plumbing, supervision trees would be leaky - in Java, you cannot kill a thread that is holding up to resources and hope that everything will always go well, And do not have ways to monitor different processes.

HeavyRain266

Erlang, OTP, and the BEAM offer much more than just behaviours. The VM is similar to a virtual kernel with supervisor, isolated processes, and distributed mode that treats multiple (physical or virtual) machines as a single pool of resources. OTP provides numerous useful modes, such as Mnesia (database) and atomic counters/ETS tables (for caching), among others. The runtime also supports bytecode hot-reloading, a feature used to apply patches without any system downtime. While the syntax is not very screen reader-friendly, it is digestable.

Apache Mesos[1] is the only thing that comes to my mind as a similar platform to BEAM in its ability to treat multi-machine resources as a single pool.

Over a year ago, my private consulting company decided to adopt Erlang as our backend language. After some time, we started exploring BEAM's internals to, for example, replace the TCP-based stack with QUIC and integrate some Rust patches. A truly fantastic choice for lightweight and high-throughput systems that are only failing in case of kernel panic or power loss. We are currently working on very "busy", concurrent software like a film/game production tracker and pipeline manager, and are now also preparing R&D for a private hospital management services.

[1]: https://mesos.apache.org/

HeavyRain266

Before you ask, we're not going to ever fully adopt Elixir (or Gleam) as its ecosystem is built around Phoenix framework and external services/databases. We would have to maintain internal bindings/implementations of things that are unmaintained on Elixir's side. Also worth to mention that it has a large amount of syntax sugar and its users have that weird fetish for abstracting stuff into DSL interfaces.

pdimitar

Couldn't understand your comment well but I am making a SQLite library for Elixir (via Rust bindings) so that would be one less dependency on external systems. I happen to believe that most projects don't need a full-blown database server.

HeavyRain266

All the people from Elixir community I met, kept telling me "Mnesia sucks, use Postgres instead" through Ecto DSL of course. Same goes about pushing towards Redis and gRPC. Most of them will try to convince you to start using Phoenix instead... Also there are very little to no references on how to use e.g. Cowboy or Bandit without Plug DSL.

null

[deleted]

LtdJorge

To me the most important aspect of Erlang is the runtime's scheduler, which is preemptive instead of cooperative. This allows the message passing, sequential code and lightweight processes to be much more effective than in any other general language or framework using cooperative scheduling (like async runtimes or coroutines in Rust, .Net, Kotlin, Lua).

You can write actually synchronous code in Erlang and the runtime makes it so that no process blocks any other process by preempting them on a schedule.

assbuttbuttass

Sounds a lot like Go

jerf

"This begs the question: why aren’t language and library designers stealing the structure behind Erlang’s behaviours, rather than copying the ideas of lightweight processes and message passing?"

Because the function signatures of Erlang's behaviors are critically tied to Erlang's other functionality, specifically its unusual use of immutability. You need a separate init call for its servers because of that, and a very distinct use of the state management to work exactly the same way.

But to achieve the same goals in other languages, you almost always shouldn't directly copy what Erlang is doing. In fact when I see "Look! I ported gen_server into $SOME_OTHER_LANGUAGE" and I see exactly and precisely the exact interface Erlang has, I know that the port doesn't deeply understand what Erlang is doing.

When I ported the idea of supervisor trees into Go [1], I did so idiomatically. It turns out in modern Go the correct interface for "a thing that can be supervised" is not precisely the same signature that Erlang has, but

    type Service interface {
        Serve(context.Context)
    }
That's all you need and all you should use... in Go. Your other language may vary. Go doesn't need a "handle_event/2" because it has channels, and you should use those, not because they are "better" or "worse" but because that's what this language does. In another language you may use something else. In another infrastructure you may end up sending things over Kafka or some cloud event bus rather than "calling a handle_event/2". The key is in building an event-based system, not copying the exact implementation Erlang has.

A peculiar issue the Erlang community has is getting excessively convinced that there's something super-mega-special about the exact way Erlang does it, and that if you do it any other way it is ipso facto wrong and therefore not reliable. This may have been true in 2005; it is not true in 2025. Where once Erlang had almost the only sensible answer, in 2025 the problem is poking through the ocean of answers deluging us! While I recommend learning from Erlang about reliable software, I strongly recommend against just blind-porting out the exact way Erlang achieves it into any other language. It is in almost any other language context the wrong answer. Even other immutable languages generally vary enough that they can't just copy the same structure.

[1]: https://jerf.org/iri/post/2930/

asa400

To follow on from your excellent post, I think a reasonable next question is, "why have these kinds of approaches and ideas in other languages and systems succeeded in gaining market adoption, but Erlang/Elixir has not?"

This to me is the most interesting question about Erlang, and I say this as someone who works professionally in Elixir.

It's _clear_ that there is incredible appetite for tools that help us design reliable concurrent systems given the wild success of things like k8s, Kafka, AWS's distributed systems products, etc., but why hasn't Erlang/Elixir been able to capture that share?

My friends and I debate this all the time, but I don't know the answer.

hackingonempty

Talk to some engineering managers. Their concerns are hiring people to get the job done. You can't easily hire devs for obscure languages like Erlang and Elixir. If you can find any that are looking for a gig they want too much money. On the contrary, if you are hiring for C++/C#/Java/JS/TS your problem is separating good from bad candidates but good ones are available.

Likewise, most devs don't want to learn an obscure language for one job even if they are more than capable. Either they get stuck doing that language or they earn a hole in their resume instead of additional experience in what future employers care about.

Finally, the vast majority of applications and systems don't need ultra high reliability and don't have the budget for it. It isn't clear that downtime impedes success for anything but the most critical businesses.

pdimitar

Well I'm a senior looking for an Elixir job. What constitutes "wanting too much money" btw? Really curious.

int_19h

> Likewise, most devs don't want to learn an obscure language for one job even if they are more than capable. Either they get stuck doing that language or they earn a hole in their resume instead of additional experience in what future employers care about.

This is less of an issue with accumulated experience. Personally I would actually welcome the kind of job that would involve learning a new niche language, since I already have >10 years of experience in several mainstream languages, and there's diminishing returns wrt resumes and interviews past this point.

zelphirkalt

I mean, if you learned Erlang on the job to build reliable systems with it, you don't have to put an "obscure" language on your resume. You can put "highly fault tolerant systems" on your resume, and when asked about it in an interview, you got the chops to back that claim up, while many other people don't. It is very far from a "hole" in ones CV. Any engineer worth their salt in a hiring process will recognize this. It is a matter of learning new things, instead of repeating the same experience of some NodeJS or Java CRUD over and over again. If I was in hiring work, and I met someone with that kind of Erlang experience, I would hope I can hire them, and that they will not be too expensive for me. I would set them to work on the interaction between the system parts and let them work on reliability and latency stuff.

It is a matter of someone having the same 2 years of experience over and over again, or someone learning many things. Personally I would welcome a chance to learn more Erlang on the job and build something with it.

Unfortunately, businesses want the fresh graduate with 10y of work experience, who already knows their complete stack. Maybe not so much in the Erlang world, but in general. Learning on the job?? Pah! You already ought to know! Just another reason to pay less!

And Erlang jobs are rare. I am between jobs, so if someone happens to know a remote job, where I could start working and learn more Erlang (have only looked at the beginning of "Learn you some Erlang for great Good"), please let me know. I would be happy to have that "hole" as part of my CV :D

ngrilly

I think that’s mostly due to Erlang looking too alien compared to mainstream languages. Elixir is changing that but it arrived a bit late.

jerf

"but why hasn't Erlang/Elixir been able to capture that share?"

Because Erlang has a well-integrated collection of what are by 2025 standards mediocre tools.

There is value to that integration, and I absolutely won't deny that.

However, the state of the art has moved beyond Erlang in a number of ways, and you're taking a pretty big penalty to stick to BEAM on a number of fronts now. Its performance is sub-par, and if you're running a large cluster, that's actually going to matter. Erlang qua Erlang I'd call a subpar language, and Elixir qua Elixir is merely competitive; there are many places to get similar capabilities, with a wide variety of other available cost/benefit choices. Erlang's message bus is not terribly resilient itself; modern message busses can be resilient against individual nodes in the message bus going down, and it's a powerful pattern to have multiple consumers against a single queue, which Erlang's focus on PIDs tends to inhibit. Erlang's message bus is 0-or-1 when as near as I can tell the rest of the world has decided, correctly IMHO, that 1-or-n is superior. Erlang is fairly insular; once you have to hook up one non-BEAM service to the system, well, you're going to do that over some sort of message bus or something, and you pretty quickly get to the point that you might as well let that be your core architecture rather than the BEAM cluster. Once you're heterogeneous, and BEAM is just another node on the net, there isn't necessarily a lot of reason to stay there. And as a system scales up, the pull to heterogeneity approaches infinity; takes a lot of work to go to an entire company and force them to work entirely in BEAM.

Plus, some of the problems Erlang solved in one way have developed better solutions. Erlang solves the problem of multiple code bases possibly simultaneously existing in the same cluster by basically making everything untyped. That was a nifty solution for the 1990s, but today I think we've gotten a lot better at having typed data structures that still retain backwards compatibility if necessary. So throwing away the entire type system, including all the methods and inheritance or composition or whatever, to solve that problem is a heck of a blow.

I do want to close out with a repetition of the fact that there is value in that solid integration. More people today are aware of the various tools like "message busses", but it is still clearly not as common knowledge as I'd like and I still see entire teams struggling along basically crafting an ad-hoc half-specified custom message bus every so often, which in 2025 is insane. (I have written a couple of services where I have basically had to provide HTTP "REST" endpoints that end up just being proxies on to my internal message bus that my system is really based on, because they'd rather POST HTTP than have to use a message bus library, even though it doesn't really buy them anything.) Erlang does help educate people about what are now the basics of cloud architecture. And that "well-integrated collection of mediocre tools" can still solve a lot of problems. Many sins can be forgiven by a 32 4GHz cores backed by high powered RAM, disk, and networking.

But it would take a lot of backwards-incompatible changes to create a BEAM 2.0 that would be competitive on all fronts... if indeed such a thing is even possible. The variety of techs exist for a reason. It stinks to have to paw through them sometimes, but the upside is you'll often find the exact right solution for your needs.

Thaxll

"It's _clear_ that there is incredible appetite for tools that help us design reliable concurrent systems given the wild success of things like k8s, Kafka, AWS's distributed systems products, etc., but why hasn't Erlang/Elixir been able to capture that share?"

Becasuse Erlang is a runtime + language and Kubernetes is a neutral platform. You can build concurrent and reliable solution without the need of locking you down to a single language.

Someone can start by just porting its Python code on Kubernetes to make it more reliable and fault tolerent.

bad_haircut72

Decision making in computing is mostly a matter of fashion / cargo-culting.

matthew16550

And Resume padding

klabb3

Go is my favorite language but:

> Go doesn't need a "handle_event/2" because it has channels, and you should use those

Of what type? But most importantly, channels are local to the process, so you need glue to make it networked. (I assume erlang has networked message handling abstracted away). In addition I’ve seen 3-4 different variations of your proposed pattern for long-running server like things.

I agree fully that porting should make use of idiomatic constructs. But I also think languages can have hidden mechanics that loses the valuable essence while porting – a form of anti-relativism of PLs if you will.

It’s entirely possible to me that this ”oh a channel? just wrap it in X” is much more detrimental to interop than what it sounds like. For instance take http.Handler in Go. Similarly simple but what are the real world implications of having it in std? An ecosystem of middleware that is largely compatible with one another, without pre-coordination (a non-std http server X can be used with auth middleware Y and logging middleware Z). Similar things can be said about io.Reader and friends. These extremely simply interfaces are arguably more valuable than the implementations.

If, and I’m speculating here, Erlang got many of the interfaces for reliable distributed systems right, that can be what enables the whole.

jerf

"Of what type?"

Of the type of the messages you're sending. Which can either be an interface for multiple messages, or you can use multiple channels with one type each. I've done both. This is not an important question when actually programming in Go.

"But most importantly, channels are local to the process, so you need glue to make it networked."

This is an important consideration if you are using Go. Although I would observe that it isn't so much that "channels don't do network" as that "channels are a local tool"; e.g., we do not complain that OS mutexes are not "network capable", because they're intrinsically local. Network locking uses different solutions, and we don't really consider etcd a competitor to a local "lock" call.

But there are dozens of message busses in the world now, and Erlang's isn't really all that competitive modulo its integration.

throwawaymaths

I don't think behaviours are all that interesting; after all other programming languages have them.

Rather, what is interesting about the BEAM is that throwing an error is so graceful that it's not such a sin to just throw an error. In otherwords, a component that CAN error or get into a weird state can be shoved into a behaviour that CANNOT. And by default you are safe from certain operational errors becoming logic or business errors.

For example. You might have a defined "get" interface that doesn't return an error -- let's say it starts as an in-memory K/V store and it returns an optional(value), which is NULL in the case that the key didn't exist.

But suppose you want to have two datastores that the same interface targets, so you might abstract that to a filesystem, and you could have a permission error. And returning "NULL" is not actually "correct". You should throw, because that bubbles up the error to ops teams instead of swallowing it whole. A panic in this case is probably fine.

What if now you're going over a filesystem that's over the network, and the line to the datacenter was backhoe'd and there was a 10 millisecond failover by your SDN -- returning "NULL" is really not correct, because consumers of your getter are liable to have a bad time managing real consistency business cases that could cost $$$. And in this case a panic is not necessarily great, because you bring down everything over a minor hiccup.

The other power with throwing errors + behaviors is that it makes trapping errors with contextual information reporting (e.g. a user-bound 500 error with stack trace information sent somewhere where ops can take a gander) really easy and generically composable, that's not so for error monads or panics.

Anyways it was always strange to me that erlang-inspired actor system programming languages came out that obsessed over "never having errors" as a principle (like ponylang) because that's throwing out a big part of erlang.

null

[deleted]

layer8

From this article and others, it’s still unclear to me what the state-handling and state-sharing model of Erlang is. Presumably, the granularity of the crashing/restarting sequential processes is also the granularity of in-memory state sharing. But what about external state, like databases, queues, file systems? For example, if a process has taken an item off a queue and then crashes before having fully processed it, how is that accounted for? Or you might not even know from the outside if it has been fully, partially, or not at all processed yet. This is an example where correct error handling or not crashing is crucial, in my experience. Or what about processing pipelines where a component in the middle crashes. Is there something like that in Erlang? Is there an article explaining Erlang from that perspective?

fidotron

> For example, if a process has taken an item off a queue and then crashes before having fully processed it, how is that accounted for?

I have worked with people that had deployed huge amounts on the BEAM that had a real problem with the answer to that, and resort to magical thinking.

When erlang processes "crash", assuming the whole system didn't crash, they almost certainly alerted a monitoring process of the fact, so that a process can be quickly restarted. This is the core of how supervision trees in erlang are built.

There are a lot of subtleties to that. The whole system may or may not be a single BEAM instance, and if more than one then they can be distributed, i.e. processes on one machine receive failure messages from processes on others, and can restart the processes elsewhere. These mechanisms on a practical basis are sufficient to automatically pick up the majority of transient failures. (I should add there are two classic ways to blow up a BEAM instance which make this less good than it should be: a bad C function call "NIF" for native something function, or posting messages to a process faster than it can consume them, which will eventually cause an OOM).

But this differs from the underlying philosophy of the runtime, which is that things are only done when they're done, and you should expect failures at any time. This maps on to their messaging paradigm.

What you actually sound like you want is a universe more like FoundationDB and QuiCK https://www.foundationdb.org/files/QuiCK.pdf where the DB and worker queue all live in one single transactional space, which certainly makes reasoning about a lot of these things easier, but have nothing to do with erlang.

sshine

> what about [...] if a process has taken an item off a queue and then crashes before having fully processed it

> you might not even know from the outside if it has been fully, partially, or not at all processed yet

Erlang does not propose a unique solution to distributed problems, just good primitives.

So the answer would be the same; you'd keep track in the queue if the element was partially popped, but not completed, and you report back to the queue that the processing failed and that the element should be fully put back.

So in Erlang you might monitor a worker process and requeue items handled by processes that failed.

layer8

Thanks. So Erlang is really only about managing process lifetimes and simple RPC? In my experience processes often have meaningful internal state, meaningful in the sense that it matters if it gets lost due to a crash. If I understand correctly, Erlang doesn’t provide any particular model or mechanisms for dealing with that?

toast0

Like fidotron said, a process's internal state is lost if it crashes (or exits).

If you want that state to be durable, you need to store it durably. Mnesia provides (optional) distributed transactions which may be appropriate for durability needs (lots of details). Or you could externalize durability to other systems.

Erlang is wonderful, but it's not magic. It won't prevent hardware failures, so if an Erlang process fetches something from a queue and the cpu stops for whatever reason, you've got a tricky situation. Erlang does offer a way for a process to monitor other processes, including processes on remote nodes, so your process will be notified if the other process crashes or if the other node is disconnected; but if the other node is disconnected, you don't know what happened to the other process --- maybe it's still running and there's a connectivity issue, maybe the whole host OS crashed. You could perhaps set bidirectional monitors, and then know that the remote process would be notified of the disconnection as well, if it still was running... but you wouldn't know if the process finished (sucessfully or not) after the connectivity failed but before the failure was detected and processed.

fidotron

> In my experience processes often have meaningful internal state, meaningful in the sense that it matters if it gets lost due to a crash.

The erlang process state will be simply what it has on the stack. (Ignoring things like ETS tables for the moment).

Erlang has the concept of ports, used to interface to the world outside, that provide a sort of hook for cleanup in the event of a crash. Ports belong to processes, in the event of a crash all associated ports are cleaned up. You can also set this sort of thing up between purely erlang processes as well.

As the other commenter observed, erlang gives you the primitives to make distributed systems work; it does not prescribe solutions, especially around distributed transactions, which imo is one of the reasons some of the hype around the BEAM is misguided.

procaryote

Erlang at least used to come with an in-memory database called Mnesia, that in the places I've encountered it depended on replicating all the state to every server, which usually caused some scaling issues.

There's nothing outright stopping you from doing proper design and building separate erlang services that exchange state with regular protocols, but there does seem to be a temptation to just put all erlang in one big monolith and then run into very hard memory and scaling issues when usage and data grows.

One high profile erlang user in the payment industry was mainly constrained by how big a server they could buy, as all their code ran on a single server with a hot standby. They have since moved to java, and rethought how they managed shared state

Facebook managed to get ejabberd, the xmpp server written in erlang, to back their first Messenger, but it involved sharding to give each ejabberd-instance a small enough data set to cope, and a clever way to replicate presence data outside of erlang (storing it in compact memory blocks on each ejabberd server, and shipping them wholesale to a presence service at a regular cadence).

Pretty soon they tore ejabberd out, metaphorically burned it in a field and salted the earth... but how much of that was the fault of erlang itself, and how much it was the issue of having one corner with erlang in a largely C++ world isn't known to me.

ramchip

> For example, if a process has taken an item off a queue and then crashes before having fully processed it, how is that accounted for?

I'm not sure I understand the question - all queue systems I've used separate delivery and acknowledgement, so if a process crashes during processing the messages will be redelivered once it restarts.

Do you have a concrete example of a flow you're curious about?

Maybe these could help:

- https://ferd.ca/the-zen-of-erlang.html

- https://jlouisramblings.blogspot.com/2010/11/on-erlang-state...