Concurrency in Haskell: Fast, Simple, Correct

128 comments

·April 14, 2025

_jackdk_

From the footnotes:

> It gets weirder: in Haskell, exceptions can be thrown to other threads!

What's really interesting is that because of purity, you have to have asynchronous exceptions otherwise you give up a lot of modularity. At least that's what Simons Marlow and Peyton Jones argue in Asynchronous Exceptions in Haskell (2006): https://www.microsoft.com/en-us/research/wp-content/uploads/...

> While the semi-asynchronous approach avoids breaking synchronization abstractions, it is non-modular in that the target code must be written to use the signalling mechanism. Worse still (for us), the semi-asynchronous approach is simply incompatible with a purely-functional language, such as Concurrent Haskell. The problem is that polling a global flag is not a functional operation, yet in a Concurrent Haskell program, most of the time is spent in purely-functional code. On the other hand, since there is absolutely no problem with abandoning a purely-functional computation at any point, asynchronous exceptions are safe in a functional setting. In short, in a functional setting, fully-asynchronous exceptions are both necessary and safe — whereas in an imperative context fully-asynchronous exceptions are not the only solution and are unsafe.

If you can read PLTese, it's really quite a nice paper.

haskell17373

It's maybe interesting to note that the `async` library in use here is very simple and easy to understand. Nearly every function is one or two lines. Likewise `TQueue` is extremely simple (and easy to prove correct) thanks to STM, and also generally has good performance.

zozbot234

A lot of the complexity here is just hidden in Haskell's runtime, which implements async processing based on green threads, besides other features such as GC. Though to be fair, the software transactional memory (STM) featureset is quite unique to Haskell since it relies on the availability of pure functions to ensure correctness. It's kind of hard to imagine a full equivalent to it in other well-known languages.

juliangamble

Quibble: Both Clojure and Scala have a Software Transactional Memory implementation, and the original Clojure Ant demo showed this.

kreetx

While the async library is great, then everything that came before (forkIO, MVar's, etc) was already simple enough - it's only the exception handling that was missing.

internet_points

https://www.oreilly.com/library/view/parallel-and-concurrent... is a great resource for those who want to go deeper into this

cosmic_quanta

The author is thinking of updating the book to a second edition as well. Looking forward to it

ahel

noice

alatriste

I read that book many years ago, but I haven't looked into Haskell for a long time. Is it still relevant today? I imagine many things have changed in 12 years!

valcron1000

The fundamentals are the same, and `async` is as relevant as it was back then. The ecosystem is extremely stable in that regard.

ackfoobar

"Fast" in title. But I don't see any benchmarks, or discussion on how to reason about the performance of STM code.

thuanao

Correct? Anyone who has worked with concurrency in Haskell is probably laughing... :)

Haskell's IO type system doesn't model concurrency at all. `IO a` could be a fork and join, an infinite event loop, really anything, it's a black box in terms of "correctness". Better than using JavaScript maybe, but hardly "correct" in any sort of formal, tractable sense.

wodenokoto

I don't know how async is in other languages but I find Pythons async incredibly difficult to use, and I kinda feel validated about how poor chatGPT is at it as well.

Is it because it is just a very hard thing, or is it because its a synchronous language, with async bolted on? (I'm talking about a purely language point of view, not from a python VM / GIL point of view)

aeonik

The easiest language I’ve used for async is Clojure—mostly because the language is immutable by default and ~99% of the code is referentially transparent. That doesn’t magically solve async, but it removes an entire class of headaches by nudging you away from shared state and side effects. You don’t need locks if there’s nothing to lock.

Async is hard, no doubt—but some languages are designed to reduce the surface area of what can go wrong. I’ve heard great things about Erlang, Elixir, and BEAM-based languages in general. They treat async not as an add-on, but as a core architectural principle.

Starlevel004

It's because ``asyncio`` is a dogwater library that's barely functional and full of footguns. The ecosystem is about the same quality too.

gpderetta

Indeed, as much as I dislike async in general, asyncio is its own special hell.

throwaway17_17

From the footnotes:

> In bare-metal embedded systems or inside the operating system, it’s not unusual to manually break computation into state machines, driven by interrupts.

Although not the topic of TFA, in fact, the footnotes that this is "a whole different ball game." Does anyone have any good source for this aspect of 'low-level'/OS development? I'm more than capable of chasing down sources from a more high level introduction or overview, so anything would be helpful. This concept seems like it may just be a more pragmatic description of embedded/OS development than what I've read previously.

thih9

After reading the title I was expecting a "pick two". From my anecdotal experience haskell is usually far from simple, but other configurations are possible too.

michalsustr

I’m not familiar with Haskell concurrency. The combination of green threads and large memory allocations due to immutable data structures sounds like it would be hard to implement a web server handling 10k+ concurrent requests on commodity hardware?

Btw. too bad author talks about microsecond guarantees usage but does not provide a link, that would be interesting reading.

cosmic_quanta

> sounds like it would be hard to implement a web server handling 10k+ concurrent requests on commodity hardware?

In practice, it is not. The canonical Haskell compiler, GHC, is excellent at transforming operations on immutable data, as Haskell programs are written, into efficient mutations, at the runtime level. Also, since web development is quite popular in the Haskell community, lots of people have spent many hours optimizing this precise use-case.

In my experience, the real downside is that compilation times are a bit long -- the compiler is doing a LOT of work after all.

eru

> The canonical Haskell compiler, GHC, is excellent at transforming operations on immutable data, as Haskell programs are written, into efficient mutations, at the runtime level.

Yes, at the level of native machine code and memory cells, there's not that much of a difference between immutability + garbage collection, and higher level source code that mutates. Thanks to GC you are going to overwrite the same memory locations over and over again, too.

whateveracct

Programmers for some reason really don't understand that generational garbage collection provides locality. I am really surprised how often I see C/C++/Rust types not understand this.

stevan

> Warp is a high-performance HTTP server library written in Haskell, a purely functional programming language. Both Yesod, a web application framework, and mighty, an HTTP server, are implemented over Warp. According to our throughput benchmark, mighty provides performance on a par with nginx.

Source: https://aosabook.org/en/posa/warp.html

_jackdk_

The interaction of laziness and purity means that the memory costs are not always what you think. Purity means that it's a lot safer to share structure between old and new versions of a data structure where an imperative language would have to do defensive copying, and laziness means that you can incrementally amortise the cost of expensive rebalancing operations (Okasaki is the standard reference for this).

eru

> [...] large memory allocations due to immutable data structures sounds [...]

Why would there be large memory allocations because of immutable data structures? Btw, you can also use immutable data structure in eg Rust fairly easily. And Haskell also supports mutation and mutable data structures.

However, Haskell can use a lot of memory, but that's more to do with pervasive 'boxing' by default, and perhaps laziness.

nesarkvechnep

No reason. OC probably thinks that immutable data structures are always copied when being operated on.

michalsustr

Yes indeed, unless you use ropes or other specialised structures

whateveracct

It doesn't actually have "large memory allocations" due to immutable data structures. This is a meme that isn't true. Immutable data structures, especially at small scale, do not have huge performance penalties. You don't copy the entire structure over and over...you copy the O(log n) spine.

Haskell's GC is also fast when you are mostly generating garbage, which is inherently true for web server handlers.

butterisgood

Deforestation helps with that

A composition of catamorphic and anamorphic functions can eliminate a lot of the in-between allocations (a hylomorphism)

Basically it looks like you’re building a ton of intermediate structure then consuming it - meaning much of the in-between stuff can be eliminated.

Interesting optimizations and a little mind blowing when you see it.

nesarkvechnep

You obviously haven’t ran anything on the BEAM (Erlang’s VM).

michalsustr

Correct. Erlang also uses green threads?

jlouis

Yes. And immutable data structures.

When data is immutable, it can be freely shared. Changes to the data essentially uses copy-on-write. And it only writes the delta change, since you don't need a deep copy due to immutability. Add that the garbage collectors of Haskell and Erlang are designed to work with a high allocation rate and have 0 cost for dead data, and this is much faster than what people think.

The way you implement a webserver in either Haskell or Erlang is rather trivial. Whenever there's an incoming request, you make a thread to handle it. So you don't have 1 webserver serving 10k requests. You have 10k webservers serving 1 request each. And since they are started from the same core data, they'll share that due to immutability. See also old-style Apache or PHP and fork().

lemper

nah bro, warp is quite performant. think there were some consultancies that wrote haskal web app for their clients.

kookamamie

> 1. Compose the program into several threads of execution, traditionally scheduled and ran by the operating system

The step 0 is missing:

Compose the program into several lanes of execution, traditionally executed via SIMD.

This is a massive piece of performance left on the table on modern computer architectures, by assuming threading is the first manifestation of concurrency.

jayd16

SIMD has been somewhat of a massive failure in this regard. Unlike threads, most languages seem to ignore its existence and abdicate its usage to the sufficiently complex compiler.

I wish there was better author time feedback to the developer on where they're getting such a perf boost. As far as I'm aware there's no popular linting or blue squiggle to guide you in the right direction.

In games it seems like the popular pattern is to rewrite everything entirely in an entity component system framework.

kookamamie

Agreed completely. Most auto-vectorization approaches are hit-miss and you still cannot have big-binaries, where instruction set is decided dynamically, trivially.

ISPC comes close, but does come with a learning curve.

SleepyMyroslav

I would say that Highway [1] comes close. Can't say anything about ISPC because in gamedev work it never even came into consideration for multiple platforms.

1. https://google.github.io/highway/en/master/

FuckButtons

I thought it was a bit odd that the author claims there’s no mutexes in sight, the TVar is effectively a mutex guard unless I’m misunderstanding this? (I’ve written exactly 0 lines of Haskel). Or is the claim that the lack of ceremony and accidental complexity around threading is the real win for concurrency here?

chongli

No, a TVar is not a mutex guard. A TVar is a software transactional memory (STM) variable. STM works just like a database: you batch together a sequence of operations into a transaction and then execute them. During execution of a transaction, all changes made to the contents of the TVar are stored in a transaction log. If some other transaction occurs during the execution then the whole thing is aborted and re-run.

This can take any ordinary Haskell data structure and give you a lock-free concurrent data structure with easy-to-use transactional semantics. How it performs is another matter! That depends on the amount of contention and the cost of re-playing transactions.

whateveracct

https://hackage.haskell.org/package/stm-containers

This library is full of STM-oriented data structures. They perform better than a simple `TVar (Map k v)`.

It's kind of a fun trick actually. The stock Map is just a tree. The STM Map is also a tree [1] but with TVars at each node. So this helps a lot with contention - you only contend along a "spine" instead of across the whole tree, which is O(log n).

[1] Technically a HAMT a la unordered-containers - trie, tree, you get the idea :)

quibono

> How it performs is another matter!

I know you say it depends on how much contention one sees but I'm interested in the performance hit. Also, is STM the "standard" (or accepted) way to do async in Haskell?

dsign

You are correct, Haskell has quite a few mutex-like types. MVar is one of them.

However, if memory serves me right, TVar is a building block for the transactional memory subsystem. The guard on TVar with, say, modifyTVar is not really stopping execution at entrance but simply indicating that the block modifies the variable. In my mental model, some magic happens in an STM block that checks if two concurrent STM blocks acted upon the same data at the same time, and if so, it reverts the computations of one of the blocks and repeats them with new data.

To my knowledge, Haskell is the only programming language (+runtime) that has a working transactional memory subsystem. It has been in the language for about 20 years, and in that time many have tried (and failed) to also implement STM.

whateveracct

I think Clojure has some kind of STM too?

Haskell's STM is pretty world-class though. That's fair to say :)

dwohnitmok

Clojure's STM never really took off because, for various reasons, it's not as easy to compose as Haskell's (where you can build up a big library of STM blocks and piece them together at the very edges of your program). As such Clojure's STM implementation doesn't actually have a great reputation within the Clojure ecosystem where it isn't usually used in most production codebases (whereas in Haskell STM is often one of the first tools used in any production codebase with concurrency).

Basically it's the difference between focusing only on transactional variables without having a good way of marking what is and isn't part of a larger transaction and having a higher-order abstraction of an `STM` action that clearly delineates what things are transactions and what aren't.

dionian

https://zio.dev/reference/stm/

mrkeen

> Implication of Using STM Running I/O Inside STM— There is a strict boundary between the STM world and the ZIO world. This boundary propagates even deeper because we are not allowed to execute arbitrary effects in the STM universe. Performing side effects and I/O operations inside a transaction is problematic. In the STM the only effect that exists is the STM itself. We cannot print something or launch a missile inside a transaction as it will nondeterministically get printed on every reties that transaction does that.

Does Zio actually offer any protection here, or is it just telling the reader that they're on their own and should be wary of footguns?

mrkeen

Mutexes lock code, TVars lock data.

If you lock a section of code (to protect data), there's no guarantee against mutations of that data from other sections of code.

If you lock the data itself, you can freely pass it around and anyone can operate on it concurrently (and reason about it as if it were single-threaded).

It's the same approach as a transactional database, where you share one gigantic bucket of mutable state with many callers, yet no-one has to put acquire/release/synchronise into their SQL statements.

dwohnitmok

No a TVar isn't a mutex guard. As a sibling comment points out it gives you transactional semantics similar to most relational databases.

Here's an example in perhaps more familiar pseudocode.

  var x = "y is greater than 0"
  var y = 1
  
  forkAndRun {() =>
    y = y - 1
    if (y <= 0) {
      x = "y is less than or equal to 0"
    }
  }
  
  forkAndRun {() =>
    y = y + 1
    if (y > 0) {
      x = "y is greater than 0"
    }
  }

In the above example, it's perfectly possible, depending on how the forked code blocks interact with each other, to end up with

  x = "y is less than or equal to 0"
  y = 1

because we have no guarantee of atomicity/transactionality in what runs within the `forkAndRun` blocks.

The equivalent of what that Haskell code is doing is replacing `var` with a new keyword `transactional_var` and introducing another keyword `atomically` such that we can do

  transactional_var x = "y is greater than 0"
  transactional_var y = 1
  
  forkAndRun {
    atomically {() =>
      y = y - 1
      if (y <= 0) {
        x = "y is less than or equal to 0"
      }
    }
  }
  
  forkAndRun {
    atomically {() =>
      y = y + 1
      if (y > 0) {
        x = "y is greater than 0"
      }
    }
  }

and never end up with a scenario where `x` and `y` disagree with each other, because all their actions are done atomically together and `x` and `y` are specifically marked so that in an atomic block all changes to the variables either happen together or are all rolled back together (and tried again), just like in a database.

`transactional_var` is the equivalent of a `TVar` and `atomically` is just `atommically`.

ghusbands

As siblings note, TVar is a transactional variable. However, it's not just protective against concurrent writes but also against concurrent reads of altered variables, so it offers true atomicity across any accessed state in a transaction.

So if you have a thread altering `foo` and checking that `foo+bar` isn't greater than 5 and a thread altering `bar` and checking the same, then it's guaranteed that `foo+bar` does not exceed 5. Whereas if only write conflicts were detected (as is default with most databases) then `foo+bar` could end up greater than 5 through parallel changes.

pkjens04

A mutex locks the thread, which doesn't happen with TVar. For mutexes TMVar would be used.

cosmic_quanta

My favourite thing about Haskell concurrency is that there are no colored functions [0]. Writing code in IO, or Async, or the next big thing (asychronous higher-order effect system of the future??), doesn't require language support like Python or Rust.

The one construct that unlocks this lack of colored functions, STM, did require runtime support (as opposed to language support), which at least is transparent to downstream developers

[0]: https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

mrkeen

Coloured functions are a feature - not a bug, Haskell is full of them, and they are exactly what make STM safe in Haskell, but abandonware in other languages which have tried.

  2. The way you call a function depends on its color.

`<-` or `>>=` vs `=`

  3. You can only call a red function from within another red function.

This should sound pretty familiar! You can only call an IO function from within another IO function. STM in this case makes a third colour:

  IO can call IO functions.
  IO can call STM functions. (*)
  IO can call pure functions.

  STM can call STM functions.
  STM can call pure functions.

  pure functions can call pure functions.

(*) calling into an STM block from IO is what makes it 'happen for real': it's the `atomically` which has type STM a -> IO a.

Having these coloured functions is what made STM achievable back in the mid-late 2000s, since the mechanism to prevent STM or pure functions from calling IO was already in-place.

Other languages either tried to figure out how to contain the side-effects and gave up, or just released STM and put the onus on the user not to use side effects.

UlisesAC4

It is a shame that the people you are answering is being downvoted, I also understand the importance of coloring functions, but look at the examples that person put, python and rust. In those, executing a colored function (at least the async related ones) propagates up to the top of the program, that is a cost that we have to interiorize, but I would be lying if I told you I wouldn't he happy with such behavior. I do a lot of js/ts and I would love to just be able to "inline" await without making my current scope recursively to the top of the program like it can be done with F# with the Async.StartAsTask operation.

mrkeen

A lot of people want to "inline" IO from a non-IO function too.

I'm glad that the fight happens between the developer and the compiler so that it doesn't have to happen between developers on every single pull request.

grandempire

This is also an advantage of blocking code. It’s just regular code. The async stuff is handled by the operating system.

HN

Concurrency in Haskell: Fast, Simple, Correct

Concurrency in Haskell: Fast, Simple, Correct