Systems Correctness Practices at Amazon Web Services

137 comments

·May 30, 2025

amazingamazing

> Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

Any good open source libraries that do this that are language agnostic? Seems doable - spin up a container with some tools within it. Said tools require some middleware to know when a test is going to be run, when test is run, tools basically make certain things, networking, storage, etc "determinstic" in the context of the test run.

This is more-or-less what antithesis does, but haven't seen anything open source yet.

You of course, could write your tests well, such that you can stub out I/O, but that's work and not everyone will write their tests well anyway (you should do this anyway, but it's nicer imo if this determinism is on a layer higher than the application).

as a slight sidebar - I'm not really bullish on AI, but I think testing is one of the things where AI will hopefully shine, because the feedback loop during prompting can be driven by your actual application requirements, such that the test implementation (driven by AI), requirements (driven by you as the prompt) and "world" (driven by the actual code being tested) can hopefully help drive all three to some theoretical ideal. if AI gives us anything, I'm hoping it can make software a more rigorous discipline by making formal verification more doable.

wwilson

There have historically been two giant adoption challenges for DST.

(1) Previously, you had to build your entire system around one of the simulation frameworks (and then not take any dependencies).

(2) It’s way too easy to fool yourself with weak search/input generation, which makes all your tests look green when actually you aren’t testing anything nontrivial.

As you say, Antithesis is trying to solve both of these problems, but they are very challenging.

I don’t know of anybody else who has a reliable way of retrofitting determinism onto arbitrary software. Facebook’s Hermit project tried to do this with a deterministic Linux userspace, but is abandoned. (We actually tried the same thing before we wrote our hypervisor, but found it didn’t work well).

A deterministic computer is a generically useful technology primitive beyond just testing. I’m sure somebody else will create one someday, or we will open-source ours.

null

[deleted]

nine_k

I suspect you can relatively easily obtain a completely deterministic machine by running QEMU in 100% emulation mode in one thread. But what you are after is controlled deterministic execution, and it's far harder. That is, making your multiple processes to follow a specific dance that triggers an interesting condition must be very involved, when seen from the level as low as CPU and OS scheduler. Hence a language-agnostic setup is hard to achieve and especially hard to make it do your bidding. It may drown you in irrelevant details.

I once built a much, much simpler thing that allowed to run multiple JVM threads in a particular kind of lockstep, by stubbing and controlling I/O operations and the advance of the system time. With that, I could run several asynchronously connected components in particular interaction patterns, including not just different thread activation order but also I/O failures, retries, etc. It was manageable, and it helped uncover a couple of nasty bugs before the code ever ran in prod.

But that was only possible because I went with drastic simplifications, controlling not the whole system but only particular synchronization points. It won't detect a generic data race where explicit synchronization would be just forgotten.

ajb

I'm sure I heard that something like this existed for the JVM ages ago (like 15 years). I don't remember the details so it might not be quite the same, but a colleague was telling me about some tech which would test your concurrency by automatically selecting bad scheduling orders.

bit_razor

https://apple.github.io/foundationdb/testing.html https://www.youtube.com/watch?v=4fFDFbi3toc

amazingamazing

Trust me, I love FDB, but that's not the same thing. The FDB team IIRC had to write their own programming language to do this. It's not a agnostic layer above the application.

The problem with coupled tooling is that no one will use it. That's what is cool about antithesis. If they're able to complete their goal, that's basically what will be achieved.

Lwerewolf

I guess you meant to say "only the people working on the software coupled to the tooling will use it". It's not just FDB & Amazon that are using something like this, and it is a ridiculously powerful type of tool for debugging distributed systems.

bit_razor

Fiar point. I was thinking about antithesis, but it's not open source (yet?). Turns out I also didn't read your comment well enough. Back to lurking I go.

crvdgc

https://rr-project.org/ for languages that can be debugged by gdb.

smj-edison

+1 for rr. Bonus feature is you can also time-travel debug! It's spoiled me forever...

john2x

There was a talk from Joe Armstrong about using property testing to test Dropbox.

simonw

S3 remains one of the most amazing pieces of software I've ever seen. That thing a few years ago where they just added strong read-after-write consistency to the whole system? Incredible software engineering. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

riknos314

I had the distinct pleasure of working on S3 (Lifecycle) during the timeframe that the index team was working on the rearchitecture that enabled the read-after-write consistency.

I can confidently say that as impressive as S3 is from the outside, it's at least that impressive internally, both in implementation, and organizational structure.

null

[deleted]

positisop

Google Cloud Storage had it for eons before S3. GCS comes across as a much better thought-out and built product.

simonw

S3 is probably the largest object store in the world. The fact that they can upgrade a system like that to add a feature as complex as read-after-write with no downtime and working across 200+ exabytes of data is really impressive to me.

tossandthrow

I really do respect the engineering efforts.

But object stores are embarrassingly parallel, so if such a migration should be possible somewhere without down time, then it is definitely object stores.

rossjudson

GCS's metadata layer was originally implemented with Megastore (the precursor to Spanner). That was seamlessly migrated to Spanner (in roughly small-to-large "region" order), as Spanner's scaling ability improved over the years. GCS was responsible for finding (and helping to knock out) quite a few scaling plateaus in Spanner.

SteveNuts

Sure, but whose (compatible) API is GCS using again? Also keep in mind that S3 is creeping up on 20 years old, so backing a change in like that is incredible.

benoau

Not just 20 years old - an almost flawless 20 years at massive scale.

theflyinghorse

Maybe. But Google do have a reputation which makes selecting them for infrastructure a risky endeavor

throitallaway

From my POV Amazon designs its services from a "trust nothing, prepare for the worst case" perspective. Eventual consistency included. Sometimes that's useful and most of the time it's a PITA.

up2isomorphism

S3 is not a piece of software per se, it is a service.

Also S3 is not better than gcs or azure blob.

simonw

Services are built on software.

S3 is likely an order of magnitude larger then those others - it's had a lot longer to grow.

furkansahin

Amazing article! Using state machines is a must if you are building infrastructure control-planes. Was P a must, though? I am not sure. We have been building infrastructure control-planes for over 13 years now and every iteration we have built with Ruby. It worked wonders for us https://www.ubicloud.com/blog/building-infrastructure-contro...

severusdd

The 92 % stat looks really interesting! It’s rarely the spectacular crash that knocks a cluster over. Instead, the “harmless” retry leaks state until everything breaks at 2 a.m on one fateful Friday. Evidently, we should budget more engineering hours for mediocre, silent failures than for outright disasters. That’s where the bodies are buried.

smallnix

Or survivorship bias: the major issues, that have been addressed, do not cause problems cause they were addressed. Some of the minor issues that are not addressed randomly do cause major issues.

osigurdson

It is interesting how the industry ended up with things like TDD when it doesn't work for something as simple as a function that adds two numbers together. While not completely useless in some edge cases, its complete lack of any kind of formal underpinnings should have given us a clue. So many bad / unexamined ideas in the Uncle Bob era. Far closer to a religion than anything else (complete with process "rituals" even).

chubot

One thing I wondered about the P language: It seems like in the early days, it was used at Microsoft to generate C code that’s actually used at runtime in the Windows USB stack?

But now it is no longer used to generate production code?

I asked that question here, which I think was the same question as in a talk: https://news.ycombinator.com/item?id=34284557

It seems like if the generated code is used in a kernel, it could also be used in a cloud, which is less resource-constrained

algorithmsRcool

It looks like Coyote[0], which is used in azure, was an evolution of P# which was an evolution of P

[0]https://www.microsoft.com/en-us/research/wp-content/uploads/...

inaseer

+1.

We have used Coyote/P# not just for model checking an abstract design (which no doubt is very useful) but testing real implementations of production services at Microsoft.

k__

How do Coyote and P differ?

chubot

OK, but then not for generating production code?

I thought I read that somewhere, but now I can't find the claim

ctkhn

This sounds interesting but as someone who hasn't worked at AWS, and isn't already familiar with TLA+ or P, I would have liked to see even a hello world example of either of them. Without that, it sounds like a lot of extra pain for things that a good design and testing process should catch anyway. Seeing a basic example in the article itself that would give me a better insight into what these actually do.

hwayne

This is a quick demo of TLA+ I like: https://gist.github.com/hwayne/39782de71f14dc9addb75f3bec515...

It models N threads non-atomically incrementing a shared counter, with the property "the counter eventually equals the number of threads in the model". When checked in TLA+, it finds a race condition where one threads overwrites another value. I've written implementations of the buggy design and on my computer, they race on less than 0.1% of executions, so testing for it directly would be very hard.

Most TLA+ specs are for significantly more complex systems than this, but this is a good demo because the error is relatively simple.

teeray

The TLA Plus examples repository is very good: https://github.com/tlaplus/Examples . I would recommend starting with something simple like the DieHard problem.

dmd

The entire point of using formal methods is that testing will never, ever catch everything.

nickpsecurity

Whereas, formal verification only catches what properties one correctly specifies in what portions of the program one correctly specifies. In many, there's a gap between these and correctness of the real-world code. Some projects closed that gap but most won't.

UltraSane

Can't you use the formal model to write/generate a lot of unit tests to verify the actual code behaves like the model does?

hamburglar

We should have formal verification of the formal verification specification. Standing on a turtle.

skydhash

Tests do specific instances of a class of problem and proves that your implementation is correct for these instances. Formal verification proves the whole class.

You can have a function that returns the anagram and testing will proves it correct for some pairs of words. But to prove it for all words require formal verification. And that’s when you catch some tricky errors due to undefined behavior or library bugs because you can’t prove their semantics.

egl2020

I experimented with TLA, and the graphical toolbox didn't seem to work or match the tutorial. Kinda disappointing: I wanted to use TLA, and I'm otherwise a big fan of Lamport's work, from the utilitarian Latex to the intellectually satisfying paper on time, clocks, and distributed systems.

yathaid

>> a good design

good is doing a lot of heavy lifting here. The point of TLA+/Pluscal is to have a proof of the soundness of the design.

sriram_malhar

https://github.com/p-org/P

EGreg

Wow. I used to correspond with Leslie Lamport years ago (about his Buridan's Principle papers, etc.)

Today I went to his website and discovered a lot about TLA+ and PlusCal. He still maintains it: https://lamport.azurewebsites.net/tla/peterson.html?back-lin...

I must say ... it would make total sense for a guy like that, who brought mathematics to programming and was an OG of concurrent systems, to create a systems design language that's used at AWS and other industrial places that need to serve people.

I wish more people who build distributed systems would use what he made. Proving correctness is crucial in large systems.

lopatin

And just a tip for who may be intersted: Claude Opus with Extended Thinking seems to be very good at converting existing code into TLA+ specs.

I've found multiple bugs for personal Rust projects like this (A Snake game that allowed a snake to do a 180 degree turn), and have verified some small core C++ components at work with it as well (a queue that has certain properties around locking and liveness).

I tried other models but kept getting issues with syntax and spec logic with anything else besides Opus.

dosnem

I’ve always envisioned tla and other formal methods as specific to distributed systems and never needed to understand it. How is it used for a snake game? Also how is the TLA+ spec determined from the code? Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system? Also when using TLA from the start, can it be applied to implementations? Or is it only for catching bugs during design? Therefore I’m assuming implementations still need to match the design exactly or else you would still get subtle bugs? Sorry for all the questions I’ve never actually learned formal methods but have always been interested.

lopatin

Here's how it caught my Snake bug: My snake representation is a vector of key points (head, turns, tail). A snake in a straight line, of length 3, facing right can look like this: [(0,0), (2,0)]. When a Snake moves (a single function called "step_forward"), the Snake representation is compressed by my code: If the last 2 points are the same, remove the last one. So if this snake changes direction to "left", then the new snake representation would be [(1, 1), (1, 1)] and compressed to [(1, 1)] before existing out of step_forward.

Here's how the bug was caught: It should be impossible for the Snake representation to be < 2 points. So I told Opus to model the behavior of my snake, and also to write a TLA+ invariant that the snake length should never be under 2. TLA+ then basically simulates it and finds the exact sequence of steps "turns" that cause that invariant to not hold. In this case it was quite trivial, I never thought to prevent a Snake from making turns that are not 90 degrees.

Jtsummers

It's targeted at distributed systems, but it can be used to model any system over time. I've used it for distributed systems, but also for embedded systems with a peculiar piece of hardware that (seemed, and we found was) to be misbehaving. I modeled the hardware and its spec in TLA+, then made changes to the behavior description to see if it broke any expected invariants (it did, in precisely the way we saw with the real hardware). The TLA+ model also helped me develop better reproducible test cases for that hardware compared to what we were doing before.

skydhash

I'm not an expert but my current understanding is that code execution is always a state transition to the next state. So what you do is fully specify each state and the relation between them. How the transition actually happens is your code and it's not that important. What's important is that the relations does not conflict to each other. It's a supercharged type system.

> Also how is TLA+ spec determined from the code?

You start from the initial state, which is always known (or is fixed). Then you model the invariants for each lines.

> Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system

Invariants will conflicts with each other in this case.

> Also when using TLA from the start, can it be applied to implementations?

Yes, by fully following the specs and handling possible incorrect states that may happens in practice. If your initial state in the TLA+ specs says that it only includes natural numbers between 1 and 5, you add assertions in your implementation (or throw exceptions) that check that as the Int type in many type systems is not a full guarantee for that constraint. Even more work when using a dynamic language.

k__

It seems like the new DeepSeek performs at a similar level as Opus 4. At least to preliminary Aider benchmarks.

skydhash

> Proving correctness is crucial in large systems.

It could be good in smaller, but critical and widely used utilities like SSH and terminals.

oblio

Yeah, basically all the coreutils plus all the common extras (rsync, ssh, etc) could use stuff like this.

rthnbgrredf

It should be feasible to rewrite the coreitils like ls, cd and cp in Lean 4 together with Cursor within days. Rsync and ssh are more complex though.

belter

>> Proving correctness is crucial in large systems.

You can't do that...

The model checker says the specification satisfies the properties you wrote within the finite state space you explored...

amw-zero

You can write proofs in TLA+ and many other formalisms. You don’t need to ever use a model checker. The proofs hold for an infinite number of infinite-length executions. We are definitely not limited to finite behaviors.

abeppu

> to more lightweight semi-formal approaches (such as property-based testing, fuzzing, and runtime monitoring)

Ok, I get how property-based testing and fuzzing have a relationship to formal methods (the thing being checked looks like part of a formal specification, and in some sense these are a subset of the checks that a model-checking confirms), but calling runtime monitoring a "semi-formal approach" seems like a real stretch.

mjb

Runtime monitoring with something like PObserve is a semi-formal approach. Not just regular alarming and metrics.

purpleidea

It's very interesting (I applaud this) that one of the main goals seems to be to make it more approachable as compared to TLA+, but then they go in write it in C# which I consider to be an incredibly unapproachable community and language.

I'm not trying to draw the ire of the Microsoft fan boys, and there are certainly smart people working on that, but it's just not going to happen for most people.

Had this been in golang, or maybe java, I'm sure many more hands would be digging in! Having said that, I hope this helps bring correctness and validation more into the mainstream. I've been casually following the project for a while now.

My long-term goal is to integrate model validation into https://github.com/purpleidea/mgmt/ so if this is an area of interest to you, please let me know!

ahalbert4

Just curious, has anyone used FIS in their own distributed services? I'm considering using it but don't have any real word experience handling those kind of experiments.

Marazan

Would I be right in saying Promela and SPIN are at a higher level than what is being described in the article?

mjb

I (one of the authors) did some distributed systems work with Promela about a decade ago, but it never felt like the right fit in the domain. It's got some cool ideas, and may be worth revisiting at some point.

HN

Systems Correctness Practices at Amazon Web Services

Systems Correctness Practices at Amazon Web Services