Jepsen: Amazon RDS for PostgreSQL 17.4

152 comments

·April 29, 2025

hliyan

I wish more writing in the software world was done this way:

"Amazon RDS for PostgreSQL is an Amazon Web Services (AWS) service which provides managed instances of the PostgreSQL database. We show that Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation, the strongest consistency model supported across all endpoints. Healthy clusters occasionally allow..."

Direct, to-the-point, unembellished and analogous to how other STEM disciplines share findings. There was a time I liked reading cleverly written blog posts that use memes to explain things, but now I long for the plain and simple.

sgarland

A company I was at had an internal blog where anyone could write an article, and others could comment on it. Zero requirement to do so, and it in no way factored into your rating. I think it was the result of a hackathon one year.

Anyway, I really enjoyed it, because I like technical writing. I found that if I wrote a deeply technical post, I’d get very few likes and comments – in fact, I even had a Staff Eng tell me I should more narrowly target the audience (you could tag groups as an intended audience; they’d only see the notification if they went to the blog, so it wasn’t intrusive) because most of engineering had no idea what I was talking about.

Then, I made a post about Kubecost (disclaimer: this was in its very early days, long before being acquired by IBM; I have no idea how it performs now, and this should not dissuade you from trying it if you want to) and how in my tests with it, its recommendations were a poor fit, and would have resulted in either minimal savings, or caused container performance issues. The post was still fairly technical, examining CPU throttling, discussing cgroups, etc. but the key difference was memes. People LOVED it.

I later repeated this experiment with something even more technical; IIRC it involved writing some tiny Python external library in C and accessing it with ctypes, and comparing stack vs. heap allocations. Except, I also included memes. Same result, slightly lessened from the FinOps one, but still far more likes and comments than I would expect for something so dry and utterly inapplicable to most people’s day-to-day job.

Like you, I find this trend upsetting, but I also don’t know how else to avoid it if you’re trying to reach a broader audience. Jensen, of course, is not, and I applaud them for their rigorous approach and pure writing.

jbaiter

It's funny, because I remember the early days of Jepsen, and it relied heavily on memes (the whole name is based on "call me maybe"/carly rae jepsen) and aphyr wasn't (and still isn't) shy about his colorful real life personality :-)

See for example https://aphyr.com/posts/282-call-me-maybe-postgres, which makes heavy uses of memes.

0xCA1EB

My favorite was "UUIDs as Primary Keys" (subheading: "Just Say No"). :)

It doesn't feel full of memes but it does have great illustrations! For example, this visualization of the fragmentation of data for a UUID v4 idb file:

https://cdn.zappy.app/b1b54bb0c780c0f6dd891475589aeee3.png

Mawr

The reason for that outcome was likely two-fold:

1. If your memes were analogies to the dry technical concepts, then the simple, easy to digest analogies were the key here, not the memes themselves.

2. Pictures are worth a thousand words. The more visual you can make your writing the better. Even something as simple as using bullet points instead of dense paragraphs of text works wonders. But the key is to use graphs and illustrations to explain and show concepts wherever possible.

Twirrim

I'm so past wanting to read meme laden blog posts. Especially when all too often it's just stretching a paragraph of content. Security vulnerability stuff is probably the worst at it these days.

cwmma

I was just thinking about how much I miss the old old Jepsen, the same matter of fact and directness but full of memes, see for example, the old redis post https://aphyr.com/posts/283-call-me-maybe-redis.

augustl

Jepsen is awesome, on so many levels!

fuy

isolation levels, that is!

n8m8

Amazon is known for a healthy culture around technical writing that I can attest to. This comment reflect my own thoughts, not of my company. Here's a public article considering it. https://quartr.com/insights/business-philosophy/amazon-s-wri...

luhn

It's not mentioned in the headline and not made super clear in the article: This is specific to multi-AZ clusters, which is a relatively new feature of RDS, and differ from multi-AZ instance that most will be familiar with. (Clear as mud.)

Multi-AZ instances is a long-standing feature of RDS where the primary DB is synchronously replicated to a secondary DB in another AZ. On failure of the primary, RDS fails over to the secondary.

Multi-AZ clusters has two secondaries, and transactions are synchronously replicated to at least one of them. This is more robust than multi-AZ instances if a secondary fails or is degraded. It also allows read-only access to the secondaries.

Multi-AZ clusters no doubt have more "magic" under the hood, as its not a vanilla Postgres feature as far as I'm aware. I imagine this is why it's failing the Jepsen test.

ants_a

Interesting why this magic would be needed. Vanilla Postgres does support quorum commit which can do this. You can also set up the equivalent multi-AZ cluster with Patroni, and (modulo bugs) it does the necessary coordination to make sure to promote primaries in a way that does not lose transactions or makes visible a transaction that is not durable.

There still is a Postgres deficiency that makes something similar to this pattern possible. Non-replicated transactions where the client goes away mid-commit become visible immediately. So in the example, if T1 happens on a partitioned leader, disconnects during commit, T2 also happens on a partitioned node, and T3 and T4 happen later on a new leader, you would also see the same result. However, this does not jive with the statement that fault injection was not done in this test.

Edit: did not notice the post that this pattern can be explained by inconsistent commit order on replica and primary. Kind of embarrassing given I've done a talk proposing how to fix that.

sontek

Link the talk video

ants_a

https://www.youtube.com/watch?v=vz-dhwSpjOw

ashu1461

Have one question

So if snapshot violation is happening inside Multi-AZ instances, it can happen with a single region - multiple read replica kind of setup as well ? But it might be easily observable in Multi-AZ setups because the lag is high ?

luhn

A synchronous replica via WAL shipping is a well-worn feature of Postgres. I’d expect RDS to be using that feature behind the scenes and would be extremely surprised if that has consistency bugs.

Two replicas in a “semi synchronous” configuration, as AWS calls it, is to my knowledge not available in base Postgres. AWS must be using some bespoke replication strategy, which would have different bugs than synchronous replication and is less battle-tested.

But as nobody except AWS knows the implementation details of RDS, this is all idle speculation that doesn’t mean much.

wb14123

This kind of replication can be configured in vanilla Postgres with something like ANY 3 (s1, s2, s3, s4) in synchronous_standby_names? Doc: https://www.postgresql.org/docs/current/runtime-config-repli...

null

[deleted]

x0x0

it's the 2nd sentence in the article:

> We show that Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation

you kind of have to expect people to read

evil-olive

I think it's still an important clarification, because for years you've had a choice in RDS (classic RDS, not Aurora) between "single-AZ" and "multi-AZ" instances, with the general rule of thumb that production workloads should always be multi-AZ.

however, "multi-AZ" has been made ambiguous, because there are now multi-AZ instances and multi-AZ clusters.

...and your multi-AZ "instance", despite being not a multi-AZ "cluster" from AWS's perspective, is still two nodes that are "clustered" together and treated as one logical database from the client connection perspective.

see [0] and scroll down to the "availability and durability" screenshot for an example.

0: https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-clus...

havkom

Good investigation!

Software developers nowadays barely know about transactions, and definitely not about different transaction models (in my experience). I have even encountered "senior developers" (who are actually so called "CRUD developers"), who are clueless about database transactions.. In reality, transactions and transaction models matter a lot to performance and error free code (at least when you have volumes of traffic and your software solves something non-trivial).

For example: After a lot of analysis, I switched from SQL Server standard Read Committed to Read Committed Snapshot Isolation in a large project - the users could not be happier -> a lot of locking contention has disappeared. No software engineer in that project had any clue of transaction models or locks before I taught them some basics (even though they had used transactions extensively in that project)..

shivasaxena

This isn't confined just to senior developers. I have even encountered system architects who were clueless about Isolation levels. Some even confused "Consistency" in ACID with the "Consistency" in CAP.

Makes me sad, since I work mostly in retail and and encounter systems that are infested with race conditions and simila errors: things where these isolation levels would be of great help.

However it's mostly engineers at startups, I have a very high opinion of typical Oracel/MSSQL developers at BigCos who at least have their fundamentals right.

derivagral

Some time ago, this was important to decipher the marketing behind MongoDB. Their benchmarks ran with a loose isolation (read_uncommitted iirc) that didn't guarantee a durable flush, and they'd benchmark against defaults from postgres, etc, which didn't use this isolation.

Clearly it worked for them, but I spent a few different stints cleaning up after developers who didn't know this sort of thing.

icedchai

In over 25+ years at various companies, I only recall one interview where isolation levels were even discussed. Almost nobody cares until it's a problem.

bdangubic

we must have had entirely different careers, same in years and 180 degrees opposite, absolute core (and disqualifying) questions at every interview, no exceptions.

ljm

I’ve noticed the lack of transaction awareness mostly in serverless/edge contexts where the backend architecture (if you can even call it that) is driven exclusively by the needs of the client. For instance, database queries are modelled as react hooks or sequential API calls.

I’ve seen this work out terribly at certain points in my career.

jacobsenscott

Soon most software devs will just be transcribing LLM trash to code with no concept of what's actually happening (its actually required at shopify now - MS is bragging 1/3rd of their software is written this way), and no new engineers are coming up because why invest the time to learn if there won't be any engineering jobs left?

whazor

I think that this is really the duality of LLMs. I can ask it to explain different database transaction models and it would perfectly explain to me how it works, which one to pick, and how to apply it.

But generated code by a LLM will likely also have bugs that could be fixed with transactions.

jacobsenscott

That's because it's glorified search. The postgres docs tell you that without risk of hallucination. You are correct that it won't produce code that does the right thing in that context though.

baq

My recommendation for juniors stands unchanged for a decade now: read a book about SQL databases over a weekend and a book about the database your current work project is using over the next weekend. Chances are you are now the database expert on the project.

fuy

Had similar situation a few years before - switched a (now) billion revenue product from Read Committed to Read Committed Snapshot with huge improvements in performance. One thing to be aware when doing this - it will break all code that rely on blocking reads (e.g. select with exists). These need to be rewritten using explicit locks or some other methods.

belter

Besides the obvious shocking statement that people can be gainfully working in this industry, without knowing about database transactions...I will take a guess...they have been using web scale MongoDB ?

cswilliams

Interesting. At a previous company, when we changed the pg_dump command in a backup script to start using parallel workers (-j flag) we started to rarely see errors that suggested inconsistency when restoring the backups (duplicate key errors and fk constraint errors). At the time, I tried reporting the issue to both AWS and on the Postgres mailing list but never got anywhere since I could not easily reproduce it. We eventually gave up and went back to single threaded dumps. I wonder if this issue is related to that behavior we were seeing.

belter

Was a single instance, one instance with a standby in another AZ or a multiaz cluster as tested here?

cswilliams

We saw it when we ran the pg_dump off a standby instance (or a "replica" to use RDS terminology). Our primary was a multi-az instance. So not exactly what they tested here I guess, but it makes me wonder what changes, if any, they've made to postgres under the hood.

ezekiel68

In my reading of this, it looks like the practical implication could be that reads happening quickly after writes to the same row(s) might return stale data. The write transaction gets marked as complete before all of the distributed layers of a multi AZ RDS instance have been fully updated, such that immediate reads from the same rows might return nothing (if the row does not exist yet) or older values if the columns have not been fully updated.

Due to the way PostgreSQL does snapshotting, I don't believe this implies such a read might obtain a nonsense value due to only a portion of the bytes in a multi-byte column type having been updated yet.

It seems like a race condition that becomes eventually consistent. Or did anyone read this as if the later transaction(s) of a "long fork" might never complete under normal circumstances?

aphyr

This isn't just stale data, in the sense of "a point-in-time consistent snapshot which does not reflect some recent transactions". I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T.

mikesun

"I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T."

i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?

aphyr

Yeah, that's right. It may be that the (apparent) order of transactions differs between primary and secondary.

kevincox

To provide a simple (although contrived) example of the type of thing that can happen. Imagine that you have a table with three columns `gps_coordinate`, `postal_code` and `city`. The way these are set is that the new coordinate gets posted to the API and `gps_coordinate` is updated. This then kicks off a background task that uses the new coordinate to lookup and update `postal_code`. Then another background task uses the postal code to look up and set `city`.

Since these happen sequentially, for a single update of `gps_coordinate` you would only expect to be able to observe one of:

1. Nothing updated yet, all columns have the previous value.

2. `gps_coordinate` updated, with `postal_code` and `city` still having the previous values.

3. `gps_coordinate` and `postal_code` updated with `city` still having the previous value.

4. All fields updated.

But the ordering that aphyr proved is possible allows you to see "impossible" states such as

1. `postal_code` updated with `gps_coordinate` and `city` still having the previous values.

2. `city` updated with `gps_coordinate` and `postal_code` still having the previous values.

Basically since these transactions happen in order and depend on one another you would expect that you can only see the "left to right" progression. But actually you can see some subset of the transactions applied even if that isn't a possible logical state of the database.

baq

> This work was performed independently by Jepsen, without compensation

not what a RDBMS stakeholder wants to wake up to on the best of days. I'd imagine there were a couple emails expressing concern internally.

hats off to aphyr as usual.

tasuki

What's a "RDBMS stakeholder" ?

(Hats off to aphyr for sure!)

bobnamob

The three layers of middlemanagement between engineers and whichever director owns this particular incarnation of RDS

baq

a stakeholder is anyone who has any business at all with the system - customer, engineer, manager, etc.

RDBMS - https://en.wikipedia.org/wiki/Relational_database#RDBMS

fulafel

I'd think anynone on the receiving end should be thrilled. Traditionally nobody survives Jepsen unscathed but getting it from Aphyr means you're being taken seriously.

nijave

It's not entirely clear but this isn't an issue in multi instance upstream Postgres clusters?

Am I correct in understanding either AWS is doing something with the cluster configuration or has added some patches that introduce this behavior?

aphyr

This is a very good question! I do not understand AWS's replication architecture well enough to reimplement it with standard Postgres yet. This behavior doesn't happen in single-node Postgres, as far as I can tell, but it might happen in some replication setups!

I also understand there are lots of ways to do Postgres replication in general, with varying results. For instance, here's Bin Wang's report on Patroni: https://www.binwang.me/2024-12-02-PostgreSQL-High-Availabili...

mattashii

> It's not entirely clear but this isn't an issue in multi instance upstream Postgres clusters?

No, it isn't an issue with single-instance PostgreSQL clusters. Multi-instance PostgreSQL clusters (single primary, plus streaming/physical replicas) are affected.

What they -too- discovered is that PostgreSQL currently doesn't have consistent snapshot behaviour between the primary and replicas. Presumably, read-only transaction T2 was executed on a secondary (replica) node, while T1, T3, and T4 (all modifying transactions) were executed on the primary.

Some background:

Snapshots on secondary PostgreSQL nodes rely on transaction persistence order (location of commit record in WAL) to determine which transactions are visible, while the visibility order on the primary is determined by when the backend that authorized the transaction first got notice that the transaction was completely committed (and then got to marking the transaction as committed). On each of these (primary and secondary) the commit order is consistent across backends that connect to that system, but the commit order may be somewhat different between the primary and the secondary.

There is some work ongoing to improve this, but that's still very much WIP.

aphyr

Thank you matashii--this would definitely explain it. I've also received another email suggesting this anomaly is due to the difference in commit/visibility order between primary and secondary. Is there by chance a writeup of this available anywhere that I can link to? It looks like https://postgrespro.com/list/thread-id/1827129 miiight be related, but I'm not certain. If so, I'd like to update the report.

My email is aphyr@jepsen.io, if you'd like to drop me a line. :-)

ants_a

That thread is indeed about the same issue. I don't think anyone has done a more concise writeup on it.

Core of the issue is that on the primary, commit inserts a WAL record, waits for durability, local and/or replicated, and then grabs a lock (ProcArrayLock) to mark itself as no longer running. Taking a snapshot takes that same lock and builds a list of running transactions. WAL insert and marking itself as visible can happen in different order. This causes an issue on the secondary where there is no idea of the apparent visibility order, so visibility order on secondary is strictly based on order of commit records in the WAL.

The obvious fix would be to make visibility happen in WAL order on the primary too. However there is one feature that makes that complicated. Clients can change the desired durability on a transaction-by-transaction basis. The settings range from confirm transaction immediately after it is inserted in WAL stream, through wait for local durability, all the way up to wait for it to be visible on synchronous replicas. If visibility happens in WAL order, then an async transaction either has to wait on every higher durability transaction that comes before it in the WAL stream, or give up on read-your-writes. That's basically where the discussion got stuck without achieving a consensus on which breakage to accept. This same problem is also the main blocker for adopting a logical (or physical) clock based snapshot mechanism.

By now I'm partial to the option of giving up on read-your-writes, with an opt-in option to see non-durable transactions as an escape hatch for backwards compatibility. Re-purposing SQL read uncommitted isolation level for this sounds appealing, but I haven't checked if there is some language in the standard that would make that a bad idea.

A somewhat elated idea is Eventual Durability, where write transactions become visible before they are durable, but read transactions wait for all observed transactions to be durable before committing.

aeyes

What are multi instance upstream Postgres clusters for you? PostgreSQL has no official support for failover of a master instance, the only mechanism is Postgres replication which you can make synchronous. Then you can build your own tooling around this to build a Postgres cluster (Patroni is one such tool).

AWS patched Postgres to replicate to two instances and to call it good if one of the two acknowledges the change. When this ack happens is not public information.

My personal opinion is that filesystem level replication (think drbd) is the better approach for PostgreSQL. I believe that this is what the old school AWS Multi-AZ instances do. But you get lower throughput and you can't read from the secondary instance.

nijave

>My personal opinion is that filesystem level replication (think drbd) is the better approach for PostgreSQL

That's basically what their Aurora variant does. It uses clustered/shared storage then uses traditional replication only for cache invalidation (so replicas know when data loaded into memory/cache has changed on the shared storage)

belter

Yes its different. This is a deeper overview of what they did: https://youtu.be/fLqJXTOhUg4

Specially here: https://youtu.be/fLqJXTOhUg4?t=434

tibbar

The submitted title buries the lede: RDS for PostgreSQL 17.4 does not properly implement snapshot isolation.

aphyr

Folks on HN are often upset with the titles of Jepsen reports, so perhaps a little more context is in order. Jepsen reports are usually the product of a long collaboration with a client. Clients often have strong feelings about how the report is titled--is it too harsh on the system, or too favorable? Does it capture the most meaningful of the dozen-odd issues we found? Is it fair, in the sense that Jepsen aims to be an honest broker of database safety findings? How will it be interpreted in ten years when people link to it routinely, but the findings no longer apply to recent versions? The resulting discussions can be, ah, vigorous.

The way I've threaded this needle, after several frustrating attempts, is to have a policy of titling all reports "Jepsen: <system> <version>". HN is of course welcome to choose their own link text if they prefer a more descriptive, or colorful, phrase. :-)

dang

Given that author and submitter (and commenter!) are all the same person I think we can go with your choice :)

The fact that the thread is high on HN, plus the GP comment is high in the thread, plus that the audience knows how interesting Jepsen reports get, should be enough to convey the needful.

broost3r

long time lurker here who registered on HN many years ago after reading Jepsen: Cassandra

the Jepsen writeups will surely stand the test of time thank you!

belter

And your comment also...In Multi-AZ clusters.

Well this is from Kyle Kingsbury, the Chuck Norris of transactional guarantees. AWS has to reply or clarify, even if only seems to apply to Multi-AZ Clusters. Those are one of the two possibilities for RDS with Postgres. Multi-AZ deployments can have one standby or two standby DB instances and this is for the two standby DB instances. [1]

They make no such promises in their documentation. Their 5494 pages manual on RDS hardly mentions isolation or serializable except in documentation of parameters for the different engines.

Nothing on global read consistency for Multi-AZ clusters because why should they.... :-) They talk about semi-synchronous replication so the writer waits for one standby to confirm log record, but the two readers can be on different snapshots?

[1] - "New Amazon RDS for MySQL & PostgreSQL Multi-AZ Deployment Option: Improved Write Performance & Faster Failover" - https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-clus...

[2] - "Amazon RDS Multi-AZ with two readable standbys: Under the hood" - https://aws.amazon.com/blogs/database/amazon-rds-multi-az-wi...

n2d4

> They make no such promises in their documentation. Their 5494 pages manual on RDS hardly mentions isolation or serializable

Well, as a user, I wish they would mention it though. If I migrate to RDS with multi-AZ after coming from plain Postgres (which documents snapshot isolation as a feature), I would probably want to know how the two differ.

altairprime

I emailed the mods and asked them to change it to this phrase copy-pasted from the linked article:

> Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation

altairprime

(The mods replied above; thank you!)

gymbeaux

Par for the course

badmonster

What safety or application-level bugs could arise if developers assume Snapshot Isolation but Amazon RDS for PostgreSQL is actually providing only Parallel Snapshot Isolation, especially in multi-AZ configurations using the read replica endpoint?

Elucalidavah

Consider a "git push"-like flow: begin a transaction, read the current state, check that it matches the expected, write the new state, commit (with a new state hash). In some unfortunate situations, you'll have a commit hash that doesn't match any valid state.

And the mere fact that it's hard to reason about these things means that it's hard to avoid problems. Hence, the easiest solution is likely "it may be possible to recover Snapshot Isolation by only using the writer endpoint", for anything where write is anyhow conditional on a read.

Although I'm surprised the "only using the writer endpoint" method wasn't tested, especially in availability loss situations.

ctapobep

Consider this: you leave a comment under a post. The user who posts first deserves a "first commenter badge". Now:

- User1 comments

- User2 comments

- User1 checks (in a separate tx) that there's only 1 comment, so User1 gets the badge

- User2 checks the same (in a separate tx) and also sees only 1 comment (his), and also receives the badge.

With Snapshot isolation this isn't possible. At least one of the checks made in a separate tx would see 2 comments.

The original article on the Parallel Snapshot is a good read: https://scispace.com/pdf/transactional-storage-for-geo-repli...

mushufasa

> These phenomena occurred in every version tested, from 13.15 to 17.4.

I was worried I had made the wrong move upgrading major versions, but it looks like this is not that. This is not a regression, but just a feature request or longstanding bug.

password4321

It would be great to get all the Amazon RDS flavors Jepsen'd.

aphyr

I have actually been working on this (very slowly, in occasional nights and weekends!) Peter Alvaro and I reported on a safety issue in RDS for MySQL here too: https://jepsen.io/analyses/mysql-8.0.34#fractured-read-like-...

password4321

There is a universe where cloud providers announce each new database offering by commissioning a Jepsen test and iterating on the results until every issue has been resolved or at least documented.

Unfortunately reliability is not that high on the priority list here. Keep up the good work!

film42

I think AWS will need to update their documentation to communicate this. Will a snapshot isolation fix introduce a performance regression in latency or throughput? Or, maybe they stand by what they have as being strong enough. Either way, they'll need to say something.

kevincox

I think the ideal solution from AWS would be fixing the bug and actually providing the guarantees that the docs say that they do.

film42

I agree, but I have a feeling this isn't a small fix. Sounds like someone picked a mechanism that seemed to be equivalent but is not. Swapping that will require a lot of time and testing.

mdaniel

> Swapping that will require a lot of time and testing.

Lucky them, there is an automated suite[1] to verify the correct behavior :-D

1: https://github.com/jepsen-io/rds

slt2021

there is no trivial fix for this without breaking performance. roughly, there is no free lunch in distributed systems, and AWS made a tradeoff to relax consistency guarantees for that specific setup, and didn't really advertise that

belter

It looks like a bug, but the problem is the documentation does not detail what guarantees are offered in this scenario, but would love if somebody could point me where it does...

zaphirplane

Yet bellow your comment is a quote that this is since v13 and above is a comment that there is no mention in the docs.

Using the words Bug and guarantee is throwing the casual readers off the mark ?