Replacing EBS and Rethinking Postgres Storage from First Principles
21 comments
·October 29, 20250xbadcafebee
hedora
Thanks for the summary.
Note that those numbers are terrible vs. a physical disk, especially latency, which should be < 1ms read, << 1ms write.
(That assumes async replication of the write ahead log to a secondary. Otherwise, write latency should be ~ 1 rtt, which is still << 5ms.)
Stacking storage like this isn’t great, but PG wasn’t really designed for performance or HA. (I don’t have a better concrete solution for ansi SQL that works today.)
graveland
(I'm on the team that made this)
The raw numbers are one thing, but the overall performance of pg is another. If you check out https://planetscale.com/blog/benchmarking-postgres-17-vs-18 for example, in the average QPS chart, you can see that there isn't a very large difference in QPS between GP3 at 10k iops and NVMe at 300k iops.
So currently I wouldn't recommend this new storage for the highest end workloads, but it's also a beta project that's still got a lot of room for growth! I'm very enthusiastic about how far we can take this!
bradyd
> EBS only lets you resize once every 6–24 hours
Is that even true? I've resized an EBS instance a few minutes after another resize before.
electroly
AWS documents it as "After modifying a volume, you must wait at least six hours and ensure that the volume is in the in-use or available state before you can modify the same volume" but community posts suggest you can get up to 8 resizes in the six hour window.
jasonthorsness
The 6-hour counter is most certainly, painfully true. If you work with an AWS rep please complain about this in every session; maybe if we all do they will reduce the counter :P.
thesz
What does EBS mean?
It is used in first line of the text but no explanation was given.
znpy
Reminds me of about ten years ago when a large media customer was running NetApp on cloud to get most of what you just wrote on AWS (because EBS features sucked/sucks very bad and are also crazy expensive).
I did not set that up myself, but the colleague that worked on that told me that enabling tcp multipath for iscsi yielded significant performance gains.
unsolved73
TimescaleDB was such a great project!
I'm really sad to see them waste the opportunity and instead build an nth managed cloud on top of AWS, chasing buzzword after buzzword.
Had they made deals with cloud providers to offer managed TimescaleDB so they can focus on their core value proposition they could have won the timeseries business, but ClickHouse made them irrelevant and Neon already has won the "Postgres for agents" business thanks to a better architecture than this.
akulkarni
Thanks for the kind words about TimescaleDB :-)
We think we're still building great things, and our customers seem to agree.
Usage is at an all-time high, revenue is at an all-time high, and we’re having more fun than ever.
Hopefully we’ll win you back soon.
runako
Thanks for the writeup.
I'm curious whether you evaluated solutions like zfs/Gluster? Also curious whether you looked at Oracle Cloud given their faster block storage?
maherbeg
This has a similar flavor to xata.io's SimplyBlock based storage system * https://xata.io/blog/xata-postgres-with-data-branching-and-p... * https://www.simplyblock.io/
It's a great way to mix copy on write and effectively logical splitting of physical nodes. It's something I've wanted to build at a previous role.
stefanha
@graveland Which Linux interface was used for the userspace block driver (ublk, nbd, tcmu-runner, NVMe-over-TCP, etc)? Why did you choose it?
Also, were existing network or distributed file systems not suitable? This use case sounds like Ceph might fit, for example.
graveland
There's some secret sauce there I don't know if I'm allowed to talk about yet, so I'll just address the existing tech that we didn't use: most things either didn't have a good enough license, cost too much, would take a TON of ramp-up and expertise we don't currently have to manage and maintain, but generally speaking, our stuff allows us to fully control it.
Entirely programmable storage so far has allowed us to try a few different things to try and make things efficient and give us the features we want. We've been able to try different dedup methods, copy-on-write styles, different compression methods and types, different sharding strategies... All just as a start. We can easily and quickly create a new experimental storage backends and see exactly how pg performs with it side-by-side with other backends.
We're a kubernetes shop, and we have our own CSI plugin, so we can also transparently run a pg HA pair with one pg server using EBS and the other running in our new storage layer, and easily bounce between storage types with nothing but a switchover event.
the8472
Though AWS instance-attached NVMe(oF?) still has less IOPS per TB than bare metal NVMe does.
E.g. i8g.2xlarge, 1875 GB, 300k IOPS read
vs. WD_BLACK SN8100, 2TB, 2300k IOPS readeverfrustrated
You can't do those rates 24x7 on a WD_BLACK tho.
tayo42
Are they not using aws anymore? I found that confusing. It says they're not using ebs, not using attached nvme, but I didn't think there were other options in aws?
thr0w
Postgres for agents, of course! It makes too much sense.
jacobsenscott
The agent stuff is BS for the pointy hairs. This seems to address real problems I've had with PG though.
cpt100
pretty cool
There's a ton of jargon here. Summarized...
Why EBS didn't work:
Why all this matters: What didn't work: Their solution: Performance stats (single volume):