Making Postgres scale
107 comments
·March 14, 2025Eikon
frollogaston
"Postgres does not scale" means that you can't just give it more machines, which is true. At some point you've got the fastest available machine, maybe you've played with the settings, and that's it. Then you embark on the kind of manual sharding journey the article describes.
tnorgaard
Super interesting compiling pg, I assume, with same as the zfs block size! It was always on our todo to try, but never got around to it. If possible, what block size did you end up with? Have you tried zfs direct io in 2.3.x, if so, could you share any findings? Thanks for sharing - and cool website!
Eikon
I don’t think Postgres will be able to benefit from direct io? I might be wrong though!
I use Postgres with 32K BLKSZ.
I am actually using default 128K zfs recordsize, in a mixed workload, I found overall performance nicer than matching at 32K, and compression is way better.
> Thanks for sharing - and cool website!
Thank you!
kelafoja
One of things I find challenging is understand the meaning of the word "scales". It is sometimes used differently in different contexts.
Can it be performant in high load situations? Certainly. Can is elastically scale up and down based on demand? As far as I'm aware it cannot.
What I'm most interested in is how operations are handled. For example, if it's deployed in a cloud environment and you need more CPU and/or memory, you have to eat the downtime to scale it up. What if it's deployed to bare metal and it cannot handle the increasing load anymore? How costly (in terms of both time and money) is it to migrate it to bigger hardware?
nine_k
When it "scales", it usually means "scales up". A scalable solution is such that can withstand a large and increasing load, past the usual limitations of obvious solutions.
Being elastic is nice, but not always needed. In most cases of database usage, downsizing never happens, or expected to happen: logically, data are only added, and any packaging and archiving only exists to keep the size manageable.
pluto_modadic
a database scaling dramatically up and down /under load/ and expecting it to perform the same as steady state seems a bit weird, vs a single, beefy database with a beefy ingest job and a bunch of read only clients searching it?
like you're more likely to encounter two phases (building the DB in heavy growth mode, and using the DB in light growth heavy read mode).
A business that doesn't quite yet know what size the DB needs to be has a frightening RDS bill incoming.
rednafi
Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?
By the way, really cool website.
Eikon
I'll try to get a blog post out soon!
> Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?
It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.
6 NVMe drives in raidz-1, 1024GB of memory, a 96 core AMD EPYC cpu.
A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).
> By the way, really cool website.
Thank you!
tracker1
> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).
That's kind of where I'm at now... you can vertically scale a server so much now (compared to even a decade ago) that there's really no need to bring a lot of complexity in IMO for Databases. Simple read replicas or hot spare should be sufficient for the vast majority of use cases and the hardware is way cheaper than a few years ago, relatively speaking.
I spent a large part of the past decade and a half using and understanding all the no-sql options (including sharding with pg) and where they're better or not. At this point my advice is start with PG, grow that DB as far as real hardware will let you... if you grow to the point you need more, then you have the money to deal with your use case properly.
So few applications have the need for beyond a few million simultaneous users, and avoiding certain pitfalls, it's not that hard. Especially if you're flexible enough to leverage JSONB and a bit of denormalization for fewer joins, you'll go a very, very long way.
stuartjohnson12
> It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.
I can build scalable data storage without a flexible scalable redundant resilient fault-tolerant available distributed containerized serverless microservice cloud-native managed k8-orchestrated virtualized load balanced auto-scaled multi-region pubsub event-based stateless quantum-ready vectorized private cloud center? I won't believe it.
rednafi
> I'll try to get a blog post out soon!
Please do.
> It’s self-hosted on bare metal, with standby replication, normal settings, nothing “weird” there.
16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.
> 6 NVMe drives in raidz-1, 1024GB of memory, a 96-core AMD EPYC CPU.
Since you’re self hosted, I’m you aren’t on AWS. How much is this setup costing you now if you don’t mind sharing.
> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IMHO does not provide many benefits outside of niche use cases).
Beautiful!
JohnBooty
Your replies are really valuable and informative. Thank you so much.
Question - what is your peak utilization % like? How close are you to saturating these boxes in terms of CPU etc?
GordonS
I'm also self-hosting Postgres, and the project is getting to the point where a standby would be a good idea to ensure higher availability.
Did you use any particular guide for setting up replication? Also, how do you handle failover/fallback to/from standby please?
MR4D
Having some issues with your numbers, but I’m probably just missing something…
If you insert 150K rows per second, that’s roughly 13 Billion rows per day.
So you’re inserting 10%+ of your database size every day?
That seems weird to me. Are you pruning somewhere? If not, is your database less than a month old? I’m confused.
Eikon
Well, that's why I said "ATM", it's not a sustained rate, all the time. And yes, there's a bunch of DELETEs too.
oa335
Can you please share some tips and tricks for achieving such high throughput?
sa46
I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades.
Any tricks you used for those parts?
Eikon
> were restoring to a new node
Zfs send / recv or replication.
> I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades. Any tricks you used for those parts?
Replication makes this pretty painless :)
wg0
So run a replica? Is there more literature on that?
shayonj
> I insert about 150,000 rows a second
That's amazing - I would love to know if you have done careful data modeling, indexing, etc that allows you to get to this and what kind of data is being insert ed?
Eikon
I am not optimizing too much around insertion speed. I avoid GIN, GIST and hash indexes.
The schema is nicely normalized.
I had troubles with hash indexes requiring hundreds of gigabytes of memory to rebuild.
Postgres B-Trees are painless and (very) fast.
Eg. querying one table by id (redacted):
EXPLAIN ANALYZE SELECT * FROM table_name WHERE id = [ID_VALUE];
Index Scan using table_name_pkey on table_name (cost=0.71..2.93 rows=1 width=32) (actual time=0.042..0.042 rows=0 loops=1)
Index Cond: (id = '[ID_VALUE]'::bigint)
Planning Time: 0.056 ms
Execution Time: 0.052 ms
Here’s a zpool iostat 1 # zpool iostat 1
operations bandwidth
read write read write
----- ----- ----- -----
148K 183K 1.23G 2.43G
151K 180K 1.25G 2.36G
151K 177K 1.25G 2.33G
148K 153K 1.23G 2.13G
aprdm
99.9% of the companies in the world will never need more than 1 beefy box running postgres with a replica for a manual failover and/or reads.
frollogaston
Availability is trickier than scalability. An async replica can lose a little data during a failover, and a synchronous replica is safer but slower. A company on some platform might not even know which one they're using until it bites them.
fmajid
Skype open-sourced their architecture way back, using PL/Proxy to route calls based on shard. It works, is quite elegant, handled 50% of all international phone calls in the noughties. My old company used it to provide real-time analytics on about 300M mobile devices.
https://wiki.postgresql.org/images/2/28/Moskva_DB_Tools.v3.p...
https://s3.amazonaws.com/apsalar_docs/presentations/Apsalar_...
Keyframe
Skype has had from the beginning the requirement that all database access must be implemented through stored procedures.
That presentation starts with hard violence.
Tostino
If the database team designed a thoughtful API with stored procedures, this can actually be a quite nice way to interact with a database for specific uses.
Being 100% hard and fast on that rule seems like a bad idea though.
Karupan
Tangentially related: is there a good guide or setup scripts to run self hosted Postgres with backups and secondary standby? Like I just want something I can deploy to a VPS/dedicated box for all my side projects.
If not is supabase the most painless way to get started?
akshayshah
Apart from being backed by Postgres instead of MySQL, is this different from Vitess (and its commercial vendor PlanetScale)?
levkk
The goal for this project is to be analogous to Vitess for Postgres.
craigkerstiens
Probably as useful is the overview of what pgdog is and the docs. From their docs[1]: "PgDog is a sharder, connection pooler and load balancer for PostgreSQL. Written in Rust, PgDog is fast, reliable and scales databases horizontally without requiring changes to application code."
rednafi
Another option is going full-scale with CockroachDB. We had a Django application backed by PostgreSQL, which we migrated to CockroachDB using their official backend.
The data migration was a pain, but it was still less painful than manually sharding the data or dealing with 3rd party extensions. Since then, we’ve had a few hiccups with autogenerated migration scripts, but overall, the experience has been quite seamless. We weren’t using any advanced PostgreSQL features, so CockroachDB has worked well.
skunkworker
Unless their pricing has changed, it’s quite exorbitant when you need a lot of data. To the point that one year of cockroachdb would cost 5x the cost of the server it was running on.
CharlesW
I think you're referring to the CockroachDB Cloud DBaaS offering vs. CockroachDB itself, correct?
rednafi
Not the parent but yeah, most likely. But then again, you probably don’t want to maintain your own deployment of Cockroach fleet.
rednafi
This is still true. I wouldn’t use Cockroach if it were my own business. Also, they don’t offer any free version to try out the product. All you get is a short trial period and that’s it.
CharlesW
> Also, they don’t offer any free version to try out the product.
The site makes it seems as if I can install CockroachDB on Mac, Linux, or Windows and try it out for as long as I like. https://www.cockroachlabs.com/docs/v25.1/install-cockroachdb... Additionally, they claim CockroachDB Cloud is free for use "up to 10 GiB of storage and 50M RUs per organization per month".
traceroute66
> Another option is going full-scale with CockroachDB
Just beware that CockroachDB is not a drop-in replacement for PostgreSQL.
Last time I looked it was missing basic stuff. Like stored functions. I don't call stored functions an "advanced feature".
levkk
I'm glad you brought up the migration, because one of the main goals behind our project is to automate migrating to a sharded deployment. You can think of your DB as the base case (num_shards = 1), and PgDog as the recursive solution.
rednafi
Automatic data transfer would be super cool to get out of the box. We had a custom multi-tenancy solution for our application that heavily used PostgreSQL schemas. One schema for each customer.
It was a pain to get that work with Cockroach since it doesn’t optimize cross schema queries and suggests one DB per customer. This was a deal breaker for us and we had to duplicate data to avoid cross schema queries.
Being able to live within Postgres has its advantages.
eximius
Something I don't see in the pgdog documentation is how cross-shard joins work. Okay, if I do a simple `select * from users order by id`, you'll in-memory order the combined results for me. But if I have group by and aggregations and such? Will it resolve that correctly?
levkk
Aggregates are a work in progress. We're going to implement them in this order:
1. count
2. max, min, sum
3. avg (needs a query rewrite to include count)
Eventually, we'll do all of these: https://www.postgresql.org/docs/current/functions-aggregate..... If you got a specific use case, reach out and we'll prioritize.
eximius
Heh, no chance I can introduce this at work and hard to have a personal project requiring it. :)
I think you probably need some documentation to the effect of the current state of affairs, as well as prescriptions as to how to work around it. _Most_ live workloads, even if the total dataset is huge, have a pretty small working set. So limiting DB operations to simple fetches and doing any complex operations in memory is viable, but should be prescribed as the solution or people will consider it's omission as a fault instead of a choice.
levkk
No worries. It's early days, the code and docs are just a few months in the making. I'm happy to keep you updated on the progress. If you want, send your contact info to hi@pgdog.dev.
- Lev
fourseventy
I run a postgresql db with a few billion rows at about 2TB right now. We don't need sharding yet but when we do I was considering Citus. Does anyone have experience implementing Citus that could comment?
briankelly
People talk about scale frequently as a single dimension (and usually volume as it relates to users) but that can be oversimplifying for many kinds of applications. For instance, as you are thinking about non-trivial partitioning schemes (like if there is high coupling between entities of the same kind - as you see in graphs) is when you should consider alternatives like the Bigtable-inspired DBs, since those are (relatively) more batteries included for you.
> It’s funny to write this. The Internet contains at least 1 (or maybe 2) meaty blog posts about how this is done
It would’ve been great to link those here. I’m guessing one refers to StackOverflow which has/had one of the more famous examples of scaled Postgres.
levkk
I was thinking of the Instagram post years ago. And maybe the Instacart one.
briankelly
Maybe this one for Instagram? https://instagram-engineering.com/sharding-ids-at-instagram-...
mindcrash
Well, ofcourse it does! :)
Another (battle tested * ) solution is to deploy the (open source) Postgres distribution created by Citus (subsidiary of Microsoft) on nodes running on Ubuntu, Debian or Red Hat and you are pretty much done: https://www.citusdata.com/product/community
Slap good old trusty PgBounce in front of it if you want/need (and you probably do) connection pooling: https://www.citusdata.com/blog/2017/05/10/scaling-connection...
*) Citus was purchased by Microsoft more or less solely to provide easy scale out on Azure through Cosmos DB for PostgreSQL
gigatexal
Is it really that easy? What are the edge cases?
levkk
It's not. We tried. Plus, it doesn't work on RDS, where most of production databases are. I think Citus was a great first step in the right direction, but it's time to scale the 99% of databases that don't run on Azure Citus already.
mindcrash
That's because Amazon wants to do whatever they like themselves... you apparently can get stuff to work by running your own masters (w/ citus extension) in EC2 backed by workers (Postgres RDS) in RDS:
https://www.citusdata.com/blog/2015/07/15/scaling-postgres-r... (note that this is a old blog post -- pg_shard has been succeeded by citus, but the architecture diagram still applies)
And me saying "Apparently" because I have no experience dealing with large databases on AWS.
Personally had no issues with Citus too, both on bare metal/VMs and as SaaS on Azure...
caffeinated_me
Depends on your schema, really. The hard part is choosing a distribution key to use for sharding- if you've got something like tenant ID that's in most of your queries and big tables, it's pretty easy, but can be a pain otherwise.
mindcrash
Same pain as with good old (native) partitioning, right? :)
As with partitioning, in my experience something like a common key (identifying data sets), tenant id and/or partial date (yyyy-mm) work pretty great
999900000999
I'm kind of interested in why we can't make a better database with all of our modern technology.
Postgres is a fantastic workhorse, but it was also released in the late 80s. Who, who among you will create the database of the future... And not lock it behind bizarro licenses which force me to use telemetry.
HighlandSpring
There are "better" databases but they're better given some particular definition that may not be relevant to your needs. If SQL/the relational model and ACID semantics is what you need then postgres is simply the best in class. The fact it dates back to the 80s is probably an advantage (requirement?) when it comes to solving a problem really well
trescenzi
I guess I’d ask why is something having been first released in the late 80s, or any decade, as positive or negative? Some things are still used because they solve the problems people have. Some things are still used just because of industry capture. I’m not honestly sure where I’d put Postgres.
Are there specific things you’d want from a modern database?
999900000999
Relating to the article, better scaling. Saying run it on a bigger box is a very brute force way to optimize an application.
While they come up with some other tricks here, that's ultimately what's scaling postgres means.
If I imagine a better database, it would have native support for scaling, a postgres compatible data layer as well as first party support for NoSQL( JSONB columns don't cut it since if you have simultaneous writes unpredictable behavior tends to occur).
It needs to also have a permissible license
throwaway7783
Can you please expand on the JSONB unpredictable behavior? We are about to embark on a journey to move some of our data from MongoDB to postgres (with some JSONB). While we don't have significant concurrent writes to a table, would be very helpful to understand the issues
aprdm
Why is it brute force and why is it bad ?
thinkingtoilet
Postgres 17.4 was released last month. Show some respect to the devs.
DoctorOW
Have you looked at CockroachDB? PostgreSQL compatibility with modern comforts (e.g. easy cloud deployments, horizontal scaling, memory safe language)
rednafi
Came here to say this. Cockroach solves the sharding issue by adopting consistent hashing-based data distribution, as described in the Dynamo paper. However, their cloud solution is a bit expensive to get started with.
I_am_tiberius
Does CockroachdB already support ltree?
wmf
Anyone who creates a better database is going to want to get paid for it which either means DBaaS or those weird licenses.
I run a 100 billion+ rows Postgres database [0], that is around 16TB, it's pretty painless!
There are a few tricks that make it run well (PostgreSQL compiled with a non-standard block size, ZFS, careful VACUUM planning). But nothing too out of the ordinary.
ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.
Isn't "Postgres does not scale" a strawman?
[0] https://www.merklemap.com/