Migrating to Postgres
277 comments
·May 14, 2025luhn
hliyan
Call me old fashioned, but when records start reaching the 100 million range, it's usually an indication that either your dataset is too wide (consider sharding) or too deep (consider time based archival) to fit into a monolithic schema. For context, I've dealt with multiple systems that generate this volume of data between 2003 - 2013 (mostly capital markets, but also some govt/compliance work) with databases and hardware from that era, and we rarely had an issue that could not be solved by either query optimization, caching, sharding or archival, usually in that order.
Secondly, we did most of these things using SQL, Bash scripts, cron jobs and some I/O logic built directly into the application code. They were robust enough to handle some extremely mission critical systems (a failure could bring down a US primary market and if it's bad enough, you hear it on the news).
hylaride
It obviously depends on how you use your data, but it really is surprising how far one can go with large tables when you implement sharding, caching, and read replicas.
For tables with a lot of updates, Postgres used to fall over with data fragmentation, but that's mostly been moot since SSDs became standard.
It's also easier than ever to stream data to separate "big data" DBs for those separate use cases.
hliyan
Thanks, I knew I forgot something: read replicas
electroly
From the point of view of an SQL engine in 2025, 100 million rows is a tiny table. You can add a surprising number of zeroes onto that figure and a single modest SQL node will handle it with no special effort. 100 billion, with a B, is not difficult on a single beefy node today. I think your points all still stand but consider refreshing the actual numbers. I personally start getting "the itch" around 10 billion (or if it looks like it's going to become 10 billion) these days. If a table gets there, I better have a plan to do something else.
I had a 200 billion row table that was operationally still manageable but, IMO, I had allowed to grow out of control. The enterprise storage costs a fortune. Should have nipped that in the bud by 20 billion at the latest.
paulddraper
Depends on the read/write workload and row size, but yeah after 100-200m rows PostgreSQL vacuums can take a while. And index rebuilding (which you have to do on an active table) too.
It all depends though, sometimes 1b is passe.
But 100m is a good point to consider what comes next.
thehappyfellow
It’s incredible how much Postgres can handle.
At $WORK, we write ~100M rows per day and keep years of history, all in a single database. Sure, the box is big, but I have beautiful transactional workloads and no distributed systems to worry about!
rastignack
At $WORK, we are within the range of 2 billion rows per day on one of our apps. We do have beefy hardware and ultra fast SSD storage though.
maz1b
A single PG database on one server? What are the specs?
geoka9
Those rows are never pruned and rarely read?
wvh
Two days ago, I'd have said the same. Yesterday, big box went down, and because it was so stable, it was a joint less oiled and the spare chickened out at the wrong time and apparently even managed to mess up the database timeline. Today was the post-mortem, and it was rough.
I'm just saying, simple is nice and fast when it works, until it doesn't. I'm not saying to make everything complex, just to remember life is a survivor's game.
thehappyfellow
You’re right, there are downsides like turbine you mention! We mitigate it by running a hot backup we can switch to in seconds and a box in which we test restoring backups every 24h, that’s necessary! But it requires 3x the number of big expensive boxes.
I still think it’s the right tradeoff for us, operating a distributed system is also very expensive in terms of dev and ops time, costs are more unpredictable etc.
It’s all tradeoffs, isn’t it?
icedchai
You don't even need to be that "modern." Back in 2010 I was working on a MySQL 5.x system with about 300 million rows on a dual Xeon box with 16 gigs RAM and a few hundred gigs of RAID 10. This was before SSDs were common.
The largest table was over 100 million rows. Some migrations were painful, however. At that time, some of them would lock the whole table and we'd need to run them overnight. Fortunately, this was for an internal app so we could do that.
luhn
The improvements to migrations have been the biggest boon for running even modestly-sized Postgres DBs. It wasn't that long ago that you couldn't add a column with a default value without rewriting the whole table, or adding NOT NULL without an exclusive lock while the whole table was scanned. That becomes unfeasible pretty quickly.
nasretdinov
Does adding a default value into a column finally work without locking up an entire table now at least?
throwaway7783
Yeah, we have 300m+ rows in a table as well. It's partitioned by time and chugs along with no issues. Granted It's a 30 vcpu, 100gb ram machine, but it hosts billions of rows in aggregate
SchemaLoad
Last app I worked on had a few tables in the billions of rows. Seemed to work fine as we were only really accessing it by unique keys which seems to remain fast no matter how large the table is.
lelanthran
> we were only really accessing it by unique keys which seems to remain fast no matter how large the table is.
Even a naive B-tree index has a logarithmic curve, which means that the time to find a record asymptotically flattens out as the number of records increases.
thomasfromcdnjs
Does mid six figure mean ~$500k?
That sounds insane for a crud app with one million users.
What am I missing?
gbear605
I’ve seen startups with a thousand active users paying $50k/month (though that’s overall costs, not just db). It’s really easy to waste a lot of money doing nothing.
bigfatkitten
It’s especially easy to waste money on databases.
People just throw more compute power (ie money) at performance problems, rather than fixing their queries or making better use of indices.
TheNewsIsHere
As a small business owner — I recently spent an hour canceling things that just added up over time and that I don’t now need. It’s just so easy to waste money, period.
Directly to your point though, I once encountered a salesperson who was running an entire sandbox environment of a very large platform to the tune of about $25k/mo. It sat idle for almost half a year before someone came knocking. The cloud team did an audit and they were a little spicy about it, understandably.
null
ies7
$500k for only 100 millions rows db also sounds crazy
wbercx
The largest table was 100 million rows. They could have had hundreds more tables.
imhoguy
I bet it is cost of query processing (CPU) and traffic (network throughput) plus ofc provider markup.
sgarland
Agreed. Devs usually do a double take when I tell them that their table with 100K rows is not in fact big, or even medium. Everyone’s experiences are different, of course, but to me, big is somewhere in the high hundreds of millions range. After a billion it doesn’t really matter; the difference between 5 billion and 1 billion isn’t important, because it’s exceedingly unlikely that a. Your working set is that large b. That your server could possibly cope with all of it at once. I hope you have partitions.
williamdclt
Yeah, 100mil is really not that much. I worked on a 10B rows table on an rds r6g.4xl, and Postgres handled it fine, even with 20+ indexes. Really not ideal and I'd rather have fewer indexes and sharding the table, but postgres dealt with it.
ringeryless
OTOH they are admittedly using an ORM (Prisma, known for its weight)
monero-xmr
It is truly amazing how mature developers always wind up at the same result - old tech that has stood the test of time. Betting the company on alpha solutions of dubious quality keeps the devs employed at least.
cies
> It is truly amazing how mature developers ...
...use ORMs!
They are always bad, but especially bad when the code base and/or query complexity grows.
esafak
I read it as: Why You Shouldn't Use Prisma and How Cockroach Hung Us Out To Dry
I already knew about prisma from the infamous https://github.com/prisma/prisma/discussions/19748
vvpan
I am in a startup that's using Prisma and it we really wish we had not:
- The query objects can become hard to read with anything more or less complex.
- If you need an unsupported Postgres extension you are out of luck.
- One large file in a schema, impossible to shard.
- We have many apps in a monorepo and they cannot have separate prisma connections cause the schema gets baked into "@prisma/client"
Basically the only thing useful about it are the TS types which is something SQL-builder libraries solve better. Long story short, use Kysely, Prisma provides no value that I see.
cess11
"Instead, it sends individual queries and joins the data on the application level. However, this doesn't mean that Prisma's relational queries are per se slower"
Wow, what the fuck.
"Also, this chapter about Query Performance Optimization from the High Performance MySQL book has some great insights. One of the techniques it mentions is JOIN decomposition:
Many high-performance web sites use join decomposition. You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application."
This belief that they can write JavaScript that outperforms decades of bare-metal executed optimisations in mainstream database engines is just astonishing.jjani
> This belief that they can write JavaScript that outperforms decades of bare-metal executed optimisations in mainstream database engines is just astonishing.
In my first job fresh out of uni, I worked with a "senior" backend developer who believed this. He advocated for crusty, bug-ridden ORMs like Sequelize and Prisma (still very early stage back then, so lots more issues than now though I'd still steer well clear of it). Claiming they did "query optimizations". I knew it made zero sense, but also that I wasn't going to be able to convince him.
Tadpole9181
The application joins are (soon to be were) done in Rust service that's side loaded with the node server.
Also, this is an old quote. Databases didn't all support things like JSON at the time, so joins and subqueries presented an N+1 problem and could balloon data fetch requirements fairly easily. Being a GraphQL-focused ORM originally too, this made some sense.
The default is now being changed and correlated subqueries, JOINs, & JSON aggregation will replace the old approach unless explicitly toggled.
cess11
As I understand it the N+1 problem is when you select a set and then perform another query per row in the set. Executing a join is a solution to this, not a cause.
pier25
Prisma is so bad... can you believe it's by far the most downloaded ORM in NPM?
VWWHFSfQ
Every ORM is bad. Especially the "any DB" ORMs. Because they trick you into thinking about your data patterns in terms of writing application code, instead of writing code for the database. And most of the time their features and APIs are abstracted in a way that basically means you can only use the least-common-denominator of all the database backends that they can support.
I've sworn off ORMs entirely. My application is a Postgres application first and foremost. I use PG-specific features extensively. Why would I sacrifice all the power that Postgres offers me just for some conveniences in Python, or Ruby, or whatever?
Nah. Just write the good code for your database.
pier25
I use PG with Entity Framework in .NET and at least 90% of my queries don't need any PG-specific features.
When I need something PG specific I have options like writing raw SQL queries.
Having most of my data layer in C# is fantastic for productivity and in most cases the performance compared to SQL is negligible.
evantbyrne
Nah. The most prolific backend frameworks are all built on ORMs for good reason. The best ones can deserialize inputs, validate them, place those object directly into the db, retrieve them later as objects, and then serialize them again all from essentially just a schema definition. Just to name a few advantages. Teams that take velocity seriously should use ORMs. As with any library choice you need to carefully vet them though.
CuriouslyC
SQL Alchemy is pretty good, because it's mostly a sql engine that has an ORM bolted on top of that, and the docs actively try to point users towards using the sql engine rather than using the ORM for everything.
reillyse
Every ORM except Active Record is awful. Active Record is amazing.
cess11
That's a tradeoff that sometimes makes sense. MICROS~1 SQL Server heavily leans into the 'use specific features extensively', and countless applications on it consist mainly of stored procedures. It does however cause a lock-in that might not be attractive, your customers might be sensitive to what database engine you run their stuff on and then you need to figure out the common ground between two or more alternatives and build your application in that space.
It's not as uncommon as one might think, one of the big products in public sector services where I live offers both SQL Server and Oracle as persistence layer so they can't push logic into stored procedures or similar techniques.
But just sketching out some schemas and booting PostgREST might be good enough for forever, if that's the case, go for it. As for ORM:s, I kind of like how Ecto in Elixir settings does things, it solves a few tedious things like validation and 'hydration', and has a macro DSL for generating SQL with concise expressions.
globular-toast
It's actually even worse than this, many Django applications are straight up Postgres applications. They use Postgres specific bits of the ORM without hesitation. So they're learning these weird ORM incantations instead of just learning the underlying SQL, which would be knowledge you could apply anywhere.
People just hate embedding SQL into other languages. I don't know why.
seer
I don’t understand the hate, the only truly limiting factor for Prisma right now is its poor support for polymorphism, apart from that it has quite good support for complicated index setups, and if you need anything more performant, just drop to typed raw sql queries, it also supports views (materialized or otherwise) out of the box.
I recently wanted to check it out and wrote a small app that had good use of pgvector for embeddings, custom queries with ctes for a few complex edge cases, and it was all quite smooth.
Now it might not be at the level of active record, ecto or sqlalchemy but it was quite decent.
If you know your sql at any point it gave me options to drop down a level of abstraction, but still keep the types so as not to break the abstraction too much for the rest of the code.
sitharus
I don't hate prisma - it's just a tool - but that's far from the only limiting factor.
I recently looked at migrating a legacy project with basic SQL query generation to a modern ORM. Prisma came up top of course so I tried it.
We use Postgres built-in range types. Prisma does not support these, there's no way to add the type to the ORM. You can add them using "Unsupported", but fields using that aren't available in queries using the ORM, so that's pretty useless.
It also requires a binary to run, which would require different builds for each architecture deployed to. Not a big thing but it was more annoying than just switching the ORM.
That coupled with their attitude to joins - which has truth to it, but it's also short-sighted - eliminated Prisma.
The final decision was to switch to Kysely to do the SQL building and provide type-safe results, which is working well.
vvpan
How do you do typed raw queries?
etblg
> It's true that Prisma currently doesn't do JOINs for relational queries. Instead, it sends individual queries and joins the data on the application level.
..........I'm sorry, what? That seems........absurd.
edit: Might as well throw in: I can't stand ORMs, I don't get why people use it, please just write the SQL.
jjice
I believe it’s either released now or at least a feature flag (maybe only some systems). It’s absolutely absurd it took so long. I can’t believe it wasn’t the initial implementation.
Funny relevant story: we got an OOM from a query that we used Prisma for. I looked into it - it’s was a simple select distinct. Turns out (I believe it was changed like a year ago, but I’m not positive), event distincts were done in memory! I can’t fathom the decision making there…
etblg
> event distincts were done in memory! I can’t fathom the decision making there…
This is one of those situations where I can't tell if they're operating on some kind of deep insight that is way above my experience and I just don't understand it, or if they just made really bad decisions. I just don't get it, it feels so wrong.
seer
Tbh, I once dabbled in building an ORM myself (in PHP) and I did find that in some situations it was faster to do individual queries and then join in code, to solve the N+1 problem.
Granted I was much worse in my sql knowledge and postgre/mysql had severe limitations in their query planners, so I can see how something like this could have happened. If they support multiple dbs, and even one has this problem, it might be better (for them) to do it application side.
The specific issue was doing a join with a table for a one to many, you get a lot more data from the db than you would normally need, if you do the join the naive way, and if the join is nested you get exponentially more data.
It was faster to do a query for each db separately and then stitch the results.
Now it is easy to solve in pg with nested selects and json aggregation, which pg query planner rewrites to efficient joins, but you still get only the bytes you have requested without duplication.
pier25
> I can't stand ORMs, I don't get why people use it, please just write the SQL.
I used to agree until I started using a good ORM. Entity Framework on .NET is amazing.
bob1029
> Entity Framework on .NET is amazing.
I disagree. It is probably one of the less terrible ORMs, but it is far from amazing. The object-relational impedance mismatch will always dominate for anything that isn't trivial business. EF works great until you need different views of the model. It does support some kind of view mapping technique, but it's so much boilerplate I fail to see the point.
Dapper + SqlConnection is goldilocks once you get into the nasty edges. Being able to query a result set that always exactly matches your view models is pretty amazing. The idea of the program automagically upgrading and migrating the schemas is something that was interesting to me until I saw what you could accomplish with Visual Studio's SQL Compare tool & RedGate's equivalent. I feel a lot more comfortable running manual schema migrations when working with hosted SQL providers.
tilne
Doesn’t entity framework have a huge memory footprint too?
ketzo
Not 100% parallel, but I was debugging a slow endpoint earlier today in our app which uses Mongo/mongoose.
I removed a $lookup (the mongodb JOIN equivalent) and replaced it with, as Prisma does, two table lookups and an in-memory join
p90 response times dropped from 35 seconds to 1.2 seconds
lelanthran
> I removed a $lookup (the mongodb JOIN equivalent)
There is no "MongoDB JOIN equivalent" because MongoDB is not a relationalal database.
It's like calling "retrieve table results sequentially using previous table's result-set" a JOIN; it's not.
nop_slide
Maybe because mongo isn’t ideal for relational data?
lesuorac
Can't speak about Prisma (or Postgres much).
But I've found with that you can get better performance in _few_ situations with application level joins than SQL joins when the SQL join is causing a table lock and therefore rather than slower parallel application joins you have sequential MySQL joins. (The lock also prevents other parallel DB queries which is generally the bigger deal than if this endpoint is faster or not).
Although I do reach for the SQL join first but if something is slow then metrics and optimization is necessary.
hermanradtke
In what cases is your join causing a table lock?
compton93
It is. But wait... it doesn't join the data on the application level of your application. You have to deploy their proxy service which joins the data on the application level.
Tadpole9181
It's pretty obvious when somebody has only heard of Prisma, but never used it.
- Using `JOIN`s (with correlated subqueries and JSON) has been around for a while now via a `relationLoadStrategy` setting.
- Prisma has a Rust service that does query execution & result aggregation, but this is automatically managed behind the scenes. All you do is run `npx prisma generate` and then run your application.
- They are in the process of removing the Rust layer.
The JOIN setting and the removing of the middleware service are going to be defaults soon, they're just in preview.
coverj
I didn't mind prisma for managing the schema etc but also seen your linked github issue. I found other people recommend combining Prisma with Kysley. I have only used this in toy projects so take this with a grain of salt.
frollogaston
I'm not the most experienced in huge DBs and can't write anything off, but I've never seen a horizontally sharded DBMS work well, even Citus which allegedly does. There's always been a catch that seems worse than manually doing sharding at a higher level than your DB, not that that's easy either.
caffeinated_me
I'd argue that horizontally sharded databases can work well, but they do tend to have significant non obvious tradeoffs that can be pretty painful.
There's a handful of companies that have scaled Citus past 1PB for production usage, but the examples I'm aware of all had more engineering to avoid capability or architecture limitations than one might like. I'd love to see someone come back with a fresh approach that covered more use cases effectively.
Disclaimer: former Citus employee
frollogaston
I can imagine it for some constrained use case, but taking your typical RDBMS that's powering a variety of business logic with complex queries, I dunno.
One interesting tradeoff Postgres and MySQL made for efficiency's sake was making xacts not fully ACID by default; instead they guarantee something that's good enough as long as you keep it in mind. Cockroach and Spanner are fully ACID, but that means even if you used those as a single-node DB, it ought to be slower.
banashark
Vitess and planetscale seem to have quite a number of high profile users who have lauded its capabilities. A search through hn history pops up a few.
As someone who has primarily worked with Postgres for relational concerns, I’ve envied the apparent robustness of the MySQL scaling solutions.
ScalaHanSolo
Author here. Yeah, that's not a bad take away either. I've also been really vocal in Primsa issues for all sorts of things. We are about to embark on a big migration away from Prisma and onto Drizzle once the Drizzle team lands 1.0
We will absolutely share our findings when that migration happens!
RadiozRadioz
That just sounds irresponsible. The correct choice for prod isn't "the cool new trendy thing that will solve all our problems once it hits 1.0", the correct choice is "the boring stable thing that has existed long enough for everyone to understand its shortcomings".
redcobra762
Yes, moving to a freshly 1.0 tool/library is often the best way to gain stability...
sreekanth850
It's wild and hilarious, how often startups and companies go for distributed databases like CockroachDB/TiDB/Yugabyte before they actually need distribution, this trends sucks. 100 million rows is nothing that a well-tuned Postgres or MySQL instance (or even read-replicated setup) can't handle comfortably. Scale when you hit the wall.
Spivak
100M isn't much even for not super well tuned postgres.
sreekanth850
Yes, there are multiple steps to consider before jumping to a distributed database and only when you actually hit bottlenecks, like read replication, CQRS, etc. But I guess it's often just about chasing fancy stuff.
simianwords
I don't buy this! Startups do need high availability. If you start having replicas you are already in distributed territory!
sreekanth850
>Startups do need high availability.
HA is important. But Postgres and MySQL both support HA and replication setups without needing to jump straight into a distributed SQL (In this context of using cockroach). We use MySQL Innodb cluster + MySQL router with auto failover on single primary mode.
> If you start having replicas you are already in distributed territory.
But it’s not the same as a distributed database with quorum writes, global consensus, and cross-region latencies. Those systems are built for horizontal write scaling, that come with added complexity and cost, which most apps don’t need.
simianwords
do you still not need quorum writes?
GaryNumanVevo
It's much more simple to have a single master multi replica setup than a multi master one
eftpotrm
I can't help thinking more startups need greybeards around. (Of which, realistically, I'm now one.)
Largest table 100 million rows and they were paying 6 figures for database services annually? I have one now that sits happily enough on an 8yo laptop. I've worked on systems that had similar scale tables chugging along on very average for 20 years ago MSSQL 2000 boxes. There just isn't a need for cloud scale systems and cloud scale bills for that data volume.
The problems they're describing should never have got that far without an experienced hand pointing out they didn't make sense, and if they'd hired that greybeard they'd have spotted it long before.
abraxas
> and they were paying 6 figures for database services annually?
Might have been monthly.
100,000,000 rows is what I handled on a single Sun server in 2001 with Sybase, no problemo.
gilbetron
The answer to the question, "what database should I use?" is "postgres". If you are in a situation where postgres actually won't work, then you already would know exactly why postgres won't work.
In other words: [Postgres -> exotic solution] is the path everyone should take (and 99% will just stay in postgres), and not [exotic solution -> postgres].
abraxas
Yes, the nosql fad that swept the industry was nearly as insufferable as the SPA craze that followed alongside. Now everyone's back to tried and true. Most data once more sits in RDBMS and most html gets render on the server.
Us grizzled genX devs saw this coming a decade ago.
gilbetron
As a grizzled genX dev myself, we are in a different situation now - "nosql" (hate the term) has tremendous use cases, it's just that most people aren't creating something that requires it. It was a natural exploration of the tools, something that should be encouraged. "I knew it all along" isn't an attitude I find helpful or effective. My "grizzled genX dev" attitude is that nearly all people think they know what is going to happen or what is the best route, and they are almost always entirely wrong. We only find out by trying a bunch of things.
In other words, there are many companies currently worth $Billion+ that wouldn't have succeeded had they followed your advice. Today, with incredibly powerful infra of all types available, starting with Postgres is almost always the right step unless you know, know, better. That wasn't the case 10+ years ago.
etler
I've lost count of how many "Migrating from X to Postgres" articles I've seen.
I don't think I've once seen a migrating away from Postgres article.
delish
Related: Oxide's podcast, "Whither CockroachDB," which reflects on experience with postgres at Joyent, then the choice to use cockroach in response to prior experiences with postgres.
https://www.youtube.com/watch?v=DNHMYp8M40k
I'm trying to avoid editorializing in my above summary, for fear of mischaracterizing their opinions or the current state of postgres. Their use of postgres was 10 years ago, they were using postgres for a high-availability use case -- so they (and I) don't think "postgres bad, cockroach good." But like Bryan Cantrill says, "No one cares about your workload like you do." So benchmark! Don't make technical decisions via "vibes!"
betaby
speed_spread
It's a very Uber thing to do to enter a one way from the wrong end.
DonHopkins
At least they have the moxie to flip off people going the right way.
psionides
Yeah so there's basically just that one ;)
yakkomajuri
I think your point still stands, and I'm a big Postgres advocate/user myself btw.
But yeah we did migrate our _analytics_ data to ClickHouse (while still keeping Postgres for more transactional stuff) back when I was at PostHog.
Writeup: https://posthog.com/blog/how-we-turned-clickhouse-into-our-e...
mplanchard
We also did this, using change data capture and kafka to stream data to clickhouse as it gets written to postgres.
Clickhouse is incredible tech. We’ve been very pleased with it for OLAP queries, and it’s taken a lot of load off the postgres instance, so it can more easily handle the very high write load it gets subjected to.
evanelias
Not an article, and I have no direct knowledge of this either way, but I would strongly suspect that Instagram migrated off Postgres a while back. Probably to fb-mysql + myrocks, or some other RocksDB based solution.
The compression level is vastly superior to any available Postgres-based solution, and at Instagram’s scale it amounts to extremely compelling hardware cost savings.
Also if they were still primarily on pg, it would be one of the largest pg deployments in existence, and there would be obvious signs of the eng impact of that (conference talks, FOSS contributions, etc).
Bigger-picture: Postgres is an amazing database, and it’s often the right choice, but nothing in tech is always the best choice 100% of the time. There’s always trade-offs somewhere.
rakejake
Probably a corollary of the fact that most usecases can be served by an RDBMS running on a decently specced machine, or on different machines by sharding intelligently. The number of usecases for actual distributed DBs and transactions is probably not that high.
sa46
I helped with the initial assessment for a migration from Postgres with Citus to SingleStore.
notTooFarGone
We migrated from postgres to ADX based on cost analysis done of the managed version on Azure.
Now we have lovely kql queries and pretty much start new with postgres again...
yen223
I have participated in a Postgres -> Clickhouse migration, but I haven't bothered writing an article about it.
I_am_tiberius
The entire database? Isn't that very limiting due to slow write speeds in Clickhouse? I saw ch more as a db for mainly read activities.
jacobsenscott
CH excels at extremely high volume writes. You probably can't throw enough data at it.
hobs
It still makes me sad when half the queries I see are json_* - I know its far too late, but a big sad trombone in query performance is constantly left joining to planner queries that are going to give you 100 rows as an estimate forever.
panzi
Not sure why those are json_agg() instead of array_agg() in that example. Why would you use a JSON array instead of a native properly typed array? Yes, if you have some complex objects for some reason you can use JSON objects. But those where all just arrays of IDs. Also why was it json_agg() and not jsonb_agg()? Is there any reason on why to use JSON over JSONB in PostgreSQL?
renhanxue
If you, for whatever obscure reason, need to preserve whitespace and key ordering, that is you want something that is effectively just a text column, then you should use JSON over JSONB.
I can't think of any case at all, no matter how contrived, where you'd want to use the non-B versions of the JSON aggregate functions though.
paulryanrogers
The non-B JSON can take up less space on disk and less write time complexity.
bastawhiz
If the queries are sensible, you can always create indexes that index on the queried expressions.
https://www.postgresql.org/docs/current/indexes-expressional...
NegativeLatency
Hoping for more easy columnar support in databases, which is one of the things that can lead you to storing json in database columns (if your data is truly columnar).
Currently the vendor lock-in or requirements for installing plugins make it hard to do with cloud sql providers. Especially hard since by the time it's a problem you're probably at enough scale to make switching db/vendors hard or impossible.
hobs
How does columnar = json? json isn't colunar at all... If you just want to have a schema in json instead of sql, use a no-sql db, postgres nosql features are strong, but the db features are actually much stronger.
NegativeLatency
json isn't necessarily columnar, but it is a natrual fit for stuff that is columnar that's otherwise harder to model in a traditional relational db
here's my usecase:
- we have a bunch of attributes (all different names by customer, and many different values for each record that a customer stores)
- it's a fairly natural fit for a json value with only one level of key: value mapping
- we use mysql on GCP (no columnar plugins, too hard to switch to postgres)
Someone could go back in time and correctly model it as columns and not json but that ship has sorta sailed. While it's not impossible to change, it would be pretty hard, time will tell if that ends up happening.
I would love to be able to tell mysql "put this column in a collumnar engine and use that when I query on it" (AlloyDB is this for postgres on GCP)
moonikakiss
great point.
with pg_mooncake v0.2 (launching in ~couple weeks), you'll be able to get a columnar copy of your Postgres that's always synced (<s freshness).
Keep your write path unchanged, and keep your Postgres where it is. Deploy Mooncake as a replica for the columnar queries.
NegativeLatency
I wish, someone unfortunately picked mysql
robertlagrant
For all the Prisma-haters: I salute you. But I want to reply to numerous comments with the following:
ORMs come in two main types, that I'm aware of: Active Record (named after the original Ruby one, I think) and Data Mapper (think Hibernate; SQLAlchemy).
Active Record ORMs are slightly more ergonomic at the cost of doing loads of work in application memory. Data Mapper looks slightly more like SQL in your code but are much more direct wrappers over things you can do in SQL.
Data Mapper also lets you keep various niceties such as generating migration code, that stem from having your table definition as objects.
Use Data Mapper ORMs if you want to use an ORM.
arunix
Also, the Query Object style, e.g. JOOQ and SQLAlchemy Core
felipemesquita
Rails’ Active Record was named after the pattern as described by Martin Fowler:
robertlagrant
Ah - the reverse! Thanks.
optician_owl
It's an awkward article. To answer why a query is slow you need a bit more details than just the query. Also, I reread about timeouts and didn't get it, what was the database, whether it was a database issue, how it was related to migration.
The only information I could extract was that the company made bad architectural decisions, believes in ORM (looking at the queries, there are many doubts that the data layouts in DB are adequate) and cannot clearly explain situations. But this is only interesting to their candidates or investors.
It may sound rude, so I apologise.
compton93
I'm curious about Motion's experience with "Unused Indices". They suggest Cockroach's dashboard listed used indexes in the "Unused Indices" list.
I think the indexes they suspect were used are unused but Motion didn't realize CockroachDB was doing zigzag joins on other indexes to accomplish the same thing, leaving the indexes that would be obviously used as genuinely not used.
It's a great feature but CRDB's optimizer would prefer a zig zag join over a covering index, getting around this required indexes be written in a way to persuade the optimizer to not plan for a zig zag join.
from-nibly
Feels like postgres is always the answer. I mean like there's gotta be some edge case somewhere where postgres just can't begin to compete with other more specialized database but I'd think that going from postgres to something else is much easier than the other way around.
graealex
It's not like PostgreSQL hasn't been in development for close to 30 years, covering basically every use case imaginable just through millions of deployments.
In addition, SQL in itself is a proven technology. The reality is that most problems you might think about solving with specialized databases (Big Data TM etc) could probably easily be solved with your run-of-the-mill RDBMS anyway, if more than five minutes are spent on designing the schema. It's extremely versatile, despite just being one layer above key-value storage.
999900000999
Depends.
If you want to fully embrace the vibe tables are difficult.
Even before LLMs, I was at a certain company that preferred MongoDB so we didn’t need migrations.
Sometimes you don’t care about data structure and you just want to toss something up there and worry about it later.
Postgres is the best answer if you have a solid team and you know what you’re doing.
If you want to ride solo and get something done fast, Firebase and its NoSQL cousins might be easier .
pojzon
I really enjoy this comment.
> Postgres is the best answer if you have a solid team and you know what you’re doing.
Not every type of data simply fits into relational model.
Example: time series data.
So depending on your model - pick your poison.
But for relational models, there is hardly anything better than postgres now.
It makes me happy coz I always rooted for the project from earily 2000s.
freilanzer
Even for timeseries there is https://github.com/timescale/timescaledb. Haven't used it, just knew it existed.
999900000999
Don't get me wrong, Postgres is awesome when things work.
But, for example I was working on a .net project and entity framework decided it couldn't migrate Postgres tables correctly.
I'm not a SQL person, at this point my options are to drop tables, and let .net recreate them or try and write my own migrations.
This just isn't an issue with Firebase. I can add all the fields I want directly on the client.
SchemaLoad
What situations do you encounter where you don't care about the structure of the data? The only ones I've ever encountered have been logging, where it's only purpose is to be manually text searchable, and something like OpenStreetMap where everything is just a key value store and the structure is loosely community defined.
As soon as you have a loosely defined object you can't access any specific keys which makes it useless for 99% of times you want to store and retrieve data.
999900000999
You define the data schema client side.
That's the entire idea behind Firebase. It makes prototyping much faster. I don't know how well it scales, but it works for most smaller projects.
jacobsenscott
PG requires a lot of expertise to keep running when you get to a billion rows or massive ingest. It can do it, but it doesn't just do it out of box running the defaults.
graealex
Hopefully at 1B records, you have a business model that allows you to spend some money on either hardware or talent to solve this problem.
sgarland
Unfortunately, most places go with hardware first, so the problem grows larger and larger. When they do finally hire someone who knows how to address it, it’s a monstrous undertaking that dev teams are reluctant to do, because it’s a lengthy pause on feature development (so really, PMs are reluctant), and also future velocity is typically somewhat slower – turns out proper RDBMS data modeling is rigid, and doesn’t suffer fools.
mdaniel
There's a gist that shows up in these threads https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...
But while digging that up it seems there is one with more colors: https://postgresforeverything.com/
And one for the AI crowd https://github.com/dannybellion/postgres-is-all-you-need#pos...
coolcase
I hear MySQL can be better for some workloads?
graealex
Till about yesterday, MySQL was little more than a CSV file on wheels, and even current popularity is mostly driven by two decades of LAMP.
You can be sure that PostgreSQL will be applicable for any work load that MySQL can handle, but not necessarily the other way round, so if you actually have the freedom to make a choice, go with PostgreSQL.
In particular, because PostgreSQL has a lot more features, people imply that the other one, for the lack of those features and its associated complexity, must automatically be faster. Which isn't true, neither generally, nor in special cases, since the latter one can go either way - your particular query might just run 4x on PostgreSQL. There is also no universal approach to even benchmark performance, since every database and query will have completely different characteristics.
evanelias
> Till about yesterday, MySQL was little more than a CSV file on wheels
That's utter nonsense. What are you even referring to here?
InnoDB is a rock-solid OLTP storage engine, and it's been MySQL's default since 2010. A very large percentage of the global economy relies on InnoDB, and has for quite some time.
JohnCClarke
<Not intending to start a flamewar...> MySQL shines for simple OLTP. My PostgreSQL friends all have "paged at 3am to vacuum the database" war stories. I like simple and dumb for transactional data and MySQL delivers.
moonikakiss
great blog. It seems like you might benefit from columnar storage in Postgres for that slow query that took ~20seconds.
It's interesting that people typically think of columnstores for strict BI / analytics. But there are so many App / user-facing workloads that actually need it.
ps: we're working on pg_mooncake v0.2. create a columnstore in Postgres that's always consistent with your OLTP tables.
It might help for this workload.
I_am_tiberius
That sounds awesome. Are you saying you still use your normal OLTP table for writing data and the columnstore table is always in sync with that OLTP table (that's fantastic)? I ready it works with duckdb - how does it work? I guess there's no chance this is going to be available on Azure Flexible Server anytime soon.
moonikakiss
exactly. we take the CDC output / logical decoding from your OLTP tables and write into a columnar format with <s freshness.
We had to design this columnstore to be 'operational' so it can keep up with changing oltp tables (updates/deletes).
You'll be able to deploy Mooncake as a read-replica regardless of where your Postgres is. Keep the write path unchanged, and query columnar tables from us.
--- v0.2 will be released in preview in ~a couple weeks. stay tuned!
I_am_tiberius
Ah, I see. So there's a replication process similar to ClickHouse's MaterializedPostgres. Ideally, there would be functionality allowing a columnstore query to wait until all writes to the OLTP tables — up to the query's execution time — are available. This would make the system truly Postgres-native and address issues that no other system currently solves.
I_am_tiberius
A follow up question: You can't join columnar tables with OLTP tables, right?
moonikakiss
yes you can. Even if the columnar tables are in the read replica. you'll be able to do joins with your OLTP tables
compton93
What are your thoughts on Fujitsu's VCI? I typically work for ERP's but im always advocating to offload the right queries to columnar DB's (not for DB performance but for end user experience).
> By Jan 2024, our largest table had roughly 100 million rows.
I did a double take at this. At the onset of the article, the fact they're using a distributed database and the mention of a "mid 6 figure" DB bill made me assume they have some obscenely large database that's far beyond what a single node could do. They don't detail the Postgres setup that replaced it, so I assume it's a pretty standard single primary and a 100 million row table is well within the abilities of that—I have a 150 million row table happily plugging along on a 2vCPU+16GB instance. Apples and oranges, perhaps, but people shouldn't underestimate what a single modern server can do.