PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)

89 comments

·April 9, 2025

retakeming

I'm one of the pg_search maintainers. Hello! A few thoughts.

First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: https://www.postgresql.org/docs/current/textsearch-tables.ht....

Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"

`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

- Created composite b-tree indexes for each of the queries with boolean predicates

- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB

But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.

lostb1t

just an fyi: The blog link in your readme does not work.

stuhood

Thanks for reporting this! I'm having trouble finding the link you are referring to though. Would you mind sharing a link to the file/page containing the dead link?

cryptonector

From the blog about pg_search linked by TFA:

  This is what we did: 

    DB with pg_search: We created a single BM25 index 
    DB without pg_search: We created all these indexes
        GIN index on message (for full-text search)
        GIN index on country (for text-based filtering)
        B-tree indexes on severity, timestamp, and metadata->>'value' (to speed up filtering, ordering, and aggregations)

See the problem? You didn't create an index on the vector in the without-pg_search case. You didn't compare apples to apples. TFA is all about that.

Perhaps you can argue that creating a fastupdates=on index would have been the right comparison, but you didn't do that in that blog.

> You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

TFA isn't cherry-picking to show you that one query could have gone faster. TFA is showing that you didn't compare apples to apples. Looking at those 12 queries nothing screams at me that TFA's approach of storing the computed tsvector wouldn't work for those too.

Perhaps pg_search scales better and doesn't require trading off update for search performance, and that would be a great selling point, but why not just make that point?

supermatt

Why so angry?

> "You didn't ..."

No, they didn't. They aren't Neon and didnt do the benchmarks in the linked article. They are a postgres maintainer.

If you actually read their comment instead of raging you will see that they are saying that pg_search is a simple generic index definition that makes a _variety_ of queries work with little effort, and you can still add the additional optimisations (that are already documented - which they linked to) where needed.

wodenokoto

Are you sure parent is the author of that blog post?

Maybe I’m reading the whole thread wrong, but it looks like you are screaming at a maintainer of pg_search that someone else did a poor benchmark

moogleii

There is an art to communicating that I think people learn around their college years...

danpalmer

> Mistake #1: Calculating tsvector On-the-Fly (Major issue)

I'm shocked that the original post being referred to made this mistake. I recently implemented Postgres FTS in a personal project, and did so by just reading the Postgres documentation on FTS following the instructions. The docs lead you through the process of creating the base unoptimized case, and then optimising it, explaining the purpose of each step and why it's faster. It's really clear that is what it's doing, and I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

aobdev

This is not my area of expertise so take this with a grain of salt, but my initial instinct was to question why you would need to store the tsvector both in the table and in the index (because the tsvector values will in fact be stored losslessly in a GIN index).

The PG docs make it clear that this only affects row rechecks, so this would only affect performance on matching rows when you need to verify information not stored in the index, e.g. queries with weighted text or queries against a lossy GiST index. It's going to be use-case dependent but I would check if your queries need this before using up the additional disk space.

sgarland

If only Postgres had Virtual Generated Columns. Not being snarky; MySQL has had them for ages, and they are a perfect fit for this: takes up essentially zero disk space, but you can index it (which is of course stored).

It is, in my mind, the single biggest remaining advantage MySQL has. I used to say that MySQL’s (really, InnoDB) clustering index was its superpower when yielded correctly, but I’ve done some recent benchmarks, and even when designing schema to exploit a clustered index, Postgres was able to keep up in performance.

EDIT: the other thing MySQL does much better than Postgres is “just working” for people who are neither familiar with nor wish to learn RDBMS care and feeding. Contrary to what the hyperscalers will tell you, DBs are special snowflakes, they have a million knobs to turn, and they require you to know what you’re doing to some extent. Postgres especially has the problem of table bloat and txid buildup from its MVCC implementation, combined with inadequate autovacuum. I feel like the docs should scream at you to tune your autovacuum settings on a per-table basis once you get to a certain scale (not even that big; a few hundred GB on a write-heavy table will do). MySQL does not have this problem, and will happily go years on stock settings without really needing much from you. It won’t run optimally, but it’ll run. I wouldn’t say the same about Postgres.

charettes

Virtual generated columns are not required to allow an index to be used in this case without incurring the cost of materializing `to_tsvector('english', message)`. Postgres supports indexing expressions and the query planner is smart enough to identify candidate on exact matches.

I'm not sure why the author doesn't use them but it's clearly pointed out in the documentation (https://www.postgresql.org/docs/current/textsearch-tables.ht...).

In other words, I believe they didn't need a `message_tsvector` column and creating an index of the form

  CREATE INDEX idx_gin_logs_message_tsvector
  ON benchmark_logs USING GIN (to_tsvector('english', message))
  WITH (fastupdate = off);

would have allowed queries of the form

  WHERE to_tsvector('english', message) @@ to_tsquery('english', 'research')

to use the `idx_gin_logs_message_tsvector` index without materializing `to_tsvector('english', message)` on disk outside of the index.

Here's a fiddle supporting it https://dbfiddle.uk/aSFjXJWz

VoVAllen

It's coming in the Postgres 18. https://www.depesz.com/2025/02/28/waiting-for-postgresql-18-...

CodesInChaos

I hope OrioleDB will succeed in replacing Postgres' high maintenance storage engine with something that just works.

null

[deleted]

mastax

MySQL logical replication isn’t quite foolproof but it’s vastly easier than anything PostgreSQL offers out of the box. (I hope I’m wrong!)

brightball

I mean, technically any database with triggers can have generated columns, but PostgreSQL has had generated columns since version 13. Current version is 17.

https://www.postgresql.org/docs/current/ddl-generated-column...

I can’t think of any advantage of a virtual generated column over a generated column for something like a search index where calculating on read would be very slow.

Postgres has been able to create indexes based on the output of functions forever though, which does the job here too.

senorrib

That’s just syntax sugar for a trigger. Not really a big advantage.

codesnik

It's not only an additional disk space, it also a need to sync it with main column, using triggers or whatever, and have much bigger backups. Why "initial instinct was to question", though? I don't see downsides still, unless yeah, weighted queries or need to search in joined tables etc.

aobdev

It's a generated column, so there's no overhead to keep it up to date, but all generated columns in PG are stored. The corpus for text search will be stored 1x in the source column, 1x as a tsvector in the index, and an additional 1x in the generated column if you do so. That's a 50% increase in disk space.

cryptonector

That is definitely an issue. And that seems like a win for pg_search. And as siblings note PG 18 will have virtual generated, indexable columns, so that advantage for pg_search will go away.

brightball

I’ve been a Postgres FTS advocate for over a decade since replacing a Solr search with it and getting easier maintenance, more flexibility with queries and virtually no difference in speed.

It’s pretty great.

Elastic is on a different level for a lot of use cases, but pg is more than enough for the vast majority of workloads.

therealdrag0

What’s the biggest scale you’ve used Postgres search for?

brightball

A site with about 300,000 users where we were still scaling it vertically.

DoctorOW

> I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

Small writing note, I probably would've swapped the order of those. Hanlon's Razor and all. :)

timClicks

Perhaps the most generous interpretation is that the authors were writing an article for people who do the naïve thing without reading the docs. There are quite a few people in that category.

fnord123

> I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

vibe sysadminning, bro

ltbarcly3

I first used pg full text in around 2008. I've also used SOLR and ElasticSearch to power search and recommendation in substantial products.

The issue I have had with postgres full text search isn't that it's too slow, it's that it's too inflexible. It's a nice way to add simple search to fields but poor if you want to tune the search at all. Even allowing for general substrings is too much to ask, even allowing for custom tokenization is too much to ask. There's no tokenization pipeline to speak of unless you want to write c extensions (which of course you can't do for hosted databases anyway). Solr and Elasticsearch let you set up very complex indexes and search processing via configuration. There's absolutely nothing that would prevent postgres from adopting a lot of this capability, but instead postgres offers literally NOTHING. I get the impression that most of the developers for postgres full text haven't spent much time with other solutions as from previous discussions they don't really understand what I mean when I talk about tokenization and filter setup, and they don't really understand why this is a deal-breaker even for very simple applications. Postgres just splits on whitespace (and lets you basically manually use stopwords and stemming, which is crap). There is really no way to concatenate fields in a clean way into a single index, which again makes it extremely annoying to work with. There's no way to score searches based on field weighting or really any other kind of weighting beyond BM. Compared to the alternatives it's a toy system.

nattaylor

I wish there were some explain plans in either post, since I don't get what's going on.

If the query uses the index, then the on the fly tsvector rechecks are only on the matches and the benchmark queries have LIMIT 10, so few rechecks right?

Edit: yes but the query predicates have conditions on 2 gin indexes, so I guess the planner chooses to recheck all the matches for one index first even though it could avoid lots of work by rechecking row-wise

nostrebored

I'm legitimately curious -- why do people want to put EVERYTHING into postgres? I don't understand this trend (vector search, full text search, workload orchestration, queues, etc.)

simonw

I've built a number of systems that run a database and a separate search index (Elasticsearch, Solr, Xapian). The hardest part by far is keeping the search index in sync with the database. I gave a talk about this a while ago: https://simonwillison.net/2017/Aug/16/denormalized-query-eng...

Using the search engine built into PostgreSQL, MySQL or SQLite makes this problem SO MUCH less difficult.

sroussey

Yes. Scale matters here. If it’s just me trying to get shit done, then one system (database, monorepo, whatever) is 100% the best. I’ve done the multiple system thing at scale, and it’s fine if I have the liquidity of engineers to do it and the cost is warranted.

Bundling everything into one system will eventually fall apart. But it’s sooo good while you can do it.

And I am decades past the point where I introduce new shit just to learn it under the guise of “needing” it. Instead I’ll introduce new things I want to learn under the guise of new things I want to learn and it will find the appropriate place (often nowhere, but you never know).

throwaway7783

Doing everything in postgresql need not necessarily mean doing it in one database server. We could have different servers meant for search, queues, analytics (columnar) etc, all replicated (pg native replication) from a vanilla transactional postgresql server.

Now applications need only one technology, not necessarily one server.

jillesvangurp

You are right that this is the hardest, and most important, thing in search to get right. It's usually referred to as ETL. Extract, transform, load. Load: for each thing, put it somewhere for processing. Transform: for each thing process it by applying some algorithm/algorithms in one or more steps. Load: for each thing, shove it into your store. It's the transform part that is important. Extract and Load are kind of trivial to implement usually. I've seen decent implementations of only a few lines of code. Transform is application specific business logic. E and L are just simple plumbing.

What you query on is not the same as what you store in your DB. And it can be expensive to calculate and re-calculate. Especially at scale. And iterating over all your stuff can be challenging too. It requires IO, memory, CPU, etc. Your application server is the wrong place. And so is your main application database.

The challenge with search is that querying just gets a lot easier if you calculate all the expensive stuff at index time rather than at query time. Different tokenization strategies for different languages, calculating things like page rank, normalization, tokenization, semantic vectors, enriching data with other data (including denormalizing things from other data sources), etc. There are a lot of tricks you can use to make stuff easier to find.

Foregoing all of that indeed makes things simpler and faster. But your search quality will probably suffer. And if you aren't measuring that to begin with, it is probably not great. Doing all these things on write in your main database schema is going to cause other issues (slow writes, lots of schema migrations, complicated logic around CRUD, etc.). The rookie mistake with ETL is just joining the three steps into one thing that then becomes hard to run, evolve, and scale. I see that with a lot of my clients. This is textbook "doing it wrong". It's usually neither fast nor very good at search.

Even if you are going to use postgresql as your main search index, you are probably doing it wrong if your search table/schema isn't decoupled from your main application database via some ETL pipeline. That logic has to live somewhere. Even if it is a bit of a simplistic/limited "do everything on INSERT" kind of thing. That's going to hold back your search quality until you address it. There is no magic feature in postgresql that can address that. Nor in Elasticsearch (though it comes with way more features for this).

I've worked with postgresql's FTS a few times. It's pretty limited compared to Elasticsearch. Anybody running performance benchmarks should be running quality benchmarks instead. Being fast is easy if you skip all the difficult stuff. Being high quality and fast is a big challenge. And it's a lot easier with proper tools and a proper ETL pipeline.

And indeed engineering that such that the two stay in sync requires knowing how to engineer that properly. I usually start with that when I consult clients looking to level up their home grown search solutions to something a bit better.

Of course if you do ETL properly, having your DB and search index in the same place stops making sense. And if you are going to separate them, you might as well pick something more optimal for the job. There are a lot of decent solutions out there for this.

bmelton

If you can avoid adding an extra service without paying too much penalty, it means not having to acquire an extra skill or hire another devops person or keep yet another service in sync / maintained / etc.

The cost of adding services to an app is so much higher than people give it credit for at organizations of every size, it's shocking to me that more care isn't done to avoid it. I certainly understand at the enterprise level that the value add of a comprehensive system is worth the cost of a few extra employees or vendors, but if you could flatten all the weird services required by all the weird systems that use them in 30,000+ employee enterprises and replace them with one database and one web/application server, you'd probably save enough money to justify having done it.

xmcqdpt2

Where I work did an inventory a few years back of their systems and found that we had about the same number of databases (not tables!) as employed engineers, counting all deployed (QA and prod) instances.

The team on that inventory project obviously created a new database to put their data in, plus QA and test replicas. They (probably) have since moved to another DB system but left the old ones running for legacy applications!

hobs

Hah, that's table stakes - I have definitely worked at companies with 100 or 1000x the database to engineer ratio.

Depending on your database system, it may even have a 1:1 equivalency with Schemas (MySQL.)

bvirb

We've been using Elasticsearch + PG and it's pretty nice and fast, but it adds a whole layer of extra stuff to deal with when your data is in PG but then also needs to be indexed into an external server outside of those PG transactions. In our case I'm pretty convinced it hasn't been worth the effort. I think we could've optimized PG to be as fast as we needed with a lot less overhead than dealing with an external search index.

We moved our queues to PG and it cuts out the same kind of overhead to be able to wrap an update and start a job in a transaction. PG has been plenty fast to keep up with our queue demand.

Ultimately I think being able to do things transactionally just avoids a whole class of syncing issues, which are basically caching issues, and cache invalidation is one of the 2 hard things.

_joel

.. along with naming things and off by one errors.

bob1029

You can abstract this to any RDBMS, and the justification is that it makes everything a lot faster & easier.

I just got off a call with a client where their developers were using ORM-style abstractions to manipulate data for downstream processing in code, turning what should have been a few seconds of one custom SQL command into several hours of passing objects around multiple computer systems.

If we can put the FTS engine inside the SQL engine, we can avoid the entire space of APIs, frameworks, 3rd parties, latency, etc that goes along with otherwise integrating this feature.

Modern SQL dialects represent universal computation over arguably the best way we know how to structure complex data. Adding custom functions & modules into the mix is mostly just syntactic sugar over what is already available.

There is ZERO honor in offloading what could have been a SQL view into a shitty pile of nested iterators somewhere. I don't understand where any of this energy comes from. The less code the better. It's pure downside.

xmcqdpt2

> There is ZERO honor in offloading what could have been a SQL view into a shitty pile of nested iterators somewhere. I don't understand where any of this energy comes from. The less code the better. It's pure downside.

I wholeheartedly agree with you. As to why we use ORMs, the impression I get from the engineers I work with is that many of them a) don’t know SQL and b) feel like it’s “data analyst” stuff and so beneath them to learn it. Real engineering requires objects and inheritance or structs, pointers and arrays (depending on the engineer).

I think it’s the declarative nature of SQL that turns them off.

cryptonector

This. ORMs by and large suck.

Cthulhu_

I've had to experience it firsthand again a while ago but yeah.

I was replacing an application management interface of sorts, large ish sets of configuration parameters, ideal for a relational database. But I wanted to treat the combined configuration as a document, since that's what the front-end would send over. Ended up using GORM, which was fine for a little while... but quickly falls apart, especially when your data model is nested more than one level deep. And then you end up having to figure out "how do I solve X in GORM" and find yourself with limited documentation and a relatively small community whose members quickly burn out of trying to help people.

I'll just write the code next time.

NavinF

Maintaining a new service sucks. Not being able to do atomic commits to both postgres and the other db sucks.

klysm

Avoiding distributed systems problems. Distributed systems are so incredibly hard to get right that I will vertically scale postgres until I hit an insurmountable wall before giving in.

cryptonector

You can also build distributed DBs with PG. For example for a DB with multiple write nodes all you need to do is implement an event shipping model with logical replication where your servers publish their events and subscribe to others, and you need to implement conflict resolution rules, naturally. I think PG's type system can probably be leveraged to make a CRDT system on top (and I bet someone's already done it).

roncesvalles

IBM mainframes were created for you. Imagine you had a single computer that had multiple nines reliability. Hot swappable disk, RAM, CPU. Redundant power supply. Redundant network stack. OS designed to never need restarting. That's basically what a mainframe is, and IBM sells billions of dollars worth of them to this day.

hobs

People ask me - "but can we just distribute it because everything in one basket makes me uneasy"

Yeah distributing state among 10 nodes, totally easy, fine, good.

eqvinox

It's because postgres is in fact good at a lot of vaguely database-looking things. Even if it weren't the best for anything, if it does 80% of things at 80% best possible — it is reasonable to have postgres as "first thing to reach for" by default.

That said, it's easy to forget to check if you're in either of those 20% (or both.) There's probably a whole bunch of postgres usage where really something else should be used, and people just never checked.

cryptonector

Because PG is a fantastic platform for doing everything you need in SQL and it performs real well. Add PostgREST and you've got a REST API almost for free (you have to design a schema of what to expose, so not entirely free, but still, pretty close). Also, PG moves _fast_ and has a very active developer and user community, which means you'll be getting more awesome functionality in the coming future. (E.g., I've been following the AIO thread and the performance improvements coming from that patch set will be hefty.)

Vonng

BTW for anyone who interested, I've packed pg_search and vchord_bm25 extension RPM/DEBs https://pigsty.io/ext/fts/vchord_bm25 https://pigsty.io/ext/fts/pg_search In case anyone want to benchmark by themselves. ;)

zhousun

Glab to see more 'postgres-native' full-text search implementation.

Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.

retakeming

The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one

sunzhousz

the fundamental mismatch i saw is "creating a new segment for each individual dml", it is possible to alleviate but i don't think there's a good general solution.

atemerev

10 million records is a toy dataset. Usually, you can fit it in memory on a laptop.

There are open large(-ish) text datasets like full Wikipedia or pre-2022 Reddit comments, that would work much better for benchmarking.

some_developer

Years ago I wanted to use native FTS (because of tall the things mentioned, having to sync to external simply adds complexity) and it failed at another point.

Not completely surprising, but on a table with _potentially_ couple of thousand of inserts / seconds, it slowed down the overall updates to the point that transactions timed out.

We already added an index for one of the columns we wanted to index and were running the statement for the second one. The moment this the second index finished, we started to see timeouts from our system when writing to that table, transaction failing etc.

We had to drop the indices again. So, sadly, we did never get to the point to test the actual FTS performance :/ I would have like to test this, because didn't necessarily had to search hundreds of millions of documents, due to customer tenants this would always be constrained to a few million _at most_.

ps: I already wrote about this -> https://news.ycombinator.com/item?id=27977526 . Never got a chance to try it nowadays (newer versions of everything, never hardware, etc.)

thr0w

Sounds like the issue was just co-location of search index + other transactional data in the same table. If you had a table acting as your search index only then would insert lag on that table matter? I could maybe see connections piling up, but with proper batching I bet it'd be fine.

chenhoey1211

I’ve seen a lot of teams jump straight to Elasticsearch or Meilisearch without realizing how much performance you can get out of native PG FTS when used properly.

could we get similar performance in the browser using something like SQLite + FTS5 + Wasm? Seems like an interesting direction for offline-first apps...

johnthescott

for text search we use the "rum" extension from postgrespro. search terabytes of pdfs < 1sec. a talk here

    https://github.com/jmscott/talk/blob/master/pgday-austin-20161112.pdf

briankelly

> 10M log entries

I don’t think the question is speed, it’s scale. Use it until it breaks, though.

HN

PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)

PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)