BM25 in PostgreSQL

29 comments

·March 3, 2025

Visit

thund

<3 the license

https://github.com/tensorchord/VectorChord/blob/main/LICENSE

skissane

What happens at a lot of places:

Elastic license not on legal’s approved license list - need to get special approval from legal to use this.

Legal’s license list says AGPL needs case by case legal review - so need to get special approval from legal to use this.

Immediate thought: is there another solution which I don’t need special approval from legal to use?

isbvhodnvemrwvn

All licences restricting commercial use will need to be cleared, that's the point.

skissane

Yes… but from my point of view, if there is an alternative solution without those restrictions, I’ll go with that. I’d only consider a solution with such restrictions if its other advantages were so compelling as to overcome that (and even then, if one has to ask legal, it isn’t guaranteed they’ll say “yes”)

jankovicsandras

This looks cool!

Shameless plug:

https://github.com/jankovicsandras/plpgsql_bm25

BM25 search implemented in PL/pgSQL, might be useful if one can't use Rust extensions with Postgres, e. g. hosted Postgres without admin rights.

emilsedgh

This looks very neat. In our company, for the past 10 years I've bet heavily on Postgres have not allowed any other databases to be used. Stuff like this makes me hope I can continue to do so as our database is growing.

But, it appears that we are hitting our limits at this point. We have a table with tens of millions of rows and the use case is that it's faceted search.

Something like Zillow, with tends of millions of listings and tens of columns that can be queried. I'm having a tough time scaling it up. Does anyone have experience building something like that in pg?

patrickhogan1

pg_bm25 - Optimized for faceted search & PGroonga - Full-text search with JSONB support

crowdyriver

Unrelated, but PGroonga reads funny in Argentinian. Better than gimp though

pezezin

I read your comment with El Bananero's voice and now I can't stop laughing xD

ngrilly

Awesome! How does it compare to RUM, another PostgreSQL extension solving a similar problem (make ranking fast by putting enough info in the index)?

siquick

Can this be used on AWS RDS? I’ve seen a few things like this that would be great to use but without RDS support they’re unusable for us.

jonathal

No, not until AWS decides to add it as a supported extension, which will most likely never happen (you can see the current list of supported extensions here: https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLRelea...)

jillesvangurp

There's more to Elasticsearch/opensearch than just bm25. It's a Swiss Army knife of stuff that you need to implement search for all sorts of things. Most of which is missing in action with solutions like this.

The article mentions stop words, which is an outdated and primitive strategy to deal with often used words adding a lot of overhead to your search. These days with Elasticsearch the advice is actually to not rely on lists of stop words and avoid using them entirely.

Reason: the search engine is smart enough to handle very common words like, "to", "not, "be", and "or" efficiently. Bm25 relies on term frequencies for scoring. So these words would have very high term frequencies. The "match" query uses some optimizations that avoid most of the overhead associated with juggling a lot of high frequency term matches and scoring a lot of matches that would score very low. If you are only going to look at five results, scoring 500000 documents that potentially match the query that includes a stop word is really expensive (because you need to score each document and then sort the results). But if you prioritize scoring to low frequency terms first, you can narrow down the result list considerably and avoid scoring most of the documents.

Not filtering out stop words means that if you are indexing the works of Shakespeare, you'd actually be able to find the phrase "to be or not to be", which is entirely made up of stop words. Classic edge case to test for. Just an example of one of many optimizations that you'd find in Lucene based search systems that improve both precision and recall metrics as well as performance. There are a lot of wheels that need to be reinvented on the postgresql side to get to that level. It's not quite as simple as "apply bm25 for each hit and sort".

Solutions like discussed in the article provide you a some of the low level primitives but without most of the high level abstractions and optimizations that you'd need to build a proper search that is both accurate and fast. Fine if you know what you are doing and can compensate for that but probably not a great starting point if that's not the case.

And if you are thinking that it's convenient to index your database model, you'd be well advised to read up on ETL and the notion of optimizing what you index for search, rather than for simple storage. In most more sophisticated search systems what's indexed for search is not the same as what you put in your database and there are some good reasons for that.

Even if you do want to use postgres for search, you'd be well advised to consider setting up two completely separate database clusters. One for storing your data and one for indexing and searching through your data. The ETL pipeline in between those is where most of the magic happens. And you typically don't want that on the critical path of simple CRUD operations because that magic tends to be expensive. Which is why you do it in an ETL pipeline so you can keep your writes and queries fast and allocate hardware resources as is appropriate.

Basically the T in ETL (extract, transform, load) is where you do all the expensive stuff that you don't want to do when you are querying or interacting with your system and editing things. If you are Google 25 years ago, that's where you'd be calculating page rank, for example. These days they do quite a bit more work. Let's just say that BM25 doesn't quite cut it if you are Google these days. Vector search is a classic example of something that's fairly expensive as well that you don't want to block on while you are writing to a database or file system (e.g. when crawling the web). Doing things asynchronously and decoupling them via queues, databases, filesystems, etc. is basically what data engineering is all about.

Of course if you are setting up two systems anyway (a db, and a search index), you might want to consider using something that has all the right tools for the job. Which just isn't postgresql unless your ambition level is relatively low and search is a low priority thing for you. Whether you can afford to be naive and unambitious is ultimately a business choice. But you should be considering what your competitors do and what the consequences are of not being quite as good.

A lot of my consulting clients end up talking to me because they figure out they are not as good as they should be. I've helped transition several clients from naive postgresql based systems to Opensearch recently. Performance issues you can usually solve by throwing hardware at a problem. But quality issues require using better tools and knowing what to use.

nchmy

This was fantastic, thanks.

Do you suggest use of lemmatization rather than stopwords? Vector embeddings instead? Keyword extraction to be included as a bm25 search field?

What are your thoughts on the various search engines (if I may call them that)? Eg elastic vs opensearch vs solr. Vectordbs - weaviate, pinecone, qdrant, milvus How about manticoresearch? It seems to be enormously powerful, flexible, mature, but not well known.

jillesvangurp

I don't have experience with all of these systems. My observation is that whatever you do needs a bit of a systematic approach to evaluating the outcome. Without that you can use trial and error and replace magical solution 1 with magical solution 2 and hope for the best but it's probably not going to magically improve things. And blindly flailing around is effectively what a lot of companies are doing on this front. Most of my clients actually struggle with the basics. The short version of it is that you can probably make each of those systems do what you need probably but there's some skill to it. It's not about the tools you use but how you use them.

Opensearch is a fork of Elasticsearch and effectively the same thing for most things with exception of recently added features that a beginning user would typically not need. One of the exceptions is vector search which works slightly differently on both sides and mostly happened post fork.

There is no one size fits all solution for every search use case. Even realizing that there are multiple search use cases that you need to worry about and that they require different solutions is not something a lot of companies are even aware off. We can talk about text search, suggested queries, recommended results, did you mean, similarity search, etc. And that's before you throw in things like faceting/aggregations, various ways of putting your fingers on the scale (ever popular in ecommerce systems), premium/promoted content, etc.

Some wisdom with vector search is that it it is more effective when you combine it with traditional search. Doing vector search on lots of documents is just not cheap. So narrowing down results with a traditional search vastly cuts down on the amount of work you need to do. And it makes response times more reasonable too. Also, vector search with off the shelf models tends to perform quite poorly and models tend to have biases that might not match your needs/requirements. A good example from an ecommerce use case that somebody told me about is image search for clothes surfacing products that feature the same models rather than the same clothes. Because the off the shelf model was looking at the faces. It's not wrong. But probably not what was needed.

Lots of companies end up specializing models for their needs for this reason. And vector search is also not great for really short queries, like the type of queries that come out of a mobile UI. Users just don't tend to type coherently on mobile and there's only so much meaning you can get out of a 3 letter query. Vector search for search as you type is not a thing. And if you have a mobile UI that's most of the searches. Because why type long queries if the search gives you what you are looking for after just a few letters?

Most vector search products give you enough rope to hang yourself but don't fully solve all issues for you. Model selection and tuning tends to be all/most of the work and is typically left as an exercise to the user and is something that requires some skill. Once you have a model, there are plenty of things you can use. Each with their pros/cons. Setting up inference is the easy part of the job.

So, I wouldn't start with vector search until after you've done the legwork of being able to tell good results from bad results. Especially not if you are new to doing anything with search.

What is appropriate all depends on the use cases, the business value of search to your company, and the skills of the team that you have and what you are willing to spend on getting something better. But picking something because its "faster" because some silly benchmark says so is probably not a good reason. Is it faster because it skips essential work? Or because it does the same thing better? And is that the thing that you need? Or are there other things that you might need?

nchmy

Fantastic insights again, thank you.

I've been developing a sort of educational platform for a few years and went deep on search stuff a couple years ago, before moving my focus elsewhere.

This was just prior to the LLM boom and I had concluded that it would probably be best to mostly avoid vector search, and instead use some sort of transformer model to extract summaries, keywords etc from each document, store them in a meta field, then just do normal, sparse bm25 on it all. Even for image search etc - just extract keywords rather then dense embeddings.

SPLADE was one promising example back then. https://github.com/naver/splade

I'm sure the field has progressed since then, but it sounds like it is still best to not invest in vector search.

The real lesson, it seems, is we need to know our needs, data, etc and act accordingly - most apparently do not do that.

jszymborski

> Our next step is to fully decouple the tokenization process, transforming it into an independent and extensible extension. This will enable us to support multiple languages, allow users to customize tokenization for better results, and even incorporate advanced features like synonym handling.

This would be rad

null

[deleted]

nitinreddy88

Any comparison results in terms of performance vs accuracy with: https://github.com/paradedb/paradedb/tree/dev/pg_search

xenator

Seems like they know about ParadeDB, but for some reason don't publish banchmarks:

"""Another solution is ParadeDB, which pushes full-text search queries down to Tantivy for results. It supports BM25 scoring and complex query patterns like negative terms, aiming to be a complete replacement for ElasticSearch. However, it uses its own unique syntax for filtering and querying and delegates filtering operations to Tantivy instead of relying on Postgres directly. Its implementation requires several hooks into Postgres' query planning and storage, potentially leading to compatibility issues."""

So it is more apples to red than equal comparison.

philippemnoel

Hi folks, ParadeDB author here. We had benchmarks, but they were super outdated. We just made new ones, and will soon make a biiiig announcement with big new benchmarks. You can see some existing benchmarks vs Lucene here: https://www.paradedb.com/blog/case_study_alibaba

This comparison isn't super fair -- ParadeDB does not have compatibility issues with Postgres and rather is directly integrated into Postgres block storage, query planner, and query executor

__jl__

How does this compare with pg_search (formally pg_bm25) from ParadeDB?

mattashii

Looks like this is based on a fork of the pg_bm25 code: https://github.com/tensorchord/VectorChord-bm25/commit/15838...

VoVAllen

Hi, I'm the tech lead of VectorChord-bm25. It's not based on pg_search (pg_bm25). We just chose the same name during our internal development, and changed it to the formal name VectorChord-bm25 when we released it.

rpcope1

I'd be interested to see how this stacks up against Manticore or Meilisearch.

gaocegege

Thanks for sharing your feedback! Could you let me know the reason behind it? Are you currently using Manticore or Meilisearch?

Jysix

Personally, I also compare it to using Meilisearch. What I find there is a superb doc, ease of use, full support for other European languages, possibility of sorting by field in addition to BM25, custom stop words (even better if the stop words for each language are already defined), custom ordering. If I could find all this in a Postgresql extension, it would be a dream. Congrats for the progress made, I'll give it a try!

null

[deleted]

HN

BM25 in PostgreSQL

BM25 in PostgreSQL