The best way to use text embeddings portably is with Parquet and Polars
30 comments
·February 24, 2025thomasfromcdnjs
minimaxir
In general I like to send structured data (see the input format here: https://github.com/minimaxir/mtg-embeddings), but the ModernBERT base for the embedding model used here specifically has better benefits implicitly for structured data compared to previous models. That's worth another blog post explaining why.
banku_brougham
Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:
https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...
jt_b
I have tinkered with using DuckDB as a poor man's vector database for a POC and had great results.
One thing I'd love to see is being able to do some sort of row group level metadata statistics for embeddings within a parquet file - something that would allow various readers to push predicates down to an HTTP request metadata level and completely avoid loading in non-relevant rows to the database from a remote file - particularly one stored on S3 compatible storage that supports byte-range requests. I'm not sure what the implementation would look like to define sorting the algorithm to organize the "close" rows together, how the metadata would be calculated, or what the reader implementation would look like, but I'd love to be able to implement some of the same patterns with vector search as with geoparquet.
rcarmo
I'm a huge fan of polars, but I hadn't considered using it to store embeddings in this way (I've been fiddling with sqlite-vec). Seems like an interesting idea indeed.
stephantul
Check out Unum’s usearch. It beats anything, and is super easy to use. It just does exactly what you need.
esafak
Have you tested it against Lance? Does it do predicate pushdown for filtering?
ashvardanian
USearch author here :)
The engine supports arbitrary predicates for C, C++, and Rust users. In higher level languages it’s hard to combine callbacks and concurrent state management.
In terms of scalability and efficiency, the only tool I’ve seen coming close is Nvidia’s cuVS if you have GPUs available. FAISS HNSW implementation can easily be 10x slower and most commercial & venture-backed alternatives are even slower: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...
In this use-case, I believe SimSIMD raw kernels may be a better choice. Just replace NumPy and enjoy speedups. It provides hundreds of hand-written SIMD kernels for all kinds of vector-vector operations for AVX, AVX-512, NEON, and SVE across F64, F32, BF16, F16, I8, and binary vectors, mostly operating in mixed precision to avoid overflow and instability: https://github.com/ashvardanian/SimSIMD
stephantul
Usearch is a vector store afaik, not a vector db. At least that’s how I use it.
I haven’t compared it to lancedb, I reached for it here because the author mentioned Faiss being difficult to use and install. usearch is a great alternative to Faiss.
But thanks for the suggestion, I’ll check it out
jononor
At 33k items in memory is quite fast, 10 ms is very responsive. With 10x/330k items given same hardware the expected time is 1 second. That might be too slow for some applications (but not all). Especially if one just does retrieval of a rather small amount of matches, an index will help a lot for 100k++ datasets.
robschmidt90
Nice read. I agree that for a lot of hobby use cases you can just load the embeddings from parquet and compute the similarities in-memory.
To find similarity between my blogposts [1] I wanted to experiment with a local vector database and found ChromaDB fairly easy to use (similar to SQLite just a file on your machine).
noahbp
Wow! How much did this cost you in GPU credits? And did you consider using your MacBook?
minimaxir
It took 1:17 to encode all ~32k cards using a preemptible L4 GPU on Google Cloud Platform (g2-standard-4) at ~$0.28/hour, costing < $0.01 overall: https://github.com/minimaxir/mtg-embeddings/blob/main/mtg_em...
The base ModernBERT uses CUDA tricks not available in MPS, so I suspect it would take much longer.
For the 2D UMAP, it took 3:33 because I wanted to do 1 million epochs to be thorough: https://github.com/minimaxir/mtg-embeddings/blob/main/mtg_em...
jtrueb
Polars + Parquet is awesome for portability and performance. This post focused on python portability, but Polars has an easy-to-use Rust API for embedding the engine all over the place.
blooalien
Gotta love stuff that has multiple language bindings. Always really enjoyed finding powerful libraries in Python and then seeing they also have matching bindings for Go and Rust. Nice to have easy portability and cross-language compatibility.
kernelsanderz
For another library that has great performance and features like full text indexing and the ability to version changes I’d recommend lancedb https://lancedb.github.io/lancedb/
Yes, it’s a vector database and has more complexity. But you can use it without creating indexes and it has excellent polars and pandas zero copy arrow support also.
3abiton
How well does it scale?
daveguy
Since a lot of ML data is stored as parquet, I found this to be a useful tidbit from lancedb's documentation:
> Data storage is columnar and is interoperable with other columnar formats (such as Parquet) via Arrow
https://lancedb.github.io/lancedb/concepts/data_management/
Edit: That said, I am personally a fan of parquet, arrow, and ibis. So many data wrangling options out there it's easy to get analysis paralysis.
esafak
Lance is made for this stuff; parquet is not.
null
thelastbender12
This is pretty neat.
IMO a hindrance to this was lack of built-in fixed-size list array support in the Arrow format, until recently. Some implementations/clients supported it, while others didn't. Else, it could have been used as the default storage format for numpy arrays, torch tensors, too.
(You could always store arrays as variable length list arrays with fixed strides and handle the conversion).
banku_brougham
Is your example of a float32 number correct, holding 24 ascii char representation? I had thought single-precision gonna be 7 digits and the exponent, sign and exp sign. Something like 7+2+1+1 or 10 char ascii representation? Rather than the 24 you mentioned?
minimaxir
It depends on the default print format. The example string I mentioned is pulled from what np.savetxt() does (fmt='%.18e') and there isn't any precision loss in that number. But I admit I'm not a sprintf() guru.
In practice numbers with that much precision is overkill and verbose so tools don't print float32s to that level of precision.
PaulHoule
One of the things I remember from my PhD work is that you can do a stupendous number of FLOPs on floating point numbers in the time it takes to serialize/deserialize them to ASCII.
Lots of great findings
---
I'm curious if anyone knows whether it is better to pass structured data or unstructured data to embedding api's? If I ask ChatGPT, it says it is better to send unstructured data. (looking at the authors github, it looks like he generated embeddings from json strings)
My use case is for jsonresume, I am creating embeddings by sending full json versions as strings, but I've been experimenting with using models to translate resume.json's into full text versions first before creating embeddings. The results seem to be better but I haven't seen any concrete opinions on this.
My understanding is that unstructured data is better because it contains textual/semantic meaning because of natural lanaguage aka
is worse than; Another question: What if the search was also a json embedding? JSON <> JSON embeddings could also be great?