Vector database that can index 1B vectors in 48M

kgeist

There was recently this paper: https://arxiv.org/abs/2508.21038

They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.

Is 1 bln vectors practical?

1999-03-31

1B vectors is nothing. You don’t need to index them. You can hold them in VRAM on a single node and run queries with perfect accuracy in milliseconds

eknkc

I guess for 2D vectors that would work?

For 1024 dimensions even with 8 bit quantization you are looking at a terrabyte of data. Lets make it binary vectors, it is still 128GB of VRAM.

WAT?

adastra22

1B x 4096 = 4T scalars.

That doesn't fit in anyone's video ram.

lyu07282

Show your math lol

softwaredoug

Not trying to be snarky, just curious -- How is this different from TurboPuffer and other serverless, object storage backed vector DBs?

hungarianhc

Hey! It's a great question. Co-founder of Vectroid here.

Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.

Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.

Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.

Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.

Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.

Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.

Apologies for the long winded answer. Feel free to ping us with any additional questions.

f311a

I’m curious, what’s the tech stack behind this?

ge96

M is minutes

HarHarVeryFunny

I was starting to think this was impressive, if not impossible. 1B vectors in 48 MB of storage => < 1 bit per vector.

Maybe not impossible using shared/lossy storage if they were sparsely scattered over a large space ?

But anyways - minutes. Thanks.

Edit: Gemini suggested that this sort of (lossy) storage size could be achieved using "Product Quantization" (sub vectors, clustering, cluster indices), giving an example of 256 dimensional vectors being stored at an average of 6 bits per vector, with ANN being one application that might use this.

gaogao

Yeah, the SI symbol for minutes is min, if you're going to abbreviate it in a technical context. Super funky using M.

williamscales

Agree the correct abbreviation is min.

Nitpick: could be wrong but I don’t think minutes is an SI derived unit.

stevemk14ebr

Thank you, title needs edited.

ikanade

Legend

l5870uoo9y

Thankfully not months.

softwaredoug

Oh the horrors of search indexing Ive seen... including weeks / months to rebuild an index.

cluckindan

How is this different from running tuned HNSW vector indices on Elasticsearch?

ashvardanian

Very curious about the hardware setup used for this benchmark!

talipozturk

No special hardware. Google Cloud vms. We use multiple of them during index building.

ashvardanian

The question is how many, and what kind of VMs you use? It greatly affects performance :)

I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.

esafak

By the creator of the real-time data platform https://en.wikipedia.org/wiki/Hazelcast.

OutOfHere

Proprietary closed-source lock-in. Nothing to see here.

CuriouslyC

Seriously. The amount of lift a SaaS product needs to give me is insane for me to even bother evaluating it, and there's a near zero percent chance I'll use it in my core.

kcb

Especially a product that demands access to large quantities of your most sensitive data to be useful.

HEmanZ

What do you think an alternative is for someone who:

1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.

2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.

3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.

Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich

mhuffman

Traditionally the most profitable approach is offering enterprise support and consulting.

cluckindan

Enterprises are so very fond of choosing novel open source technologies, too!

(not)

hungarianhc

Not that locked in - you can just move your vectors to another platform, no?

Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.

It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.

I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.

I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.

stronglikedan

Nothing for you to see here. Surely you just aren't their target customer.

OutOfHere

So who is? Who really needs to index 1 billion new vectors every 48 minutes, or perhaps equivalently 1 million new vectors every 3 seconds?

hansvm

If HNSW were accurate enough (and if this DB were much faster) then I'd have a use case. I wound up going down a different route to create a differentiable database for ML shenanigans though.