Show HN: VectorVFS, your filesystem as a vector database
83 comments
·May 5, 2025anotherpaul
perone
Thanks, I'm working on implementing the commands to clean the embeddings (you can now do that with Linux xattr command-line tool). I'm supporting CPU or GPU (NVIDIA) for the encoders and it only supports Linux at the moment.
3abiton
I am curious why Python, and not rust for example?
danudey
Not OP, but despite working in an all-Go shop I just wrote a utility in Python the other week and caught some flak for it.
The reason I gave (which was accepted) was that the process of creating a proof of concept and iterating on it rapidly is vastly easier in Python (for me) than it is in Go. In essence, it would have taken me at least a week, possibly more, to write the program I ended up with in Golang, but it only took me a day to write it in Python, and, now that I understand the problem and have a working (production-ready) prototype, it would probably only take me another day to rewrite it in Golang.
Also, a large chunk of the functionality in this Python script seems to be libraries - pillow for image processing, but also pytorch and related vision/audio/codec libraries. Even if similar production-ready Rust crates are available (I'm not sure if they are), this kind of thing is something Python excels at and which these modules are already optimized for. Most of the "work" happening here isn't happening in Python, by and large.
perone
Hi, I think Rust won't bring much benefit here to be honest, the bottleneck is mainly the model and model loading. It would probably be a nightmare to load these models from Rust, I would have to use torch bindings and then convert everything from the preprocessing already in Python to Rust.
b0a04gl
If VectorVFS obscures retrieval logic behind opaque embeddings, how do users debug why a file surfaced—or worse, why one didn’t?
perone
Hi, not sure if I understood what you meant by opaque embeddings as well, but the reason why files surface or not is due to the similarity score (which is basically the dot product of embeddings).
null
refulgentis
What is a non-opaque embedding?
Does VectorVFS do retrieval, or store embeddings in EXT4?
Is retrieval logic obscured by VectorVFS?
If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?
jlhawn
If I understand correctly, this is attaching metadata to files in a format that LLMs (or any tool that can understand the semantic embedding vector) can leverage to understand what a file is without having to actually read the contents of the file.
That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?
perone
Hi, it is quite different, there is no LLM involved, we can certainly use it for a RAG for example, but what is currently implemented is basically a way to generate embeddings (vector representation) which are then used for search later, it is all offline and local (no data is ever sent to cloud from your files).
jlhawn
I understand that LLMs aren't involved in generating the embeddings and adding the xattrs. I was just wondering what the value add of this is if there's no other background process (like mds on macOS) which is using it to build a search index.
I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.
Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:
> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."
[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...
null
lstodd
so, like magic(5)?
mywittyname
What is magic(5) and how is it similar to what was described?
danudey
magic(5) is a system for determining the type of a file by examining the 'magic bytes' at or near the start of a file.
For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...
You can see that at byte offset 257 is `char magic[6]`, which contains `TMAGIC`, which is the byte string "ustar\0". Thus, if a file has the bytes 'ustar\0' at offset 257 we can reasonably assume that it's a tar file. Almost every defined file type has some kind of string of 'magic' predefined bytes at a predefined location that lets a program know "yes, this is in fact a JPEG file" rather than just asserting "it says .jpg so let's try to interpret this bytestring and see what happens".
As for how it's similar: I don't think it actually is, I think that's a misunderstanding. The metadata that this vector FS is storing is more than "this is a a JPEG" or "this is a word document", as I understand it, so comparing it to magic(5) is extremely reductionist. I could be mistaken, however.
simcop2387
I think they're referring to this, https://linux.die.net/man/5/magic given the notation. That said I don't really see how it'd be all that relevant to the discussion so maybe i'm missing something else.
yjftsjthsd-h
https://manpages.org/magic/5 is a database of file types, used by the file(1) command. I don't exactly follow how it's the same though; it would let you say "what files are videos" but not "what files are videos of a cat". Which is sort of related but unless I missed something there is a difference.
0x457
magic(5) means `man 5 magic`: https://linux.die.net/man/5/magic
It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.
bullen
I did something similar, but I use these EXT4 requirements:
- hard links (only tar works for backup)
- small file size (or inodes run out before disk space)
http://root.rupy.seIt's very useful for global distributed real-time data that don't need the P in CAP for writes.
(no new data can be created if one node is offline = you can login, but not register)
asadawadia
is the embedding for the whole file? or each 1024/512 byte chunk?
natas
this is actually a great idea
iugtmkbdfil834
Assuming I understand it correctly, the idea is to be able to have LLMs get through file systems more easily with some interesting benefits to human users as well. The idea is interesting and I want to try it out.
perone
Hi, there are no LLMs involved, it is all local and an embedding (vector representation) of the data is created and then that is used for search later, nothing is sent to cloud from your files and there are no local LLMs running as well, only the encoders (I use the Perception Encoder from Meta released a few weeks ago).
malcolmgreaves
Fun idea storing embeddings in inodes! Very clever!
I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....
Still, fun idea :)
PaulHoule
The lack of an index is not bad at all if you have it stored contiguously in RAM: the mechanical sympathy is great, SIMD will spin like a top not to mention multithreaded programming, etc. Circa 2014 or so I worked on a search engine that scanned maybe 2GB worth of vectors for 10 million documents, queries were turned around in much less than a second, nobody complained about the speed.
If you gotta gather the data from a lot of different inodes, it is a different story.
ori_b
It's not stored continuously in ram. It's stored in extended attributes.
binarymax
O(n) is still OK for vector search if n isn't too large. Filesystem search solutions are currently terrible, with background indexing jobs and poor relevance. This won't scale for every file on your system but anything in your working documents folder would easily work well.
perone
Thanks. There is a bit of a nuance there, for example: you can build an index in first pass which will indeed be linear, but then later keep it in an open prompt for subsequent queries, I'm planning to implement that mode soon. But agree, it is not intended to search 10 million files, but you seldom have this use case in local use anyways.
int_19h
An index could be built on top of this though if desired. No need to have it in the FS itself.
esafak
thanks for saving readers time. If so this is not a viable tool for production.
esafak
Files-as-vector stores is LanceDB's value proposition. How do you compare in performance, etc.?
perone
This is quite different than LanceDB. In VectorVFS I'm using the inodes directly to store the embeddings, there is no external file with metadata and db, the db is your filesystem itself, that's the key difference.
esafak
That's an implementation detail, and it sounds more like a liability than a selling point, to have such tight coupling. (Why) do you see not using files as a good thing?
Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.
adenta
I wonder if I could use this locally on my macbook. The finder applications built-in search is kinda meh.
perone
I'm planning to support MacOS, the only issue is with the encoders that I'm using now, I will probably work more on it next week to try to make a release that works on MacOS as well. Thanks !
badmonster
interesting
pseudosavant
This immediately made me nostalgic for BeOS's BeFS or Windows Longhorn's WinFS database filesystems, and how this kind of thing would have fit them perfect. So much cool stuff you could do with vectors for everything. Smart folders that include files for a project based on a description of the project. Show me all of my config files for appXYZ. Images of a black dog at the beach. At the OS-level for any other app to easily tap into.
I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.
perone
I share the same feeling, I think filesystems will have to reinvent themselves given the pace of how useful ML models became in the past years.
didgetmaster
I built a local object store that was designed to replace file systems. You can create hundreds of millions of objects (e.g. files) and attach a variety of metadata tags to each one. A tag could be a number, string, or other data type (including vector info). Searches for objects with certain tags is exceptionally fast.
I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.
tugdual
Got a demo ?
p_ing
WinFS wasn't a file system laid down on hardware, it was just a SQL database that stored arbitrary data.
didgetmaster
I think that is one of the main reason it failed to launch. It was just too easy for the metadata stored in the separate database to become out of sync with the actual file data.
Microsoft saw the tech support nightmare this could generate, and abandoned the project.
p_ing
It was abandoned due to The Cloud. There was no need for WinFS as a tech when you could store everything in The Cloud.
It was also complex, ran poorly, and would have required developers to integrate their applications.
Microsoft had long solved the problem of blobs and metadata in ESE and SharePoint's use of MS SQL for binary + metadata storage.
pseudosavant
They just weren't able to pull it off for whatever reason. I actually ran BeOS as my daily driver for quite a while (way) back in the day. BeFS was genuinely amazing, and not something I've seen replicated elsewhere yet. There hasn't really been anything interesting done in filesystems used by users on devices in a really long time.
null
Great idea indeed. The documentation needs a bit more information to be useful. What GPU backends are supported for example? How do I delete the embedding information after I decide to uninstall it? Will give it a try though.