Show HN: VectorVFS, your filesystem as a vector database

149 comments

·May 5, 2025

jlhawn

If I understand correctly, this is attaching metadata to files in a format that LLMs (or any tool that can understand the semantic embedding vector) can leverage to understand what a file is without having to actually read the contents of the file.

That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?

perone

Hi, it is quite different, there is no LLM involved, we can certainly use it for a RAG for example, but what is currently implemented is basically a way to generate embeddings (vector representation) which are then used for search later, it is all offline and local (no data is ever sent to cloud from your files).

jlhawn

I understand that LLMs aren't involved in generating the embeddings and adding the xattrs. I was just wondering what the value add of this is if there's no other background process (like mds on macOS) which is using it to build a search index.

I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.

Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:

> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."

[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...

freeamz

Yeah this kind of setup is indefinitely scaleable, but not searchable without out a meta db/index keeping track of all the nodes.

null

[deleted]

pilooch

Using it for a RAG is smart indeed, especially with a multimodal encoder (vision-rag), as the implementation would be straightforward from what you already have.

lstodd

if you go look up how xattrs work, you will understand it's no different than just reading a chunk of the file in question, and in fact can be slower.

xattrs are better be forgotten already. it was just as dumb idea as macos resource forks/

SoftTalker

Also locks you in to filesystems that support them, which are not all of them or on all operating systems.

lstodd

so, like magic(5)?

mywittyname

What is magic(5) and how is it similar to what was described?

danudey

magic(5) is a system for determining the type of a file by examining the 'magic bytes' at or near the start of a file.

For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...

You can see that at byte offset 257 is `char magic[6]`, which contains `TMAGIC`, which is the byte string "ustar\0". Thus, if a file has the bytes 'ustar\0' at offset 257 we can reasonably assume that it's a tar file. Almost every defined file type has some kind of string of 'magic' predefined bytes at a predefined location that lets a program know "yes, this is in fact a JPEG file" rather than just asserting "it says .jpg so let's try to interpret this bytestring and see what happens".

As for how it's similar: I don't think it actually is, I think that's a misunderstanding. The metadata that this vector FS is storing is more than "this is a a JPEG" or "this is a word document", as I understand it, so comparing it to magic(5) is extremely reductionist. I could be mistaken, however.

simcop2387

I think they're referring to this, https://linux.die.net/man/5/magic given the notation. That said I don't really see how it'd be all that relevant to the discussion so maybe i'm missing something else.

0x457

magic(5) means `man 5 magic`: https://linux.die.net/man/5/magic

It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.

yjftsjthsd-h

https://manpages.org/magic/5 is a database of file types, used by the file(1) command. I don't exactly follow how it's the same though; it would let you say "what files are videos" but not "what files are videos of a cat". Which is sort of related but unless I missed something there is a difference.

lstodd

four people answered strictly correctly as to what magic(5) is, but not a single one realized that storing some aux data as xattr in linux FS is not in any way different from just storing the exact same data as a file header. which is how magic(5) works.

how come?

(besides good luck not forgetting to rsync those xattrs)

malcolmgreaves

Fun idea storing embeddings in inodes! Very clever!

I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....

Still, fun idea :)

PaulHoule

The lack of an index is not bad at all if you have it stored contiguously in RAM: the mechanical sympathy is great, SIMD will spin like a top not to mention multithreaded programming, etc. Circa 2014 or so I worked on a search engine that scanned maybe 2GB worth of vectors for 10 million documents, queries were turned around in much less than a second, nobody complained about the speed.

If you gotta gather the data from a lot of different inodes, it is a different story.

ori_b

It's not stored continuously in ram. It's stored in extended attributes.

perone

Thanks. There is a bit of a nuance there, for example: you can build an index in first pass which will indeed be linear, but then later keep it in an open prompt for subsequent queries, I'm planning to implement that mode soon. But agree, it is not intended to search 10 million files, but you seldom have this use case in local use anyways.

binarymax

O(n) is still OK for vector search if n isn't too large. Filesystem search solutions are currently terrible, with background indexing jobs and poor relevance. This won't scale for every file on your system but anything in your working documents folder would easily work well.

int_19h

An index could be built on top of this though if desired. No need to have it in the FS itself.

yencabulator

But then there's no point in storing anything in xattrs.

int_19h

The reason would be that it's there as the source of truth, and when files e.g. get copied around, so does the metadata. The indexer doesn't need to be synchronous wrt such operations though, it can just watch the FS for changes and spin up reindexing as needed asynchronously.

esafak

thanks for saving readers time. If so this is not a viable tool for production.

anotherpaul

Great idea indeed. The documentation needs a bit more information to be useful. What GPU backends are supported for example? How do I delete the embedding information after I decide to uninstall it? Will give it a try though.

perone

Thanks, I'm working on implementing the commands to clean the embeddings (you can now do that with Linux xattr command-line tool). I'm supporting CPU or GPU (NVIDIA) for the encoders and it only supports Linux at the moment.

3abiton

I am curious why Python, and not rust for example?

danudey

Not OP, but despite working in an all-Go shop I just wrote a utility in Python the other week and caught some flak for it.

The reason I gave (which was accepted) was that the process of creating a proof of concept and iterating on it rapidly is vastly easier in Python (for me) than it is in Go. In essence, it would have taken me at least a week, possibly more, to write the program I ended up with in Golang, but it only took me a day to write it in Python, and, now that I understand the problem and have a working (production-ready) prototype, it would probably only take me another day to rewrite it in Golang.

Also, a large chunk of the functionality in this Python script seems to be libraries - pillow for image processing, but also pytorch and related vision/audio/codec libraries. Even if similar production-ready Rust crates are available (I'm not sure if they are), this kind of thing is something Python excels at and which these modules are already optimized for. Most of the "work" happening here isn't happening in Python, by and large.

perone

Hi, I think Rust won't bring much benefit here to be honest, the bottleneck is mainly the model and model loading. It would probably be a nightmare to load these models from Rust, I would have to use torch bindings and then convert everything from the preprocessing already in Python to Rust.

thirdtrigger

Might be interesting to add an optional embedded Weaviate [1] with a flat-index [2] to the project. It wouldn't use external services and is fully disk-based. Would allow you to search the whole filesystem (about 1.5kb per file (384 dimensions) which would be added to the metadata as well).

1. https://weaviate.io/developers/weaviate/installation/embedde... 2. https://weaviate.io/developers/academy/py/vector_index/flat

binarymax

Why weaviate and not FAISS? The latter is faster and lighter.

bobvanluijt

It depends on additional filters and whether you want to use vector search only. The upside of using Faiss would be storing the ID as file metadata and embedding it in the Faiss index. However, if you need any other filters or data, you would need to store it somewhere else.

lysp

I think they are associated with the project

ndsipa_pomu

I've long wanted to have a linux filesystem that robustly supported "tags" for files so that I didn't have to rely on the filesystem hierarchy to represent media files etc. e.g. I might want to tag a particular films as "Scifi" and also "Horror". Of course, for films, NFO files are typically used for this kind of metadata, but I'd like a similar facility that could be applied to any type of file.

sneak

That is literally what xattrs are for.

ndsipa_pomu

Yes, but they seem fairly limited in terms of userspace programs. How would you use xattrs to produce a filesystem hierarchy that say, listed the same file in multiple folders according to the attributes?

sneak

The xattrs are for storing the tag metadata, you’d use other tools (easily composed from shell utilities) to find files that match tags. If you really want it to be in multiple locations, you could make a fuse interface that shows directories full of files matching specific tags.

quantadev

I've been wondering for about 20 years why File Systems basically died and stopped innovating. For example we have lots of hierarchical data structures in the world, and no one seems to have figured out how to let a folder be the storage, instead of always just databases.

For example, if we simply had the ability to have "ordered" files inside folders, that would instantly make it practical for a folder structure to represent "Documents". After all, documents are nothing but a list of paragraphs and images, so if we simply had ordering in file systems we could have document editors which are using individual files for each paragraph of text or image. It would be amazing.

Also think about use cases like Jupyter Notebooks. We could stop using the XML file format, and just make it a folder structure instead. Each cell (node) being in a file. All social media messages and chatbot conversations could be easily saved as folders structures.

I've heard many file copy tools ignore XATTR so I've never tried to use it for this purpose, so maybe we've had the capability all along and just nobody thought to use it in a big way that became popular yet. Maybe I should consider XATTR and take it seriously.

wfn

I agree! I'm sort of exploring "programmable filesystem" concept (using FuseFS) (for some notes, see [1]).

Re: ordered files: depends on FS. e.g. filesystems which use B+ trees will tend to have files (in directories) in lexical order. So in some cases you may not need a new FS:

    echo 'for f in *.txt; do cat "$f"; done' > doc.sh; chmod +x doc.sh

=> `doc.sh` in dir produces 'documents' (add newlines / breaks as needed, or add piping through Markdown processor); symlink to some standardized filename 'Process', etc...

That said... wouldn't it be nice to have ridiculous easily pluggable features like

    echo "finish this poem: roses are red," > /auto-llm/poem.txt; cat ..

[1]: chaotic notes: https://kfs.mkj.lt/#welcome (see bullet point list below)

quantadev

Interesting stuff. Thanks for posting that.

b0a04gl

If VectorVFS obscures retrieval logic behind opaque embeddings, how do users debug why a file surfaced—or worse, why one didn’t?

perone

Hi, not sure if I understood what you meant by opaque embeddings as well, but the reason why files surface or not is due to the similarity score (which is basically the dot product of embeddings).

jlhawn

How much work do you think it would be to also have a separate xattr which has a human-readable description of the file contents? I wonder if it that might already be an intermediate product of some of the embedding tools, like "arbitrary media" -> "text description of media" -> "embedding vector". You could store both of those as xattrs and you could debug by comparing your text query with the text description of the file contents as they should produce similar embedding vectors. You could even audit any file, assuming you know what its contents are, by checking the text description xattr generated by this program.

b0a04gl

[dead]

null

[deleted]

refulgentis

What is a non-opaque embedding?

Does VectorVFS do retrieval, or store embeddings in EXT4?

Is retrieval logic obscured by VectorVFS?

If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?

null

[deleted]

PeterZaitsev

I think comparing it to Vector Database is confusing as database would typically mean indexes and some sort of query support.

Storing Embeddings with File is interesting concept... we already do it for some file formats (ie EXIF), where this one is generalized... yet you would need to have some actual database to load this data into to process at scale.

Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates

bullen

I did something similar, but I use these EXT4 requirements:

  - hard links (only tar works for backup)
  - small file size (or inodes run out before disk space)

http://root.rupy.se

It's very useful for global distributed real-time data that don't need the P in CAP for writes.

(no new data can be created if one node is offline = you can login, but not register)

yencabulator

> Zero-overhead indexing Embeddings are stored as extended attributes (xattrs) on each file, eliminating the need for external index files or services.

Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.

natas

this is actually a great idea

iugtmkbdfil834

Assuming I understand it correctly, the idea is to be able to have LLMs get through file systems more easily with some interesting benefits to human users as well. The idea is interesting and I want to try it out.

perone

Hi, there are no LLMs involved, it is all local and an embedding (vector representation) of the data is created and then that is used for search later, nothing is sent to cloud from your files and there are no local LLMs running as well, only the encoders (I use the Perception Encoder from Meta released a few weeks ago).

PeterStuer

If there is no indexing, how will your search time not increase linear or worse with the number of files?