We built another object storage

jamiesonbecker

These questions are meant to be constructively critical, but not hyper-critical: I'm genuinely interested and a big fan of open-source projects in this space:

* In terms of a high-performance AI-focused S3 competitor, how does this compare to NVIDIA's AIstore? https://aistore.nvidia.com/

* What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?

* You spend a lot of time talking about performance; do you have any benchmarks?

* Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?

* How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?

* Are there any front-end or admin tools (and screenshots)?

* Can a cluster scale horizontally or only vertically (ie Minio)

* Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?

* Is there any telemetry?

* Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?

* Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?

* Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?

Thanks!

mrweasel

> the page was written by ChatGPT

I wonder in that's why it's all over the place. Meta engine written in Zig, okay, do I need to care? Gateway in Rust... probably a smart choice, but why do I need to be able to pick between web frameworks?

> Most object stores use LSM-trees (good for writes, variable read latency) or B+ trees (predictable reads, write amplification). We chose a radix tree because it naturally mirrors a filesystem hierarchy

Okay, so are radix tree good for write, and reads, bad for both, somewhere in between?

What is "physiological logging"?

randallsquared

A hybrid of physical logging, which is logging page-by-page changes, and logical logging, which is recording the activity performed at an intent level. If you do both of these, it's apparently "physiological", which I imagine was first conceived of as "physio-logical".

I could only find references to this in database systems course notes, which may indicate something.

kburman

I feel like this product is optimizing for an anti-pattern.

The blog argues that AI workloads are bottlenecked by latency because of 'millions of small files.' But if you are training on millions of loose 4KB objects directly from network storage, your data pipeline is the problem, not the storage layer.

Data Formats: Standard practice is to use formats like WebDataset, Parquet, or TFRecord to chunk small files into large, sequential blobs. This negates the need for high-IOPS metadata operations and makes standard S3 throughput the only metric that matters (which is already plentiful).

Caching: Most high-performance training jobs hydrate local NVMe scratch space on the GPU nodes. S3 is just the cold source of truth. We don't need sub-millisecond access to the source of truth, we need it at the edge (local disk/RAM), which is handled by the data loader pre-fetching.

It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

hansvm

Nice. I was looking at building an object store myself. It's fun to see what features other people think are important.

I'm curious about one aspect though. The price comparison says storage is "included," but that hides the fact that you only have 2TB on the suggested instance type, bringing the storage cost to $180/TB/mo if you pay each year up-front for savings, $540/TB/mo when you consider that the durability solution is vanilla replication.

I know that's "double counting" or whatever, but the read/write workloads being suggested here are strange to me. If you only have 1875GB of data (achieved with 3 of those instances because of replication) and sustain 10k small-object (4KiB) QPS as per the other part of the cost comparison, you're describing a world where you read and/or write 50x your entire storage capacity every month.

I know there can be hot vs cold objects or workloads where most data is transient, but even then that still feels like a lot higher access amplification than I would expect from most workloads (or have ever observed in any job I'm allowed to write about publicly). With that in mind, the storage costs themselves actually dominate, and you're at the mercy of AWS not providing any solution even as cheap as 6x the cost of a 2-year amortized SSD (and only S3 comes close -- it's worse when you rent actual "disks," doubly so when they're high-performance).

websiteapi

it's always interesting to me how our profession keeps reimplementing the same sort of thing over and over and over again. is it just inherent to the ease in which our experiments can be conducted?

oersted

Small objects and low latency.

Why not use any of the great KV stores out there? Or a traditional database even.

People use object storage for the low cost, not because it is a convenient abstraction. I suspect some people use the faster expensive S3 simply as a stopgap. Because they started with object storage, the requirements changed, it is no longer the right tool for the job but it is a hassle to switch, and AWS is taking advantage of their situation. I suppose that offering an alternative to those people for a non-extortionate price is a decent business model, but I am not sure how big that market is or how long it will last.

But object storage at the price of a database with the performance of a database, is just a database, and I doubt that quickly reinventing that wheel yielded anything too competitive.

Aperocky

So they built an object storage to replace filesystem.

And in "Why Not Just Use a Filesystem?", the answer they gave is "the line is already blurring" and "industry is converging".

The line maybe blurring but as mentioned is still a clear cut use case for file system - or if higher access speed is warranted, just slap more RAM to the system and cache them. It will still cost less even at current cost of RAM.

andai

HN's version of this title is unintentional comedy :)

dbacar

One can only hope this does not go to same direction like Minio once they gain momentum.

tsuru

Every time I hear hierarchical storage, I can't help but think "It's all coming back to MUMPS, isn't it?"

whinvik

Interesting. Have you seen any benefits of using io-uring. It seems io-uring is constatly talked about but no one seems to be really using it in anger.

6r17

Io-uring has it's fair amount of CVEs ; I'm wondering if people are checking these out ; because the goal is not to just make something fast ; but fast & secure. It's a little bit of a grey area in my opinion for prod on public machines. Anyone has a counter view on this I'm genuinely curious maybe i'm over cautious ?

ps : there are actually other faster and more secure options than io-uring but I won't spoil ;)

hansvm

My understanding is that the iouring CVEs are about local privilege escalation, not being appropriately sandboxed, etc. If you're only running code you trust on machines with iouring enabled then you're fine (give or take "defense in depth").

Is that not accurate?

fractalbits

github page: https://github.com/fractalbits-labs/fractalbits-main

ChocolateGod

so they added a metadata engine to S3?

How does that compare to something like JuiceFS.