MillenniumDB: Property graph and RDF engine, still in development
22 comments
·January 31, 2025smarx007
I think if someone is just trying out RDF, it is better to start with Apache Jena/Fuseki or Eclipse RDF4J. Maybe https://github.com/oxigraph/oxigraph if you like to live dangerously (i.e. to use pre-1.0 DBMSs).
Use of other systems involves factoring tradeoffs and considerations that are probably not the best for the newcomers. For example, qLever mentioned here is good in query performance and relative disk use but once the import is done, it's essentially a read-only DB and completely unsuitable for a typical OLTP scenario.
Having said that, the Chilean research group that is driving the development of MilleniumDB is very well-regarded in the RDF/semantic web querying space.
FjordWarden
If you expect Jena to be more battle-tested because it is older, forget it, if the process is killed by a unexpected shutdown or some other reason it results in data corruption. At least this was my experience a few years ago.
I found graph databases a beguiling idea when I first learned about them, and this is a welcome addition, but I've since temperated my excitement. They are not as flexible and universal a modal as is often promised. Everything is a graph, sure but the result of your SPARQL query not necessarily.
I found classical DBMS based on sets/multisets to be much easier to compose from a querying point of view. A table is a set/multiset and a result of a query is also a set/multiset, SPARQL guarantees no such composability. Maybe, if you want to start mucking around with inference engines, but you'll either run into problems of undecidability.
smarx007
I said suitable for newcomers aka people touching RDF for the first time. If you want production-ready, you probably want Stardog, Ontotext GraphDB, or AWS Neptune - neither is cheap. https://github.com/the-qa-company/qEndpoint is also an interesting project that's used in production.
PaulHoule
Jena lets you make little in-memory triple stores that you can use the way people use the list-map-scalar trinity. I've been working on this publication about that (RDF for difficult cases and when ordering counts) for years and it just got published last week
https://www.iso.org/standard/76310.html
I'll call out my collabortor Liju Fan for being the only person I've met who knew how to do anything interesting with OWL. (Well, I can do interesting things now but I owe it all to her.)
(For the research for that paper I used rdflib under PyPi because CPython was not fast enough.)
When I needed big persistent triple stores (that you use the way you might use postgres) I used to use
https://en.wikipedia.org/wiki/Virtuoso_Universal_Server
and had pretty good luck if I loaded a billion triples if I used plenty of 'stabilizers' (create a new AWS instance with ample RAM, use scripts to load a billion triples starting from an empty database, shut it down, make an AMI, start a new instance with the AMI, expect it to warm up for 20 minutes or so before query performance is good)
I don't regularly build systems on SPARQL today because of problems with updating. In particular, SQL has an idea of a "record" which is a row in a table, document oriented databases have an idea of a "record" which is a bit more flexible. Updating a SPARQL database is a little bit dangerous because there is no intrinsic idea of what a record is; i mean, you can define one by starting at a particular URI and traversing to the right across blank nodes and saying it is a 'record' and it works OK. But it's a discipline that I impose on it with my libraries, it ought to be baked into standards, baked into the databases, wrapped up in transactions, etc. For anything OLTP-ish I am still using SQL or document-oriented databases, but I hate the lack of namespaces and similar affordances that make SPARQL scalable in terms of "smash together a bunch of data from different sources" in document-oriented databases wheras SPARQL is missing the affordances you have in document-oriented databases for handling ordered collections. We badly need a SPARQL 2 which makes the kind of work that I talk about in that technical report easy.
smarx007
> Updating a SPARQL database is a little bit dangerous because there is no intrinsic idea of what a record is
SPARQL has a notion of a transactional boundary just like SQL has. You can combine multiple SPARQL queries in one transaction, they will all succeed or all fail just like you'd expect.
svilen_dobrev
datomic (and partially xtdb /former crux) are OLTPish, and use only such "tuples" , essentially it's up to the user to define what constitutes an entity if at all ("row", "object", "document", whatever) - maybe some entity-id and everything linked to it, but maybe other less-identity-related stuff. Which might feel freeing to extent, but as you said, also expects great responsibility/discipline to cobble the proper properties together.
zozbot234
> SPARQL guarantees no such composability.
SPARQL has a CONSTRUCT clause which gives you RDF as your query output. Isn't that compositional enough?
FjordWarden
Ok, that is true, but how do I tell my graph database that the result of the construct query is some other graph in my DB?
hobofan
As someone that has built production systems with Oxigraph (and a bit less with Jena), I'd recommend Oxigraph over Jena any day. Especially if you have you are working with a Rust-based tech stack.
You can save so much time and headache based on less operational complexity and the architectural options it opens up. If you only reinvest part of that into building a framework for versioning/backups, etc. you'll have a much better overall package.
iddan
Systems like Apache Jena are not production ready for anything serious. It makes total sense to start something different
jerven
MilleniumDB is an interesting engine, as is Qlever mentioned in other comments. I think both are good candidates at making RDF graphs one or two orders of magnitude cheaper to host as sparql endpoints.
Both seem to have arrived at the stage of transitioning from research to production code.
Very exiting for those of us providing our data in RDF and exposing Sparql.
AWs Neptune analytics is also very interesting, allowing Cypher on RDF graphs. Even the Oracle inbuilt RDF+Sparql seems to have improved greatly in 23ai.
WhatIsDukkha
Here is a bug with some back and forth between millenniumdb and qlever in starting a benchmarking attempt but I don't see results, though they managed to build and import.
jitl
Is it any good?
null
UltraSane
What is a domain graph?
null
These guys write really great papers!
We implemented a simplified version of their ring index for our data space (https://github.com/triblespace/tribles-rust/blob/master/src/...), and it's a really simple and cool idea once you wrap your head around it. Funnily enough, we build this even before the paper was officially published, because we found a preprint on one of the authors blogs. The idea itself was published by them before but their new paper made this a lot easier to understand. (burrows wheeler transforms vs. stable column sorting).
It's really too bad that the whole linked-data space is completely gunked up with RDF.
Ps: If anyone plans on implementing their ring index, using 0 based offsets makes the formulas much more streamlined, their paper uses 1 based indexing and they have to +/-1 all over the place.