Skip to content(if available)orjump to list(if available)

Why RDF Is the Natural Knowledge Layer for AI Systems

IanCal

This seems to miss the other side of why all this failed before.

Rdf has the same problems as the sql schemas with information scattered. What fields mean requires documentation.

There - they have a name on a person. What name? Given? Legal? Chosen? Preferred for this use case?

You only have one id for apple eh? Companies are complex to model, do you mean apple just as someone would talk about it? The legal structure of entities that underpins all major companies, what part of it is referred to?

I spent a long time building identifiers for universities and companies (which was taken for ROR later) and it was a nightmare to say what a university even was. What’s the name of Cambridge? It’s not “Cambridge University” or “The university of Cambridge” legally. But it also is the actual name as people use it. The university of Paris went from something like 13 institutes to maybe one to then a bunch more. Are companies locations at their headquarters? Which headquarters?

Someone will suggest modelling to solve this but here lies the biggest problem:

The correct modelling depends on the questions you want to answer.

Our modelling had good tradeoffs for mapping academic citation tracking. It had bad modelling for legal ownership. There isn’t one modelling that solves both well.

And this is all for the simplest of questions about an organisation - what is it called and is it one or two things?

jandrewrogers

As the article itself points out, this has been around for 25 years. It isn’t an accident that nobody does things this way, it wasn’t an oversight.

I worked on semantic web tech back in the day, the approach has major weaknesses and limitations that are being glossed over here. The same article touting RDF as the missing ingredient has been written for every tech trend since it was invented. We don’t need to re-litigate it for AI.

rglullis

I would be very interested in reading what you think it can't work. I am inclined to agree with the post on a sibling thread that mentions that the main problem with RDF is that it is been captured by academia.

FrankyHollywood

The article states "When that same data is transformed into a knowledge graph"

This is a non-trivial exercise. How does one transform knowledge into a knowledge graph using RDF?

RFD is extremely flexible and can represent any data and that's exactly it's great weakness. It's such a free format there is no consensus on how to represent knowledge. Many academic panels exist to set standards, but many of these efforts end up in github as unmaintained repositories.

The most important thing about RDF is that everyone needs to agree on the same modeling standards and use the same ontologies. This is very hard to achieve, and room for a lot of discussion, which makes it 'academic' :)

4ndrewl

IME it's less than a "capture", more that most outside of academia don't have the requisite learning to be able to think in the abstract outside of trivial examples.

flanked-evergl

RDF is great but it's somewhat inadvertently captured by academia.

The tooling is not in a state where you can use it for any commercial or mission critical application. The tooling is mainly maintained by academics, and their concerns run almost exactly counter to normal engineering concerns.

An engineer would rather have tooling with limited functionality that is well designed and behaves correctly without bugs.

Academics would rather have tooling with lots of niche features, and they can tolerate poor design, incorrect behavior and bugs. They care more for features, even if they are incorrect, as they need to publish something "novel".

The end result is that almost all things you find for RDF is academia quality and lots of it is abandoned because it was just part of publication spam being pumped and dumped by academics that need to publish or perish.

Anyone who wants to use it commercially really has to start from scratch almost.

philjohn

Yes and no.

I worked for a company that went hard into "Semantic Web" tech for libraries (as in, the places with books), using an RDF Quad Store for data storage (OpenLink Virtuoso) and structuring all data as triples - which is a better fit for the Heirarchical MARC21 format than a relational database.

There are a few libraries (the software kind) out there that follow the W3 spec correctly, Redland being one of them.

ragebol

Tooling sounds like it can be fixed? If the knowledge bases are useful, why not use them with better tools?

jraph

> even if they are incorrect

Uh. Do you have a source for this? Correctness is a major need in academia.

rglullis

Correct != Bug-free.

My experience working with software developed by academics is that it is focused on getting the job done for a very small user base of people who are okay with getting their hands dirty. This means lots of workarounds, one-off scripts, zero regards for maintainability or future-proofing...

lmm

> Correctness is a major need in academia.

How so? Consider the famous result that most published research findings are false.

tsimionescu

I think they mean things like a tool that has feature X even if it crashes 50% when it is used is preferable to a tool that doesn't have feature X at all.

rglullis

Wrote this about one month ago here at https://news.ycombinator.com/item?id=44839132

I'm completely out of time or energy for any side project at the moment, but if someone wants to steal my idea: please take an llm model and fine tune so that it can take any question and turn it into a SparQL query for Wikidata. Also, make a web crawler that reads the page and turns into a set of RDF triples or QuickStatements for any new facts that are presented. This would effectively be the "ultimate information organizer" and could potentially turn Wikidata into most people's entry page of the internet.

IanCal

Even without tuning Claude is pretty solid at this, just give it the sparql endpoint as a tool call. Claude can generate this integration too.

rglullis

But the idea of tuning the model for this task is to make a model that is more efficient, cheaper to operate and not requiring $BILLIONS of infrastructure going to the hands of NVDA and AMZN.

luguenth

yorwba

I asked "Which country has the most subway stations?" and got the query

  SELECT ?country (COUNT(*) AS ?stationCount) WHERE {
    ?station wdt:P31 wd:Q928830.
    ?station wdt:P17 ?country.
  }
  GROUP BY ?country
  ORDER BY DESC(?stationCount)
  LIMIT 1
https://query.wikidata.org/#SELECT%20%3Fcountry%20%28COUNT%2...

which is not unreasonable as a quick first attempt, but doesn't account for the fact that many things on Wikidata aren't tagged directly with a country (P17) and instead you first need to walk up a chain of "located in the administrative territorial entity" (P131) to find it, i.e. I would write

  SELECT ?country (COUNT(DISTINCT ?station) AS ?stationCount) WHERE {
    ?station wdt:P31 wd:Q928830.
    ?station wdt:P131*/wdt:P17 ?country.
  }
  GROUP BY ?country
  ORDER BY DESC(?stationCount)
  LIMIT 1
https://query.wikidata.org/#SELECT%20%3Fcountry%20%28COUNT%2...

In this case it doesn't change the answer (it only finds 3 more subway stations in China), but sometimes it does.

rapnie

"Could not reach the server" after asking it a question.

zekrioca

Author listed RDF a couple dozen of times but didn’t define it, so:

The Resource Description Framework (RDF) is a standard model for data interchange on the web, designed to represent interconnected data using a structure of subject-predicate-object triples. It facilitates the merging of data from different sources and supports the evolution of schemas over time without requiring changes to all data consumers.

jraph

I wrote a comment trying to explain it there, with a concrete, current and widespread example: https://news.ycombinator.com/item?id=45135302#45135593

retube

What is "RDF" ? Not defined in the article

jraph

Resource Description Framework [1] is basically a way to describes resources with (subject, verb, object) predicates, where subject is the resource being described and object is another resource related to the subject in a way verb defines (verb is not necessarily a grammatical verb/action, it's often a property name).

There are several formats to represent these predicates (Turtle), database implementations, query languages (SPARQL), and there are ontologies, which are schemas, basically, defining/describing what how to describe resource in some domain.

It's highly related to the semantic web vision of the early 2000s.

If you don't know about it, it is worth taking a few minutes to study it. It sometimes surfaces and it's nice to understand what's going on, it can give good design ideas, and it's an important piece of computer history.

It's also the quiet basis for many things, OpenGraph [3] metadata tags in HTML documents are basically RDF for instance. (TIL about RDFa [4] btw, I had always seen these meta tags as very RDF-like, for a good reason indeed).

[1] https://en.wikipedia.org/wiki/Resource_Description_Framework

[2] https://en.wikipedia.org/wiki/Semantic_Web

[3] https://ogp.me/

[4] https://en.wikipedia.org/wiki/RDFa

rapnie

Does OpenGraph gain any benefit from its definition as linked data? Or might it just as well have been defined as, say, JSON Schema's and referring to those property names in html?

jraph

I'm not expert on OpenGraph, and it's been a while I've actually manipulated RDF other than the automatically generated og meta tags.

I'd say defining this as linked data was quite idiomatic / elegant. It's possibly mainly because OpenGraph was inspired of Dublin Core [1], which was RDF-based. They didn't reinvent everything with OpenGraph, but kept the spirit, I suppose.

In the end it's probably quite equivalent.

And in this end, why not both? Apparently we defined an RDF ontology for JSON schemas! [2]

[1] https://en.wikipedia.org/wiki/Dublin_Core

[2] https://www.w3.org/2019/wot/json-schema

vixen99

We meet this casual use of acronyms all too often on HN. It only takes a line or two to enable everyone to follow along without recourse to a search expedition.

Animats

Right. I was thinking "RDF - vaguely remember that as some XML thing from Semantic Web era".

Yup, it's still that RDF. Inevitably, it had to be converted to new JSON-like syntaxes.

It reminds me of the "is-a" predicate era of AI. That turned out to be not too useful for formal reasoning about the real world. As as a representation for SQL database output going into a LLM, though, it might go somewhere. Maybe.

Probably because the output of an SQL query is positional, and LLMs suck at positional representations.

jraph

You wouldn't spell out HyperText Markup Language each time.

RDF is one of those things it's easy to assume everybody has already encountered. RDF feels fundamental. Its predicate triplet design is fundamental, almost obvious (in hindsight?). It could not have not existed. Had RDF not existed, something else very similar would have appeared, it's a certitude.

But we might have reached a point where this assumption is quite false though. RDF and the semantic web were hot in the early 2000s, which was twenty years ago after all.

rapnie

Tim Berners-Lee, now co-founder of Inrupt [0], has launched Solid Project [1] where he kept working on semantic web concepts and linked data specs. Looks like Inrupt went full AI today. What ailed Solid was I think the academic approach mentioned in other comments, the heavy-weight specification process (inspired by W3C), and overlooking the fact that you better get your dev community on board and excited as a good road to adoption. Inrupt didn't spend much attention to their Solid community, except for the active followers in their chat channels, and were directly targeting commercial customers. I don't know the health of Solid project today, but there are a couple of interesting projects around social networking and the fediverse.

[0] https://www.inrupt.com/about

[1] https://solidproject.org/

ricksunny

Five times in that article he says some version of “Accuracy triples”.

What does that even mean? Suppose something 97% accurate became 99.5% accurate? How can we talk of accuracy doubling or tripling in that context? The only way I could see that working is if the accuracy of something went from say 1% to 3% or 33% to 99%. Which are not realistic values in the LLM case. (And I’m writing as a fan of knowledge graphs).

barrenko

Or a semantic layer it's called?

mdhb

Maybe worth also pointing out that a meaningful refresh of the RDF specification is getting rather close to completion.

Hopefully version 1.2 which addresses a lot of shortcomings should officially be a thing this year.

In the meantime you can take a look at some of the specification docs here https://w3c.github.io/rdf-concepts/spec/

jiggawatts

The sibling comment by flanked-evergl "RDF is great but it's somewhat inadvertently captured by academia." is made manifestly obvious when reading this spec.

It's overburdened by terminology, an exponential explosion of nested definitions, and abstraction to the point of unintelligibility.

It is clear that the authors have implementation(s) of the spec in mind while writing, but very carefully dance around it and refuse to be nailed down with pedestrian specifics.

I'm reminded of the Wikipedia mathematics articles that define everything in terms of other definitions, and if you navigate to those definitions you eventually end up going in circles back to the article you started out at, no wiser.

verisimi

RDF - Resource Description Framework

> The Resource Description Framework (RDF) is a method to describe and exchange graph data. It was originally designed as a data model for metadata by the World Wide Web Consortium (W3C).

https://www.wikipedia.org/wiki/Resource_Description_Framewor...