SQL Needed Structure

88 comments

·September 5, 2025

bccdee

My favourite way of doing this is

  with movieIds as (select id from Movie where title = $1),
       actorIds as (select Actor.id from Actor join ActorMovie on [...]
                    where ActorMovie.Movie in movieId),
       alsoActedIn as (select id from ActorMovie where actor in actorId),
       movieResults as (select * from Movie where id in movieIds),
       actorResults as (select * from Actor where id in actorIds),
       alsoActedInResults as (select * from Movie join ActorMovie on [...]
                              where ActorMovie.id in alsoActedIn)
  select * from movieResults
      full outer join actorResults on false
      full outer join alsoActedInResults on false;

Not every database supports "full outer join on false," and sometimes you have to add "as not materialized" to the "with" subqueries in order for it to be performant, but it works in Postgres, and you end up with a results table that looks something like this:

  MovieId,MovieTitle,ActorId,ActorName,MovieId,MovieTitle,ActorMovie.actor
  0,"Indiana Jones",null,null,null,null,null
  null,null,0,"Harrison Ford",null,null,null
  null,null,null,null,0,"Indiana Jones",0
  null,null,null,null,1,"Star Wars",0
  null,null,null,null,2,"The Fugitive",0

With a bit of client-side logic, it's easy to transform that into this:

  [{
    id: 0,
    title: "Indiana Jones",
    actors: [{
      id: 0,
      name: "Harrison Ford",
      alsoActedIn: [
        {id: 1, title: "Star Wars"},
        {id: 2, title: "The Fugitive"},
      ]
    }]
  }]

without the need to use json as a middleman.

This pattern has saved me from some truly awful query logic.

greggyb

Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.

I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".

This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:

sinfulprogeny

Kleppman[1] calls it schema-on-read (json, xml) and schema-on-write (typed columns in an RDB). I like it over structured/unstructured, it's a bit more specific.

[1] https://martin.kleppmann.com/2017/03/27/designing-data-inten...

greggyb

Yes, I agree with that preference. I don't love the verbiage of "structured" / "unstructured" in either usage (the article's or that which I shared).

stevage

I think the SQL sense is more that "structured" means "it contains internal structure" (like a tree), whereas a table without JSON is free from additional structure apart from its own table structure.

thaumasiotes

> Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.

Mark Rosewater likes to write about his personal concept of "linear" Magic: the Gathering decks, which are decks in which the cards tend to pick up synergy bonuses from each other, so that having two of them together is considerably more powerful than you'd expect from the strength of those two cards individually.

This always bothers me because it is the opposite of the more normal use of "linear" relationships, in which everything contributes exactly as much to an aggregate as it's worth individually.

stared

I used to work bit with MongoDB (over 10 years ago) and it did wonders in making me fall back into love with relational databases. I finally saw the light of Codd!

SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.

Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.

andyferris

100%.

The shame, to me, is that SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data. The relational model allows for any given value (field or cell value) to be itself a relation, but SQL doesn't make it easy to express such a query or return such a value from the database (as Jamie says - often the API server has to "perform the join again" to put it in nested form, due to this limitation).

bazoom42

> The relational model allows for any given value (field or cell value) to be itself a relation

First normal form explicitly forbids nested relations though. Relational algebra does not support nested relations for this reason.

But perhaps nesting relations might make sense as the final step, just like sorting, which is not supported by the pure relational model either.

arnsholt

My stint with MongoDB was brief, but I too came away with a deeper appreciation of SQL bases. I feel it's a bit like a good type system: yes, there absolutely is an upfront cost, but over time it really does save you from stupid runtime problems. It's especially important when the bugs are in the data storage layer, because if a bug causes bad data to be written to your database, not only do you need to find and fix the bug, you also have to figure out how to deal with (possibly lots of) bad data.

Also, modern SQL is an incredibly powerful language. Good SQL can save you from lots of dumb data munging code if you know how to wield your database properly. Especially for analytical queries, but also more typical application code IMO.

da_chicken

Yes, nothing will make you appreciate the relational model more than using a database where the designer broke first normal form. Or where the developer thought the data in the database was theirs and it was only ever going to be used by just their application.

Something else I find confusing is that every developer seems to want every database query for a single object to return a single row with a thousand columns, and anything multi row or with multiple results is way too complicated to handle. This goes double for report writing software.

I really wonder what we're doing with database providers (drivers) that makes people want to develop ORMs when ORMs consistently feel like 10 ton gorillas. If the database world is so disparaging of the "row by excruciating row" processing, why do the drivers exclusively present data that way and no other?

mrlongroots

Agree: flat/relational structure is highly efficient to store, query planning/optimization/scaling become a lot easier, the language is clunky but relational algebra is beautiful, but...

I think pushing multiple joins to what is still a relational database and getting a complex output isn't the worst idea in the world, as a higher-level layer on top of a regular database.

On the other hand, "it needs four queries/RTTs" is not the worst thing in the world. It needn't be the goal of a system to achieve theoretical minimum performance for everything.

Let those who truly have the problem in prod push the first patch.

andyferris

Yes - I have to agree.

Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.

The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).

This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.

[1] https://github.com/ggaughan/dee [2] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

9rx

> The article is right that this idea was taken too far

The biggest mistake was thinking we could simply slap a network on top of SQL and call it a day. SQL was originally intended to run locally. You don't need fancy structures so much when the engine is beside you, where latency is low, as you can fire off hundreds of queries without thinking about it, which is how SQL was intended to be used. It is not like, when executed on the same machine, the database engine is going to be able to turn the 'flat' data into rich structures any faster than you can, so there is no real benefit to it being a part of SQL itself.

But do that same thing over the network and you're quickly in for a world of hurt.

reaanb2

In my view, the O/R impedance mismatch derives from a number of shortcomings. Many developers view entities as containers of their attributes and involved only in binary relationships, rather than the subjects of n-ary facts. They map directly from a conceptual model to a physical model, bypassing logical modeling. They view OOP as a data modeling system, and reinvent network data model databases and navigational code on top of SQL.

mcphage

In that case, the mismatch is between "What developers need" and "What SQL provides".

kragen

I agree about first-class nested relations, but I don't agree about transactions.

Codd was writing 10 years before the idea of transactional semantics was formulated, and transactions are in fact to a great extent a real alternative to normalization. Codd was working to make inconsistent states unrepresentable in the database, but transactions make it a viable alternative to merely avoid committing inconsistent states. And I'm not sure what you mean by "quick", but anything you could do 35 years ago in 10 milliseconds is something you can do today in 100 microseconds.

andyferris

It's not about just _transactions_. What you wrote is 100% correct.

It's specifically about _fast_ transactions in the OLTP context. When talking about the 1970s (not 1990s) and tape drives, rewriting a whole nested dataset to apply what we'd call a "small patch" nowadays wasn't a 10 millisecond job - it could feasibly take 10s of seconds or minutes or hours. That a small patch to the dataset can happen almost instantly - propagated to it's containing relation, and a handful of subordinate index relations - was the real advance in OLTP DBs. (Of course this never has and never will help with "large patches" where the dataset is mostly rewritten, and this logic doesn't apply to the field of analytics).

Perhaps Codd "lucked out" here or perhaps he didn't have the modern words to describe his goal, but nonetheless I think this is why we still use flat relations as our systems of record. Analytical/OLAP systems do vary a lot more!

kragen

Hmm, but I think people doing OLTP in the 01970s were largely using things like IMS, which used ISAM, on disk, to be able to do small updates to large nested datasets very quickly? And for 20+ years one of the major criticisms of relational databases was that they were too slow? And that even today the remaining bastions of IMS cite performance as their main reason for not switching to RDBMSes?

I think that if you're processing your transactions on tape drives, your TP isn't OL; it's offline transaction processing.

I think Codd's major goal was decoupling program structure from on-disk database structure, not improving performance. There's a lot of the history I don't know, though.

david927

This Princeton student has a project which solves this by creating a separate datastore with the model. It mines FK contraints and existing queries to get that.

The UI is a treeview, so that if you look at Authors, you can drill down (join) to get their books, and if you look at Books, you drill down for the authors.

https://5ql.site (It's a single html page)

sgarland

The only reason this post has any truth is that the IMDB dataset is horrendously unnormalized. It also points out at the end that you can in fact use a combination of aggregations to return the desired JSON object.

Even then, if you really wanted to, you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.

mcphage

> you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.

That's kinda the problem—rather there being 1 way to make a tree-like structure in SQL, that works correctly, there's a lot of ways to do it, and they all have different tradeoffs, and they're all a bit obnoxious.

gethly

I am using event sourcing with cqrs and that means i also utilise projections, which the db itself can replicate less efficiently as a "view" or virtual table.

So one can technically create a projection/view that is tailor-made for a query that needs to display some data. Of course it is no often possible to retrieve all the data with a single select command.

So joins and multiple queries are simply inherent to complexity of data we store nowadays.

Anyway, years ago, i have moved to a db model where every entity is stored as a blob in a dedicated column and every additional column, beside id, is indexed. So there is no wasted space and a ton of columns that only hold data but are not used for filtering. I can run data-efficient queries that yield a list of ids of blobs to load or the blobs themselves and then i extract any necessary data out of those blobs(entities) on the application level. So the database us purely a blob store + few fast indices.

kragen

What's your application?

gethly

I use it on gethly.com and previously i used the same approach on an internal application for domain registrar that handles the EPP. So far no issues. Also using those blobs allows me to easily add encryption layer on application level.

kragen

I see: "Gethly is a paywalled hosting and sales platform for digital content creators. Sell online courses, license downloadable content, receive donations or build communities accessible via paid memberships." Are all the online courses and downloadable content and analytics and payments in the CQRS system, or are you using it for only some subset?

kragen

An interesting thing about "A Relational Model of Data for Large Shared Data Banks" (the stone tablets handed down from Codd https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) is that it starts out by considering N-ary relations whose values may themselves be relations, and only then switches to considering "normalized" relations whose values are not relations:

> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating.⁴ There is, in fact, a very simple elimination procedure, which we shall call normalization.

But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifesto https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.

I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.

ThinkBeat

Would this not become a bit simpler if you populate the sql schema like this?

                            +-------------------+                                                
                            |                   |                                                
                            |                   |

+------------------+ | Genere | | | | | | | +-------------------+ | Movies | | | | | +-------------------+ | | | | | | | Actors | +------------------+ | | | | +-------------------+

mamcx

The things `SQL` fail to do that enable this sort of things are:

* Store relation-on-cell: All the trick of so called `nosql` is that they can do nested data. THAT IS ALL

* Then, `join` is already inline

* Then, instead of the very poorly named `GROUP BY` it could has REAL `GROUP BY`!

That is all is need at the core

mcdonje

This was enlightening. I've often wondered why ORMs exist, and it turns out a bunch of devs don't understand tabular data. I was going to say they don't understand SQL, but apparently it goes deeper than that.

Like, left joins have been around since SQL-92, so the query not returning a result when there is no director is a skill issue.

Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.

kragen

The author of this blog post writes in https://www.scattered-thoughts.net/writing/reflections-on-a-...:

> - In a small team (averaging ~3 people, I think?) in a little under a year, we built a SQL query planner that passed the >6 million tests in the sqlite logic test suite.

> - I figured out how to decorrelate arbitrary SQL subqueries (I wasn't the first person to figure this out, but I got there independently).

> - I demonstrated a mode of consistency failure in various popular streaming systems that I believe was not widely understood at the time.

I think it is likely that the reason you disagree with him is not that he "don't understand tabular data" and has "a skill issue". Unless you're secretly Michael Stonebraker?

You write:

> Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.

You may not have noticed this, but the post you are commenting on explains how to run a JSON query and have the DB return your results in a hierarchical format.

mcdonje

My take was charitable because if he knows better, then he's making a bad faith argument, and I didn't want to accuse him of that.

I didn't read beyond the article. I only responded to what was in the post.

He didn't use capabilities SQL has had for decades, made a bad query, and used that to build his argument about how bad SQL is.

The ORM tables could've been written in one query that returns a tabular format of hierarchical data, which the front end could use.

What does that look like? It looks like the kind of underlying tabular format JSON & XML queries require. He didn't do that with his non-JSON examples.

Yes, the framework would need to turn it into an object. Yes, there is data duplication in some fields. While not ideal, it's still better than the ORM example because you're not splitting it up into multiple queries.

His JSON example undercuts his main argument. The DB is fully capable of returning the representation he wants. It's also more complex than necessary with all the JSONB_AGGs in the subqueries. Then he says you can see the duct tape.

So yeah, it reads like a front end person who never really learned modern SQL. Or maybe it's someone who knows better who is making a bad faith argument. Either way, it's just a bad take.

kragen

I don't think his final query is more complex than necessary? Maybe there's a better approach that I'm not seeing.

null

[deleted]

Mikhail_Edoshin

There was a discussion about a relevant paper here at HN:

"What if an SQL statement returned a database?"

https://news.ycombinator.com/item?id=38606146