Composable SQL

145 comments

·January 26, 2025

branko_d

From the article:

> What you hope happens is that Postgres will recursively inline every view, merge them all together into a gigaquery, and shuffle the predicates up and down to maximally optimize the query. That was the conceit. And of course that never happens.

The author seems to be describing a Progress-specific problem. Oracle, SQL Server (and probably other "enterprise" databases as well) are pretty good at inlining views and then optimizing the whole resulting gigaquery as if the views were not there in the first place.

> The central problem is that views have to be written for the general case, and then you filter on the view’s output. Views can’t take parameters.

SQL Server supports inline table-valued functions, which are exactly that - parametrized and inlinable views.

Unfortunately, they are not as composable as they could be. For example, a function returning table and a function accepting the same type of a table as input cannot be directly composed - you'd first need to store the first function's result in a temporary table or table variable. But despite the limitations, functions are extremely useful for constructing complex queries which reuse logic.

bob1029

Oracle, MSSQL, DB2, et. al., are essentially like magic when it comes to making hopeless piles of SQL run quickly. This is because these engines have all ostensibly targeted the business people over the decades more than the tech people.

Features like dynamically optimizing queries as they return results [0,1,2] isn't something you chase if your audience is willing to rewrite the universe at the first sign of incongruity.

[0] https://learn.microsoft.com/en-us/sql/relational-databases/p...

[1] https://www.ibm.com/docs/en/i/7.5?topic=overview-adaptive-qu...

[2] https://apex.oracle.com/pls/apex/features/r/dbfeatures/featu...

SigmundA

I mentioned this down thread, MSSQL and Oracle cache query plans allow their optimizers to take their time which I imagine helps with doing more sophisticated optimization.

PG always re optimizes on every call unless you manually prepare the statement and that only last for the session. So PG's optimizer is designed to be very fast not very thorough.

natmaka

> PG's optimizer is designed to be very fast not very thorough

Some parameters weight on this policy: geqo_threshold, geqo_effort, constraint_exclusion, from_collapse_limit, join_collapse_limit, jit_above_cost, enable_partition*...

As requiring heavy query optimization usually slows simple queries down, this is a matter of context.

willvarfar

The article talks about how CTEs are efficiency minefields and how Postgres doesn't do a good job of pushing down predicates, for example.

TBH, the solution is to improve the Postgres engine in this regard?

For example, in Bigquery I often use CTEs and see predictable speedups that are explained by predicate pushdown when slapping a WHERE clause on the user of a CTE etc. Bigquery has plenty of performance gotchas but I don't think CTEs are one of them; they seem to be transparent to the optimiser.

Something else that Bigquery supports is user-defined 'table' functions. These are like the 'functors' the article creates.

However, in practice they aren't very widely used, and I think this hints that they aren't as useful as the article is hoping for.

The thing I _really_ want is scoping. If only functions could be described like CTE WITH blocks, e.g.

   WITH visits AS (
      SELECT ...
      FROM web_logs
      WHERE ...
   ),
   foo(bar INT, baz STRING) AS (
      ... function body here
   )
   SELECT ..., foo(a, b), ...
   FROM visits
   WHERE ...

Apart from the tidiness of this, this allows inlining of functions into the SQL statement body rather than as separate statements. Often various tooling e.g. DBT makes it really hard to with separate function definitions etc, and being able to inline logic would make things possible.

Bigquery also supports user-defined aggregate functions. Although there is rarely the need to stray from the standard builtin aggregate functions, when you need them they are brilliant.

(Anyone working _on_ Bigquery reading? If so, please ping me, I have loads of feedback on what works well and where the user pain and toil is ;) )

sa46

As of Postgres 12 (Oct. 2019) CTEs are no longer materialized by default and are eligible for predicate push down.

The planner may still struggle but it’s not true that a CTE is an optimization fence.

https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

vprcic

Functions in WITH clause really are a blessing. Oracle DB supported these for a decade now and that was a gamechanger for me.

https://oracle-base.com/articles/12c/with-clause-enhancement...

zetalyrae

This suffers from the "sufficiently smart compiler" problem. The query planner that can do what I mean with maximal efficiency is always just over the horizon. There's always yet another query that can't be optimized automatically.

LikesPwsh

It's not a problem in MSSQL, so solving the fat head of problems is clearly possible.

The escape hatch in MSSQL for the long tail is materialising an intermediate result to temp table.

SigmundA

One significant difference between PG and MSSQL is MSSQL caches query plans allowing its optimizer to be much more sophisticated and slower.

PG re-optmizes every query over and over unless you manually do a prepared statement and that only last for the session its prepared on. Therefore it's very important the optimizer not take much time before execution.

geysersam

Similar functionality already exists in some SQL implementations, notably DuckDB:

  create table macro author_books() as   
  (
    select
        a.name as author,
        b.title,
    from
        a
    inner join
        b on b.author_id = a.author_id
  );

is the same as the example in the article and it is called like:

  with a as (...),
       b as (...)
       from author_books();

https://duckdb.org/docs/sql/statements/create_macro.html#tab...

ncruces

SQLite supports this with the help of a virtual table: https://github.com/0x09/sqlite-statement-vtab

Available in sqlean: https://github.com/nalgeon/sqlean/blob/main/docs/define.md#t...

And my Go driver: https://pkg.go.dev/github.com/ncruces/go-sqlite3/ext/stateme...

Mine is a strict port, though admittedly, this could be improved with additional pushdown.

_dark_matter_

I really dislike leading off with "create table". My brain is unable to read it except as DDL.

typ

DuckDB macro doesn't support instantiating a table macro using the `query_table` function. So, it's not possible to expand a table macro like `foo(bar(baz()))`, making its use limited.

chrisjc

Perhaps I don't fully understand what you're saying

    CREATE TABLE person (name VARCHAR, age BIGINT);
    INSERT INTO person VALUES ('Alice', 37), ('Ana', 35), ('Bob', 41), ('Bea', 25);
    CREATE OR REPLACE MACRO my_query_table(t) AS TABLE (SELECT * FROM query_table(t));
    SELECT * FROM my_query_table('person');

Or do you mean that you can't use `query_table` with a macro identifier in place of the table identifier as input (after all, where would put the macro args)? Perhaps using `query(...)` instead would work?

    CREATE OR REPLACE MACRO query_macro(m, i) AS TABLE (SELECT * FROM query(CONCAT('SELECT * FROM ', m::STRING, $$($$, i::STRING, $$)$$)));
    SELECT * FROM query_macro('my_query_table', 'person');

Or maybe through some other kind of gymnastics https://duckdb.org/2024/03/01/sql-gymnastics.html

typ

Here is a contrived example:

    CREATE MACRO foo(bar, baz) AS TABLE
    WITH
        base AS (
            FROM query_table(bar) t
        ),
        refined AS (
            FROM base
            JOIN query_table(baz) u ON base.id = u.id
        )
    SELECT * FROM refined;

I want to test many user-defined combinations of (bar, baz). How can I pass a macro instance to query_table like

    FROM (
        (FROM foo(mybar1(some_arg1, some_arg2), mybaz101(500)))
        UNION ALL
        (FROM foo(mybar2(some_arg3, some_arg4), mybaz201(300)))
        UNION ALL
        ...
    )
    SELECT *;

Your second example using the 'query_macro' looks like something I was looking for. But it doesn't seem to be of general use that supports an arbitrary number of arguments.

geysersam

It's possible. You just have to use ctes:

   with arg_to_bar as (from baz()),
        arg_to_foo as (from bar()),
   from foo()

Where `arg_to_bar` is a table name used internally in `bar`.

typ

I clarified what I meant by an example in the other reply.

krembo

Isn't this an abstraction of functions?

tankenmate

It is an abstraction, it may be arguable if it is an abstraction of functions.

I once built a web app where we had a library of CTEs that we used to manage all the layers of indirection in the data, it was performant and made it much easier for users to be able to search and manipulate the data; especially for ad hoc data analysis.

jeltz

How do these differ from SQL functions in PostgreSQL? They look like the same thing.

geysersam

Yes but they are transparent to the optimizer so the argument against them in the blog article doesn't apply.

n_plus_1_acc

How does it differ from a view?

geysersam

It can take arguments. I didn't use any though so it's not a good example

yujzgzc

I work on a project that does tons of SQL based data transformations (thousands). A lot of it isn't backed by unit tests. Some of our people with more of a C++ background have got the idea that this is crazy, every bit of business logic should be encapsulated into a function just like in every other civilized language, so that it could be unit tested. They built a framework like this for SQL and implemented quite a few jobs with it.

The problem is most of the SQL-first devs hate it and would rather write untested code, or rely on alternative test automation techniques like bringing up a local DB instance with fake data, than have to refit their business logic into pure functions. The truth is that it's just not a way of working that comes naturally to people who write a lot of SQL for a living.

fifilura

I can definitely sympathise with the SQL developers here. (I have worked extensively both with SQL and C++).

There is a trade off between unit tests and having control and overview of what you are writing.

Also very often it is so much more about verifying that the input data is clean and if this becomes more cumbersome, you may have lost the ability to debug the biggest source of errors.

And of course sometimes it hits a nerve when the "real engineers" come and dictate exactly how you should do things. Reference also to Joel Spolsky's architecture astronauts.

NortySpock

dbt v1.8 added a feature to be able to unit-test a view or CTAS statement. (a natural extension of dbt since it was designed to assist with moving and renaming the inputs to a view)

I haven't tried it yet but I look forward to doing so. Already dbt macro expansions with tests have provided me way more test coverage than I ever had in the pre-dbt days.

It's a footnote in the article, but it seemed like a natural half-step forward in the direction indicated by the article.

thyristan

Pure functions in the context of databases are a rarity. A database is meant to be persisting data while upholding certain consistency, integrity and isolation guarantees. This basically means that most of the database-using SQL functions can never be pure, because having side effects is their whole purpose. Whats more, integrity and isolation are basically untestable, especially with unit testing.

So imho the DB people are totally right, to test any SQL, you need database tables with data in them.

svilen_dobrev

long time ago i had similar need.. did not want to directly churn SQL, for hundreds of small filters. So i did an expression parser and translator, and wrote all things in simple python functions, then translated those into SQL automatically. Both funcs and the translator were heavily tested. Apparently, that parser also grew few other "renderers" - like evaluator, documenter and what not.

https://github.com/svilendobrev/svd_util/blob/master/expr.py

smartmic

A lot of enterprise software tools that I know of are not tested as your C++ professional developers would expect. Especially when this software is not from the original software vendor, but is assembled/configured/enhanced by third party vendors or consultants. But of course even such software is "tested", albeit by the end user over a long period of use, unstructured, haphazardly, and in most cases unsatisfactorily. The problem is particularly acute if the industry is not deeply rooted in software development.

slt2021

SQL relies on RDBMS's type system, just like C++ devs rely on C++ OOP.

the problem lies with lax/weak table design that allows too lax data (all string nullable fields for example), instead of strict types like int, float, varchar, foreign key constraints etc.

you dont need unit test if you table design is clean, because SQL operators are deterministic and are tested by the RDBMS vendor

just need integration test, dry run test, periodic data integrity checks at the system boundaries when data is ingested/egress

SQL represents operations on the Sets of data and their relations, so you need not unit test on a single row, but a test suite on your entire DataSet (backup copy or synthetic copy)

earnestinger

> you dont need unit test if you table design is clean, because SQL operators are deterministic and are tested by the RDBMS vendor

Imagine well structured tables in 3rd normal form.

And then a view which joins 20 tables and creates a report with myriad of case/then expressions.

Many things can go wrong. (Wrong join conditions, duplicate rows, omitted rows, wrong formulas)

You might say that there should be no such logic in database and I understand the sentiment, but circumstances (e.g. performance) might say it’s needed.

slt2021

Performance of complex query is totally depends on the query planner, which depends on 1) table size, 2) data cardinality, 3) table statistics

It is only testable on real production database, so you can do test of complex queries on backup of your prod to get a sense of performance.

But still query can degrade unexpectedly under consistent load in prod from other queries

cbsmith

> The truth is that it's just not a way of working that comes naturally to people who write a lot of SQL for a living.

I don't know... Refactoring Databases is pretty old now. Plenty of DBAs can handle this style and even prefer it. I think this is more of a cultural problem in the small than in the large.

TachyonicBytes

I assume that framework is not open-source or somewhere I can look at it?

mousetree

You could take a look at the open-source dbt framework. They have a good implementation of unit testing for sql

https://docs.getdbt.com/docs/build/unit-tests

yujzgzc

I guarantee you wouldn't want to use it

marcosdumay

Hum...

You know that SQL is completely pure up to the point where you specify a data alteration command, right?

Your framework is ditching the queries, that are completely pure as enforced by the engine, and insists on using maybe-pure functions that all the engines have problems with.

feoren

SQL cannot be "fixed". It's too broken, too ubiquitous, too customized, too optimized. Will these functors integrate with the thousands of optimizations modern database systems rely on?

This post has the format: "SQL is extremely terribly awful at writing business logic. But if people implemented this one half-baked, unrealistic idea, then it would suck a little bit less!" The first part is spot-on: SQL is a horrible language for writing business logic. So: stop. Stop doing that.

Stop writing business logic in SQL!

Imagine you joined a new team, and they said: our program uses a bunch of XML files as its data format, and that's why most of our business logic is implemented in XPath. Just sit there and ponder that situation for a little bit. This is what you all sound like to me. Stop it!

> Testing the simplest query requires building up a massive object graph, just to test an infinitesimal slice of it.

Yes, correct, because SQL is awful at representing business logic, and foreign key constraints are business logic! The fact that an employee has one supervisor who is also an employee (FK Employee.SupervisorID -> Employee.ID) is a business constraint. (And one that can easily change, by the way.) All these database engineers who can't pull themselves away from their stored procedures will argue with me to their graves about foreign key constraints, but they're generally terrible. They're a business constraint that just happens to be supported by the database engine, but they're still a business constraint. Stop putting your business logic in your persistence layer.

We've been getting better at building programming languages that actually support software engineering every year. Use one of those. SQL is an output of one of your program layers. Just like XPath would be an output of your program. Compose SQL in an actual real programming language that isn't horrible instead of doing any of this.

wvenable

> Yes, correct, because SQL is awful at representing business logic, and foreign key constraints are business logic!

This strikes me as a unnecessary all-or-nothing argument. I would agree that you shouldn't put all your business logic in the database but that doesn't mean you shouldn't put any business logic in your database. If foreign key constraints are business logic then so are table definitions and column types. It's not awful for those things as it's literally built right in.

My advice is to ensure, as best as possible, that invalid states are impossible to represent. If you can't have an address without being attached to a client, make that impossible to represent in the database. It's one less thing you'll ever have to worry about.

feoren

> This strikes me as a unnecessary all-or-nothing argument

I completely understand why it comes across that way. HN Comment length is limited and the current practice is a comfy local optimum, so any one change away from it is going to feel bad, and it's hard to get across my full theory of database design in comment form. I would not recommend most existing teams simply stop using foreign keys, unless they also adapted a lot of other changes that coexist with it. Still, there is a better world for database design, and that world probably does not use foreign key constraints.

> I would agree that you shouldn't put all your business logic in the database but that doesn't mean you shouldn't put any business logic in your database.

I actually would (lightly) argue that does mean you shouldn't put any business logic in your database, as I find that highly coherent design philosophy leads to highly coherent code. The most important question to me is about how the system can handle change over time. Foreign keys are one part of a larger set of "design principles" that everyone seems to take for granted that lead to a system that cannot handle change.

> If foreign key constraints are business logic then so are table definitions and column types.

I disagree on this one. Table definitions and column types are a "level below" business logic, as evidenced by the fact that you could probably refactor a lot of schemas for better/different performance characteristics without changing any of the actual business logic that drive them. The business logic doesn't (shouldn't) care whether you've used a bitfield or a bunch of boolean columns or a child set-inclusion table to represent a bunch of yes/no facts about some entity. They're "persistence logic", and the database is exactly the right place for that.

> My advice is to ensure, as best as possible, that invalid states are impossible to represent.

The problem is that an "invalid state" is highly dependent on your perspective. A state that is invalid to use in one part of your codebase is perfectly valid in other parts. The ubiquity of "draft", "draft final", and "final" shows that humans want to represent "invalid states" all the time! Aren't you glad your IDE lets you represent invalid code? Okay, fine, that data is not ready for the annual submittal, but does that mean I'm really not allowed to save it at all?

And these "invalid states" are extremely volatile as your focus expands around your domain. All contact info must have a primary contact method -- well, unless they're just "contact stubs". All facilities need a facility manager -- well, unless they're 3rd party facilities, or storage facilities, or in Region 9, or start with the letter 'P'.

Haven't you been 300 hours into the development of a system when suddenly someone springs on you, "Oh yeah, Bob has two bosses. Why should that be a problem?" And you're screaming internally, "Of course it's a problem! You assured me six months ago that everyone only has one supervisor, and now you're telling me this one guy has two!? This changes EVERYTHING!"

But why? Why does it change everything? Why does every little change to our database schemas always seem to propagate to the entire system? It's because we're designing databases wrong. And foreign keys are a (small) part of that wrongness.

barnabee

> The problem is that an "invalid state" is highly dependent on your perspective. A state that is invalid to use in one part of your codebase is perfectly valid in other parts. The ubiquity of "draft", "draft final", and "final" shows that humans want to represent "invalid states" all the time! Aren't you glad your IDE lets you represent invalid code? Okay, fine, that data is not ready for the annual submittal, but does that mean I'm really not allowed to save it at all?

It's perfectly possible to define a schema that can represent draft or incomplete data, syntactically invalid code (as a VARCHAR/TEXT), etc. while still also enforcing that a complete and valid domain object must be available when the record is in a non-draft state.

> Haven't you been 300 hours into the development of a system when suddenly someone springs on you, "Oh yeah, Bob has two bosses. Why should that be a problem?" And you're screaming internally, "Of course it's a problem! You assured me six months ago that everyone only has one supervisor, and now you're telling me this one guy has two!? This changes EVERYTHING!"

Schema updates and migrations should not be this difficult in any reasonably designed system.

wvenable

I have had this issue and it does change everything -- all the systems that depend on that data, the calculations involved, the user interface. The change in the database schema is actually relatively minor compared to the rest of the code. But with an enforced schema there's a never a case where it will be wrong -- the code with either work all the way one way or the other but never both at the same time.

Neither of examples provided has anything to do with foreign key relationships. If Bob can have two bosses, that's changing a column into a relation. That's going to have an effect beyond the database. If the facilities manager is optional then that's changing a column to be nullable. Again most the work after that change is outside of the database.

Even if drafts or temporary invalid states exist, I'm never going to want orphaned data. You said table definitions are column types are a "level below" business logic but the table definition is directly tied to what you're trying to store and how you're storing it. I think I would simply argue that foreign key definitions are also part of that "level below". It might be that not all facilities need a facility manager but if one is specified it better exist and you better not be able to remove it without affecting the facility.

I have a bug right now where in some race-condition situation the database is catching the issue with the constraints. I haven't found the bug yet (it's low priority) but at least it's not filling my database with invalid data -- it is just crashing that operation.

galbar

For whatever it is worth, I have reached the same conclusion and I have been building systems like you describe for the last few years.

Recently I changed jobs. In the new team they love their ORM with all the foreign keys and direct mapping between business logic data objects and the database schema.

Needless to say, it is a mess. Despite all the automated tooling they have around migrations that "should make it easy" to change things, the reality is that the whole application depends on what the database schemas look like and migrating the existing representation to a new one would break everything.

This has become an elephant in the room that nobody talks about and everyone works around.

liontwist

What is business logic? What specifically is SQL bad at?

I’m not sold by your foreign key example. Constraints are great. I guess you can do that in python? Good like getting ACID right

As soon as your data stops resembling a tree and starts looking like a graph than all your “business logic” looks a lot like operations on relations and you are back to where you started.

dagss

I agree that SQL has problems as a language.

I disagree that we should be happy about moving data out of the storage layer into the backend in cases where it does not make sense performance wise.

The problem is doing some processing in SQL in some cases has a huge performance advantages over moving data to the backend.

If your business logic requires you to take a million rows from table A, apply some rule, and output a million rows into table B -- then piping all that data through a backend is a bad solution performance wise. I reject the idea that all that extra IO (+extra overhead also for the database) is worth it just to use another programming language than SQL.

What we need is better alternatives that execute at the storage layer. Probably starting with proper languages that compile to SQL, like we saw with JavaScript.

magicalhippo

Agree not to putting your business logic in stored procs, but hard neg on getting rid of all constraints.

I consider it defense in depth. Sure your application should absolutely ensure order lines aren't referencing a non-existing order, but bugs happen and I prefer the database returning an error to the database getting full of inconsistent data.

default-kramer

> Compose SQL in an actual real programming language that isn't horrible instead of doing any of this.

Got any recommendations? That's a serious question, I've tried many (and built my own) and some are okay but none feel remotely good enough when you consider how much progress we've made basically everywhere else in the wider world of software engineering. Relational databases are so good; they deserve better than SQL!

efromvt

Shameless plug if you've tried a range of alternatives: I've tried building a version (Trilogy) targeted at analytics, and I'd be curious about how it stacks up to your experience with other attempts at this - can read more about the composition/reuse story here: https://trilogydata.dev/blog/composability_reuse.html

default-kramer

I looked at it when it hit HN a while ago. Looks nice, but not exactly what I want because

> Joins and source tables are automatically resolved in Trilogy. You won't ever explicitly specify one in your query; you're declaring what you want, not how to get it. The responsibility of how to get it is delegated to the semantic model.

I don't fully understand how the semantic model is created (is this documented anywhere?), but I don't think I would enjoy a hard separation between the query layer and the semantic layer regardless.

I would prefer a continuum of "how much indirection do I really want" to be available to the user. My own exploration of this topic is https://docs.racket-lang.org/plisqin/index.html and you can compare section 2 (Using define-schema) to section 7.2 (Plisqin Desugared) if you want to know what I mean about a continuum.

Unrelated, the article we are commenting on has inspired me such that I think I have an answer to the big type system questions that eluded me when I decided to put Plisqin on the shelf. Maybe time to pick it up again... but probably not.

feoren

> Got any recommendations?

Find a good ORM that allows you to keep things at the "set logic" level for as long as possible and has a good internal query representation. Avoid or turn off any fancier features of the ORM like "navigation properties" or "implicit subtyping" or automatically detecting changes or anything that happens "magically" in the background: prefer simplicity and being explicit about everything. For C#, this is EntityFramework with most of its features turned off -- compose queries by joining IQueryables together, and only drop down into in-memory iteration when you have to.

If that's not available, I'd probably look for a library that presents a structured representation of the SQL abstract syntax tree (probably a subset) as an object, along with a translation to the various dialects. Pass those ASTs around and compose them together. I'd probably slowly build abstractions on top of this that eventually look similar to what I described in the first paragraph.

creamyhorror

https://kysely.dev/ (SQL query builder for Typescript) is good, although I'm not sure it lives up to your expectations. Instead of an ORM, build the underlying SQL queries efficiently.

jeltz

I have also tried many and so far none have even managed to be as good as the kludge that is SQL.

HdS84

C# LINQ comes close.

It basically allows to express queries independent of the target language. E.g. it's entirely possible that the same LINQ query can target PG, MSSQL Server, objects in local memory or RavenDB (NoSql). Quality of the generated query depends on the translating library ofc.

Syntax is very nice and much more aligned with programming than SQL. E.g: someData.Where(x => x.id =="someId").OrderBy(x => x.LastModified).Take(10).ToListAsync()

Note that the query is not executed until ToListAsync (or other materializing calls) is called. This allows for composabillity.

There is an alternative form of writing it more resembling SQL, but I've never seen it much used.

branko_d

I don't understand what you are trying to say about foreign keys. Are you advocating moving FKs from database to application?

If yes, you are opening yourself up to race conditions that will only become visible in concurrent environment. I had a distinct misfortune of working with an ERP system based on Oracle which didn't use database-level FKs. Predictably, their customers would occasionally complain about "dangling" data, which, to my knowledge, has never been properly resolved.

thyristan

It is possible to move foreign keys and other constraints and consistency checks into the application. But that comes at a huge cost: Either you need to be generous with your use of transactions to ensure that no invalid state is ever written, and your application has to know and use those transactions. Which is something application developers in my experience often fail at. Or you have to accept that broken inconsistent data is a fact of life that can happen at any time, and will increase in times of high throughput and concurrency (where the business part of the company will care most about things running smoothly). Also, you will limit yourself to exactly one application using the database, because otherwise you will have to duplicate your constraint logic in any application precisely.

There is also a cost of doing constraints and consistency in SQL, the language is somewhat limited in what you can easily represent (anything is possible, but only some things are easy). And the handling of violations still needs to be implemented in the application, which often necessitates at least some duplication and fiddling to e.g. get friendly and useful error messages.

So imho, yes, SQL has its problems, but it is still the lesser evil in almost all cases.

feoren

> Or you have to accept that broken inconsistent data is a fact of life that can happen at any time

You have to accept this anyway. Data can be broken and inconsistent in so many different ways that you cannot protect against every possible way. People will type the wrong thing in sometimes, in ways you never expected. Someone pasting text into the wrong field is just as "broken and inconsistent" as an ID field pointing to a nonexistent entity. How important those errors are to the business depends on how important those columns are, not whether they're an ID vs. text. And as another person pointed out, data always ends up split/replicated among multiple systems anyway.

> Also, you will limit yourself to exactly one application using the database, because otherwise you will have to duplicate your constraint logic in any application precisely.

Layer your architecture? Expose an API? Microservices? There are all sorts of ways around this problem.

Yes, it makes it harder for people to just log into your database and run their own SQL queries. Sometimes that may be politically hard to sell. But I heard this same argument in college when private variables were introduced: wait, but, what if someone wants to directly manipulate private variables of class members? That is not a virtue. It's perfectly reasonable to assert that some internals are simply private implementation details. You are reserving your right to change those details in the future as the application changes. You are separating the details of your persistence layer from the public interface that you're exposing. That is a virtue.

threatofrain

It's inevitable that some data will be split between more than one service / database. I'm not saying we should have data whimsically split up, but foreign key constraint logic must eventually come to live in the app layer too.

tacone

> Imagine you joined a new team, and they said: our program uses a bunch of XML files as its data format, and that's why most of our business logic is implemented in XPath. Just sit there and ponder that situation for a little bit.

Been there 20 years ago! It sucks at unimaginable levels!

dicytea

I agree with most of this, but I don't understand why you'd treat FKs as business logic. IMO, it's clearly a part of the data structure. Like, say if you have a domain that looks like:

    type User = {
      username: string;
      posts: Post[];
    }

    type Post = {
      content: string;
    }

(for the DDD-brained, assume User is an aggregate and Post is a value object here)

The natural way to represent this in the database is:

    CREATE TABLE users (
        username text PRIMARY KEY
    );

    CREATE TABLE posts (
        author text NOT NULL REFERENCES users (username),
        content text NOT NULL
    );

I just don't see why you'd do it in any other way.

feoren

Hey dicytea, thanks for getting that blogging app up. Our users love it. Hey, I was talking to John, and he said he's got some guest columns he wants you to throw up there. Just some blog posts from thinkers he likes that he thinks would be a good fit for our site. Can you throw up some of those?

What do you mean they need to be users of the site? They're just some people whose posts we want to feature. No, they shouldn't need to log in. What are you talking about? What does that have to do with it?

</BossVoice>

Oops, turns out the rule "all blog post authors are users of the site" was actually just a volatile business rule after all.

lelanthran

Yes, and isn't it wonderful that you get an error message when you try to change a business rule, forcing you to properly encode the new rule instead?

A lot of these types of scenarios are missing the fact that, without these enforcements in the database, sooner or later a developer is going to make a change that violates an existing business rule without realising that they just broke a rule!

Rules change. We know this. What is valuable is being told that some new rule conflicts with an existing rule.

If you don't enforce the business rules in the database, how do you know when a new rule conflicts with some existing rule?

throwaway201606

I understand the primary premise about the difficulty with testing SQL and fully agree with it.

I do have a question though - while I understand how functors can help make the problem easier to tackle, I am not sure I fully understand how functors are different from a similar existing tool - stored procedures.

Some DB flavors:

- can take tables as arguments to stored procedures - can return tables - also offer the additional benefit of being able to run almost all flavors of SQL commands ( DDL, DML, DQL, DCL, TCL) in those stored procedures

Netezza stored procedures, for example, can do what you describe here:

https://www.ibm.com/docs/en/netezza?topic=nsg-return-result-...

As can SQL Server & Oracle (which both return cursors, which are just ordered tables):

https://learn.microsoft.com/en-us/sql/relational-databases/s...

https://docs.oracle.com/cd/B28359_01/appdev.111/b28843/tdddg...

Am I missing something fundamental here? How are functors different from stored procedures? To me, they seem to be just a sub-class / sub-type of stored procedures.

default-kramer

The goal is that the composable parts get woven into an efficient, planner-friendly query. Stored procedures completely undermine that unless something very exciting has happened since last I checked (SQL Server, but probably applies to all of them). You will likely end up in a "row by agonizing row" situation.

(Okay, maybe you can find a few cases where stored procs work decently, but in general both composability and performance will be much worse than the proposed functors.)

throwaway201606

OK, this I understand, that is a good insight - cursors are row-processing based so its gonna be slow

I think Netezza, SQL Server and Oracle are all cursor-based processing "by default" so this makes a lot of sense. I suspect that they all have bulk operation capability but can't immediately think of how I would have worked bulk processing in a way that maps to this article - maybe something like analytic functions like windowing, partitioning etc. that is definitely not row by row.

Having said that, the examples I see for actual testing in the article are DQL / DML so would be multiple row processing by default .. yes, the functor definition / creation is a DDL process but it is a "do once and reuse the definition" thing (like, the author correctly observes, a view, which is the point of functors) and the functor in use would just be DML. In which case, functors go back to looking like stored procedures...

I also understood composability as being built in for SQL - for example, in Oracle, packages allow composability of stored procedures, triggers, sequences etc allow composability of DML and views allow composability of queries and tables - which the author points out in the article.

With functors, DDL, DML, DQL, DCL, TCL would still be the only command tools available unless a new command language was invented for SQL for testing - let call that something like DTL (Data Test Language), with a whole new bunch of associated new SQL keywords, capability and functionality that are built right into the core of the DB engine that are optimized for what functors are trying to achieve.

gregw2

Regarding "can't immediately think of how I would have worked bulk processing in a way that maps to this article think of how to map composibility" ...

I believe stored procedures where you construct dynamic sql and execute the results can basically provide the composability/performance described with bulk non row-based logic. If you keep it simple it can work ok.

andyferris

They seem somewhat like stored procedures, but not stored? As in a query can contain a functor in it and then use it immediately. I didn't see those `create functor` statements as anything other than ephemeral - or am I wrong?

EDIT: also stored procs that use imperative logic and cursors can be quite a bit slower than queries that achieve the same logic - the capability here is purposefully a subset of that and is just to help build standard SQL queries that can go through the standard query planner.

throwaway201606

I think they have to be long lived else they cannot make sense for performant testing. ie they are created as DB objects, using DDL, in the same way tables, views, functions etc are made.

They can certainly be created at test run time but that would slow things down a lot - you would essentially be creating a ton of objects every time you run the test which means having a setup to test if they exist or not, take them down if they do or fix them if they don't match spec ( e.g. column and data type changes etc etc )

The more I think about this, the more complicated I realize it would be to manage this dynamically:

You essentially have to build a test harness enviroment that figures out your testing elements dynamically from your data environment (with some kind of parameterization engine and data set to tell it what to look for so as to "make functors and run them" (e.g. all PKs of FKs or all columns starting with a certain prefix or all columns of a certain data type etc etc), gets the most up to date definitions of those elements from system tables and uses that data to create or update or drop functor objects ... wow, ok, this is getting complicated, I am going to stop now before I see the void.

sixdimensional

I found the argument that it’s impossible to test SQL (and some other hard statements in this article) a bit too absolute.

Testing SQL is not impossible – SQL unit testing is a thing – identify a unit of SQL, such as a query, function, or stored procedure - mock the necessary inputs and any state that is modified and check the result of the unit just as you would in any unit testing framework (e.g. arrange, act, assert). Perhaps it’s a little difficult, or different to wrap our heads around (and many do not take the time to see how this can work) but it’s not impossible.

Also, as others have already pointed out, depending on the implementation of the SQL engine itself, different engines can have very different behaviors, some of which make composable SQL possible already today (e.g. table valued functions, CTEs that are inlined, stored procedures, etc).

And for what it’s worth, if you look at a lot of the data processing engines we use today (such as Apache Spark), they really solve the problem of being able to compose and modularize code and SQL together, through data frames or things like language integrated query, integrations of languages that compile down to code that can run and be orchestrated on common execution engines, and so forth.

agentultra

Ah but “functor,” is a name from category theory and algebraic topology. I had read along thinking that this would come up in the conclusion! Instead it seems it’s not even mentioned in the sources for the name.

There’s potentially something like this in the “esqueleto” package for Haskell. But it has some limitations. It composes on syntactic fragments.

Although speaking of functors, the category theory ones, maybe categorical databases would fit the bill: https://categoricaldata.net/

RyanHamilton

He got so close to the right answer but went the wrong direction. "Imagine a programming language without functions.". Imagine SQL was instead based on an actual programming language, with variables and functions. that would solve all the problems you mention. Kdb+ and Dialog already knew this 20+ years ago. I wish someone else will recreate this in an open source library. Now with Arrow format in memory the best hope may be that all languages will be able to act on shared memory data frames bringing tables to every language within the database.

zetalyrae

> Imagine SQL was instead based on an actual programming language, with variables and functions.

This is what the entire article is about. That paragraph is meant to illustrate the problem of SQL through an analogy.

RyanHamilton

The article is about functors. Some new concept that gives you a tiny subset of the functionality that a full language provides. Most languages have loops, control structures, the ability to store any data as a variable.

r1b

Nodding along furiously. One area where this comes up is when you want to export some piece of logic that effectively “returns a table”.

For example, I work with a source of data that requires a particularly hairy JOIN strategy. The logic is the same for any table with the same “shape”, but no one can reuse it without copy / paste.

virtualwhys

See Esqueleto in Haskell, Slick and Quill in Scala, probably LINQ or some variant on .NET in C#/F#.

All support building up arbitrarily complex queries based on statically typed query fragments.

String fragment "composition" is like banging rocks together.

yujzgzc

There are a lot more composability problems with SQL.

As an example, can you write a function that you can invoke on a column in a select statement and will compute the cumulative distribution function of that column?

The expression to do this is enormous "SUM(column) OVER (ORDER BY column ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(column) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)" and there is no choice but to inline it. I think it would be reasonable to expect users to be able to describe this and then write simply "CDF(column)". The only systems that do this I believe are text processing macros on top of SQL...

reshlo

`CUME_DIST() OVER (ORDER BY column)`

https://modern-sql.com/caniuse/cume_dist

yujzgzc

My point is that in most other languages, this kind of functionality can easily be factored into libraries available to the community. SQL doesn't have the composability so every single of these features requires to be added directly into the language. Libraries and package tooling is practically non-existent in SQL land.

earnestinger

Actually some do, but…

- non standard sql

- not concise

- has limitations

- installing a function is a db migration

I had dreams. A lot of them. I still do some times.

“Normal language” functions accept/return: primitives, objects(, functions, types).

Sql functions could accept/return: scalars/primitives, a row, set of rows, a table (maybe set of tables or related rows), for aggregating functions need to group by some columns, sliding window functions need to keep track of the window.

Let’s add some kind of interfaces/traits and generics.

If we are extending current sql, then I’m certain there will be corners cases for various clauses sql clauses (select, from, on, where)

And all of that on top of basic sqls (sel/ins, checks, access controls, query execution planning)

Quite quickly c++ starts to look like an exercise in language minimalism.

mrtimo

Malloy is a newish data language with semantic models... Perhaps what semantic SQL would look like: https://www.malloydata.dev/

392

Somehow Malloy always gets called out in these threads but not PRQL.

https://prql-lang.org/