SQL or Death? Seminar Series – Spring 2025
46 comments
·January 30, 2025tanvach
I dunno, personally when I teach SQL to engineers, it’s not the syntax that is the main difficulty (yes it can be improved) but the relational data mental model that trips people up. They all want sequential data processing and loops. Translating that to vectorized code is the closest thing I got to help non-data people grok.
fxttr
I used to hate SQL when I was a backend engineer. I had difficulties understanding exactly this mental model and tried to avoid writing SQL by using ORMs. At some point I “accidentally” switched to data engineering and was forced to get to grips with SQL. I think that in the meantime I warmed up to a functional programming style which helped me to some extent.
whatever1
The thing that trips everyone is that you cannot explain the SQL outputs. In sequential code, you can step through and understand why the logic error produced erroneous output.
With SQL, it is a trial and error. When your query passes your sniff tests, you sign looks good to me, and you ship it to the world. Only to silently break shortly afterwards without any warning.
In enterprise, I am convinced at this point that all of the complex ETL jobs are vending wrong outputs. Just nobody has the tools to diagnose and fix the problems.
cyberax
For me, it's not the syntax, and not even the relational model. They are fairly easy to explain.
No, it's the morass of the interactions between GROUP BY/ORDER BY/HAVING. Like, why isn't there FIRST statement to select the first element in the group?
kmoser
Not sure what getting the first row has to do with any issues you have with GROUP BY/ORDER BY/HAVING? Each of those clauses serves a distinct purpose, and I'm not sure I'd describe any interactions a "morass" since any connections between them are usually by design, and well documented. With the power of SQL (or any powerful and complex tool) comes great responsibility to understand how it works.
_dain_
WHERE and HAVING (and QUALIFY) are redundant. The PRQL language unifies them into a single "filter" construct:
SQL:
SELECT country, MAX(salary) AS max_salary
FROM employees
WHERE start_date > DATE '2021-01-01'
GROUP BY country
HAVING MAX(salary) > 100000
PRQL: from employees
filter start_date > @2021-01-01
group country (
aggregate {max_salary = max salary}
)
filter max_salary > 100_000
cyberax
A very simple problem, you have a table of house sales. For each house (identified by an ID), you want to find the sale date that had the lowest price.
You'd think that you should be able to sort by the house ID, then by the sale price, then group by the house ID, and select the sale _date_ from the first row of each group.
Something like: `select s.house_id, first(s.date) from sale s order by s.house_id, s.price asc group by s.house_id`. But this is impossible, because there's no `first` function in SQL.
But nope, you need DB-specific extensions for that.
helge9210
To have the first element, you have to describe a relation of order. It's not ordered by default.
calmbonsai
True, but even going all the way back to E.F. Codd, there's "cheating" involved with all sorts of implicit sorting.
Even now, the default behavior of SQLite is still "undefined" if a Select column doesn't appear in Group By. Well, to be precise it's a single "arbitrary chosen row" used to sort the results. https://www.sqlite.org/lang_select.html#resultset
I'm not bagging on SQLite here (I love it!), but I am bagging on SQL 'cuz EVERY implementation must make far too many of these sorts of Faustian bargains.
cyberax
Actually, you don't have to. Any first element would do, even if it's a random one.
It's perfectly possible to do `SELECT * FROM blah LIMIT 1`, after all.
cyberax
But I can order it: https://news.ycombinator.com/item?id=42911873
azurezyq
Window function is your friend.
Row_number() over (partition by x order by) as rank
Then: where rank = 1.
Grouping is unordered, which is a clean definition.
mamcx
> It’s not the syntax ... but the relational data mental model that trips people up.
It is also the syntax because it does not match the actual model. Even the simple example `SELECT 1` shows a mismatch.
And this cascade. Because the syntax is wrong, people have big trouble with `JOINs` (that is disconnected from the idea of making `?-to-?` specifications), `group by` (that doesn't exist in SQL), aggregation logic (that is badly implemented as window functions), and bigly, the lack of composability (that has a weird band-aid called CTEs) and so on.
A mismatch between the domain and the code is always reflected in syntax.
calmbonsai
Can I take death?
For me, it's the lack of SQLs expression composability. And yes, I'm well aware of various libraries and plug-ins that purport to yield "composable SQL". It's never gonna' happen at-scale due to decades of investment in query optimizers for the popular relational engines.
Just make peace with all of SQL's warts or go NoSQL and make piece with code-bloat and eventual consistency.
ro_sharp
Not sure I follow. In a world with CTEs and views, what do you think is missing for composability?
antihipocrat
CTEs really tripped me up when I started using them professionally. My mental model was that they result in reusable objects in memory and thus could be used to improve performance as well as composition.
After discovering the truth, it was interesting to find out that almost everyone I knew who wasn't a snr db engineer shared the same incorrect assumption.
_dain_
https://borretti.me/article/composable-sql
>Imagine a programming language without functions. You can only write code that operates on concrete values, i.e. variables or literals. So instead of writing a function and calling it anywhere you have to write these little code templates as comments and every time you want to “call” the “function” you copy the template and do a search/replace.
>This would be tiresome. But that’s what SQL is. The concrete values are the table names. The code is the queries. And the function templates you have to search replace are your business logic, which must be inlined everywhere it is used.
benzayb
"make sql as fast as possible" is a misnomer.
you cannot make a language fast, only its underlying implementation (i.e the specific dbms).
and this is the error I see 90% of the time. there is a confusion between conceptual (model) vs implementation concerns.
the problem with SQL the language is its conceptual foundations -- that it is not currently relational and yet it is used as the language of "relational" DBMSes in the market.
ayhanfuat
It says “making sql go as fast as possible”, referring to the talks about optimizers.
benzayb
To quote:
"The talks in this series will present ideas on either (1) making SQL go as fast as possible or (2) replacing SQL with something better."
How were you able to infer that the existence of the word "go" in that phrase refers to optimizers?
And 2nd idea/proposal of "replacing SQL with something better" really implies that they are talking about the language and not the optimizer.
In any case, if they really meant "optimizer" then their phrasing is very vague and imprecise; if they are talking about the "optimizer" -- then which specific optimizer? They have failed to mention that too.
calmbonsai
Correct. I think the parent commenter is confusing a language's concrete syntax with its abstract syntax. At the end of the day, the ease (or lack there of) of translating an AST into performant executing code is highly correlated with the potential variance of those ASTs.
The more higher-level expressions your languages supports, the more work its compiler (whether targeting physical or virtual hardware) must do to translate those ASTs into "machine" code and the higher the variance of that code during its execution workload.
It's even more work when that language's expressions are declarative instead of imperative.
benzayb
When a DBMS that claims it is relational and yet it supports SQL's "features" that departs from the relational model, then that DBMS is not relational.
If a DBMS is "slow" or "fast", that is not because of SQL -- it is because of the specific implementation decisions that the makers of that DBMS took.
And that has nothing to do with what you just said.
mg
SQLite makes for a really nice document store with the json arrow functions. Assume you have a "book" column that is a json blob, then selecting the names of all books is just:
SELECT book->>'name' FROM books;
The only thing I miss is auto increments.Without an additional "id" column, inserting a new book with a unique id becomes cumbersome.
yayitswei
I chose death and never looked back. Favorite replacement has been Diatomic, currently looking into Rama.
__mharrison__
I generally prefer working with dataframes, so this could be interesting.
relaxing
> Suppose somebody has been rubbing gasoline on their body since the 1970s. Would you marry that person even if they smelled terrible?
Who wrote this? What are they trying to do here? Be funny? Stop that.
brikym
I choose kusto query language
pletnes
But where? It’s got its nice sides but from my experience it’s in the azure web portal only, and I don’t want to write code there.
Temporary_31337
These seminars seem like marketing sessions for various database vendors. Prove me wrong.
james_cowling
I'm giving the first talk of the session and while I'll be talking about ideas we've actually built and tested in Convex, the point I want to get across is that I think the industry has gone in the wrong direction with exposing databases to clients and using an inexpressive declarative language to query them. Will be trying to avoid any shilling because I feel like this is a pretty important discussion to have.
ksec
This is going to be ranty. I felt this is the same question over and over again. And Tech is worst than Fast Fashion Industry.
C or Death? Lets replace C with Rust.
SQL or Death? Lets replace SQL with NoSQL and now back to SQL and to something else?
HTML or Death? Lets replace everything with JS client side rendering.
PDF? JPEG? AAC? H.264 / AVC? x86 / ARM [1]? Every single thing that is 10 - 20 years old are considered old fashioned and we need to rip them apart without learning about their merits and pitfalls they went through. And sometimes as Alan Kay said only to reinventing the flat tire.
How about Better C? Better SQL? Better JPEG? etc And most if not all the time it isn't the language or spec that is the problem. It is the implementation. I am willing to bet we wont even have an Open Source SQL implementation that is as good as today's Oracle or MS_SQL in 10 years time. There are still insane amount of work on implementation alone.
It seems the Tech industry as a whole is in an eternal philosophy battle of throwing away everything and build something new vs improving what we have today and iterating on it.
[1] A prominent Intel engineers once said by the time you learn everything about Hardware design pitfalls from GPU to CPU and different ISA and be ready to design something truly so much better you will be at the age of retirement. I think this sort of hold true in other areas as well.
least
Everything you're complaining about being replaced not only still exists but is still widely used and in many cases continues to be updated and improved, often in response to the ideas introduced by these newer technologies. Would relational databases have JSON support without NoSQL’s rise in popularity? I doubt it.
This just isn’t a problem. There are more than enough programmers in the world to explore different approaches to common problems. We all ultimately learn and benefit from those experiments. The cycle of reinvention isn’t some great failure that requires fixing. It's how progress works.
And yes, people will not understand everything about what exists or has existed before trying something new that may or may not work out. We're humans, not robots. We don't have perfect memory and we don't have the bandwidth to ingest every bit of wisdom that has come before us, so inevitably we will retread some of the same mistakes that have been made or rediscover things that have already been discovered. That's called learning.
> It seems the Tech industry as a whole is in an eternal philosophy battle of throwing away everything and build something new vs improving what we have today and iterating on it.
At what point do iterations become so different or just completely ideologically opposed that it stops being compatible with the language? Should we get rid of malloc and free in C? What if the extension I want to add to the language is rejected by the governing board? What if the file format is proprietary and requires a license? What if it's belongs to a company like Oracle or Microsoft and contributing to its development is only possible if you work for them?
mmcromp
The trends in tech come from a demand for better software solutions and innovation in making them. People want modern apps and developers want modern tools. People who want to use brand new technology just for the sake of it exist, but for the most part we're growing and learning and making great things.
We are absolutely learning from the pitfalls of our previous technology to make better things. It's not perfect but we're moving forward. It's not fast fashion, it's more like the car/vehicle industry.
Older developers who cannot tell the difference between a website built 20 years ago vs today or think the developer experience of working with rust vs c++ is just aesthetic are digging their heels in to their own ignorance
onemoresoop
Whatever the industry comes up with, it needs time to mature. What’s in question is mature and still useful - we can learn a whole lot from it. So the only take here is to not jump on from bandwagons to bandwagons and stick to some principles to maturity.
mkleczek
One nitpick. We do have an Open Source SQL implementation as good as commercial ones: PostgreSQL.
null
TheCleric
I choose death. Thanks.
DannyBee
I'm surprised they didn't name it SQL or Death Or 1?
treebeard901
DROP [TheCleric];
This is a great lineup. Excited for the "SQL replacement" talks. Malloy is currently my favorite because of it's semantic nature. Define all the joins, measures, dimensions up in the source, and use them as needed. Makes for easy writing, and cleaner thinking.
Also, Malloy has done something clever. They embed duckdb into their VSCode extension, so you can easily query (and join) .csv or .parquet files on the fly. I laugh when I think of all the time I spent noodling with pandas code to do simple data cleaning, analysis, or joining. This also means that you can press "." on any github repo and start querying.
I teach MS Business Analytics students - they really enjoyed walking through an analysis of first names given in the USA for the past century. To see it, click to the github repo [1], press ".", install the Malloy extension, click back to the names1.malloynb file and run the queries.
[1]: https://github.com/malloydata/malloy-samples/blob/main/names...