SQL Anti-Patterns You Should Avoid

63 comments

·October 18, 2025

SoftTalker

A big one that isn't listed is looking for stuff that isn't there.

Using != or NOT IN (...) is almost always going to be inefficient.

Also, understand how your DB handles nulls. Are nulls and empty strings the same? Does null == null? Not all databases do this the same way.

EvanAnderson

> Overusing DISTINCT to “Fix” Duplicates

Any time I see DISTINCT in a query I immediately become suspicious that the query author has an incomplete understanding of the data model, a lack of comprehension of set theory, or more likely both.

dotancohen

I've been told similar nasty things for adding LIMIT 1 to queries that I expect to return at most a single result, such as querying for an ID. But on large tables (at least in sqlite, mysql, and maybe postgress too) the database will continue to search the entire table after the given record was found.

sigwinch28

Or it’s simply an indicator of a schema that has not been excessively normalised (why create an addresses_cities table just to ensure no duplicate cities are ever written to the addresses table?)

bts89

That’s almost always my experience too.

Though fairly recently I learned that even with all the correct joins in place, sometimes adding a DISTINCT within a CTE can dramatically increase performance. I assume there’s some optimizations the query planner can make when it’s been guaranteed record uniqueness.

dragonwriter

In my experience, its nearly as often a problem with the design of the database as the query author.

jmull

I'd be wary of overgeneralizing on that. I guess it depends on whose queries you're usually reading.

bandrami

IDK, "which ZIP codes do we have customers in?" seems like a reasonable thing to want to know

mbb70

The very next ask will be "order the zipcodes by number of customers" at which point you'll be back to aggregations, which is where you should have started

wvbdmp

Anti-Patterns You Should Avoid: overengineering for potential future requirements. Are there real-life cases where you should design with the future in mind? Yes. Are there real-life cases where DISTINCT is the best choice by whatever metric you prioritize at the time? Also yes.

kristjansson

Whole seconds will have been wasted!

edoceo

count(id) group by post_code order by 1

sql_nitpicker

distinct seems like an aggregation to me

ryandv

PostgreSQL's `DISTINCT ON` extension is useful for navigating bitemporal data in which I want, for example, the latest recorded version of an entry, for each day of the year.

There are few other legitimate use cases of the regular `DISTINCT` that I have seen, other than the typical one-off `SELECT DISTINCT(foo) FROM bar`.

ryandv

Set theory...

There are self-identifying "senior software engineers" that cannot understand what even an XOR is, even after you draw out the entire truth table, all four rows.

petalmind

> Overusing DISTINCT to “Fix” Duplicates

I wrote a small tutorial (~9000 words in two parts) on how to design complicated queries so that they don't need DISTINCT and are basically correct by construction.

https://kb.databasedesignbook.com/posts/systematic-design-of...

joshmn

Nice articles in there. Bookmarked.

Edit: it’s also actually a book!

wmonk

The section of using functions on indexes could do with more explicit and deeper explanation. When you use the function on the index it becomes a full scan of the data instead as the query runner has to run the function on every row and column, effectively removing any benefit of the index.

Unfortunately I learned this the hard way!

crazygringo

The blog has a typo. The first line needs to have the text in uppercase:

> query WHERE name = ‘ABC’

> create an indexed UPPER(name) column

The point is that the index itself is already on the data with the function applied. So it's not a full scan, the way the original query was.

Of course, in this particular example you just want to use a case-insensitive collation to begin with. But the general concept is valid.

LikesPwsh

Some well known docs on the topic- https://use-the-index-luke.com/sql/where-clause/obfuscation

tremon

The given solution (create an indexed UPPER(name) column) is not the best way to solve this, at least not on MS SQL Server. Not sure if this is equally supported in other databases, but the better solution is to create a case-insensitive computed column:

  ALTER TABLE example ADD name_ci AS name COLLATE SQL_Latin1_General_CI_AS;

(season to taste)

null

[deleted]

readthenotes1

"Unfortunately I learned this the hard way!" ... Seems to be the motto of SQL developers.

Otoh, it seems a fairly stable language (family of dialects?) so finding the pitfalls has long leverage

dgb23

If „select *“ breaks your code, then there‘s something wrong with your code. I think Rich Hickey talked about this. Providing more than is needed should never be a breaking change.

Certain languages, formats and tools do this correctly by default. For the others you need a source of truth that you generate from.

sql_nitpicker

I don't see anything wrong with what the article is saying. If you have a view over a join of A and B, and the view uses "select *", then what is gonna happen when A adds a column with the same name as a column in B?

In sqlite, the view definition will be automatically expanded and one of the columns in the output will automatically be distinguished with an alias. Which column name changes is dependent on the order of tables in the join. This can absolutely break code.

In postgres, the view columns are qualified at definition time so nothing changes immediately. But when the view definition gets updated you will get a failure in the DDL.

In any system, a large column can be added to one of the constituent tables and cause a performance problem. The best advice is to avoid these problems and never use "select *" in production code.

0xbadcafebee

`select *` is bad for many reasons, but the biggest is that the "contract" your code has with the remote data store isn't immutable. The database can change, for many different reasons, independent of your code. If you want to write reliable code, you need to make as few assumptions as possible. One of those assumptions is what the remote schema is.

rileymat2

The reasoning is in the article, and true.

> Schema evolution can break your view, which can have downstream effects

Select * is the problem itself in the face of schema evolution and things like name collision.

tremon

If you have select * in your code, there already is something wrong with your code, whether it breaks or not: the performance and possibly output of your code is now dependent on the table definition. I'm pretty sure Rich Hickey has also talked about the importance of avoiding non-local dependencies and effects in your code.

onli

The performance and partly the output of the code is always dependent on the table definition. * instead of column names just removes an output limiter, which can be useful or can be irrelevant, depending on the context.

Though sure, known to negatively affect performance, I think in some database systems more than in others?

skybrian

> three or four layers of subqueries, each one filtering or aggregating the results of the previous one, totaling over 5000 lines of code

In a better language, this would be a pipeline. Pipelines are conceptually simple but annoying to debug, compared to putting intermediate results in a variable or file. Are there any debuggers that let you look at intermediate results of pipelines without modifying the code?

tremon

This is not a pipeline in the control flow sense; the full query is compiled into a single processing statement, and the query compiler is free to remove and/or reorder any of the subqueries as it sees fit. The intermediate results during query execution (e.g. temp table spools) do not follow the structure of the original query, as CTEs and subqueries are not execution boundaries. It's more accurate to compare this to a C compiler that performs aggressive link-time optimization, including new rounds of copy elision, loop unrolling and dead code elimination.

If you want to build a pipeline and store each intermediate result, most tooling will make that easy for you. E.g. in dbt, just put each subquery in its separate file, and the processing engine will correctly schedule each subresult after the other. Just make sure you have enough storage available, it's not uncommon for intermediate results to be hundreds of times larger than the end result (e.g. when you perform a full table join in the first CTE, and do target filtering in another).

jasonpbecker

We did the views on view thing once when triggers, at least how we implemented them failed. This became a huge regret that we lived with for years and not-so affectionately called "view mountain". We finally slayed viewed mountain over the last 2 years and it feels so good.

anthonyIPH

"Instead you should:

query WHERE name = ‘abc’

create an indexed UPPER(name) column"

Should there be an "or" between these 2 points, or am I missing something? Why create an UPPER index column and not use it?

karmakaze

[and a third] OR use a case-insensitive collation for the name column.

egeozcan

I don't know about anti patterns but what I like to do is putting 1=1 after each WHERE to align ANDs nicely and this is enough to create huge dramas in PR reviews.

ffsm8

It's always perfectly aligned for me, because enter prefixes 2 whitespace in my ide in SQL files, ending with

    where a=1
      And k=2
      And v=3

DrewADesign

> what I like to do is putting 1=1 after each WHERE to align ANDs nicely

Frankly, that sounds like one of those things that totally makes sense in the author’s head, but inconsiderately creates terrible code ergonomics and needless cognitive load for anyone reading it. You know to just ignore those expressions when you’re reading it because you wrote it and know they have no effect, but to a busy code reviewer, it’s annoying functionless clutter making their job more annoying. “Wait, that should do nothing… but does it actually do something hackish and ‘clever’ that they didn’t comment? Let’s think about this for a minute.” Use an editor with proper formatting capability, and don’t use executable expressions for formatting in code that other people look at.

MobiusHorizons

I use `WHERE true` for this. Very little cognitive load parsing that. And it makes AND conditions more copy pastable. Effectively the trailing comma of SQL where clauses

0xbadcafebee

At this point it's malpractice not to use AI to analyze your SQL statements and tables for optimizations

jpnc

Are we on bizarro HN?

No, you ask the DB to EXPLAIN itself to you.

chongli

When working with larger enterprise software, it is common to have large CASE WHEN statements translating application status codes into plain English. For example, status code 1 could mean the item is out of stock.

Why wouldn’t you store this information in a table and query it when you need it? What if you need to support other languages? With a table you can just add more columns for more languages!

megaman821

I usually use generated columns for this. It still uses CASE WHEN but it is obvious to all consumers of the table that it exists.

jwsteigerwalt

That’s my rap sheet…