Yagri: You are gonna read it

67 comments

·April 23, 2025

alganet

*_at and *_by fields in SQL are just denormalization + pruning patterns consolidated, right?

Do the long walk:

Make the schema fully auditable (one record per edit) and the tables normalized (it will feel weird). Then suffer with it, discover that normalization leads to performance decrease.

Then discover that pruned auditing records is a good middle ground. Just the last edit and by whom is often enough (ominous foreshadowing).

Fail miserably by discovering that a single missing auditing record can cost a lot.

Blame database engines for making you choose. Adopt an experimental database with full auditing history. Maybe do incremental backups. Maybe both, since you have grown paranoid by now.

Discover that it is not enough again. Find that no silver bullet exists for auditing.

Now you can make a conscious choice about it. Then you won't need acronyms to remember stuff!

lud_lite

Another option is audit info could go to another table or datastore entirely.

If you never use it, that data can be dumped to s3 glacier periodically (e.g. after 90 days).

By losing the foreign key you gain flexibility in what you audit. Maybe audit the operation and not the 20 writes it causes.

klysm

My current state is have the database be the current state and use logical replication (CDC) to keep the log of changes in case you need it

datadrivenangel

It is interesting thinking about record changes as a spectrum towards application logs. At some point too much detail is expensive to store, and you must adopt an archival strategy.

klysm

Really depends on the app. If you have a low throughput line-of-business kind of application you can probably get away with storing everything.

alganet

If you see it from the pure SQL point of view, you are in the "blame database engines and adopt an experimental solution".

It is the point where you give up modeling the audit as part of the systems tables.

The drawbacks of this choice are often related to retrieval. It depends on the engine.

I once maintained a system that kept a fully working log replicated instance delayed by 24h, ready for retrieval queries, in addition to regular disk backups (slow costy retrieval).

I am more developer than DBA, so I can probably speak more about modeling solutions than infra-centric solutions.

klysm

Yeah 100% giving up on pure SQL to solve the problem, mainly from the perspective that doing full versioning etc. in SQL is really damn hard.

awesome_dude

But wait, there's Event Driven Architectures and Event Sourcing, meaning that the events are your log of edits!

alganet

Doesn't that also falls on the "blame the database engines and go for an experimental solution"?

I'm not saying databases are blameless. It's just that experiencing the issues they have by yourself is rewarding!

There is also a walk before the long walk of databases. Store things in text files and use basic tools (cat, sed, sh...).

The event driven stuff (like Kafka) reminds me of that. I am not very familiar with it though, just played a little bit with it once or twice.

gorgoiler

Additionally, mutable fields will quite often benefit from having a separate edit table which records the old value, the new value, who changed it, and when. Your main table’s created and updated times can be a function of (or a complement to) the edit table.

It is tempting to supernormalize everything into the relations object(id, type) and edit(time, actor_id, object_id, key, value). This is getting dangerously and excitingly close to a graph database implemented in a relational database! Implement one at your peril — what you gain in schemaless freedom you also lose in terms of having the underlying database engine no longer enforcing consistency on your behalf.

crazygringo

These are not decisions that should be taken solely by whoever is programming the backend.

They need to be surfaced to the product owner to decide. There may very well be reasons pieces of data should not be stored. And all of this adds complexity, more things to go wrong.

If the product owner wants to start tracking every change and by who, that can completely change your database requirements.

So have that conversation properly. Then decide it's either not worth it and don't add any of these "extra" fields you "might" need, or decide it is and fully spec it out and how much additional time and effort it will be to do it as a proper feature. But don't do it as some half-built just-in-case "favor" to a future programmer who may very well have to rip it out.

On a personal project, do whatever you want. But on something professional, this stuff needs to be specced out and accounted for. This isn't a programming decision, it's a product decision.

reillyse

One thing I do quite frequently which is related to this (and possibly is a pattern in rails) is to use times in place of Booleans.

So is_deleted would contain a timestamp to represent the deleted_at time for example. This means you can store more information for a small marginal cost. It helps that rails will automatically let you use it as a Boolean and will interpret a timestamp as true.

jfengel

I consider booleans a code smell. It's not a bug, but it's a suggestion that I'm considering something wrong. I will probably want to replace it with something more meaningful in the future. It might be an enum, a subclass, a timestamp, refactoring, or millions of other things, but the Boolean was probably the wrong thing to do even if I don't know it yet.

xp84

This seems at first like a controversial idea, but the more I think about it the more I like this thought technology. Merely the idea of asking myself if there's a better way to store a fact like that will potentially improve designs.

crdrost

The enum idea is often wise; also: for just an example that has probably occurred a hundred thousand times across the world in various businesses...

Original design: store a row that needs to be reported to someone, with an is_reported column that is boolean.

Problem: one day for whatever reason the ReporterService turns out to need to run two of these in parallel. Maybe it's that the reporting is the last step after ingestion in a single service and we need to ingest in parallel. Maybe it's that there are too many reports to different people and the reports themselves are parallelizable (grab 5 clients, grab unreported rows that foreign key to them, report those rows... whoops sometimes two processes choose the same client!)... Maybe it's just that these are run in Kubernetes and if the report happens when you're rolling pods then the request gets retried by both the dying pod and the new pod.

Alternative to boolean: unreported and reported records both live in the `foo` table and then a trigger puts a row for any new Foos into the `foo_unreported` table. This table can now store a lock timestamp, a locker UUID, and denormalize any columns you need (client_id) to select them. The reporter UPDATEs a bunch of rows reserving them, SELECTs whatever it has successfully reserved, reports them, then DELETEs them. It reserves rows where the lock timestamp IS NULL or is less than now minus 5 minutes, and the Reporter itself runs with a 5 minute timeout. The DB will do the barest amount of locking to make sure that two UPDATES don't conflict, there is no risk of deadlock, and the Boolean has turned into whether something exists in a set or not.

A similar trick is used in the classic Python talk “Stop Writing Classes” by @jackdied where a version of The Game of Life is optimized by saying that instead of holding a big 2D array of true/false booleans on a finite gameboard, we'll hold an infinite gameboard with a set of (x,y) pairs of living cells which will internally be backed by a hashmap.

lud_lite

Booleans also force the true/false framing.

E.g. a field called userCannotLoginWithoutOTP.

Then in code "if not userCannotLoginWithoutOTP or otpPresent then..."

Thus may seem easy until you have a few flags to combine and check.

An enum called LoginRequirements with values Password, PasswordAndOTP is one less negation and easier to read.

noman-land

This is all well and good until you need to represent something that happened on Jan 1 1970 00:00 UTC.

jchw

32-bit UNIX timestamps are often signed so you can actually go before that, but most UNIX timestamps are 64-bit now, which can represent quite a larger range. And SQL datetime types might have a totally different range.

Not that it really matters; deleted_at times for your database records will rarely predate the existence of said database.

noman-land

It's not about the scale, it's that `if (0)` will evaluate to `false` in many languages.

BeFlatXIII

Leave it null for non-deleted items.

hadriendavid

« Anytime you store Boolean, a kitten dies » Nobody has ever said that but nobody wants any kitten to die so nobody has ever challenged me anytime I use that statement.

acedTrex

This one little change alone can bring such huge benefits later.

arjonagelhout

A little while back, I had a conversation with a colleague about sorting entries by "updated at" in the user interface, and to my surprise this was not added by the backend team.

Many of these "we are going to need it"s come from experience. For example in the context of data structures (DS), I have made many "mistakes" that I do correctly a second time. These mistakes made writing algorithms for the DS harder, or made the DS have bad performance.

Sadly, it's hard to transfer this underlying breadth of knowledge and intuition for making good tradeoffs. As such, a one-off tip like this is limited in its usefulness.

tcdent

Database schemas being perfect out-of-the gate was replaced by reliable migrations.

If it's not data that's essential to serving the current functionality, just add a column later. `updated_at` doesn't have to be accurate for your entire dataset; just set it to `NOW()` when you run the migration.

phire

Sure, migrations are bearable (especially ones that only add columns).

But for the example of the "updated_at" column, or "soft delete" functionality, you only find out you need it because the operations team suddenly discovered they needed that functionality on existing production rows because something weird happened.

smithkl42

In C#-land, we just have it as a standard that ~every table inherits from `ITrackable`, and we wrote a little EF plugin to automatically update the appropriate columns.

public interface ITrackable { DateTime CreatedOn {get; set;} DateTime ModifiedOn {get; set;} }

Saves so much time and hassle.

SOLAR_FIELDS

“Reliable migrations” almost seems like an oxymoron. Migrations are complicated, difficult and error prone. I think there’s a good takeaway here around good initial schema design practices. The less you have to morph your schema overtime, the less of those risky migrations need to run.

tcdent

My experience over the last decade has been different.

Use a popular framework. Run it against your test database. Always keep backups in case something unforseen happens.

Something especially trivial like adding additional columns is a solved problem.

imcritic

Still depends on what the data represent: you could get yourself in a storm of phone calls from customers if after your latest release there's now a weird note saying their saved document was last updated today.

"HOW DARE YOU MODIFY MY DOCUMENTS WITHOUT MY..."

SchemaLoad

Somewhat related, but I suggest having both the record updated at, and some kind of "user editing updated at". As I've encountered issues where some data migration ends up touching records and bumping the updated at, which shocks users since they see the UI reshuffle and think they have been hacked when they see the records updated at a time they didn't update them.

XorNot

I mean this is what audit logs are for I'd say: generally speaking you want to know what was changed, by who and why.

So really you probably just want a reference to the tip of the audit log chain.

shaunofthedev

The perils of UI design wagging the horse.

I like the heuristics described here. However if these things aren't making it into a product spec where appropriate, then I smell some dysfunction that goes beyond what's being stored by default.

Product need (expressed as spec, design, etc) should highlight the failure cases where we would expect fields like these to be surfaced.

I'd hope that any given buisness shouldn't need someone with production database access on hand to inform as to why/when/how 'thing' was deleted. Really we'd want the user (be it 'boss' or someone else) to be able to access that information in a controlled manner.

"What information do we need when something goes wrong?". Ask it. Drill it. Ask it again.

That said, if you can't get those things, this seems a fine way to be pragmatic.

Hasnep

"Wagging the horse" is a great turn of phrase, better than "putting the cart before the dog."

al_borland

I agree with this as written, as think it's important to have some degree of forethought when building out the DB to plan for future growth and needs.

That said, the monkey paw of this would be someone reading it and deciding they should capture and save all possible user data, "just in case", which becomes a liability.

im_anon_on_hn

As an acronym, it's easy to be misremembered as "You ARENT gonna read it" (based on the popularity of yagni) - and have the opposite advice spread..

000ooo000

How about: IIPTWNIRTIoSoDIQMBSNRTL

(It Is Probable That While Not Immediately Required The Implementation of Storage of Data In Question May Be Simpler Now Rather Than Later)

I've gone ahead and included additional detail in the acronym in the event that the clarity is required later, as this would be difficult to retrofit into a shorter, more-established acronym.

willismichael

Why do I feel like that should be a cheat code in a FPS game?

__float

YRIS: you'll read it someday

guy4261

Author is very kind! In practice, many times I saw only the CR/CRU of CRUD getting implemented.

For example: as a company aspires to launch its product, one of the first features implemented in any system is to add a new user. But when the day comes when a customer leaves, suddenly you discover no one implemented off-boarding and cleanup of any sort.

evanmoran

Just curious, how do people feel about this general style of soft deletes currently? Do people still use these in production or prefer to just delete fully or alternatively move deleted rows to a separate tables / schema?

I find the complexity to still feel awkward enough that makes me wonder if deleted_at is worth it. Maybe there are better patterns out there to make this cleaner like triggers to prevent deletion, something else?

As for the article, I couldn't agree more on having timestamps / user ids on all actions. I'd even suggest updated_by to add to the list.

j_w

Financial world: records have a "close" or "expire" date which is then purged after some period of time. A deletion doesn't just happen, the record is updated to be "closed" or "expired" and some time after that it's deleted.

Something like a loan could live in a production environment for well over a year after closing, while an internal note may last just a month.

swagasaurus-rex

I think soft deletes using timestamptz are a good thing.

Deleting rows directly could mean you're breaking references. For example, say you have a product that the seller wants to delete. Well, what happens if customers have purchased that product? You still want it in the database, and you still want to fulfill the orders placed.

Your backend can selectively query for products, filter out deleted_at for any customer facing queries, but show all products when looking at purchase history.

There are times when deleting rows makes sense, but that's usually because you have a write-heavy table that needs clearing. Yes, soft-deletes requires being careful with WHERE statements filtering out deleted rows, but that's a feature not a bug.

pmontra

> what happens if customers have purchased that product? You still want it in the database, and you still want to fulfill the orders placed.

You might still want to show to those customers their purchase history including what they bought 25 years ago. For example, my ISP doesn't have anymore that 10 Mb/s fiber optic product I bought im 2000, because it was superseded by 100 Mb/s products and then by 1 Gb/s ones. It's also not my ISP anymore but I use it for the SIM in my phone. That also accumulated a number of product changes along the years.

And think about the inventory of eshops with a zillion products and the archive of the pady orders. Maybe they keep the last few years, maybe everything until the db gets too large.

zie

If you have a good audit log, it really doesn't matter. You can always restore it if need be.

If you have no audit log(or a bad one), like lots of apps, then you have to care a lot.

Personally, I just implement a good audit log and then I just delete with impunity. Worst case scenario, someone(maybe even me) made a mistake and I have to run undo_log_audit() with the id of the audit log entry I want to put back. Nearly zero hassle.

The upside, when something goes wrong, I can tell you who, what and when. I usually have to infer the why, or go ask a human, but it's not usually even difficult to do that.

dml2135

Can you share more about what makes a good audit log? My company doesn’t currently have one and I’m a little lost on where to start.

Should this be at the application code level, or the ORM, or the database itself?

zie

That depends on where the data you need to keep track of is and your architecture. The important thing is, you want your audit log to be able to tell you:

  * Who
  * What
  * When
  * Ideally Why

For any change in the system. Also when storing the audit log, take into account that you might need to undo things that happened(not just deletes). For instance maybe some process went haywire and inserted 100k records it wasn't supposed to. A good audit log, you should be able to run something like undo_log_audit(rec1, rec100k) and it will do the right thing. I'm not saying that code needs to exist day 1, but you should take into account the ability to do that when designing it.

Also you need to take into account your regulatory environment. Sometimes it's very very important that your audit logs are write once, and read only afterwards and are stored off machine, etc. Other times it's just for internal use and you can be a little more lax about date integrity of your audit logs.

Our app is heavily database centric. We push into the DB the current unix user, the current PID of the process connecting to the DB, etc(also every user has their own login to the DB so it handles our authentication too). This means our database(Postgres) does all of the audit logging for us. There are plenty of Postgres audit logging extensions. We run 2 of them. One that is trigger based creating entries in a log_audit table(which the undo_log_audit() code uses along with most reporting use cases) and a second one that writes out to syslog(so we can move logs off machine and keep them read only). We are in a regulated industry that gets audited regularly however. Not everyone needs the same level of audit logging.

You need to figure out how you can answer the above questions given your architecture. Normally the "Why" question is hard to answer without talking with a human, but unless you have the who, what and when, it's nearly impossible to even get to the Why part of the question.

jimbokun

Probably application level in most cases as those other levels probably don’t have all the information you want to include.

refset

> Maybe there are better patterns out there to make this cleaner

SQL:2011 temporal tables are worth a look.

imcritic

Always soft-deletion first. Then it gets exported to a separate archive and only then, after some time and may be attempted to be fully deleted from the initial base.

hoipaloi

Counter point: "Soft Deletion Probably Isn't Worth It" https://brandur.org/soft-deletion

jimbokun

But auditing probably is.

sgarland

> But I've never heard someone complain about a table having too many timestamps.

I do. Each one is 8 bytes. At the billions of rows scale, that adds up. Disk is cheap, but not free; more importantly, memory is not cheap at all.

HN

Yagri: You are gonna read it

Yagri: You are gonna read it