Data Branching for Batch Job Systems

PaulHoule

An interesting pattern that I've thought about in "intelligent systems" is "patching" which could be applied at various stages of a process.

If you look at the average training set used in competitive evaluations you will find examples in it that are just plain wrong which put an upper limit on both the evaluation scores and the real-world performance based on that data.

Very occasionally someone does the hard work to improve the training and eval set and, if the world was efficient, this would be the new data set everyone uses.

In real life you are getting more data all the time and you need a stable way to keep your data set "patched" as new data comes in. Similarly, an AI step later in the process needs to (i) have a human override so it can get things right for the consumer and (ii) remember that override so not to waste the time of the human or wear them out emotionally.

ragulpr

This pattern is really meaningful conceptually but the tricky thing is to not create a mess in the process.

If it's too easy to branch people will do so and the (knowledge) economics scale disappears (and we'll have a mess).

If the common data definition is too hard to branch from no experiments will happen (slow).

I think most tech for this seem to make it too easy, and in the process injecting a bunch of dependencies that makes it slow and harder to access. May have changed since I last looked.

I found that the simple pattern of versioned paths/table names as `s3://mybucket/mystage/version=42/` or `my_table_v42` puts a high enough evolutionary cost on branching (as consumers need to explicitly adapt) while it also doesn't have the costs associated with using special tech (legacy/lock in/dependencies).

It's also searchable on github/slack/etc if done right.

buryat

Apache Iceberg supports Branching and Tagging since some early versions https://iceberg.apache.org/docs/1.4.0/branching/#overview

And the broader name for what the author is describing is the Write-Audit-Publish pattern, where data gets written into a branch first, audited/checked, and then the main branch gets replaced with the new one, effectively publishing the updated dataset using a single command. https://www.tabular.io/apache-iceberg-cookbook/data-engineer...

prpl

Iceberg has branching but it doesn’t really have great “merge” semantics, but the semantics otherwise would work good for batch semantics.

What I think I’d like is to say “there are only AppendFilesCommits in these two branches” and merge the two, or otherwise look at the operations to determine if they two things can be fast forwarded.

larrydavidsdad

you seen this? https://www.doltdb.com

philsnow

I hadn’t seen that before and I can’t speak to the quality of the project, but I wanted to call out the first section in the readme [0] for being perfectly clear and succinct:

> Git versions files. Dolt versions tables. It's like Git and MySQL had a baby.

> We also built DoltHub, a place to share Dolt databases. We host public data for free. If you want to host your own version of DoltHub, we have DoltLab. If you want us to run a Dolt server for you, we have Hosted Dolt. If you are looking for a Postgres version of Dolt, we built DoltgreSQL. Warning, it's early Alpha. Dolt is production-ready.

[0] https://github.com/dolthub/dolt?tab=readme-ov-file#dolt-is-g...

HN

Data Branching for Batch Job Systems

Data Branching for Batch Job Systems