How we decreased GitLab repo backup times from 48 hours to 41 minutes

154 comments

·June 6, 2025

Visit

divbzero

The performance improvement that GitLab contributed to Git is slated to be released with v2.50.0:

https://github.com/git/git/commit/bb74c0abbc31da35be52999569...

rubit_xxx21

[flagged]

divbzero

This is a valid concern, but checking for uniqueness with a hash is fairly straightforward and the new implementation reuses a strset struct that has been in use since 2020:

https://github.com/git/git/blame/v2.49.0/strmap.h#L205-L218

herpderperator

If the requirement is to check uniqueness, what assumptions could possibly cause a bug? In this case, why does it matter if the uniqueness is tested with a nested for loop or with a map? There are many identical ways to check uniqueness, some being faster than others.

rubit_xxx22_

[flagged]

LgWoodenBadger

IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything I’ve written.

I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.

EdwardCoffin

Bruce Dawson says: I like to call this Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...

hinkley

I made the “mistake” in an interview of equating two super-quadratic solutions in an interview. What I meant was what Dawson meant. It doesn’t matter because they’re both too ridiculous to even discuss.

dieortin

They’re too ridiculous… unless a more optimal solution does not exist

paulddraper

The second law is that O(n * log n) is for practical intents and purposes O(n).

sn9

Skiena has a great table in his algorithms book mapping time complexity to hypothetical times for different input sizes.

For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.

EdwardCoffin

To be clear though, that isn't his second law, at least as of two months ago, according to https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...

hinkley

But sometimes a big enough C can flip which solution helps you hit your margins.

koala_man

Good call. O(N^2) is the worst time complexity because it's fast enough to be instantaneous in all your testing, but slow enough to explode in prod.

I've seen it several times before, and it's exactly what happened here.

david422

We just had this exact problem. Tests ran great, production slowed to a crawl.

esprehn

Modern computers are pretty great at scanning small blocks of memory repeatedly, so n^2 can be faster than the alternative using a map in cases for small N.

I spent a lot of time fixing n^2 in blink, but there were some fun surprises:

https://source.chromium.org/chromium/chromium/src/+/main:thi...

For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).

This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.

Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.

throwaway2037

    > where linear search beats constant time maps

Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.

parthdesai

My rule of thumb for 80%-90% of the problems is, if you need complicated algorithm, it means your data model isn't right. Sure, you do need complicated algorithms for compilers, db internals, route planning et all, but all things considered, those are minority of the use cases.

anttihaapala

This is not a complicated algorithm. A hash map (dictionary) or a hash set is how you would always do deduplication in Python, because it is easiest to write / least keystrokes anyway. That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.

throwaway2037

    > That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.

I am confused. There are plenty of open source, fast hash map impls in C.

paulddraper

This isn't knapsack. This is a dict lookup.

LtWorf

I wrote a funny algorithm to group together words that end the same way to write them once in my binary wordlist file, since there is an array that points to the start character and a \0 to end the word. My initial solution was O(n²) but it was too slow on a real wordlist so I had to come up with something better. In the end I just sort the list with quicksort, but revert the order of the words and then the groupable ones end up nearby each other.

SAI_Peregrinus

I'd say the exception is when `n` is under about 10, and is counting some sort of hardware constrained thing (e.g. some operation over all CAN interfaces pesent on an OBDII connector can be O(n^(2)) since n will always be between 1 and 4). If you wouldn't have to physically replace hardware for `n` to increase, you really need to avoid n^2 operations. And even then consider them carefully, perhaps explicitly failing if `n` gets too big to allow for noticing rework is needed before new hardware hits the field.

echelon

> perhaps explicitly failing if `n` gets too big

That's the problem. A lot of these quadratic time algorithms don't set limits.

Even 'n!' is fine for small 'n'. Real production use cases don't have small 'n'.

masklinn

> Real production use cases don't have small 'n'.

Real production use cases absolutely have small n. You don't hear about them, because it's very hard for them to cause issues. Unless the use case changes and now the n is not small anymore and nobody noticed the trap.

haiku2077

I have an app that's been running an O n^2 algorithm in "production" (free open source app used by various communities) for about half a year now.

It's been fine because "n" is "number of aircraft flying in this flight simulator" - and the simulator's engine starts to fail above around 2000 anyway. So even in the worst case it's still going to run within milliseconds.

lassejansen

Or, phrased differently, if n has an upper limit, the algorithm is O(1).

vlovich123

I have an n^3 operation that's currently a huge bottleneck at only 10k elements. Not sure how to fix it.

SoftTalker

Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.

kokada

Yes. I read the whole article thinking that this must have been generated by LLM, because at least the style remembers it.

ruuda

That was also my thought.

davidron

For those that haven't read the article yet, scroll down to the flame graph and start reading unit it starts talking about back porting the fix. Then stop.

wgjordan

"How we decreased reading 'How we decreased GitLab repo backup times from 48 hours to 41 minutes' times from 4.8 minutes to 41 seconds"

1oooqooq

it could have been longer. I still don't know why they were doing backup bundles with two refs :)

Arcuru

48 hours is a crazy amount of time to spend just to compress a git folder, it's only a couple GB. 41 minutes still seems like quite a long time.

Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?

bombela

> Be aware that even with these recommendations, syncing in this way has some risk since it bypasses Git’s normal integrity checking for repositories, so having backups is advised. You may also wish to do a git fsck to verify the integrity of your data on the destination system after syncing.

https://git-scm.com/docs/gitfaq#_transfers

It doesn't tell you how to make a backup safely though.

On a personal scale, Syncthing and Btrfs snapshots work plenty good enough. It's as fast as the storage/network too.

nightfly

Syncthing is the only way I've ever corrupted a git repo before

hobofan

I think that's why they specified the "BTRFS snapshots" part. Yes, directly syncing a .git directory seems like a recipe for disaster with how often I've seen individual files lagging to sync, but I guess with BTRFS snaphots one can ensure that only a consistent view of a git directory is being backed up and synced.

null

[deleted]

unsnap_biceps

zfs snapshots are difficult to offsite in non-zfs replicas, say like an S3 bucket.

That said, there's another less known feature that bundles help out with when used with `git clone --bundle-uri` The client can specify a location to a bundle, or the server can send the client the bundle location in the clone results and the client can fetch the bundle, unpack it, and then update the delta via the git server, so it's a lot lighter weight on the server for cloning large repos, and a ton faster for the client for initial clones.

benlivengood

I think if you want consistent snapshot backups on non-zfs destinations the safest thing is to clone the snapshot and rsync from the clone. Not a single-step operation but preserves the atomicity of the snapshot.

EDIT: you could also rsync from a .zfs snapshot directory if you have them enabled.

null

[deleted]

nightfly

ZFS can send to file or whatever you want to pipe to, you can have incremental sends, and if you convert to bookmarks on the sender you don't have to keep the historical data after you send it

AtlasBarfed

... so they added caching to things that should have been cached?

... is this really the way people "back up" git repos? I mean, it is git, so isn't there some way to mirror changes to the repo in another repo and just use ZFS / snapshots / backup software / etc to do that? It's a distributed version control system. Just make sure the version control information is ... distributed?

bob1029

I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.

This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.

lbotos

> Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path.

Bingo. One of the worst problems is helping a client piece back together a corrupted repo when they are using snapshots. Check my profile to see how I know. :)

It's usually an OMG down scenario, and then you are adding in the "oh no, now the restore is corrupted."

It's fixable but it's definitely annoying.

amtamt

> This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve.

A few months back a better solution was provided by SQLite: sqlite3_rsync

https://www.sqlite.org/rsync.html

nonameiguess

Gitlab isn't just a managed service. They release the software for self-hosted instances as well. There is no guarantee or requirement that users all run Gitlab on the same filesystem or even a filesystem that supports block level snapshots at all. Presumably, they want a universal backup system that works for all Gitlabs.

bob1029

I've never heard of a .git folder that spanned multiple filesystems. It sounds like we are now conflating the git workspace with everything else in the product.

There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.

to11mtm

I think GP's point is that the filesystems used by someone self-hosting gitlab may not be the same as what gitlab themselves are using.

File systems can be weird. Sometimes the OS can be weird and fsync type calls may not do what you expect. At least at one point MacOS fsync didn't behave the same way as Linux (i.e. Linux should ensure the write is truly done and not just in cache so long as the drive isn't lying). [0]

> There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.

Gitlab has a community edition. Not handling data well would be bad for their public image.

[0] - https://news.ycombinator.com/item?id=30372218

nonameiguess

I can't reply to the other reply for some reason, but what they said is indeed what I meant. gitlab.com might be running their Gitaly servers on btrfs or zfs or lvm volumes or whatever, but other customers may be using ext2. Gitlab the company could require customers to only run Gitaly on a specific filesystem, but up to now, they never have, it would be pretty shitty to suddenly change their minds after a decade plus of establishing one expectation, and whoever the developer is who submitted the patch to upstream Git and got a technical blog post out of it has absolutely no power to dictate contract terms to enterprise customers personally and instead did what is actually in their power to do.

hiddew

"fixed it with an algorithmic change, reducing backup times exponentially"

If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.

umanwizard

This is not the precise mathematical definition of exponential, but rather the colloquial one, where it just means "a lot".

cvoss

You shouldn't use a word that can carry a precise mathematical meaning in a sentence that literally uses mathematical notation in order to speak precisely and then expect readers not to interpret the word in the precise mathematical way.

IshKebab

You should if you expect your readers to be normal humans who understand obvious context, and not pedantic HN readers who understand obvious context but delight in nit-picking it anyway.

blharr

I somewhat agree, but for lack of a better word, what would you use? Quadratically doesn't have the same punch

sneak

“Words mean things.”

If you can’t agree with this, then you shouldn’t be speaking or writing, IMO.

Those who argue that words that mean different things are actually equivalent have no business dealing with language.

null

[deleted]

sn9

The algorithm complexity went down in the function they patched (6x improvement in their benchmark), but in the context of how they benefited with how they were using the algorithm the impact was much larger (improved to taking 1% of the time), which is plausibly exponential (and figuring out the actual complexity is neither relevant nor an economic use of time).

csnweb

If you replace an n^2 algorithm with a log(n) lookup you get an exponential speed up. Although a hashmap lookup is usually O(1), which is even faster.

ryao

That is not true unless n^C / e^n = log(n) where C is some constant, which it is not. The difference between log(n) and some polynomial is logarithmic, not exponential.

csnweb

But if you happen to have n=2^c, then an algorithm with logarithmic complexity only needs c time. Thats why this is usually referred to as exponential speedup in complexity theory, just like from O(2^n) to O(n). More concretely if the first algorithm needs 1024 seconds, the second one will need only 10 seconds in both cases, so I think it makes sense.

wasabi991011

It depends if you consider "speedup" to mean dividing the runtime or applying a function to the runtime.

I.e. you are saying and f(n) speedup means T(n)/f(n), but others would say it means f(T(n)) or some variation of that.

wasabi991011

I interpreted that as n->log(n) since log and exp are inverses.

Also because I've often heard tha the quantum Fourier transform is an exponential speedup over the discrete Fourier transform, and there the scaling goes n^2->nlogn.

marcellus23

Meaningless and non-constructive pedantry.

chrisweekly

I'm not the OP you're responding to, but to be fair, in a sentence about big-O perf characteristics, which includes the word "algorithms", using "exponentially" in a colloquial non-technical sense is an absolutely terrible word choice.

linguistbreaker

Exponentially bad word choice even... since we're using that word however we want now?

I don't think this is meaningless or non-constructive pedantry - we're a technical community and those are technical words.

msgodel

I disagree. Misuse of the word "exponential" is a major pet peeve of mine. It's a particular case of the much more common "use mathematically precise phrasing to sound careful/precise" that you often find in less than honest writing.

Here they are actually using it to refer to growth functions (which is rare for this error) and being honest (which is also rare IMO) but it's still wrong. They should have written about quadratic or quadratic vs linear.

Regardless sloppy language leads to sloppy thought.

keybored

Sloppy writing is up orders of magnitude lately.

sneak

My personal pet peeve is when the term “exponentially” is used to refer to a change between precisely two data points.

It’s a very specific subset of the one you’re describing.

“Tell me you don’t know what you’re talking about without telling me you don’t know what you’re talking about.”

null

[deleted]

pjmlp

A very good example that writing code in C doesn't help for performance, when the algorithms or data structures aren't properly taken into consideration.

IshKebab

I would say C makes this sort of thing far more likely because it's usually a ton of effort to obtain suitable containers. In C++ or Rust they have plenty of things like `unordered_set`/`HashSet` built in, so people are much more likely to use it and not go "eh, I'll use a for loop".

In this case Git already had a string set, but it's still not standard so there's a good chance the original author just didn't know about it.

masklinn

> In this case Git already had a string set, but it's still not standard so there's a good chance the original author just didn't know about it.

The original commit was made in January 2009 (https://github.com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was added in November 2020 (https://github.com/git/git/commit/ae20bf1ad98bdc716879a8da99..., strset was added a few days later: https://github.com/git/git/commit/1201eb628ac753af5751258466...). It was first proposed in 2018 (https://lore.kernel.org/git/20180906191203.GA26184@sigill.in... the proposal specifically mentions it fixing possibly quadratic sites).

As noted in the comment, git did have a sorted string list with bisection search, and that's from 2008 (and it actually dates back to 2006 as the "path list" API, before it was renamed following the realisation that it was a generalised string list). Though as the hashmap proposal notes, it's a bit tricky because there's a single type with functions for sorted and functions for unsorted operations, you need to know whether your list is sorted or not independent of its type.

pjmlp

Yeah, not something that WG14 will ever care about.

gre

looks like the commit is here: https://github.com/git/git/commit/a52d459e72b890c192485002ec...

mortar

Thanks, this helped me find the submission and related discussion thread - https://lore.kernel.org/git/20250401-488-generating-bundles-...

throawayonthe

[dead]

divbzero

There seems to be a lesson here about the balance between premature vs. anticipatory optimization. We’re generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.

Quekid5

If a set-of-strings was trivially available in the source language (at time of implementation) the original programmer would probably have done this (relatively) trivial optimization... This is a symptom of anemic languages like C.

KETpXDDzR

With git it should be straightforward to implement an incremental, dedup backup solution since all objects are stored with their hashs in filename.

ashishb

O(n^2) is fast enough to end up I'm production and slow enough to cause problems at scale.

The worst troublesome cases of inefficient production are almost always O(n^2).

edflsafoiewq

So they replaced a nested for loop for checking for duplicates with a set data structure. Surprised such a common thing was in git.