Skip to content(if available)orjump to list(if available)

Git clone –depth 2 is vastly better than –depth 1 if you want to Git push later

rafaelcosta

I'm wondering what the "because when we read it in, we mangle it" part really means... does this mean that there's no way to reference the commit (signaling that it's just a reference and has no actual data) without actually reading the contents of it?

-- Update: just realized why it wouldn't make sense: `git push` would send only the delta from the previous commit and the previous commit is... non-existent (we only know it's ID), so we'd be back in square 1 (sending everything).

kruador

See my top-level response, but basically nothing is mangled. Instead Git internally treats it as a 'graft' and knows not to look for parents of the prior commit.

I started that comment as a reply to you but I realised that a) it may just have been a bug that might already be fixed and b) it looks like the Stack Overflow answer was speculative and not tested!

kruador

It isn't mangled. The commit is there as-is. Instead the repository has a file, ".git/shallow", which tells it not to look for the parents of any commit listed there. If you do a '--depth 1' clone, the file will list the single commit that was retrieved.

This is similar to the 'grafts' feature. Indeed 'git log' says 'grafted'.

You can test this using "git cat-file -p" with the commit that got retrieved, to print the raw object.

> git clone --depth 1 https://github.com/git/git > git log

commit 388218fac77d0405a5083cd4b4ee20f6694609c3 (grafted, HEAD -> master, origin/master, origin/HEAD) Author: Junio C Hamano <gitster@pobox.com> Date: Mon Feb 10 10:18:17 2025 -0800

    The ninth batch

    Signed-off-by: Junio C Hamano <gitster@pobox.com>
> git cat-file -p 388218fac77d0405a5083cd4b4ee20f6694609c3

tree fc620998515e75437810cb1ba80e9b5173458d1c parent 50e1821529fd0a096fe03f137eab143b31e8ef55 author Junio C Hamano <gitster@pobox.com> 1739211497 -0800 committer Junio C Hamano <gitster@pobox.com> 1739211512 -0800

The ninth batch

Signed-off-by: Junio C Hamano <gitster@pobox.com>

I can't reproduce the problem pushing to Bitbucket, using the most recent Git for Windows (2.47.1.windows.2). It only sent 3 objects (which would be the blob of the new file, the tree object containing the new file, and the commit object describing the tree), not the 6000+ in the repository I tested it on.

It may be that there was a bug that has now been fixed. Or it may be something that only happens/happened with GitHub (i.e. a bug at the receiving end, not the sending one!)

I note that the Stack Overflow user who wrote the answer left a comment underneath saying

"worth noting: I haven't tested this; it's just some simple applied math. One clone-and-push will tell you if I was right. :-)"

mg

Reading this again reminds me of the fact how beautifully git uses the file system as a database. Where everything is laid out nicely in directories and files.

Except for performance, is there any downside to this?

In other words: When you store data in an application that only reads and writes data occasionally, is it a good idea to use the git approach and store it in files?

remram

Performance is one problem, concurrency is another (you'll need another locking and logging system to make it concurrent-safe and atomic), it can also be unwieldy to move around, and it will be broken by dropbox-like apps that will mark individual files as conflicted (rather than your whole database).

Ferret7446

Concurrency is not a big problem. The only concurrency issue with Git is that refs (and states like rebase/merge conflicts) are stored in "loose" files. This can be easily and elegantly solved like how jj does it, by putting the repo metadata into the object store too.

You can use a jj repo concurrently, e.g., over Dropbox with coworkers, and all it requires is a minor modification on top of the existing Git data model.

bennofs

One major downside is that it becomes really hard to do transactions, especially across multiple files. If you store mostly immutable data though like git (where except the refs every object is immutable, mutating creates a new object) it can work nicely.

mg

Hmm... is the mutability of data really enough to create a need for transactions?

For example here on HN (which afaik also stores the data in files) you can change a comment you wrote. But that type of mutability does not call for transactions, right?

t0mas88

Depends on the requirements, if you need concurrent access and things like durability in a crash you're going to be implementing something that looks a lot like a database.

mg

Do files really prevent concurrent access?

And does git not need crash durability?

chithanh

It's called flat file storage and it has a number of advantages and disadvantages. I prefer it because it is more robust and performs well, even for large amounts of data, as long as your filesystem is reasonably fast and reliable.

Think maildir vs. mbox/PST/etc. for message storage. I stopped counting the number of times that I have seen Outlook mangle its message database and require a rebuild.

Generally it is not so popular, in part because of the OSes like Windows and macOS which have somewhat lacking filesystem implementations. Git also has performance issues with large repositories on Windows, which need to be worked around by various methods.

Transactions are another limitation as mentioned by the other reply (they are possible to implement on top of "normal" filesystems, but not in a portable way).

necovek

I like the fact that none of this was tested, even if described with such authority :)

Anyone try it out yet?

(Not that I don't trust it, but I usually fetch the full history locally anyway)

kruador

I can't replicate the initial problem, at least pushing to Bitbucket. I'm using Windows, so I didn't use `touch` - instead I used 'echo' to create a new file in a shallow clone of my repo. That repo is 126 MB on Bitbucket, and the shallow clone downloaded 6395 objects taking 40.68 MB.

I've tried with a new file both having content ('Test shallow clone push'), and again with an empty file. In both cases it pushed 3 objects, and in the empty file case it reused one (it turns out my repo already has some empty files in it).

It's always possible that this is (or was) a GitHub bug - I haven't tried it there.

jbreckmckye

Why can't git push, when it encounters a `.git/shallow`, just ask the git server to fill in the remaining history by verifying the parent hashes the client can send?

TachyonicBytes

It can, but that's another type of "shallow", or more exactly "not-deep" cloning, called blobless cloning [1]. There is also treeless cloning, with other tradeoffs, but much to the same effect.

I found this[2] very enlightening.

[1] https://github.blog/open-source/git/get-up-to-speed-with-par...

[2] https://www.howtogeek.com/devops/how-to-use-git-shallow-clon...

Timwi

This seems like a bug to me. Even if the previous commit is “mangled” as they call it, there's no reason why you can't diff against it and only send the diff.

gbin

I believe this is because the generated unique id of that node is derived from the link to its parent: https://gist.github.com/masak/2415865

So the "where to attach it to the tree" info is effectively lost.

nopurpose

`git clone --filter blob:none` FTW

ksynwa

What does this do?

ChocolateGod

Have a look at this blog posts, it explains that option really well as well as other alternatives to shallow clones.

https://github.blog/open-source/git/get-up-to-speed-with-par...

pabs3

I use tree:0 instead.

edflsafoiewq

Do blobless clones suffer from this?

jakub_g

From my experience, I heavily don't recommend blobless/treeless clones for local dev. They should only be used in CI for throwaway build & forget scenario.

When you have blobless/treeless clone on local, it will fetch missing blobs on-demand when doing random `git` operations, in the least expected moments, and it does this very inefficiently, one-by-one, which is super slow. (I also didn't find a way to go back from blobless/treeless clone to a normal clone, i.e. to force-fetch all missing blobs efficiently).

It's especially tricky when those `git` operations happen in background e.g. from a GUI (IDE extension etc.) and you don't have any feedback what is happening.

Some further reading:

- https://github.blog/open-source/git/get-up-to-speed-with-par...

- https://github.blog/open-source/git/git-clone-a-data-driven-...

nopurpose

blobless still better than shallow because at least commit history is preserved.

jakub_g

It might be fine for small repos, but for massive repos, blobless/treeless clones become unmanageable because many git operations become very slow. I added some further links.

From my side, when you have non-trivial sized repo, on local machine one should either use either a shallow, or a full clone (or, if possible, a sparse checkout, but this requires repo to be compatible).

wvh

That's a beautiful answer. Sometimes people explain something you already know, but different parts of your brain light up. This doesn't just explain git once more, but also plants some seeds related to hashed state optimisations in other, future challenges.

bradley13

Ok, I'm a simplistic Git user, but: I always do a full clone. Maybe (probably) I will never need all that history, but...maybe I will. Disk space is cheap.

emmelaich

Often I just want to have a look locally. It's saving time. I'm not very concerned about disk space. Easy enough to deepen it later if wanted.

koiueo

That is until you have to clone a 10 GiB repo on shaky 5g while commuting on a train

diggan

Yeah, or any project that is more than a decade old and you don't want 10,000 commits from before the year 2023 on disk for almost no gains.

pabs3

For the use-case of downloading as little data as possible, I use this:

git clone --depth=1 --filter tree:0

zeristor

Should have the (2021) suffix