git gc –aggressive vs git repack

Nowadays there is no difference: git gc --aggressive operates according to the suggestion Linus made in 2007; see below. As of version 2.11 (Q4 2016), git defaults to a depth of 50. A window of size 250 is good because it scans a larger section of each object, but depth at 250 is bad because it makes every chain refer to very deep old objects, which slows down all future git operations for marginally lower disk usage.


Historical Background

Linus suggested (see below for the full mailing list post) using git gc --aggressive only when you have, in his words, “a really bad pack” or “really horribly bad deltas,” however “almost always, in other cases, it’s actually a really bad thing to do.” The result may even leave your repository in worse condition than when you started!

The command he suggests for doing this properly after having imported “a long and involved history” is

git repack -a -d -f --depth=250 --window=250

But this assumes you have already removed unwanted gunk from your repository history and that you have followed the checklist for shrinking a repository found in the git filter-branch documentation.

git-filter-branch can be used to get rid of a subset of files, usually with some combination of --index-filter and --subdirectory-filter. People expect the resulting repository to be smaller than the original, but you need a few more steps to actually make it smaller, because Git tries hard not to lose your objects until you tell it to. First make sure that:

  • You really removed all variants of a filename, if a blob was moved over its lifetime. git log --name-only --follow --all -- filename can help you find renames.

  • You really filtered all refs: use --tag-name-filter cat -- --all when calling git filter-branch.

Then there are two ways to get a smaller repository. A safer way is to clone, that keeps your original intact.

  • Clone it with git clone file:///path/to/repo. The clone will not have the removed objects. See git-clone. (Note that cloning with a plain path just hardlinks everything!)

If you really don’t want to clone it, for whatever reasons, check the following points instead (in this order). This is a very destructive approach, so make a backup or go back to cloning it. You have been warned.

  • Remove the original refs backed up by git-filter-branch: say

    git for-each-ref --format="%(refname)" refs/original/ |
      xargs -n 1 git update-ref -d
    
  • Expire all reflogs with git reflog expire --expire=now --all.

  • Garbage collect all unreferenced objects with git gc --prune=now (or if your git gc is not new enough to support arguments to --prune, use git repack -ad; git prune instead).


Date: Wed, 5 Dec 2007 22:09:12 -0800 (PST)
From: Linus Torvalds <torvalds at linux-foundation dot org>
To: Daniel Berlin <dberlin at dberlin dot org>
cc: David Miller <davem at davemloft dot net>,
    ismail at pardus dot org dot tr,
    gcc at gcc dot gnu dot org,
    git at vger dot kernel dot org
Subject: Re: Git and GCC
In-Reply-To: <[email protected]>
Message-ID: <[email protected]>
References: <[email protected]>
            <[email protected]>
            <[email protected]>
            <[email protected]>
            <[email protected]>

On Thu, 6 Dec 2007, Daniel Berlin wrote:

Actually, it turns out that git-gc --aggressive does this dumb thing
to pack files sometimes regardless of whether you converted from an
SVN repo or not.

Absolutely. git --aggressive is mostly dumb. It’s really only useful for
the case of “I know I have a really bad pack, and I want to throw away
all the bad packing decisions I have done.”

To explain this, it’s worth explaining (you are probably aware of it, but
let me go through the basics anyway) how git delta-chains work, and how
they are so different from most other systems.

In other SCMs, a delta-chain is generally fixed. It might be “forwards”
or “backwards,” and it might evolve a bit as you work with the repository,
but generally it’s a chain of changes to a single file represented as some
kind of single SCM entity. In CVS, it’s obviously the *,v file, and a lot
of other systems do rather similar things.

Git also does delta-chains, but it does them a lot more “loosely.” There
is no fixed entity. Deltas are generated against any random other version
that git deems to be a good delta candidate (with various fairly
successful heuristics), and there are absolutely no hard grouping rules.

This is generally a very good thing. It’s good for various conceptual
reasons (i.e., git internally never really even needs to care about the whole
revision chain — it doesn’t really think in terms of deltas at all), but
it’s also great because getting rid of the inflexible delta rules means
that git doesn’t have any problems at all with merging two files together,
for example — there simply are no arbitrary *,v “revision files” that have
some hidden meaning.

It also means that the choice of deltas is a much more open-ended
question. If you limit the delta chain to just one file, you really don’t
have a lot of choices on what to do about deltas, but in git, it really
can be a totally different issue.

And this is where the really badly named --aggressive comes in. While
git generally tries to re-use delta information (because it’s a good idea,
and it doesn’t waste CPU time re-finding all the good deltas we found
earlier), sometimes you want to say “let’s start all over, with a blank
slate, and ignore all the previous delta information, and try to generate
a new set of deltas.”

So --aggressive is not really about being aggressive, but about wasting
CPU time re-doing a decision we already did earlier!

Sometimes that is a good thing. Some import tools in particular could
generate really horribly bad deltas. Anything that uses git fast-import,
for example, likely doesn’t have much of a great delta layout, so it might
be worth saying “I want to start from a clean slate.”

But almost always, in other cases, it’s actually a really bad thing to do.
It’s going to waste CPU time, and especially if you had actually done a
good job at deltaing earlier, the end result isn’t going to re-use all
those good deltas you already found, so you’ll actually end up with a
much worse end result too!

I’ll send a patch to Junio to just remove the git gc --aggressive
documentation. It can be useful, but it generally is useful only when you
really understand at a very deep level what it’s doing, and that
documentation doesn’t help you do that.

Generally, doing incremental git gc is the right approach, and better
than doing git gc --aggressive. It’s going to re-use old deltas, and
when those old deltas can’t be found (the reason for doing incremental GC
in the first place!) it’s going to create new ones.

On the other hand, it’s definitely true that an “initial import of a long
and involved history” is a point where it can be worth spending a lot of
time finding the really good deltas. Then, every user ever after (as
long as they don’t use git gc --aggressive to undo it!) will get the
advantage of that one-time event. So especially for big projects with a
long history, it’s probably worth doing some extra work, telling the delta
finding code to go wild.

So the equivalent of git gc --aggressive — but done properly — is to
do (overnight) something like

git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be
(make them longer for old history — it’s worth the space overhead), and
the window thing is about how big an object window we want each delta
candidate to scan.

And here, you might well want to add the -f flag (which is the “drop all
old deltas,” since you now are actually trying to make sure that this one
actually finds good candidates.

And then it’s going to take forever and a day (i.e., a “do it overnight”
thing). But the end result is that everybody downstream from that
repository will get much better packs, without having to spend any effort
on it themselves.

          Linus

Leave a Comment