git pull –rebase lost commits after coworker’s git push –force

Question

TL;DR: it’s the fork point code

You are getting the effect of git rebase --fork-point, which deliberately drops Dan’s commit from your repository too. See also Git rebase – commit select in fork-point mode (although in my answer there I don’t mention something I will here).

If you run the git rebase yourself, you choose whether --fork-point is used. The --fork-point option is used when:

you run git rebase with no <upstream> argument (the --fork-point is implied), or
you run git rebase --fork-point [<arguments>] <upstream>.

This means that to rebase on your upstream without having --fork-point applied, you should use:

git rebase @{u}

or:

git rebase --no-fork-point

Some details are Git-version-dependent, as --fork-point became an option only in Git version 2.0 (but was secretly done by git pull ever since 1.6.4.1, with methods growing more complex until the whole --fork-point thing was invented).

Discussion

As you already know, git push --force rudely overwrites the branch pointer, dropping some existing commit(s). You expected, though, that your git pull --rebase would restore the dropped commit, since you already had it yourself. For naming convenience, let’s use your naming, where Dan’s commit gets dropped when Brian force-pushes. (As a mnemonic, let’s say “Dan got Dropped”.)

Sometimes it will! Sometimes, as long as your Git has Dan’s commit in your repository, and you have Dan’s commit in your history, Dan’s commit will get restored when you rebase your commits. This includes the case where you are Dan. And yet, sometimes it won’t, and this also include the case where you are Dan. In other words, it’s not based on who you are at all.

The complete answer is a bit complicated, and it’s worth noting that this behavior is something you can control.

About `git pull` (don’t use it)

First, let’s make a brief note: git pull is, in essence, just git fetch followed by either git merge or git rebase.¹ You choose in advance which command to run, by supplying --rebase or setting a configuration entry, branch.branch-name.rebase. However, you can run git fetch yourself, and then run git merge or git rebase yourself, and if you do it this way, you gain access to additional options.²

The most important of these is the ability to inspect the result of the fetch before choosing your primary option (merge vs rebase). In other words, this gives you a chance to see that there was a commit dropped. If you had done a git fetch earlier and gotten Dan’s commit, then—with or without any intervening work where you may or may not have incorporated Dan’s commit—done a second git fetch, you would see something like this:

 + 5122532...6f1308f pu         -> origin/pu  (forced update)

Note the “(forced update)” annotation: this is what tells you that Dan got Dropped. (The branch name used here is pu, which is one in the Git repo for Git that regularly gets force-updated; I just cut-and-pasted an actual git fetch output here.)

¹There are several niggling technical differences, especially in very old versions of Git (before 1.8.4). There is also, as I was recently reminded, one other special case, for a git pull in a repository that has no commits on the current branch (typically, into a new empty repository): here git pull invokes neither git merge nor git rebase, but rather runs git read-tree -m and, if that succeeds, sets the branch name itself.

²I think you can supply all the necessary arguments on the command line, but that’s not what I mean. In particular, the ability to run other Git commands between the fetch and the second step is what we want.

Basics of `git rebase`

The main and most fundamental thing to know about git rebase is that it copies commits. The why is itself fundamental to Git: nothing—no one, and not Git itself—can change anything in a commit (or any other Git object), as the “true name” of a Git object is a cryptographic hash of its contents.³ Hence if you take a commit out of the database, modify anything—even a single bit—and go to put the object back in, you get a new, different hash: a new and different commit. It can be extremely similar to the original, but if any bit of it is different in any way, it’s a new, different commit.

To see how these copies work, draw at least part of the commit graph. The graph is just a series of commits, starting from the newest—or tip—commit, whose true-name hash ID is stored in the branch’s name. We say that the name points to the commit:

            D   <-- master

The commit, which I’ve called D here, contains (as part of its hashed commit data) the hash ID of its parent commit, i.e., the commit that was the tip of the branch before we made D. So it “points to” its parent, and its parent points further back:

... <- C <- D   <-- master

The fact that the internal arrows are all backwards like this is usually not very important, so I tend to omit them here. When the one-letter names are not very important I just draw a round dot for each commit:

...--o--o   <-- branch

For branch to “branch off from” master, we should draw both branches:

A--B--C--D     <-- master
    \
     E--F--G   <-- branch

Note that commit E points back to commit B.

Now, if we want to re-base branch, so that it comes after commit D (which is now the tip of master), we need to copy commit E to a new commit E' that is “just as good as” C, except that it has D as its parent (and of course has a different snapshot as its source base as well):

           E'  <-- (temporary)
          /
A--B--C--D     <-- master
    \
     E--F--G   <-- branch

We must now repeat this with F and G, and when we are all done, make the name branch point to the last copy, G', abandoning the original chain in favor of the new one:

           E'-F'-G'  <-- branch
          /
A--B--C--D           <-- master
    \
     E--F--G         [abandoned]

This is what git rebase is all about: we pick out some set of commits to copy; we copy them to some new position, one at a time, in parent-first order (vs the more typical child-first backwards Git order); and then we re-point the branch label to the last-copied commit.

Note that this works even for the null case. If the name branch points directly to B and we rebase it on master, we copy all zero commits that come after B, copying them to come after D. Then re-point the label branch to the last-copied commit, which is none, which means we re-point branch to commit D. It’s perfectly normal, in Git, to have several branch names all pointing to the same commit. Git knows which branch you are on by reading .git/HEAD, which contains the name of the branch. The branch itself—some portion of the commit graph—is determined by the graph. This means the word “branch” is ambiguous: see What exactly do we mean by “branch”?

Note also that commit A has no parents at all. It’s the first commit in the repository: there was no previous commit. Commit A is therefore a root commit, which is just a fancy way to say “a commit with no parents”. We can also have commits with two or more parents; these are merge commits. (I did not draw any here, though. It’s often unwise to rebase branch chains that contain merges, since it’s literally impossible to rebase a merge and git rebase has to re-perform the merge to approximate it. Normally git rebase just omits merges entirely, which causes other problems.)

³Obviously, by the Pigeonhole Principle, any hash that reduces a longer bit-string to a fixed-length k-bit key must necessarily have collisions on some inputs. A key requirement for a Git hash function is that it avoid accidental collisions. The “cryptographic” part is not really crucial to Git, it just makes it hard (but of course not impossible) for someone to deliberately cause a collision. Collisions cause Git to be unable to add new objects, so they are bad, but—aside from bugs in the implementation—they don’t actually break Git itself, just the further usage of Git for your own data.

Determining what to copy

One problem with rebasing lies in identifying which commits to copy.

Most of the time, it seems easy enough: you want Git to copy your commits, and not someone else’s. But that’s not always true—in large, distributed environments, with administrators and managers and so on, sometimes it’s appropriate for someone to rebase someone else’s commits. In any case, this is not how Git does it in the first place. Instead, Git uses the graph.

Naming a commit—e.g., writing branch—tends to select not just that commit, but also that commit’s parent commit, the parent’s parent, and so on, all the way back to the root commit. (If there is a merge commit, we usually select all of its parent commits, and follow all of them back towards the root simultaneously. A graph can have more than one root, so this lets us select multiple strands going back to multiple roots, as well as branch-and-merge strands going back to a single root.) We call the set of all commits that we find, when starting from one commit and doing these parent traversals, the set of reachable commits.

For many purposes, including git rebase, we need to make this en-masse selection stop, and we use Git’s fancy set operations to do that. If we write master..branch as a revision selector, this means: “All commits reachable from the tip of branch, except for any commits reachable from the tip of master.” Look at this graph again:

A--B--C--D     <-- master
    \
     E--F--G   <-- branch

The commits reachable from branch are G, F, E, B, and A. The commits reachable from master are D, C, B, and A. So master..branch means: subtract away the A+B+C+D set from the bigger A+B+E+F+G set.

In set subtraction, removing something that was never there in the first place is trivial: you just do nothing. So we remove A+B from the second set, leaving E+F+G. Alternatively, I like to use a method that I can’t draw on StackOverflow: color the commits red (stop) and green (go), following all the backwards arrows in your graph, starting with red for commits that are forbidden (master) and green for commits to take (branch). Just make sure that red overwrites green, or that you do red first and don’t re-color them when you’re doing green. It’s intuitively obvious⁴ that doing green first, then overwriting with red; or doing red first and not overwriting, gives the same result.

Anyway, this is one way git rebase selects commits to copy. The rebase documentation calls this the <upstream>. You write:

git checkout branch; git rebase master

and Git knows to color master commits red and current-branch branch commits green, and then copy just the green ones. Moreover, that same name—master—tells git rebase where to put the copies. This is quite elegant and efficient: a single argument, master, tells Git both what to copy and where to put the copies.

The problem is, it doesn’t always work.

There are several common cases where it breaks down. One occurs when you want to limit the copies even more, e.g., to break up one big branch into two smaller ones. Another occurs when some, but not all, of your own commits have already been copied (cherry-picked or squash-“merged”) into another branch you’re rebasing onto. Rarer, but not unheard-of, sometimes an upstream has deliberately discarded some commits, and you should too.

For some of these cases, git rebase can deal with them using git patch-id: it can actually tell that a commit got copied, as long as the two commits have the same patch ID. For others, you must manually split the rebase target (rebase calls this <newbase>) using the --onto flag:

git rebase --onto <newbase> <upstream>

which limits the commits to copy to those in <upstream>..HEAD, while starting the copies after <target>. This—splitting the targeting of the copies away from the <upstream> argument—means that you are now free to choose any <upstream> that removes the right commits, rather than some set determined by “where the copies go”.

⁴This is the phrase mathematicians use when they don’t want to write out a proof. 🙂

The `--fork-point` option

If you regularly rebase commits onto an origin/whatever branch (or similar) that is, itself, also regularly rebased specifically with a goal of removing commits, it can be hard to decide which commits to copy. But if you have, in your origin/whatever reflog, a series of commit hashes that show that some commits were there before but are no longer, it’s possible for Git to use this to discard, from the to-copy set, some or all of those commits.

I was not entirely sure how --fork-point is implemented internally (it’s not very well documented). For this answer, I made a test repository. The option was, unsurprisingly, order-dependent: git merge-base --fork-point origin/master topic returns a different result from git merge-base --fork-point topic origin/master.

For this answer, I looked at the source code. This shows that Git looks in the reflog of the first non-option argument—call this arg1—and then uses it to find a merge base using the next such argument arg2 resolved to a commit ID, completely ignoring any additional arguments. Based on this, the result of git merge-base --fork-point $arg1 $arg2 is essentially⁵ the output of:

git merge-base $arg2 $(git log -g --format=%H $arg1)

As the documentation says:

More generally, among the two commits to compute the merge base from, one is specified by the first commit argument on the command line; the other commit is a (possibly hypothetical) commit that is a merge across all the remaining commits on the command line.

As a consequence, the merge base is not necessarily contained in each of the commit arguments if more than two commits are specified. This is different from git-show-branch(1) when used with the --merge-base option.

So --fork-point tries to find the merge base between the current hash for the second argument, and the hypothetical merge base of the current value of the upstream and all of its reflog-recorded values. This is what leads to the exclusion of a dropped commit, such as the one by Dan in our example here.

Remember, using --fork-point mode merely modifies, internally, the <upstream> argument to git rebase (without also changing its --onto target). Let’s say, for instance, that at one time, the upstream had:

...--o--B1--B2--B3--C--D    <-- upstream
                 \
                  E--F--G   <-- branch

The goal, as it were, of --fork-point is to detect a rewrite of this form:

...--o-------------C--D     <-- upstream
      \
       B1--B2--B3           <-- upstream@{1}
                \
                 E--F--G    <-- branch

and “knock out” commits B1, B2, and B3 by selecting B3 as the internal <upstream> argument. If you leave out the --fork-point option, Git views everything this way instead:

...--o-------------C--D     <-- upstream
      \
       B1--B2--B3--E--F--G  <-- branch

so that all the B commits are “ours”. The commits on the upstream branch are D, C, and C‘s parent o (and its parents, on to the root).

In our particular case, Dan’s commit—the one that gets dropped—resembles one of these B commits. It’s dropped with --fork-point and kept with --no-fork-point.

⁵If, were one to use --all, this would produce multiple merge bases, the command fails, and prints nothing. If the resulting (singular) merge base is not a commit that is already in the reflog, the command also fails. The first case occurs with criss-cross merges as the nearest ancestor. The second case occurs when the selected ancestor is old enough to have expired from the reflog, or was never in it in the first place (I am pretty sure both are possible). I have an example of this second kind of failure right here:

$ arg1=origin/next
$ arg2=stash-exp
$ git merge-base --all $arg2 $(git log -g --format=%H $arg1)
3313b78c145ba9212272b5318c111cde12bfef4a
$ git merge-base --fork-point $arg1 $arg2
$ echo $?
1

I think the idea of skipping 3313b78... here is that it is one of those “possibly hypothetical commits” I quote from the documentation, but in fact it would be the right commit to use with git rebase, and it’s the one that is used without --fork-point.

Git before 2.0, and conclusion

In Git versions before 2.0 (or maybe a late 1.9), git pull that was rebasing would compute this fork point, but git rebase never did. This meant that if you wanted this kind of behavior, you had to use git pull to get it. Now that git rebase has --fork-point, you can choose when to get it:

Add the option if you definitely do want it.
Use --no-fork-point or an explicit upstream (@{u} suffices if you have a default upstream) if you definitely don’t want it.

If you run git pull, you do not have the --no-fork-point option.⁶ Whether git pull origin master (assuming the current branch’s upstream is origin/master) suppresses the fork-point the way git rebase origin/master would, I do not know: I mostly avoid git pull. According to testing (see Breaking Benjamin’s comment below), in Git 2.0+, git pull --rebase always uses fork-point mode, even with additional arguments.

⁶At least, as of now when I just tested it to make sure.

TL;DR: it’s the fork point code

Discussion

About git pull (don’t use it)

Basics of git rebase

Determining what to copy

The --fork-point option

Git before 2.0, and conclusion

More Related Contents:

Leave a Comment Cancel reply

About `git pull` (don’t use it)

Basics of `git rebase`

The `--fork-point` option