Trying to understand `git diff` and `git mv` rename detection mechanism

Git’s “similarity index” computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.

To compute the similarity index for two files S (source) and D (destination), Git:

  • reads both files
  • computes a hash table of all of the chunks of file S
  • computes a second hash table of all of the chunks of file D

The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).

The hash value for a file chunk is computed by:

  • start at the current file offset (initially zero)
  • read 64 bytes or until '\n' character, whichever occurs first
  • if the file is claimed to be text and there is a '\r' before the '\n', discard the '\r'
  • hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file

Now that there are hash tables for both S and D, each possible hash hi appears nS times in S and nD in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., nD ≤ nS—then D “copies from SnD times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a “literal add” of nD – nS occurrences of the hashed chunk, and D also copies all nS original occurrences as well.

Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of “chunks” to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)

These two values (src_copied and literal_added) computed in diffcore_count_changes are handed over to function estimate_similarity in diffcore-rename.c. It completely ignores the literal_added count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the src_copied number matters:

score = (int)(src_copied * MAX_SCORE / max_size);

where max_size is the size in bytes of larger of the two input files S and D.

Note that there is an earlier computation:

max_size = ((src->size > dst->size) ? src->size : dst->size);
base_size = ((src->size < dst->size) ? src->size : dst->size);
delta_size = max_size - base_size;

and if the two files have changed size “too much”:

if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
        return 0;

we never even get into the diffcore-delta.c code to hash them. The minimum_score here is the argument to -M or --find-renames, converted to a scaled number. MAX_SCORE is 60000.0 (type double), so the default minimum_score, when you use the default -M50%, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.

[Edit: this is now obsolete:] git status always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the rename_score field of the diff options. Until Git version 2.18.0, there was no way to control this for git status. In Git 2.18.0 and later, git status has the same --find-renames option as git diff. The status.renames option in the Git configuration enables any default detection, and if unset, git status obeys the diff.renames setting; see the git config documentation and the git status documentation.

Leave a Comment