How can I know if `git gc –auto` has done something?

Update Sept. 2020: you won’t have to run only git gc --auto as part of your automatic saves script.

The old “gc” can now be superseded by the new git maintenance run --auto.
And it can display what it is doing.

With Git 2.29 (Q4 2020), A “git gc(man)‘s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning.

See commit 25914c4, commit 4ddc79b, commit 916d062, commit 65d655b, commit d7514f6, commit 090511b, commit 663b2b1, commit 3103e98, commit a95ce12, commit 3ddaad0, commit 2057d75 (17 Sep 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano — gitster in commit 48794ac, 25 Sep 2020)

maintenance: create basic maintenance runner

Helped-by: Jonathan Nieder
Signed-off-by: Derrick Stolee

The ‘gc’ builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as:

  • repacking the repository,
  • packing refs, and
  • rewriting the commit-graph file.

The name implies it performs “garbage collection” which means several different things, and some users may not want to use this operation that rewrites the entire object database.

Create a new ‘maintenance‘ builtin that will become a more general- purpose command.

To start, it will only support the ‘run‘ subcommand, but will later expand to add subcommands for scheduling maintenance in the background.

For now, the ‘maintenance‘ builtin is a thin shim over the ‘gc‘ builtin.
In fact, the only option is the ‘--auto‘ toggle, which is handed directly to the ‘gc‘ builtin.
The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin.

Use existing builtin/gc.c file because we want to share code between the two builtins.
It is possible that we will have ‘maintenance‘ replace the ‘gc‘ builtin entirely at some point, leaving ‘git gc(man)‘ as an alias for some specific arguments to ‘git maintenance run‘.

Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file.
A negation mode is available that will be used in later tests.

(That last part is one way to ascertain the new git maintainance run --auto does something)

git maintenance now includes in its man page:

git-maintenance(1)

NAME

git-maintenance – Run tasks to optimize Git repository data

SYNOPSIS

[verse]
'git maintenance' run [<options>]

DESCRIPTION

Run tasks to optimize Git repository data, speeding up other Git commands
and reducing storage requirements for the repository.

Git commands that add repository data, such as git add or git fetch,
are optimized for a responsive user experience. These commands do not take
time to optimize the Git data, since such optimizations scale with the full
size of the repository while these user commands each perform a relatively
small action.

The git maintenance command provides flexibility for how to optimize the
Git repository.

SUBCOMMANDS

run

Run one or more maintenance tasks.

TASKS

gc

Clean up unnecessary files and optimize the local repository. “GC”
stands for “garbage collection,” but this task performs many
smaller tasks. This task can be expensive for large repositories,
as it repacks all Git objects into a single pack-file. It can also
be disruptive in some situations, as it deletes stale data. See
git gc for more details on garbage collection in Git.

OPTIONS

--auto

When combined with the run subcommand, run maintenance tasks
only if certain thresholds are met. For example, the gc task
runs when the number of loose objects exceeds the number stored
in the gc.auto config setting, or when the number of pack-files
exceeds the gc.autoPackLimit config setting.


maintenance: replace run_auto_gc()

Signed-off-by: Derrick Stolee

The run_auto_gc() method is used in several places to trigger a check for repo maintenance after some Git commands, such as ‘git commit(man) or ‘git fetch(man).

To allow for extra customization of this maintenance activity, replace the ‘git gc --auto [--quiet](man)‘ call with one to ‘git maintenance run --auto [--quiet](man)‘.
As we extend the maintenance builtin with other steps, users will be able to select different maintenance activities.

Rename run_auto_gc() to run_auto_maintenance() to be clearer what is happening on this call, and to expose all callers in the current diff. Rewrite the method to use a struct child_process to simplify the calls slightly.

Since ‘git fetch(man) already allows disabling the ‘git gc --auto(man) subprocess, add an equivalent option with a different name to be more descriptive of the new behavior: ‘--[no-]maintenance‘.

fetch-options now includes in its man page:

Run git maintenance run --auto at the end to perform automatic
repository maintenance if needed. (--[no-]auto-gc is a synonym.)
This is enabled by default.

git clone now includes in its man page:

which automatically call git maintenance run --auto. (See
git maintenance.)


Plus, your save script will be able to make git maintenance do more than git gc ever could, thanks to tasks.

maintenance: add –task option

Signed-off-by: Derrick Stolee

A user may want to only run certain maintenance tasks in a certain order.

Add the --task=<task> option, which allows a user to specify an ordered list of tasks to run. These cannot be run multiple times, however.

Here is where our array of maintenance_task pointers becomes critical. We can sort the array of pointers based on the task order, but we do not want to move the struct data itself in order to preserve the hashmap references. We use the hashmap to match the –task= arguments into the task struct data.

Keep in mind that the ‘enabled‘ member of the maintenance_task struct is a placeholder for a future ‘maintenance.<task>.enabled‘ config option. Thus, we use the ‘enabled‘ member to specify which tasks are run when the user does not specify any --task=<task> arguments.
The ‘enabled‘ member should be ignored if --task=<task> appears.

git maintenance now includes in its man page:

Run one or more maintenance tasks. If one or more --task=<task>
options are specified, then those tasks are run in the provided
order. Otherwise, only the gc task is run.

git maintenance now includes in its man page:

--task=<task>

If this option is specified one or more times, then only run the
specified tasks in the specified order. See the ‘TASKS’ section
for the list of accepted <task> values.

And:

maintenance: create maintenance..enabled config

Signed-off-by: Derrick Stolee

Currently, a normal run of “git maintenance run(man) will only run the ‘gc‘ task, as it is the only one enabled.
This is mostly for backwards-compatible reasons since “git maintenance run --auto(man) commands replaced previous “git gc --auto” commands after some Git processes.

Users could manually run specific maintenance tasks by calling “git maintenance run --task=<task>” directly.

Allow users to customize which steps are run automatically using config. The ‘maintenance.<task>.enabled‘ option then can turn on these other tasks (or turn off the ‘gc‘ task).

git config now includes in its man page:

maintenance.<task>.enabled

This boolean config option controls whether the maintenance task
with name <task> is run when no --task option is specified to
git maintenance run. These config values are ignored if a
--task option exists.
By default, only maintenance.gc.enabled is true.

git maintenance now includes in its man page:

Run one or more maintenance tasks. If one or more --task options
are specified, then those tasks are run in that order. Otherwise,
the tasks are determined by which maintenance.<task>.enabled
config options are true.
By default, only maintenance.gc.enabled is true.

git maintenance now also includes in its man page:

If no --task=<task>
arguments are specified, then only the tasks with
maintenance.<task>.enabled configured as true are considered.


Another way to know if the new git maintenance run is doing currently anything is to check for a lock (.git/maintenance.lock file):

maintenance: take a lock on the objects directory

Signed-off-by: Derrick Stolee

Performing maintenance on a Git repository involves writing data to the .git directory, which is not safe to do with multiple writers attempting the same operation.
Ensure that only one ‘git maintenance(man) process is running at a time by holding a file-based lock.

Simply the presence of the .git/maintenance.lock file will prevent future maintenance. This lock is never committed, since it does not represent meaningful data. Instead, it is only a placeholder.

If the lock file already exists, then no maintenance tasks are attempted. This will become very important later when we implement the ‘prefetch‘ task, as this is our stop-gap from creating a recursive process loop between ‘git fetch(man) ‘ and ‘git maintenance run --auto(man).


You can also check if git gc/git maintenance will have to do anything.

With Git 2.29 (Q4 2020), A “git gc(man) ‘s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning.

See commit 25914c4, commit 4ddc79b, commit 916d062, commit 65d655b, commit d7514f6, commit 090511b, commit 663b2b1, commit 3103e98, commit a95ce12, commit 3ddaad0, commit 2057d75 (17 Sep 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano — gitster in commit 48794ac, 25 Sep 2020)

maintenance: use pointers to check --auto

Signed-off-by: Derrick Stolee

The ‘git maintenance run(man) ‘ command has an ‘–auto’ option. This is used by other Git commands such as ‘git commit(man) ‘ or ‘git fetch(man) ‘ to check if maintenance should be run after adding data to the repository.

Previously, this --auto option was only used to add the argument to the ‘git gc(man) command as part of the ‘gc‘ task.
We will be expanding the other tasks to perform a check to see if they should do work as part of the --auto flag, when they are enabled by config.

  • First, update the ‘gc’ task to perform the auto check inside the maintenance process.
    This prevents running an extra ‘git gc --auto(man) command when not needed.
    It also shows a model for other tasks.

  • Second, use the ‘auto_condition‘ function pointer as a signal for whether we enable the maintenance task under ‘--auto‘.
    For instance, we do not want to enable the ‘fetch’ task in ‘--auto‘ mode, so that function pointer will remain NULL.

We continue to pass the ‘–auto’ option to the ‘git gc(man) command when necessary, because of the gc.autoDetach config option changes behavior.
Likely, we will want to absorb the daemonizing behavior implied by gc.autoDetach as a maintenance.autoDetach config option.


To illustrate what git maintenance will do that git gc won’t:

maintenance: add commit-graph task

Signed-off-by: Derrick Stolee

The first new task in the ‘git maintenance(man) ‘ builtin is the ‘commit-graph‘ task.
This updates the commit-graph file incrementally with the command

git commit-graph write --reachable --split  

By writing an incremental commit-graph file using the “--split” option we minimize the disruption from this operation.

The default behavior is to merge layers until the new “top” layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution.

Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will very likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph.

git maintenance now includes in its man page:

commit-graph

The commit-graph job updates the commit-graph files incrementally,
then verifies that the written data is correct.

The incremental
write is safe to run alongside concurrent Git processes since it
will not expire .graph files that were in the previous
commit-graph-chain file. They will be deleted by a later run based
on the expiration delay.

And:

maintenance: add auto condition for commit-graph task

Signed-off-by: Derrick Stolee

Instead of writing a new commit-graph in every ‘git maintenance run --auto(man) process (when maintenance.commit-graph.enabled is configured to be true), only write when there are “enough” commits not in a commit-graph file.

This count is controlled by the maintenance.commit-graph.auto config option.

To compute the count, use a depth-first search starting at each ref, and leaving markers using the SEEN flag.
If this count reaches the limit, then terminate early and start the task.
Otherwise, this operation will peel every ref and parse the commit it points to. If these are all in the commit-graph, then this is typically a very fast operation.

Users with many refs might feel a slow-down, and hence could consider updating their limit to be very small. A negative value will force the step to run every time.

git config now includes in its man page:

maintenance.commit-graph.auto

This integer config option controls how often the commit-graph task
should be run as part of git maintenance run --auto.

  • If zero, then the commit-graph task will not run with the --auto option.
  • A negative value will force the task to run every time.
  • Otherwise, a positive value implies the command should run when the number of
    reachable commits that are not in the commit-graph file is at least
    the value of maintenance.commit-graph.auto.

The default value is 100.


With Git 2.30 (Q1 2021), the test-coverage enhancement of running commit-graph task “git maintenance(man) as needed led to discovery and fix of a bug.

See commit d334107 (12 Oct 2020), and commit 8f80180 (08 Oct 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano — gitster in commit 0be2d65, 02 Nov 2020)

maintenance: test commit-graph auto condition

Signed-off-by: Derrick Stolee

The auto condition for the commit-graph maintenance task walks refs looking for commits that are not in the commit-graph file.
This was added in 4ddc79b2 (“maintenance: add auto condition for commit-graph task”, 2020-09-17, Git v2.29.0-rc0 — merge listed in batch #17) but was left untested.

The initial goal of this change was to demonstrate the feature works properly by adding tests. However, there was an off-by-one error that caused the basic tests around maintenance.commit-graph.auto=1 to fail when it should work.

The subtlety is that if a ref tip is not in the commit-graph, then we were not adding that to the total count. In the test, we see that we have only added one commit since our last commit-graph write, so the auto condition would say there is nothing to do.

The fix is simple: add the check for the commit-graph position to see that the tip is not in the commit-graph file before starting our walk. Since this happens before adding to the DFS stack, we do not need to clear our (currently empty) commit list.

This does add some extra complexity for the test, because we also want to verify that the walk along the parents actually does some work. This means we need to add at least two commits in a row without writing the commit-graph. However, we also need to make sure no additional refs are pointing to the middle of this list or else the for_each_ref() in should_write_commit_graph() might visit these commits as tips instead of doing a DFS walk. Hence, the last two commits are added with “git commit(man) instead of "test_commit".


With Git 2.30 (Q1 2021), “git maintenance(man) run/start/stop” needed to be run in a repository to hold the lockfile they use, but didn’t make sure they are actually in a repository, which has been corrected.

See commit 0a1f2d0 (08 Dec 2020) by Josh Steadmon (steadmon).
See commit e72f7de (26 Nov 2020) by Rafael Silva (raffs).
(Merged by Junio C Hamano — gitster in commit f2a75cb, 08 Dec 2020)

maintenance: fix SEGFAULT when no repository

Signed-off-by: Rafael Silva
Reviewed-by: Derrick Stolee

The “git maintenance run git(man) and “git maintenance start/stop” commands holds a file-based lock at the .git/maintenance.lock and .git/schedule.lock respectively.
These locks are used to ensure only one maintenance process is executed at the time as both operations involves writing data into the repository.

The path to the lock file is built using "the_repository->objects->odb->path” that results in SEGFAULT when we have no repository available as `"`the_repository->objects->odb" is set to NULL.

Let’s teach maintenance command to use RUN_SETUP option that will provide the validation and fail when running outside of a repository. Hence fixing the SEGFAULT for all three operations and making the behaviour consistent across all subcommands.

Setting the RUN_SETUP also provides the same protection for all subcommands given that the “register” and “unregister” also requires to be executed inside a repository.

Furthermore let’s remove the local validation implemented by the “register” and “unregister” as this will not be required anymore with the new option.

Leave a Comment