Blog Product Git 2.42 release: Here are four of our contributions in detail
October 12, 2023
8 min read

Git 2.42 release: Here are four of our contributions in detail

Find out how GitLab's Git team helped improve Git 2.42.

git-241.jpg

Git 2.42
was officially released on August 21, 2023, and included some
improvements from GitLab's Git team. Git is the foundation of
repository data at GitLab. GitLab's Git team works on new features, performance improvements, documentation improvements,
and growing the Git community. Often our contributions to Git have a
lot to do with the way we integrate Git into our services at
GitLab.

We previously shared some of our improvements that were included in the Git 2.41 release. Here are some highlights from the Git 2.42 release, and a
window into how we use Git on the server side at GitLab.

1. Prevent certain refs from being packed

Write-ahead logging

In Gitaly, we
want to use a write-ahead log
to replicate Git operations on different machines.

This means that the Git objects and references that should be changed
by a Git operation are first kept in a log entry. Then, when all the
machines have agreed that the operation should proceed, the log entry
is applied so the corresponding Git objects and references are
actually added to the repositories on all the machines.

Need for temporary references

Between the time when a specific log entry is first written and when
it is applied, other log entries could be applied which could remove
some objects and references. It could happen that these objects and
references are needed to apply the specific log entry though.

So when we log an entry, we have to make sure that all the objects and
references that it needs to be properly applied will not be removed
until that entry is either actually applied or discarded.

The best way to make sure things are kept in Git is to create new Git
references pointing to these things. So we decided to use temporary
references for that purpose. They would be created when a log entry is
written, and then deleted when that entry is either applied or
discarded.

Packed-refs performance

Git can store references in "loose" files, with one reference per
file, or in the packed-refs file, which contains many of them. The
git pack-refs command is used to pack some references from "loose"
files into the packed-refs file.

For reading a lot of references, the packed-refs file is very
efficient, but for writing or deleting a single reference, it is not
so efficient as rewriting the whole packed-refs file is required.

As temporary references are to be created and then deleted soon after,
storing them in the packed-refs file would not be efficient. It
would be better to store them in "loose" files.

The git pack-refs command had no way to be told precisely which refs
should be packed or not though. By default it would repack all the
tags (which are refs in refs/tags/) and all the refs that are
already packed. With the --all option one could tell it to repack
all the refs except the hidden refs, broken refs, and symbolic refs,
but that was the only thing that could be controlled.

Improving git pack-refs

We decided to improve git pack-refs by adding two new options to it:

  • --include <pattern> which can be used to specify which refs should be packed
  • --exclude <pattern> which can be used to specify which refs should not be packed

John Cai, Gitaly:Git team engineering manager, implemented these options.

For example, if the refs managed by the write-ahead log are in
refs/wal/, it's now possible the exclude them from being moved into
the packed-refs file by using:

$ git pack-refs --exclude "refs/wal/*"

Details of the patch series, including discussions, can be found
here.

2. Get machine-readable output from git cat-file --batch

Efficiently retrieving Git object information

In GitLab, we often retrieve Git object information. For example, when a
user navigates into the files and directories in a repository, we need
to get the content of the corresponding Git blobs and trees so that
we can show it.

In Gitaly, we use git cat-file to retrieve Git object information
from a Git repository. As it's a frequent operation, it needs to be
performed efficiently, so we use the batch modes of git cat-file
available through the --batch, --batch-check and --batch-command
options.

In these modes, a pointer to a Git object can be repeatedly sent to
the standard input, called 'stdin', of a git cat-file command, while
the corresponding object information is read from the standard ouput,
called 'stdout' of the command. This way we don't need to launch a
new git cat-file command for each object.

GitLab can keep, for example, a git cat-file --batch-command process
running in the background while feeding it commands like
info <object> or contents <object> through its stdin to
get either information about an object or its content.

Newlines in stdin, stdout, and filenames

The commands or pointers to Git objects that are sent through stdin
should be delimited using newline characters, and in the same way git cat-file will use newline characters to delimit the information from
different Git objects in its output. This is a common shell practice
to make it easy to chain commands together. For example, one can
easily get the size (in bytes) of the last three commits on the current
branch using the following:

$ git log -3 --format='%H' | git cat-file --batch-check='%(objectsize)'
285
646
428

Sometimes, though, the pointer to a Git object can contain a filename
or a directory name, as such a pointer is allowed to be in the form
<branch>:<path>. For example HEAD:Documentation is a valid
pointer to the blob or the tree corresponding to the Documentation
path on the current branch.

This used to be an issue because on some systems newline characters
are allowed in file or directory names. So the -z option was
introduced last year in Git 2.38 to allow users to change the input
delimiter in batch modes to the NUL character.

Error output

When the -z option was introduced, it wasn't considered useful to
change the output delimiter to be also the NUL character. This is
because only tree objects can contain paths and the internal format
of tree objects already uses NUL characters to delimit paths.

Unfortunately, it was overlooked that in case of an error the pointer
to the object is displayed in the error message:

$ echo 'HEAD:does-not-exist' | git cat-file --batch
HEAD:does-not-exist missing

As the error messages are printed along with the regular ouput of the
command on stdout, passing in an invalid pointer with a number of
newline characters in it could make it very difficult to parse the
output.

-Z comes to the rescue

Toon Claes, Gitaly senior engineer, initially worked on a
patch to just quote the pointer in the error message, but it was
decided in the Git mailing list discussions related to the patch that
it would be better to just create a new -Z option. This option would
change both the input and the output delimiter to the NUL character,
while the old -z option would be deprecated over time.

So Patrick Steinhardt, Gitaly staff engineer, implemented that new -Z option.

Details of the patch series, including discussions, can be found
here
and here.

3. Pass pseudo-options to git rev-list --stdin

Computing sizes

In GitLab, we need to have different ways to compute the size of Git
related content. For example, we need to know:

  • how much disk space a repository is using
  • how big a specific Git object is
  • how much additional space on a repository is required by a
    specific set of revisions (and the objects they reference)

Knowing "how much disk space a repository is using" is useful to
enforce repository-related quotas and is easy to get using regular
shell and OS features.

Size information about a specific Git object is useful to enforce
quotas related to maximum file size. It can be obtained using, for
example, git cat-file -s <object> or
echo <object> | git cat-file --batch-check='%(objectsize)'
as already seen above.

Computing the space required by a set of revisions is useful, too, as
forks can share Git content in what we call
"pool repositories,"
and we want to discriminate how much content belongs to each forked
repository. Fortunately, git rev-list has a --disk-usage option
for this purpose.

Passing arguments to git rev-list

git rev-list can take a number of different arguments and has a lot
of different options. It's a fundamental command to traverse commit
graphs, and it should be flexible enough to fulfill a lot of different
user needs.

When repositories grow, they often store a lot of references and a lot
of files and directories, so there is often the need to pass a big
number of references or paths as arguments to the
command. References and paths can be quite long though.

To avoid hitting platform limits related to command line length, long
ago, a --stdin mode was added that allowed users to pass revisions
and paths through stdin, instead of as command line
arguments. However, when that was implemented, it was not considered
necessary to allow options or pseudo-options, like --not,
--glob=..., or --all to be passed through stdin.

This appeared to be a problem for GitLab, as for computing sizes for
forked repositories we needed some of the pseudo-options, and it would
have been intricate and possibly buggy to pass some of them and their
arguments as arguments on the command line while others were passed
through stdin.

Allowing pseudo-options

To fix this issue, Patrick Steinhardt implemented a small patch series to
allow pseudo-options through stdin.

With it, in Git 2.42, one can now pass pseudo-options, like --not,
--glob=..., or --all through stdin when the --stdin mode is used.

Details of the patch series, including discussions, can be found
here.

4. Code and test improvements

While looking at some Git code, we are often tempted to modify nearby
code, either to change only its style when the code is ancient and it
would look better using Git's current code style, or to refactor it to
make it cleaner. This is why we sometimes send small patch series that
don't have a real GitLab related purpose.

In Git 2.42, examples of style code improvements we made are the
part1
and
part2
test code modernization patches from John Cai.

And here is
an example of a refactoring to cleanup some code by Patrick Steinhardt.

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert