UPDATE 2019-08-06: This bug has now been resolved in the following
distributions:
- Red Hat Enterprise Linux 7
- Ubuntu
- Linux mainline: Backported to 4.14-stable and 4.19-stable
On Sep. 14, the GitLab support team escalated a critical
problem encountered by one of our customers: GitLab would run fine for a
while, but after some time users encountered errors. When attempting to
clone certain repositories via Git, users would see an opaque Stale file error
message. The error message persisted for a long time,
blocking employees from being able to work, unless a system
administrator intervened manually by running ls
in the directory
itself.
Thus launched an investigation into the inner workings of Git and the
Network File System (NFS). The investigation uncovered a bug with the
Linux v4.0 NFS client and culiminated with a kernel patch that was written by
Trond Myklebust
and merged in the latest mainline Linux kernel
on Oct. 26.
This post describes the journey of investigating the issue and
details the thought process and tools by which we tracked down the
bug. It was inspired by the fine detective work in How I spent two
weeks hunting a memory leak in Ruby
by Oleg Dashevskii.
More importantly, this experience exemplifies how open source software
debugging has become a team sport that involves expertise across
multiple people, companies, and locations. The GitLab motto "everyone can
contribute" applies not only to GitLab itself, but also to other open
source projects, such as the Linux kernel.
Reproducing the bug
While we have run NFS on GitLab.com for many years, we have stopped
using it to access repository data across our application
machines. Instead, we have abstracted all Git calls to
Gitaly.
Still, NFS remains a supported configuration for our customers who
manage their own installation of GitLab, but we had never seen the exact
problem described by the customer before.
Our customer gave us a few important clues:
- The full error message read,
fatal: Couldn't read ./packed-refs: Stale file handle
. - The error seemed to start when they started a manual Git garbage
collection run viagit gc
. - The error would go away if a system administrator ran
ls
in the
directory. - The error also would go away after
git gc
process ended.
The first two items seemed obviously related. When you push to a branch
in Git, Git creates a loose reference, a fancy name for a file that
points your branch name to the commit. For example, a push to master
will create a file called refs/heads/master
in the repository:
$ cat refs/heads/master
2e33a554576d06d9e71bfd6814ee9ba3a7838963
git gc
has several jobs, but one of them is to collect these loose
references (refs) and bundle them up into a single file called
packed-refs
. This makes things a bit faster by eliminating the need to
read lots of little files in favor of reading one large one. For
example, after running git gc
, an example packed-refs
might look
like:
# pack-refs with: peeled fully-peeled sorted
564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable
edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable
94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable
2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master
How exactly is this packed-refs
file created? To answer that, we ran
strace git gc
with a loose ref present. Here are the pertinent lines
from that:
28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3
28705 open(".git/packed-refs", O_RDONLY) = 3
28705 open("/tmp/libgit2/.git/packed-refs.new", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4
28705 rename("/tmp/libgit2/.git/packed-refs.new", "/tmp/libgit2/.git/packed-refs") = 0
28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0
The system calls showed that git gc
did the following:
- Open
packed-refs.lock
. This tells other processes thatpacked-refs
is locked and cannot be changed. - Open
packed-refs.new
. - Write loose refs to
packed-refs.new
. - Rename
packed-refs.new
topacked-refs
. - Remove
packed-refs.lock
. - Remove loose refs.
The fourth step is the key here: the rename where Git puts packed-refs
into action. In addition to collecting loose refs, git gc
also
performs a more expensive task of scanning for unused objects and
removing them. This task can take over an hour for large
repositories.
That made us wonder: for a large repository, does git gc
keep the file
open while it's running this sweep? Looking at the strace
logs and
probing the process with lsof
, we found that it did the following:
Notice that packed-refs
is closed only at the end, after the potentially
long Garbage collect objects
step takes place.
That made us wonder: how does NFS behave when one node has packed-refs
open while another renames over that file?
To experiment, we asked the customer to run the following experiment on
two different machines (Alice and Bob):
-
On the shared NFS volume, create two files:
test1.txt
and
test2.txt
with different contents to make it easy to distinguish them:alice $ echo "1 - Old file" > /path/to/nfs/test1.txt alice $ echo "2 - New file" > /path/to/nfs/test2.txt
-
On machine Alice, keep a file open to
test1.txt
:alice $ irb irb(main):001:0> File.open('/path/to/nfs/test1.txt')
-
On machine Alice, show the contents of
test1.txt
continuously:alice $ while true; do cat test1.txt; done
-
Then on machine Bob, run:
bob $ mv -f test2.txt test1.txt
This last step emulates what git gc
does with packed-refs
by
overwriting the existing file.
On the customer's machine, the result looked something like:
1 - Old file
1 - Old file
1 - Old file
cat: test1.txt: Stale file handle
Bingo! We seemed to reproduce the problem in a controlled way. However,
the same experiment using a Linux NFS server did not have this
problem. The result was what you would expect: the new contents were
picked up after the rename:
1 - Old file
1 - Old file
1 - Old file
2 - New file <