How we spent two weeks hunting an NFS bug in the Linux kernel

UPDATE 2019-08-06: This bug has now been resolved in the following
distributions:

Red Hat Enterprise Linux 7
Ubuntu
Linux mainline: Backported to 4.14-stable and 4.19-stable

On Sep. 14, the GitLab support team escalated a critical
problem encountered by one of our customers: GitLab would run fine for a
while, but after some time users encountered errors. When attempting to
clone certain repositories via Git, users would see an opaque Stale file error message. The error message persisted for a long time,
blocking employees from being able to work, unless a system
administrator intervened manually by running ls in the directory
itself.

Thus launched an investigation into the inner workings of Git and the
Network File System (NFS). The investigation uncovered a bug with the
Linux v4.0 NFS client and culiminated with a kernel patch that was written by
Trond Myklebust
and merged in the latest mainline Linux kernel
on Oct. 26.

This post describes the journey of investigating the issue and
details the thought process and tools by which we tracked down the
bug. It was inspired by the fine detective work in How I spent two
weeks hunting a memory leak in Ruby
by Oleg Dashevskii.

More importantly, this experience exemplifies how open source software
debugging has become a team sport that involves expertise across
multiple people, companies, and locations. The GitLab motto "everyone can
contribute" applies not only to GitLab itself, but also to other open
source projects, such as the Linux kernel.

Reproducing the bug

While we have run NFS on GitLab.com for many years, we have stopped
using it to access repository data across our application
machines. Instead, we have abstracted all Git calls to
Gitaly.
Still, NFS remains a supported configuration for our customers who
manage their own installation of GitLab, but we had never seen the exact
problem described by the customer before.

Our customer gave us a few important clues:

The full error message read, fatal: Couldn't read ./packed-refs: Stale file handle.
The error seemed to start when they started a manual Git garbage
collection run via git gc.
The error would go away if a system administrator ran ls in the
directory.
The error also would go away after git gc process ended.

The first two items seemed obviously related. When you push to a branch
in Git, Git creates a loose reference, a fancy name for a file that
points your branch name to the commit. For example, a push to master
will create a file called refs/heads/master in the repository:

$ cat refs/heads/master
2e33a554576d06d9e71bfd6814ee9ba3a7838963

git gc has several jobs, but one of them is to collect these loose
references (refs) and bundle them up into a single file called
packed-refs. This makes things a bit faster by eliminating the need to
read lots of little files in favor of reading one large one. For
example, after running git gc, an example packed-refs might look
like:

# pack-refs with: peeled fully-peeled sorted
564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable
edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable
94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable
2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master

How exactly is this packed-refs file created? To answer that, we ran
strace git gc with a loose ref present. Here are the pertinent lines
from that:

28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3
28705 open(".git/packed-refs", O_RDONLY) = 3
28705 open("/tmp/libgit2/.git/packed-refs.new", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4
28705 rename("/tmp/libgit2/.git/packed-refs.new", "/tmp/libgit2/.git/packed-refs") = 0
28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0

The system calls showed that git gc did the following:

Open packed-refs.lock. This tells other processes that packed-refs is locked and cannot be changed.
Open packed-refs.new.
Write loose refs to packed-refs.new.
Rename packed-refs.new to packed-refs.
Remove packed-refs.lock.
Remove loose refs.

The fourth step is the key here: the rename where Git puts packed-refs
into action. In addition to collecting loose refs, git gc also
performs a more expensive task of scanning for unused objects and
removing them. This task can take over an hour for large
repositories.

That made us wonder: for a large repository, does git gc keep the file
open while it's running this sweep? Looking at the strace logs and
probing the process with lsof, we found that it did the following:

Git Garbage Collection

Notice that packed-refs is closed only at the end, after the potentially
long Garbage collect objects step takes place.

That made us wonder: how does NFS behave when one node has packed-refs
open while another renames over that file?

To experiment, we asked the customer to run the following experiment on
two different machines (Alice and Bob):

On the shared NFS volume, create two files: test1.txt and
test2.txt with different contents to make it easy to distinguish them:
```
alice $ echo "1 - Old file" > /path/to/nfs/test1.txt
alice $ echo "2 - New file" > /path/to/nfs/test2.txt
```

On machine Alice, keep a file open to test1.txt:

 alice $ irb
 irb(main):001:0> File.open('/path/to/nfs/test1.txt')

On machine Alice, show the contents of test1.txt continuously:
```
alice $ while true; do cat test1.txt; done
```
Then on machine Bob, run:
```
bob $ mv -f test2.txt test1.txt
```

This last step emulates what git gc does with packed-refs by
overwriting the existing file.

On the customer's machine, the result looked something like:

1 - Old file
1 - Old file
1 - Old file
cat: test1.txt: Stale file handle

Bingo! We seemed to reproduce the problem in a controlled way. However,
the same experiment using a Linux NFS server did not have this
problem. The result was what you would expect: the new contents were
picked up after the rename:

1 - Old file
1 - Old file
1 - Old file
2 - New file  <

How we spent two weeks hunting an NFS bug in the Linux kernel

Reproducing the bug

More to explore

Introducing the GitLab CI/CD Catalog Beta

How user research transformed GitLab Runner Fleet dashboard visibility and metrics

Why and how we rearchitected Git object database maintenance for scale

We want to hear from you

Ready to get started?

How we spent two weeks hunting an NFS bug in the Linux kernel

Reproducing the bug

Sign up for GitLab’s newsletter

More to explore

Introducing the GitLab CI/CD Catalog Beta

How user research transformed GitLab Runner Fleet dashboard visibility and metrics

Why and how we rearchitected Git object database maintenance for scale

We want to hear from you

Ready to get started?