Blog Engineering How a fix in Go 1.9 sped up our Gitaly service by 30x
January 23, 2018
7 min read

How a fix in Go 1.9 sped up our Gitaly service by 30x

After noticing a worrying pattern in Gitaly's performance, we uncovered an issue with fork locking affecting virtual memory size. Here's how we figured out the problem and how to fix it.

cover.jpg

Gitaly is a Git RPC service that we are currently rolling out
across GitLab.com, to replace our legacy NFS-based file-sharing solution. We expect it to be faster, more stable
and the basis for amazing new features in the future.

We're still in the process of porting Git operations to Gitaly, but the service has been
running in production on GitLab.com for about nine months, and currently peaks at about 1,000
gRPC requests per second. We expect the migration effort to be completed
by the beginning of April at which point all Git operations in the GitLab application will
use the service and we'll be able to decommission NFS infrastructure.

Worrying performance improvements

The first time we realized that something might be wrong was shortly after we'd finished deploying a new release.

We were monitoring the performance of one of the gRPC endpoints for the Gitaly service and noticed that the
99th percentile performance of the endpoint had dropped from 400ms down to 100ms.

400ms to 100ms latency drop
Latencies drop from 400ms to 100ms after a deploy, for no good reason

This should have been fantastic news, but it wasn't. There were no changes that should have led to faster
response times. We hadn't optimized anything in that release; we hadn't changed the runtime and the new
release was using the same version of Git.

Everything should have been exactly the same.

We started digging into the data a little more and quickly realised that 400ms is a very high latency for
an operation that simply confirms the existence of a Git reference.

How long had it been this way? Well it started about 24 hours after the previous deployment.

100ms to 400ms latency hike
Latencies rising over a 24 hour period following a deployment, for no good reason

When browsing our Prometheus performance data, it quickly became apparent that this pattern was being repeated with each
deployment: things would start fast and gradually slow down. This was occurring across all endpoints. It had been this way for a while.

The first assumption was that there was some sort of resource leak in the application, causing the host to slow
down over time. Unfortunately the data didn't back this up. CPU usage of the Gitaly service did increase, but the
hosts still had lots of capacity.

Gitaly CPU charts
Gitaly CPU increasing with process age, but not enough to explain the problem

At this point, we still didn't have any good leads as to the cause of the problem, so we decided to further
improve the observability of the application by adding pprof profiling support
and cAdvisor metrics.

Profiling

Adding pprof support to a Go process is very easy.
The process already has a Prometheus listener and we added a pprof handler on the same listener.

Since production teams would need to be able to perform the profiling without our assistance, we
also added a runbook.

Go's pprof support is easy to use and in our testing, we found that the overhead it
added to production workloads was negligible, meaning we could use it in production without concern
about the impact it would have on site performance.

cAdvisor

The Gitaly service spawns Git child processes for many of its endpoints. Unfortunately these Git
child processes don't have the same instrumentation as the parent process so it was
difficult to tell if they were contributing to the problem. (Note: we record getrlimit(2) metrics for Git processes but cannot observe grandchild processes spawned by Git, which often do much of the heavy lifting)

On GitLab.com, Gitaly is managed through systemd, which will automatically create a cgroup for
each service it manages.

This means that Gitaly and its child processes are contained within a single cgroup, which we
could monitor with cAdvisor, a Google monitoring tool
which supports cgroups and is compatible with Prometheus.

Although we didn't have direct metrics to determine the behavior of the Git processes, we could
infer it using the cgroup metrics and the Gitaly process metrics: the difference between the
two would tell us the resources (CPU, memory, etc) being consumed by the Git child processes.

At our request, the production team added cAdvisor to the Gitaly servers.

Having cAdvisor gives us the ability to know what the Gitaly service, including all its child
processes, is doing.

cAdvisor graphs for the Gitaly cgroup
cAdvisor graphs of the Gitaly cgroup

From bad to worse. Much, much worse...

In the meantime, the situation had got far worse.
Instead of only seeing gradual latency increases over time, we were now seeing far more serious lockups.

Individual Gitaly server instances would grind to a halt, to the point where all new incoming TCP connections
were not being accepted. This proved to be a problem to using pprof: during the lockup the connection
would time out when attempting to profile the process. Since the reason we added pprof was to observe the
process under duress, that approach was a bust.

Interestingly, during a lock-up, CPU would actually decrease – the system was not overloaded, but actually
idled. Iops, iowait and CPU would all drop way down.

Eventually, after a few minutes the service would recover and there would be a surge in backlogged
requests. Usually though, as soon as the state was detected, the production team would restart the
service manually.

The team spent a significant amount of time trying to recreate the problem locally, with little success.

Forking locks

Without pprof, we fell back to SIGABRT thread dumps
of hung processes. Using these, we determined that the process had a large amount of contention around syscall.ForkLock
during the lockups. In one dump, 1,400 goroutines were blocked waiting on ForkLock – most for several minutes.

syscall.ForkLock has the following documentation:

Lock synchronizing creation of new file descriptors with fork.

Each Gitaly server instance was fork/exec'ing Git processes about 20 times per second so we seemed to finally have a very promising lead.

Serendipity

Researching ForkLock led us to an issue on the Go repository,
opened in 2013, about switching from fork/exec to clone(2) with CLONE_VFORK and CLONE_VM
on systems that support it: golang/go#5838

The clone(2) syscall with CLONE_VFORK and CLONE_VM is the same as
the posix_spawn(3) c function, but the latter is easier to
refer to, so let's use that.

When using fork, the child process will start with a copy of the parent processes' memory.
Unfortunately this process takes longer the larger the virtual memory footprint the process has.
Even with copy-on-write, it can take several hundred milliseconds in a memory-intensive process.
posix_spawn doesn't copy the parent processes' memory space and has a roughly constant time.

Some good benchmarks of fork/exec vs. posix_spawn can be found here: https://github.com/rtomayko/posix-spawn#benchmarks

This seemed like a possible explanation. Over time, the virtual memory size (VMM) of the Gitaly process would increase. As VMM
increased, each fork(2) syscall would take longer. As fork latency increased, syscall.ForkLock contention would increase.
If fork time exceeded the frequency of fork requests, the system could temporarily lock up entirely.

(Interestingly, TCPListener.Accept
also interacts with syscall.ForkLock,
although only on older versions of Linux. Could this be the cause of our failure to connect to the pprof listener during a lockup?)

By some incredibly good luck, golang/go#5838, the switch from fork to posix_spawn, had,
after several years' delay, recently landed in Go 1.9, just in time for us. Gitaly had been compiled with Go 1.8.
We quickly built and tested a new binary with Go 1.9 and manually deployed this
on one of our production servers.

Spectacular results

Here's the CPU usage of Gitaly processes across the fleet:

CPU after Go 1.9
CPU after recompiling with Go 1.9

Here's the 99th percentile latency figures. This chart is using a logarithmic scale, so we're talking about two orders of
magnitude faster!

30x latency drops with Go 1.9
Endpoint latency after recompiling with Go 1.9 (log scale)

Conclusion

Recompiling with Go 1.9 solved the problem, thanks to the switch to posix_spawn. We learned several other lessons
in the process too:

  1. Having solid application monitoring in place allowed us to detect this issue, and start investigating it, far
    earlier than we otherwise would have been able to.
  2. pprof can be really helpful, but may not help when a process
    has locked up and won't accept new connections. pprof is lightweight enough that you should consider adding it to your application before you need it.
  3. When all else fails, SIGABRT thread dumps might help.
  4. cAdvisor is great for monitoring cgroups. Systemd services each run in
    their own cgroup, so cAdvisor is an easy way of monitoring a service and all its child processes, together.

Photo by Javier García on Unsplash

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert