At GitLab everyone can contribute to GitLab.com's availability. We
measure the availability using several Service Level Indicators (SLIs)
But it's not always easy to see how the features you're building are
performing. GitLab's features are divided amongst development groups,
and every group has their own dashboard
displaying an availability score.
When a group's availability goes below 99.95%, we work with the group
on figuring out why that is and how we can improve the performance or
reliability of the features that caused their number to drop. The
99.95% service level objective (SLO) is the same target the
infrastructure department has set for
GitLab.com availability.
By providing specific data about how features perform on our production systems, it has become easier to recognize when it is important to prioritize performance and availability work.
Service availability on GitLab.com
Our infrastructure is separated into multiple services, handling
different kinds of traffic but running the same monolithic Rails
application. Not all features have a similar usage pattern. For
example, on the service handling web requests for GitLab.com we see a
lot more requests related to code_review
or team_planning
than we
do related to source_code_management
. It's important that we
look at these in isolation as well as a service aggregate.
There's nobody who knows better how to interpret these numbers in
feature aggregations than the people who build these features.
This number is sourced by the same SLIs that we use to monitor
GitLab.com's availability. We calculate this by dividing the number of
successful measurements by the total number of measurements over the
past 28 days. A measurement could be several things, most commonly a
request handled by our Rails application or a background job.
Monitoring feature and service availability
For monitoring GitLab.com we have Grafana dashboards, generated using
Grafonnet, that show these
source metrics in several dimensions. For example, these are error
rates of our monolithic Rails application, separated by feature:
We also generate multiwindow, multi-burn-rate alerts
as defined in Google's SRE workbook.
The red lines represent alerting thresholds for a burn rate. The
thin threshold means we'll alert if the SLI has spent more than 5%
of its monthly error budget in the past 6 hours. The thicker
threshold means we'll alert when the SLI has not met SLO for more than
2% of measurements in the past hour.
Because both GitLab.com's availability number and the availability
number for development groups are sourced by the same metrics, we
can provide similar alerts and graphs tailored to the
development groups. Features with a relatively low amount of traffic would not easily show
problems in our bigger service aggregations. With this mechanism we can see those problems
and put them on the radar of the teams building those features.
Building and adoption
In upcoming posts, we will talk about how we built this tooling and how we worked with other teams to have this adopted into the product prioritization process.