Blog Engineering How we share SLIs across engineering departments
March 10, 2022
3 min read

How we share SLIs across engineering departments

The Scalability team engages with the Development department for collaborating on SLIs. The first post in this series explains how we made available information accessible for development groups.

Blog fallback hero

At GitLab everyone can contribute to GitLab.com's availability. We
measure the availability using several Service Level Indicators (SLIs)
But it's not always easy to see how the features you're building are
performing. GitLab's features are divided amongst development groups,
and every group has their own dashboard
displaying an availability score.

Stage group availability

When a group's availability goes below 99.95%, we work with the group
on figuring out why that is and how we can improve the performance or
reliability of the features that caused their number to drop. The
99.95% service level objective (SLO) is the same target the
infrastructure department has set for
GitLab.com availability.

By providing specific data about how features perform on our production systems, it has become easier to recognize when it is important to prioritize performance and availability work.

Service availability on GitLab.com

Our infrastructure is separated into multiple services, handling
different kinds of traffic but running the same monolithic Rails
application. Not all features have a similar usage pattern. For
example, on the service handling web requests for GitLab.com we see a
lot more requests related to code_review or team_planning than we
do related to source_code_management. It's important that we
look at these in isolation as well as a service aggregate.

There's nobody who knows better how to interpret these numbers in
feature aggregations than the people who build these features.

This number is sourced by the same SLIs that we use to monitor
GitLab.com's availability. We calculate this by dividing the number of
successful measurements by the total number of measurements over the
past 28 days. A measurement could be several things, most commonly a
request handled by our Rails application or a background job.

Monitoring feature and service availability

For monitoring GitLab.com we have Grafana dashboards, generated using
Grafonnet, that show these
source metrics in several dimensions. For example, these are error
rates of our monolithic Rails application, separated by feature:

Puma SLI by feature

We also generate multiwindow, multi-burn-rate alerts
as defined in Google's SRE workbook.

Puma SLI error rate and requests per second

The red lines represent alerting thresholds for a burn rate. The
thin threshold means we'll alert if the SLI has spent more than 5%
of its monthly error budget in the past 6 hours. The thicker
threshold means we'll alert when the SLI has not met SLO for more than
2% of measurements in the past hour.

Because both GitLab.com's availability number and the availability
number for development groups are sourced by the same metrics, we
can provide similar alerts and graphs tailored to the
development groups. Features with a relatively low amount of traffic would not easily show
problems in our bigger service aggregations. With this mechanism we can see those problems
and put them on the radar of the teams building those features.

Building and adoption

In upcoming posts, we will talk about how we built this tooling and how we worked with other teams to have this adopted into the product prioritization process.

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert