In summer 2022, the Vulnerability Research team at GitLab
launched the Google Summer of Code (GSoC) project:
A benchmarking framework for SAST.
The goal of the project was to create a benchmarking framework, which would assess the
impact and quality of a security analyzer or configuration change before it reaches the production environment.
Preliminaries
GitLab SAST
As a complete DevOps Platform, GitLab has a variety of integrated static analysis (SAST) tools
for different languages and frameworks. These tools help developers find
vulnerabilities as early as possible in the software development lifecycle.
These tools are constantly being updated, either by upgrading the underlying
security analyzers or by applying configuration changes.
Since all the integrated SAST tools are very different in terms of
implementation, and depend on different tech stacks, they are all
wrapped in Docker images. The wrappers translate tool-native vulnerability
reports to a generic, common report format which is
made available by means of the gl-sast-report.json
artifact. This generic
report is GitLab's common interface between analyzers and the GitLab Rails
backend.
Benchmarking is important to assess the efficacy of analyzers and helps to make
data-driven decisions. For example, benchmarking is useful for QA testing
(spotting regressions), for data-driven decision making, and for research by
assessing the progression of the GitLab security feature performance over time.
Google Summer Of Code (GSoC)
Google Summer of Code (GSoC)
is a 10-week program that enlists contributors to work on open source projects
in collaboration with open source organizations. For GSoC 2022, GitLab offered
four projects to GSoC contributors. The contributors completed each of the
projects with the guidance from GitLab team members who mentored them and
provided regular feedback and assistance when needed.
Terms & Notation
In this blog post, we use the terms/acronyms below to classify findings reported by security analyzers.
Acronym | Meaning | Description |
---|---|---|
TP | True Positive | Analyzer correctly identifies a vulnerability. |
FP | False Positive | Analyzer misidentifies a vulnerability or reported a vulnerability where none exist. |
TN | True Negative | Analyzer correctly ignores a potential false positive. |
FN | False Negative | Analyzer does not report a known vulnerability. |
For the figures in the blog post we use the following notation: processes are
depicted as rounded boxes, whereas artifacts (e.g., files) are depicted as
boxes; arrows denote an input/output (IO) relationship between the connected nodes.
flowchart TB;
subgraph legend[ Legend ]
proc(Process);
art[Artifact];
proc -->|IO relation|art;
end
Motivation
The authors of the paper How to Build a Benchmark distilled the desirable characteristics of a benchmark below:
- Relevance: How closely the benchmark behavior correlates to behaviors that are of interest to consumers of the results.
- Reproducibility: The ability to consistently produce similar results when the benchmark is run with the same test configuration.
- Fairness: Allowing different test configurations to compete on their merits without artificial limitations.
- Verifiability: Providing confidence that a benchmark result is accurate.
- Usability: Avoiding roadblocks for users to run the benchmark in their test environments.
There currently is no standard nor de facto language-agnostic SAST benchmark
satisfying all the criteria mentioned above. Many benchmark suites focus on
specific languages, are shipped with incomplete or missing ground-truths, or
are based on outdated technologies and/or frameworks. A ground-truth or
baseline is the set of findings a SAST tool is expected to detect.
The main objective of the GSoC project was to close this gap and start to
create a benchmarking framework that addresses all the desirable charateristics
mentioned above in the following manner:
- Relevance: Include realistic applications (in terms of size, framework usage
and customer demand). - Reproducibility: Automate the whole benchmarking process in CI.
- Fairness: Make it easy to integrate new SAST tools by just tweaking the CI
configuration and use the GitLab security report schema as a common standard. - Verifiability: Assemble baseline that includes all the relevant
vulnerabilities and make it publicly available. The baseline is the north star
that defines what vulnerabilities are actually included in a test application. - Usability: Benchmark users can just integrate the benchmark as a downstream
pipeline to their CI configuration.
A benchmarking framework for SAST
The benchmarking framework compares the efficacy of an analyzer against a known
baseline. This is very useful for monitoring the efficacy of the analyzer that
participates in the benchmarking. The baseline is the gold standard that serves
as a compass to guide analyzer improvements.
Usage
For using the framework, the following requirements have to be met:
- The analyzer has to be dockerized.
- The analyzer has to produce a vulnerability report that adheres to the
GitLab security report schema
format, which serves as our generic intermediate representation to compare
analyzer efficacy. - The baseline expectations have to be provided as
GitLab security report schema
so that we can compare the analyzer output against it.
The framework is designed in such a way that it can be easily integrated into
the CI configuration of existing GitLab projects by means of a downstream pipeline.
There are many possible ways in which a downstream pipeline can be triggered:
source code changes applied to an analyzer, configuration changes
applied to an analyzer, or scheduled pipeline invocation. By using the pipeline,
we can run the benchmarking frameworks continuously and instantaneously on the GitLab
projects that host the source code of the integrated analyzers whenever code or
configuration changes are applied.
Architecture
The figure below depicts the benchmarking framework when comparing an analyzer
against a baseline.
We assume that we have a baseline configuration available; a baseline consists
of an application that is an actual test application that includes
vulnerabilities. These vulnerabilities are documented in an expectation file
that adheres to the security report schema.
Note that we use the terms baseline and expectation interchangeably. As
mentioned earlier, the benchmarking framework is essentially a GitLab pipeline
that can be triggered downstream. The configured analyzer then takes the
baseline app as input and generates a gl-sast-report.json
file. The heart of the
benchmarking framework is the compare
step, which compares the baseline
against the report generated by the analyzer, both of which adhere to the
security report schema.
The compare step also computes the TP, FN and FP that have been reported by the
analyzer and computes different metrics based on this information. The compare
step is implemented in the
evaluator tool.
flowchart LR;
sbx[gl-sast-report.json];
breport[Report];
config[Configuration];
config --> bf;
subgraph Baseline
bcollection[app];
baseline[expectation];
end
subgraph bf [ Benchmarking Framework ]
orig(Analyzer);
compare(Compare);
orig --> sbx;
sbx --> compare;
end
baseline --> compare;
compare --> breport
bcollection --> orig
Using the security report format as a common standard makes the benchmarking
framework very versatile: the baseline could be provided by an automated
process, by another analyzer, or manually, which happened to be the case in this
GSoC project.
Scoring
The main functionality of the evaluator tool
is to compute the overlap/intersection, and difference between a baseline and
generated report in order to uncover true positives, false positives, and false
negatives.
The relationship between TP, FP, FN, TN, baseline, and generated report can be
seen in the table below; it includes three columns analyzer
, baseline
and
classification
. The column analyzer
represents the findings included in the
report generated by the analyzer; column baseline
represents the findings
included in the baseline; column classification
denotes the
verdict/classification that the evaluator tool
attaches to the analyzer finding when performing the comparison. The X
and
-
denote reported and non-reported findings, respectively.
analyzer | baseline | classification |
---|---|---|
- | - | TN |
- | X | FN |
X | - | FP |
X | X | TP |
The classification
column in the table above shows that a TP is a
vulnerability existing in both baseline and generated report; similarly, an
FP is a vulnerability detected by an analyzer without a corresponding
baseline entry, while an FN is a vulnerability present in the baseline but
not detected by an analyzer. Note, that TN is practically not relevant for
our use-case since the analyzers we are looking at only report unsafe,
vulnerable cases instead of safe, non-vulnerable cases.
At the moment, the evaluator
tool computes the metrics below:
- Precision: P = TP /( TP + FP )
- Recall: R = TP / ( TP + FN )
- F-Score: F = 2 * ( P * R ) / ( P + R )
- Jaccard-Index: J = TP / ( TP + FP + FN )
A higher precision indicates that an analyzer is less noisy due to the low(er)
number of FPs. Hence, a high precision leads to a reduction of auditing effort
of irrelevant findings. A high recall represents an analyzer's detection
capacity. F-Score is a combined measure so that precision and recall can be
condensed to a single number. The Jaccard-Index is a single value to capture
the similarity between analyzer and baseline.
The evaluator tool
supports the addition of custom metrics via a simple call-back mechanism; this
enables us to add support more metrics in the future that help us to gain
additional or new insights with regards to the efficacy of our analyzers.
Framework Properties
In principle, the implemented benchmarking framework is language-agnostic: new
analyzers and baselines can be plugged-in as long as they adhere to the
security report schema.
Establishing baselines is laborious since it requires (cross-)validation,
trying out attacks on the running baseline application and
code auditing.
For the GSoC project, we established baselines for the applications below
covering Java (Spring) and Python
(Flask) as they are ranking high in the most used languages and frameworks.
For a benchmark application to have practical utility, it is important that the
application itself is based on technology, including programming languages and
frameworks, that are used in the industry.
For both of these applications, the baseline/expectations have been collected,
verified and are publicly availabe:
- WebGoat.
WebGoat is a deliberately insecure Web application used to teach security vulnerabilities.
We chose this as baseline application because it is often used as a benchmark
app in the Java world and it is based on Spring which is
one of the most popular frameworks in the Java world. - vuln-flask-web-app Like WebGoat, this application is deliberately insecure.
vuln-flask-web-app
covers both Python and Flask, one of the most popular web frameworks in the Python world.
Conclusion
This GSoC project was a first step towards building a FOSS benchmarking
framework that helps the community to test their own tools and to build up a
relevant suite of baselines covering various languages and frameworks. With the
help of the community, we will continue adding more baselines to the
benchmarking framework in the future to cover more languages and frameworks.
If you found the project interesting, you might want to check out the following repositories:
- evaluator
- WebGoat baseline
- Vulnerable Flask Web App baseline
- Example of downstream pipeline triggering evaluator
Cover image by Maxim Hopman on Unsplash