The OWASP Benchmark is a test suite designed to evaluate the speed, coverage, and accuracy of automated vulnerability detection tools. Without the ability to measure these tools, it is difficult to understand their strengths and weaknesses, and compare them to each other. The Benchmark contains thousands of test cases that are fully runnable and exploitable.

You can currently use the Benchmark with Static Application Security Testing (SAST) tools. A future goal is to support the evaluation of Dynamic Application Security Testing (DAST) tools like OWASP ZAP and Interactive Application Security Testing (IAST). The current version is implemented in Java. Future versions may expand to include other languages.

For more information, please visit the OWASP Benchmark Project Site.

Interpretation Guide

Security tools (SAST, DAST, and IAST) are amazing when they find a complex vulnerability in your code. But they can drive everyone crazy with complexity, false alarms, and missed vulnerabilities. Using these tools without understanding their strengths and weaknesses can lead to a dangerous false sense of security.

We are on a quest to measure just how good these tools are at discovering and properly diagnosing security problems in applications. We rely on the long history of military and medical evaluation of detection technology as a foundation for our research. Therefore, the test suite tests both real and fake vulnerabilities.

There are four possible test outcomes in the Benchmark:

We can learn a lot about a tool from these four metrics. A tool that simply flags every line of code as vulnerable will perfectly identify all vulnerabilities in an application, but will also have 100% false positives. Similarly, a tool that reports nothing will have zero false positives, but will also identify zero real vulnerabilities. Imagine a tool that flips a coin to decide whether to report each vulnerability for every test case. The result would be 50% true positives and 50% false positives. We need a way to distinguish valuable security tools from these trivial ones.

If you imagine the line that connects all these points, from 0,0 to 100,100 establishes a line that roughly translates to "random guessing." The ultimate measure of a security tool is how much better it can do than this line. The diagram below shows how we will evaluate security tools against the Benchmark.


True Positive (TP) Tests with real vulnerabilities that were correctly reported as vulnerable by the tool.
False Negative (FN) Tests with real vulnerabilities that were not correctly reported as vulnerable by the tool.
True Negative (TN) Tests with fake vulnerabilities that were correctly not reported as vulnerable by the tool.
False Positive (FP) Tests with fake vulnerabilities that were incorrectly reported as vulnerable by the tool.
True Positive Rate (TPR) = TP / ( TP + FN ) The rate at which the tool correctly reports real vulnerabilities. Also referred to as Recall, as defined at Wikipedia.
False Positive Rate (FPR) = FP / ( FP + TN ) The rate at which the tool incorrectly reports fake vulnerabilities as real.
Score = TPR - FPR Normalized distance from the random guess line.