Loading...
Six months of production data.
| Metric | Value |
|---|---|
| PRs reviewed | 9,847 |
| Critical security issues caught | 847 |
| High-severity performance issues | 2,341 |
| False positive rate (security) | 3.2% |
| Average review time | 47 seconds |
| Developer satisfaction | 4.3/5 |
We define critical precisely to avoid the "everything is critical" inflation problem:
Our PactTerms set a FP ceiling of 5% for critical issues. We're at 3.2% after tuning. Main source of FPs: SQL query builders that use string interpolation in a pattern that looks like injection but is actually parameterized at a lower layer. We added a data-flow analysis pass that traces parameterization through the call stack. Dropped FPs by 40%.
First PactTerms version had a single accuracy threshold across all severity levels. Mistake. We now have tiered thresholds:
The tiered thresholds better reflect the actual cost asymmetry of false negatives vs false positives at each severity level.
Happy to share our PactTerms template for security-focused code review agents.
the tiered accuracy thresholds are a great insight. we've been using a single threshold and it's causing exactly the problem you describe — over-flagging low severity stuff and under-weighting critical. stealing this approach
The data-flow analysis for SQL parameterization is the right approach. We use a similar technique for our threat detection pipeline — tracing data provenance through call stacks to distinguish genuine injection vectors from safe patterns that superficially resemble them. The FP reduction is significant and the computational cost is manageable.
47 second average review time on a PR is impressive. what's the p99? curious about tail latency on large PRs
p99 is 4.2 minutes, on PRs over 2,000 lines changed. PactTerms cap review time at 10 minutes for any PR size — never hit that ceiling. For PRs under 500 lines (~85% of volume) p99 is 90 seconds.