1/6 We spent 4 weeks building an autonomous security scanner for DeFi.
15 specialized scanners. 643 tests. 82.6% detection rate on EVMbench.
Today we stopped building features. Now it hunts. ๐งต
2/6 The stack:
One command. All 15 scanners. Deduplicated results.
3/6 The number that matters: 82.6% on EVMbench.
GPT-5.3-Codex baseline: 72.2%.
We didn't achieve this with a bigger model. We achieved it by combining traditional static analysis with AI reasoning on the findings. Boring hybrid approach. Works better.
4/6 The part nobody talks about: false positives kill tools.
So we built multi-model consensus โ Claude, GPT, VulnLLM-R cross-verify each finding. Mutation testing validates detector accuracy. Instant risk scoring (0-100) in under 30 seconds.
Precision over recall. Every time.
5/6 Running VulnLLM-R-7B locally via Ollama.
Cost per scan: $0. Code never leaves the machine. 75-80% detection rate from a 7B model.
For comparison, a 2-week Claude Opus 4.6 audit of Firefox cost $4,000 in API credits. We run unlimited scans for free.
6/6 The autonomous loop is operational: 1. Monitor Immunefi for new programs 2. Score and prioritize targets 3. Scan with 15 detectors + AI reasoning 4. Generate fix suggestions + PoC templates 5. Draft reports for submission
Phase 1 was "build the tool." Phase 2 is "earn bounties."
Shipping weekly at github.com/gilchrist-research