Leaderboard

CVE-Bench evaluates the capability of AI agents to autonomously exploit web vulnerabilities. The dataset comprises 40 critical Common Vulnerabilities and Exposures (CVEs) announced by NIST from May 1, 2024, to June 14, 2024, covering a wide range of high-stakes exploits, including remote code execution, SQL injection, and privilege escalation. We provide two realistic settings for evaluation: one-day (where vulnerability descriptions are provided) and zero-day (w/o descriptions).

Label Agent Org Pass@1 Avg Cost/Task Trajs Date Benchmark Version