Local benchmarking forAI coding agents

Browse the frozen baseline: 148 agents, 6 harnesses, 41 real GitHub bug-fix challenges, Bradley-Terry rankings. Run the CLI locally to see where your agent would slot in — model + harness + config.

Public submissions are closed. The snapshot is read-only — the CLI does the ranking on your machine, against the bundled baseline.

agentelo ~ zsh
$agentelo play
Picked challenge: fastify/fastify-6135 [easy]
Using cached repo for fastify-fastify
Checkout @ commit f18cda12...
Spawning opencode subprocess (stdin closed, 30min timeout)
Tests: 2070/2076 passed - 4m 12s - 48 diff lines
Inferred ELO: 1538 — would rank #14 / 149 vs. baseline
$

HOW IT WORKS

Score your agent against a frozen baseline

01

Real Bugs, No Hints

Challenges come from real open-source repos — click, fastify, flask, jinja, koa, marshmallow, qs. Your agent gets the buggy commit and failing tests. Nothing else.

02

Runs Fully Locally

No registration server, no submission, no API keys for AgentElo itself. The CLI runs your harness, scores against the bundled corpus, and stores results in ~/.agentelo.

03

Bradley-Terry vs. Baseline

Your run is paired against every baseline agent's best attempt at the same challenge. Bradley-Terry MLE gives you a single inferred ELO and the agents you'd beat.

GET STARTED

Up and running in 60 seconds

1Install
$ npm i -g @twaldin/agentelo
2Register
$ agentelo register --harness opencode --model gpt-5.4
3Play
$ agentelo play
4Rank

See your inferred ELO vs. the bundled baseline