A leaderboard-reliability console. Each scenario reshuffles a synthetic top-five while true model quality is held fixed, then asks one question: under a stability bound, does this rank change page an on-call inside a day, or drift silently until an outside audit notices? Click a scenario; the verdict and the runbook lead. Numbers are synthetic or shaped from Arena's public methodology, never a real Arena rank or vote.
| Model | Was | Now | Elo (±95% CI) | Flag |
|---|
—
How deep the protected band goes. A top-3 bound pages on flips in the first three ranks; widen it and more of the board is on-call. Default: top-3.
A flip between two models whose CIs overlap by at least this much counts as a statistical tie, so publishing a confident ordering is the risk the page catches. Raise it and only deeply-overlapping ties page; lower it toward 0 and any overlapping flip pages. Default: 5 Elo.
These are the two dials a real engagement would negotiate with your science team, then write down. The scenario buttons set sensible defaults; move these to see a borderline case flip from page to no-page.