Synthetic + public-methodology-shaped · no real Arena data · noindex · metric-SEV

A published rank just flipped. Did it page anyone, or wait for the next paper?

A leaderboard-reliability console. Each scenario reshuffles a synthetic top-five while true model quality is held fixed, then asks one question: under a stability bound, does this rank change page an on-call inside a day, or drift silently until an outside audit notices? Click a scenario; the verdict and the runbook lead. Numbers are synthetic or shaped from Arena's public methodology, never a real Arena rank or vote.

Pick a reliability scenario: one click runs it

The rank change, on the stability bound

INSIDE BOUND. No page.

Elo score (95% CI bar) stability bound: flip past it pages

Model	Was	Now	Elo (±95% CI)	Flag

What the on-call sees

● SEVPAGES ON-CALL

Signal: —
Bound: —
Observed: —
True-quality change: —
Blast radius: —

Runbook — open

Generated post-mortem template

—

Fine-tune the bound (optional)

Stability bound: page if a top-N rank flips while CIs overlap

How deep the protected band goes. A top-3 bound pages on flips in the first three ranks; widen it and more of the board is on-call. Default: top-3.

CI-overlap gate: only page if the flipped pair are within Elo

A flip between two models whose CIs overlap by at least this much counts as a statistical tie, so publishing a confident ordering is the risk the page catches. Raise it and only deeply-overlapping ties page; lower it toward 0 and any overlapping flip pages. Default: 5 Elo.

These are the two dials a real engagement would negotiate with your science team, then write down. The scenario buttons set sensible defaults; move these to see a borderline case flip from page to no-page.

Sources & method

Public-methodology-shaped: Arena (LMArena) fits anonymous pairwise votes to a Bradley-Terry / Elo model and publishes a 95% CI per model via bootstrap; many ranked models are statistically tied within those CIs (LMArena ranking-method post). A separate result shows dropping a handful of preferences can change top rankings (arXiv 2508.11847). The console borrows that structure; the model names and scores below are invented placeholders (Model A–E), not Arena's board.
Synthetic, seeded, deterministic: every scenario is a fixed synthetic event (a top-five with CIs that reshuffles after a simulated change while true quality is held fixed). Reload and re-run reproduce identical numbers. No real Arena rank, vote, provider, or internal monitoring is represented.
The provider-fairness scenario maps to the differential-treatment-across-slices measurement in Jeff's CAMH research (decoupled classifiers reduced an accuracy parity gap from 35% to 1% and sensitivity parity from 50% to 9% across 140 evaluated permutations); here the "slice" is a provider. Cited, not minted.
What it deliberately can't see: Arena's real vote stream, its real provider-submission policy enforcement, or its internal monitoring. It demonstrates the shape of the reliability risk and the discipline that would catch it, not Arena's actual exposure. Method note: jeffpinto.com/notes/metric-sev.