Benchscope

Compare commercial LLM endpoints using real benchmark runs. Browse MMLU, GSM8K, IFEval, MATH, MuSR, and other eval results, then inspect scores, latency, prompts, raw outputs, cost, and methodology.

Runs

Public runs show provider-hosted model evaluations with benchmark, status, score, latency, prompt mode, sample size, and links to inspect prompts and raw outputs.

Benchmarks

Benchmark pages group comparable results for task suites such as MMLU, GSM8K, IFEval, MATH, MuSR, and related evals.

Models

Model family pages compare public runs across providers so the same underlying model can be evaluated separately from the endpoint that hosts it.

Methodology

A run is one evaluation of a provider-hosted endpoint on a benchmark. Benchscope records score, latency, sample size, prompt, lifecycle state, and raw per-example outputs. Canonical prompts are intended for direct comparison; custom prompts are flagged because they can change model behavior.

Current Snapshot

5 public benchmark suites currently listed
2 model families with public runs
36 public runs indexed

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.