Benchscope
Compare commercial LLM endpoints using real benchmark runs. Browse MMLU, GSM8K, IFEval, MATH, MuSR, and other eval results, then inspect scores, latency, prompts, raw outputs, cost, and methodology.
Runs
Public runs show provider-hosted model evaluations with benchmark, status, score, latency, prompt mode, sample size, and links to inspect prompts and raw outputs.
Benchmarks
Benchmark pages group comparable results for task suites such as MMLU, GSM8K, IFEval, MATH, MuSR, and related evals.
Models
Model family pages compare public runs across providers so the same underlying model can be evaluated separately from the endpoint that hosts it.
Methodology
A run is one evaluation of a provider-hosted endpoint on a benchmark. Benchscope records score, latency, sample size, prompt, lifecycle state, and raw per-example outputs. Canonical prompts are intended for direct comparison; custom prompts are flagged because they can change model behavior.
Current Snapshot
- 5 public benchmark suites currently listed
- 2 model families with public runs
- 36 public runs indexed
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.