LLM Benchmarks

Review benchmark coverage and open task-suite pages to compare public model results within a consistent evaluation setup.

Coverage

Benchscope tracks public results for task suites such as MMLU, GSM8K, IFEval, MATH, MuSR, and related benchmarks.

Comparable Results

Benchmark pages keep runs anchored to a specific task, version, sample size, prompt mode, and scoring method so results can be interpreted in context.

Current Benchmark Coverage

IFEval: Instruction Following Evaluation; 541 examples; 10 public runs.
MuSR: Multi-Step Soft Reasoning; 756 examples; 8 public runs.
GSM8K: Grade School Math 8K; 1,319 examples; 5 public runs.
MATH: Competition Mathematics; 5,000 examples; 5 public runs.
MMLU: Massive Multitask Language Understanding; 14,042 examples; 4 public runs.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.