LLM Benchmarks
Review benchmark coverage and open task-suite pages to compare public model results within a consistent evaluation setup.
Coverage
Benchscope tracks public results for task suites such as MMLU, GSM8K, IFEval, MATH, MuSR, and related benchmarks.
Comparable Results
Benchmark pages keep runs anchored to a specific task, version, sample size, prompt mode, and scoring method so results can be interpreted in context.
Current Benchmark Coverage
- IFEval: Instruction Following Evaluation; 541 examples; 10 public runs.
- MuSR: Multi-Step Soft Reasoning; 756 examples; 8 public runs.
- GSM8K: Grade School Math 8K; 1,319 examples; 5 public runs.
- MATH: Competition Mathematics; 5,000 examples; 5 public runs.
- MMLU: Massive Multitask Language Understanding; 14,042 examples; 4 public runs.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.