MuSR Multi-Step Reasoning Benchmark
MuSR evaluates multi-step reasoning over long narratives across murder mysteries, object placement tracking, and team allocation. Models must reason through chains of deductions to select the correct answer.
Benchmark Details
MuSR uses exact_match_choice scoring and currently lists 756 examples in version 2024-01.
Public Runs
10 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
What MuSR Measures
MuSR (Multistep Soft Reasoning) tests compositional reasoning across narrative contexts. Problems require a model to process a story or situation description and then reason across multiple steps to arrive at a correct answer. MuSR includes tasks like murder mystery resolution, object placement ordering, and team member allocation.
Scoring uses exact match on the final answer extracted from model output. The dataset is smaller than most benchmarks — typically a few hundred examples — and was generated algorithmically to reduce training data contamination risk.
Why MuSR Matters and Caveats
MuSR tests whether models can integrate information across a long context and reason compositionally rather than retrieve memorized facts. Because MuSR was generated rather than collected from existing datasets, scores are less affected by training data contamination than benchmarks like MMLU or GSM8K. This makes MuSR a useful complementary signal for tasks requiring contextual reasoning.
MuSR's small dataset means individual run scores are more sensitive to random variance than large benchmarks. Partial runs on MuSR should be interpreted with extra caution. The three reasoning domains test related but distinct skills — aggregate scores can mask domain-specific strengths and weaknesses.
Multi-Step Reasoning Benchmark Results
MuSR is useful when you care about multi-step compositional reasoning rather than factual recall. It tests whether a model can integrate information across a context, track constraints, and resolve the answer through several dependent steps.
Because visible search demand includes both MuSR benchmark and multi-step reasoning benchmark phrasing, this page should make the connection explicit: MuSR is Benchscope's benchmark page for multi-step contextual reasoning results.
Recent MuSR Runs
- api.code-relay.com on MUSR: failed; score unavailable; latency unavailable; 756 samples.
- 4router.net on MUSR: completed; 71.4%; 2985 ms p50 latency; 756 samples.
- code.newcli.com on MUSR: partial; 60.1%; 1983 ms p50 latency; 756 samples.
- api.groq.com on MUSR: failed; score unavailable; latency unavailable; 1 samples.
- Vertex MaaS / MiniMax M2 on MUSR: completed; 70.0%; 12754 ms p50 latency; 20 samples.
Related
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.