MuSR Benchmark Results
MuSR evaluates multi-step reasoning over long narratives across murder mysteries, object placement tracking, and team allocation. Models must reason through chains of deductions to select the correct answer.
Benchmark Details
MuSR uses exact_match_choice scoring and currently lists 756 examples in version 2024-01.
Public Runs
8 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
Recent MuSR Runs
- Groq / Llama 3.3 70B on MUSR: completed; 35.0%; 346 ms p50 latency; 20 samples.
- Together / Llama 3.3 70B on MUSR: completed; 30.0%; 597 ms p50 latency; 20 samples.
- Groq / Llama 3.3 70B on MUSR: completed; 50.0%; 311 ms p50 latency; 10 samples.
- Groq / Llama 3.3 70B on MUSR: completed; 100.0%; 617 ms p50 latency; 1 samples.
- Groq / Llama 3.3 70B on MUSR: failed; score unavailable; latency unavailable; 1 samples.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.