Public LLM Eval Runs

Browse shared Benchscope runs and compare provider-hosted model endpoints by score, latency, throughput, sample size, prompt mode, and run status.

What Runs Include

Each run records the endpoint, model family, hosting provider, benchmark, prompt mode, lifecycle state, score, latency metrics, sample count, and per-example outputs.

How To Compare Runs

Use canonical-prompt runs for the cleanest comparisons. Partial runs and custom-prompt runs remain useful, but Benchscope marks them because sample selection and prompt wording can change outcomes.

Recent Public Runs

  • Groq / Llama 3.3 70B on MUSR: completed; 35.0%; 346 ms p50 latency; 20 samples.
  • Together / Llama 3.3 70B on MUSR: completed; 30.0%; 597 ms p50 latency; 20 samples.
  • Together / Llama 3.3 70B on MATH: partial; 44.4%; 6133 ms p50 latency; 10 samples.
  • Cerebras / Qwen 3 235B A22B Instruct on MATH: completed; 60.0%; 2118 ms p50 latency; 20 samples.
  • Groq / Llama 3.3 70B on MATH: completed; 30.4%; 1967 ms p50 latency; 23 samples.
  • Groq / Llama 3.3 70B on MUSR: completed; 50.0%; 311 ms p50 latency; 10 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.