Model Benchmark Results Across Providers

A model family is the underlying model — Llama 3.3 70B, GPT-4o, Gemini 1.5 Pro. A hosted endpoint is a specific provider's deployment of that model. The same model family can produce different benchmark results depending on which provider hosts it.

Why Provider Matters

Provider-hosted versions of the same model family may differ in quantization level, serving infrastructure, rate limits, and configuration. These differences can affect accuracy, latency, and instruction-following behavior. Benchscope treats each provider-hosted endpoint as a separate comparison target so these differences are visible.

How to Use Model Pages

Open a model family page to see all hosted endpoints and their benchmark results side by side. Filter by benchmark or provider. Use MMLU and MATH runs for the most cross-provider coverage. The methodology page explains how endpoints and model families are defined.

Model Families With Public Runs

  • Llama 3.3 70B: 25 public runs across 3 endpoints.
  • Qwen 3 235B A22B Instruct: 11 public runs across 2 endpoints.
  • MiniMax M2: 6 public runs across 2 endpoints.
  • Z.ai GLM 4.7: 6 public runs across 1 endpoint.
  • Qwen3 32B: 4 public runs across 1 endpoint.
  • Llama 3.1 8B: 2 public runs across 3 endpoints.
  • GPT-OSS 120B: 1 public runs across 2 endpoints.
  • GPT-OSS 20B: 1 public runs across 1 endpoint.
  • Kimi K2.5: 1 public runs across 1 endpoint.

Read the methodology for how Benchscope defines model families, hosting providers, and endpoints — and what makes results comparable across providers.

Editorial Comparisons

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.