Model Benchmark Results Across Providers
A model family is the underlying model — Llama 3.3 70B, GPT-4o, Gemini 1.5 Pro. A hosted endpoint is a specific provider's deployment of that model. The same model family can produce different benchmark results depending on which provider hosts it.
Why Provider Matters
Provider-hosted versions of the same model family may differ in quantization level, serving infrastructure, rate limits, and configuration. These differences can affect accuracy, latency, and instruction-following behavior. Benchscope treats each provider-hosted endpoint as a separate comparison target so these differences are visible.
How to Use Model Pages
Open a model family page to see all hosted endpoints and their benchmark results side by side. Filter by benchmark or provider. Use MMLU and MATH runs for the most cross-provider coverage. The methodology page explains how endpoints and model families are defined.
Model Families With Public Runs
- Llama 3.3 70B: 25 public runs across 3 endpoints.
- Qwen 3 235B A22B Instruct: 11 public runs across 2 endpoints.
- MiniMax M2: 6 public runs across 2 endpoints.
- Z.ai GLM 4.7: 6 public runs across 1 endpoint.
- Qwen3 32B: 4 public runs across 1 endpoint.
- Llama 3.1 8B: 2 public runs across 3 endpoints.
- GPT-OSS 120B: 1 public runs across 2 endpoints.
- GPT-OSS 20B: 1 public runs across 1 endpoint.
- Kimi K2.5: 1 public runs across 1 endpoint.
Read the methodology for how Benchscope defines model families, hosting providers, and endpoints — and what makes results comparable across providers.
Featured Model Pages
- Llama 3.3 70B benchmark results — Compare Llama 3.3 70B benchmark results across providers and endpoints.
- Qwen 3 235B A22B benchmark results — Compare Qwen 3 235B A22B benchmark results across providers and endpoints.
- DeepSeek R1 benchmark results — Review DeepSeek R1 benchmark scores, latency, and provider coverage.
- Llama 4 Scout 17B benchmark results — Review Llama 4 Scout 17B benchmark scores, latency, and provider coverage.
- GPT-OSS 120B benchmark results — Review GPT-OSS 120B benchmark scores, latency, and provider coverage.
Editorial Comparisons
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.