Llama 3.3 70B Benchmark Results

Llama 3.3 70B is available through 2 provider-hosted endpoints from Together AI, Groq. Benchscope records public evaluation runs separately for each endpoint so provider differences are visible.

Provider Endpoints

Llama 3.3 70B has 27 public runs across 2 providers. Provider-hosted versions of the same model can differ in quantization, infrastructure, and serving configuration, which affects benchmark results independently of model capability.

How to Compare Endpoints

Use canonical-prompt runs on the same benchmark to compare endpoints fairly. Score differences between providers running the same model family reflect hosting differences rather than model differences. Check the methodology for how runs are defined and what makes them comparable.

Best Public Scores

  • IFEval: 100.0% across 7 public runs.
  • MuSR: 100.0% across 6 public runs.
  • GSM8K: 97.0% across 1 public run.
  • GSM8K: 96.0% across 2 public runs.
  • IFEval: 90.3% across 1 public run.
  • MMLU: 86.5% across 3 public runs.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.