IFEval Benchmark Results

IFEval measures a model's ability to follow verifiable natural language instructions such as 'respond in bullet points' or 'use exactly 3 paragraphs'. Scoring is prompt-level: the model must satisfy ALL instructions in a prompt to count as correct.

Benchmark Details

IFEval uses lighteval_native scoring and currently lists 541 examples in version 2024-01.

Public Runs

10 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.

Recent IFEval Runs

Together / Llama 3.3 70B on IFEVAL: partial; 90.3%; 3824 ms p50 latency; 100 samples.
Groq / Llama 3.3 70B on IFEVAL: completed; 90.0%; 1108 ms p50 latency; 100 samples.
Cerebras / Qwen 3 235B A22B Instruct on IFEVAL: completed; 67.0%; 553 ms p50 latency; 100 samples.
Groq / Llama 3.3 70B on IFEVAL: partial; 70.7%; 907 ms p50 latency; 100 samples.
Groq / Llama 3.3 70B on IFEVAL: completed; 100.0%; 1121 ms p50 latency; 10 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.