IFEval Benchmark Leaderboard

IFEval measures a model's ability to follow verifiable natural language instructions such as 'respond in bullet points' or 'use exactly 3 paragraphs'. Scoring is prompt-level: the model must satisfy ALL instructions in a prompt to count as correct.

Benchmark Details

IFEval uses lighteval_native scoring and currently lists 541 examples in version 2024-01.

Public Runs

10 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.

What IFEval Measures

IFEval (Instruction Following Evaluation) presents approximately 500 prompts, each containing explicit verifiable instructions such as "write your response in bullet points", "include the word X", "respond with exactly N paragraphs", or "do not use any comma". The benchmark tests whether models can reliably follow structural formatting constraints that are easy to verify programmatically.

IFEval is scored using rule-based verification. Each instruction has a corresponding check function that examines the model output for compliance. Scores represent the fraction of instructions followed correctly across the evaluated prompts.

Why IFEval Matters and Caveats

Instruction following is a practical capability for deployed use cases. A model that scores highly on knowledge benchmarks but ignores explicit formatting instructions may be unsuitable for applications requiring structured outputs. IFEval results let you compare whether provider-hosted endpoints maintain instruction-following capability, which can degrade with quantization or system prompt interference.

IFEval is highly sensitive to prompt formatting. Custom prompts or system prompt additions can conflict with benchmark instructions and produce misleading results. The instruction set tests a specific and narrow type of compliance — verifiable formatting rules — which does not fully generalize to semantic instruction following or task completion quality.

Top IFEval Models and Instruction-Following Scores

Use the IFEval leaderboard to compare which LLM endpoints most reliably follow explicit formatting and structural instructions. This is especially useful for applications that require JSON-like structure, exact paragraph counts, banned words, required terms, bullet formatting, or other verifiable output constraints.

Model-specific searches such as Llama 3.3 70B IFEval score should be interpreted at the endpoint level. The same model family can score differently across providers because hosted deployments may differ in quantization, system prompts, and serving configuration.

When to Use IFEval

Use IFEval when instruction compliance matters more than broad knowledge recall. Pair it with MMLU for knowledge, GSM8K or MATH for math reasoning, and MuSR for multi-step contextual reasoning before choosing a production endpoint.

Recent IFEval Runs

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.