Benchmark Reliability

When evaluation artifacts distort capability claims

Overview

Benchmark reliability is the meta-consistency problem: can we trust our measurements of model capability? Leaderboard scores are often treated as objective progress, but in LLM evaluation they can be highly sensitive to prompt templates, contamination, grading method, and benchmark saturation.

Three recurring threats are repeatedly documented in this area. First, data contamination can inflate apparent performance when benchmark-like content appears in training corpora. Second, format and prompt sensitivity can change absolute scores—and sometimes rankings—without model changes. Third, LLM-as-judge biases (position, verbosity, style preference) can inject systematic artifacts into automatic evaluation.

As a result, reliable assessment increasingly uses portfolio evaluation: static benchmarks + dynamic human preference + execution-grounded tasks, with stronger metadata and reproducibility controls.

Key Papers

Contamination and Memorization Risks

Language Models are Few-Shot Learners (Brown et al., 2020) · NeurIPS · link

Early large-scale model paper that explicitly discusses contamination checks and overlap concerns.

Documenting Large Webtext Corpora: C4 Case Study (Dodge et al., 2021) · EMNLP · link

Highlights provenance/filtering limitations in web corpora that complicate clean train-test separation.

Deduplicating Training Data Makes Language Models Better (Lee et al., 2022) · ACL · link

Shows deduplication improves quality and reduces memorization-like artifacts.

Prompt and Protocol Instability

HELM (Liang et al., 2022) · arXiv / CRFM · link

Landmark multi-metric framework emphasizing standardized scenarios and reporting beyond single scores.

BIG-bench (Srivastava et al., 2022) · TMLR · link

Demonstrates broad capability coverage and substantial task-level variance.

Fantastically Ordered Prompts (Lu et al., 2022) · ACL · link

Demonstration order can dramatically alter outcomes, stressing reproducibility concerns.

LLM-as-Judge Artifacts

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) · arXiv (LMSYS) · link

Foundational modern judge framework plus explicit bias caveats.

G-Eval (Liu et al., 2023) · arXiv · link

Strong correlation to human judgments in some settings, but sensitive to rubric/prompt design.

Large Language Models are not Fair Evaluators (Wang et al., 2023) · arXiv · link

Documents systematic non-content biases in judge outputs.

Preprints & Recent Work

MMLU-Pro (Wang et al., 2024) · arXiv · link

Harder benchmark designed to address saturation and reliability issues in MMLU-like testing.

GPQA (Rein et al., 2023) · arXiv · link

Graduate-level benchmark intended to restore discriminative power where older tests saturate.

SWE-bench (Jimenez et al., 2024) · ICLR · link

Execution-grounded software tasks reduce reliance on purely textual grading artifacts.

Blog Posts & Practitioner Resources

The Leaderboard Illusion (Kapoor & Narayanan, 2024) — https://www.aisnakeoil.com/p/the-leaderboard-illusion Widely cited critique of benchmark over-interpretation and ranking instability.
Open LLM Leaderboard docs (Hugging Face, ongoing) — https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about Important caveats on versioning, benchmark composition, and comparability.
LMSYS methodology posts (LMSYS, ongoing) — https://lmsys.org/blog/ Practical reflections on pairwise ranking strengths and limitations.

Benchmarks & Datasets

Name	Year	Description	Link
HELM	2022	Multi-metric, scenario-based evaluation framework	https://crfm.stanford.edu/helm/latest/
MMLU	2020	Broad MCQ benchmark; now partly saturated	https://arxiv.org/abs/2009.03300
MMLU-Pro	2024	Harder MMLU-style benchmark	https://arxiv.org/abs/2406.01574
TruthfulQA	2022	Truthfulness benchmark with format sensitivity relevance	https://arxiv.org/abs/2109.07958
HumanEval	2021	Code generation via unit tests	https://arxiv.org/abs/2107.03374
MBPP	2021	Python programming benchmark	https://arxiv.org/abs/2108.07732
GPQA	2023	Google-proof graduate-level QA	https://arxiv.org/abs/2311.12022
SWE-bench	2024	Real-world GitHub issue resolution benchmark	https://arxiv.org/abs/2310.06770
Chatbot Arena	2023–	Live human pairwise model ranking	https://lmarena.ai/

GitHub Repos

EleutherAI/lm-evaluation-harness — https://github.com/EleutherAI/lm-evaluation-harness De facto standard open benchmarking harness.
stanford-crfm/helm — https://github.com/stanford-crfm/helm Holistic benchmark protocols and implementation.
lm-sys/FastChat — https://github.com/lm-sys/FastChat MT-Bench and Arena ecosystem tooling.
openai/evals — https://github.com/openai/evals Custom eval authoring for regression and policy checks.

Key Takeaways

Key Insights

Benchmark scores are highly protocol-dependent; reproducibility metadata is not optional.
LLM-as-judge is useful at scale but must include swap-ordering, rubric controls, and bias audits.
Reliable capability assessment increasingly requires dynamic and execution-grounded tasks.