Benchmark Reliability

When evaluation artifacts distort capability claims

Overview

Benchmark reliability is the meta-consistency problem: can we trust our measurements of model capability? Leaderboard scores are often treated as objective progress, but in LLM evaluation they can be highly sensitive to prompt templates, contamination, grading method, and benchmark saturation.

Three recurring threats are repeatedly documented in this area. First, data contamination can inflate apparent performance when benchmark-like content appears in training corpora. Second, format and prompt sensitivity can change absolute scores—and sometimes rankings—without model changes. Third, LLM-as-judge biases (position, verbosity, style preference) can inject systematic artifacts into automatic evaluation.

As a result, reliable assessment increasingly uses portfolio evaluation: static benchmarks + dynamic human preference + execution-grounded tasks, with stronger metadata and reproducibility controls.

Key Papers

Contamination and Memorization Risks

Language Models are Few-Shot Learners (Brown et al., 2020) · NeurIPS · link

Early large-scale model paper that explicitly discusses contamination checks and overlap concerns.

Documenting Large Webtext Corpora: C4 Case Study (Dodge et al., 2021) · EMNLP · link

Highlights provenance/filtering limitations in web corpora that complicate clean train-test separation.

Deduplicating Training Data Makes Language Models Better (Lee et al., 2022) · ACL · link

Shows deduplication improves quality and reduces memorization-like artifacts.

Prompt and Protocol Instability

HELM (Liang et al., 2022) · arXiv / CRFM · link

Landmark multi-metric framework emphasizing standardized scenarios and reporting beyond single scores.

BIG-bench (Srivastava et al., 2022) · TMLR · link

Demonstrates broad capability coverage and substantial task-level variance.

Fantastically Ordered Prompts (Lu et al., 2022) · ACL · link

Demonstration order can dramatically alter outcomes, stressing reproducibility concerns.

LLM-as-Judge Artifacts

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) · arXiv (LMSYS) · link

Foundational modern judge framework plus explicit bias caveats.

G-Eval (Liu et al., 2023) · arXiv · link

Strong correlation to human judgments in some settings, but sensitive to rubric/prompt design.

Large Language Models are not Fair Evaluators (Wang et al., 2023) · arXiv · link

Documents systematic non-content biases in judge outputs.

Preprints & Recent Work

MMLU-Pro (Wang et al., 2024) · arXiv · link

Harder benchmark designed to address saturation and reliability issues in MMLU-like testing.

GPQA (Rein et al., 2023) · arXiv · link

Graduate-level benchmark intended to restore discriminative power where older tests saturate.

SWE-bench (Jimenez et al., 2024) · ICLR · link

Execution-grounded software tasks reduce reliance on purely textual grading artifacts.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
HELM 2022 Multi-metric, scenario-based evaluation framework https://crfm.stanford.edu/helm/latest/
MMLU 2020 Broad MCQ benchmark; now partly saturated https://arxiv.org/abs/2009.03300
MMLU-Pro 2024 Harder MMLU-style benchmark https://arxiv.org/abs/2406.01574
TruthfulQA 2022 Truthfulness benchmark with format sensitivity relevance https://arxiv.org/abs/2109.07958
HumanEval 2021 Code generation via unit tests https://arxiv.org/abs/2107.03374
MBPP 2021 Python programming benchmark https://arxiv.org/abs/2108.07732
GPQA 2023 Google-proof graduate-level QA https://arxiv.org/abs/2311.12022
SWE-bench 2024 Real-world GitHub issue resolution benchmark https://arxiv.org/abs/2310.06770
Chatbot Arena 2023– Live human pairwise model ranking https://lmarena.ai/

GitHub Repos

Key Takeaways

TipKey Insights
  • Benchmark scores are highly protocol-dependent; reproducibility metadata is not optional.
  • LLM-as-judge is useful at scale but must include swap-ordering, rubric controls, and bias audits.
  • Reliable capability assessment increasingly requires dynamic and execution-grounded tasks.

See Also