Resources & Tools

Surveys, benchmarks, frameworks, and practitioner infrastructure

Overview

This page is a curated hub for navigating the LLM consistency literature and tooling ecosystem. The core challenge is fragmentation: relevant work is spread across hallucination papers, robust NLP methods, calibration studies, benchmark design, and applied eval engineering. This hub consolidates those strands into one actionable map.

A practical pattern in the literature and practitioner ecosystem is that no single metric or benchmark is sufficient (e.g., HELM, Dynabench, MT-Bench/Arena). More reliable workflows typically combine static task suites, perturbation tests, preference data, and task-grounded checks, with full pipeline versioning.

Use this page as a “build sheet” for your evaluation stack: start with broad references, pick benchmark families by risk profile, and adopt tooling that supports reproducible reruns and regression testing.

Key Papers

Foundational Survey & Taxonomy References

A Survey of Large Language Models (Zhao et al., 2023) · arXiv

Broad technical overview of LLM foundations and evaluation context.

A Survey of Hallucination in Natural Language Generation (Ji et al., 2023) · ACM CSUR

Canonical hallucination taxonomy and mitigation map.

On the Evaluation of Large Language Models (Chang et al., 2024) · arXiv

Comprehensive treatment of evaluation pitfalls and protocol design.

HELM: Holistic Evaluation of Language Models (Liang et al., 2022) · arXiv/CRFM

Landmark framework for multi-metric, scenario-specific model assessment.

Trustworthy Evaluation Frameworks

TrustLLM (Huang et al., 2024) · arXiv

Framework and benchmark for trust dimensions spanning truthfulness, robustness, safety, and fairness.

Dynabench (Kiela et al., 2021) · NAACL

Dynamic benchmark philosophy to reduce static-test overfitting.

Preprints & Recent Work

MMLU-Pro (Wang et al., 2024) · arXiv

Harder successor benchmark motivated by saturation concerns.

SWE-bench (Jimenez et al., 2024) · ICLR

Real-world software issue resolution benchmark, useful for grounded capability testing.

MT-Bench + Arena analysis (Zheng et al., 2023) · arXiv

Pairwise quality comparison methodology and analysis of LLM-judge behavior.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
TruthfulQA 2022 Truthfulness under misconception pressure https://arxiv.org/abs/2109.07958
MMLU 2020 Broad MCQ capability benchmark https://arxiv.org/abs/2009.03300
MMLU-Pro 2024 Harder and less saturated MMLU variant https://arxiv.org/abs/2406.01574
GSM8K 2021 Grade-school math reasoning https://arxiv.org/abs/2110.14168
HumanEval 2021 Code generation with tests https://arxiv.org/abs/2107.03374
SWE-bench 2024 Real-world software issue benchmark https://arxiv.org/abs/2310.06770
FActScore 2023 Atomic factual precision metric/benchmark https://arxiv.org/abs/2305.14251
HaluEval 2023 Hallucination benchmark for LLMs https://arxiv.org/abs/2305.11747
MT-Bench 2023 Multi-turn quality benchmark https://arxiv.org/abs/2306.05685
Chatbot Arena 2023– Live pairwise human preference ranking https://lmarena.ai/

GitHub Repos

Key Takeaways

TipKey Insights
  • Build evaluation as a stack, not a single benchmark: static tests + perturbations + human preference + grounded execution.
  • Track versions of prompts, datasets, judges, and decoding settings; otherwise comparisons are weak.
  • Treat benchmark results as evidence, not truth—especially when scores are near saturation.

See Also