Factual Consistency & Hallucination

When fluent text diverges from truth and evidence

Overview

Factual consistency asks a basic but difficult question: when a model makes a claim, is that claim supported by reliable evidence? In modern LLM systems, failures are often not obvious nonsense. They are polished, plausible, and confidently delivered statements that may conflict with source documents, world facts, or both. This makes hallucination a reliability problem, not just a generation-quality problem.

A useful distinction in the literature is between intrinsic hallucination (contradicting provided context) and extrinsic hallucination (introducing unsupported claims), as discussed in survey work such as Ji et al. (2023) · link. This split matters operationally: intrinsic failures are often diagnosable with source-grounded checks, while extrinsic failures often need retrieval, citation, or explicit uncertainty handling.

The field has moved from coarse overlap metrics toward claim-level and evidence-aware evaluation (e.g., QA-based checks, atomic fact scoring, citation grounding). In practice, robust mitigation stacks often combine retrieval, verification, and selective abstention rather than relying on one silver bullet.

Key Papers

Faithfulness in Summarization and NLG

On Faithfulness and Factuality in Abstractive Summarization (Maynez et al., 2020) · ACL · link

An early influential paper showing that strong abstractive summarizers can be fluent yet factually inconsistent. It helped separate stylistic quality from factual faithfulness as distinct evaluation targets.

Evaluating the Factual Consistency of Abstractive Text Summarization (FactCC) (Kryściński et al., 2020) · EMNLP · link

Introduced one of the earliest scalable factual inconsistency detectors trained from synthetic perturbations. FactCC became a practical baseline for source-grounded consistency checks.

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization (Durmus, He, Diab, 2020) · ACL · link

Proposed QA-based factuality scoring, moving evaluation from lexical overlap to claim verification. This informed many later factuality metrics.

Truthfulness and Hallucination in LLMs

TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin, Hilton, Evans, 2022) · ACL · link

Established a benchmark where plausible misconceptions are common. A key finding is that scale alone does not guarantee truthfulness.

SelfCheckGPT (Manakul, Liusie, Gales, 2023) · EMNLP Findings · link

Uses disagreement across sampled outputs as a black-box hallucination signal. Particularly useful when token probabilities or internals are inaccessible.

FActScore (Min et al., 2023) · EMNLP · link

Decomposes generations into atomic facts and scores evidence support. Strong for long-form outputs where aggregate metrics miss local hallucinations.

Grounding and Mitigation Architectures

RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) · NeurIPS · link

A widely adopted generate-with-retrieval architecture that anchors responses to external evidence. It remains a backbone of many production anti-hallucination systems.

Chain-of-Verification Reduces Hallucination in Large Language Models (Dhuliawala et al., 2023) · arXiv · link

Shows that structured decomposition and verification passes reduce unsupported claims in long-form generation.

Preprints & Recent Work

A Survey of Hallucination in Natural Language Generation (Ji et al., 2023) · ACM CSUR · link

A widely cited taxonomy of hallucination types, causes, and interventions across tasks.

A Survey on Hallucination in Large Language Models (Huang et al., 2023) · arXiv · link

LLM-era reframing of hallucination detection and mitigation with emphasis on instruction-tuned systems.

HaluEval (Li et al., 2023) · EMNLP Findings / arXiv · link

Large-scale benchmark for hallucination behavior in QA/dialog settings.

ALCE: Enabling and Evaluating LLMs with Automatic Citations (Gao et al., 2023) · arXiv preprint · link

Pushes factual evaluation toward explicit evidence attribution with citation-aware protocols.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
TruthfulQA 2022 Truthfulness under misconception-rich prompts https://aclanthology.org/2022.acl-long.229/
FRANK 2021 Factual errors in summarization outputs https://aclanthology.org/2021.naacl-main.383/
TRUE 2021 Unified factual consistency benchmark https://aclanthology.org/2021.naacl-main.287/
FactCC data/protocol 2020 Synthetic inconsistency examples for summarization https://aclanthology.org/2020.emnlp-main.750/
FActScore 2023 Atomic fact precision for long-form text https://arxiv.org/abs/2305.14251
HaluEval 2023 Hallucination benchmark across tasks https://arxiv.org/abs/2305.11747
ALCE 2023 Citation-grounded answer evaluation https://arxiv.org/abs/2305.14627

GitHub Repos

Key Takeaways

TipKey Insights
  • Hallucination is heterogeneous: intrinsic and extrinsic failures require different diagnostics.
  • Retrieval helps, but robust factuality usually needs retrieval plus verification plus uncertainty-aware abstention.
  • Claim-level evaluation (e.g., FActScore-style) provides better debugging signals than one-shot aggregate scores.

See Also