Factual Consistency & Hallucination

When fluent text diverges from truth and evidence

Overview

Factual consistency asks a basic but difficult question: when a model makes a claim, is that claim supported by reliable evidence? In modern LLM systems, failures are often not obvious nonsense. They are polished, plausible, and confidently delivered statements that may conflict with source documents, world facts, or both. This makes hallucination a reliability problem, not just a generation-quality problem.

A useful distinction in the literature is between intrinsic hallucination (contradicting provided context) and extrinsic hallucination (introducing unsupported claims), as discussed in survey work such as Ji et al. (2023) · link. This split matters operationally: intrinsic failures are often diagnosable with source-grounded checks, while extrinsic failures often need retrieval, citation, or explicit uncertainty handling.

The field has moved from coarse overlap metrics toward claim-level and evidence-aware evaluation (e.g., QA-based checks, atomic fact scoring, citation grounding). In practice, robust mitigation stacks often combine retrieval, verification, and selective abstention rather than relying on one silver bullet.

Key Papers

Faithfulness in Summarization and NLG

On Faithfulness and Factuality in Abstractive Summarization (Maynez et al., 2020) · ACL · link

An early influential paper showing that strong abstractive summarizers can be fluent yet factually inconsistent. It helped separate stylistic quality from factual faithfulness as distinct evaluation targets.

Evaluating the Factual Consistency of Abstractive Text Summarization (FactCC) (Kryściński et al., 2020) · EMNLP · link

Introduced one of the earliest scalable factual inconsistency detectors trained from synthetic perturbations. FactCC became a practical baseline for source-grounded consistency checks.

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization (Durmus, He, Diab, 2020) · ACL · link

Proposed QA-based factuality scoring, moving evaluation from lexical overlap to claim verification. This informed many later factuality metrics.

Truthfulness and Hallucination in LLMs

TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin, Hilton, Evans, 2022) · ACL · link

Established a benchmark where plausible misconceptions are common. A key finding is that scale alone does not guarantee truthfulness.

SelfCheckGPT (Manakul, Liusie, Gales, 2023) · EMNLP Findings · link

Uses disagreement across sampled outputs as a black-box hallucination signal. Particularly useful when token probabilities or internals are inaccessible.

FActScore (Min et al., 2023) · EMNLP · link

Decomposes generations into atomic facts and scores evidence support. Strong for long-form outputs where aggregate metrics miss local hallucinations.

Grounding and Mitigation Architectures

RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) · NeurIPS · link

A widely adopted generate-with-retrieval architecture that anchors responses to external evidence. It remains a backbone of many production anti-hallucination systems.

Chain-of-Verification Reduces Hallucination in Large Language Models (Dhuliawala et al., 2023) · arXiv · link

Shows that structured decomposition and verification passes reduce unsupported claims in long-form generation.

Preprints & Recent Work

A Survey of Hallucination in Natural Language Generation (Ji et al., 2023) · ACM CSUR · link

A widely cited taxonomy of hallucination types, causes, and interventions across tasks.

A Survey on Hallucination in Large Language Models (Huang et al., 2023) · arXiv · link

LLM-era reframing of hallucination detection and mitigation with emphasis on instruction-tuned systems.

HaluEval (Li et al., 2023) · EMNLP Findings / arXiv · link

Large-scale benchmark for hallucination behavior in QA/dialog settings.

ALCE: Enabling and Evaluating LLMs with Automatic Citations (Gao et al., 2023) · arXiv preprint · link

Pushes factual evaluation toward explicit evidence attribution with citation-aware protocols.

Blog Posts & Practitioner Resources

Hallucination in Natural Language Generation (Lilian Weng, 2024) — https://lilianweng.github.io/posts/2024-07-07-hallucination/ A clear technical synthesis of causes, detection, and mitigation patterns.
LLM Evaluation Guidebook (Evidently AI, 2023–2024) — https://www.evidentlyai.com/llm-guide/llm-evaluation Practical guidance for groundedness and factuality monitoring in deployed systems.
RAGAS Documentation (Exploding Gradients, ongoing) — https://docs.ragas.io/ Production-friendly faithfulness/relevance metrics for RAG pipelines.

Benchmarks & Datasets

Name	Year	Description	Link
TruthfulQA	2022	Truthfulness under misconception-rich prompts	https://aclanthology.org/2022.acl-long.229/
FRANK	2021	Factual errors in summarization outputs	https://aclanthology.org/2021.naacl-main.383/
TRUE	2021	Unified factual consistency benchmark	https://aclanthology.org/2021.naacl-main.287/
FactCC data/protocol	2020	Synthetic inconsistency examples for summarization	https://aclanthology.org/2020.emnlp-main.750/
FActScore	2023	Atomic fact precision for long-form text	https://arxiv.org/abs/2305.14251
HaluEval	2023	Hallucination benchmark across tasks	https://arxiv.org/abs/2305.11747
ALCE	2023	Citation-grounded answer evaluation	https://arxiv.org/abs/2305.14627

GitHub Repos

SelfCheckGPT — https://github.com/potsawee/selfcheckgpt Reference implementation for black-box hallucination detection.
FActScore — https://github.com/shmsw25/FActScore Atomic fact extraction and scoring toolkit.
RAGAS — https://github.com/explodinggradients/ragas RAG evaluation package for faithfulness, relevance, and context quality.
TruthfulQA — https://github.com/sylinrl/TruthfulQA Benchmark resources for model truthfulness testing.

Key Takeaways

Key Insights

Hallucination is heterogeneous: intrinsic and extrinsic failures require different diagnostics.
Retrieval helps, but robust factuality usually needs retrieval plus verification plus uncertainty-aware abstention.
Claim-level evaluation (e.g., FActScore-style) provides better debugging signals than one-shot aggregate scores.