Self-Consistency & Contradiction

Whether models agree with themselves across prompts and samples

Overview

Self-consistency concerns whether an LLM returns compatible answers to semantically equivalent questions, paraphrases, or repeated runs. In deployed systems this appears as answer drift: same user intent, different outputs. The issue is partly stochastic (sampling), partly representational (unstable latent beliefs), and partly prompt-induced (format sensitivity).

An important distinction in this literature: self-consistency as a failure mode versus self-consistency as a decoding strategy. The Wang et al. self-consistency method (majority vote over chains) improves outcomes, but it does not imply that the base model is internally coherent. A more cautious interpretation is that performance gains come from sampling diverse reasoning paths and selecting the most common one.

Self-consistency failures also connect directly to factual reliability and hallucination risk: output disagreement is often a useful uncertainty signal when internals are unavailable.

Key Papers

Consistency as Method vs. Consistency as Property

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2023) · ICLR · link

Introduced majority-vote decoding over multiple reasoning paths. Strong practical gains, but best interpreted as a mitigation for instability rather than proof of stable internal beliefs.

SelfCheckGPT (Manakul et al., 2023) · EMNLP Findings · link

Treats cross-sample disagreement as a hallucination signal. A practical bridge between self-consistency diagnostics and factuality monitoring.

Contradiction and Invariance Stress Tests

CheckList (Ribeiro et al., 2020) · ACL · link

Established behavioral invariance tests (e.g., paraphrase perturbations), now foundational for consistency audits.

PAWS (Zhang, Baldridge, He, 2019) · NAACL · link

Demonstrated robustness failures under high-overlap paraphrase-like variants, exposing brittle semantic judgments.

TruthfulQA (Lin et al., 2022) · ACL · link

Primarily a truthfulness benchmark; in practice, researchers often pair it with prompt variants to probe answer stability under different elicitation styles.

Preprints & Recent Work

PromptBench (Zhu et al., 2023) · NeurIPS Datasets/Benchmarks track · link

Systematic perturbation suite for prompt robustness and consistency scoring.

Language Models (Mostly) Know What They Know (Kadavath et al., 2022) · arXiv · link

Connects confidence estimation to reliability; useful for interpreting when unstable answers should trigger abstention.

HaluEval (Li et al., 2023) · EMNLP Findings / arXiv · link

Includes contradiction-like failures against evidence and world facts across tasks.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
TruthfulQA 2022 Truthfulness under misconception pressure https://arxiv.org/abs/2109.07958
ANLI 2020 Adversarial entailment/contradiction NLI https://arxiv.org/abs/1910.14599
HANS 2019 NLI heuristic stress test https://aclanthology.org/P19-1334/
WinoGrande 2020 Robust coreference benchmark https://arxiv.org/abs/1907.10641
Winogender 2018 Demographic-swapped consistency diagnostics https://aclanthology.org/N18-2003/
PromptBench 2023 Prompt perturbation robustness benchmark https://arxiv.org/abs/2306.04528

GitHub Repos

Key Takeaways

TipKey Insights
  • Cross-sample disagreement is one of the most useful black-box uncertainty signals in practice.
  • “Self-consistency decoding” improves outcomes but does not mean models are intrinsically self-consistent.
  • Robust evaluation must include paraphrase and prompt perturbation tests, not single-template scores.

See Also