Self-Consistency & Contradiction
Whether models agree with themselves across prompts and samples
Overview
Self-consistency concerns whether an LLM returns compatible answers to semantically equivalent questions, paraphrases, or repeated runs. In deployed systems this appears as answer drift: same user intent, different outputs. The issue is partly stochastic (sampling), partly representational (unstable latent beliefs), and partly prompt-induced (format sensitivity).
An important distinction in this literature: self-consistency as a failure mode versus self-consistency as a decoding strategy. The Wang et al. self-consistency method (majority vote over chains) improves outcomes, but it does not imply that the base model is internally coherent. A more cautious interpretation is that performance gains come from sampling diverse reasoning paths and selecting the most common one.
Self-consistency failures also connect directly to factual reliability and hallucination risk: output disagreement is often a useful uncertainty signal when internals are unavailable.
Key Papers
Consistency as Method vs. Consistency as Property
Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2023) · ICLR · link
Introduced majority-vote decoding over multiple reasoning paths. Strong practical gains, but best interpreted as a mitigation for instability rather than proof of stable internal beliefs.
SelfCheckGPT (Manakul et al., 2023) · EMNLP Findings · link
Treats cross-sample disagreement as a hallucination signal. A practical bridge between self-consistency diagnostics and factuality monitoring.
Contradiction and Invariance Stress Tests
CheckList (Ribeiro et al., 2020) · ACL · link
Established behavioral invariance tests (e.g., paraphrase perturbations), now foundational for consistency audits.
PAWS (Zhang, Baldridge, He, 2019) · NAACL · link
Demonstrated robustness failures under high-overlap paraphrase-like variants, exposing brittle semantic judgments.
TruthfulQA (Lin et al., 2022) · ACL · link
Primarily a truthfulness benchmark; in practice, researchers often pair it with prompt variants to probe answer stability under different elicitation styles.
Preprints & Recent Work
PromptBench (Zhu et al., 2023) · NeurIPS Datasets/Benchmarks track · link
Systematic perturbation suite for prompt robustness and consistency scoring.
Language Models (Mostly) Know What They Know (Kadavath et al., 2022) · arXiv · link
Connects confidence estimation to reliability; useful for interpreting when unstable answers should trigger abstention.
HaluEval (Li et al., 2023) · EMNLP Findings / arXiv · link
Includes contradiction-like failures against evidence and world facts across tasks.
Blog Posts & Practitioner Resources
Prompting Guide: Consistency Techniques (DAIR.AI, ongoing) — https://www.promptingguide.ai/techniques/consistency Catalog of practical templates for self-consistency sampling and voting.
OpenAI Prompt Engineering Guide (OpenAI, ongoing) — https://platform.openai.com/docs/guides/prompt-engineering Production heuristics for reducing answer variance.
Anthropic research notes on confidence/consistency (Anthropic, ongoing) — https://www.anthropic.com/news Applied discussions of confidence and reliability behavior.
Benchmarks & Datasets
| Name | Year | Description | Link |
|---|---|---|---|
| TruthfulQA | 2022 | Truthfulness under misconception pressure | https://arxiv.org/abs/2109.07958 |
| ANLI | 2020 | Adversarial entailment/contradiction NLI | https://arxiv.org/abs/1910.14599 |
| HANS | 2019 | NLI heuristic stress test | https://aclanthology.org/P19-1334/ |
| WinoGrande | 2020 | Robust coreference benchmark | https://arxiv.org/abs/1907.10641 |
| Winogender | 2018 | Demographic-swapped consistency diagnostics | https://aclanthology.org/N18-2003/ |
| PromptBench | 2023 | Prompt perturbation robustness benchmark | https://arxiv.org/abs/2306.04528 |
GitHub Repos
chain-of-thought-hub — https://github.com/FranxYao/chain-of-thought-hub Reproducible prompts and evaluation scripts for consistency-sensitive reasoning tasks.
BIG-Bench-Hard — https://github.com/suzgunmirac/BIG-Bench-Hard Hard reasoning tasks often used in variance/consistency studies.
lm-evaluation-harness — https://github.com/EleutherAI/lm-evaluation-harness Standardized benchmarking framework for multi-prompt comparisons.
Key Takeaways
- Cross-sample disagreement is one of the most useful black-box uncertainty signals in practice.
- “Self-consistency decoding” improves outcomes but does not mean models are intrinsically self-consistent.
- Robust evaluation must include paraphrase and prompt perturbation tests, not single-template scores.