Self-Consistency & Contradiction

Whether models agree with themselves across prompts and samples

Overview

Self-consistency concerns whether an LLM returns compatible answers to semantically equivalent questions, paraphrases, or repeated runs. In deployed systems this appears as answer drift: same user intent, different outputs. The issue is partly stochastic (sampling), partly representational (unstable latent beliefs), and partly prompt-induced (format sensitivity).

An important distinction in this literature: self-consistency as a failure mode versus self-consistency as a decoding strategy. The Wang et al. self-consistency method (majority vote over chains) improves outcomes, but it does not imply that the base model is internally coherent. A more cautious interpretation is that performance gains come from sampling diverse reasoning paths and selecting the most common one.

Self-consistency failures also connect directly to factual reliability and hallucination risk: output disagreement is often a useful uncertainty signal when internals are unavailable.

Key Papers

Consistency as Method vs. Consistency as Property

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2023) · ICLR · link

Introduced majority-vote decoding over multiple reasoning paths. Strong practical gains, but best interpreted as a mitigation for instability rather than proof of stable internal beliefs.

SelfCheckGPT (Manakul et al., 2023) · EMNLP Findings · link

Treats cross-sample disagreement as a hallucination signal. A practical bridge between self-consistency diagnostics and factuality monitoring.

Contradiction and Invariance Stress Tests

CheckList (Ribeiro et al., 2020) · ACL · link

Established behavioral invariance tests (e.g., paraphrase perturbations), now foundational for consistency audits.

PAWS (Zhang, Baldridge, He, 2019) · NAACL · link

Demonstrated robustness failures under high-overlap paraphrase-like variants, exposing brittle semantic judgments.

TruthfulQA (Lin et al., 2022) · ACL · link

Primarily a truthfulness benchmark; in practice, researchers often pair it with prompt variants to probe answer stability under different elicitation styles.

Preprints & Recent Work

PromptBench (Zhu et al., 2023) · NeurIPS Datasets/Benchmarks track · link

Systematic perturbation suite for prompt robustness and consistency scoring.

Language Models (Mostly) Know What They Know (Kadavath et al., 2022) · arXiv · link

Connects confidence estimation to reliability; useful for interpreting when unstable answers should trigger abstention.

HaluEval (Li et al., 2023) · EMNLP Findings / arXiv · link

Includes contradiction-like failures against evidence and world facts across tasks.

Blog Posts & Practitioner Resources

Prompting Guide: Consistency Techniques (DAIR.AI, ongoing) — https://www.promptingguide.ai/techniques/consistency Catalog of practical templates for self-consistency sampling and voting.
OpenAI Prompt Engineering Guide (OpenAI, ongoing) — https://platform.openai.com/docs/guides/prompt-engineering Production heuristics for reducing answer variance.
Anthropic research notes on confidence/consistency (Anthropic, ongoing) — https://www.anthropic.com/news Applied discussions of confidence and reliability behavior.

Benchmarks & Datasets

Name	Year	Description	Link
TruthfulQA	2022	Truthfulness under misconception pressure	https://arxiv.org/abs/2109.07958
ANLI	2020	Adversarial entailment/contradiction NLI	https://arxiv.org/abs/1910.14599
HANS	2019	NLI heuristic stress test	https://aclanthology.org/P19-1334/
WinoGrande	2020	Robust coreference benchmark	https://arxiv.org/abs/1907.10641
Winogender	2018	Demographic-swapped consistency diagnostics	https://aclanthology.org/N18-2003/
PromptBench	2023	Prompt perturbation robustness benchmark	https://arxiv.org/abs/2306.04528

GitHub Repos

chain-of-thought-hub — https://github.com/FranxYao/chain-of-thought-hub Reproducible prompts and evaluation scripts for consistency-sensitive reasoning tasks.
BIG-Bench-Hard — https://github.com/suzgunmirac/BIG-Bench-Hard Hard reasoning tasks often used in variance/consistency studies.
lm-evaluation-harness — https://github.com/EleutherAI/lm-evaluation-harness Standardized benchmarking framework for multi-prompt comparisons.

Key Takeaways

Key Insights

Cross-sample disagreement is one of the most useful black-box uncertainty signals in practice.
“Self-consistency decoding” improves outcomes but does not mean models are intrinsically self-consistent.
Robust evaluation must include paraphrase and prompt perturbation tests, not single-template scores.