LLM Consistency: A Literature Survey
A thorough review of consistency, reliability, and robustness issues in large language models
What Is This Survey About?
Large language models (LLMs) exhibit a troubling pattern: ask the same question twice and you may get different answers. Rephrase the question slightly and the answer changes (Lu et al., 2022; Sclar et al., 2023). Push back on a response and the model may shift its answer to match user cues (Perez et al., 2022). Provide misleading context and the model can abandon prior claims. These are all manifestations of LLM inconsistency.
This survey maps the landscape of LLM consistency issues comprehensively — covering academic papers, preprints, blog posts, and practical resources. The goal is to provide both an overview and a navigable map: a guide for researchers, practitioners, and curious minds trying to understand this fragmented problem space.
This survey covers consistency, reliability, robustness, and self-contradiction issues in LLMs. Related but distinct areas (e.g., factual accuracy per se, alignment, or safety) are included where they directly bear on consistency.
Why This Matters
Inconsistency is not just an academic curiosity. It affects:
- Scientific reproducibility: Research using LLMs can be sensitive to prompt phrasing and setup details, making exact replication harder (Liang et al., 2022; Lu et al., 2022)
- Deployed applications: Chatbots and assistants can produce materially different outputs across runs or prompt variants (Sclar et al., 2023; Zhu & al., 2023)
- Trust calibration: Users may struggle to calibrate trust when confidence and correctness are weakly coupled (Kadavath & al., 2022; Xiong & al., 2023)
- Evaluation validity: Benchmark scores can partly reflect prompt/template sensitivity rather than underlying capability (Liang et al., 2022; Lu et al., 2022)
- Safety and alignment: Inconsistent behavior complicates auditing and reliability efforts in aligned assistants (Bai et al., 2022; Ouyang et al., 2022)
Taxonomy Overview
We organize LLM consistency issues into seven major categories:
| Category | Core Question | Key Phenomena |
|---|---|---|
| Factual Consistency & Hallucination | Does the model produce accurate facts consistently? | Hallucination, knowledge conflicts, intrinsic/extrinsic inconsistency |
| Self-Consistency & Contradiction | Does the model agree with itself? | Cross-prompt contradiction, belief revision under pressure, logical self-contradiction |
| Reasoning Consistency | Is the model’s reasoning process reliable? | CoT failures, arithmetic errors, fallacious reasoning |
| Behavioral Consistency & Sycophancy | Does the model behave predictably across social contexts? | Sycophancy, authority bias, framing effects, ordering bias |
| Calibration & Uncertainty | Does the model know what it doesn’t know? | Overconfidence, miscalibration, verbalized vs. actual uncertainty |
| Cross-Context Sensitivity | Is behavior consistent across languages, formats, and prompts? | Cross-lingual, template sensitivity, paraphrase sensitivity |
| Benchmark Reliability | Do our evaluations actually measure what they claim? | Data contamination, prompt sensitivity, evaluation inconsistency |
See the full Taxonomy page for a detailed breakdown and discussion of how these categories relate.
How to Use This Survey
- Newcomers: Start with the Taxonomy for an orienting overview
- Researchers: Browse individual category pages for deep-dives into specific literatures
- Practitioners: Check the Resources & Tools page for benchmarks, datasets, and repos
- Follow-the-thread readers: Each page ends with cross-references to adjacent categories
Key Search Terms
The literature on LLM consistency is scattered across many keyword spaces. This survey covers work appearing under:
LLM consistency, self-consistency, factual consistency, hallucination, sycophancy, prompt sensitivity, robustness, reliability, calibration, overconfidence, logical contradiction, paraphrase invariance, cross-lingual consistency, benchmark sensitivity, evaluation robustness, verbosity bias, position bias, ordering bias, epistemic consistency, belief revision, uncertainty quantification, confabulation, faithfulness, groundedness
About This Survey
Compiled in March 2026. The literature is rapidly evolving; this survey focuses on establishing a durable taxonomy while citing representative and influential works.