LLM Consistency: A Literature Survey

A thorough review of consistency, reliability, and robustness issues in large language models

Author

John P. Lake

Published

March 18, 2026

What Is This Survey About?

Large language models (LLMs) exhibit a troubling pattern: ask the same question twice and you may get different answers. Rephrase the question slightly and the answer changes (Lu et al., 2022; Sclar et al., 2023). Push back on a response and the model may shift its answer to match user cues (Perez et al., 2022). Provide misleading context and the model can abandon prior claims. These are all manifestations of LLM inconsistency.

This survey maps the landscape of LLM consistency issues comprehensively — covering academic papers, preprints, blog posts, and practical resources. The goal is to provide both an overview and a navigable map: a guide for researchers, practitioners, and curious minds trying to understand this fragmented problem space.

Scope

This survey covers consistency, reliability, robustness, and self-contradiction issues in LLMs. Related but distinct areas (e.g., factual accuracy per se, alignment, or safety) are included where they directly bear on consistency.

Why This Matters

Inconsistency is not just an academic curiosity. It affects:

Scientific reproducibility: Research using LLMs can be sensitive to prompt phrasing and setup details, making exact replication harder (Liang et al., 2022; Lu et al., 2022)
Deployed applications: Chatbots and assistants can produce materially different outputs across runs or prompt variants (Sclar et al., 2023; Zhu & al., 2023)
Trust calibration: Users may struggle to calibrate trust when confidence and correctness are weakly coupled (Kadavath & al., 2022; Xiong & al., 2023)
Evaluation validity: Benchmark scores can partly reflect prompt/template sensitivity rather than underlying capability (Liang et al., 2022; Lu et al., 2022)
Safety and alignment: Inconsistent behavior complicates auditing and reliability efforts in aligned assistants (Bai et al., 2022; Ouyang et al., 2022)

Taxonomy Overview

We organize LLM consistency issues into seven major categories:

Category	Core Question	Key Phenomena
Factual Consistency & Hallucination	Does the model produce accurate facts consistently?	Hallucination, knowledge conflicts, intrinsic/extrinsic inconsistency
Self-Consistency & Contradiction	Does the model agree with itself?	Cross-prompt contradiction, belief revision under pressure, logical self-contradiction
Reasoning Consistency	Is the model’s reasoning process reliable?	CoT failures, arithmetic errors, fallacious reasoning
Behavioral Consistency & Sycophancy	Does the model behave predictably across social contexts?	Sycophancy, authority bias, framing effects, ordering bias
Calibration & Uncertainty	Does the model know what it doesn’t know?	Overconfidence, miscalibration, verbalized vs. actual uncertainty
Cross-Context Sensitivity	Is behavior consistent across languages, formats, and prompts?	Cross-lingual, template sensitivity, paraphrase sensitivity
Benchmark Reliability	Do our evaluations actually measure what they claim?	Data contamination, prompt sensitivity, evaluation inconsistency

See the full Taxonomy page for a detailed breakdown and discussion of how these categories relate.

How to Use This Survey

Newcomers: Start with the Taxonomy for an orienting overview
Researchers: Browse individual category pages for deep-dives into specific literatures
Practitioners: Check the Resources & Tools page for benchmarks, datasets, and repos
Follow-the-thread readers: Each page ends with cross-references to adjacent categories

Key Search Terms

The literature on LLM consistency is scattered across many keyword spaces. This survey covers work appearing under:

LLM consistency, self-consistency, factual consistency, hallucination, sycophancy, prompt sensitivity, robustness, reliability, calibration, overconfidence, logical contradiction, paraphrase invariance, cross-lingual consistency, benchmark sensitivity, evaluation robustness, verbosity bias, position bias, ordering bias, epistemic consistency, belief revision, uncertainty quantification, confabulation, faithfulness, groundedness

About This Survey

Compiled in March 2026. The literature is rapidly evolving; this survey focuses on establishing a durable taxonomy while citing representative and influential works.

View on GitHub

References

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & al., et. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv Preprint arXiv:2212.08073. https://arxiv.org/abs/2212.08073

Kadavath, S., & al., et. (2022). Language models (mostly) know what they know. arXiv Preprint arXiv:2207.05221. https://arxiv.org/abs/2207.05221

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., & al., et. (2022). Holistic evaluation of language models. arXiv Preprint arXiv:2211.09110. https://arxiv.org/abs/2211.09110

Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. Proceedings of ACL. https://aclanthology.org/2022.acl-long.556/

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & al., et. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2203.02155

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G., Foulkes, A., & al., et. (2022). Discovering language model behaviors with model-written evaluations. arXiv Preprint arXiv:2212.09251. https://arxiv.org/abs/2212.09251

Sclar, T., Choi, Y., Tsvetkov, Y., & Smith, N. A. (2023). Quantifying language models’ sensitivity to spurious features in prompt design. arXiv Preprint arXiv:2310.11324. https://arxiv.org/abs/2310.11324

Xiong, Z., & al., et. (2023). Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. arXiv Preprint arXiv:2306.13063. https://arxiv.org/abs/2306.13063

Zhu, X., & al., et. (2023). PromptBench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv Preprint arXiv:2306.04528. https://arxiv.org/abs/2306.04528