LLM Consistency: A Literature Survey

A thorough review of consistency, reliability, and robustness issues in large language models

Author

John P. Lake

Published

March 18, 2026

What Is This Survey About?

Large language models (LLMs) exhibit a troubling pattern: ask the same question twice and you may get different answers. Rephrase the question slightly and the answer changes (Lu et al., 2022; Sclar et al., 2023). Push back on a response and the model may shift its answer to match user cues (Perez et al., 2022). Provide misleading context and the model can abandon prior claims. These are all manifestations of LLM inconsistency.

This survey maps the landscape of LLM consistency issues comprehensively — covering academic papers, preprints, blog posts, and practical resources. The goal is to provide both an overview and a navigable map: a guide for researchers, practitioners, and curious minds trying to understand this fragmented problem space.

NoteScope

This survey covers consistency, reliability, robustness, and self-contradiction issues in LLMs. Related but distinct areas (e.g., factual accuracy per se, alignment, or safety) are included where they directly bear on consistency.

Why This Matters

Inconsistency is not just an academic curiosity. It affects:

Taxonomy Overview

We organize LLM consistency issues into seven major categories:

Category Core Question Key Phenomena
Factual Consistency & Hallucination Does the model produce accurate facts consistently? Hallucination, knowledge conflicts, intrinsic/extrinsic inconsistency
Self-Consistency & Contradiction Does the model agree with itself? Cross-prompt contradiction, belief revision under pressure, logical self-contradiction
Reasoning Consistency Is the model’s reasoning process reliable? CoT failures, arithmetic errors, fallacious reasoning
Behavioral Consistency & Sycophancy Does the model behave predictably across social contexts? Sycophancy, authority bias, framing effects, ordering bias
Calibration & Uncertainty Does the model know what it doesn’t know? Overconfidence, miscalibration, verbalized vs. actual uncertainty
Cross-Context Sensitivity Is behavior consistent across languages, formats, and prompts? Cross-lingual, template sensitivity, paraphrase sensitivity
Benchmark Reliability Do our evaluations actually measure what they claim? Data contamination, prompt sensitivity, evaluation inconsistency

See the full Taxonomy page for a detailed breakdown and discussion of how these categories relate.

How to Use This Survey

  • Newcomers: Start with the Taxonomy for an orienting overview
  • Researchers: Browse individual category pages for deep-dives into specific literatures
  • Practitioners: Check the Resources & Tools page for benchmarks, datasets, and repos
  • Follow-the-thread readers: Each page ends with cross-references to adjacent categories

Key Search Terms

The literature on LLM consistency is scattered across many keyword spaces. This survey covers work appearing under:

LLM consistency, self-consistency, factual consistency, hallucination, sycophancy, prompt sensitivity, robustness, reliability, calibration, overconfidence, logical contradiction, paraphrase invariance, cross-lingual consistency, benchmark sensitivity, evaluation robustness, verbosity bias, position bias, ordering bias, epistemic consistency, belief revision, uncertainty quantification, confabulation, faithfulness, groundedness

About This Survey

Compiled in March 2026. The literature is rapidly evolving; this survey focuses on establishing a durable taxonomy while citing representative and influential works.

View on GitHub

References

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & al., et. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv Preprint arXiv:2212.08073. https://arxiv.org/abs/2212.08073
Kadavath, S., & al., et. (2022). Language models (mostly) know what they know. arXiv Preprint arXiv:2207.05221. https://arxiv.org/abs/2207.05221
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., & al., et. (2022). Holistic evaluation of language models. arXiv Preprint arXiv:2211.09110. https://arxiv.org/abs/2211.09110
Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. Proceedings of ACL. https://aclanthology.org/2022.acl-long.556/
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & al., et. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2203.02155
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G., Foulkes, A., & al., et. (2022). Discovering language model behaviors with model-written evaluations. arXiv Preprint arXiv:2212.09251. https://arxiv.org/abs/2212.09251
Sclar, T., Choi, Y., Tsvetkov, Y., & Smith, N. A. (2023). Quantifying language models’ sensitivity to spurious features in prompt design. arXiv Preprint arXiv:2310.11324. https://arxiv.org/abs/2310.11324
Xiong, Z., & al., et. (2023). Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. arXiv Preprint arXiv:2306.13063. https://arxiv.org/abs/2306.13063
Zhu, X., & al., et. (2023). PromptBench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv Preprint arXiv:2306.04528. https://arxiv.org/abs/2306.04528