Taxonomy of LLM Consistency Issues

A structured map of the problem space

Overview

The term “consistency” in the context of LLMs is used loosely across the literature. We use it broadly to mean: a model is consistent if it gives the same (or compatible) answer to the same (or equivalent) question across different contexts, phrasings, sessions, or modalities.

This survey decomposes consistency into seven major categories, which we explain and justify below.

The Seven Categories

1. Factual Consistency & Hallucination

Core question: When an LLM makes a factual claim, is it accurate — and is it reliably accurate across different framings?

Hallucination (generating plausible-sounding but false content) is among the most-studied forms of inconsistency in current LLM/NLG work (Ji et al., 2023; Lin et al., 2022). It has several subtypes:

  • Intrinsic hallucination: The model contradicts the source/context it was given
  • Extrinsic hallucination: The model generates claims that cannot be verified against any source
  • Parametric vs. contextual inconsistency: The model’s training knowledge conflicts with in-context information
  • Temporal inconsistency: Knowledge cutoff effects cause outdated claims

Key insight: Hallucination is not just about accuracy — it’s about reliability. A model that hallucinates 20% of the time on a topic is inconsistent even if 80% of its answers are correct.

Factual Consistency & Hallucination


2. Self-Consistency & Contradiction

Core question: Does the model produce compatible answers when asked the same or logically related questions in different ways?

This is perhaps the purest form of inconsistency:

  • Paraphrase inconsistency: Different phrasings of the same question yield different answers
  • Logical inconsistency: The model asserts P and later asserts not-P in the same session or across sessions
  • Cross-session inconsistency: The same prompt gives different answers across runs
  • Belief inconsistency: The model holds “beliefs” that contradict each other when elicited in different ways

Key insight: Self-inconsistency suggests that LLM outputs can vary substantially with local context and elicitation strategy, rather than reflecting fully stable, globally coherent “beliefs” (Ribeiro et al., 2020; X. Wang et al., 2023; Zhang et al., 2019).

Self-Consistency & Contradiction


3. Reasoning Consistency

Core question: Is the model’s multi-step reasoning process internally coherent and reliably correct?

Reasoning consistency differs from factual consistency because the failure mode is in the process rather than just the conclusion:

  • Chain-of-thought faithfulness: Does the stated reasoning actually lead to the conclusion?
  • Mathematical/arithmetic reliability: Do the same calculations always give the same result?
  • Logical validity: Does the model respect logical entailments?
  • Commonsense reasoning: Are common-sense inferences stable?
  • Process-outcome mismatch: Model gives correct answer with wrong reasoning, or wrong answer with correct reasoning

Key insight: LLMs can produce correct answers for wrong reasons (and vice versa), making free-form reasoning traces an imperfect proxy for underlying decision processes (Lightman et al., 2023; Turpin et al., 2023).

Reasoning Consistency


4. Behavioral Consistency & Sycophancy

Core question: Does the model behave consistently across different social contexts, pressures, and framings?

This category covers how the model’s outputs are shaped by social and contextual cues:

  • Sycophancy: Changing answers when the user expresses disagreement or preference
  • Authority bias: Deferring to apparent expertise or status
  • Position/ordering bias: Preferring answers that appear first or last
  • Framing effects: Different formulations of the same question yield different normative/factual answers
  • Verbosity bias: Preferring longer or more “elaborate-sounding” responses in evaluation
  • Social desirability bias: Giving answers the user seems to want to hear

Key insight: Preference tuning can reward agreement/helpfulness signals that sometimes trade off against factual steadfastness, so social context may shift answers even when ground truth is unchanged (Ouyang et al., 2022; Perez et al., 2022).

Behavioral Consistency & Sycophancy


5. Calibration & Uncertainty

Core question: Is the model’s expressed confidence aligned with its actual accuracy?

Calibration is about the match between what a model says it believes and how reliable those beliefs are:

  • Overconfidence: High expressed confidence even when wrong
  • Underconfidence: Excessive hedging on correct claims
  • Verbalized vs. numerical uncertainty: Mismatches between “I think” vs. actual probability
  • Selective hedging: Hedging on easy questions, confident on hard ones
  • Epistemic vs. aleatoric uncertainty: Failure to distinguish “I don’t know” from “this is uncertain”

Key insight: Even strong average accuracy is insufficient if confidence signals are miscalibrated; users need uncertainty estimates that track empirical correctness (Desai & Durrett, 2020; Guo et al., 2017; Kadavath & al., 2022).

Calibration & Uncertainty


6. Cross-Context Sensitivity

Core question: Is the model’s behavior consistent across languages, modalities, templates, and surface-level formatting?

  • Cross-lingual inconsistency: Different languages yield different answers to equivalent questions
  • Template/format sensitivity: Different prompt templates change answers despite identical semantics
  • Tokenization artifacts: Superficial changes (capitalization, spacing) affect outputs
  • Cross-modal consistency: Does a vision-language model give consistent answers from text vs. image descriptions?
  • System prompt sensitivity: Behavior changes based on system prompt framing

Key insight: If semantically equivalent inputs produce different outputs, behavior is at least partly driven by surface cues rather than invariant semantic representations (Liu et al., 2024; Lu et al., 2022; Sclar et al., 2023).

Cross-Context Sensitivity


7. Benchmark Reliability

Core question: Do our evaluations reliably measure the capabilities they claim to measure?

This is a meta-level consistency issue:

  • Data contamination: Benchmark items in training data inflate apparent performance
  • Prompt sensitivity in benchmarks: Small changes to evaluation prompts change rankings
  • Multiple-choice biases: Option order, labeling, and format affect scores
  • Evaluation inconsistency: Different evaluators (human or LLM) disagree substantially
  • Metric inconsistency: Different metrics give contradictory rankings of models

Key insight: Some benchmark gains can reflect prompt sensitivity, contamination risk, or evaluator effects rather than clean capability improvements, so conclusions should be interpreted with care (Kiela et al., 2021; Liang et al., 2022; Zheng et al., 2023).

Benchmark Reliability


How the Categories Relate

The seven categories are not independent. Here are key relationships:

Factual Consistency ←→ Self-Consistency
        ↓                    ↓
   Calibration        Reasoning Consistency
        ↓                    ↓
  Cross-Context ←→ Behavioral Consistency
              ↘        ↙
          Benchmark Reliability
              (measures all of the above)
  • Sycophancy is a behavioral consistency failure that often manifests as factual inconsistency
  • Calibration failure can be seen as a self-consistency failure (stated confidence ≠ actual reliability)
  • Cross-context sensitivity amplifies all other consistency failures across prompt variations
  • Benchmark reliability is the measurement problem that sits above all others

Generator-Validator Gap (Cross-Cutting Concept)

The generator-validator gap captures a pattern that spans multiple categories: models are often better at judging candidate answers than producing the best answer in one shot.

This concept links:

In other words, many successful reliability methods convert one hard problem (single-pass generation) into two easier ones (candidate generation + discriminative validation). This is why verifier models, process-supervision methods, and structured checking loops have become central in modern LLM reliability pipelines (Dhuliawala et al., 2023; Lightman et al., 2023; P. Wang et al., 2023).

Generator-Validator Gap

Boundaries and Scope

Included: Self-consistency, paraphrase robustness, factual reliability, sycophancy, calibration, cross-lingual consistency, reasoning faithfulness, evaluation robustness

Adjacent but not the primary focus: Adversarial robustness (attacks designed to break models), privacy/memorization, general alignment and safety (except where it specifically bears on consistency)

Intentionally excluded: Pure accuracy/knowledge (a model can be consistently wrong — that’s not a consistency issue per se), style and fluency variation (unless it correlates with factual changes)

References

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv Preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168
Desai, S., & Durrett, G. (2020). Calibration of pre-trained transformers. Proceedings of EMNLP. https://aclanthology.org/2020.emnlp-main.21/
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Singh, N., Li, Y., Weston, J., & Roller, S. (2023). Chain-of-verification reduces hallucination in large language models. arXiv Preprint arXiv:2309.11495. https://arxiv.org/abs/2309.11495
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of ICML. https://arxiv.org/abs/1706.04599
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Madotto, A., & Fung, P. (2023). A survey of hallucination in natural language generation. ACM Computing Surveys. https://arxiv.org/abs/2202.03629
Kadavath, S., & al., et. (2022). Language models (mostly) know what they know. arXiv Preprint arXiv:2207.05221. https://arxiv.org/abs/2207.05221
Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Hii, A., & al., et. (2021). Dynabench: Rethinking benchmarking in NLP. Proceedings of NAACL. https://aclanthology.org/2021.naacl-main.324/
Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. Nature. https://arxiv.org/abs/2302.09664
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., & al., et. (2022). Holistic evaluation of language models. arXiv Preprint arXiv:2211.09110. https://arxiv.org/abs/2211.09110
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). Let’s verify step by step. arXiv Preprint arXiv:2305.20050. https://arxiv.org/abs/2305.20050
Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. Proceedings of ACL. https://aclanthology.org/2022.acl-long.229/
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172
Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. Proceedings of ACL. https://aclanthology.org/2022.acl-long.556/
Manakul, P., Liusie, A., & Gales, M. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. Findings of EMNLP. https://arxiv.org/abs/2303.08896
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & al., et. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2203.02155
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G., Foulkes, A., & al., et. (2022). Discovering language model behaviors with model-written evaluations. arXiv Preprint arXiv:2212.09251. https://arxiv.org/abs/2212.09251
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of ACL. https://aclanthology.org/2020.acl-main.442/
Sclar, T., Choi, Y., Tsvetkov, Y., & Smith, N. A. (2023). Quantifying language models’ sensitivity to spurious features in prompt design. arXiv Preprint arXiv:2310.11324. https://arxiv.org/abs/2310.11324
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv Preprint arXiv:2305.04388. https://arxiv.org/abs/2305.04388
Wang, P., Li, L., Qi, Z., & al., et. (2023). Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. arXiv Preprint arXiv:2312.08935. https://arxiv.org/abs/2312.08935
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations. https://arxiv.org/abs/2203.11171
Zhang, Y., Baldridge, J., & He, L. (2019). PAWS: Paraphrase adversaries from word scrambling. Proceedings of NAACL. https://aclanthology.org/N19-1382/
Zheng, L., Chiang, W.-L., Sheng, Y., Li, T., Zhuang, Y., Wu, Z., Zhuang, S., Lin, Z., Li, E., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-judge with MT-bench and chatbot arena. arXiv Preprint arXiv:2306.05685. https://arxiv.org/abs/2306.05685