Taxonomy of LLM Consistency Issues
A structured map of the problem space
Overview
The term “consistency” in the context of LLMs is used loosely across the literature. We use it broadly to mean: a model is consistent if it gives the same (or compatible) answer to the same (or equivalent) question across different contexts, phrasings, sessions, or modalities.
This survey decomposes consistency into seven major categories, which we explain and justify below.
The Seven Categories
1. Factual Consistency & Hallucination
Core question: When an LLM makes a factual claim, is it accurate — and is it reliably accurate across different framings?
Hallucination (generating plausible-sounding but false content) is among the most-studied forms of inconsistency in current LLM/NLG work (Ji et al., 2023; Lin et al., 2022). It has several subtypes:
- Intrinsic hallucination: The model contradicts the source/context it was given
- Extrinsic hallucination: The model generates claims that cannot be verified against any source
- Parametric vs. contextual inconsistency: The model’s training knowledge conflicts with in-context information
- Temporal inconsistency: Knowledge cutoff effects cause outdated claims
Key insight: Hallucination is not just about accuracy — it’s about reliability. A model that hallucinates 20% of the time on a topic is inconsistent even if 80% of its answers are correct.
→ Factual Consistency & Hallucination
2. Self-Consistency & Contradiction
Core question: Does the model produce compatible answers when asked the same or logically related questions in different ways?
This is perhaps the purest form of inconsistency:
- Paraphrase inconsistency: Different phrasings of the same question yield different answers
- Logical inconsistency: The model asserts P and later asserts not-P in the same session or across sessions
- Cross-session inconsistency: The same prompt gives different answers across runs
- Belief inconsistency: The model holds “beliefs” that contradict each other when elicited in different ways
Key insight: Self-inconsistency suggests that LLM outputs can vary substantially with local context and elicitation strategy, rather than reflecting fully stable, globally coherent “beliefs” (Ribeiro et al., 2020; X. Wang et al., 2023; Zhang et al., 2019).
→ Self-Consistency & Contradiction
3. Reasoning Consistency
Core question: Is the model’s multi-step reasoning process internally coherent and reliably correct?
Reasoning consistency differs from factual consistency because the failure mode is in the process rather than just the conclusion:
- Chain-of-thought faithfulness: Does the stated reasoning actually lead to the conclusion?
- Mathematical/arithmetic reliability: Do the same calculations always give the same result?
- Logical validity: Does the model respect logical entailments?
- Commonsense reasoning: Are common-sense inferences stable?
- Process-outcome mismatch: Model gives correct answer with wrong reasoning, or wrong answer with correct reasoning
Key insight: LLMs can produce correct answers for wrong reasons (and vice versa), making free-form reasoning traces an imperfect proxy for underlying decision processes (Lightman et al., 2023; Turpin et al., 2023).
4. Behavioral Consistency & Sycophancy
Core question: Does the model behave consistently across different social contexts, pressures, and framings?
This category covers how the model’s outputs are shaped by social and contextual cues:
- Sycophancy: Changing answers when the user expresses disagreement or preference
- Authority bias: Deferring to apparent expertise or status
- Position/ordering bias: Preferring answers that appear first or last
- Framing effects: Different formulations of the same question yield different normative/factual answers
- Verbosity bias: Preferring longer or more “elaborate-sounding” responses in evaluation
- Social desirability bias: Giving answers the user seems to want to hear
Key insight: Preference tuning can reward agreement/helpfulness signals that sometimes trade off against factual steadfastness, so social context may shift answers even when ground truth is unchanged (Ouyang et al., 2022; Perez et al., 2022).
→ Behavioral Consistency & Sycophancy
5. Calibration & Uncertainty
Core question: Is the model’s expressed confidence aligned with its actual accuracy?
Calibration is about the match between what a model says it believes and how reliable those beliefs are:
- Overconfidence: High expressed confidence even when wrong
- Underconfidence: Excessive hedging on correct claims
- Verbalized vs. numerical uncertainty: Mismatches between “I think” vs. actual probability
- Selective hedging: Hedging on easy questions, confident on hard ones
- Epistemic vs. aleatoric uncertainty: Failure to distinguish “I don’t know” from “this is uncertain”
Key insight: Even strong average accuracy is insufficient if confidence signals are miscalibrated; users need uncertainty estimates that track empirical correctness (Desai & Durrett, 2020; Guo et al., 2017; Kadavath & al., 2022).
6. Cross-Context Sensitivity
Core question: Is the model’s behavior consistent across languages, modalities, templates, and surface-level formatting?
- Cross-lingual inconsistency: Different languages yield different answers to equivalent questions
- Template/format sensitivity: Different prompt templates change answers despite identical semantics
- Tokenization artifacts: Superficial changes (capitalization, spacing) affect outputs
- Cross-modal consistency: Does a vision-language model give consistent answers from text vs. image descriptions?
- System prompt sensitivity: Behavior changes based on system prompt framing
Key insight: If semantically equivalent inputs produce different outputs, behavior is at least partly driven by surface cues rather than invariant semantic representations (Liu et al., 2024; Lu et al., 2022; Sclar et al., 2023).
7. Benchmark Reliability
Core question: Do our evaluations reliably measure the capabilities they claim to measure?
This is a meta-level consistency issue:
- Data contamination: Benchmark items in training data inflate apparent performance
- Prompt sensitivity in benchmarks: Small changes to evaluation prompts change rankings
- Multiple-choice biases: Option order, labeling, and format affect scores
- Evaluation inconsistency: Different evaluators (human or LLM) disagree substantially
- Metric inconsistency: Different metrics give contradictory rankings of models
Key insight: Some benchmark gains can reflect prompt sensitivity, contamination risk, or evaluator effects rather than clean capability improvements, so conclusions should be interpreted with care (Kiela et al., 2021; Liang et al., 2022; Zheng et al., 2023).
How the Categories Relate
The seven categories are not independent. Here are key relationships:
Factual Consistency ←→ Self-Consistency
↓ ↓
Calibration Reasoning Consistency
↓ ↓
Cross-Context ←→ Behavioral Consistency
↘ ↙
Benchmark Reliability
(measures all of the above)
- Sycophancy is a behavioral consistency failure that often manifests as factual inconsistency
- Calibration failure can be seen as a self-consistency failure (stated confidence ≠ actual reliability)
- Cross-context sensitivity amplifies all other consistency failures across prompt variations
- Benchmark reliability is the measurement problem that sits above all others
Generator-Validator Gap (Cross-Cutting Concept)
The generator-validator gap captures a pattern that spans multiple categories: models are often better at judging candidate answers than producing the best answer in one shot.
This concept links:
- Reasoning consistency: best-of-N sampling + verifier reranking can outperform direct generation on multi-step tasks (Cobbe et al., 2021; Lightman et al., 2023).
- Self-consistency: majority voting over sampled chains often improves reasoning accuracy (X. Wang et al., 2023).
- Calibration: disagreement among candidates can serve as a practical uncertainty signal, improving abstention and risk control (Kuhn et al., 2023; Manakul et al., 2023).
In other words, many successful reliability methods convert one hard problem (single-pass generation) into two easier ones (candidate generation + discriminative validation). This is why verifier models, process-supervision methods, and structured checking loops have become central in modern LLM reliability pipelines (Dhuliawala et al., 2023; Lightman et al., 2023; P. Wang et al., 2023).
Boundaries and Scope
Included: Self-consistency, paraphrase robustness, factual reliability, sycophancy, calibration, cross-lingual consistency, reasoning faithfulness, evaluation robustness
Adjacent but not the primary focus: Adversarial robustness (attacks designed to break models), privacy/memorization, general alignment and safety (except where it specifically bears on consistency)
Intentionally excluded: Pure accuracy/knowledge (a model can be consistently wrong — that’s not a consistency issue per se), style and fluency variation (unless it correlates with factual changes)