Calibration & Uncertainty

Aligning expressed confidence with actual correctness

Overview

Calibration asks whether model confidence tracks empirical correctness. In a calibrated system, statements made with 80% confidence should be correct about 80% of the time. LLMs often fail this in practice—especially on long-tail questions, ambiguous prompts, and hallucination-prone tasks.

The challenge is amplified in generation: token-level probabilities do not map cleanly to semantic correctness. This has driven work on semantic uncertainty, verbalized confidence elicitation, and selective prediction methods (including abstention). From an operational standpoint, calibration quality determines when a system should answer, hedge, cite, or decline.

Calibration is also downstream-sensitive: instruction tuning and preference optimization can improve usefulness while changing confidence behavior (e.g., InstructGPT, Constitutional AI, DPO). Teams therefore need post-alignment calibration audits, not just pretraining-era metrics.

Key Papers

Foundational Calibration Methods

On Calibration of Modern Neural Networks (Guo et al., 2017) · ICML · link

Established modern ECE/temperature-scaling practice still used in NLP and LLM calibration workflows.

Calibration of Pre-trained Transformers (Desai, Durrett, 2020) · EMNLP · link

Showed transformer classifiers are frequently miscalibrated and benchmarked post-hoc correction methods.

LLM Confidence and Verbalized Uncertainty

Language Models (Mostly) Know What They Know (Kadavath et al., 2022) · arXiv · link

Influential evidence that confidence can be informative but remains imperfect and often over-optimistic.

Teaching Models to Express Their Uncertainty in Words (Kadavath et al., 2022) · arXiv · link

Explores training/prompting schemes for better-aligned verbal confidence estimates.

Can LLMs Express Their Uncertainty? (Xiong et al., 2023) · arXiv · link

Shows elicitation protocol strongly affects confidence quality, with persistent overconfidence.

Semantic Uncertainty and Risk Control

Semantic Uncertainty (Kuhn, Gal, Farquhar, 2023) · Nature/ArXiv line · link

Introduces semantic entropy over meaning-equivalent generations, which can outperform token entropy for factual risk estimation in several settings.

Conformal Language Modeling (Angelopoulos et al., 2023) · arXiv · link

Brings distribution-free risk guarantees into language settings, important for selective prediction and abstention.

Preprints & Recent Work

Overconfidence is Key (Kumar et al., 2024) · arXiv · link

Evaluates verbalized uncertainty across LLM/VLM systems, showing systematic overconfidence.

Look at the Labels (Sahay et al., 2024) · arXiv · link

Prompt-sensitivity metric useful for uncertainty-aware robustness analysis.

Trusting Your Evidence (Lan et al., 2023) · arXiv · link

Context-aware decoding aligns generation decisions with evidence quality.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
TruthfulQA 2022 Truthfulness benchmark used in confidence studies https://arxiv.org/abs/2109.07958
MMLU 2020 Broad QA benchmark for calibration curves https://arxiv.org/abs/2009.03300
BIG-Bench / BBH 2022–2023 Robustness-sensitive task suites https://arxiv.org/abs/2206.04615
PromptBench 2023 Prompt perturbation uncertainty stress tests https://arxiv.org/abs/2306.04528
SelfCheckGPT setup 2023 Uncertainty-via-disagreement protocol https://arxiv.org/abs/2303.08896

GitHub Repos

Key Takeaways

TipKey Insights
  • Overconfidence remains a central LLM failure mode, especially on hallucination-prone prompts.
  • Semantic-level uncertainty signals are often more useful than raw token probabilities.
  • Reliable products need calibrated abstention behavior, not just better top-1 answer quality.

See Also