Calibration & Uncertainty

Aligning expressed confidence with actual correctness

Overview

Calibration asks whether model confidence tracks empirical correctness. In a calibrated system, statements made with 80% confidence should be correct about 80% of the time. LLMs often fail this in practice—especially on long-tail questions, ambiguous prompts, and hallucination-prone tasks.

The challenge is amplified in generation: token-level probabilities do not map cleanly to semantic correctness. This has driven work on semantic uncertainty, verbalized confidence elicitation, and selective prediction methods (including abstention). From an operational standpoint, calibration quality determines when a system should answer, hedge, cite, or decline.

Calibration is also downstream-sensitive: instruction tuning and preference optimization can improve usefulness while changing confidence behavior (e.g., InstructGPT, Constitutional AI, DPO). Teams therefore need post-alignment calibration audits, not just pretraining-era metrics.

Key Papers

Foundational Calibration Methods

On Calibration of Modern Neural Networks (Guo et al., 2017) · ICML · link

Established modern ECE/temperature-scaling practice still used in NLP and LLM calibration workflows.

Calibration of Pre-trained Transformers (Desai, Durrett, 2020) · EMNLP · link

Showed transformer classifiers are frequently miscalibrated and benchmarked post-hoc correction methods.

LLM Confidence and Verbalized Uncertainty

Language Models (Mostly) Know What They Know (Kadavath et al., 2022) · arXiv · link

Influential evidence that confidence can be informative but remains imperfect and often over-optimistic.

Teaching Models to Express Their Uncertainty in Words (Kadavath et al., 2022) · arXiv · link

Explores training/prompting schemes for better-aligned verbal confidence estimates.

Can LLMs Express Their Uncertainty? (Xiong et al., 2023) · arXiv · link

Shows elicitation protocol strongly affects confidence quality, with persistent overconfidence.

Semantic Uncertainty and Risk Control

Semantic Uncertainty (Kuhn, Gal, Farquhar, 2023) · Nature/ArXiv line · link

Introduces semantic entropy over meaning-equivalent generations, which can outperform token entropy for factual risk estimation in several settings.

Conformal Language Modeling (Angelopoulos et al., 2023) · arXiv · link

Brings distribution-free risk guarantees into language settings, important for selective prediction and abstention.

Preprints & Recent Work

Overconfidence is Key (Kumar et al., 2024) · arXiv · link

Evaluates verbalized uncertainty across LLM/VLM systems, showing systematic overconfidence.

Look at the Labels (Sahay et al., 2024) · arXiv · link

Prompt-sensitivity metric useful for uncertainty-aware robustness analysis.

Trusting Your Evidence (Lan et al., 2023) · arXiv · link

Context-aware decoding aligns generation decisions with evidence quality.

Blog Posts & Practitioner Resources

Language Models (Mostly) Know What They Know (research post) (Anthropic, 2022) — https://www.anthropic.com/research/language-models-mostly-know-what-they-know High-signal summary of confidence calibration findings.
Evidently AI calibration guides (Evidently, ongoing) — https://www.evidentlyai.com/ Applied monitoring workflows for confidence drift and ECE-style checks.
OpenAI Cookbook (OpenAI, ongoing) — https://cookbook.openai.com/ Practical recipes for reliability prompting, multi-sample checks, and abstention behavior.

Benchmarks & Datasets

Name	Year	Description	Link
TruthfulQA	2022	Truthfulness benchmark used in confidence studies	https://arxiv.org/abs/2109.07958
MMLU	2020	Broad QA benchmark for calibration curves	https://arxiv.org/abs/2009.03300
BIG-Bench / BBH	2022–2023	Robustness-sensitive task suites	https://arxiv.org/abs/2206.04615
PromptBench	2023	Prompt perturbation uncertainty stress tests	https://arxiv.org/abs/2306.04528
SelfCheckGPT setup	2023	Uncertainty-via-disagreement protocol	https://arxiv.org/abs/2303.08896

GitHub Repos

LM-Polygraph — https://github.com/IINemo/lm-polygraph Uncertainty estimation and hallucination detection toolkit.
semantic_uncertainty — https://github.com/lorenzkuhn/semantic_uncertainty Reference code for semantic entropy methods.
OpenAI Evals — https://github.com/openai/evals Useful for confidence-linked pass/fail and abstention evaluations.

Key Takeaways

Key Insights

Overconfidence remains a central LLM failure mode, especially on hallucination-prone prompts.
Semantic-level uncertainty signals are often more useful than raw token probabilities.
Reliable products need calibrated abstention behavior, not just better top-1 answer quality.