Calibration & Uncertainty
Aligning expressed confidence with actual correctness
Overview
Calibration asks whether model confidence tracks empirical correctness. In a calibrated system, statements made with 80% confidence should be correct about 80% of the time. LLMs often fail this in practice—especially on long-tail questions, ambiguous prompts, and hallucination-prone tasks.
The challenge is amplified in generation: token-level probabilities do not map cleanly to semantic correctness. This has driven work on semantic uncertainty, verbalized confidence elicitation, and selective prediction methods (including abstention). From an operational standpoint, calibration quality determines when a system should answer, hedge, cite, or decline.
Calibration is also downstream-sensitive: instruction tuning and preference optimization can improve usefulness while changing confidence behavior (e.g., InstructGPT, Constitutional AI, DPO). Teams therefore need post-alignment calibration audits, not just pretraining-era metrics.
Key Papers
Foundational Calibration Methods
On Calibration of Modern Neural Networks (Guo et al., 2017) · ICML · link
Established modern ECE/temperature-scaling practice still used in NLP and LLM calibration workflows.
Calibration of Pre-trained Transformers (Desai, Durrett, 2020) · EMNLP · link
Showed transformer classifiers are frequently miscalibrated and benchmarked post-hoc correction methods.
LLM Confidence and Verbalized Uncertainty
Language Models (Mostly) Know What They Know (Kadavath et al., 2022) · arXiv · link
Influential evidence that confidence can be informative but remains imperfect and often over-optimistic.
Teaching Models to Express Their Uncertainty in Words (Kadavath et al., 2022) · arXiv · link
Explores training/prompting schemes for better-aligned verbal confidence estimates.
Can LLMs Express Their Uncertainty? (Xiong et al., 2023) · arXiv · link
Shows elicitation protocol strongly affects confidence quality, with persistent overconfidence.
Semantic Uncertainty and Risk Control
Semantic Uncertainty (Kuhn, Gal, Farquhar, 2023) · Nature/ArXiv line · link
Introduces semantic entropy over meaning-equivalent generations, which can outperform token entropy for factual risk estimation in several settings.
Conformal Language Modeling (Angelopoulos et al., 2023) · arXiv · link
Brings distribution-free risk guarantees into language settings, important for selective prediction and abstention.
Preprints & Recent Work
Overconfidence is Key (Kumar et al., 2024) · arXiv · link
Evaluates verbalized uncertainty across LLM/VLM systems, showing systematic overconfidence.
Look at the Labels (Sahay et al., 2024) · arXiv · link
Prompt-sensitivity metric useful for uncertainty-aware robustness analysis.
Trusting Your Evidence (Lan et al., 2023) · arXiv · link
Context-aware decoding aligns generation decisions with evidence quality.
Blog Posts & Practitioner Resources
Language Models (Mostly) Know What They Know (research post) (Anthropic, 2022) — https://www.anthropic.com/research/language-models-mostly-know-what-they-know High-signal summary of confidence calibration findings.
Evidently AI calibration guides (Evidently, ongoing) — https://www.evidentlyai.com/ Applied monitoring workflows for confidence drift and ECE-style checks.
OpenAI Cookbook (OpenAI, ongoing) — https://cookbook.openai.com/ Practical recipes for reliability prompting, multi-sample checks, and abstention behavior.
Benchmarks & Datasets
| Name | Year | Description | Link |
|---|---|---|---|
| TruthfulQA | 2022 | Truthfulness benchmark used in confidence studies | https://arxiv.org/abs/2109.07958 |
| MMLU | 2020 | Broad QA benchmark for calibration curves | https://arxiv.org/abs/2009.03300 |
| BIG-Bench / BBH | 2022–2023 | Robustness-sensitive task suites | https://arxiv.org/abs/2206.04615 |
| PromptBench | 2023 | Prompt perturbation uncertainty stress tests | https://arxiv.org/abs/2306.04528 |
| SelfCheckGPT setup | 2023 | Uncertainty-via-disagreement protocol | https://arxiv.org/abs/2303.08896 |
GitHub Repos
LM-Polygraph — https://github.com/IINemo/lm-polygraph Uncertainty estimation and hallucination detection toolkit.
semantic_uncertainty — https://github.com/lorenzkuhn/semantic_uncertainty Reference code for semantic entropy methods.
OpenAI Evals — https://github.com/openai/evals Useful for confidence-linked pass/fail and abstention evaluations.
Key Takeaways
- Overconfidence remains a central LLM failure mode, especially on hallucination-prone prompts.
- Semantic-level uncertainty signals are often more useful than raw token probabilities.
- Reliable products need calibrated abstention behavior, not just better top-1 answer quality.