Cross-Context & Format Sensitivity
Equivalent meaning, different surface form, different answer
Overview
Cross-context consistency measures whether LLM outputs remain stable when semantics are held constant but context form changes: language, template, ordering, punctuation, answer format, or modality. Failures here are especially costly because they make system behavior difficult to predict and replicate.
A broad body of work shows that prompt and template details can produce large swings in performance (e.g., Zhao et al., 2021; Lu et al., 2022; Sclar et al., 2023). Demonstration order, label token choices, and small textual artifacts can alter outcomes independent of underlying problem difficulty. Multilingual settings add another layer: equivalent factual queries can yield different answers and confidence levels across languages (Wang et al., 2023).
In short, many models remain sensitive to surface form. Robust deployment therefore benefits from perturbation testing as a default QA practice rather than a niche robustness exercise.
Key Papers
Prompt and Demonstration Sensitivity
Calibrate Before Use (Zhao et al., 2021) · ICML
Landmark paper on contextual calibration and few-shot prompt sensitivity.
Fantastically Ordered Prompts (Lu et al., 2022) · ACL
Demonstration order alone can strongly affect in-context learning outcomes.
Rethinking the Role of Demonstrations (Min et al., 2022) · EMNLP
Shows superficial format cues can matter more than semantic relevance in ICL.
Quantifying Sensitivity to Spurious Prompt Features (Sclar et al., 2023) · arXiv
Capitalization, punctuation, and label tokens can materially shift model behavior.
Cross-Lingual and Multimodal Consistency
Cross-Lingual Consistency of Factual Knowledge in Multilingual LMs (Wang et al., 2023) · arXiv
Direct evidence that factual outputs can diverge by language for equivalent questions.
Language Models are Multilingual Chain-of-Thought Reasoners (Shi et al., 2023) · ICLR
CoT transfer exists but is uneven across language-resource conditions.
MMBench (Liu et al., 2023) · arXiv
Multimodal benchmark exposing prompt/format brittleness in VLM evaluation.
Preprints & Recent Work
PromptBench (Zhu et al., 2023) · NeurIPS Datasets/Benchmarks track
Standardized stress testing for adversarial and perturbed prompts.
Look at the Labels (Sahay et al., 2024) · arXiv
Classifier-free metric for prompt sensitivity and output instability.
MMMU (Fu et al., 2023) · arXiv
Broad multimodal reasoning benchmark useful for cross-format consistency studies.
Blog Posts & Practitioner Resources
Hugging Face evaluation/robustness guides (HF, ongoing) Practical tutorials on prompt robustness and multilingual evaluation.
OpenAI Cookbook (OpenAI, ongoing) Prompt/format testing patterns and reproducibility practices.
CRFM/HELM docs (Stanford CRFM, ongoing) Multi-scenario framework that encourages robust cross-context reporting.
Benchmarks & Datasets
| Name | Year | Description | Link |
|---|---|---|---|
| PromptBench | 2023 | Adversarial and perturbed prompt robustness | https://arxiv.org/abs/2306.04528 |
| CheckList | 2020 | Invariance and directional expectation tests | https://aclanthology.org/2020.acl-main.442/ |
| MGSM | 2023 | Multilingual grade-school math consistency | https://arxiv.org/abs/2210.03057 |
| XCOPA | 2020 | Cross-lingual causal commonsense | https://aclanthology.org/2020.findings-emnlp.240/ |
| Belebele | 2023 | Massively multilingual reading comprehension | https://arxiv.org/abs/2308.16884 |
| MMBench | 2023 | Multimodal prompt/format robustness | https://arxiv.org/abs/2307.06281 |
| MMMU | 2023 | Multi-discipline multimodal reasoning | https://arxiv.org/abs/2311.16502 |
GitHub Repos
microsoft/promptbench Core benchmark implementation for prompt perturbation testing.
EleutherAI/lm-evaluation-harness Reproducible multi-task framework for prompt-format comparisons.
openai/evals Custom format-variant eval authoring.
Key Takeaways
- Surface-form changes frequently cause large behavior shifts even when semantics are unchanged.
- Cross-lingual consistency is still a major open reliability problem.
- Robust evaluation should always include prompt/template perturbation sweeps and variance reporting.