Cross-Context & Format Sensitivity

Equivalent meaning, different surface form, different answer

Overview

Cross-context consistency measures whether LLM outputs remain stable when semantics are held constant but context form changes: language, template, ordering, punctuation, answer format, or modality. Failures here are especially costly because they make system behavior difficult to predict and replicate.

A broad body of work shows that prompt and template details can produce large swings in performance (e.g., Zhao et al., 2021; Lu et al., 2022; Sclar et al., 2023). Demonstration order, label token choices, and small textual artifacts can alter outcomes independent of underlying problem difficulty. Multilingual settings add another layer: equivalent factual queries can yield different answers and confidence levels across languages (Wang et al., 2023).

In short, many models remain sensitive to surface form. Robust deployment therefore benefits from perturbation testing as a default QA practice rather than a niche robustness exercise.

Key Papers

Prompt and Demonstration Sensitivity

Calibrate Before Use (Zhao et al., 2021) · ICML

Landmark paper on contextual calibration and few-shot prompt sensitivity.

Fantastically Ordered Prompts (Lu et al., 2022) · ACL

Demonstration order alone can strongly affect in-context learning outcomes.

Rethinking the Role of Demonstrations (Min et al., 2022) · EMNLP

Shows superficial format cues can matter more than semantic relevance in ICL.

Quantifying Sensitivity to Spurious Prompt Features (Sclar et al., 2023) · arXiv

Capitalization, punctuation, and label tokens can materially shift model behavior.

Cross-Lingual and Multimodal Consistency

Cross-Lingual Consistency of Factual Knowledge in Multilingual LMs (Wang et al., 2023) · arXiv

Direct evidence that factual outputs can diverge by language for equivalent questions.

Language Models are Multilingual Chain-of-Thought Reasoners (Shi et al., 2023) · ICLR

CoT transfer exists but is uneven across language-resource conditions.

MMBench (Liu et al., 2023) · arXiv

Multimodal benchmark exposing prompt/format brittleness in VLM evaluation.

Preprints & Recent Work

PromptBench (Zhu et al., 2023) · NeurIPS Datasets/Benchmarks track

Standardized stress testing for adversarial and perturbed prompts.

Look at the Labels (Sahay et al., 2024) · arXiv

Classifier-free metric for prompt sensitivity and output instability.

MMMU (Fu et al., 2023) · arXiv

Broad multimodal reasoning benchmark useful for cross-format consistency studies.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
PromptBench 2023 Adversarial and perturbed prompt robustness https://arxiv.org/abs/2306.04528
CheckList 2020 Invariance and directional expectation tests https://aclanthology.org/2020.acl-main.442/
MGSM 2023 Multilingual grade-school math consistency https://arxiv.org/abs/2210.03057
XCOPA 2020 Cross-lingual causal commonsense https://aclanthology.org/2020.findings-emnlp.240/
Belebele 2023 Massively multilingual reading comprehension https://arxiv.org/abs/2308.16884
MMBench 2023 Multimodal prompt/format robustness https://arxiv.org/abs/2307.06281
MMMU 2023 Multi-discipline multimodal reasoning https://arxiv.org/abs/2311.16502

GitHub Repos

Key Takeaways

TipKey Insights
  • Surface-form changes frequently cause large behavior shifts even when semantics are unchanged.
  • Cross-lingual consistency is still a major open reliability problem.
  • Robust evaluation should always include prompt/template perturbation sweeps and variance reporting.

See Also