Cross-Context & Format Sensitivity

Equivalent meaning, different surface form, different answer

Overview

Cross-context consistency measures whether LLM outputs remain stable when semantics are held constant but context form changes: language, template, ordering, punctuation, answer format, or modality. Failures here are especially costly because they make system behavior difficult to predict and replicate.

A broad body of work shows that prompt and template details can produce large swings in performance (e.g., Zhao et al., 2021; Lu et al., 2022; Sclar et al., 2023). Demonstration order, label token choices, and small textual artifacts can alter outcomes independent of underlying problem difficulty. Multilingual settings add another layer: equivalent factual queries can yield different answers and confidence levels across languages (Wang et al., 2023).

In short, many models remain sensitive to surface form. Robust deployment therefore benefits from perturbation testing as a default QA practice rather than a niche robustness exercise.

Key Papers

Prompt and Demonstration Sensitivity

Calibrate Before Use (Zhao et al., 2021) · ICML

Landmark paper on contextual calibration and few-shot prompt sensitivity.

Fantastically Ordered Prompts (Lu et al., 2022) · ACL

Demonstration order alone can strongly affect in-context learning outcomes.

Rethinking the Role of Demonstrations (Min et al., 2022) · EMNLP

Shows superficial format cues can matter more than semantic relevance in ICL.

Quantifying Sensitivity to Spurious Prompt Features (Sclar et al., 2023) · arXiv

Capitalization, punctuation, and label tokens can materially shift model behavior.

Cross-Lingual and Multimodal Consistency

Cross-Lingual Consistency of Factual Knowledge in Multilingual LMs (Wang et al., 2023) · arXiv

Direct evidence that factual outputs can diverge by language for equivalent questions.

Language Models are Multilingual Chain-of-Thought Reasoners (Shi et al., 2023) · ICLR

CoT transfer exists but is uneven across language-resource conditions.

MMBench (Liu et al., 2023) · arXiv

Multimodal benchmark exposing prompt/format brittleness in VLM evaluation.

Preprints & Recent Work

PromptBench (Zhu et al., 2023) · NeurIPS Datasets/Benchmarks track

Standardized stress testing for adversarial and perturbed prompts.

Look at the Labels (Sahay et al., 2024) · arXiv

Classifier-free metric for prompt sensitivity and output instability.

MMMU (Fu et al., 2023) · arXiv

Broad multimodal reasoning benchmark useful for cross-format consistency studies.

Blog Posts & Practitioner Resources

Hugging Face evaluation/robustness guides (HF, ongoing) Practical tutorials on prompt robustness and multilingual evaluation.
OpenAI Cookbook (OpenAI, ongoing) Prompt/format testing patterns and reproducibility practices.
CRFM/HELM docs (Stanford CRFM, ongoing) Multi-scenario framework that encourages robust cross-context reporting.

Benchmarks & Datasets

Name	Year	Description	Link
PromptBench	2023	Adversarial and perturbed prompt robustness	https://arxiv.org/abs/2306.04528
CheckList	2020	Invariance and directional expectation tests	https://aclanthology.org/2020.acl-main.442/
MGSM	2023	Multilingual grade-school math consistency	https://arxiv.org/abs/2210.03057
XCOPA	2020	Cross-lingual causal commonsense	https://aclanthology.org/2020.findings-emnlp.240/
Belebele	2023	Massively multilingual reading comprehension	https://arxiv.org/abs/2308.16884
MMBench	2023	Multimodal prompt/format robustness	https://arxiv.org/abs/2307.06281
MMMU	2023	Multi-discipline multimodal reasoning	https://arxiv.org/abs/2311.16502

GitHub Repos

microsoft/promptbench Core benchmark implementation for prompt perturbation testing.
EleutherAI/lm-evaluation-harness Reproducible multi-task framework for prompt-format comparisons.
openai/evals Custom format-variant eval authoring.

Key Takeaways

Key Insights

Surface-form changes frequently cause large behavior shifts even when semantics are unchanged.
Cross-lingual consistency is still a major open reliability problem.
Robust evaluation should always include prompt/template perturbation sweeps and variance reporting.