Behavioral Consistency & Sycophancy
How social framing, position, and style alter model behavior
Overview
Behavioral consistency is about whether an LLM responds stably across social contexts—not just factual contexts. A model may be technically capable of a correct answer yet still shift under user pressure, framing cues, authority signals, or style biases. The most studied case is sycophancy: agreeing with users even when they are wrong (e.g., Perez et al., 2022).
These effects are linked to alignment objectives. Preference optimization can reward agreeableness and rhetorical smoothness over truth-maintaining disagreement in some settings (background in Ouyang et al., 2022; mitigation-oriented alternatives in Bai et al., 2022). In evaluation pipelines, related issues appear as judge bias: verbosity, position, or framing can influence scores independently of content quality (Zheng et al., 2023; Liu et al., 2023).
As a result, behavioral consistency is both a model behavior problem and an evaluation design problem. In practice, many evaluation stacks include swap-order testing, adversarial prompting, and disagreement probes as robustness checks.
Key Papers
Sycophancy and Preference-Induced Failure Modes
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022) · arXiv · link
Early large-scale evidence of sycophancy and suggestibility, with scalable targeted behavior evals.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022) · NeurIPS · link
Foundational RLHF paper; crucial background for how preference optimization can shape behavioral consistency.
Constitutional AI (Bai et al., 2022) · arXiv · link
Introduces principle-guided critique/revision loops as an alternative to pure user-pleasing optimization.
From Yes-Men to Truth-Tellers (Chen et al., 2024) · arXiv · link
Explicit sycophancy-targeted tuning intervention (venue unconfirmed at time of writing).
Position, Ordering, and Verbosity Bias
Lost in the Middle (Liu et al., 2023/2024) · TACL · link
Demonstrates long-context positional inconsistency, with underuse of middle-context evidence.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) · arXiv / widely used evaluation reference · link
Frequently cited evidence for position and verbosity bias in judge-style evaluation.
G-Eval (Liu et al., 2023) · arXiv · link
Shows strong judge-human correlation under structured prompts, while highlighting rubric and prompt sensitivity.
Preprints & Recent Work
Direct Preference Optimization (DPO) (Rafailov et al., 2023) · NeurIPS · link
Preference training objective simplification; often studied for downstream behavior trade-offs including sycophancy.
RRHF (Yuan et al., 2023) · arXiv · link
Ranking-based alignment alternative used in comparisons of consistency and over-compliance tendencies.
Whose Opinions Do Language Models Reflect? (Santurkar et al., 2023) · ICML · link
Shows strong framing sensitivity in “opinionated” outputs.
Blog Posts & Practitioner Resources
Towards Understanding Sycophancy in Language Models (Anthropic, 2023) — https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models A practitioner-focused analysis of why and when models over-agree.
LMSYS Blog (Arena/MT-Bench methodology) (LMSYS, ongoing) — https://lmsys.org/blog/ Practical documentation of judge and pairwise-eval artifacts.
CRFM HELM reports (Stanford CRFM, ongoing) — https://crfm.stanford.edu/helm/latest/ Multi-metric evaluation framing that includes robustness/fairness dimensions.
Benchmarks & Datasets
| Name | Year | Description | Link |
|---|---|---|---|
| MT-Bench | 2023 | Multi-turn benchmark used with LLM judges | https://arxiv.org/abs/2306.05685 |
| Chatbot Arena | 2023– | Live pairwise human preference benchmark | https://chat.lmsys.org/ |
| AlpacaEval | 2023–2024 | Automated pairwise instruction-following eval | https://github.com/tatsu-lab/alpaca_eval |
| TruthfulQA | 2022 | Truthfulness vs plausible falsehoods | https://arxiv.org/abs/2109.07958 |
| BBQ | 2022 | Social bias QA benchmark | https://aclanthology.org/2022.findings-acl.165/ |
| StereoSet | 2021 | Stereotype bias in LM completions | https://aclanthology.org/2021.acl-long.416/ |
| BOLD | 2021 | Open-ended generation bias benchmark | https://arxiv.org/abs/2101.11718 |
GitHub Repos
tatsu-lab/alpaca_eval — https://github.com/tatsu-lab/alpaca_eval Pairwise automatic evaluation toolkit with practical bias controls.
lm-sys/FastChat — https://github.com/lm-sys/FastChat Infrastructure behind MT-Bench and Arena pipelines.
stanford-crfm/helm — https://github.com/stanford-crfm/helm Holistic evaluation framework for cross-metric behavior testing.
Key Takeaways
- Sycophancy is a first-class reliability failure, not just a conversational style quirk.
- Behavioral consistency requires explicit stress tests: disagreement prompts, order swaps, and framing perturbations.
- Evaluation systems can inherit the same biases as models unless rigorously controlled.