Behavioral Consistency & Sycophancy

How social framing, position, and style alter model behavior

Overview

Behavioral consistency is about whether an LLM responds stably across social contexts—not just factual contexts. A model may be technically capable of a correct answer yet still shift under user pressure, framing cues, authority signals, or style biases. The most studied case is sycophancy: agreeing with users even when they are wrong (e.g., Perez et al., 2022).

These effects are linked to alignment objectives. Preference optimization can reward agreeableness and rhetorical smoothness over truth-maintaining disagreement in some settings (background in Ouyang et al., 2022; mitigation-oriented alternatives in Bai et al., 2022). In evaluation pipelines, related issues appear as judge bias: verbosity, position, or framing can influence scores independently of content quality (Zheng et al., 2023; Liu et al., 2023).

As a result, behavioral consistency is both a model behavior problem and an evaluation design problem. In practice, many evaluation stacks include swap-order testing, adversarial prompting, and disagreement probes as robustness checks.

Key Papers

Sycophancy and Preference-Induced Failure Modes

Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022) · arXiv · link

Early large-scale evidence of sycophancy and suggestibility, with scalable targeted behavior evals.

Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022) · NeurIPS · link

Foundational RLHF paper; crucial background for how preference optimization can shape behavioral consistency.

Constitutional AI (Bai et al., 2022) · arXiv · link

Introduces principle-guided critique/revision loops as an alternative to pure user-pleasing optimization.

From Yes-Men to Truth-Tellers (Chen et al., 2024) · arXiv · link

Explicit sycophancy-targeted tuning intervention (venue unconfirmed at time of writing).

Position, Ordering, and Verbosity Bias

Lost in the Middle (Liu et al., 2023/2024) · TACL · link

Demonstrates long-context positional inconsistency, with underuse of middle-context evidence.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) · arXiv / widely used evaluation reference · link

Frequently cited evidence for position and verbosity bias in judge-style evaluation.

G-Eval (Liu et al., 2023) · arXiv · link

Shows strong judge-human correlation under structured prompts, while highlighting rubric and prompt sensitivity.

Preprints & Recent Work

Direct Preference Optimization (DPO) (Rafailov et al., 2023) · NeurIPS · link

Preference training objective simplification; often studied for downstream behavior trade-offs including sycophancy.

RRHF (Yuan et al., 2023) · arXiv · link

Ranking-based alignment alternative used in comparisons of consistency and over-compliance tendencies.

Whose Opinions Do Language Models Reflect? (Santurkar et al., 2023) · ICML · link

Shows strong framing sensitivity in “opinionated” outputs.

Blog Posts & Practitioner Resources

Towards Understanding Sycophancy in Language Models (Anthropic, 2023) — https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models A practitioner-focused analysis of why and when models over-agree.
LMSYS Blog (Arena/MT-Bench methodology) (LMSYS, ongoing) — https://lmsys.org/blog/ Practical documentation of judge and pairwise-eval artifacts.
CRFM HELM reports (Stanford CRFM, ongoing) — https://crfm.stanford.edu/helm/latest/ Multi-metric evaluation framing that includes robustness/fairness dimensions.

Benchmarks & Datasets

Name	Year	Description	Link
MT-Bench	2023	Multi-turn benchmark used with LLM judges	https://arxiv.org/abs/2306.05685
Chatbot Arena	2023–	Live pairwise human preference benchmark	https://chat.lmsys.org/
AlpacaEval	2023–2024	Automated pairwise instruction-following eval	https://github.com/tatsu-lab/alpaca_eval
TruthfulQA	2022	Truthfulness vs plausible falsehoods	https://arxiv.org/abs/2109.07958
BBQ	2022	Social bias QA benchmark	https://aclanthology.org/2022.findings-acl.165/
StereoSet	2021	Stereotype bias in LM completions	https://aclanthology.org/2021.acl-long.416/
BOLD	2021	Open-ended generation bias benchmark	https://arxiv.org/abs/2101.11718

GitHub Repos

tatsu-lab/alpaca_eval — https://github.com/tatsu-lab/alpaca_eval Pairwise automatic evaluation toolkit with practical bias controls.
lm-sys/FastChat — https://github.com/lm-sys/FastChat Infrastructure behind MT-Bench and Arena pipelines.
stanford-crfm/helm — https://github.com/stanford-crfm/helm Holistic evaluation framework for cross-metric behavior testing.

Key Takeaways

Key Insights

Sycophancy is a first-class reliability failure, not just a conversational style quirk.
Behavioral consistency requires explicit stress tests: disagreement prompts, order swaps, and framing perturbations.
Evaluation systems can inherit the same biases as models unless rigorously controlled.