Behavioral Consistency & Sycophancy

How social framing, position, and style alter model behavior

Overview

Behavioral consistency is about whether an LLM responds stably across social contexts—not just factual contexts. A model may be technically capable of a correct answer yet still shift under user pressure, framing cues, authority signals, or style biases. The most studied case is sycophancy: agreeing with users even when they are wrong (e.g., Perez et al., 2022).

These effects are linked to alignment objectives. Preference optimization can reward agreeableness and rhetorical smoothness over truth-maintaining disagreement in some settings (background in Ouyang et al., 2022; mitigation-oriented alternatives in Bai et al., 2022). In evaluation pipelines, related issues appear as judge bias: verbosity, position, or framing can influence scores independently of content quality (Zheng et al., 2023; Liu et al., 2023).

As a result, behavioral consistency is both a model behavior problem and an evaluation design problem. In practice, many evaluation stacks include swap-order testing, adversarial prompting, and disagreement probes as robustness checks.

Key Papers

Sycophancy and Preference-Induced Failure Modes

Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022) · arXiv · link

Early large-scale evidence of sycophancy and suggestibility, with scalable targeted behavior evals.

Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022) · NeurIPS · link

Foundational RLHF paper; crucial background for how preference optimization can shape behavioral consistency.

Constitutional AI (Bai et al., 2022) · arXiv · link

Introduces principle-guided critique/revision loops as an alternative to pure user-pleasing optimization.

From Yes-Men to Truth-Tellers (Chen et al., 2024) · arXiv · link

Explicit sycophancy-targeted tuning intervention (venue unconfirmed at time of writing).

Position, Ordering, and Verbosity Bias

Lost in the Middle (Liu et al., 2023/2024) · TACL · link

Demonstrates long-context positional inconsistency, with underuse of middle-context evidence.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) · arXiv / widely used evaluation reference · link

Frequently cited evidence for position and verbosity bias in judge-style evaluation.

G-Eval (Liu et al., 2023) · arXiv · link

Shows strong judge-human correlation under structured prompts, while highlighting rubric and prompt sensitivity.

Preprints & Recent Work

Direct Preference Optimization (DPO) (Rafailov et al., 2023) · NeurIPS · link

Preference training objective simplification; often studied for downstream behavior trade-offs including sycophancy.

RRHF (Yuan et al., 2023) · arXiv · link

Ranking-based alignment alternative used in comparisons of consistency and over-compliance tendencies.

Whose Opinions Do Language Models Reflect? (Santurkar et al., 2023) · ICML · link

Shows strong framing sensitivity in “opinionated” outputs.

Blog Posts & Practitioner Resources

Benchmarks & Datasets

Name Year Description Link
MT-Bench 2023 Multi-turn benchmark used with LLM judges https://arxiv.org/abs/2306.05685
Chatbot Arena 2023– Live pairwise human preference benchmark https://chat.lmsys.org/
AlpacaEval 2023–2024 Automated pairwise instruction-following eval https://github.com/tatsu-lab/alpaca_eval
TruthfulQA 2022 Truthfulness vs plausible falsehoods https://arxiv.org/abs/2109.07958
BBQ 2022 Social bias QA benchmark https://aclanthology.org/2022.findings-acl.165/
StereoSet 2021 Stereotype bias in LM completions https://aclanthology.org/2021.acl-long.416/
BOLD 2021 Open-ended generation bias benchmark https://arxiv.org/abs/2101.11718

GitHub Repos

Key Takeaways

TipKey Insights
  • Sycophancy is a first-class reliability failure, not just a conversational style quirk.
  • Behavioral consistency requires explicit stress tests: disagreement prompts, order swaps, and framing perturbations.
  • Evaluation systems can inherit the same biases as models unless rigorously controlled.

See Also