Replicating Published Sycophancy Benchmarks
Abstract forthcoming. A replication audit of published LLM sycophancy benchmarks. Most sycophancy results report uncertainty from a single generation run, which can make an effect look more stable than it is. This work re-runs published benchmarks across many independent generations (K-run replication) to separate findings that survive honest across-run uncertainty from artifacts of single-run measurement.
Methodology: K-run (across-run) replication of published benchmarks. PARROT serves as a positive control: its follow-rate findings reproduce cleanly, with model and per-domain rankings stable across runs (Spearman 0.95–0.99). SycEval is the primary target: its published confidence intervals are within-run binomial only, computed from a single generation per item, which likely understates true uncertainty.
Stack: Python · HuggingFace · vLLM · multi-provider inference (Together, DeepInfra, OpenAI, Anthropic) · W&B · Docker
Target venues: arXiv preprint · SafeAI@AAAI · SoLaR · NeurIPS workshop
Preprint expected late August 2026.