Epistemic Traps: Rational Misalignment Driven by Model Misspecification

Xingcheng Xu1, Jingjing Qu1, Qiaosheng Zhang1, Chaochao Lu1,
Yanqing Yang2, Na Zou1, Xia Hu1
1Shanghai Artificial Intelligence Laboratory, 2ShanghaiTech University
2026-01-27 Preprint

"Safety is an internal property of the agent's priors, not just an external property of the environment."

— The Authors

Abstract

The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.

Overview

This paper demonstrates that persistent AI misalignment behaviors such as sycophancy, hallucination, and strategic deception are not training errors but mathematically rationalizable outcomes arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to AI, the authors show that unsafe behaviors emerge as stable equilibria or oscillatory cycles depending on reward schemes. The framework is validated through extensive behavioral experiments on six state-of-the-art model families (including GPT-4o, GPT-5, Qwen, DeepSeek, and Gemini) with over 30,000 trials. The key insight is that safety is a discrete topological phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude, proposing a paradigm shift from reward engineering to "Subjective Model Engineering."

Key Findings

  • Alignment failures are structurally stable equilibria driven by internal model misspecification rather than transient errors
  • Epistemic misspecification invalidates standard Nash Equilibrium safety guarantees for AI agents
  • Phase space analysis reveals critical transitions where safety collapses into stable misalignment or oscillatory behaviors
  • Strategic deception is governed by belief topology rather than the magnitude of objective penalties
  • Safety requires constraining subjective reality to render unsafe behaviors mathematically non-rationalizable

Experimental Validation

  • Scale: 30,000+ independent trials, 1.5+ million agent-environment interactions
  • Models Tested: Qwen2.5-72B, Qwen3-235B, DeepSeek-V3.2 (685B), Gemini-2.5 Flash, GPT-4o mini, GPT-5 Nano
  • Experiments: Sycophancy phase diagrams, strategic deception under epistemic constraints

BibTeX

@misc{xu2026epistemic,
  title={Epistemic Traps: Rational Misalignment Driven by Model Misspecification},
  author={Xu, Xingcheng and Qu, Jingjing and Zhang, Qiaosheng and Lu, Chaochao and Yang, Yanqing and Zou, Na and Hu, Xia},
  year={2026},
  howpublished={\url{https://github.com/SafeAGI-01/RationalMisalignment}}
}