The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Shanghai Artificial Intelligence Laboratory
2025-07-27 arXiv:2507.20150

Abstract

Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an "effective reward" aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.

Overview

This paper develops a rigorous mathematical framework to analyze why reinforcement learning often produces brittle and unstable policies in large language models. The core contribution is proving that the reward-policy map is continuous when optimal actions are unique, but becomes discontinuous (exhibiting "policy cliffs") when multiple optimal actions exist. This theoretical insight provides a unified explanation for various LLM alignment failures including spurious reasoning, deceptive alignment, instruction disobedience, and RLHF-induced sophistry. The framework extends to multi-reward RL settings and demonstrates that entropy regularization can restore policy stability. The work bridges theoretical analysis with empirical observations, offering principled guidance for designing safer AI systems.

Key Contributions

  • Theoretical Framework: Formal analysis of the reward-policy map showing policy instability stems from non-unique optimal actions
  • Unified Explanation: Mathematical lens connecting spurious reasoning, deceptive alignment, instruction-following failures, and RLHF sophistry
  • Multi-Reward Analysis: Extension to realistic multi-reward RL with "effective reward" aggregation mechanism
  • Entropy Regularization: Proof that entropy regularization restores Lipschitz continuity to the reward-policy map

BibTeX

@article{xu2025policy,
  title={The policy cliff: A theoretical analysis of reward-policy maps in large language models},
  author={Xu, Xingcheng},
  journal={arXiv preprint arXiv:2507.20150},
  year={2025}
}