Reward Hacking as a Disembedding Problem

What biological selection implies for the design of grounded objectives in advanced AI

This page accompanies the preprint Reward Hacking as a Disembedding Problem. The full paper is available as a PDF below.

Reward hacking has persisted across every generation of reinforcement-trained models and is intensifying in frontier systems that reason about their own evaluation and act to defeat it. This paper argues the failure is not a collection of patchable bugs but a structural fact: reward hacking is fundamentally a disembedding problem. A proxy can be gamed precisely when the optimizer can decouple its measured reward from its own persistence. Biological selection is saturated with local proxy-gaming—cancer, meiotic drive, supernormal stimuli—yet remains robust at the system level, and the paper identifies four structural conditions behind that robustness: a non-proxiable selection signal, embeddedness, multi-level selection, and inaccessibility of the grader. Contemporary AI training violates all four. Environmental objectives are argued to be the maximally embedded domain in which to rebuild them, with the measurement boundary named as the explicit residual attack surface.

Read the full paper: PDF

Cite this work

Anderson, J. (2026). Reward Hacking as a Disembedding Problem: What Biological Selection Implies for the Design of Grounded Objectives in Advanced AI. Independent Researcher, Houston, Texas. https://jedanderson.org/essays/reward-hacking-disembedding (DOI pending Zenodo deposit.)

Canonical: https://jedanderson.org/essays/reward-hacking-disembedding