---
title: 'Reward Hacking as a Disembedding Problem'
subtitle: 'What biological selection implies for the design of grounded objectives in advanced AI'
slug: 'reward-hacking-disembedding'
date: 2026-06-19
type: 'essay'
status: 'published'
tags: ['enviroai', 'information-theory', 'causal-sovereignty', 'physics']
abstract: 'Reward hacking persists across every generation of trained models and is intensifying in frontier systems that reason about their own evaluation and corrupt it. This paper argues the failure is structural—reward hacking is a disembedding problem: a proxy can be gamed precisely when the optimizer can decouple its reward from its own persistence. It identifies four structural conditions behind the system-level robustness of biological selection, shows contemporary AI training violates all four, and argues environmental objectives are the maximally embedded domain in which to rebuild them, with the measurement boundary as the explicit residual attack surface.'
license: 'CC-BY-4.0'
author: 'Jed Anderson'
co_authors: []
canonical_url: 'https://jedanderson.org/essays/reward-hacking-disembedding'
pdf: '/pdfs/reward-hacking-disembedding.pdf'
hero_image: '/images/reward-hacking-disembedding-cover.png'
supporting_files: []
---

This page accompanies the preprint *Reward Hacking as a Disembedding Problem*. The full paper is available as a PDF below.

Reward hacking has persisted across every generation of reinforcement-trained models and is intensifying in frontier systems that reason about their own evaluation and act to defeat it. This paper argues the failure is not a collection of patchable bugs but a structural fact: reward hacking is fundamentally a *disembedding* problem. A proxy can be gamed precisely when the optimizer can decouple its measured reward from its own persistence. Biological selection is saturated with local proxy-gaming—cancer, meiotic drive, supernormal stimuli—yet remains robust at the system level, and the paper identifies four structural conditions behind that robustness: a non-proxiable selection signal, embeddedness, multi-level selection, and inaccessibility of the grader. Contemporary AI training violates all four. Environmental objectives are argued to be the maximally embedded domain in which to rebuild them, with the measurement boundary named as the explicit residual attack surface.

**Read the full paper:** [PDF](/pdfs/reward-hacking-disembedding.pdf)

## Cite this work

Anderson, J. (2026). *Reward Hacking as a Disembedding Problem: What Biological Selection Implies for the Design of Grounded Objectives in Advanced AI.* Independent Researcher, Houston, Texas. https://jedanderson.org/essays/reward-hacking-disembedding (DOI pending Zenodo deposit.)

Canonical: https://jedanderson.org/essays/reward-hacking-disembedding