Are we really tilting?

The mechanics of reward guidance in flow and diffusion models

Carnegie Mellon University

Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion — finite-particle plug-in estimation of the Doob $h$-function — even in the simplest non-trivial settings of Gaussian and Gaussian-mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes: within-mode reward hacking and the inability to select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in fixing mode selection. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

Paper Code Citation

Overview

Compared to analytic reward tilting, practical guidance algorithms over-concentrate within each mode and fail to select high-reward modes. We propose a damped reward scale to mitigate within-mode reward hacking and clarify the role of best-of-$n$ in mode selection; combining these two methods often enables us to approximately recover the reward tilt.

Background on reward guidance

Stochastic interpolants and flow matching

We assume access to a pre-trained flow model $b : [0,1] \times \mathbb{R}^d \to \mathbb{R}^d$. Samples from the data distribution $\rho_1 \in \mathcal{P}(\mathbb{R}^d)$ are drawn by integrating the probability flow ODE

$$\dot{x}_t \;=\; b_t(x_t), \qquad x_0 \sim \mathcal{N}(0, I_d),$$

until $t = 1$. We work in the case where the drift can be written in terms of the linear stochastic interpolant $I_t = (1-t)\,I_0 + t\,I_1$ (with $I_0 \sim \mathcal{N}(0, I_d)$ and $I_1 \sim \rho_1$ independent) as $b_t(x) = \mathbb{E}[\dot{I}_t \mid I_t = x]$, and is learned by minimizing a flow matching objective or a regression objective (Lipman et al.; Albergo et al.).

For any noise schedule $\sigma_t \in \mathbb{R}^{d \times \ell}$, the forward SDE

$$dX_t \;=\; \Bigl( b_t(X_t) + \tfrac{1}{2}\,\sigma_t \sigma_t^{\!\top}\, \nabla \log \rho_t(X_t) \Bigr) dt \;+\; \sigma_t\, dB_t$$

shares the same time-marginals as the probability flow ODE.

Proof · SDE matches the flow time-marginals

The Fokker–Planck equation associated with the SDE is

$$\partial_t \rho_t = -\nabla \cdot \left( \rho_t b_t + \tfrac{1}{2} \nabla \cdot (\rho_t\, \sigma_t \sigma_t^\top) \right) + \tfrac{1}{2} \nabla \cdot (\nabla \cdot (\rho_t\, \sigma_t \sigma_t^\top)) = -\nabla \cdot (\rho_t b_t),$$

which matches the continuity equation for the probability flow ODE.

Reward guidance and the Doob $h$-transform

In reward-guided generation, we want samples not from $\rho_1$ but from the reward-tilted measure

$$\tilde{\rho}_1(x) \;\propto\; \rho_1(x)\, e^{\lambda r(x)},$$

for a reward $r : \mathbb{R}^d \to \mathbb{R}$ and inverse temperature $\lambda > 0$. The Doob $h$-transform provides a principled way to reach $\tilde{\rho}_1$ at terminal time by modifying the drift of the forward SDE. Define the Doob $h$-function

$$h_t(x) \;:=\; \mathbb{E}\!\left[ e^{\lambda r(X_1)} \,\big|\, X_t = x \right].$$

The guided ODE

$$\dot{\tilde{x}}_t \;=\; b_t(\tilde{x}_t) \;+\; \tfrac{1}{2}\, \sigma_t \sigma_t^{\!\top}\, \nabla \log h_t(\tilde{x}_t), \qquad \tilde{x}_0 = x_0,$$

then yields a sample $\tilde{x}_1 \sim \tilde{\rho}_1$ provided that $X_0$ and $X_1$ are independent. Domingo-Enrich et al. show that the memoryless schedule $\sigma_t = \sqrt{2(1-t)/t}\,I_d$ guarantees this property.

Proof · Doob $h$-transform yields the reward tilt

We first define the Doob $h$-function

$$h_t(x) = \mathbb{E}[e^{\lambda r(X_1)} \mid X_t = x]$$

and note that $h_t(X_t)$ is a Doob martingale. Letting $\mathcal{L}_t$ denote the generator of the diffusion:

$$\mathcal{L}_t f = \left( b_t + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \rho_t \right)^{\!\top} \nabla f + \tfrac{1}{2} \operatorname{tr}\!\left( (\sigma_t \sigma_t^\top) \nabla^2 f \right),$$

we know that $h_t(X_t)$ must satisfy the Kolmogorov backward equation

$$\partial_t h_t(X_t) + \mathcal{L}_t h_t(X_t) = 0.$$

Next, let $(\mathcal{F}_t)_{t \geq 0}$ denote the natural filtration generated by the Brownian motion $(B_t)_{t \geq 0}$. Letting $P$ denote the law of the path $(X_t)_{t \in [0, 1]}$ and letting $P_t$ denote the restriction of $P$ to $\mathcal{F}_t$, we define the Doob $h$-transform of $P$ by its Radon–Nikodym derivative:

$$\frac{dQ_t}{dP_t} = g_t := \frac{h_t(X_t)}{h_0(X_0)}.$$

By Itô's formula and the Kolmogorov backward equation, we have

$$\begin{aligned} dg_t & = \frac{1}{h_0(X_0)} \left( (\partial_t h_t(X_t) + \mathcal{L}_t h_t(X_t))\, dt + \nabla h_t(X_t)^{\!\top} \sigma_t\, dB_t \right) \\ & = \frac{1}{h_0(X_0)} \nabla h_t(X_t)^{\!\top} \sigma_t\, dB_t \\ & = g_t \nabla \log h_t(X_t)^{\!\top} \sigma_t\, dB_t. \end{aligned}$$

At this point, applying Itô's formula shows that

$$d(\log g_t) = \nabla \log h_t(X_t)^{\!\top} \sigma_t\, dB_t - \tfrac{1}{2} \|\nabla \log h_t(X_t)^{\!\top} \sigma_t\|_2^2\, dt$$

and integrating both sides shows that $g_t$ is the Doléans–Dade exponential (local) martingale

$$g_t = \exp\!\left( \int_0^t \nabla \log h_s(X_s)^{\!\top} \sigma_s\, dB_s - \tfrac{1}{2} \int_0^t \|\nabla \log h_s(X_s)^{\!\top} \sigma_s\|_2^2\, ds \right).$$

Under the standard Novikov's condition (for instance):

$$\mathbb{E}\!\left[ \exp\!\left( \tfrac{1}{2} \int_0^1 \|\nabla \log h_s(X_s)^{\!\top} \sigma_s\|_2^2\, ds \right) \right] < \infty,$$

$g_t$ is a true martingale. At last, Girsanov's theorem implies that under $Q := Q_1$, the process

$$\tilde{B}_t = B_t - \int_0^t \sigma_s^{\!\top} \nabla \log h_s(X_s)\, ds$$

is a Brownian motion, and substituting back into the SDE for $X_t$ gives

$$dX_t = \left( b_t(X_t) + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \rho_t(X_t) + (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t) \right) dt + \sigma_t\, d\tilde{B}_t.$$

We can then find the density $\tilde{\rho}_t$ of $X_t$ under $Q$ by integrating a test function $f \in C_c^\infty(\mathbb{R}^d)$:

$$\mathbb{E}_Q[f(X_t)] = \mathbb{E}_P[g_t\, f(X_t)] = \mathbb{E}_P\!\left[ f(X_t)\, \frac{h_t(X_t)}{h_0(X_0)} \right].$$

In the specific case that $X_0 \perp\!\!\!\perp X_1$ (which happens using the memoryless noise schedule), it is clear that $h_0(X_0) = \mathbb{E}[e^{\lambda r(X_1)}]$ is a normalizing constant, and thus $\tilde{\rho}_t(x) \propto \rho_t(x)\, h_t(x)$. At the endpoint, this means that $\tilde{\rho}_1(x) \propto e^{\lambda r(x)} \rho_1(x)$ as desired, and the Doob $h$-transform of the original diffusion gives a principled way to sample from the reward-tilted measure. We can then convert the guided SDE back to a guided ODE by considering the associated probability flow:

$$\begin{aligned} dX_t & = \left( b_t(X_t) + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \rho_t(X_t) + (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t) - \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \tilde{\rho}_t(X_t) \right) dt \\ & = \left( b_t(X_t) + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t) \right) dt. \end{aligned}$$

Hence, the guided probability flow ODE simply includes an additional score term $\tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t)$ that steers the dynamics toward high-reward regions of the output space.

Plug-in estimation

The Doob $h$-function is intractable in general, so practitioners replace it with a $k$-particle plug-in estimator:

$$\hat{h}_t^{(k)}(x) \;=\; \frac{1}{k} \sum_{i=1}^{k} e^{\lambda r(X_1^{(i)})}, \qquad X_1^{(i)} \stackrel{\text{iid}}{\sim} p_{1\mid t}(\,\cdot\, \mid x),$$

where $p_{1 \vert t}$ denotes the law of $(X_1 \mid X_t = x)$. In practice, $k$ is often chosen to be small due to computational constraints. We demonstrate that the plug-in estimator exhibits two biases due to finite-sample effects, and provide methods to mitigate each one.

What we prove

Our main contributions are:

We prove in Gaussian settings that significant within-mode reward hacking arises from finite-particle plug-in estimation in most implementations of reward-guided diffusion.
We show in Gaussian mixture settings that guidance using plug-in estimation fails to select between modes and has no mechanism for accurately weighting distant high-reward modes.
We propose a simple closed form damped reward schedule $\lambda_t$ to mitigate within-mode reward hacking and clarify the role of best-of-$n$ sampling in performing mode selection.

See the paper for the precise statements and proofs.

Within-mode reward hacking: We prove that plug-in guidance overshoots the reward maximizer and samples concentrate too tightly, and that increasing $k$ helps only mildly (at a logarithmic rate in $\infty$-Wasserstein distance). For an isotropic target $\mathcal{N}(0, \sigma^2 I_d)$ with a quadratic reward, replacing the guidance scale $\lambda$ with the time-dependent reward damping schedule

$$ \lambda_t \;=\; \frac{\lambda}{1 + 2\lambda\,\sigma^2_{1\mid t}}, \qquad \sigma^2_{1\mid t} \;=\; \frac{\sigma^2 (1-t)^2}{(1-t)^2 + t^2 \sigma^2}, $$

recovers the analytic guidance. In practice, although these assumptions don't hold exactly, we find that the damped schedule still significantly mitigates reward hacking and improves sample quality. We treat $\sigma$ as a tunable hyperparameter that controls the strength of damping.

Gaussian mixture reward hacking comparison: analytic tilt, plug-in k=1, plug-in k=8, and damped guidance

Failure to select modes: We prove in a representative Gaussian mixture example that plug-in guidance cannot accurately weight distant high-reward modes — trajectories usually end up in the mode they were initialized near. Best-of-$n$ sampling (running $n$ independent $k = 1$ guided trajectories and selecting the one with highest reward) can recover the correct mixture weights.

mode selection failure on 1D Gaussian mixture

Experiments

We confirm that our insights translate to practice through a variety of experiments (see the paper for details and more results).

Within-mode reward hacking

Masked intensity reward: mean pixel intensity inside a top-right circular mask minus the mean intensity outside.

Blueness reward: mean blue channel minus the mean red and green channels.

ImageReward: a learned human-preference reward taking a prompt and an image as inputs.

Mode selection

Checkerboard: the base model samples uniformly from the checkerboard (gray) and the reward is Gaussian.

checkerboard guidance comparison — **Checkerboard guidance.** Best-of-$n$ can select modes; with reward damping, best-of-$n$ significantly improves fidelity to the analytic tilt.

Checkerboard statistics. Mean reward and covariance trace for each method. Plug-in guidance cannot select modes, so the mean reward is too low and the covariance trace is too high. Best-of-$2$ already improves reward and comes close to matching the analytic tilt; best-of-$4$ with reward damping increases fidelity to the analytic tilt even further.

Method	Mean reward	Cov. trace
Analytic tilt	0.914 ± 0.003	0.457 ± 0.021
Plug-in ($k=1$)	0.764 ± 0.004	1.930 ± 0.048
Plug-in ($k=8$)	0.804 ± 0.007	1.397 ± 0.068
Best-of-$2$	0.914 ± 0.003	0.520 ± 0.026
Best-of-$4$	0.978 ± 0.002	0.107 ± 0.010
Best-of-$4$, $\sigma_{\mathrm{damp}}=0.2$	0.920 ± 0.004	0.429 ± 0.022

FLUX.1 (VLM reward): the base model is FLUX.1-dev and the reward is $r(x) = \log(p(\mathrm{Yes})) - \log(p(\mathrm{No}))$, where $p(\cdot)$ denotes the next-token probability from the Qwen2.5-VL-3B VLM.

VLM reward derived from “Does this image clearly show a neon sign with the word ‘ECLIPSE’ as the main readable text?” Plug-in guidance hacks the reward; damping helps slightly; best-of-$n$ substantially improves it, confirming the importance of the initial seed for mode selection.

VLM reward derived from “Does this image clearly show a display with the text ‘NEXT TRAIN MARS’ as the main readable text?” The qualitative pattern is similar: plug-in guidance hacks the reward, and best-of-$n$ substantially improves it.

Citation

@article{dandapanthula2026tilting,
  title   = {Are we really tilting? The mechanics of reward guidance in flow and diffusion models},
  author  = {Dandapanthula, Sanjit and Boffi, Nicholas M.},
  journal = {arXiv preprint arXiv:2606.02884},
  year    = {2026}
}