The mechanics of reward guidance in flow and diffusion models
Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion — finite-particle plug-in estimation of the Doob $h$-function — even in the simplest non-trivial settings of Gaussian and Gaussian-mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes: within-mode reward hacking and the inability to select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in fixing mode selection. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.
Compared to analytic reward tilting, practical guidance algorithms over-concentrate within each mode and fail to select high-reward modes. We propose a damped reward scale to mitigate within-mode reward hacking and clarify the role of best-of-$n$ in mode selection; combining these two methods often enables us to approximately recover the reward tilt.
We assume access to a pre-trained flow model $b : [0,1] \times \mathbb{R}^d \to \mathbb{R}^d$. Samples from the data distribution $\rho_1 \in \mathcal{P}(\mathbb{R}^d)$ are drawn by integrating the probability flow ODE
until $t = 1$. We work in the case where the drift can be written in terms of the linear stochastic interpolant $I_t = (1-t)\,I_0 + t\,I_1$ (with $I_0 \sim \mathcal{N}(0, I_d)$ and $I_1 \sim \rho_1$ independent) as $b_t(x) = \mathbb{E}[\dot{I}_t \mid I_t = x]$, and is learned by minimizing a flow matching objective or a regression objective (Lipman et al.; Albergo et al.).
For any noise schedule $\sigma_t \in \mathbb{R}^{d \times \ell}$, the forward SDE
shares the same time-marginals as the probability flow ODE.
The Fokker–Planck equation associated with the SDE is
$$\partial_t \rho_t = -\nabla \cdot \left( \rho_t b_t + \tfrac{1}{2} \nabla \cdot (\rho_t\, \sigma_t \sigma_t^\top) \right) + \tfrac{1}{2} \nabla \cdot (\nabla \cdot (\rho_t\, \sigma_t \sigma_t^\top)) = -\nabla \cdot (\rho_t b_t),$$which matches the continuity equation for the probability flow ODE.
In reward-guided generation, we want samples not from $\rho_1$ but from the reward-tilted measure
for a reward $r : \mathbb{R}^d \to \mathbb{R}$ and inverse temperature $\lambda > 0$. The Doob $h$-transform provides a principled way to reach $\tilde{\rho}_1$ at terminal time by modifying the drift of the forward SDE. Define the Doob $h$-function
The guided ODE
then yields a sample $\tilde{x}_1 \sim \tilde{\rho}_1$ provided that $X_0$ and $X_1$ are independent. Domingo-Enrich et al. show that the memoryless schedule $\sigma_t = \sqrt{2(1-t)/t}\,I_d$ guarantees this property.
We first define the Doob $h$-function
$$h_t(x) = \mathbb{E}[e^{\lambda r(X_1)} \mid X_t = x]$$and note that $h_t(X_t)$ is a Doob martingale. Letting $\mathcal{L}_t$ denote the generator of the diffusion:
$$\mathcal{L}_t f = \left( b_t + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \rho_t \right)^{\!\top} \nabla f + \tfrac{1}{2} \operatorname{tr}\!\left( (\sigma_t \sigma_t^\top) \nabla^2 f \right),$$we know that $h_t(X_t)$ must satisfy the Kolmogorov backward equation
$$\partial_t h_t(X_t) + \mathcal{L}_t h_t(X_t) = 0.$$Next, let $(\mathcal{F}_t)_{t \geq 0}$ denote the natural filtration generated by the Brownian motion $(B_t)_{t \geq 0}$. Letting $P$ denote the law of the path $(X_t)_{t \in [0, 1]}$ and letting $P_t$ denote the restriction of $P$ to $\mathcal{F}_t$, we define the Doob $h$-transform of $P$ by its Radon–Nikodym derivative:
$$\frac{dQ_t}{dP_t} = g_t := \frac{h_t(X_t)}{h_0(X_0)}.$$By Itô's formula and the Kolmogorov backward equation, we have
$$\begin{aligned} dg_t & = \frac{1}{h_0(X_0)} \left( (\partial_t h_t(X_t) + \mathcal{L}_t h_t(X_t))\, dt + \nabla h_t(X_t)^{\!\top} \sigma_t\, dB_t \right) \\ & = \frac{1}{h_0(X_0)} \nabla h_t(X_t)^{\!\top} \sigma_t\, dB_t \\ & = g_t \nabla \log h_t(X_t)^{\!\top} \sigma_t\, dB_t. \end{aligned}$$At this point, applying Itô's formula shows that
$$d(\log g_t) = \nabla \log h_t(X_t)^{\!\top} \sigma_t\, dB_t - \tfrac{1}{2} \|\nabla \log h_t(X_t)^{\!\top} \sigma_t\|_2^2\, dt$$and integrating both sides shows that $g_t$ is the Doléans–Dade exponential (local) martingale
$$g_t = \exp\!\left( \int_0^t \nabla \log h_s(X_s)^{\!\top} \sigma_s\, dB_s - \tfrac{1}{2} \int_0^t \|\nabla \log h_s(X_s)^{\!\top} \sigma_s\|_2^2\, ds \right).$$Under the standard Novikov's condition (for instance):
$$\mathbb{E}\!\left[ \exp\!\left( \tfrac{1}{2} \int_0^1 \|\nabla \log h_s(X_s)^{\!\top} \sigma_s\|_2^2\, ds \right) \right] < \infty,$$$g_t$ is a true martingale. At last, Girsanov's theorem implies that under $Q := Q_1$, the process
$$\tilde{B}_t = B_t - \int_0^t \sigma_s^{\!\top} \nabla \log h_s(X_s)\, ds$$is a Brownian motion, and substituting back into the SDE for $X_t$ gives
$$dX_t = \left( b_t(X_t) + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \rho_t(X_t) + (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t) \right) dt + \sigma_t\, d\tilde{B}_t.$$We can then find the density $\tilde{\rho}_t$ of $X_t$ under $Q$ by integrating a test function $f \in C_c^\infty(\mathbb{R}^d)$:
$$\mathbb{E}_Q[f(X_t)] = \mathbb{E}_P[g_t\, f(X_t)] = \mathbb{E}_P\!\left[ f(X_t)\, \frac{h_t(X_t)}{h_0(X_0)} \right].$$In the specific case that $X_0 \perp\!\!\!\perp X_1$ (which happens using the memoryless noise schedule), it is clear that $h_0(X_0) = \mathbb{E}[e^{\lambda r(X_1)}]$ is a normalizing constant, and thus $\tilde{\rho}_t(x) \propto \rho_t(x)\, h_t(x)$. At the endpoint, this means that $\tilde{\rho}_1(x) \propto e^{\lambda r(x)} \rho_1(x)$ as desired, and the Doob $h$-transform of the original diffusion gives a principled way to sample from the reward-tilted measure. We can then convert the guided SDE back to a guided ODE by considering the associated probability flow:
$$\begin{aligned} dX_t & = \left( b_t(X_t) + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \rho_t(X_t) + (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t) - \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log \tilde{\rho}_t(X_t) \right) dt \\ & = \left( b_t(X_t) + \tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t) \right) dt. \end{aligned}$$Hence, the guided probability flow ODE simply includes an additional score term $\tfrac{1}{2} (\sigma_t \sigma_t^\top) \nabla \log h_t(X_t)$ that steers the dynamics toward high-reward regions of the output space.
The Doob $h$-function is intractable in general, so practitioners replace it with a $k$-particle plug-in estimator:
where $p_{1 \vert t}$ denotes the law of $(X_1 \mid X_t = x)$. In practice, $k$ is often chosen to be small due to computational constraints. We demonstrate that the plug-in estimator exhibits two biases due to finite-sample effects, and provide methods to mitigate each one.
Our main contributions are:
See the paper for the precise statements and proofs.
Within-mode reward hacking: We prove that plug-in guidance overshoots the reward maximizer and samples concentrate too tightly, and that increasing $k$ helps only mildly (at a logarithmic rate in $\infty$-Wasserstein distance). For an isotropic target $\mathcal{N}(0, \sigma^2 I_d)$ with a quadratic reward, replacing the guidance scale $\lambda$ with the time-dependent reward damping schedule
We confirm that our insights translate to practice through a variety of experiments (see the paper for details and more results).
Masked intensity reward: mean pixel intensity inside a top-right circular mask minus the mean intensity outside.
Blueness reward: mean blue channel minus the mean red and green channels.
ImageReward: a learned human-preference reward taking a prompt and an image as inputs.
Checkerboard: the base model samples uniformly from the checkerboard (gray) and the reward is Gaussian.
Checkerboard statistics. Mean reward and covariance trace for each method. Plug-in guidance cannot select modes, so the mean reward is too low and the covariance trace is too high. Best-of-$2$ already improves reward and comes close to matching the analytic tilt; best-of-$4$ with reward damping increases fidelity to the analytic tilt even further.
| Method | Mean reward | Cov. trace |
|---|---|---|
| Analytic tilt | 0.914 ± 0.003 | 0.457 ± 0.021 |
| Plug-in ($k=1$) | 0.764 ± 0.004 | 1.930 ± 0.048 |
| Plug-in ($k=8$) | 0.804 ± 0.007 | 1.397 ± 0.068 |
| Best-of-$2$ | 0.914 ± 0.003 | 0.520 ± 0.026 |
| Best-of-$4$ | 0.978 ± 0.002 | 0.107 ± 0.010 |
| Best-of-$4$, $\sigma_{\mathrm{damp}}=0.2$ | 0.920 ± 0.004 | 0.429 ± 0.022 |
FLUX.1 (VLM reward): the base model is FLUX.1-dev and the reward is $r(x) = \log(p(\mathrm{Yes})) - \log(p(\mathrm{No}))$, where $p(\cdot)$ denotes the next-token probability from the Qwen2.5-VL-3B VLM.
@article{dandapanthula2026tilting,
title = {Are we really tilting? The mechanics of reward guidance in flow and diffusion models},
author = {Dandapanthula, Sanjit and Boffi, Nicholas M.},
journal = {arXiv preprint arXiv:2606.02884},
year = {2026}
}