A Statistical Benchmark for Diffusion-Posterior-Sampling Algorithms

In this post, I’ll cover a paper that I’ll present at ICLR 2026. The paper is available on OpenReview.

The setting: linear inverse problems

There are many situations in which one wants to estimate an unknown signal $\mathbf{x}$, but only has access to indirect and noisy measurements $\mathbf{y}$. The forward problem is to model the relation between the measurements and the signal. Often, this relation is well-modeled as linear, in which cases we can write

$$\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n},$$

where $\mathbf{A}$ is some known matrix that models the noise-free measurement acquisition and $\mathbf{n}$ is an unknown noise component with known distribution. The inverse problem is to estimate the signal $\mathbf{x}$ from the measurements $\mathbf{y}$. When the relation between the measurements and the signals is linear, that is a linear inverse problem, which are the ones we consider here.

A concrete example is the following. Suppose you want to recover a person’s speech, but all you have is a recording from a faulty microphone placed across the room that occasionally drops parts of the signal. By the time the sound reaches the microphone, it has already scattered and reflected off the walls, an effect that can be modeled as a convolution with the room’s impulse response. On top of that, the microphone misses parts of the recording, and the remaining signal is corrupted by measurement noise. Below is a concrete audio example: a clean speech recording¹ (left) and the corresponding indirect, noisy measurements² recorded by the faulty microphone (right). Click either waveform to listen.

Clean signal $\mathbf{x}$

$\mathbf{A}(\cdot) + \mathbf{n}$

Measurements $\mathbf{M}^\top\mathbf{y}$

In the case above, the noiseless measurement acquisition can be modeled with a matrix $\mathbf{A}$ that is defined such that for any signal $\mathbf{x}$, $\mathbf{A}\mathbf{x} = \mathbf{M}\bigl(\mathbf{k} \ast \mathbf{x})$, where $\mathbf{k} = $ is the room impulse response and $\mathbf{M}$ is a sampling matrix (obtained by removing appropriate rows from the identity matrix) that models the tendency of the faulty microphone to drop parts of the recording. In the visualization above, we're displaying $\mathbf{M}^\top \mathbf{y}$ (which fills the measurements with zeros in the appropriate places).

This pattern shows up everywhere: cleaning noisy data, undoing blur, filling in missing values, estimating hidden states, reconstructing signals from compressed or spectral measurements (e.g., MRI, radar, radio astronomy), and even inferring parameters in physical and biological models are all variations of the same core task.

The Bayesian resolution

One way to resolve an inverse problem through the Bayesian resolution. The Bayesian resolution is defined by modeling the unknown signal $\mathbf{x}$ as an instantiation of a random variable. We denote that random variable as $\mathbf{X}$, and its distribution as $p_{\mathbf{X}}$. Ultimately, the object of interest will be the posterior distribution: a probability distribution over signals that are both plausible a-priori and consistent with the measurements $\mathbf{y}$. The key objects are:

Prior $p_{\mathbf{X}}$: a distribution over plausible signals. It encodes everything we know about the signal before seeing any measurements.
Likelihood $p_{\mathbf{Y} \mid \mathbf{X} = \,\cdot\,}(\mathbf{y})$: the compatibility of a candidate signal with the observed data $\mathbf{y}$. When the noise is Gaussian, this is a (potentially degenerate) Gaussian centered around the least-squares estimate $\mathbf{x}_{\mathrm{LS}}$ that is given by applying a pseudoinverse of $\mathbf{A}$, denoted $\mathbf{A}^\dagger$, to the measurements: $\mathbf{x}_{\mathrm{LS}} = \mathbf{A}^\dagger\mathbf{y}$.
Posterior $p_{\mathbf{X} \mid \mathbf{Y} = \mathbf{y}}$: by Bayes’ theorem,

$$p_{\mathbf{X} \mid \mathbf{Y} = \mathbf{y}}(\mathbf{x}) \propto p_{\mathbf{Y} \mid \mathbf{X} = \mathbf{x}}(\mathbf{y}) \times p_{\mathbf{X}}(\mathbf{x}).$$

The Bayesian view is especially attractive because the posterior is more than a single answer: it gives you a whole distribution to make decisions from. Once you have it, you can choose a point estimate that matches what you actually care about; for example, minimizing the squared error, maximizing the peak signal-to-noise ratio, or optimizing a perceptual score like the Frechet inception distance. In other words, once you commit to a prior (and a likelihood), a whole family of Bayes-optimal estimators comes “for free” (as long as the corresponding optimum exists).

The three ingredients are illustrated below for a two-dimensional toy model. Click any panel to explore the interaction between prior, likelihood, and posterior.

Prior $p_{\mathbf{X}}$

Likelihood $p_{\mathbf{Y} \mid \mathbf{X} = \,\cdot\,}(\mathbf{y})$

Posterior $p_{\mathbf{X} \mid \mathbf{Y} = \mathbf{y}}$

A · + n

A catch is that the prior $p_{\mathbf{X}}$ is usually unknown and far too complicated to write down explicitly. What we often do have, though, is data: a large collection of example signals. That data can be used to train a diffusion model, which can generate new samples that go beyond simple memorization and often look like realistic “new” examples. This naturally suggests using diffusion models as learned priors for Bayesian inverse problems. The goal of diffusion-posterior-sampling (DPS) algorithms is to repurpose a pretrained diffusion model so it produces samples not from the prior, but from the posterior $p_{\mathbf{X}\mid \mathbf{Y}=\mathbf{y}}$: signals that both look realistic and agree with the measurements.

Diffusion models, briefly

Diffusion models sample by reversing a stochastic differential equation (SDE) like

$$\mathrm{d}\mathbf{X}_t = \mathbf{f}(\mathbf{X}_t, t)\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{W}_t$$

with initial condition $\mathbf{X}_0 = \mathbf{X}$, where $\mathbf{W}_t$ is a standard Wiener process. Under suitable choices of the drift $\mathbf{f}$ and diffusion coefficient $g$, this forward SDE admits a limiting marginal $\mathbf{X}_{\infty}$ with a “simple” distribution. We’ll use the variance preserving $\mathbf{f}(\mathbf{x}, t) = -\frac{\beta(t)}{2}\mathbf{x}$ and $g(t) = \sqrt{\beta(t)}$ where $\beta$ is a function that controls the speed of the contraction to zero and how much noise is injected. This choice of the drift and the diffusion coefficient induces a standard normal limiting marginal. The generation from $p_{\mathbf{X}_0} = p_{\mathbf{X}}$ can proceed by simulating the SDE in reverse. By Anderson’s theorem, the SDE that reproduces the marginals of the forward SDE is

$$\mathrm{d}\mathbf{X}_t = \Bigl(\mathbf{f}(\mathbf{X}_t, t) - g^2(t)\,\nabla \log p_{\mathbf{X}_t}(\mathbf{X}_t)\Bigr)\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{W}_t$$

with initial condition $\mathbf{X}_{\infty}$.

The key object here is the score $\nabla \log p_{\mathbf{X}_t}$ of the intermediate distributions $p_{\mathbf{X}_t}$. It’s important to highlight that, if the score is exact, this is an exact sampling algorithm, just like the algorithm in which the Cholesky factorization is used to sample a Gaussian. We’ll see later that it is possible to estimate this score up to arbitrary precision for the prior distributions that we choose in this benchmark.

The animation below shows the forward process (particles from the prior gradually turning to Gaussian noise) and the reverse (noise reverting back to structured samples). You can track the evolution of any particle in the diffusions by selecting it with the mouse.

Forward (adding noise)

Backward (denoising with exact score)

t = 0 / 999

Click a particle to highlight its trajectory. Slider controls time.

From prior to posterior sampling

To adapt the reverse-diffusion sampler prior sampling to posterior sampling, a first idea would be to replace the score $\nabla \log p_{\mathbf{X}_t}$ of the intermediate prior distributions $p_{\mathbf{X}_t}$ with the score $\nabla \log p_{\mathbf{X}_t \mid \mathbf{Y} = \mathbf{y}}$ of the intermediate posterior distributions $p_{\mathbf{X}_t | \mathbf{Y} = \mathbf{y}}$, which are given by Bayes’ theorem as

$$\nabla \log p_{\mathbf{X}_t \mid \mathbf{Y} = \mathbf{y}} = \nabla \log p_{\mathbf{X}_t} + \nabla \bigl( \mathbf{x} \mapsto \log p_{\mathbf{Y} \mid \mathbf{X}_t = \mathbf{x}} \bigr). $$

The problem with that approach is that the score of the intermediate likelihoods is generally intractable. Different approximations of this term define different DPS algorithms and determining which one is best is exactly the question our benchmark addresses.

Why evaluation is hard

Existing evaluations fall into one of two camps, each with their own limitations.

Evaluations on empirical distributions

Novel algorithms are most often evaluated on empirical distributions of real data. That typically goes like that: First, you split the dataset into training, validation, and test points. Then, you use the training points to train neural-network-based surrogate of the score of the intermediate distributions, and use the validation points to tune any hyperparameters of the DPS algorithm that you want to benchmark. Finally, for each test point $\mathbf{x}$, you synthesize measurements $\mathbf{y}$, run the DPS algorithm, and compare the samples to $\mathbf{x}$.

This mode of evaluation has the benefit that the distributions are realistic, but it is fundamentally limited by three pitfalls:

Evaluations on Gaussian mixtures

Gaussian mixtures have the nice property that all intermediate marginals and their scores are available in closed form, as is the posterior of the initial inverse problem. This enables distribution-level comparisons.

The downside is that Gaussian mixtures are bad models for the distribution of realistic signal. We’ll illustrate this by showing that the empirical distribution of differences between neighboring samples of many realistic signals from different domains are far from Gaussian: small jumps are much more likely than a Gaussian predicts, and so are extremely large jumps.³

This heavy-tailed behavior is characteristic of many real-world signals. The plot below shows audio and stock-price increments: while the best-fit Gaussian (blue) systematically underestimates the tails, Student-t and Laplace distributions match much better.

Our benchmark: Lévy process priors

The prior distributions we use are precisely those of signals with such heavy-tailed increment distributions. The key idea is to characterize signals not directly, but through their jumps — the differences between neighboring samples. By choosing a distribution $p_\mathrm{inc}$ for those jumps, we get a rich and flexible family of priors. These are known as Lévy processes: signals with independent, identically distributed increments drawn from $p_\mathrm{inc}$, giving a lot of flexibility:

Increment distribution	Notation	Parameters	Signal type
Gaussian	$\mathrm{Gauss}(\mu, \sigma^2)$	$\mu$: mean, $\sigma^2$: variance	Smooth / Brownian
Laplace	$\mathrm{Lap}(\mu, \sigma^2)$	$\mu$: mean, $\sigma^2$: variance	Occassional large jumps
Student-$t$	$\mathrm{St}(\nu)$	$\nu$: degrees of freedom	Heavy-tailed jumps
Bernoulli–Laplace	$\mathrm{BL}(p, \sigma^2)$	$p$: non-zero jump prob., $\sigma^2$: Laplace variance	Piecewise-constant

Formally, our priors are given by

$$p_{\mathbf{X}}(\mathbf{x}) = \prod_{k=1}^{d} p_{\mathrm{inc}}\bigl([\mathbf{D}\mathbf{x}]_k\bigr).$$

where $\mathbf{D}$ is a first-oder finitie-difference operator (defined by $[\mathbf{D}\mathbf{x}]_1 = \mathbf{x}_1$, $[\mathbf{D}\mathbf{x}]_k = \mathbf{x}_{k+1} - \mathbf{x}_{k}$ when $k>1$). These distributions encode signals with extreme jumps and even strictly piecewise-constant behavior; much closer to reality than Gaussians. In the plot below, you can interact with two example realizations per family:

Efficient Gibbs sampling

The key property that makes these priors attractive for our purposes is that the posteriors that arise from these priors admit efficient exact sampling via Gibbs sampling, and we now show how to obtain these samplers. The first step is to rewrite the posterior (using again a Gaussian likelihood as an example for the moment) as the huge product

$$p_{\mathbf{X}\mid\mathbf{Y}=\mathbf{y}}(\mathbf{x}) \propto \exp\!\left(-\frac{1}{2\sigma_n^2}\|\mathbf{A}\mathbf{x}-\mathbf{y}\|^2\right) \prod_{k=1}^{d} p_{\mathrm{inc}}\bigl([\mathbf{D}\mathbf{x}]_k\bigr) \propto \prod_{k=1}^{m+d} \phi_k\bigl([\mathbf{K}\mathbf{x}]_k\bigr),$$

where a stacked operator $\mathbf{K} = [\mathbf{A}; \mathbf{D}]$ collects both the measurement rows of $\mathbf{A}$ and the difference rows of $\mathbf{D}$. The key trick is now to represent each factor $\phi_k$ as a latent-variable Gaussian mixture:

$$\phi_k(t) = \int_{\mathbb{R}} g_{\mu_k(z),\sigma_k^2(z)}(t) f_k(z) \, \mathrm{d}z,$$

where $g_{\mu_k(z),\sigma_k^2(z)}$ is a Gaussian with mean $\mu_k(z)$ and variance $\sigma_k^2(z)$, and $f_k$ is the mixing distribution. This is possible for all our prior distributions and many likelihoods. The functions $\mu_k$, $\sigma_k^2$ and the mixing density $f_k$ depend on the factor $\phi_k$. For example, the $m$ Gaussian factors that come from the likelihood can trivially be represented using the functions $\mu_k(z) = [\mathbf{y}]_k$, $\sigma^2(z) = \sigma_n^2$, and the mixing distribution $f_k(z) = \delta(z)$ (the Dirac distribution centered at $0$). A Laplace distribution with variance $b$ can be represented with the functions $\mu_k(z) = 0$ and $\sigma_k^2(z) = z$, and the mixing density $f_k(z) = \lambda\exp(-\lambda z)$, which is an exponential distribution with parameter $\lambda = \frac{1}{2b}$. The construction of a Laplace distribution (variance $b=1$) is animated here:

It turns out that if it is possible to represent a distribution as this huge product where each factor can be represented as a latent-variable Gaussian mixture, then there exists an efficient Gibbs sampling scheme for that distribution. That’s basically what we showed in The Gaussian Latent Machine, and you can find all the details there.

Access to the score

Having efficient posterior sampling for these priors has a particularly important implication for diffusion models: By Tweedie’s formula, the score $\nabla \log p_{\mathbf{X}_t}$ of an intermediate distribution at time $t$ is related to the conditional expectation of $\mathbf{X}_0$ given $\mathbf{X}_t$ via

$$\nabla \log p_{\mathbf{X}_t}(\mathbf{x}_t) = \frac{1}{\sigma^2(t)} \bigl(\alpha(t)\,\mathbb{E}[\mathbf{X}_0 \mid \mathbf{X}_t = \mathbf{x}_t] - \mathbf{x}_t\bigr)$$

where $\alpha(t) = \exp\bigl( -\frac{1}{2}\int_0^t\beta(s)\,\mathrm{d}s \bigr)$ and $\sigma^2(t) = (1-\alpha^2(t))$. In addition, under our choice of the drift and the diffusion coefficients, the random variables $\mathbf{X}_0$ and $\mathbf{X}_t$ are related in distribution as

$$\mathbf{X}_t = \alpha(t)\mathbf{X}_0 + \sigma(t)\mathbf{N},$$

where $\mathbf{N}$ is a random variable with standard normal distribution. This implies that the expectation $\mathbb{E}[\mathbf{X}_0 \mid \mathbf{X}_t = \mathbf{x}_t]$ is that of a denoising-posterior distribution $p_{\mathbf{X}_0 \mid \mathbf{X}_t = \mathbf{x}_t}$ of an inverse problem with forward operator $\alpha(t)\mathbf{I}$ and Gaussian noise with variance $\sigma^2(t)$. By construction, this falls into the types of posteriors that we can sample with our efficient Gibbs methods and we can estimate the expectation up to arbitrary precision through Monte Carlo estimation.

This gives us access to an arbitrary-precision Monte-Carlo estimate at any point in the reverse diffusion, which enables:

Disentangling algorithmic approximation errors from modeling errors.
Systematic evaluation of the behavior of DPS algorithms in terms of the quality of the score estimate.

The visualization below shows how the score is obtained at an intermediate time $t$. The gray contours are the prior; the teal contours show the denoising posterior $p_{\mathbf{X}_0 \mid \mathbf{X}_t = \mathbf{x}_t}$. The red diamond is the Monte-Carlo estimate $\widehat{\mathbb{E}[\mathbf{X}_0 \mid \mathbf{X}_t = \mathbf{x}_t]}$ that is obtained by averaging the samples from the Gibbs chain, and the blue dashed arrow is the corresponding estimation of the score through Tweedie’s formula.

$t$

$n$

Results

We benchmark three algorithms — C-DPS⁴, DiffPIR⁵, and DPnP⁶ — across four inverse problems (denoising, deconvolution, inpainting, partial Fourier recovery) and all our increment distributions.

For each increment distribution $p_\mathrm{inc} \in \{\mathrm{Gauss}, \mathrm{Lap}, \mathrm{St}(1), \mathrm{St}(2), \mathrm{St}(3), \mathrm{BL}\}$, the benchmarking pipeline is:

Synthesize a large test set of signals $\mathbf{x}_i$ from the prior $p_{\mathbf{X}} = \prod_{k=1}^d p_\mathrm{inc}([\mathbf{D}\,\cdot\,]_k)$.
For each forward operator $\mathbf{A} \in \{\mathbf{I},\, (\mathbf{k}\ast\cdot),\, \mathbf{M},\, \mathbf{MF}\}$:
1. Simulate measurements $\mathbf{y}_i = \mathbf{A}\mathbf{x}_i + \mathbf{n}_i$.
2. Run the Gibbs sampler to draw gold-standard posterior samples $\{\mathbf{x}^{\mathrm{ref}}_{i,j}\}_j \sim p_{\mathbf{X} \mid \mathbf{Y} = \mathbf{y}_i}$.
3. Tune each DPS algorithm’s hyperparameters on a held-out validation set with respect to MMSE gap.
4. Run each DPS algorithm to obtain samples $\{\mathbf{x}^{\mathrm{DPS}}_{i,j}\}_j$.
5. Compare $\{\mathbf{x}^{\mathrm{ref}}_{i,j}\}_j$ against $\{\mathbf{x}^{\mathrm{DPS}}_{i,j}\}_j$ (and the ground-truth $\mathbf{x}_i$) using our metrics.

Metric 1: MMSE optimality gap

We compute the reference MMSE estimate from the Gibbs samples, average the DPS samples to get their estimate, and measure the distance between the two. The visualization illustrates the gap: gray contours show the prior, teal contours show the true posterior, teal dots are reference Gibbs samples (and their mean), orange dots are algorithm samples (and their mean). The gap between the two means is the MMSE gap.

Result: DiffPIR has the smallest MMSE gap, winning in 19 out of 24 settings.

Best MMSE gap count (out of 24 task-prior settings)

DiffPIR

DPnP

C-DPS

Remaining 2/24 settings are ties.

Takeaway: DiffPIR is the strongest overall method on MMSE gap in this benchmark.

Metric 2: Posterior calibration

For a threshold $t$, we check what fraction of test signals are “covered” by the $t$-th percentile of the DPS algorithm’s empirical distribution — a calibrated algorithm covers a fraction $1 - t$ of signals. The interactive visualization below shows how coverage is measured. Step through with the buttons to see the procedure: draw posterior samples, build the empirical CDF, mark the 20th-percentile cutoff, check whether the true signal falls in the top 80%, and aggregate across the dataset.

Result: DPnP has the best calibration. More importantly, we find that all three algorithms are generally overconfident — they have too little variance and cluster around their modes. This failure mode is invisible to point-estimator-based evaluations.

Best coverage (out of 24 settings)

Gibbs

DPnP

less accurate

83%

more accurate

100%

DiffPIR

less accurate

75%

more accurate

79%

C-DPS

less accurate

75%

more accurate

79%

Coverage target: 0.9. OC = overconfidence fraction (empirical coverage below target). Gibbs is the exact reference.

Takeaway: DPnP has the best coverage across settings. All DPS algorithms are generally overconfident — a failure mode invisible to point-estimator evaluations.

Metric 3: Score accuracy matters — but not always the way you expect

By systematically increasing the quality of the score estimate (using more Gibbs samples for the Monte Carlo approximation), we can measure how each algorithm responds.

Setup: Algorithm parameters were tuned at a fixed score accuracy level. Question: How does the MMSE gap change when score accuracy increases?

Result: C-DPS frequently degrades as the score improves. Specifically, when using hyperparameters tuned on a lower-quality score, increasing the score quality made C-DPS perform worse in 7 out of 11 statistically significant comparisons. DPnP and DiffPIR improved in most cases. This implies that C-DPS’s hyperparameters are implicitly compensating for score inaccuracy, and that results reported with a weaker score model may not transfer to a better one.

Fraction of statistically significant MMSE-gap changes that are improvements better worse

DPnP

7/8

DiffPIR

5/7

C-DPS

4/11

Wilcoxon signed-rank test at p = 0.05.

Takeaway: The optimal parameters of C-DPS depend on the accuracy of the score estimator — results obtained with a weak score may not transfer to a stronger one.

Why this matters

Our benchmark enables the kind of diagnostics that standard evaluations cannot provide:

Distinguish failure modes: is the problem the prior/model, the sampling algorithm, or its hyperparameters?
Distribution-level diagnostics: reveal overconfidence, underconfidence, or operator-specific brittleness that point-estimator metrics hide.
Actionable decisions: know whether improving the score model will actually help, or whether the algorithm needs to be re-tuned.

The code is available on GitHub with a Dockerized runtime that includes all our fast samplers and line-by-line instructions to replicate the results. The paper is on arXiv.

References

Taken from the CSTR VCTK Corpus. ↩︎
Here, the noise is simulates as Gaussian, but our benchmark is not restricted to that. It can handle any noise whose distribution is representable as a (infinite-component) Gaussian mixture. ↩︎
You’ll just have to trust me that this implies that the distribution of the original is necessarily non-Gaussian too. ↩︎
H. Chung, J. Kim, M. T. McCann, M. L. Klasky, J. C. Ye. Diffusion Posterior Sampling for General Noisy Inverse Problems. ICLR, 2023. ↩︎
Y. Zhu, K. Zhang, J. Liang, J. Cao, B. Wen, R. Timofte, L. Van Gool. Denoising Diffusion Models for Plug-and-Play Image Restoration. CVPRW, 2023. ↩︎
X. Xu, Y. Chi. Provably Robust Score-Based Diffusion Posterior Sampling for Plug-and-Play Image Reconstruction. NeurIPS, 2024. ↩︎

The setting: linear inverse problems¶

The Bayesian resolution¶

Diffusion models, briefly¶

From prior to posterior sampling¶

Why evaluation is hard¶

Evaluations on empirical distributions¶

Evaluations on Gaussian mixtures¶

Our benchmark: Lévy process priors¶

Efficient Gibbs sampling¶

Access to the score¶

Results¶

Metric 1: MMSE optimality gap¶

Metric 2: Posterior calibration¶

Metric 3: Score accuracy matters — but not always the way you expect¶

Why this matters¶

References¶

The setting: linear inverse problems

The Bayesian resolution

Diffusion models, briefly

From prior to posterior sampling

Why evaluation is hard

Evaluations on empirical distributions

Evaluations on Gaussian mixtures

Our benchmark: Lévy process priors

Efficient Gibbs sampling

Access to the score

Results

Metric 1: MMSE optimality gap

Metric 2: Posterior calibration

Metric 3: Score accuracy matters — but not always the way you expect

Why this matters

References