On the generalization abilities of diffusion models

In the recent paper Bigger Isn’t Always Better: Towards a General Prior for Medical Image Reconstruction (I know, quite the sensational title) we investigated why diffusion models can be used to recover images from incomplete data when there is a significant shift between the training distribution and the inference distribution (that is, the distribution from which the underlying signals are drawn in the recovery problem). This is quite a common application; in fact, a major contribution in one of the earliest papers that utilized diffusion models for solving inverse problems in medical imaging—Score-based diffusion models for accelerated MRI by Hyungjin Chung and Jong Chul Ye—was to show that you can use a model trained on reference root-sum-of-squares (magnitude) reconstructions to reconstruct the real and imaginary part of individual coil images. Comparing individual coil images to the reference magnitude image reveals significant differences, indicating that they are not drawn from the “same distribution”:

Reference magnitude image — Reference magnitude reconstruction
The reference magnitude image; The 17th slice in file1000005.h5 in FastMRI. Black [0.06, 7.19] white

Globally, the reference RSS reconstruction and the real part of the 15th coil image differ significantly; the coil image is much more localized and has a completely different value range¹. Robustness of these types of recovery algorithms was independently observed in Robust Compressed Sensing MRI with Deep Generative Priors by Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G. Dimakis and Jonathan I. Tamir. They applied a diffusion model trained on brain MRIs to recovery problems from knee and abdomen scans with great success, achieving state-of-the-art results and even beating end-to-end methods on their training data.

I was never quite comfortable with these results. I understand that out-of-distribution application is crucial in practice. But why should a brain prior work well for knee scans? Intuitively, I would expect to recover knees (or at least, “knee features”) when using this prior.

Then, I remembered the classic papers of Jinggang Huang and David Mumford (yes, that David Mumford), like Statistics of natural images and models. There, they showed that natural images actually all look almost the same locally. Figure 1 in our paper shows the distribution of image gradients (i.e., differences of neighboring pixels, a “local” statistic) is quite similar between datasets: The right two plots show the distribution of horizontal (top) and vertical (bottom) image gradients, which is similar between datasets. In contrast, the pixel-wise mean of the different datasets (a “global” statistic) shown in the left three colummns, differs significantly.

This led us to hypothesize that “out-of-distribution” reconstructions of diffusion models are largely driven by local statistics, and that more “local” models should exhibit even more robustness to distribution shifts. To test this hypothesis, we proposed a simple experiment: We took the standard UNet architecture used in diffusion models, trained on the CelebA HQ dataset², and monitored the reconstruction performance while successively stripping away layers. (Just to be clear: The models were independently trained; this will become important later.) The number of layers is encoded by the “depth” \( d \); smaller means more local.

Here are the results (part of Figure 3 in the paper):³

I know, boring. The key takeaways are:

Shallower, more local models⁴ generally perform as well as deeper, more global models on in-distribution reconstruction. This can be seen by the general rightwards trend in the first column of the first and second row. If deeper models were better, there would be a significant downwards trends.
Shallower, more local models are more robust to distribution shifts. This can be seen by the general upwards trend in the second and third column of the first and second row.
Models trained on CelebA are effectively as good at reconstructing FastMRI knee images as models trained on FastMRI knee images. This can be seen by the “Delta in PSNR” and “Delta in SSIM” plots in the first column of the third and fourth row. If the models trained on FastMRI knee images were better, the lines would be significantly below 0.
Models trained on CelebA are better at reconstructing FastMRI knee images with different contrast and FastMRI brain images than models trained on FastMRI knee images. This can be seen by the “Delta in PSNR” and “Delta in SSIM” plots in the second and third column of the third and fourth row.

Another important but unrelated observation is that models with an attention block ("\(d=4*\)", the standard model used in the literature) are almost never best-performing.

Again, I was not really comfortable with these results. Although we are in the noise-free case, the reconstruction algorithm of Chung and Ye doesn’t project onto the data, but utilizes gradient updates on the data term, whose weight, \( \lambda \), can not be larger than 1 for stability reasons. Between 0 and 1, there must be a regime in which the prior is “stronger” than the likelihood, causing the reconstruction to show celebrity faces. Therefore, we tested a range of values for the trade-off parameter between 0 and 1, leaving everything else (importantly, also the initial noise and the RNG’s seed) unchanged. You can play around with it using the slider below:

d = 2

d = 3

d = 4

d = 4*

For comparison, here is the naive reconstruction without regularization:

In line with expectations, there is a regime of \( \lambda \) where we get celebrity faces in the reconstruction, e.g. with \( \lambda = 0.005 \). With the choice \( \lambda = 1 \), the standard choice of Chung and Ye, all models prodice surprisingly good reconstructions, with a few high-frequency artifacts here and there (does it want the fat and muscle tissue to look like hair? 🤔).

However, what I find most interesting in this experiment is actually what happens in the unconditional case, when \( \lambda = 0 \): Although the models differ significantly in their architecture⁵, they generate “the same” face given the same boundary conditions. Even in the \( d = 2 \) case, where the model’s receptive field is not large enough to capture long-range dependencies and hence the generated image only looks like a face “locally”, the generated features (the eye and hairs) strongly resemble the features of the face generated by the other models.

I found this extremely surprising and examined the literature to see if others had observed this phenomenon and whether research existed on the topic. It turns out that Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat did a related experiment in Generalization in diffusion models arises from geometry-adaptive harmonic representations (Figure 2): Instead of varying the model’s architecture, they partitioned the training set into two disjoint sets and trained two models that shared the same architecture. They found that the two models generate “the same” face given the same boundary conditions when the size of the training dataset is sufficiently large. When the dataset is too small, the models tend to memorize instead of generalizing. The regime change between memorization and generalization has also been studied more generally (different perturbations, different architectures) by Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, and Qing Qu in The Emergence of Reproducibility and Consistency in Diffusion Models.

Interestingly, the work seems to have been rejected for ICLR 2024 but apparently accepted to ICML2024. In the openreview discussion for ICLR 2024, there was a discussion about whether this generalization behavior is “obvious”, as the models aim to learn the same thing. While I understand this argument for the memorization regime, to me it is everything but obvious for the generalization regime.

What remains unclear is why the models exhibit a useful form of generalization in the first place. If the models would learn exactly what they are told to (e.g., a KDE with Gaussian kernels of ever-growing variance in case of the VE SDE), they would actually be useless as the generation process would just yield the training data back. If the models learned some interpolation in image space, it would be useless as such interpolations are known to yield bad results. Somehow, the generalization that these models provide is “just right”. I believe the buzz word that relates to this is inductive bias, but I need to do more research on that. What I wonder is: Have we simply stumbled upon the correct architectures and training configurations that provide exactly the right inductive bias, or is there more to it?

Any reasonable model of the distribution of magnitude images should include a non-negativity constraint. ↩︎
Downsampled to \( 320 \times 320 \) to match the resolution of the MRI images. ↩︎
In each graph, the model size decreases from left to right; the models become more local. Differently colored lines indicate different sub-sampling patterns; the specics are not important for the sake of this discussion. Results in the first two rows are obtained by models trained on the FastMRI knee dataset with CORPD contrast; results in the last two rows are obtained by models trained on the CelebA dataset. ↩︎
We measure the “locality” of the models by their receptive field; see Figure 2 in the paper. ↩︎
The details can be found in the paper. For example, the model with \( d=2 \) has 23.9M parameters, whereas \( d=4* \) has 61.4M parameters, parts of which are in an attention block that is not present in the others. ↩︎