A Historical Perspective on Regularizers in Imaging

In my thesis titled “Generative Models as Regularizers for Inverse Problems in Imaging” (which you can find on github), chapter 4 deals with the historical development of regularizers in imaging. In this post, I revisit this chapter and summarize about fourty years of research on the topic.

A running example: MRI reconstruction

To isolate the influence of the regularizer on the reconstruction problem, we will commit an inverse crime and construct synthetic data from a reference reconstruction. In detail, we will use a reference root-sum-of-squares (RSS) reconstruction from the fastMRI dataset, and construct the data by sampling the two-dimensional discrete Fourier transform along vertical lines. To simulate a sped-up acquisiton, we only consider every fourth line (on average, they are not equispaced), with a few low-frequency lines densely sampled to facilitate reconstruction. Formally, we denote with \( x \in \mathbb{R}^{320 \times 320} \) the reference RSS reconstruction, which was previously mapped to the interval \( [0, 1] \), and \[ y = MFx + n, \] where \( F \) is the two-dimensional discrete “real” Fourier transform (think torch.fft.rfft from torch), \( M \) is an operator that selects one fourth of the frequencies, and \( n \) is white Gaussian noise with variance \( 2\times 10^{-2} \).

Reference reconstruction — Image
The real-valued reference image \( x \); The 17th slice in file1000005.h5 in fastMRI.

Visualization of the synthetic data — Data
Logarithm of the absolute value of the data \( y \). Black lines are not acquired.

The naive solution

If we ignore the role of regularizers (more on those soon), we can reconstruct the image by filling in the missing frequencies with zeros. This approach is known as the zero-filling solution: \[ x_{\text{zf}} = F^* M^* y \]

In plain terms, we’re just taking the partial data, padding it with zeros, and transforming it back to the image domain. While it’s computationally cheap, the results are far from perfect. The reconstructed image suffers from typical back-folding artifacts (repeating structures that shouldn’t be there) and noise. In essence, the reconstruction is useless for clinical diagnosis:

Reconstruction
Naive reconstruction obtained from the Fourier inversion of the zero-filled data.

Importantly, the zero-filling solution \( x_{\text{zf}} \) is a minimizer of the least-squares objective \[ f(x) = \lVert MFx - y \rVert^2 / 2, \] which is evident by noting that Fermat’s condition at \( F^* M^* y \), \[ \nabla f(F^* M^* y) = F^* M^* (MFF^* M^* y - y) = 0, \] is fulfilled since \( MFF^* M^* \) is the identity operator (on the space \( M \) operates).

Classical variational penalties

A historically extremely popular method to improve the reconstruction is regularizing the reconstruction. Here, regularizing is understood as imposing some form of regularity onto the reconstruction, such as: “The reconstruction should have small magnitude.” or “Neighboring pixels of the reconstruction shouldn’t be drastically different.” This regularization is realized by augmenting the least squeares problem with a regularizer \( R \), penalizing deviations from these desired properties.

Magnitude penalization

As an example, in the regularized least-squares objective \[ \lVert MFx - y \rVert^2 / 2 + R(x) \] the desired property “small magnitude” is easily realized by setting \( R = \lambda \lVert \cdot \rVert^2 / 2 \). This choice of regularization is extremely popular in many fields; as an example, it is the classical weight decay in machine learning.

It turns out that it is completely useless in out application. It is easily checked through Fermat’s condition that \[ (1 + \lambda)^{-1} F^* M^* y, \] the zero filling solution \( x_{\text{zf}} = F^* M^* y \) shrinked by a factor \( (1 + \lambda)^{-1} \), minimizes the regularized least-squares objective. There is no point in showing this solution here, it is just a dimmed version of the naive recosntruction above.

Penalizing differences between neighboring pixels

A much more fruitful approach lies in penalizing the differences between neighboring pixels. For an image \( x \in \mathbb{R}^{m \times n} \), regularizers of this type can be written as \[ R(x) = \lambda \bigl( \phi(x_{2, 1} - x_{1, 1}) + \cdots + \phi(x_{n, 1} - x_{n - 1, 1}) + \phi(x_{2, 2} - x_{1, 2}) + \cdots + \phi(x_{n, m} - x_{n-1, m})\bigr). \] There, the value of a potential function \( \phi : \mathbb{R} \to \mathbb{R} \) is summed up over the difference of all neighboring pixels. More succcinctly, we can define vertical and horizontal forward finite difference operators, \( D_v, D_h : \mathbb{R}^{m \times n} \to \mathbb{R}^{m \times n} \) by \[ \begin{aligned} (D_v x)_ {i, j} &= \begin{cases} x_{i + 1,j} - x_{i,j} & \text{if}\ 1 \leq i < m, \\ 0 & \text{else}, \end{cases}\\ (D_h x)_ {i, j} &= \begin{cases} x_{i,j + 1} - x_{i,j} & \text{if}\ 1 \leq j < n, \\ 0 & \text{else}, \end{cases} \end{aligned} \] summarize them in \[ D : \mathbb{R}^{m \times n} \to \mathbb{R}^{m \times n \times 2} : (Dx)_ {i, j, 1} = (D_v x)_ {i, j}\ \text{and}\ (Dx)_ {i, j, 2} = (D_h x)_ {i, j} \] and write \[ R(x) = \lambda \sum_{i, j = 1}^{m, n} \sum_{k=1}^{2} \phi\bigl((D x)_ {i, j, k}\bigr). \]

When we use the quadratic function \( x \mapsto x^2/2 \), loved for it’s smoothness, we can write even more succinctly \[ R(x) = \lambda \lVert Dx \rVert_{2, 2}^2 / 2. \] The solution to the regularized least-squares objective can be efficiently found by solving the linear system \[ (F^*M^*MF + \lambda D^*D)x = F^*M^*y \] arising from Fermat’s condition using conjugate gradients.

The results are somwhat underwhelming:

Reconstruction using quadratic gradient penalization — Reconstruction
Reconstruction using quadratic penalization of pixel differences

Although the back-folding artifacts are less severe, the image is lacking sharp edges and appears washed out in general. The problem lies in the choice of the potential \( \phi \): The quadratic function strongly penalizes large differences, while small differences around zero are weakly penalized.

In a way, this is opposite to what we want: Neighboring pixels should have either very similar intensities, corresponding to homogeneous regions in the image, or large differences, corresponding to edges in the image. A better fit to this idea is the absolute value potential, \( \phi = |\,\cdot\,| \).¹ With this, we can write down the celebrated (anisotropic) total variation regularizer as \[ R(x) = \lambda \lVert Dx \rVert_{1, 1}. \]

Optimizing the penalized least-squares objective under this choice of reguliarizer poses a significant challenge due to it’s nonsmoothness. The primal-dual hybrid gradient algorithm, which we might exploit in a different post, is widely used to solve nonsmooth problems of this sort, and the reconstruction we obtain looks significantly better:

Reconstruction using absolute gradient penalization — Reconstruction
Reconstruction using absolute penalization of pixel differences

However, the reconstruction is far from perfect. It has a patchy appearance due to the staircasing phenomenon and is missing lots of small details such as the blood vessels. The back-folding artifacts are also not entirely removed, but increasing regularization strength would result in more loss of detail.

Fields of experts

A generalization of this approach lies in the celebrated “filters, random fields, and maximum entropy” (FRAME)-type models from due to Song Chun Zhu, Yingnian Wu, and the legendary fields medalist David Mumford. There, the contribution of potential functions applied to linear filter responses are summed: \[ R(x) = \lambda \sum_ {i,j=1}^{m, n} \sum_ {k=1}^o \phi_k \bigl( (K_k x)_ {i, j} \bigr) \] These models are also called fields-of-experts in the image reconstruction community or ridge regularizers in communities closer to classical signal processing. In contrast to the previous regularizers, the convolution operators \( K_1, K_2, \dotsc, K_k \) consider a larger pixel neighborhood and are each endowed with their own potential acting on the responses. Both, the kernels of the convolution operators and the potential functions can be learned from data.

Research in these types of models is extremely extensive. Here, we look at two models of this type from one of my recent papers, Product of Gaussian Micture DIffusion Models. The first model is learned such that the potential functions match the negative-log empirical distribution of filter responses.² We solve the smooth but nonconvex optimization problem with nonconvex FISTA. The reconstruction is slightly more pleasant to look at:

Reconstruction using the patch prior — Reconstruction
Reconstruction using potentials derived from the empirical distribution of filter responses

However, it again appears washed-out and small details such as the blood vessels are missing. A visually significant difference comes when we learn this regularizer learned generatively:³ Here, the image is sharp and details such as the blood vessels are recovered.

Deep neural regularizers

Finally, we can abandon the fields-of-experts structure and delve into the era of deep learning, making the regularizer a deep CNN with \( l = 6\): \[ R(x, \theta) = \bigl( |\,\cdot\,| \circ W_{l + 1} \circ L_l \circ L_{l-1} \circ \ldots \circ L_2 \circ L_1 \bigr)(x), \] where the layers have have the structure \[ L_i(x) = \mathrm{ReLU}(\widetilde{W}_ i\,\mathrm{ReLU}(W_ i x)) \] with \( 3 \times 3 \) convolutions \( \widetilde{W}_ i \) and \( W_ i \), where \( \widetilde{W}_ i \) has stride two, and \( W_{l + 1} \) is a fully connected layer mapping to a scalar. I point out in my thesis that the gradient of this object is a UNet with particular constraints. The UNet is an extremely popular architecture to model the score, in diffusion models these days.

As previously, we train the regularizer generatively, enabling it to capture local and global statistics of the underlying dataset. We can even visualize these statistics by utilizing Markov chain Monte Carlo methods to draw samples from its distribution:

MCMC sampling from the distribution of the regularizer
Sampling the distribution of the regularizer reveals preferred image structures. In addition, we can visualize the landscape of the distribution by projecting it onto two dimensions. The golden path is the MCMC trajectory.

This figure is taken from my paper Stable Deep MRI Reconstruction using Generative Priors. Due to the conv-ReLU parametrization, the optimization problem is nonsmooth and noncovex, but a backtracked FISTA type algorithm (with gradients replaced by Clarke subgradients) works well in practice.

With this rich prior information, the reconstruction becomes significantly better:

Reconstruction
Reconstruction using the deep neural regularizer

This is often motivated by the absolute value being the best convex fit to the marginal distribution of pixel differences. It is still completely unclear to me if such a construction always makes a good regularizer. ↩︎
The filters are also learned here, but they are not too important. ↩︎
Instead of letting the potential functions be the negative-log empirical marginals, the regularizer is learned such ath the distribution of the filter responses of samples from its distribution reproduce the distribution of the filter responses of the data. I plan on making another post about this in the future. ↩︎

A running example: MRI reconstruction¶

The naive solution¶

Classical variational penalties¶

Magnitude penalization¶

Penalizing differences between neighboring pixels¶

Fields of experts¶

Deep neural regularizers¶