Deep learning initialized compressed sensing (Deli-CS) in volumetric spatio-temporal subspace reconstruction

Theory

Spatio-temporal subspace-reconstruction is generally posed as a linear inverse problem where the acquisition operator, A ∈ ℂMTC × NK, models the transformation from the input subspace coefficient images (x ∈ ℂNK × 1) to the data acquired, b ∈ ℂMTC × 1. With T being the total number of TRs, K denoting the number of coefficient images, or the rank of the low-rank subspace utilized, N the number of voxels in image space, M the number of k-space samples per TR, and C the number of coils.. Then, the acquisition operator A is as follows:

Here, F ∈ ℂMTC × NTC denotes the forward NUFFT operator [35], S ∈ ℂNTC × NT the projection onto coil sensitivity maps according to the SENSE [36, 37] model and ɸ ∈ ℂNT × NK the linear subspace.

The low-rank subspace ɸ is estimated by taking the Singular Value Decomposition (SVD) of the signal dictionary generated from Bloch simulations using realistic brain tissue parameters.

The forward operation ɸx recovers the TR images of the TGAS-SPI-MRF acquisition.

A rank K = 5 subspace was deemed sufficient in capturing the signal variation as per Cao et al. [15].

The linear inverse problem used to solve the subspace reconstruction is as follows:

$$ \min_} \left( ||} - }||_^ + \;\lambda }\left( } \right)} \right)$$

(2)

Here, λ is the regularization value and LLR denotes the locally low-rank constraint.

The LLR operator breaks the reconstructed volume up into blocks, and for each block performs a singular value decomposition (SVD). In our implementation of the subspace reconstruction, the SVD is applied across the subspace coefficients. The singular values are thresholded based on the λ weighting, enforcing local low-rankedness among voxels within each block. The value of λ for this work was chosen based on the proposed value from Cao et al. [15]. Since the optimal λ is determined by the image and not by initialization of the iterative reconstruction, we were confident that the same λ would work equally well in our work with the deep-learning initialized reconstruction.

Data acquisition

The TGAS-SPI-MRF acquisition proposed by Cao et al. [15] involves an initial adiabatic inversion pulse followed by a 500 TR long readout train (TI/TE/TR = 20/0.7/12 ms) with varying flip angles (10 to 75 degrees) and a 3D center-out spiral trajectory (Fig. 1A). A water-exciting rectangular RF pulse with a duration of 2.38 ms to suppress the fat signal [38] was used. No gradient delay correction was applied. The gold standard acquisition used 48 readout trains to fully fill k-space. Only 16 readout trains (R = 3) were used for the accelerated 2-min acquisition.

Fig. 1figure 1

The Deli-CS pipeline includes two sequences: MRF depicted in the box A that shows one MRF group, which consists of an inversion pulse followed by a 500 TR acquisition train with flip angles varying from 10 to 75 degrees acquired with a spiral sampling pattern. The second sequence is a simple large FOV GRE sequence depicted in the box B by showing how the FOV was set up in comparison to the MRF FOV. Box C depicts the reconstruction pipeline. Starting with the GRE reconstruction and autoFOV to reduce signals outside the MRF FOV (depicted as a red square) and ROVir to further eliminate signals outside the FOV (orange circle). Then the MRF data is processed using the coil compression matrix and shifts estimated from the GRE. Elements in blue are only used in the inference pipeline, whereas elements in magenta were only used when training the network

Additionally, a 20 s, low-resolution (6.9 mm3) gradient echo (GRE) pre-scan with a large field-of-view (FOV) of 440 mm was performed (Fig. 1B). This pre-scan was used for automatic FOV shifting and to pre-calculate a coil compression matrix, such that any signal originating from outside the TGAS-SPI-MRF FOV after shifting (e.g. shoulders) could be removed using region-optimized virtual coil compression (ROVir) [39].

Memory optimization

This work uses the Python-based SigPy [40] framework for its ease-of-use reconstruction whilst retaining control over GPU memory management. Reduced memory burden improves both standard physics-driven iterative reconstruction and the Deli-CS pipeline. To reduce the memory requirements of A, the following optimizations are made:

First, the forward model A is split into smaller sub-models so that each sub-model can be evaluated on the GPU on each subspace, K, and each channel, C, independently. This is possible since the data per channel do not interact with each other (and are simply stacked), and because of the second optimization, which avoids expanding into the dimension, T, which is ~ 100 times larger than the subspace dimension, K., The order of the subspace operator ɸ and the transform from image to k-space is reversed. This is possible because neither S nor F affects the temporal dimension that ɸ acts on.

These two optimizations together result in the following forward model:

$$A = \left[ c} \right)}} } & \right)}} } & \cdots & \right)}} } \\ \right)}} } & \right)}} } & \cdots & \right)}} } \\ \vdots & \vdots & & \vdots \\ \right)}} } & \right)}} } & \cdots & \right)}} } \\ \end } \right]$$

(3)

With

$$}_ \right)}} = \Phi_ }\left( \right)}_$$

(4)

Here, ɸk ∈ ℂMT × MT denotes the transform from one subspace component in k-space to its counterpart time series in k-space, F1,2,…,T ∈ ℂMT × N is a NUFFT mapping image space to k-space locations for all TR’s combined, and Sc ∈ ℂN × N denotes the cth coil-sensitivity map. The individual smaller sub-model linear operators, Ak,c ∈ ℂMT × N, can be evaluated one-by-one. The inputs to each sub-model linear operator, xk ∈ ℂN × 1, is first transferred to GPU memory before applying the operator, Ak,cxk, and the resulting outputs (k-space time courses weighted by k) are is then summed over k (columns in the matrix A), and transferred back into CPU memory before the stacking over coils, c (rows in the matrix A) to keep the memory burden on the GPU as low as possible.

The modified linear operator only utilizes approximately 13 GB of peak CPU memory and 4.7 GB of peak GPU memory, compared to 140 GB peak CPU memory for the BART implementation. BART uses Toeplitz embedding, which reduces per-iteration computation time as the NUFFT’s in the acquisition model can be replaced by FFT’s, but increases memory usage and requires a pre-calculation, which had a peak memory usage of > 400 GB.

However, despite the improvement in memory utilisation, the modified linear operator proposed in (3) increases the per iteration speed of FISTA when solving (2) for a 2-min TGAS-SPI-MRF acquisition in SigPy on a single GPU from approximately 26 s for BART on the CPU to approximately 58 s (a 45% reduction in efficiency). These times were observed on a Linux workstation with 80 threads on Intel Xeon Gold 5320 CPUs at 2.20 GHz and an NVIDIA RTX A6000 GPU. Note that this performance is expected to vary between hardware and that the reduction in per-iteration speed is counteracted by the lower memory burden.

Density compensation

Having achieved considerably lower memory usage, the next optimization targets improving the iterative convergence. The acquisition operator A is ill-conditioned, yielding slow iterative convergence. Assuming 25 s per iteration and 300 iterations, which was used by Cao et al. [15], this results in approximately 2 h 10 min required to reconstruct data. To improve the conditioning of A, first Pipe-Menon density compensation [41] was integrated into the optimization (Eq. 2) as described in [19, 20], yielding the following optimization:

$$\min_} \left( ||}^} \right. \kern-0pt} 2}}} \left( } - }} \right)||_^ + \;\lambda }\left( } \right)} \right)$$

(5)

Here, D is the Density Compensation array designed to target F1 + 2 + … + T in Eq. 3 so that A*DA has better conditioning.

Field of view processing

Caution is required when using a tight FOV in TGAS-SPI-MRF to avoid artifacts caused by signals originating outside the FOV, such as those from the shoulders and neck. This can be overcome by utilizing automatic FOV shifting and FOV size adaption, for example, the approach proposed in Baron et al. [20].

Since the MRF sequence acquires a fixed FOV around isocenter it is sensitive to subject positioning. An autoFOV algorithm was implemented to ensure that all anatomy was inside the FOV for the MRF. The autoFOV feature also allows speed up scanning because the FOV does not need to be set up manually.

The automatic FOV centering was done by reconstructing the calibration GRE image and performing a sum-of-squares combination of data from the multiple receive coils. The image was then flattened by taking a maximum-intensity projection through the sagittal plane. The resulting 2D image was then smoothed, binarized, and a bounding box was calculated around the largest continuous area. A shift was then calculated to ensure the top and front of the bounding box was within the target field-of-view.

The GRE data was pre-whitened before ROVir was used to calculate a weighting of the 48 input channels that minimizes signal outside the MRF FOV (referred to as the interference region [39]). Eight virtual ROVir coils with the most signal in the interference region were excluded. SVD compression to 10 virtual channels of the remaining 40 channels was then performed. The complete coil compression matrix containing coil whitening, ROVir, and SVD compression was calculated based on the GRE and used for compression of the TGAS-SPI-MRF data. Coil sensitivity maps used in the MRF reconstruction were estimated from the MRF data itself using JSENSE [42].

The reconstruction, automatic FOV shifting, and coil processing matrix calculation for the GRE data takes less than 30 s and can be run while the TGAS-SPI-MRF data is being acquired.

Deli-CS

The Deli-CS pipeline consists of the following steps:

1.

Traditional Reconstruction. As a first step, an approximate reconstruction is performed by gridding the acquired data A*b. This reconstruction, the input to the next step, suffers from severe temporal-aliasing artifacts and increased noise, in particular in the lower energy subspace components.

2.

Block-based Data-Driven Deep Learning. A deep learning network (ResUNet [43]) was trained to denoise the input image. The model is both trained and deployed in a block-wise manner to reduce the memory and training data requirements and is consequently not integrated with data consistency terms in an unrolled manner. The block-based processing proposed in this step allows the deep learning model to be trained and deployed efficiently with 5 GB of GPU memory.

3.

Compressed Sensing Certification. Since the above network is block-based and data-driven, the inferred reconstruction is susceptible to hallucinations. The inferred result is therefore used to initialize Eq. 5 solved with an iterative reconstruction. By initializing with the inferred result, the number of iterations required to converge is significantly reduced. Additionally, the resulting image satisfies the same convergence criterion as the traditional reconstruction achieved when solving Eq. 5.

The full Deli-CS reconstruction pipeline is depicted in Fig. 1C, and its DL network structure is shown in Fig. 2.

Fig. 2figure 2

The network structure of the ResUNet structure used in Deli + CS. The light gray boxes show the data size at various layers of the network, where the first three numbers are the spatial dimensions, and the fourth number is the number of feature channels. The convolutions with kernel size 1 in the skip connections match the input and output dimensions so that they can be added together after each convolutional layer

Experiments

Data from 14 healthy volunteers were acquired on a 3 T Premier MRI scanner (GE Healthcare, Waukesha, WI) and a 48-channel receive coil. GRE and TGAS-SPI-MRF data were acquired. The TGAS-SPI-MRF acquisition time was 6 min, acquired resolution was 1 mm3, and FOV was 220 mm3. The data was retrospectively sub-sampled to simulate a 2-min acquisition. The data were partitioned as 10 training, 2 validation and 2 testing subjects. The technique was additionally tested on three sets of patient data (all male, ages 28, 49, and 75) acquired as additional scans to the standard of care protocols using two different 3 T Signa Premier scanners. The patients were scanned with a prospectively accelerated 2-min TGAS-SPI-MRF protocol and thus no 6-min reference comparison was possible for these cases.

All human data were acquired with informed consent using protocols approved by the local institutional review board.

Coil sensitivity maps were estimated with JSENSE [42] from the time-averaged acquisitions. The subspace basis was generated from a simulated signal dictionary, and template matching was used to estimate the T1, and T2 parametric maps [1]. The tissue parameters used to generate the signal dictionary were:

$$\begin T_ \in \left\ \right\}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \cup \\ \left\ \right\} \\ \end$$

$$\begin T_ \in \left\ \right\}& \cup \\ \left\ \right\}&\cup \\ \left\ \right\}&\cup \\ &\left\ \right\} \\ \end$$

The reported λ values for all reconstructions are after D1/2b in Eq. 5 was normalized to have unitary l2 norm.

A gold standard LLR reconstruction was performed on the 6-min data with an LLR block size of 8 and a λ value of 3 × 10–5 (based on similar values in previous work [15]) with 40 FISTA iterations. The matrix size was 256 × 256 × 256 × 5. This gold standard reconstruction is labelled MRF target recon in Fig. 1 and was only used for reference to compare the performance of the different reconstruction approaches of the retrospectively undersampled 2-min case.

The retrospectively undersampled 2-min data LLR reconstruction was also performed using a λ value of 5 × 10–5 with 40 FISTA iterations. In both the 6-min and 2-min cases this was sufficient for convergence. These data were used as target reconstructions when training the Deli-CS network.

For training the Deli-CS network, non-overlapping blocks of dimensions 64 × 64 × 64 × 5 were extracted from the initial adjoint reconstruction and the reference reconstruction of the 2 min acquisition. For data augmentation, random flips (mirroring of the block in the x, y, and z, direction), transposes (swapping the order of the x, y, and z dimensions), and spatial shifts of random integer voxel steps in each direction were performed on the whole volume, before augmentation of the absolute scaling (between 50 and 100% of the original scale) was performed on each block. The block size was chosen to fit in GPU memory, but no hyper-parameter search was performed to assess the block size’s impact on network performance.

Before augmentation, the volume was normalized to have l2-norm = 1. If any block’s 0th order Fourier coefficient’s standard deviation was below 0.3, that block was discarded. This filter ensures that the network avoids learning from regions with no signal variation (e.g. background only blocks), which provide no useful information, but significantly increases the training time of the network. The blocks are subsequently split into real and imaginary components, and concatenated blockwise along the subspace dimension. A total of 623 training blocks from 10 subjects and 152 validation blocks from 2 subjects were used. These blocks were piped into a ResUNet [43] with 3D convolutions, where the channel dimension corresponds to the subspace dimension. The ResUNet utilized 3 residual encoding blocks and 3 residual decoding blocks with a filter size of 3 × 3 × 3 and ReLU activation. The total number of trainable parameters was 23.8 million. The model was implemented in PyTorch Lightning [44] and trained for 540 epochs (stopping criteria defined by when minimal l1-loss on the validation set was reached, see Supplemental Figure S1 for convergence plot.) using the Adam optimizer [45] with a learning rate of 1 × 10–5. The training utilized 5 GB of GPU memory.

To avoid edge effects during inference, blocks were extracted from the initial reconstruction with a 16-voxel overlap in all dimensions. The resulting blocks were combined by applying a linear cross-blend operation, where the overlapping regions were averaged using a weighted average with weights linearly increasing from the edge of the overlapping region to the centre of the block (Fig. 1C).

For the final step of Deli-CS, the model prediction is used to initialize Eq. 5 and compared with the standard initialization of A*b. The iterative reconstruction used for refinement utilized the same parameters as the reference 2 min reconstruction.

The T1 and T2 values for different numbers of reconstruction iterations were assessed in a gray matter and white matter mask generated by FSL FAST [46] after brain segmentation using FSL BET [47].

Comments (0)

No login
gif