Introduction

Thanks to advances in computer-aided image analysis, radiological image data are now increasingly considered a valuable source of quantitative biomarkers [1,2,3,4,5,6]. Body tissue composition is a long-known biomarker with high diagnostic and prognostic value not only in cardiovascular, oncological, and orthopedic diseases but also in rehabilitation medicine or drug dosage. As obvious and simple as a quantitative determination of tissue composition based on modern radiological sectional imaging may seem, the actual extraction of this information in clinical routine is not feasible, since a manual assessment requires an extraordinary amount of human labor. A recent study has shown that some anthropometric measures can be estimated from simple and reproducible 2D measurements in CT using linear regression models [7]. Another study showed that a fully automated 2D segmentation of CT sectional images at the level of L3 vertebra into subcutaneous adipose tissue, muscle, viscera, and bone was possible using a 2D U-Net architecture [8]. The determination of the tissue composition at the level of L3 is often used as a reference in clinical routine to limit the amount of work required for the assessment. However, even here, this is only a rough approximation, since the inter-individual variability between patients is large and the section at the level of L3 does not necessarily have to be representative of the entire human anatomy. Other dedicated techniques for analyzing body composition using dual-energy X-ray absorptiometry or magnetic resonance imaging exist [9] but require additional potentially time-consuming or expensive procedures to be performed.

The aim of our study was therefore to develop a fully automated, reproducible, and quantitative 3D volumetry of body tissue composition from standard CT examinations of the abdomen in order to be able to offer such valuable biomarkers as part of routine clinical imaging.

Materials and methods

Dataset

A retrospective dataset was collected, consisting of 40 abdominal CTs for training and 10 abdominal CTs for testing (Table 1). The included scans were randomly selected from abdominal CT studies performed between 2015 and 2019 at the University Hospital Essen. The indication of the studies was not considered. According to the distribution of clinical studies in our department, more than 50% should have been examined for oncological indications. Each CT volume has a slice thickness of 5 mm and was reconstructed using a soft tissue convolutional reconstruction kernel. The data was annotated with five different labels: background (= outside the human body), muscle, bones, subcutaneous tissue, abdominal cavity, and thoracic cavity. For annotation, the ITK Snap [10] software (version 3.8.0) was used. Region segmentation was performed manually with a polygon tool. In order to reduce the annotation effort, every fifth slice was fully annotated. Remaining slices were marked with an ignore label, as visualized in Fig. 1. The final dataset contains 751 fully annotated slices for training and 186 for testing.

Table 1 Patient characteristics and acquisition parameters of the collected cohort
Fig. 1
figure 1

Exemplary annotation of an abdominal CT, with subcutaneous tissue (red), muscle (yellow), bones (blue), abdominal cavity (green), thoracic cavity (purple), and ignore regions (white)

Network architectures

Many different architectural designs exist implementing semantic segmentation, some utilizing pre-trained classification networks trained on ImageNet; others are designed to be trained from scratch. For this study, two different network architectures were chosen for training, namely the commonly used U-Net 3D [11] and a more recent variant multi-resolution U-Net 3D [12]. The latter is shown in Fig. 2; however, U-Net 3D is very similar to residual path blocks replaced by identity operations and multi-resolution blocks replaced by two successive convolutions. In this case, volumetric data limits the batch size to a single example per batch due to a large memory footprint. Therefore, instance normalization [13] layers were utilized in favor of batch normalization layers [14]. In the original architectures, transposed convolutions were employed to upsample feature maps back to the original image size. However, transposed convolutions tend to generate checkerboard artifacts [15]. This is why trilinear upsampling followed by a 3 × 3 × 3 convolution was used instead, which is computationally more expensive, but more stable during optimization. Additionally, different choices for the initial number of feature maps nf are evaluated: 16, 32, and 64. After each pooling step, the number gets doubled, resulting in 256, 512, and 1024 feature maps in the lowest resolution, respectively.

Fig. 2
figure 2

Schematic overview of the multi-resolution U-Net 3D architecture (red box: multi-resolution block; orange box: residual path block; green box: upsampling block; blue arrow: max-pooling layer; black arrow: identity data flow)

Training details

The implementation of network architectures and training was done in Python using Tensorflow 2.0 [16] and the Keras API. Nvidia Titan RTX GPUs with 24-GB VRAM were used, which enable the training of more complex network architectures when using large volumetric data.

Adam [17] with decoupled weight decay regularization [18] was utilized, configured with beta_1 = 0.9, beta_2 = 0.999, eps = 1e-7, and weight decay of 1e-4. An exponentially decaying learning rate with an initial value of 1e-4, multiplied by 0.95 every 50 epochs, helped to stabilize the optimization process at the end of the training. For selecting the best model weights during training, fivefold cross-validation was used on the training set and the average dice score was monitored on the respective validation splits. Since the training dataset consists of 40 abdominal CTs, each training run was performed using 32 CTs for training and 8 CTs for validation.

During training, several data augmentations were applied in order to virtually increase the unique sample size for training a generalizable network. For example, in [11, 12, 19], it has been shown that aggressive data augmentation strategies can prevent overfitting on small sample sizes by capturing expectable variations in the data. First, random scale augmentation was applied with a scaling factor sampled uniformly between 0.8 and 1.2. Since this factor was sampled independently for both x- and y-axis, it also acts as an aspect ratio augmentation. Second, random flipping was utilized to mirror volumes on the x-axis. Third, subvolumes of size 32 × 256 × 256 were randomly cropped from the full volume with size n × 512 × 512. During inference, the same number of slices was used, but with x- and y-dimension kept unchanged, and the whole volume was processed using a sliding window approach with a 75% overlap. To improve segmentation accuracy, predictions for overlapping subvolumes were aggregated in a weighted fashion, giving the central slices more weight than the outermost.

Besides random data augmentations, additional pre-processing steps were performed before feeding the image data into the neural networks. Volumes were downscaled by factor 2 to 128 × 128 on the x-/y-axes, retaining a slice thickness of 5 mm on the z-axis. CT images are captured as Hounsfield units (HU), which capture fine details and allow for different interpretations depending on which transfer function is used to map HUs to a color (e.g., black/white). Normally, when using floating-point values, the typical scanner quantization of 12 bits can be stored lossless and a network should be able to process all information without any problems. In this work, multiple HU windows [− 1024, 3071], [− 150, 250], and [− 95, 155] were applied to the 16-bit integer data in order to map to [0, 1] with clipping outliers to the respective minimum and maximum values and stacked as channels. Lastly, the network inputs were centered around zero with a minimum value at − 1 and maximum value at + 1.

For supervision, a combination of softmax cross-entropy loss and generalized Sørensen Dice loss [20] was chosen, similar to [19]. Voxels marked with an ignore label do not contribute to the loss computation. Both losses are defined as below:

$$ {\mathbbm{L}}_{\mathrm{XCE}}=-\frac{1}{N}\cdotp \sum \limits_{n-1}^N\sum \limits_{c=1}^C{y}_{c,n}\cdotp \log \left({\hat{y}}_{c,n}\right) $$
$$ {\mathbbm{L}}_{\mathrm{Dice}}=1.0-\frac{1}{C-1}\cdotp \sum \limits_{c=2}^C\frac{\sum \limits_{n=1}^N2\cdotp {\hat{y}}_{c,n}\cdotp {y}_{c,n}+\epsilon }{\sum \limits_{n=1}^N{\hat{y}}_{c,n}+{y}_{c,n}+\epsilon } $$

C stands for the total number of classes, which equals six for the problem at hand. \( {\hat{y}}_{c,n} \) and yc,n represent the prediction respectively groundtruth label for class c at voxel location n. The background class is in this work explicitly not covered by the dice loss in order to give the foreground classes more weight in the optimization process. This choice is well known for class imbalanced problems where the foreground class only covers little areas compared with the background class.

The final loss is an equally weighted combination of both losses:

$$ {\mathbbm{L}}_{\mathrm{SV}}=0.5\cdotp {\mathbbm{L}}_{\mathrm{XCE}}+0.5\cdotp {\mathbbm{L}}_{\mathrm{Dice}} $$

Tissue quantification

Various materials can be extracted from a CT by thresholding the HU to a specific intensity range. For quantifying tissues, the reporting system uses a mixture of classical thresholding and modern semantic segmentation neural networks for building semantic relationships. During training, fivefold cross-validation [21] was employed to measure the generalization performance of the selected model configuration, which in the end produced five trained model weights per configuration. For inference, those five models were used to build an ensemble system [21] by averaging the probabilities of all individual predictions, which a common method for increasing the stability and accuracy of a machine learning model. The final output of the quantification system is a report about subcutaneous adipose tissue (SAT), visceral adipose tissue (VAT), and muscle volume. Muscular tissue is identified by thresholding the HU between − 29 and 150 [22]. Adipose tissue is identified by thresholding the HU between − 190 and − 30 [22]. If an adipose voxel is within the abdominal cavity region, it is counted as VAT. If it is within the subcutaneous tissue region, it is counted as SAT. Automatically subclassified tissue volumes were validated against the tissue volumes derived from groundtruth annotations using the intra-class correlation method on a slice by slice basis.

Results

Model evaluation

As described in the “Network architecture” and “Training details” sections, two different network architectures with the varying initial number of feature maps were systematically evaluated using a fivefold cross-validation scheme on the training dataset. The results are stated in Table 2 (additional complementary evaluation metrics are available for the interested reader in Table A.1, A.2, and A.3). First of all, all networks delivered promising results with average dice scores over 0.93. Second, multi-resolution U-Net variants achieved constantly higher scores compared with their respective U-Net counterparts. It is interesting to note that the improvements in scores were small compared with the increase in trainable parameters and thus required time to train and test the networks. A single optimization step took 294 ms, 500 ms, and 1043 ms on a NVIDIA Titan RTX for the initial feature map count of 16, 32, and 64, respectively.

Table 2 Evaluation for the fivefold cross-validation runs (stated as mean overall runs) and ensemble predictions on the test set. AC, abdominal cavity; B, bones; M, muscle; ST, subcutaneous tissue; TC, thoracic cavity

For visual inspection of the ensemble segmentations, a few exemplary slices are shown in Fig. 3. Most slices show almost perfect segmentation boundaries; however, especially the ribs are problematic due to the partial volume effect. In 5-mm CTs, it is even sometimes hard for human readers to correctly assign one or the other region.

Fig. 3
figure 3

Comparison of different slices, their respective groundtruth annotation, and predictions of the ensemble formed from five trained models on cross-validation splits

Ablation study

During model development, it was observed that the choice of HU window has an impact on optimization stability and final achieved scores. Therefore, a small ablation study was conducted in order to systematically evaluate the influence of different HU limits. Additional models were trained using the same training parameters, but only with changed input pre-processing. The results are stated in Table 3.

Table 3 Evaluation of multi-resolution U-Nets with nf = 32 trained on different mappings from Hounsfield units to the target intensity value range of [− 1, 1]. Multi-window stands for a combination of theoretical value range of 12-bit CT scans, abdomen window, and liver window. AC, abdominal cavity; B, bones; M, muscle; ST, subcutaneous tissue; TC, thoracic cavity

Increasing the HU intensity range consistently improves dice scores. By combining multiple HU windows as separate input channels, the dice scores can be even more improved to over 0.95 dice score on average on both cross-validation and test set. The lowest scores of 0.829 dice on average for cross-validation and 0.875 for the test set were achieved by an abdominal HU window ranging from − 150 to 250.

Tissue quantification report

As described in the “Tissue quantification” section, the segmentation models are intended to be used for assigning thresholded tissues to different regions, which is technically a logical conjunction. The achieved intra-class correlation coefficients for the derived SAT, VAT, and muscle volumes measured per slice on the test set are 0.999, 0.998, and 0.991, respectively (p < 0.001), and corresponding Bland-Altman plots are shown in Fig. 4. In order to visually inspect the quality of the tissue segmentation, a PDF report with sagittal and coronal slices is generated, in conjunction with a stacked bar plot showing the volumes of segmented muscle, SAT, and VAT per axial slice (see Fig. 5). This is only intended to give the human reader a first visual impression on the system output. For analysis, an additional table with all numeric values per slice is generated. The PDF file is encapsulated into DICOM and automatically sent back to the PACS, in order to make use of existing DICOM infrastructure.

Fig. 4
figure 4

Bland-Altman plot of SAT, VAT, and muscle volumetry with data points for every annotated slice in the test set

Fig. 5
figure 5

Final visual report of the tissue quantification system output. SAT is shown in red, VAT is shown in green, and muscle tissue is shown yellow

Discussion

Our study aimed to develop a fully automated, reproducible, and quantitative 3D volumetry of body tissue composition from standard abdominal CT examinations in order to provide valuable biomarkers as part of routine clinical imaging.

Our best approach using a multi-resolution U-Net 3D with an initial feature map count of 64 was able to fully automatically segment abdominal cavity, bones, muscle, subcutaneous tissue, and thoracic cavity with a mean Sørensen Dice coefficient of 0.9553 and thus yielded excellent results. The derived tissue volumetry had intra-class correlation coefficients of over 0.99. Further experiments showed a high performance with heavily reduced parameter counts which enables considering speed/accuracy trade-offs depending on the type of application. Choosing the transfer function to map from HU to a normalized value range for feeding images into neural networks was found to have a huge impact on segmentation performance.

In a recent study, manual single-slice CT measurements were used to build linear regression models for predicting stable anthropometric measures [7]. As the authors suggest, these measures may be important as biomarkers for several diseases like e.g. sarcopenia, but could also be used where the real measurements are not available. However, manual single-slice CT measurements are still prone to intra-patient variability and inter- and intra-rater variability. By using a fully automated approach, derived anthropometric measures from more than a single CT slice should in theory be more stable.

Fully automated analysis of body composition has been attempted many times in the past. Older methods utilize classical image processing and binary morphological operations [23,24,25] in order to isolate the SAT and VAT from total adipose tissue (TAT). Other studies use prior knowledge about contours and shapes and actively fit a contour or template to a given CT image [26,27,28,29,30]. Those methods are prone to variations in intensity values and assume certain body structures for algorithmic separation between SAT and VAT. Apart from purely CT imaging–based studies, there have been efforts to apply similar techniques to magnetic resonance imaging (MRI) [31,32,33]. However, MRI procedures are more cost and time expensive than CT imaging in the clinical routine. Specific MRI procedures exist for body fat assessment, but have to be performed explicitly. Our approach can be used on routine CT imaging and may be used as supplementary material for diagnosis or screening purposes.

Recently, deep learning–based methods have been proposed [8, 34]. In both studies, models were trained solely on single L3 CT slices. However, Weston et al [8] visually showed that their model was able to generalize for other abdominal slices well without being trained on such data. Nonetheless, they mentioned that extending the training and evaluation data to the whole abdomen would be beneficial for stability but also analysis capabilities. Our study uses annotated data for training and evaluation across the whole abdomen and thus is a true volumetric approach to body composition analysis. In addition, they segmented SAT and VAT directly, whereas in our study, the semantic body region was segmented and adipose tissue was subclassified using known HU thresholds.

One major disadvantage of the collected dataset is the slice thickness of 5 mm. Several tissues, materials, and potentially air can be contained within a distance of 5 mm; the resulting HU at a specific location is an average of all components. This is also known as partial volume effect and can be counteracted by using a smaller slice thickness, ideally with isometric voxel sizes. However, a reconstructed slice thickness of 5 mm is common in clinical routine CT and it is questionable whether the increased precision of calculating the tissue composition on 1-mm slices would have clinical relevance. Nevertheless, we plan to investigate the influence of thinner slices in further studies, as the reading on thin slices is becoming routine in more and more institutions.

Another limitation is the differentiation between visceral fat and fat contained within organs. Currently, every voxel with HU in the fat intensity value range, which is contained within the abdominal cavity region, is counted as VAT. However, per definition, fat cells within organs do not count as VAT and thus should be excluded from the final statistics. Public datasets like [35, 36] already exist for multi-organ semantic segmentation and could be utilized to postprocess the segmentation results from this study by masking organs in the abdominal cavity.

It is quite common to find metal foreign objects like implants in abdominal CTs and thus to encounter beam hardening artifacts. Those artifacts, depending on how strong they are, may affect the segmentation quality, as shown in Fig. 6. Even if the segmentation model is able to predict the precise boundary of the individual semantic regions, streaking and cupping artifacts make it impossible to threshold fatty or muscular tissue based on HU intensities potentially invalidating quantification reports. In a future version of our tool, we are therefore planning functionality for automatic detection and handling of image artifacts.

Fig. 6
figure 6

Beam hardening artifacts may not only harm segmentation quality (top) but also prevent accurate identification of tissues (bottom). Strong beam hardening artifacts with faults in the segmentation output (left). Beam hardening artifacts with mostly accurate segmentation, but streaking artifacts prevent accurate muscle and SAT identification (middle). No beam hardening artifacts at all, but metal foreign object detected (right)

In future works, we plan to extend the body composition analysis system to incorporate other regions of the body as well. For example, [24] already showed an analysis of adipose tissue and muscle for thighs. Ideally, the system should be capable of analyzing the whole body in order to derive stable biomarkers. Furthermore, an external validation is required in order to prove the stability and generalizability of the developed system. This includes data from different scanners as well as a large variety of body composition cases.

Conclusion

In the present study, we presented a deep learning–based, fully automated volumetric tissue classification system for the extraction of robust biomarkers from clinical CT examinations of the abdomen. In the future, we plan to extend the system to thoracic examinations and to add important tissue classes such as pericardial adipose tissue and myocardium.