Data split
Two COVID-19 datasets were involved in this study. One is the Integrative Resource of Lung CT Images and Clinical Features (ICTCF) [40] which contains the clinical severity for each patient. There are 6654 lung CT images from 1338 patients with their clinical severity in ICTCF. The other is Med-Seg (medical segmentation) COVID-19 dataset [36] which contains 932 CT lung images with the ground truth labels of their GGO and consolidation segments. The segments and data splits are shown in Fig. 2. We split the dataset into training set and test set. To prevent data leakage, we split the dataset based on the patients rather than the CT lung images.
Supervised InfNet
The supervised InfNet (SInfNet) is a recently developed CNN model for COVID-19 lung CT segmentation [22], which was used as both our backbone and one of the baseline models. We did not change the overall structure and default hyperparameters of the original SInfNet model (Additional file 1: Figure S1). A complete SInfNet consists of two parts: a single SInfNet (Additional file 1: Figure 2A) and a multi SInfNet (Additional file 1: Figure 3A). The single SInfNet only predicts the infected region without classifying them more specifically. The input of the single SInfNet is a raw CT lung image and the output includes the edge contour of the overall lesion regions and four overall lesion region segmentations with different sizes as shown in Additional file 1: Figure S1. A CT lung image is first passed into the initial convolutional layers of the single InfNet to extract image features. Then, the features generated from the convolutional layer are fed into the partial decoder module, reverse attention module, and the edge detection module. The edge detection module is meant to help the network with the detection of the boundaries of the segmentation. The reverse attention and the partial decoder generate the segmentation of the infection regions of the CT lung images.
The prediction from the single SInfNet represents the overall infected regions and acts as a prior to be fed, concatenated with the original CT images, into the multi SInfNet. The multi SInfNet is used to predict multiple labeled segmentations. The segmentations include the predicted background, GGO, and consolidation.
Self-supervised InfNet
The self-supervised InfNet (SSInfNet) is our proposed COVID-19 segmentation CNN model, which, like the SInfNet, includes two parts, a single SSInfNet (Additional file 1: Figure S2B) and a multi SSInfNet (Additional file 1: Figure S3B). It is obtained by integrating generative adversarial image inpainting, focal loss, and Lookahead optimizer to SInfNet. The original SInfNet model generates 5 different predictions: an edge segmentation prediction and the 4 segmentations of the infected regions. To utilize the ability of a self-supervised method for the SInfNet’s segmentation, we generate masks fed into the SInfNet model. The last convolution layer that outputs the prediction is not used for the self-supervised case. However, the last convolutional layer is replaced with a different convolutional layer to reconstruct the image and the edge appropriately. Everything else is kept the same as the SInfNet architecture (Additional file 1: Figure S2A). This process allows the network learns meaningful representations of the CT images. We can use these meaningful representations to segment the infected regions of CT lung images. After learning the self-supervised features for InfNet, the training continues as normal, similar to the SInfNet algorithm. The training starts with the weights trained using the self-supervised inpainting method. The last layer is changed to its original layer instead of the replaced convolutional layer.
By learning features from image inpainting, the model can learn features that are closer related to image segmentation. As creating masks can be a complex task for the network to learn to inpaint, the mask can either be too complex for the network to start learning or too simple to be able to learn good representations. We use a coach network that increases the complexity of the masking of the CT images throughout the training of the network. The mask created is initially simple, once the network can predict the inpainting of the CT images with good performance, the coach increases the complexity of the masking to reduce the performance of the network. The loss for the coach network is constructed from the loss of the image inpainting from the SSInfNet. The coach network and the SSInfNet both works together as a MinMax algorithm. The SSInfNet tries to minimize the loss to generate better image inpainting while the coach network tries to increase the loss of the image inpainting through generating more complex masks. In the beginning, the masks generated by the coach network are quite simple. Through the training of the coach network, as the SSInfNet gets better at predicting image inpainting, the coach network starts to generate more complex masks. The loss function for the coach network is:
$$L_{coach} (x) = 1 - L_{rec} (x \odot M)$$
(1)
where.
-
M is the mask created by the coach network
-
x is the CT lung image
-
Lcoach is the loss for the coach network
-
Lrec is the loss for the reconstruction loss
A constraint is applied to this loss function because the coach network would just create a mask that masks all regions. After all, no context information would be present for the network to learn and a maximum loss is achieved. The constraint is:
$$\hat{B}(x) = B(x) - SORT(B(x))^{k|B(x)}$$
(2)
$$M = C(x) = \sigma (\alpha \hat{B}(x))$$
(3)
The backbone, B, of the coach network has a similar network architecture as the InfNet models. SORT(B(x)) sorts the features in descending order over the activation map. k represents the kth elements in the sorted list and k helps to control the fraction of the image to be erased. The regions that have scores smaller than the kth element are erased from the images. If k is 0.75, then 0.75 percent of the image is not erased. The score is scaled into a range of [0,1] using a sigmoid, σ, activation function. C(x) is the coach network that is fed with the CT lung images. The illustration of the coach network can be seen in Fig. 1.
After the self-supervision training is finished, the single SSInfNet is reused to train normally, using the segmentation of the CT lung images. Likewise, the multi SSInfNet network reuses the weights that are trained during the self-supervised multi SSInfNet to train normally, using the multi segmentations of the CT lung images.
The proposed single SSInfNet architecture can be seen in Fig. 1 and Additional file 1: Figure S2B. Additional file 1: Figure S2A shows the original single SInfNet architecture. The difference is that the last layer for each output prediction is replaced with a different linear activation layer. The linear activation layer recreates the original image that is covered by the masks. The proposed multi SSInfNet architecture is shown in Fig. 1 and Additional file 1: Figure S3B. The changes in the architecture for the multi SSInfNet are similar to the single SSInfNet where the last convolutional layer is replaced with a different linear activation layer to output the inpainting of the original image.
A loss is calculated for each of the outputs of the single SInfNet model. The first loss function is the edge loss, Ledge, which guides the model in representing better segmentation boundaries. The other loss function is the segmentation loss, Lseg. The segmentation loss combines both the loss of Intersection over Union (IoU) and the binary cross entropy loss (LBCE). The segmentation loss equation for the single SInfNet is as follow:
$$L_{seg} = L_{IoU} + \lambda L_{BCE}$$
(4)
λ is a hyperparameter that controls how much weight we want to put on the binary cross entropy loss. The segmentation loss is adapted to all Si predicted output where Si are created from fi such that i = 3, 4, 5. As low-level features use more computational resources due to larger spatial resolutions but achieves lesser performance. We use the features in the higher level (i = 3, 4, 5) instead. The total loss function for the single SInfNet model is then:
$$L_{total} = L_{seg} (G_{t} ,S_{g} ) + L_{edge} + \sum\limits_{i = 3}^{5} {L_{seg} (G_{t} ,S_{i} )}$$
(5)
The summation of the segmentation loss functions is calculated from the output of the parallel partial decoder and the three convolutional layers (i = 3, 4, 5)). Gt refers to the ground truth labels. Sg is the output from the parallel partial decoder to match with the ground truth label. Si is the different sizes of the segmentation of infected regions output by the InfNet. The different sizes of the segmentation of infected regions outputted by the SSInfNet are resized to the same shape as the ground truth segmentation image.
As for the multi SSInfNet, we use the default model and hyperparameters from the multi SInfNet. However, we train the multi SSInfNet without using any unlabeled images during self-supervision because the multi SSInfNet requires the prior (infected region) as input. The CT lung images and prior (infected region) for the CT lung images are concatenated together before being fed into the multi SSInfNet. The prior is generated from the single SInfNet. The multi SSInfNet labels the prior with background, ground-glass opacities, and consolidations. The loss function for the multi SSInfNet is as follow:
$$L_{bce} = \frac{1}{N}\sum\limits_{i = 1}^{N} {y_{i} \cdot \log (\hat{y}_{i} )} + (1 - y_{i} ) \cdot \log (1 - \hat{y}_{i} )$$
(6)
Where,
-
yi is the ground truth value for the segmentation—background, ground-glass opacities, or consolidation
-
\(\widehat{{y}_{i}}\) is the network's predicted value for the segmentation
-
N is the total number of the current training batch of data samples
The loss function for the multi SInfNet uses the binary cross-entropy loss between the predicted and the ground truth segmentation. In order to improve the performance of the model and to aid in its generalization, we chose to use self-supervised learning to learn good representations of the CT lung images.
Additionally, we use the focal loss instead of the binary cross-entropy loss function for the Multi SSInfNet model to provide more weight on the smaller data label samples (consolidation). The focal loss function is:
$$FL(p_{t} ) = - \alpha_{t} (1 - p_{t} )^{\gamma } \log (p_{t} )$$
(7)
where.
-
FL is the focal loss
-
pt is the Multi SSInfNet's predicted output
-
αt is a hyperparameter that controls the weight of positive and negative samples
-
γ is the term that controls the rate of the downweighed examples
We also wrap the Lookahead optimizer around the SGD optimizer with k = 5 and alpha = 0.5. k is the number of inner-loops the SGD will optimize before the Lookahead optimizer starts optimizing. In our case, after the SGD optimizes the network weights for 5 iterations, the Lookahead optimizer will optimize using alpha multiplied by the difference between the network weights after the 5 iterations of SGD optimizer and the network weights before the 5 iterations of the SGD optimizer. The alpha is used to control the intensity of the difference. The pseudo code for our single/multi SSInfNet can be found in Additional file 1: Algorithm 1.
Experimental settings
For the Single SInfNet, we train the network for 500 epochs. We use Adam as the optimizer with a learning rate of 0.0001. For the Multi SInfNet, we train the network for 500 epochs. We use SGD as the optimizer. The momentum is set as 0.7 and the learning rate is set as 0.01. As for the Multi SSInfNet with added focal loss and lookahead optimizer, we train the network for 500 epochs, use lookahead optimizer with k = 5 and alpha = 0.5, and wrap the Lookahead optimizer around the SGD optimizer where the momentum is set as 0.7 and the learning rate is set as 0.01.
For the self-supervised version of both the Single SInfNet and Multi SInfNet, the self-supervised image inpainting is first trained. Then the weights from the trained networks, except for the last layer, are transferred and be used to train on the segmentation of the CT lung images. During the self-supervised image inpainting stage, we train the network for 2000 epochs. The network is trained for the first 200 epochs before we train the coach network for 200 epochs which increases the complexity of the masks generated. After that, we alternate in between training the self-supervised image inpainting for 100 epochs and the coach network for 100 epochs for 1800 epochs in total. For every alternating between the training of the self- supervised image inpainting and the coach network, we set the learning rate to 0.1 at the start of the epoch, 0.01 at the 40th epoch, 0.001 at the 80th epoch, and 0.0001 at the 90th epoch to speed up convergence. We use SGD as the optimizer for the self-supervised image inpainting, set the momentum to 0.9 and the weight decay to 0.0005. As for the optimizer for the coach network, we use the Adam optimizer with a learning rate of 0.00001.
We compare our self-supervised method against some supervised models trained on the COVID-19 datasets. We train and follow the same network structure, but change from supervised learning to self-supervised learning, and compare the performance between the supervised and the self-supervised approaches. We want to determine if self-supervised learning is a useful way to help the SInfNet improve its performance in segmenting the ground-glass opacities or consolidation around the infected region of the CT lung images.
Performance evaluation metrics
Five metrics are used to measure the models’ performance: F1, intersection over union (IoU), Recall, and Precision and the area under the curve (AUC) of a receiver operating characteristic (ROC):
The F1-Score is also called the Dice Coefficient. It is used to measure the overlap between the ground-truth infected region and the predicted infected region. The F1-Score equation is defined as:
$$F1 = \frac{{2*\left| {T \cap P} \right|}}{\left| T \right| + \left| P \right|}$$
(8)
where T is the ground truth infected region and P is the predicted infected region.
The Intersection over Union (IoU) is a different method to measure the overlap between the ground truth infected region and the predicted infected region. The IoU equation is defined as:
$$IoU = \frac{T \cap P}{{T \cup P}}$$
(9)
where T is the ground truth infected region and P is the predicted infected region.
The Recall is used to measure how much of the ground truth infected region is present in the predicted infected region. The equation is as follow:
$${\text{Re}} call = \frac{T \cap P}{T}$$
(10)
where T is the ground truth infected region and P is the predicted infected region.
The Precision is used to measure how much of the predicted infected region is present in the ground truth infected region. The equation is as follow:
$$\Pr ecision = \frac{T \cap P}{P}$$
(11)
where T is the ground truth infected region and P is the predicted infected region.
For each of above performance metrics, we first perform the calculation within each test sample, separately. We compute the mean and the error interval of the estimated mean for each of the metrics in the entire test set. The mean is defined as:
$$mean = \frac{{\sum\nolimits_{i = 1}^{N} {Metric(\hat{y}_{i} ,y_{i} )} }}{N}$$
(12)
where Metric refers to F1, IoU, Recall, Precision or AUC. N refers to the number of test samples. The error is defined as:
$$error = SE \times 1.96$$
(13)
where SE is the standard error of the test samples for the given metric. The error interval of the estimated mean is defined as − error and + error.
Generation of image phenotypes
The well-trained multi SSInfNet outputs three kinds of image-level segments: the overall lesion segments, the GGO segments, and the consolidation segments. These image-level segments act as masks in the Python radiomic package PyRadiomics [41] for extracting image phenotypes, separately. Three runs of phenotype extraction are executed with the inputs of overall lesion segments plus the original images, the GGO segments plus the original images, and the consolidation segments plus the original images, respectively. We select first order measurements, such as Gray Level Co-occurrence Matrix (GLCM) measurement, Gray Level Dependence Matrix (GLDM) measurement, and Neighboring Gray Tone Difference Matrix (NGTDM) measurement, as our image phenotypes. The definition and formulas of these image phenotypes can be found in Additional file 1: Table S1. After the segments-based image-level phenotypes are generated, we take the average of them to make the image phenotypes at patient-level.
Mediation analysis
Univariate mediation analyses are performed to determine the potential causal mechanism in which age, gender, or underlying diseases is associated with COVID-19 severity through an intermediate image phenotype. Let y be the dependent variable which is the binarized COVID-19 severity. In the original ICTCF dataset, severity is measured with 9 levels: Control (Healthy), Control, Control (Community-acquired pneumonia), Suspected, Suspected (COVID-19-confirmed later), Mild, Regular, Severe, and Critically ill. We code these 9 levels of the severity into 2 levels by grouping Control (Healthy), Control, Control (Community-acquired pneumonia), Suspected into one group (coded as 0) and Suspected (COVID-19-confirmed later), Mild, Regular, Severe, and Critically ill into another group (coded as 1). Let m be a mediator (patient-level image phenotype), x be an independent variable (age, gender, or underlying diseases). Hence, we can fit the below regression models [12]:
$$\begin{gathered} y = \beta_{10} + \beta_{11} x + \in_{1} \hfill \\ m = \beta_{20} + \beta_{21} x + \in_{2} \hfill \\ y = \beta_{30} + \beta_{31} x + \beta_{32} m + \in_{3} \hfill \\ \end{gathered}$$
(14)
Here, \(\beta {\rm ~and~} \epsilon\) are the parameters of the models to be estimated and tested. \(\beta s\) are the coefficients of variables, while \(\epsilon\) are the residuals. If the abovementioned three regressions are significant (adjusted p-value < 0.05) and \(\left|{\beta }_{11}\right|>|{\beta }_{31}|\), we say that x is associated with y, mediated through m, which provides a potential mechanism explanation of how x has influence on y through m. The indirect effect of x on y through m is defined as \({\beta }_{21}\times {\beta }_{32}\).
For multiple mediation analysis, we first perform a pair-wise correlation analysis of the significant mediators from the univariate mediation analysis using the R package, corrplot [42], to control the potential confounding influence on the multiple mediation analysis. The mediator pairs that have absolute correlation coefficient greater than 0.8 are first identified. Then, one phenotype within each of these correlated pairs is removed. The filtering criteria include both less indirect effect or less commonly used in medical research. The remaining mediators with the two independent variables (age and the underlying diseases) are input into a multiple mediation model for further identifying the indirect effect when controlling for each other using R package lavaan [43. Lavaan is a tool for structure equation modeling (SEM) which is a very general and powerful multivariate technique. SEM uses conceptual model, path diagram and linked regression-style equations to model complex relationships among a network of variables. Thus, it allows multiple independent variables and mediators, even multiple dependent variables in the model [43,42,45]. We build our equation system as below for our special case (two independent variables, one dependent variable, and several mediators linked to different independent variables.):
$$\begin{gathered} y = \theta_{00} + \theta_{01} x_{1} + \theta_{02} x_{2} + \varepsilon_{0} \hfill \\ M_{c} = \theta_{c0} + \theta_{c1} x_{1} + \theta_{c2} x_{2} + \varepsilon_{c} \hfill \\ M_{a} = \theta_{a0} + \theta_{a1} x_{1} + \varepsilon_{a} \hfill \\ M_{u} = \theta_{u0} + \theta_{u2} x_{2} + \varepsilon_{u} \hfill \\ y = \theta_{(n + 1)0} + \theta_{(n + 1)1} x_{1} + \theta_{(n + 1)2} x_{2} + \theta_{(n + 1)3} M_{1} + \ldots + \theta_{(n + 1)(n + 2)} M_{n} + \varepsilon_{n + 1} \hfill \\ \end{gathered}$$
(15)
where x1 is the age, x2 is the underlying disease. Mc is the significant mediators for both age and underlying disease. Ma represents the significant mediators for age, while Mu refers to the significant mediators for underlying diseases. The \(\theta s\) are the coefficients which are estimated and tested when the model is fit. \(\varepsilon\) are the residuals.
Sensitivity analyses
A series of sensitivity analyses are performed to further support our conclusions. These analyses include: a three-fold cross validations performed using both single SSInfNet and multi SSInfNet to ensure that the performance is consistent, a comparison with transfer learning- based FCN8 segmentation network [46], further experiments on other independent datasets [47] to show the generalization ability of our models, ablation studies to explore which techniques (generative adversarial image inpainting, focal loss, and lookahead optimizer) we use in the multi SSInfNet model contribute to the improved performance, and a computation cost analysis to show the difference between the different models’ computation efficiency. The details of these analyses could be found in Additional file 1: Sensitivity Analysis.