Multi-parametric MRI phenotype with trustworthy machine learning for differentiating CNS demyelinating diseases

Background Misdiagnosis of multiple sclerosis (MS) and neuromyelitis optica (NMO) may delay the treatment, resulting in poor prognosis. However, the precise identification of these two diseases is still challenging in clinical practice. We aimed to evaluate the value of quantitative radiomic features extracted from the brain white matter lesions for differential diagnosis of MS and NMO. Methods We recruited 116 CNS demyelinating patients including 78 MS, and 38 NMO. Three neuroradiologists performed visual differential diagnosis based on brain MRI for comparison purpose. A multi-level scheme was designed to harness the selection of discriminative and stable radiomics features extracted from brain while mater lesions in T1-MPRAGE, T2 sequences and clinical factors. Based on the imaging phenotype composed of the selected radiomic and clinical features, Multi-parametric Multivariate Random Forest (MM-RF) model was constructed and verified with both 10-fold cross-validation and independent testing. Result interpretation was provided to build trust in diagnostic decisions. Results Eighty-six patients were randomly selected to form the training set while the rest 30 patients for independent testing. On the training set, our MM-RF model achieved accuracy 0.849 and AUC 0.826 in 10-fold cross-validation, which were significantly higher than clinical visual analysis (0.709 and 0.683, p < 0.05). In the independent testing, the MM-RF model achieved AUC 0.902, accuracy 0.871, sensitivity 0.873, specificity 0.869, respectively. Furthermore, age, sex and EDSS were found mildly correlated with the radiomic features (p of all < 0.05). Conclusions Multi-parametric radiomic features have potential as practical quantitative imaging biomarkers for differentiating MS from NMO. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-021-03015-w.

There are several factors contributing to the difficulty of differential diagnosis, e.g., they share overlapped features in clinical symptoms such as myelitis, optic neuritis [1,4], and laboratory examinations (30% of the NMO patients had the same negative results of NMO immunoglobulin G as MS patients [5]). Misdiagnosis can lead to unprecise treatment and sometimes even exacerbation of the disease, as the treatment for MS differs greatly from that of NMO [6]. Magnetic resonance imaging (MRI) is routinely used in the differential diagnosis of MS and NMO; however, its specificity is limited because partial lesions in brain white matter of two diseases share similar lesion appearance, location distribution, and signal characteristics on MRI [3,[7][8][9]. In addition to similar neuroimaging characteristics, another common cause of MS misdiagnosis is the subjective visual observation and analysis, such as misinterpretation and misapplication of abnormal MRI findings as suggested by Solomon et al. [3]. Therefore, it is in high demand for a quantitative, repeatable and objective measurement for the differential diagnosis.
Radiomics is an emerging field with a surge of interest due to its capabilities to extract quantitative biomedical imaging "markers" for automated objective diagnosis [10,11], and potentially to foster individualized diagnosis. Empowered with machine learning and deep learning, radiomics methodology mines the valuable information underlying imaging that could be beyond the perception capacity of human beings and has been successfully applied for differential diagnosis of other CNS diseases [12][13][14]. Though radiomic models are able to produce promising diagnostic results with higher accuracy, clinicians often find it difficult to interpret the results from machine learning models. To be clinically applicable, there is an urgent need to address the "lacking interpretability" problem [15].
In this study, we aim to investigate a quantitative and objective MRI-based radiomics platform, equipped with individualized result interpretation, to provide clinicians with trustworthy assistance for diagnostic differentiation.

Subjects and MRI acquisition
This study was approved by the institutional review board of Xuanwu Hospital, Capital Medical University, and written informed consent was obtained from all participants. Totally 116 participants were recruited including 78 relapsing-remitting MS and 38 NMO patients. The impact of unbalanced data was comprehensively assessed with metrics including AUC, diagnosis accuracy, sensitivity and specificity. The patients were composed of two cohorts with different MRI protocols as shown in Additional file 1: Table S1. The first cohort  included patients from April 2004 to December 2004 who underwent brain scanning on 1.5 T MRI (Sonata; Siemens Medical Systems, Erlangen, Germany) with an 8-channel head coil. The second cohort included patients from November 2009 to April 2014 who had brain scanning on 3 T MRI (Siemens Magnetom Trio Tim System, Munich, Germany). As a proportion of patient cohorts were recruited before the introduction of the new MS and NMO criteria, our diagnosis of MS and NMO was based on the 2010 McDonald criteria, and the revised NMO diagnostic criteria, respectively [16,17]. None of these patients had been treated with medications within three months before the MRI was obtained. The demographic and clinical characteristics including Expanded Disability Status Scale (EDSS) score [18] and Disease Duration of the patients were recorded.

Clinical MRI review
Three neuroradiologists with 5, 7, and 10 years of MRI reading experience were involved in the visual assessment of the brain lesion and differential classification of MS and NMO patients. The assessment was based on T1-MPRAGE and T2 MRI sequences, while the clinical data (age, sex, disease duration and EDSS score) was allowed to refer during the assessment. Each neuroradiologist produced a diagnostic result for each patient based on their own clinical experience. In case of any discrepancy, it shall be jointly reviewed to reach an agreement. The assessments such as AUC, diagnosis accuracy, sensitivity and specificity were calculated.

Radiomic analysis overview
Radiomic analysis framework was composed of five main modules, including image segmentation, feature extraction, feature selection (phenotype building), machine learning modeling, and quantitative interpretation of results. An overview was provided in Fig. 1.

Image processing and feature extraction
Marking of hyperintense brain lesions volume on T2 sequences was performed by a neuroradiologist with more than 9 years of experience (J.H.) using MRIcro software (https:// people. cas. sc. edu/ rorden/ mricro/), and validated by a senior neuroradiologist (Z.Q.), who had more than 20 years of experience. The volume of interests (VOIs) delineated on T2 sequence were mapped to T1-MPRAGE sequence through rigid image registration to automatically obtain the corresponding VOIs from T1-MPRAGE sequence.
Then, from the VOIs of both MRI sequences, we extracted 1118 quantitative radiomic features for each sequence that embraced 18 intensity, 68 texture, 344 Laplace of Gaussian (LoG) features, and 688 wavelet features [19]. LoG and wavelet filters were applied before the texture feature extraction to reduce the impact of noise. After feature extraction, all features were standardized to be comparable in scale. Other descriptions and implementation details of radiomic features are described in Additional file 1: Appendix S1 and Table S2 respectively.

Multi-level Feature Selection for imaging phenotype construction
Our Multi-level Feature Selection algorithm aimed to solve two key challenges in feature selection in the clinical context, including: (1) selecting relevant and discriminative features from high-dimensional small-sample multimodal data, and (2) mining robust features across MRI images with different imaging quality (e.g., MRI images with different magnetic field strength), which is often neglected by feature selection algorithms [20]. To address these two challenges, we design a Multi-level Feature Selection algorithm, composed of univariatelevel and multivariate-level module, to jointly explore feature relevancy, robustness and discriminability.
In univariate-level module, we design a statistical filter scheme on the basis of Wilcoxon Rank-sum test to simultaneously select: (1) robust features by testing the statistical consistency across MRI with different image quality, and (2) relevant features through calculating statistical relevancy towards the outcome. Specifically, robust features across different magnetic field strength of MRI scanners (1.5 T and 3 T) were first selected with Wilcoxon test [21]. Then, discriminative features were selected by assessing whether there was a significant distribution difference between MS patients and NMO patients via Wilcoxon test [22].
In multivariate-level module, we propose a pyramid searching structure to first exploit intra-modal feature relationships and then explore inter-modality relationships. This pyramid searching scheme boosted feature discriminability and mining efficiency. In contrast, conventional feature selection often uses a flattened search space by concatenating all features for feature selection. In specific, Random Forest-based sequential forward selection (RF-SFS) was firstly employed to select discriminative features and construct preliminary phenotypes from T2, T1-MPRAGE and clinical features separately [23]. Then, a multi-parametric phenotype was constructed by further applying RF-SFS to a fused feature set of three preliminary phenotypes. The mathematical details of Multi-level Feature Selection were summarized in Additional file 1: Appendix S2.

Classification analysis
To compare the diagnostic performance of preliminary phenotypes of T2, T1-MPRAGE and clinical and the multi-parametric phenotype, three preliminary Multivariate Random Forest models and Multi-parametric Multivariate Random Forest model (MM-RF) were constructed based on the corresponding phenotype respectively. To handle data imbalance, balanced bootstrap mechanism and balanced weight [24] were incorporated into the Random Forest model. Common evaluation metrics for imbalanced datasets were used for assessing the diagnostic performance of models, including the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity. The stability of diagnostic performance was assessed with the mean of relative standard deviation (RSD) of AUC. The lower the AUC_RSD value, the higher the stability.

Validation
To rigorously validate the diagnostic performance of the imaging phenotypes, both 10-fold cross-validation on the training set and independent validation on the testing set were computed. From a total of 116 patients, 86 patients were randomly selected to form the training set, while the rest 30 patients was used for independent testing. Bootstrapping with 1000 times resampling was used in the independent validation.

Quantitative interpretation of results
Lack of interpretability is a key challenge as the basis for trustworthy decision making [36]. To provide quantitative interpretation, we utilized SHAP method [37] to analyse the differential decision from our MM-RF model at both individual-level and model-level. The individuallevel interpretation explained the output of an individual prediction by visualizing the important features in the phenotype and unveiling their importance for discrimination decisions. The model-level interpretation computed the average feature importance across all patients and revealed the relationship between the feature value and its importance.

Demographics and clinical and MRI characteristics
Seventy-eight relapsing-remitting MS patients (mean age ± SD: 36.5 years ± 10.0), 38 NMO patients (mean age ± SD: 40.9 years ± 11.7) participated in this study. The percentages of males out of all patients were 34.6%, 18.4%, respectively. There were no significant differences in sex and age between MS and NMO patients. NMO group showed a trend towards higher EDSS score than the MS group (p = 0.005). Other demographic characteristics of the participants were provided in Table 1.

Evaluation of multi-parametric phenotype in terms of discriminative ability
The multi-parametric phenotype was evaluated with both 10-fold cross validation and independent testing. In cross-validation, the multi-parametric phenotype achieved AUC 0.826 (95% CI 0.732-0.912), which was significantly higher than that of visual analysis (p = 0.016). The diagnostic accuracy was 0.849 (95% CI 0.767-0.919), higher than that of visual analysis (p = 0.008). Its sensitivity and specificity were 0.769 and 0.883, respectively. In the independent testing, the multi-parametric phenotype based on Random Forest model achieved AUC 0.902 ± 0.027, which was higher than the performance of preliminary T2, T1-MPRAGE and clinical phenotypes (Fig. 2C). The diagnostic AUC of T2, T1-MPRAGE and clinical phenotype was 0.852 ± 0.053, 0.880 ± 0.033 and 0.573 ± 0.055, respectively. Figure 2C also illustrates that multi-parametric phenotype achieved better stability of diagnostic performance. Other assessments of the multiparametric phenotype such as diagnostic accuracy, sensitivity and specificity were 0.871 ± 0.044, 0.873 ± 0.083 and 0.869 ± 0.051, respectively, as reported in Table 2. To assess the impact of 3 T and 1.5 T MRI, a further experiment showed that high diagnostic accuracy was achieved by both 3 T (0.856 ± 0.046) and 1.5 T (0.976 ± 0.074) cohort (p < 0.05). Table 2 shows the performance comparison of our Multilevel Feature Selection methods with eight state-ofthe-art methods. From a combined feature pool of T2, T1-MPR and clinical features, these comparison methods selected at most eight features as in our method, and were evaluated with the same Random Forest model as ours. The experimental results in Table 2 showed that our Multi-level Feature Selection algorithm outperformed these SOTA methods in comparison. Our multiparametric phenotype achieved highest AUC 0.902 ± 0.027, followed by our MPR phenotype (AUC 0.880 ± 0.033) and Anova Filter (AUC 0.879 ± 0.037).

Case studies for individual-level interpretation
As illustrated in Fig. 3, the two selected cases included: (a) an MS case, and (b) an NMO case, whose lesions were difficult to differentiate due to similar lesion location and signal characteristics. With extracted phenotype from VOIs in the MR images, our MM-RF classified the cases correctly with 89% confidence for MS case and NMO case with 86% confidence, respectively. For the MS case, our case-level interpretation revealed that H-MPR-log3-firstorder-Median, H-MPR-waveletLHLglszm-GLNU, and EDSS were the three most significant contributors for accurate classification, with 29.86%, 27.61% and 21.07% contribution, respectively. As a contrast, for the NMO case, three T1-MPRAGE features (H-MPR-log3-firstorder-Median, H-MPR-log4-gldm-SDLGLE, H-MPR-waveletLHL-glszm-GLNU) were three major contributors, contributing 25.06%, 24.78%, 17.46% towards the correct decision.

Model-level result interpretation
Model-level interpretation investigated the feature importance and the relationship between feature value and its importance from the perspective of all patients. Figure 4A shows case-level interpretation results of all patients in one graph, which visualizes how feature contributions differ for different cases. Of all eight features, H-MPR-log4-gldm-SDLGLE, H-MPR-log3-firstorder-Median, and H-T2-waveletLLH-glcm-JE were the top three important features in the model decision making, as shown in Fig. 4B, C. In terms of relationship between the feature value and its importance, there was a negative linear relationship for H-T2-waveletLLH-glcm-JE (Fig. 4D) and H-MPR-log4-gldm-SDLGLE (Fig. 4E), and a positive linear relationship for H-MPR-log3-firstorder-Median (Fig. 4F).

Discussion
In this research, we extracted the imaging phenotype from multi-parametric MRI sequences with the machine learning framework for automated differentiating MS from NMO, which provided an additional reference for timely differential diagnostic decision making. The major findings of this study include: (1) our multi-parametric phenotype was able to achieve high differential diagnostic performance, generalizability and robustness, mined by our designed Multi-level Feature Selection algorithm; (2) our radiomics platform provided individualized differential diagnosis and interpretation which was illustrated with a case study; and (3) the correlation between radiomic and clinical features was revealed to enhance trust in radiomic features.
The first finding of our study is that the multi-parametric phenotype demonstrated high differential diagnostic performance, which statistically outperformed visual analysis in terms of AUC (0.826 vs. 0.683, p = 0.016), and the diagnosis accuracy (0.849 vs. 0.709, p = 0.008) in 10-fold cross-validation. The accuracy of clinical visual analysis in our study complied with the studies [38][39][40][41][42] with the reported accuracy ranging from 0.573 to 0.739. In this study, doctors misdiagnosed about 25% of patients with MS as NMO, similar to the previous study [2], which justified the machine learning model provide valuable assistance for clinical decisions. Remarkably, the multi-parametric phenotype demonstrated the highest discriminative ability (AUC 0.902 ± 0.027) in the independent testing, outperforming the discriminative performance of the T2 (AUC 0.852), T1-MPRAGE (AUC 0.880) and clinical phenotypes (AUC 0.573). It indicates that the multi-parametric phenotype successfully fused pathological characteristics such as the information about edema, demyelination in T2 images, axonal damage in T1-MPRAGE images [43] and clinical information. This finding is consistent with previous studies in the differential diagnosis of brain tumors where the model embracing MRI-based radiomic features and clinical features can achieve the highest classification accuracy [44]. The present study, for the first time, constructs a multiparametric phenotype including T2, T1-MPRAGE and clinical information for differentiating MS from NMO.
Our Multi-level Feature Selection algorithm outperformed SOTA feature selection methods because the proposed algorithm comprehensively considered: (1) feature robustness across MRI images with different magnetic fields strength, (2) feature relevancy towards the outcome, and (3) intra-modal and inter-modal feature discriminability. Comparatively, filter methods [28][29][30] select features using the defined feature relevancy such as mutual information; however, these methods might not take account of the interaction with the learning algorithm, and hence feature discriminability might not be optimized [45]. To address this issue, both Wrapper [31,32] and Embedded [33][34][35] methods involve learning algorithms to assess the predictive performance of feature combinations. However, these methods might be prone to overfitting due to the dependency of the learning algorithm [46]. In contrast, our univariate-level selection selected robust and relevant features based on Wilcoxon testing, which addressed feature generalizability issue across different MRI imaging qualities (as analysed in [20]) and facilitated alleviating the risk of overfitting. Further, our multivariate-level selection boosted feature discriminability by exploiting intramodal feature interaction and inter-modality interaction using a pyramid search structure. Due to the reduced risk of overfitting and boosted feature discriminability, our algorithm outperformed SOTA methods.
Secondly, individual-level interpretation was provided to articulate machine learning-based decision making for individual patients and thus to facilitate the trustworthy and individualized differential diagnosis. It was achieved by graphical visualization of important features and unveiling of quantitative contribution of features in the machine learning models which facilitates understanding on both radiomic features and model decisions, as illustrated in case studies in Fig. 3A, B. Specifically, for the case in Fig. 3A, this patient was correctly classified as MS with 89% confidence by our phenotype, in which the top three important features were two T1-MPRAGE features and EDSS. And for the case in Fig. 3B, three T1-MPRAGE radiomic features were major contributors. Interpretability enables doctors to gain insight why the diagnosis is made, thus assisting clinicians to provide precise differential diagnosis [47,48]. Furthermore, the trust in the radiomic features was enhanced by revealing its connection with the clinical information. Mild correlations were observed between the radiomic features and clinical features (age, sex, and EDSS). Interestingly, we found that one T2 feature was related to EDSS (Fig. 5). The reason underlying the correlation between EDSS and T2 features might be that EDSS was found correlated with lesion load and brain atrophy [49,50], while T2 and T1-MPRAGE images could also reflect the information of lesion and brain structure, respectively. As a result, we may potentially use radiomic features to objectively and conveniently assist EDSS in evaluating the treatment and disability management in the future. Although the above assumptions are preliminary, our study provides a perspective for understanding the clinical significance of radiomic features, in response to the urgent clinical need [51].
We suggest that our multi-parametric phenotype may serve as an objective, quantitative tool to assist clinical differential diagnosis of MS and NMO. Compared with the current diagnostic criteria of these two diseases using MR, our phenotype holds advantages in three aspects: (1) Instead of diagnosis by naked eye based on vague clinical experience, our multi-parametric phenotype provide a quantitative solution by feature extraction of medical images, thus help complement and clarify the current diagnostic criteria; (2) The phenotype reduces inter-observer variety and subjectivity because the whole system is highly automated; and (3) The current model can achieve high diagnostic accuracy with conventional MR sequences, which is simple and operable in clinical practice [52].

Conclusion
Radiomic features extracted from T1-MPRAGE and T2 sequences are potential practical imaging biomarkers for differentiating the lesions of demyelinating diseases. They have been shown to be discriminative and robust in classifying the brain white matter lesions between MS and NMO. Effective interpretation of radiomic features coupled with machine learning methods can be used as an adjunct to traditional radiology to support the diagnostic process in clinical practice.