 Research
 Open Access
 Published:
MDAKRLS: Predicting human microbedisease association based on Kronecker regularized least squares and similarities
Journal of Translational Medicine volume 19, Article number: 66 (2021)
Abstract
Background
Microbes are closely related to human health and diseases. Identification of diseaserelated microbes is of great significance for revealing the pathological mechanism of human diseases and understanding the interaction mechanisms between microbes and humans, which is also useful for the prevention, diagnosis and treatment of human diseases. Considering the known diseaserelated microbes are still insufficient, it is necessary to develop effective computational methods and reduce the time and cost of biological experiments.
Methods
In this work, we developed a novel computational method called MDAKRLS to discover potential microbedisease associations (MDAs) based on the Kronecker regularized least squares. Specifically, we introduced the Hamming interaction profile similarity to measure the similarities of microbes and diseases besides Gaussian interaction profile kernel similarity. In addition, we introduced the Kronecker product to construct two kinds of Kronecker similarities between microbedisease pairs. Then, we designed the Kronecker regularized least squares with different Kronecker similarities to obtain prediction scores, respectively, and calculated the final prediction scores by integrating the contributions of different similarities.
Results
The AUCs value of global leaveoneout crossvalidation and 5fold crossvalidation achieved by MDAKRLS were 0.9327 and 0.9023 ± 0.0015, which were significantly higher than five stateoftheart methods used for comparison. Comparison results demonstrate that MDAKRLS has faster computing speed under two kinds of frameworks. In addition, case studies of inflammatory bowel disease (IBD) and asthma further showed 19 (IBD), 19 (asthma) of the top 20 prediction diseaserelated microbes could be verified by previously published biological or medical literature.
Conclusions
All the evaluation results adequately demonstrated that MDAKRLS has an effective and reliable prediction performance. It may be a useful tool to seek diseaserelated new microbes and help biomedical researchers to carry out followup studies.
Background
With the fast development of advanced analytical techniques and highthroughput methods for exploring complex microbial communities, in human disease and health, the role of the microbiome has gained widespread attention over the past decade [1, 2]. The microbial community is complex and immensely diverse, research showed that about 100 trillion archaeal and bacterial cells in the human gut which belong to more than 1000 species, tenfold the number of human cells [3, 4]. Microbes are closely related to human health and disease. Generally, most of the gut microbes are either harmless or even beneficial to the human body, such as which can contribute to normal immune function, improve metabolic capability and protect against enteric pathogens [5, 6]. Therefore, microbes are also considered as “forgotten organs” in host [7]. But, if the normal balance between the host and microbiota is broken, which may possibly induce many diseases, including asthma [8], inflammatory bowel disease (IBD) [9], brain disorders or neurodevelopmental deficits [10] and even cancer [11], and so on.
Unquestionably, it has a great significance to identify microbes related to diseases for revealing the pathological mechanism of human diseases and understanding the mechanisms of microbehost interactions. Some largescale projects have been initiated, such as the Human Microbiome Project (HMP) [12] and Metagenomics of the Human Intestinal Tract of European Union Project (MetaHIT) [4, 13]. It can help us to initially understand the significance of medicine and biology, functional states and healthy composition of the human microbiome [5]. It is still a challenge to understand how the microbiome influences human diseases, since the microbial community is complex and diverse. Effective computational methods could significantly reduce the time and cost of traditional culturebased microbial experiments. Researchers could select potential MDAs for experimental verification. In 2016, Ma et al. [14] manually collected and developed a human microbe–disease association database (HMDAD), which provided the foundation for identifying the MDAs through computational methods.
In general, we could transform this biological problem of predicting diseaserelated microbes into a link prediction task. In fact, some computational methods have been widely developed to solve the association or interaction problem such as miRNAdisease [15], drugtarget [16], lncRNAprotein [17] and protein–protein interaction [18] prediction problems, and so on. However, to the best of our knowledge, until 2016, there are almost no related MDAs prediction researches from a computational point of view. Thereafter, in 2016, Chen et al. [19] designed the first computational method called KATZHMDA for the prediction of MDAs. It is a KATZ measurebased network prediction method to solve the problem of MDAs prediction by calculating the Gaussian interaction profile (GIP) kernel similarity. Beyond that, in recent years, some networkbased methods were also proposed only using the GIP kernel similarity for prediction, which are primarily based on the fusion of known associations and heterogeneous data to construct the network, including random walkingbased methods [20, 21], label propagationbased method [22], pathbased method [23]. In 2017, Huang et al. [24] presented the NGRHMDA method by integrating two single recommendation methods (graph‑based scoring and neighbor‑based collaborative filtering prediction model), and achieved a good prediction result. With the fast development of machine learning technology [25, 26], some machine learningbased methods were also presented for MDAs prediction. For example, in 2017, Wang et al. [27] proposed a semisupervised method called LRLSHMDA based on the Laplacian regularized least squares method. In addition, in 2018, He et al. [28] and Shi et al. [29] developed machine learningbased method named GRNMFHMDA and BMCMDA for MDAs prediction, respectively, based on the graph regularized nonnegative matrix factorization and binary matrix completion.
In recent years, the above computational methods mainly utilized a basic assumption that microbes with similar functions will share similar noninteraction or interaction patterns with phenotype diseases [30, 31]. With the fast development of machine learning technology, the regularized least squares algorithm is a useful tool and has been widely used in the recommended system [32,33,34]. Although some computational methods have been developed, most diseaserelated microbes remain unknown and effective methods are still scarce [5, 35]. We could address or reduce some limitations to improve the prediction performance of the computational method. For example, some existing methods only used the GIP kernel similarity for extracting the efficacious information, which may lead to the algorithm inevitably biased against wellresearched microbes and diseases, multivariate information fusion will be more helpful for prediction. Beyond that, some existing methods did not consider that the effective contribution of diseases and microbes is uneven due to the number of diseases and microbes is different in the database [36]. It is necessary to improve calculation speed since some methods integrate multiple calculation methods which may be complex and timeconsuming. Some methods used many model parameters which may reduce robustness and do not apply to new data.
In this paper, considering some of the above limitations, we developed a novel computational method called MDAKRLS based on the Kronecker regularized least squares method to identify potential MDAs. It is a machine learningbased method and uses fewer model parameters, which can save time and obtain robust performance. First, we calculated Kronecker Gaussian similarity and Kronecker Hamming similarity of microbedisease pairs based on the known microbedisease association network. Then, the Kronecker regularized least squares algorithm used two different Kronecker similarities to obtain prediction scores, respectively. Finally, we obtained the final prediction results by integrating the contributions of different similarities. The experimental results of 5fold crossvalidation (5CV) and global leaveoneout crossvalidation (LOOCV) indicated that MDAKRLS can achieve superior performance by comparing it with five stateoftheart methods. In addition, case studies further demonstrated that MDAKRLS is a useful tool that can effectively identify potential MDAs.
Materials and methods
In this work, we proposed a novel method called MDAKRLS for inferring latent MDAs. Figure 1 describes the overall flow chart of MDAKRLS for prediction. The framework of prediction method consists of three steps. First, we constructed Kronecker Gaussian similarity \({K}_{G}\) and Kronecker Hamming similarity \({K}_{H}\) of microbedisease pairs by fully exploiting Gaussian interaction profile (GIP) kernel similarity and Hamming interaction profile (HIP) similarity from known microbe–disease association matrix, respectively. Second, we introduced the Kronecker regularized least squares algorithm based on two Kronecker similarity to construct loss function for prediction. Third, we used an integration strategy to get the final predicted association matrix. Finally, the final possibility score of each microbedisease pair can be calculated.
Human microbe–disease association data set
In this study, we used a widelyused benchmark data set (HMDAD) to evaluate the reliability and effectiveness of MDAKRLS. It was manually collected by Ma et al. [14] and can be available at http://www.cuilab.cn/hmdad. The database contains a total of 483 verified associations, 292 human microbes and 39 diseases. The microbedisease association data set adopted by us was downloaded from HMDAD in June, 2020. We finally obtained 450 verified associations after we removed repetitive associations. In fact, we represented the advantages of MDAKRLS through the overall HMDAD data set. For a better description, we constructed an adjacency matrix \(A \epsilon {R}^{39\times 292}\) to express the associations network.
Similarity measures
For a better description, in this study, set \(D=\left\{{d}_{1},{d}_{2},\dots ,{d}_{i},\dots {d}_{nd}\right\}\) and \(M=\left\{{m}_{1},{m}_{2},\dots ,{m}_{j},\dots ,{m}_{nm}\right\}\) denote the sets of diseases and microbes, respectively. We introduced an adjacency matrix \(A \epsilon {R}^{nd\times nm}\) to express the associations network, where variable \(nd\) denotes the numbers of diseases; \(nm\) represents the numbers of microbes. Besides, the adjacency matrix \(A \epsilon {R}^{nd\times nm}\) is defined as follows:
Set \(A\left({d}_{i}\right)\epsilon {\{0, 1\}}^{1*nm}\) represents the \(i\) th row of \(A\), which is a binary vector and denotes the interaction profile of the disease \({d}_{i}\). Similarly, \(A\left({m}_{j}\right)\epsilon {\{0, 1\}}^{nd*1}\) denotes the \(j\) th column of \(A\), which represents the interaction profile of the microbe \({m}_{j}\). According to the basic assumption, microbes with similar functions will share similar noninteraction or interaction patterns with phenotype diseases, which is widely used in the related studies. To integrate more effective information and uncover potential associations, we calculated the GIP kernel similarity and HIP similarity of human microbes and diseases, respectively.
GIP kernel similarity for microbes and diseases
To mine conveniently the topological structure information of association matrix \(A\), we used the GIP kernel similarity [19, 37] for measuring similarity of human microbes. Specifically, for two given microbes \({m}_{i}\) and \({m}_{j}\), we first extracted their interaction profiles \(A\left({m}_{i}\right)\) and \(A\left({m}_{j}\right)\) from the training adjacency matrix \(A\), respectively. Subsequently, the GIP kernel similarity of microbes can be calculated as follows:
where \({S}_{G}^{m}\) is defined as the microbe GIP kernel similarity matrix; \({\sigma }_{m}^{\mathrm{^{\prime}}}\) is a tradeoff parameter and we set \({\sigma }_{m}^{\mathrm{^{\prime}}}=1\) in the experiments; parameter \({\sigma }_{m}\) is applied to tuneup bandwidth of GIP kernel, which can be updated by the Eq. (3).
Similarly, we also obtained the disease GIP kernel similarity as follows:
where \({S}_{G}^{d}\) is defined as the disease GIP kernel similarity matrix; \({\sigma }_{d}^{\mathrm{^{\prime}}}\) is a tradeoff parameter and we set \({\sigma }_{d}^{\mathrm{^{\prime}}}=1\) in the experiments; parameter \({\sigma }_{d}\) is applied to tuneup bandwidth of GIP kernel, which can be updated by the Eq. (5).
HIP similarity for microbes and diseases
In this work, inspired by the Jiang et al.' s work [38], we introduced HIP similarity to measure the interaction profile similarity between microbe pairs from the training adjacency matrix \(A\). For HIP similarity, two microbes will have a lower similarity if they have more different corresponding values in the interaction profiles. Further, the HIP similarity of microbes \({m}_{i}\) and \({m}_{j}\) is defined as follows:
where \({S}_{H}^{m}\) denotes the microbe HIP similarity matrix; \(\left\cdot \right\) denotes the number of elements in the interaction profile.
Similarly, based on the interaction profiles of diseases, the HIP similarity of diseases can be calculated as follows:
where \({S}_{H}^{d}\) denotes the disease HIP similarity matrix.
MDAKRLS for microbedisease association prediction
Regularized least squares (RLS) and its extended versions are popular machine learning methods. In this work, to boost the predictable performance, a novel predict model called MDAKRLS is proposed to calculate the relevance scores between microbes and diseases by integrating the Kronecker product and RLS method.
We first obtained the Kronecker Gaussian similarity between microbedisease pairs by the GIP similarity matrix of microbes and diseases. Specifically, we use the following equation to define the similarity between the microbedisease pairs \(\left(m\left(i\right),d\left(p\right)\right)\) and \(\left(m\left(j\right),d\left(q\right)\right)\):
where \({S}_{G}^{m}\) and \({S}_{G}^{d}\) represent the GIP similarity matrix of microbes and diseases defined above, respectively. Let \(N=nd\times nm\) which represents the number of microbedisease pairs. The above equation can be represented by the Kronecker product as follows:
where \({K}_{G }\epsilon {R}^{N\times N}\) is defined as the Kronecker Gaussian similarity of microbedisease pairs. In the same manner, the Kronecker Hamming similarity matrix \({K}_{H }\epsilon {R}^{N\times N}\) of microbedisease pairs can be measured:
For a better description, in this work, we set \(X=\left({x}_{1},{x}_{2},\dots ,{x}_{i},\dots ,{x}_{N}\right)\), where \({x}_{i}\) denotes the \(i\) th microbedisease pair. \(vec\left(Y\right)=\left({y}_{1},{y}_{2},\dots ,{y}_{i},\dots ,{y}_{N}\right)\), where \(Y {\epsilon R}^{nd\times nm}\) denotes the training microbedisease adjacency matrix in the process of forecasting; \(vec\left(\cdot \right)\) is a vector operator that stacks the elements of all columns into a vector; \({y}_{i}\epsilon \{0, 1\}\) denotes the corresponding label of microbedisease pair \({x}_{i}\). The biological problem of predicting diseaserelated microbes can be transformed to learn a mapping function \({f}_{G}\) and calculate a corresponding association score. \(vec\left({F}_{G}\right)=\left({f}_{G}\left({x}_{1}\right),{f}_{G}\left({x}_{2}\right),\dots ,{f}_{G}\left({x}_{i}\right),\dots ,{f}_{G}\left({x}_{N}\right)\right)\), where \({F}_{G}\) denotes the prediction score matrix based on the Kronecker Gaussian similarity; \({f}_{G}\left({x}_{i}\right)\) represents the prediction score of microbedisease pair \({x}_{i}\) obtained by prediction function \({f}_{G}\).
In further work, first, we constructed the Kronecker regularized least squares [39] based on the Kronecker Gaussian similarity to solve the microbedisease prediction problem. The objective function based on the Tikhonov minimization problem is formulated as follows:
where \({\sigma }_{G}>0\) is a regularization coefficient used to adjust the regularization term and loss function of the objective function; \({\parallel {f}_{G}\parallel }_{k}\) is the norm of mapping function \({f}_{G}\) in Reproducing Kernel Hilbert Space (RKHS) [40] associated to the kernel \(k\). Based on the classical Representer Theorem [41], the solution of the Tikhonov regularization problem exists in the RKHS and can be calculated as follows:
According to the previous studies [37, 42], the optimal solution of the objective function can be further calculated as follows:
where \(I\) denotes the identity matrix.
Eigen decompositions were implemented on the GIP similarity matrix \({S}_{G}^{m}\) of microbes and GIP similarity matrix \({S}_{G}^{d}\) of diseases. We can get \({S}_{G}^{m}\)=\({V}_{G}^{m}{\Lambda }_{G}^{m}{{V}_{G}^{m}}^{\rm T}\) and \({S}_{G}^{d}\)=\({V}_{G}^{d}{\Lambda }_{G}^{d}{{V}_{G}^{d}}^{\rm T}\), respectively. According to the property of the Kronecker product, we can obtain the \({K}_{G}={S}_{G}^{m}\otimes {S}_{G}^{d}={V}_{G}{\Lambda }_{G}{{V}_{G}}^{\rm T}\), where \({V}_{G}={V}_{G}^{m}\otimes {V}_{G}^{d}\) and \({\Lambda }_{G}={\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}\). Then, we can transform the Eq. (13) as follows:
According to another property of the Kronecker product [43], \(\left({N}^{\rm T} {\otimes} M\right)vec\left(C\right)=vec\left(MCN\right)\), Eq. (14) can be rewritten as follows:
Finally, we will obtain the score matrix based on the Kronecker Gaussian similarity by the following equation:
where \(vec\left({X}_{G}\right)=\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}\right){\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}+{\sigma }_{G} I\right)}^{1}vec\left({{V}_{G}^{d}}^{\rm T}Y{V}_{G}^{m}\right).\)
In addition, we also can get another objective function and optimal solution based on the Kronecker Hamming similarity in a similar manner:
We implemented eigen decompositions on the HIP similarity matrix \({S}_{H}^{m}\) of microbes and HIP similarity matrix \({S}_{H}^{d}\) of diseases, and obtained the second score matrix based on the Kronecker Hamming similarity:
where \(vec\left({X}_{H}\right)=\left({\Lambda }_{H}^{m}\otimes {\Lambda }_{H}^{d}\right){\left({\Lambda }_{H}^{m}\otimes {\Lambda }_{H}^{d}+{\sigma }_{H} I\right)}^{1}vec\left({{V}_{H}^{d}}^{\rm T}Y{V}_{H}^{m}\right).\)
After obtaining the prediction matrix \({F}_{G}^{*}\) and \({F}_{H}^{*}\) based on the two different Kronecker similarities, respectively, we obtain the final prediction matrix by integrating their contributions as follows:
where \(w\) is a tradeoff parameter. Eventually, we will obtain the score matrix \({F}^{*}\). In the future research, the association with the high score will have a priority to be verified by biological experiment.
Results and discussion
Evaluation metrics
To measure the reliability and effectiveness of the proposed method, in the same experimental conditions, we implemented our method and reran the other five stateoftheart computational methods for comparison, under 5CV and global LOOCV framework. Notably, the GIP kernel similarity and HIP similarity of microbes and diseases should be recalculated in every round of the global LOOCV and 5CV framework.
Specifically, in the global LOOCV framework, all of the microbedisease pairs without associations were used as candidate samples, each of the known MDA was treated as a testing sample and the rest of the known MDAs were treated as a training set to conduct experiments. We can obtain the rank of every testing sample by comparing it with candidate samples. To visualize the prediction performance, 1specificity (false positive rates) and sensitivity (true positive rates) were calculated to plot the receiver operating characteristic (ROC) curves by setting different thresholds. For convenient observation, we calculated the area under the ROC curve (AUC) values to measure the ability of prediction method.
In the validation framework of 5CV, all observed microbedisease associations are randomly split into 5 subsets. Each of the 5 subsets is specified as an independent testing set and the rest of the 4 subsets are regarded as training sets. To weaken potential experimental bias caused by random sample division, the process of the experiment of every method was performed 100 times. Furthermore, the corresponding 1specificity and sensitivity were obtained for plotting the ROC curves. The corresponding AUC values were also calculated for evaluation. The AUC value of 1 means best prediction, while the AUC value of 0.5 indicates random prediction.
Parameter sensitivity analysis
There are three parameters (\({\sigma }_{G}\), \({\sigma }_{H}\) and \(w\)) in our model. In general, the prediction performance of the model depends on some parameters, and different scale values of the parameter will produce different prediction results. Here, to explore the properties of the proposed method and the influences of parameter and find the optimal parameter, we calculated the AUCs and made some comparison experiments with different initial parameters under the 5CV and LOOCV frameworks.
\({\sigma }_{G}\) and \({\sigma }_{H}\) are selftuned parameters of MDAKRLS. To promote robust performance and simplify the complex problem, we set the same variable value for parameters \({\sigma }_{G}\) and \({\sigma }_{H}\). The experimental results of the parameters have been shown in Fig. 2a. From the figure, the average AUC of MDAKRLS is greatly enhanced when the parameter increases from 0 to 5, and the performance remains almost unchanged as the value of the parameter increases from 5 to 35 under two kinds of frameworks. Finally, the values of parameters \({\sigma }_{G}\) and \({\sigma }_{H}\) were set as 30 to obtain a stable and optimal prediction result. Then, we fixed \({\sigma }_{G}\) and \({\sigma }_{H}\), and adjusted the tradeoff parameter \(w\). The relationship between the AUC value and the parameter \(w\) is shown in Fig. 2b. It can be seen that MDAKRLS will obtain the highest AUC whn \(w=0.8\), indicating that the Kronecker Gaussian similarity can provide more effective information for prediction. Finally, we obtained the best parameters for the following analysis, which can achieve better performance. The average AUC value of 5CV achieved by our proposed method based on the optimal parameters was 0.9023 ± 0.0015, and the AUC value of global LOOCV was 0.9327. The standard deviation and evaluation results demonstrate that used parameter values are reliable and robust for the proposed model.
Comparison with other methods
To validate the effectiveness of MDAKRLS, we compared it with five stateoftheart computational methods under the same experimental conditions, including KATZ measure (KATZHMDA) [19], Laplacian Regularized Least Squares (LRLSHMDA) [27], BiRandom Walk (BiRWHMDA) [20], Network Topological Similarity (NTSHMDA) [21] and Graph Regularized Nonnegative Matrix Factorization (GRNMFHMDA) [28] for human microbe–disease association prediction. Previous studies showed that these methods achieved effective prediction results. Here, we implemented the above 5 prediction methods for comparison under the global LOOCV and 5CV frameworks on the same benchmark data set. The comparison results are shown in Figs. 3 and 4, respectively.
Specifically, Fig. 3 shows AUC values and ROC curves of different methods under the global LOOCV framework. It can be observed from the figure that the AUC values of five comparative methods are the following: KATZHMDA (0.8382), LRLSHMDA (0.8916), BiRWHMDA (0.8964), NTSHMDA (0.9040) and GRNMFHMDA (0.8719). Our method obtained the highest AUC value (0.9327), which is superior to the other five methods. Similarly, we compared all methods in the framework of 5CV. The corresponding average AUC values and ROC curves of different methods have been shown in Fig. 4. As a result, the average AUC value of the proposed method is 0.9023, which performs better than KATZHMDA (0.8324), LRLSHMDA (0.8809), BiRWHMDA (0.8839), NTSHMDA (0.8918) and GRNMFHMDA (0.8555). The experimental results demonstrate that the proposed method is an effective and reliable prediction tool in inferring possible associations.
Practically, the traditional experimentbased methods are timeconsuming, some computational methods were proposed to save time. It is necessary to improve calculation speed and prediction accuracy for developing new reliable computational methods. Thus, calculation speed is an important metric for performance evaluation of different computational methods. Therefore, it is a fresh perspective and we implemented a runtime analysis. Specifically, we compared the calculation speed between the proposed method and five stateoftheart prediction methods under the global LOOCV and 5CV frameworks. The comparisons of calculation speed analysis are shown in Table 1. Our proposed method obtained a faster average running time under the 5CV framework. Moreover, the proposed method can use the shortest time for prediction under the global LOOCV framework. In brief, these results indicate the proposed method is reliable, effective and timesaving. It may be a useful tool for seeking diseaserelated new microbes.
Case studies
To further access the practical effect of the proposed method in inferring associated microbes with a disease without any known associated microbes, case study analysis [28, 44] was implemented on the MDAKRLS. For a given disease, we removed all known microbe associations in the HMDAD. Then, the proposed method was trained on the rest of the known associations and tested on the candidate microbe samples to seek the diseaserelated microbes, it can guarantee the independence between training data sets and validation data sets. In other words, the prediction model only depends on the rest of the known association information and the similarity measures of the training data sets. Specifically, in the microbedisease adjacency matrix \(A\), we converted all 1 to 0 for a given disease and ranked all microbe samples based on the prediction scores. The top ranked microbes will be further verified by the relevant literature and the method will be effective if the top prediction results have more verified microbes. To reveal the pathological relationship of diseases and microbes, in the framework of the MDAKRLS, we implemented independent case studies on two kinds of important human diseases: asthma and inflammatory bowel disease (IBD). It should be noted that we assume that the genus of this microbe will be associated with the disease if the microbe is associated with the disease when we validate microbes [21, 45].
Asthma is a common chronic inflammatory disease, which has substantial morbidity. According to statistics, more than 300 million patients were affected by asthma worldwide [8]. In this study, the top 20 prediction results of asthmarelated microbes are tabulated in Table 2. In the prediction list, there are some predictions have been validated by the HMDAD, the rest could be validated by the previously published biological and medical literature for asthmarelated microbes. Finally, 19 of the top 20 prediction microbes could be manually verified that they are related to asthma patients. For example, Actinobacteria, Firmicutes and Bacteroides have lower proportions in all sputum samples of asthmatic patients, while Proteobacteria and Staphylococcus aureus were higher [46, 47]. Moreover, Clostridium difficile colonized at 1 month of age, which was closely related to asthma at 6 to 7 years of age [48]. The clustering results of bacterial composition showed Enterobacteriaceae family were more abundant in healthy people, while Lachnospiraceae and Bifidobacterium were more abundant in the asthma patients [49]. In addition, in a study about children and infants, the fecal colonization of Clostridium coccoides subcluster XIVa species and Bacteroides fragilis subgroup can be served as early indicators, which will be good for the prevention of asthma [50]. Lactobacillus has been shown to be beneficial for children with asthma [51].
IBD is a chronic disabling gastrointestinal disease with a continually increasing incidence, which is a worldwide healthcare problem [9]. Similar to asthma, the top 20 prediction results of inflammatory bowel disease (IBD)related microbes are tabulated in Table 3. In the prediction list, based on the HMDAD and recently published biological and medical literature for IBDrelated microbes, 19 of the top 20 prediction microbes could be manually verified that they are related to the IBD patients. For example, previous studies showed Bacteroidetes, Prevotella and Firmicutes were associated with the formation of IBD [52, 53]. Clostridium difficile can aggravate flares of IBD, resulting in mortality and morbidity [54]. There is a negative relevant relation between IBD and Helicobacter pylori [55]. Compared with healthy people, Clostridium coccoides was less represented in active IBD patients [56]. In the salivary microbiota of IBD patients, Haemophilus, Veillonella and Prevotella were found that can largely contribute to dysbiosis [57]. In addition, in the faeces of IBD patients, the proportion of Lactobacillus increased, while Bifidobacterium decreased [58].
In addition, we also implemented case studies for three metabolic diseases including Obesity, Type 1 diabetes and Type 2 diabetes (see Additional file 1). Case studies indicate if one of the 39 human diseases does not have any known related microbes in the HMDAD, MDAKRLS can calculate the possibility of association between the disease and 292 microbes. The proposed method may be an effective tool for seeking diseaserelated possible new microbes. Then we further used MDAKRLS to rank all candidate microbes for all the diseases involved in HMDAD (see Additional file 2). We hope that the prediction list can provide aid, and more and more potential microbedisease pairs could be verified by clinical or biological experiment observation.
Conclusion
Identifying of MDAs could help us better understand the pathogenesis of human diseases, which is also useful for the prevention, diagnosis and treatment of human diseases. In this study, we developed a novel computational method called MDAKRLS based on the Kronecker regularized least squares. Firstly, we not only calculated the Kronecker Gaussian similarity of microbedisease pairs through the GIP kernel similarity of microbes and diseases, but also obtained the Kronecker Hamming similarity by the HIP similarity. Then, we developed the Kronecker regularized least squares based on the Kronecker product and RLS method to calculate the correlation scores of MDAs. A comparison of calculation speed showed our method has the advantage of fast calculating speed. The evaluation results of the 5CV and the global LOOCV framework demonstrated that MDAKRLS improved calculation accuracy and had a reliable prediction performance. In addition, case studies of IBD and asthma further indicated that MDAKRLS can effectively discover potential associations.
Several critical factors that make MDAKRLS has a reliable prediction performance. Firstly, different from some methods only using the GIP kernel similarity for prediction, we also introduced the HIP similarity to measure the similarities of microbes and diseases. Secondly, we used the Kronecker product to construct two kinds of Kronecker similarities of microbedisease pairs, which is complementary and can effectively mine the topological structure information of the network. Thirdly, In the process of solving the Tikhonov minimization problem, we introduced eigen decompositions to reduce the computational complexity. Kronecker regularized least squares is a machine learningbased method and uses fewer model parameters, thus saving time and improving robust performance. Of course, MDAKRLS needs to be improved in future work, such as some prior information of microbes or diseases could be introduced to improve the prediction performance; the insufficient number of experimentally verified MDAs limits the performance and development of the computational model.
The development of a reasonable and effective calculation model is conducive to the study of the microbial community. MDAKRLS has a good transplantation character, which is easily implemented to solve similar biological problems. The insufficient number of experimentally verified MDAs limits the performance and development of the computational model. At present, most diseaserelated microbes remain unknown in HMDAD. Therefore, it will be feasible and be of great practical significance to develop prediction algorithms that can effectively overcome the data sparsity problem. In addition, it is necessary to add more experimentally verified MDAs to improve the database, which can provide a foundation for improving the performance of computational method. We hope that our method could help biomedical researchers to carry out followup studies, and more and more potential microbedisease associations could be verified by clinical or biological experimental observation.
Availability of data and materials
The data set analyzed during the current study can be available at: http://www.cuilab.cn/hmdad.
Abbreviations
 MDAs:

Microbedisease associations
 IBD:

Inflammatory bowel disease
 HMDAD:

Human microbe–disease association database
 GIP:

Gaussian interaction profile
 HIP:

Hamming interaction profile
 5CV:

5fold crossvalidation
 LOOCV:

Leaveoneout crossvalidation
 ROC:

Receiver operating characteristic
 AUC:

Area under the ROC curve
References
Cho I, Blaser MJ. The human microbiome: At the interface of health and disease. Nat Rev Genet. 2012;13:260–70.
Consortium THMP. A framework for human microbiome research. Nature. 2012;486:215–21.
Tremaroli V, Bäckhed F. Functional interactions between the gut microbiota and host metabolism. Nature. 2012;489:242–9.
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65.
Zhao Y, Wang CC, Chen X. Microbes and complex diseases: from experimental results to computational models. Brief Bioinform. 2020;1:21.
Niu YW, Qu CQ, Wang GH, Yan GY. RWHMDA: Random Walk on Hypergraph for MicrobeDisease Association Prediction. Front Microbiol. 2019;10:1–10.
Clemente JC, Ursell LK, Parfrey LW, Knight R. The Impact of the Gut Microbiota on Human Health: An Integrative View. Cell. 2012;148:1258–70.
Lambrecht BN, Hammad H. The immunology of asthma. Nat Immunol. 2015;16:45–56.
Zhang YZ, Li YY. Inflammatory bowel disease: Pathogenesis. World J Gastroenterol. 2014;20:91–9.
Borre YE, O’Keeffe GW, Clarke G, Stanton C, Dinan TG, Cryan JF. Microbiota and neurodevelopmental windows: implications for brain disorders. Trends Mol Med. 2014;20:509–18.
Schwabe RF, Jobin C. The microbiome and cancer. Nat Rev Cancer Nature Publishing Group. 2013;13:800–12.
Turnbaugh PJ, Ley RE, Hamady M, FraserLiggett CM, Knight R, Gordon JI. The Human Microbiome Project. Nature. 2007;449:804–10.
Ehrlich SD, Consortium TM. MetaHIT: The European Union Project on Metagenomics of the Human Intestinal Tract. Metagenomics Hum Body. 2011;2:307–16.
Ma W, Zhang L, Zeng P, Huang C, Li J, Geng B, et al. An analysis of human microbedisease associations. Brief Bioinform. 2017;18:85–97.
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: From experimental results to computational models. Brief Bioinform. 2019;20:515–39.
Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, et al. Drugtarget interaction prediction: Databases, web servers and computational models. Brief Bioinform. 2016;17:696–712.
Zhang H, Liang Y, Han S, Peng C, Li Y. Long noncoding RNA and protein interactions: From experimental results to computational models based on network methods. Int J Mol Sci. 2019;20:9.
Xu D, Xu H, Zhang Y, Chen W, Gao R. ProteinProtein Interactions Prediction Based on Graph Energy and Protein Sequence Information. Molecules. 2020;25:1–15.
Chen X, Huang YA, You ZH, Yan GY, Wang XS. A novel approach based on KATZ measure to predict associations of human microbiota with noninfectious diseases. Bioinformatics. 2017;33:733–9.
Zou S, Zhang J, Zhang Z. A novel approach for predicting microbedisease associations by birandom walk on the heterogeneous network. PLoS ONE. 2017;12:1–16.
Luo J, Long Y. NTSHMDA: Prediction of Human MicrobeDisease Association based on Random Walk by Integrating Network Topological Similarity. IEEE/ACM Trans Comput Biol Bioinforma. 2020;17:1341–51.
Qu J, Zhao Y, Yin J. Identification and Analysis of Human MicrobeDisease Associations by Matrix Decomposition and Label Propagation. Front Microbiol. 2019;10:1–10.
Huang ZA, Chen X, Zhu Z, Liu H, Yan GY, You ZH, et al. PBHMDA: PathBased Human MicrobeDisease Association Prediction. Front Microbiol. 2017;8:1–10.
Huang YA, You ZH, Chen X, Huang ZA, Zhang S, Yan GY. Prediction of microbedisease association from the integration of neighbor and graph with collaborative recommendation model. J Transl Med. 2017;15:1–11.
Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. NextGeneration Machine Learning for Biological Networks. Cell. 2018;173:1581–92.
Xu D, Zhang J, Xu H, Zhang Y, Chen W, Gao R, et al. Multiscale supervised clusteringbased feature selection for tumor classification and identification of biomarkers and targets on genomic data. BMC Genomics. 2020;21:1–17.
Wang F, Huang ZA, Chen X, Zhu Z, Wen Z, Zhao J, et al. LRLSHMDA: Laplacian Regularized Least Squares for Human MicrobeDisease Association prediction. Sci Rep. 2017;7:1–11.
He BS, Peng LH, Li Z. Human MicrobeDisease Association Prediction With Graph Regularized NonNegative Matrix Factorization. Front Microbiol. 2018;9:1–11.
Shi JY, Huang H, Zhang YN, Cao JB, Yiu SM. BMCMDA: A novel model for predicting human microbedisease associations via binary matrix completion. BMC Bioinformatics. 2018;19:77.
Bao W, Jiang Z, Huang DS. Novel human microbedisease association prediction using network consistency projection. BMC Bioinformatics. 2017;18:173–81.
Wang L, Wang Y, Li H, Feng X, Yuan D, Yang J. A bidirectional label propagation based computational model for potential microbedisease association prediction. Front Microbiol. 2019;10:44.
Chen X, Ren B, Chen M, Wang Q, Zhang L, Yan G. NLLSS: predicting synergistic drug combinations based on semisupervised learning. PLoS Comput Biol. 2016;12:1–23.
Xie G, Meng T, Luo Y, Liu Z. SKFLDA: Similarity Kernel Fusion for Predicting lncRNADisease Association. Mol Ther  Nucleic Acids. 2019;18:45–55.
Zhao Y, Chen X, Yin J, Qu J. SNMFSMMA: using symmetric nonnegative matrix factorization and Kronecker regularized least squares to predict potential small moleculemicroRNA association. RNA Biol. 2020;17:281–91.
Fan C, Lei X, Guo L, Zhang A. Predicting the associations between microbes and diseases by integrating multiple data sources and pathbased HeteSim scores. Neurocomputing. 2019;323:76–85.
Zhang W, Yang W, Lu X, Huang F, Luo F. The BiDirection Similarity Integration Method for Predicting MicrobeDisease Associations. IEEE Access. 2018;6:38052–61.
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drugtarget interaction. Bioinformatics. 2011;27:3036–43.
Jiang L, Ding Y, Tang J, Guo F. MDASKF: similarity kernel fusion for accurately discovering miRNAdisease association. Front Genet. 2018;9:7.
Pahikkala T, Airola A, Pietilä S, Shakyawar S, Szwajda A, Tang J, et al. Toward more realistic drugtarget interaction predictions. Brief Bioinform. 2015;16:325–37.
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006;7:2399–434.
Rifkin R, Yeo G, Poggio T. Regularized LeastSquares Classification. NatoScience Ser Sub Ser III Comput Syst Sci. 2003;190:131–54.
Luo J, Xiao Q, Liang C, Ding P. Predicting MicroRNADisease Associations Using Kronecker Regularized Least Squares Based on Heterogeneous Omics Data. IEEE Access. 2017;5:2503–13.
Pahikkala T, Airola A, Stock M, De Baets B, Waegeman W. Efficient regularized leastsquares algorithms for conditional ranking on relational data. Mach Learn. 2013;93:321–56.
Peng LH, Yin J, Zhou L, Liu MX, Zhao Y. Human microbedisease association prediction based on adaptive boosting. Front Microbiol. 2018;9:1–9.
Long Y, Luo J. WMGHMDA: A novel weighted metagraphbased model for predicting human microbedisease association on heterogeneous information network. BMC Bioinform. 2019;20:1–18.
Marri PR, Stern DA, Wright AL, Billheimer D, Martinez FD. Asthmaassociated differences in microbial composition of induced sputum. J Allergy Clin Immunol. 2013;131:346–52.
Vael C, Nelen V, Verhulst SL, Goossens H, Desager KN. Early intestinal Bacteroides fragilis colonisation and development of asthma. BMC Pulm Med. 2008;8:1–6.
Van Nimwegen FA, Penders J, Stobberingh EE, Postma DS, Koppelman GH, Kerkhof M, et al. Mode and place of delivery, gastrointestinal microbiota, and their influence on asthma and atopy. J Allergy Clin Immunol. 2011;128:948–55.
AbdelAziz MI, Vijverberg SJH, Neerincx AH, Kraneveld AD, Maitland AH. The crosstalk between microbiome and asthma: Exploring associations and challenges. Clin Exp Allergy. 2019;49:1067–86.
Vael C, Vanheirstraeten L, Desager KN, Goossens H. Denaturing gradient gel electrophoresis of neonatal intestinal microbiota in relation to the development of asthma. BMC Microbiol. 2011;11:e23.
Huang CF, Chie WC, Wang IJ. Efficacy of lactobacillus administration in schoolage children with asthma: a randomized. PlaceboControlled Trial Nutr. 2018;10:1–11.
Juste C, Kreil DP, Beauvallet C, Guillot A, Vaca S, Carapito C, et al. Bacterial protein signals are associated with Crohn’s disease. Gut. 2014;63:1566–77.
Walters WA, Xu Z, Knight R. Metaanalyses of human gut microbes associated with obesity and IBD. FEBS Lett. 2014;588:4223–33.
Hashash JG, Binion DG. Managing Clostridium difficile in Inflammatory Bowel Disease (IBD). Curr Gastroenterol Rep. 2014;16:14–9.
Sonnenberg A, Genta RM. Low prevalence of Helicobacter pylori infection among patients with inflammatory bowel disease. Aliment Pharmacol Ther. 2012;35:469–76.
Sokol H, Seksik P, Furet JP, Firmesse O, NionLarmurier I, Beaugerie L, et al. Low Counts of Faecalibacterium prausnitzii in Colitis Microbiota. Inflamm Bowel Dis. 2009;15:1183–9.
Said HS, Suda W, Nakagome S, Chinen H, Oshima K, Kim S, et al. Dysbiosis of salivary microbiota in inflammatory bowel disease and its association with oral immunological biomarkers. DNA Res. 2014;21:15–25.
Takaishi H, Matsuki T, Nakazawa A, Takada T, Kado S, Asahara T, et al. Imbalance in intestinal microflora constitution could be involved in the pathogenesis of inflammatory bowel disease. Int J Med Microbiol. 2008;298:463–72.
Acknowledgements
MYW thanked the Weihai Engineering Technology Research Center for financial support. The authors thank the editors and anonymous reviewers for their reading time and constructive comments.
Funding
YSZ was supported by the National Natural Science Foundation of China under Grant (Nos. 61877064, U1806202). MYW thanked the Weihai Engineering Technology Research Center for financial support. RG was supported by the National Natural Science Foundation of China under Grant (Nos. U1806202 and 61533011).
Author information
Affiliations
Contributions
DX conceived and designed the study, wrote the manuscript. HXX, YSZ obtained and processed the data. DX, HXX and MYW performed the experiments and analyzed the results. MYW, RG and WC provided suggestions and supervised the research. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
The prediction results of the top 20 associated microbes with Obesity, Type 1 diabetes and Type 2 diabetes.
Additional file 2.
We further used MDAKRLS to rank all candidate microbes for all the diseases involved in HMDAD. The prediction results may help biomedical researchers conduct experimental validation and followup research.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Xu, D., Xu, H., Zhang, Y. et al. MDAKRLS: Predicting human microbedisease association based on Kronecker regularized least squares and similarities. J Transl Med 19, 66 (2021). https://doi.org/10.1186/s12967021027326
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12967021027326
Keywords
 Association prediction
 Microbe
 Disease
 Machine learning
 Kronecker regularized least squares