In this work, we proposed a novel method called MDAKRLS for inferring latent MDAs. Figure 1 describes the overall flow chart of MDAKRLS for prediction. The framework of prediction method consists of three steps. First, we constructed Kronecker Gaussian similarity \({K}_{G}\) and Kronecker Hamming similarity \({K}_{H}\) of microbe-disease pairs by fully exploiting Gaussian interaction profile (GIP) kernel similarity and Hamming interaction profile (HIP) similarity from known microbe–disease association matrix, respectively. Second, we introduced the Kronecker regularized least squares algorithm based on two Kronecker similarity to construct loss function for prediction. Third, we used an integration strategy to get the final predicted association matrix. Finally, the final possibility score of each microbe-disease pair can be calculated.
Human microbe–disease association data set
In this study, we used a widely-used benchmark data set (HMDAD) to evaluate the reliability and effectiveness of MDAKRLS. It was manually collected by Ma et al. [14] and can be available at http://www.cuilab.cn/hmdad. The database contains a total of 483 verified associations, 292 human microbes and 39 diseases. The microbe-disease association data set adopted by us was downloaded from HMDAD in June, 2020. We finally obtained 450 verified associations after we removed repetitive associations. In fact, we represented the advantages of MDAKRLS through the overall HMDAD data set. For a better description, we constructed an adjacency matrix \(A \epsilon {R}^{39\times 292}\) to express the associations network.
Similarity measures
For a better description, in this study, set \(D=\left\{{d}_{1},{d}_{2},\dots ,{d}_{i},\dots {d}_{nd}\right\}\) and \(M=\left\{{m}_{1},{m}_{2},\dots ,{m}_{j},\dots ,{m}_{nm}\right\}\) denote the sets of diseases and microbes, respectively. We introduced an adjacency matrix \(A \epsilon {R}^{nd\times nm}\) to express the associations network, where variable \(nd\) denotes the numbers of diseases; \(nm\) represents the numbers of microbes. Besides, the adjacency matrix \(A \epsilon {R}^{nd\times nm}\) is defined as follows:
$$A\left(i,j\right)=\left\{\begin{array}{c}1, \quad if \, disease \, {d}_{i} \, is \, related \, to \, microbe \, {m}_{j}\\ 0, \quad otherwise\end{array}\right.$$
(1)
Set \(A\left({d}_{i}\right)\epsilon {\{0, 1\}}^{1*nm}\) represents the \(i\) th row of \(A\), which is a binary vector and denotes the interaction profile of the disease \({d}_{i}\). Similarly, \(A\left({m}_{j}\right)\epsilon {\{0, 1\}}^{nd*1}\) denotes the \(j\) th column of \(A\), which represents the interaction profile of the microbe \({m}_{j}\). According to the basic assumption, microbes with similar functions will share similar non-interaction or interaction patterns with phenotype diseases, which is widely used in the related studies. To integrate more effective information and uncover potential associations, we calculated the GIP kernel similarity and HIP similarity of human microbes and diseases, respectively.
GIP kernel similarity for microbes and diseases
To mine conveniently the topological structure information of association matrix \(A\), we used the GIP kernel similarity [19, 37] for measuring similarity of human microbes. Specifically, for two given microbes \({m}_{i}\) and \({m}_{j}\), we first extracted their interaction profiles \(A\left({m}_{i}\right)\) and \(A\left({m}_{j}\right)\) from the training adjacency matrix \(A\), respectively. Subsequently, the GIP kernel similarity of microbes can be calculated as follows:
$${S}_{G}^{m}\left({m}_{i}, {m}_{j}\right)=exp\left(-{\sigma }_{m}{||A\left({m}_{i}\right)-A\left({m}_{j}\right)||}^{2}\right)$$
(2)
$${\sigma }_{m} ={\sigma }_{m}^{\mathrm{^{\prime}}}/\left(\frac{1}{nm}\sum_{k=1}^{nm}{||A\left({m}_{k}\right)||}^{2}\right)$$
(3)
where \({S}_{G}^{m}\) is defined as the microbe GIP kernel similarity matrix; \({\sigma }_{m}^{\mathrm{^{\prime}}}\) is a trade-off parameter and we set \({\sigma }_{m}^{\mathrm{^{\prime}}}=1\) in the experiments; parameter \({\sigma }_{m}\) is applied to tune-up bandwidth of GIP kernel, which can be updated by the Eq. (3).
Similarly, we also obtained the disease GIP kernel similarity as follows:
$${S}_{G}^{d}\left({d}_{p}, {d}_{q}\right)=exp\left(-{\sigma }_{d}{||A\left({d}_{p}\right)-A\left({d}_{q}\right)||}^{2}\right)$$
(4)
$${\sigma }_{d} ={\sigma }_{d}^{\mathrm{^{\prime}}}/\left(\frac{1}{nd}\sum_{k=1}^{nd}{||A\left({d}_{k}\right)||}^{2}\right)$$
(5)
where \({S}_{G}^{d}\) is defined as the disease GIP kernel similarity matrix; \({\sigma }_{d}^{\mathrm{^{\prime}}}\) is a trade-off parameter and we set \({\sigma }_{d}^{\mathrm{^{\prime}}}=1\) in the experiments; parameter \({\sigma }_{d}\) is applied to tune-up bandwidth of GIP kernel, which can be updated by the Eq. (5).
HIP similarity for microbes and diseases
In this work, inspired by the Jiang et al.' s work [38], we introduced HIP similarity to measure the interaction profile similarity between microbe pairs from the training adjacency matrix \(A\). For HIP similarity, two microbes will have a lower similarity if they have more different corresponding values in the interaction profiles. Further, the HIP similarity of microbes \({m}_{i}\) and \({m}_{j}\) is defined as follows:
$${S}_{H}^{m}\left({m}_{i},{m}_{j}\right)=1-\frac{\left|A\left({m}_{i}\right)!=A\left({m}_{j}\right)\right|}{\left|A\left({m}_{i}\right)\right|}$$
(6)
where \({S}_{H}^{m}\) denotes the microbe HIP similarity matrix; \(\left|\cdot \right|\) denotes the number of elements in the interaction profile.
Similarly, based on the interaction profiles of diseases, the HIP similarity of diseases can be calculated as follows:
$${ S}_{H}^{d}\left({d}_{p},{d}_{q}\right)=1-\frac{\left|A\left({d}_{p}\right)!=A\left({d}_{q}\right)\right|}{\left|A\left({d}_{p}\right)\right|}$$
(7)
where \({S}_{H}^{d}\) denotes the disease HIP similarity matrix.
MDAKRLS for microbe-disease association prediction
Regularized least squares (RLS) and its extended versions are popular machine learning methods. In this work, to boost the predictable performance, a novel predict model called MDAKRLS is proposed to calculate the relevance scores between microbes and diseases by integrating the Kronecker product and RLS method.
We first obtained the Kronecker Gaussian similarity between microbe-disease pairs by the GIP similarity matrix of microbes and diseases. Specifically, we use the following equation to define the similarity between the microbe-disease pairs \(\left(m\left(i\right),d\left(p\right)\right)\) and \(\left(m\left(j\right),d\left(q\right)\right)\):
$${K}_{G}\left(\left(m\left(i\right),d\left(p\right)\right),\left(m\left(j\right),d\left(q\right)\right)\right)={S}_{G}^{m}\left(m\left(i\right),m\left(j\right)\right)*{S}_{G}^{d}\left(d\left(p\right),d\left(q\right)\right)$$
(8)
where \({S}_{G}^{m}\) and \({S}_{G}^{d}\) represent the GIP similarity matrix of microbes and diseases defined above, respectively. Let \(N=nd\times nm\) which represents the number of microbe-disease pairs. The above equation can be represented by the Kronecker product as follows:
$${K}_{G}={S}_{G}^{m}\otimes {S}_{G}^{d}$$
(9)
where \({K}_{G }\epsilon {R}^{N\times N}\) is defined as the Kronecker Gaussian similarity of microbe-disease pairs. In the same manner, the Kronecker Hamming similarity matrix \({K}_{H }\epsilon {R}^{N\times N}\) of microbe-disease pairs can be measured:
$${K}_{H}={S}_{H}^{m}\otimes {S}_{H}^{d}$$
(10)
For a better description, in this work, we set \(X=\left({x}_{1},{x}_{2},\dots ,{x}_{i},\dots ,{x}_{N}\right)\), where \({x}_{i}\) denotes the \(i\) th microbe-disease pair. \(vec\left(Y\right)=\left({y}_{1},{y}_{2},\dots ,{y}_{i},\dots ,{y}_{N}\right)\), where \(Y {\epsilon R}^{nd\times nm}\) denotes the training microbe-disease adjacency matrix in the process of forecasting; \(vec\left(\cdot \right)\) is a vector operator that stacks the elements of all columns into a vector; \({y}_{i}\epsilon \{0, 1\}\) denotes the corresponding label of microbe-disease pair \({x}_{i}\). The biological problem of predicting disease-related microbes can be transformed to learn a mapping function \({f}_{G}\) and calculate a corresponding association score. \(vec\left({F}_{G}\right)=\left({f}_{G}\left({x}_{1}\right),{f}_{G}\left({x}_{2}\right),\dots ,{f}_{G}\left({x}_{i}\right),\dots ,{f}_{G}\left({x}_{N}\right)\right)\), where \({F}_{G}\) denotes the prediction score matrix based on the Kronecker Gaussian similarity; \({f}_{G}\left({x}_{i}\right)\) represents the prediction score of microbe-disease pair \({x}_{i}\) obtained by prediction function \({f}_{G}\).
In further work, first, we constructed the Kronecker regularized least squares [39] based on the Kronecker Gaussian similarity to solve the microbe-disease prediction problem. The objective function based on the Tikhonov minimization problem is formulated as follows:
$$J\left({f}_{G}\right)=\frac{1}{2}\sum_{i=1}^{N}{\left({y}_{i}-{f}_{G}\left({x}_{i}\right)\right)}^{2}+\frac{{\sigma }_{G}}{2}{\parallel {f}_{G}\parallel }_{k}^{2}$$
(11)
where \({\sigma }_{G}>0\) is a regularization coefficient used to adjust the regularization term and loss function of the objective function; \({\parallel {f}_{G}\parallel }_{k}\) is the norm of mapping function \({f}_{G}\) in Reproducing Kernel Hilbert Space (RKHS) [40] associated to the kernel \(k\). Based on the classical Representer Theorem [41], the solution of the Tikhonov regularization problem exists in the RKHS and can be calculated as follows:
$${f}_{G}\left({x}_{i}\right)=\sum_{j=1}^{N}{\alpha }_{j}{K}_{G}\left({x}_{i},{x}_{j}\right)$$
(12)
According to the previous studies [37, 42], the optimal solution of the objective function can be further calculated as follows:
$$vec\left({F}_{G}\right)={K}_{G}{({K}_{G}+{\sigma }_{G} I)}^{-1}vec\left(Y\right)$$
(13)
where \(I\) denotes the identity matrix.
Eigen decompositions were implemented on the GIP similarity matrix \({S}_{G}^{m}\) of microbes and GIP similarity matrix \({S}_{G}^{d}\) of diseases. We can get \({S}_{G}^{m}\)=\({V}_{G}^{m}{\Lambda }_{G}^{m}{{V}_{G}^{m}}^{\rm T}\) and \({S}_{G}^{d}\)=\({V}_{G}^{d}{\Lambda }_{G}^{d}{{V}_{G}^{d}}^{\rm T}\), respectively. According to the property of the Kronecker product, we can obtain the \({K}_{G}={S}_{G}^{m}\otimes {S}_{G}^{d}={V}_{G}{\Lambda }_{G}{{V}_{G}}^{\rm T}\), where \({V}_{G}={V}_{G}^{m}\otimes {V}_{G}^{d}\) and \({\Lambda }_{G}={\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}\). Then, we can transform the Eq. (13) as follows:
$$\begin{aligned} vec\left({F}_{G}\right)=&{V}_{G}{\Lambda }_{G}{{V}_{G}}^{\rm T}{\left({V}_{G}{\Lambda }_{G}{{V}_{G}}^{\rm T}+{\sigma }_{G} I\right)}^{-1}vec\left(Y\right)\\ =&{V}_{G}{\Lambda }_{G}{\left({\Lambda }_{G}+{\sigma }_{G} I\right)}^{-1}{{V}_{G}}^{\rm T}vec\left(Y\right)\end{aligned}$$
(14)
According to another property of the Kronecker product [43], \(\left({N}^{\rm T} {\otimes} M\right)vec\left(C\right)=vec\left(MCN\right)\), Eq. (14) can be rewritten as follows:
$$vec\left({F}_{G}\right)={(V}_{G}^{m}\otimes {V}_{G}^{d})\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}\right){\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}+{\sigma }_{G} I\right)}^{-1}({{V}_{G}^{m}}^{\rm T}\otimes {{V}_{G}^{d}}^{\rm T})vec\left(Y\right)$$
$$={(V}_{G}^{m}\otimes {V}_{G}^{d})\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}\right){\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}+{\sigma }_{G} I\right)}^{-1}vec({{V}_{G}^{d}}^{\rm T}Y{V}_{G}^{m})$$
$$={(V}_{G}^{m}\otimes {V}_{G}^{d})vec({X}_{G})$$
$$=vec\left({V}_{G}^{d}{X}_{G}{{V}_{G}^{m}}^{\rm T}\right)$$
(15)
Finally, we will obtain the score matrix based on the Kronecker Gaussian similarity by the following equation:
$${F}_{G}^{*}={V}_{G}^{d}{X}_{G}{{V}_{G}^{m}}^{\rm T}$$
(16)
where \(vec\left({X}_{G}\right)=\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}\right){\left({\Lambda }_{G}^{m}\otimes {\Lambda }_{G}^{d}+{\sigma }_{G} I\right)}^{-1}vec\left({{V}_{G}^{d}}^{\rm T}Y{V}_{G}^{m}\right).\)
In addition, we also can get another objective function and optimal solution based on the Kronecker Hamming similarity in a similar manner:
$$J\left({f}_{H}\right)=\frac{1}{2}\sum_{i=1}^{N}{\left({y}_{i}-{f}_{H}\left({x}_{i}\right)\right)}^{2}+\frac{{\sigma }_{H}}{2}{\parallel {f}_{H}\parallel }_{k}^{2}$$
(17)
$$vec\left({F}_{H}\right)={K}_{H}{({K}_{H}+{\sigma }_{H} I)}^{-1}vec\left(Y\right)$$
(18)
We implemented eigen decompositions on the HIP similarity matrix \({S}_{H}^{m}\) of microbes and HIP similarity matrix \({S}_{H}^{d}\) of diseases, and obtained the second score matrix based on the Kronecker Hamming similarity:
$${F}_{H}^{*}={V}_{H}^{d}{X}_{H}{{V}_{H}^{m}}^{\rm T}$$
(19)
where \(vec\left({X}_{H}\right)=\left({\Lambda }_{H}^{m}\otimes {\Lambda }_{H}^{d}\right){\left({\Lambda }_{H}^{m}\otimes {\Lambda }_{H}^{d}+{\sigma }_{H} I\right)}^{-1}vec\left({{V}_{H}^{d}}^{\rm T}Y{V}_{H}^{m}\right).\)
After obtaining the prediction matrix \({F}_{G}^{*}\) and \({F}_{H}^{*}\) based on the two different Kronecker similarities, respectively, we obtain the final prediction matrix by integrating their contributions as follows:
$${F}^{*}=w\cdot {F}_{G}^{*}+\left(1-w\right)\cdot {F}_{H}^{*}$$
(20)
where \(w\) is a trade-off parameter. Eventually, we will obtain the score matrix \({F}^{*}\). In the future research, the association with the high score will have a priority to be verified by biological experiment.