A novel semi-supervised model for miRNA-disease association prediction based on ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell_{1}$$\end{document}-norm graph

Identification of miRNA-disease associations has attracted much attention recently due to the functional roles of miRNAs implicated in various biological and pathological processes. Great efforts have been made to discover the potential associations between miRNAs and diseases both experimentally and computationally. Although reliable, the experimental methods are in general time-consuming and labor-intensive. In comparison, computational methods are more efficient and applicable to large-scale datasets. In this paper, we propose a novel semi-supervised model to predict miRNA-disease associations via ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell_{1}$$\end{document}-norm graph. Specifically, we first recalculate the miRNA functional similarities as well as the disease semantic similarities based on the latest version of MeSH descriptors and HMDD. We then update the similarity matrices and association matrix iteratively in both miRNA space and disease space. The optimized association matrices from each space are combined together as the final output. Compared with four state-of-the-art prediction methods, our method achieved favorable performance with AUCs of 0.943 and 0.946 in both global LOOCV and local LOOCV, respectively. In addition, we carried out three types of case studies on five common human diseases, and most of the top 50 predicted miRNAs were confirmed to be associated with the investigated diseases by four databases dbDEMC, PheomiR, miR2Disease and miRwayDB. Specifically, our results provided potential evidence that miRNAs within the same family or cluster were likely to play functional roles together in given diseases. Taken together, the experimental results clearly demonstrated the utility of the proposed method. We anticipated that our method could serve as a reliable and efficient tool for miRNA-disease association prediction.


Background
MicroRNAs (miRNAs) are small single-stranded RNAs that repress mRNA translation and trigger mRNA degradation at the post-transcriptional level. Since the discovery of the first two miRNAs in mammalian cells, there has been a tremendous and growing interest among researchers to investigate the role of miRNAs in normal cellular as well as the disease processes [1]. Compelling evidence have demonstrated the fundamental importance of miRNAs in normal development, differentiation, growth control and in human diseases such as cancer [2]. For instance, the overexpression of miR-193a-3p and miR-224 increases cell proliferation in renal cell carcinoma by directly targeting ST3GallV via PI3K/Akt pathway [3], and miR-197 induces epithelial-mesenchymal transition and invasion through the downregulation of HIPK2 in lung adenocarcinoma [4]. Emerging evidence also suggested that substitution of tumor suppressive miRNAs or inhibition of oncogenic miRNAs could be used to develop novel treatment strategies [5]. Therefore, the identification of the disease-related miRNAs is of great significance for the new drug design and therapeutic development for complex human diseases.
Great efforts have been made to discover potential associations between miRNAs and diseases using experimental approaches. Jones et al. found that miR-186-5p was involved in the prostate cancer cell proliferation and invasion through qRT-PCR and western blot [6]. Similarly, Cui et al. found that the decreased miR-337 expression was significantly associated with tumor stage and lymph node metastasis of hepatocellular carcinoma based on the analysis of transfection of miR-337 mimics [7]. Although reliable, experimental methods are generally time-consuming and cannot be applied to large-scale datasets. With the accumulation of multiple data sources, a number of computational methods have been developed to predict reliable miRNA-disease associations [8][9][10]. Under the assumption that functionally related miRNAs tend to be involved in phenotypically similar diseases and vice versa, Jiang et al. developed the first computational model to prioritize the disease-related miRNAs by constructing a scoring system based on hypergeometric distribution [11]. Following their seminal work, Chen et al adopted global network similarities and developed random walk with restart to infer potential miRNA-disease associations [12]. Shi et al. also used the random walk with restart to calculate an enrichment score by integrating the miRNA target information as well as the protein-protein interactions [13]. Xuan et al. first calculated the miRNA functional similarity by taking miRNA family and cluster information into account, and then prioritized disease-related miRNAs in terms of the weighted k most similar neighbors [14]. However, their method cannot be applied to diseases without any known associated miRNAs. To solve this issue, they proposed another approach called MIDPE based on bilayer random walk model later on, in which different categories of nodes were assigned different transition weights [15]. Mørk et al. inferred the miRNA-disease associations by coupling known and predicted miRNA-protein associations with protein-disease associations text mined from the literature. Besides linking miRNAs to diseases, it directly suggested the underlying proteins that can be further validated experimentally [16]. By taking advantage of tissue-specific miRNA expression profiles and miRNA target information, Zhao et al. calculated the enrichment significance of the known pathway over gene clusters to identify cancer-related miRNAs [17]. Nonetheless, their method relies on tissue-specific miRNA expression profiles, which might be difficult to obtain sometimes. Chen et al. first calculated the within-score and between-score from the view of miRNA and diseases respectively, and then combined them together to obtain final scores for the prioritization of the miRNA-disease associations [18]. Liu et al. implemented random walk on a heterogeneous network which was constructed by integrating multiple data sources, including gene functional similarities, miRNA-target gene associations, miRNA-lncRNA associations, lncRNA similarity and etc., which improved the prediction accuracy of previous methods [19]. Recently, Chen et al. proposed Heterogeneous Graph Inference for MiRNA-Disease Association (HGIMDA) by iteratively updating the association matrix based on the miRNA functional similarity matrix and disease semantic similarity simultaneously [20]. The leave-one-out cross validation demonstrated that HGIMDA achieved comparable results.
Several machine learning-based models were also developed to predict potential miRNA-disease associations. Jiang et al. extracted a set of features for each positive and negative miRNA-disease association and trained a support vector machine (SVM) for the classification [21]. Chen et al. constructed a continuous classification function based on regularized least squares to reflect the probability of each miRNA related to a given disease [22]. Pasquier et al. represented distributional information on miRNAs and diseases in a high-dimensional vector space and the miRNA-disease association scores were calculated in terms of their vector similarity [23]. Shen et al. developed a computational method based on collaborative matrix factorization for miRNA-disease association prediction by integrating miRNA functional similarity, disease semantic similarity and known miRNA-disease associations [24]. Luo et al. developed a collective prediction model based on transductive learning to systematically prioritize disease-related miRNAs. They calculated a relevance score for each association and updated the network structure iteratively until convergence [25]. Chen et al. presented a novel computational model called MKRMDA in which Kronecker regularized least squares were calculated based on multiple kernels for miRNA-disease association prediction [26]. However, there were several parameters involved in their model and how to appropriately choose proper values is not a trivial task. They further proposed a model of Extreme Gradient Boosting Machine for MiRNA-Disease Association (EGBMMDA). For each miRNA-disease pair, they formed an informative feature vector by combining results obtained from statistical measures, graph theoretical measures and matrix factorization results. The feature vector was then used to train a regression tree under the gradient boosting framework [27]. Recently, Fu and Peng proposed a deep ensemble model called DeepMDA which extracts high-level features from similarity information using stacked autoencoders [28]. The miRNAdisease associations were then predicted based on a three-layer neural network. Xiao et al. presented a graph regularized non-negative matrix factorization method for identifying miRNA-disease associations. Experiment results indicated that their method could effectively prioritize disease-associated miRNAs with higher accuracy compared with other alternatives [29].
Another family of methods considers the network topology when predicting miRNA-disease associations. Sun et al. presented a computational method named NTSMDA that utilized the known miRNA-disease network topological similarity to exploit potential diseaserelated miRNAs [30]. You et al. proposed a Path-Based MiRNA-Disease Association (PBMDA) prediction model. They first constructed a heterogeneous graph consisting of three interlinked sub-graphs and then used depth-first algorithm to infer potential miRNA-disease associations [31]. However, the maximum length of paths cannot be larger than four due to the exponential computational complexity. Chen et al. developed a computational model named NDAMDA that not only considered the direct network distance between two miRNAs or diseases but also took their respective mean network distances to all other miRNAs or diseases into account [32]. They further proposed to use the graphlet interaction to analyze the complex relationships between miRNA or disease pairs in a graph. Specifically, they counted the number of different graphlet interaction isomers to calculate relevance scores for miRNA-disease associations. Nevertheless, their method cannot scale to graphlets that contain more than four nodes [33].
Although existing methods have achieved remarkable performances, there are still some limitations to be solved. Briefly, due to the intrinsic noise as well as the incompleteness existing in the current datasets, it is difficult to obtain reliable similarity matrices for both miRNAs and diseases. Moreover, the fact that no true negative datasets were validated might influence the prediction performance of the machine learning-based methods. Consequently, how to predict miRNA-disease associations reliably and effectively still remains a challenging task. To solve the above problems, in this paper, we first recalculate the similarity matrices for both miRNAs and diseases with the latest version of Mesh database (http://www.ncbi.nlm.nih.gov/) and HMDD [34]; we then propose a novel semi-supervised prediction method based on ℓ 1 -norm graph model. Specifically, both miRNA and disease similarity matrices could be adaptively re-weighted during the iteration process and the label matrix could be updated accordingly. To demonstrate the effectiveness of our method, we apply global leave-one-out cross validation (global LOOCV) and local leave-one-out cross validation (local LOOCV) to evaluate the prediction performance. The experiment results show that our method achieved AUCs of 0.943 and 0.946 for global LOOCV and local LOOCV, respectively. The case studies on five common human diseases further confirm the utility of our method. Together, we present a novel framework for miRNA-disease association prediction and envision it being a useful tool for future clinical analysis.

Disease semantic similarity
According to the previous study [35], we downloaded the latest MeSH descriptors from the National Library of Medicine (http://www.nlm.nih.gov/) and only kept the items from Category C for diseases, which resulted in 11,572 unique items. As described in [35], the relationship among different diseases can be represented as a Directed Acyclic Graph (DAG). For a given disease d, its DAG can be denoted as represents all the ancestor nodes of d and d itself, and E(d) represents all direct edges connecting the parent nodes to child nodes. The contribution D d (t) of a disease t in DAG d to the semantics of disease d could be calculated by: Based on Eq. (1), the semantic value DV of a given disease d could be defined as follows: Apparently, diseases with more common items will have greater semantic similarities. Finally, the semantic similarity score between two diseases i and j is defined as follows: Moreover, the similarity of a given disease d and a group of diseases D t = d t1 , d t2 , . . . , d tk was defined by: By using Eq. (3), we could obtain the semantic similarities for each disease pair. For convenience, we denote the disease semantic similarity matrix as W d , where the entity W d (i, j) represents the semantic similarity between disease i and disease j. The computed disease similarity matrix was provided in Additional file 1.

Human miRNA-disease association data
The latest version of human miRNA-disease association data (v2.0) was downloaded from HMDD [34]. Besides, we also downloaded the latest version of existing miR-NAs that was released on March 2018 from miRBase [36], which record 4796 human miRNAs. To keep consistent of data from different sources and eliminate as many false positives as possible, associations with miR-NAs and diseases that were not recorded in miRBase and MeSH were excluded [37]. As a result, 6088 associations between 550 miRNAs and 328 diseases were used in the subsequent analysis (Additional file 2). Adjacency matrix A is adopted to represent the miRNA-disease associations. For a given miRNA i and disease j,

MiRNA functional similarity
To calculate the functional similarity between two miR-NAs M1 and M2, we need to measure the contributions from similar diseases that are associated with both of them [35]. Let DT 1 and DT 2 represent the related diseases of miRNA M1 and M2, respectively. The functional similarity of M1 and M2 is then calculated as follows: where S(dt, DT) measures the similarity of a given disease dt and a set of diseases DT and its definition is given in Eq. (4). We use W m to denote the miRNA functional similarity matrix, where the entity W m (i, j) represents the functional similarity between miRNA i and miRNA j.
The computed miRNA similarity matrix was provided in Additional file 3.

The proposed method
To effectively predict the potential miRNA-disease associations, we here propose a novel semi-supervised method based on ℓ 1 -norm graph model (Fig. 1). Let n and m denote the number of miRNAs and diseases in our dataset, respectively. The dimension of the known association matrix A is thus n × m. Let us first consider the miRNA space. Given the association matrix A as well as the miRNA functional similarity matrix W m , our goal is to obtain an indicator matrix Q m ∈ R n×m that could reflect the association probability between certain miRNAs and diseases. Since the solution to the traditional graph based semi-supervised learning is sensitive to noise and outliers [38,39], we define the ℓ 1 -norm-based objective as follows: where q i m and q j m represent the i-th and j-th column of Q m , respectively. U m is a diagonal matrix with the i-th diagonal element to control the impact of the initial associations from A.
Let p m denote a n 2 -dimensional vector of which the , we can rewrite Eq. (6) as  which gives us the ℓ 1 -norm representation of our objective function. It is widely known that the ℓ 1 -norm usually generates sparse solutions and thus the solution to Eq. (7) will provide a more confident prediction results for potential miRNA-disease associations [40]. However, Eq. (7) is non-smooth and difficult to be solved efficiently [41]. To overcome this issue, we further defined a reweighted similarity matrix as follows: where the similarity matrix W m can be updated during each iteration. By integrating Eq. (8) into Eq. (6) and taking the derivative of Eq. (6) with respect to Q m , we have: where L m =D m −W m is the Laplacian matrix and D m is a diagonal matrix with the i-th diagonal element as jW m ij . Note that L m is dependent on W m , we develop an iterative algorithm to solve Q m until convergence. Similarly, we define the ℓ 1 -norm based objective for the disease space as follows: where Q d ∈ R m×n is the label matrix to be solved. Following the same procedure presented above, we could obtain: Combining Eq. (9) with Eq. (11), we could obtain the final prediction result Q final : The procedure of the proposed method is summarized in Algorithm 1. According to previous literature [38], Algorithm 1 is guaranteed to converge to the global optimum of the problem.

Performance evaluation
To validate the prediction ability of our method, we implemented leave-one-out cross validation (LOOCV) where each known association was left in turn as the test sample and the rest of the known associations were used for optimization. LOOCV can be conducted in two ways, i.e. global LOOCV and local LOOCV. In global LOOCV, the test sample was ranked with all the other unconfirmed miRNA-disease associations, whereas in local LOOCV the test sample was ranked with all the unconfirmed associations of a given disease. Test samples with predicted values higher than a given threshold were considered as successful predictions. To intuitively evaluate the prediction performance, we adopted receiver operating characteristics (ROC) curve and calculated the area under the ROC curve (AUC). The larger the AUC, the better the prediction performance. Moreover, we compared our method with five state-of-the-art approaches, i.e. HGIMDA [20], EGBMMDA [27], DeepMDA [28], NTSMDA [30] and PBMDA [31]. As mentioned before, HGIMDA was an efficient prediction framework based on heterogeneous graph inference. EGBMMDA was an effective classification method based on extreme gradient boosting machine while DeepMDA was a deep ensemble classification model. Both NTSMDA and PBMDA took advantage of different network topological characteristics to prioritize disease-related miRNAs. The experimental results were demonstrated in Fig. 2 Table 1, our method significantly improved the prediction performance with respect to the other five methods. We next examined the computational cost of all methods by evaluating their computational time and memory needed for each run. Experiments were performed on a computer cluster where each node is equipped with 2 AMD Dual-Core Opteron 8218 processors and 16 GB memory. As shown in Table 2, our method could achieve superior performance with a reasonable amount of computational resources.

Case studies
To further demonstrate the prediction ability of the proposed method, we carried out three types of case studies on five common diseases. Four databases dbDEMC [42], PhenomiR [43], miR2Disease [44] and miRwayDB [45] were used to validate the prediction results in all five case studies. Specifically, dbDEMC is an integrated database that records differentially expressed miRNAs in human  cancers detected by high-throughput method, while PhenomiR, miR2Disease and miRwayDB provide information about differentially regulated miRNA expression in diseases and other biological processes or pathways completely generated by manual curation of experienced annotators. Since the miRNAs recorded in dbDEMC, miR2Disease as well as miRwayDB are annotated in their mature sequence form, we matched the candidate miR-NAs with those recorded in the three aforementioned databases according to the miRNA nomenclature provided from miRBase. Besides, to validate our case study results across all the four databases, we selected 16 common diseases among them for the subsequent analysis. Due to space limitations, we only provided the validation results of five diseases here and the results of the other diseases can be found in additional files. For the first type of case studies, we applied our method to predict the potential associations between miRNAs and three given diseases, i.e. Lung Neoplasms, Ovarian Neoplasms and Prostatic Neoplasms based on the known associations in HMDD v2.0 (Additional file 4). Lung cancer is the leading cause of cancer death among men and women worldwide, with an incidence of over 200,000 new cases per year coupled with a very high mortality rate [46]. Great efforts have been made to investigate the functional roles of miRNAs in lung cancer cell progression and resistance to therapy. For instance, recent studies have identified that miR-15a-3p could induce apoptosis in lung cancer cell lines and thus serve as a potential biomarker for apoptosis-modulating therapies in lung cancer treatment [47]. However, promising findings of a lung cancer-associated miRNAs in one study is inadequate to support a solid report, more studies would be needed to cross validate the discovery. Here, we carried out our first case study on this lethal disease and prioritized the top 50 ranked miRNAs by our method. As shown in Table 3, 49 out of the 50 predicted miRNAs were confirmed by experimental findings recorded in at least one of the four databases dbDEMC, PhenomiR, miR2Disease and miRwayDB. Specifically, three of the top four predicted miRNAs (i.e. hsa-mir-16-1, hsamir-16-2 and hsa-mir-15) were validated by all the databases. The only unconfirmed miRNA was hsa-mir-520b. Intriguingly, we observed that other miRNAs (i.e. hsamir-520d, hsa-mir-520c and hsa-520a) within the same miRNA family of hsa-mir-520b were all confirmed by dbDEMC. Therefore, hsa-mir-520b might also function as a potential regulator in the tumorigenesis and progression of lung cancer.
Ovarian neoplasms is the fifth most common cause of cancer deaths in women and has the highest mortality rate among all the gynecological malignancies. Its lethality is largely due to the difficulties in detecting it at an early stage and lack of effective treatments for patients with an advanced or recurrent status [48,49]. Consequently, there is an urgent need to identify prognostic and predictive markers for early detection. Various miRNAs such as miR-200 family and let-7  paralogs have been proposed as potential therapeutic targets for disseminated or chemoresistant ovarian tumors. We implemented our method to prioritize the candidate miRNAs for ovarian neoplasms and the top 50 predicted miRNAs are given in Table 4. Similarly, 49 out of the 50 predicted miRNAs were confirmed by at least one databases from dbDEMC, PhenomiR, miR2Disease and miRwayDB. The only unconfirmed miRNA was hsa-mir-181a-2. As a matter of fact, in vivo experiments have implicated that miR-181a could modulate TGF-β signaling to induce and maintain epithelial-mesenchymal transition and further affect ovarian cancer cell survival [50]. In addition, three miRNAs (hsa-mir-181a-1, hsa-mir-181b-1 and hsa-mir-181b-2) from the same miRNA family of hsa-mir-181a-2 were all supported to be associated with ovarian cancer by dbDEMC. Together, our prediction provided new evidence for its association with ovarian cancer. Prostatic neoplasms is the most prevalent nonskin cancer among men worldwide and is commonly found in men over 50 years of age. Although it has an indolent course, prostate cancer remains the thirdleading cause of cancer death in men [51]. In recent years, the miRNA profiling studies demonstrate that miRNAs may act independently or in partnership with other transcription factors to regulate gene transcription, which ultimately leads to perturbed cellular processes in prostate cancer [52]. For instance, it has been suggested that hsa-miR-29b could act as an antimetastatic miRNA for prostate cancer cells at multiple steps in a metastatic cascade by regulating epithelial-mesenchymal transition signaling [53]. The top 50 prostate cancer-related miRNAs predicted by our method is listed in Table 5. As a result, 49 of the top 50 predicted miRNAs were confirmed to be associated with prostate cancer by at least one database from dbDEMC, PhenomiR, miR2Disease and miRwayDB. The only unconfirmed miRNA was hsa-mir-429. Actually, studies have demonstrated that the downregulation of miR-429 inhibits cell proliferation by targeting p27Kip1 in human prostate cancer cells. Our prediction results further confirmed its association with prostate cancer.  To demonstrate the applicability of our method to diseases without any known miRNAs, we carried out the second type of case studies for Breast neoplasms (Additional file 5). Breast neoplasms is a malignant tumor that forms from the uncontrolled growth of abnormal breast cells. Recent research on miRNAs has implicated that the loss of tumor suppressor miRNAs or overexpression of oncogenic miRNAs can lead to breast cancer tumorigenesis or metastasis [54]. In this case study, we first removed all 237 miRNAs that were confirmed to be associated with breast neoplasms by HMDD v2.0, and then prioritized all the 550 candidate miRNAs by our method. As shown in Table 6, 47 out of the top 50 predicted miRNAs were verified by HMDD v2.0, and all of them were further confirmed by at least one database from dbDEMC, PhenomiR, miR2Disease and miRwayDB.
Lastly, we conducted the third type of case studies for Hepatocellular Carcinoma in which the older version of HMDD was used to prioritize miRNAs with the given disease and the latest version of HMDD (i.e. v2.0) was adopted to evaluate the prediction results (Additional file 6). Concretely, there were 1475 known associations involving 281 miRNAs and 129 diseases recorded in the older version of HMDD. The top 50 ranked miRNAs predicted by our method were listed in Table 7. As a result, 38 out of them were confirmed by HMDD v2.0, and all of them were validated by at least one of the four databases dbDEMC, PhenomiR, miR2Disease and miR-wayDB. Notably, we found that although hsa-mir-9-1, hsa-mir-132, hsa-mir-194-1 and hsa-mir-9-2 were not recorded in HMDD v2.0, they were all confirmed by the four databases, indicating their potential functional roles in the pathogenesis of Hepatocellular Carcinoma. In summary, all the three types of case studies further validated the effectiveness and reliability of our method in uncovering potential associations between miRNAs and diseases. Liang et al. J Transl Med (2018) 16:357

Discussion
The experimental results presented above clearly demonstrated the superior performance of our method. Moreover, the results of case studies on five common human diseases further confirmed the utility of the proposed method. Intriguingly, we noticed that for lung neoplasms and ovarian neoplasms, miRNAs within the same family of the unconfirmed miRNAs in the top 50 predicted miRNAs were essentially verified to be related with the investigated diseases by dbDEMC. As a matter of fact, evidence have demonstrated that miRNA family/cluster could function together in various pathological processes, such as miR-200 family, let-7 family and etc. [55,56]. Therefore, our results provided new evidence that miR-520 family and miR-181 family might play vital roles in lung neoplasms and ovarian neoplasms, respectively.
The success of our model could be largely attributed to the following two reasons. Firstly, the ℓ 1 -norm imposed on our objective function could generate sparse solutions, which makes our method robust to the incompleteness of current datasets. Secondly, both of the reconstructed miRNA functional similarities as well as the disease semantic similarities could be adaptively re-weighted according to the learned label matrix during each iterations. As a result, miRNAs or diseases with higher similarities will get more similar predicted labels and vice versa. However, there are still rooms for improvements in our model. In essence, since the miRNA functional similarity matrix as well as disease semantic similarity matrix was updated separately in their own spaces, our model is expected to be more effective if we could combine the two optimization spaces in a more reasonable manner. Besides, more data sources such as miRNA sequence similarities and miRNA family information should be integrated into our model to further improve the prediction ability of our model.

Conclusion
MiRNAs have been established as key metastasis regulators in diverse disease types. The ability of these small non-coding RNAs to regulate gene expression has generated much interests in exploiting them as potential therapeutic biomarkers in human diseases [57]. The Table 7 Top 50 predicted miRNAs associated with hepatocellular carcinoma based on known associations in the older version of HMDD