Two precision medicine predictive tools for six malignant solid tumors: from gene-based research to clinical application

Background The current study aimed to construct competing endogenous RNA (ceRNA) regulation network and develop two precision medicine predictive tools for colorectal cancer (CRC). Methods Differentially expressed (DE) analyses were performed between CRC tissues and normal tissues. A ceRNA regulation network was constructed based on DElncRNAs, DEmiRNAs, and DEmRNAs. Results Fifteen mRNAs (ENDOU, MFN2, FASLG, SHOC2, VEGFA, ZFPM2, HOXC6, KLK10, DDIT4, LPGAT1, BEX4, DENND5B, PHF20L1, HSP90B1, and PSPC1) were identified as prognostic biomarkers for CRC by multivariate Cox regression. Then a Fifteen-mRNA signature was developed to predict overall survival for CRC patients. Concordance indexes were 0.817, 0.838, and 0.825 for 1-, 2- and 3-year overall survival. Patients with high risk scores have worse OS compared with patients with low risk scores. Conclusion The current study provided deeper understanding of prognosis-related ceRNA regulatory network for CRC. Two precision medicine predictive tools named Smart Cancer Survival Predictive System and Gene Survival Analysis Screen System were constructed for CRC. These two precision medicine predictive tools can provide valuable precious individual mortality risk prediction before surgery and improve the individualized treatment decision-making.


Introduction
Colorectal cancer (CRC) is one of the most prevalent malignant tumors worldwide. There were around 1.1 million newly diagnosed CRC patients and 0.55 million CRC patients died in 2018 [1]. The 5-year overall survival (OS) of CRC patients was less than 12% [2,3]. Due to the poor overall survival, early detection of CRC patients with high mortality risk has important significance for improving the individualized treatment decision-making. Several prognostic models have been developed for CRC patients [4,5]. However, the computational formulas of these prognostic models were too complex for clinical application by patients without calculation tools and medical knowledge. Additional, these prognostic models provided overall survival prediction for different groups, but not individual mortality risk prediction.
From the perspective of precise medicine, a good prognostic model should be able to provide individual mortality risk prediction for specific patient at the individual level. Considering the clinical need of precious individual mortality risk prediction for patients with different tumors, our research team has developed several precision medicine predictive tools for gastric cancer [6] and hepatocellular carcinoma [7]. For individual mortality risk prediction, our precision medicine predictive tools have the following advantages: full-time individual risk prediction, visual illustration, numerical presentation, customizable subgroups, and on-line computing.
So far the molecular biological regulatory mechanism of development and prognosis of tumor remains unclear. Several researches have explored underlying molecular biological regulatory mechanism for different tumors [8][9][10][11]. Salmena and collaborators presented an interesting molecular regulatory mechanism named competing endogenous RNAs (ceRNAs) [12]. The lncRNAs can indirectly regulate the expression of mRNAs through binding shared miRNA response elements [13]. Several researches have explored potential molecular regulatory mechanism for CRC [14][15][16]. Therefore the current research aimed to depict prognosis related ceRNA regulatory network and develop two precision medicine predictive tools for CRC patients.

Study cohort
RNA sequencing and miRNA sequencing data were obtained from TCGA database. RNA sequencing data contained 14,449 lncRNAs and 20,337 mRNAs whereas miRNA sequencing data contained 1881 miRNAs. Four hundred and twenty-eight CRC patients were included after removing patients without complete survival information. GSE17538 dataset were downloaded from GEO database. GSE17538 dataset involved 231 CRC patients and 23,328 RNAs (GPL570 platform). The original read count values in TCGA dataset were normalized by log2 transformation. The gene background file (Gencode.v29 supplied by The European Bioinformatics Institute of The European Molecular Biology Laboratory (EMBL-EBI) database) was used for gene symbol annotation.

Differentially expressed analyses and regulatory network
The original RNA data were processed by "edgeR" package, with a defined P of 0.05 and a ratio of 1.5 times between tumor and non-tumor tissues [17]. The original miRNA sequencing data were processed by "limma" package [18]. First, the interaction associations between lncRNAs and miRNAs were identified in miRcode database [19]. Second, miRTarBase [20], miRDB [21], and TargetScan [22] were searched for miRNA-targeted mRNAs. The ceRNA network was visualized by Cytoscape v3.6.1 [23].

Statistical analyses and artificial intelligence algorithms
Random survival forest, Multi-task logistic regression, and Cox survival regression algorithms were carried out according to the algorithms suggested in the original articles [24][25][26][27][28][29]. Statistical analyses were carried out through SPSS Statistics 19.0 (SPSS Inc.,USA). Other analyses were carried out by R version 3.5.2 with corresponding packages. P value < 0.05 was defined as statistically significant.

Baseline characteristics
TCGA dataset contained 428 CRC patients and GSE17538 dataset contained 231 CRC patients. The clinical information of included patients was shown in Table 1. The mortality of GSE17538 dataset was 40.3% (93/231), which was significantly higher than 22.9% (98/428) of TCGA dataset. There were significant differences in terms of survival time and pathological stage, whereas there were no significant differences in terms of age and gender.

Differentially expressed analyses
Differential expression analysis could identify genes with significant differences in expression levels between normal samples and tumor samples. Compared with normal tissues, 3005 lncRNAs (2224 up-regulated and 781 downregulated), 332 miRNAs (246 up-regulated and 86 downregulated), and 6713 mRNAs (4087 up-regulated and 2626 down-regulated) were identified in CRC tissues. The volcano plots for differentially expressed RNAs were shown in Additional file 1: Figure S1.

Screening of prognostic mRNAs
Univariate Cox regression was used to explore potential prognostic biomarkers for CRC. Out of previous differentially expressed mRNAs, there were 2504 mRNAs identified as prognostic biomarkers for CRC. Out of 2504 potential prognostic biomarkers, there were 1371 risk factors and 1133 protective factors.

Development of ceRNA network
The miRNA-targeted mRNAs that could be searched in three above databases were defined as the miRNA-targeted mRNAs. Third, these miRNA-targeted mRNAs were intersected with previous prognostic mRNAs for development of ceRNA network. Finally, the ceRNA network, consisting of 14 lncRNAs, 29 miRNAs, and 79 mRNAs, were constructed for CRC. The ceRNA network was visualized in Fig. 1 by Cytoscape v3.6.1. This ceRNA network depicted potential regulatory relations among lncRNAs, miRNAs, and mRNAs, and was helpful to understand the potential mechanisms of tumor prognosis.

Functional enrichment analyses
Functional enrichment analyses were performed based on previous prognostic mRNAs in ceRNA network and identified 26 enriched Gene Ontology (GO) terms. The top 15 enriched GO terms were shown in Fig. 2a. The prognostic mRNAs were mainly enriched in transcription factor activity, RNA polymerase II proximal promoter sequence-specific DNA binding, regulation of MAP kinase activity, regulation of protein serine/threonine kinase activity, inactivation of MAPK activity, MAP kinase tyrosine/serine, MAP kinase phosphatase activity, protein tyrosine/serine, E-box binding, protein tyrosine phosphatase activity, and negative regulation of MAP kinase activity. The top KEGG pathways ( Fig. 2b) were   Fig. 3.

Construction of predictive model
Multivariate Cox regression analyses were used to screen independent risk factors for tumor prognosis. Fifteen mRNAs were used to develop a predictive model for CRC ( Table 2). The formula for predictive model was as

Survival curve analyses
Survival curve analyses were performed to explore the survival influences of included mRNAs (Additional file 1: Figure S2). Kaplan-Meier curves indicated that patients with high expression had significantly poor overall survival than that with low expression for these 15 mRNAs (P < 0.05). Comparison of Kaplan-Meier curves supported that these 15 genes were associated with overall survival in CRC patients.

Predictive performance in model dataset
CRC patients were divided into high risk subgroup and low risk subgroup according to median risk score. Figure 5a demonstrated that there was significant difference between two subgroups for OS (P < 0.001).
Calibration curves for OS were presented in Fig. 6, indicating a good agreement between predicted mortality and actual mortality for 1-year, 2-year, and 3-year OS.

Predictive performance in validation dataset
Kaplan-Meier plot (Fig. 7a) demonstrated that OS in high risk subgroup was significantly worse than that in low risk subgroup (P < 0.05). Concordance indexes were 0.773, 0.824, and 0.801 for 1-year, 2-year, and 3-year OS (Fig. 7b). Calibration curves for OS were depicted in Fig. 8, demonstrating that the predicted mortality was in good agreement with the actual mortality. Independence assessment of predictive model In model cohort, prognostic signature and pathological stage were independent influence factors for OS (Table 3). In validation cohort, prognostic signature, pathological stage, and age were identified as independent influence factors for OS. Decision curves and clinical impact curve for OS were presented in Additional file 1: Figure S3, suggesting that the clinical efficacy of the current prognostic model was superior to pathological stage and age.

Smart Cancer Survival Predictive System
We developed a precision medicine predictive tool named Smart Cancer Survival Predictive System, providing a novel convenient on-line calculator for prediction of OS. Smart Cancer Survival Predictive System (Fig. 9) is available at: https ://zhang zhiqi ao6.shiny apps.io/Smart _ cance r_predi ctive _syste m_11_CRC_B1003 /. Smart Cancer Survival Predictive System provided three individual mortality risk curves predicted by Random survival forest algorithm, Multitask logistic regression algorithm, and Cox survival regression algorithm according to the calculation formula in ogininal articles.

Gene Survival Analysis Screen System
To further explore survival curves of previous prognostic genes in different gender and pathological stage subgroups, we developed a new online program named Gene Survival Analysis Screen System (Fig. 10). Gene Survival Analysis Screen System is available at: https ://zhang zhiqi ao5.shiny apps.io/Gene_Survi val_Subgr oup_Analy sis_B1003 /. Gene Survival Analysis Screen System provided seven tumor datasets for exploration research and allowed users to select tumor type, gender, and stage, which were the important influence factors to the prognosis.

Clinical application in other tumors
The current study download five different tumor datasets from TCGA database as external validation datasets to explore the clinical application of the current prognostic model. Figure 11 displayed the diagnostic accuracy of the current prognostic model in five tumors, including hepatocellular carcinoma (n = 348), breast cancer (n = 1030), gastric cancer (n = 265), lung cancer (n = 494), and ovarian cancer (n = 370).

Predictive performance in five malignant solid tumors
To further explore the predictive performance of the current prognostic model, we externally validated the current prognostic model in a super merge dataset (n = 2507), which including five malignant solid tumors (hepatocellular carcinoma, breast cancer, gastric cancer, lung cancer, and ovarian cancer). Kaplan-Meier plot (Fig. 12a) demonstrated that OS in high risk subgroup was significantly worse than that in low risk subgroup (P < 0.05). Concordance indexes were 0.663, 0.639, and 0.655 for 1-year, 2-year, and 3-year OS (Fig. 12b). Calibration curves for OS were depicted in Fig. 13, demonstrating that the predicted mortality was in good agreement with the actual mortality. The current study screened differentially expressed RNAs between CRC tissues and normal tissues and then constructed a prognosis-related ceRNA regulatory network for CRC. Based on mRNAs in ceRNA regulatory network, the current study further identified independent prognostic biomarkers for overall survival in CRC patients. Through ceRNA regulatory network, the     reported that Paraspeckle Component 1 (PSPC1) is significantly related with poor prognosis in hepatocellular carcinoma [38]. The results in current study further supported the credibility of these findings above in previous researches. High quality samples and standardized RNA extraction technology are helpful to improve the quality and reliability of research data. According to the Standard Operating Procedure suggested by TCGA database, for gene sequencing sample, conventional methods of fixation and heating of biopsies may lead to inactivation of antigens and genetic material in tissues, therefore frozen sections was recommended in TCGA database. Samples should be snap-frozen and stored in cryovials in liquid nitrogen vapor and should not allow tissues thaw until sectioned. The mirVana isolation technology uses organic extraction and then fixes RNA on glass fiber filter to purify total RNA, so it can prepare high-purity and high-quality small RNA molecules within 2 h. RNA analytes should not undergo multiple freeze-thaw cycles to maintain biological activity. RNA is easily degraded by ribonuclease, so special methods should be taken to prevent RNA degradation. All reagents must be made of ribonuclease free materials and all products and disposable materials used must be free of ribonuclease. In order to prevent contamination by ribonuclease carried by the skin, gloves must be worn before handling biological samples. RNA analytes should be placed on wet ice and RNA quantification should be performed before freezing. Advantages of the current study: Firstly, our study team developed a precision medicine tool named Smart Cancer Survival Predictive System, providing full-time individual mortality risk prediction with visual illustration and numerical presentation. Secondly, a novel precision medicine tool named Gene Survival Analysis Screen System was developed to explore the associations between prognostic genes (including clinical parameters) and overall survival. Users are free to choose the appropriate subgroup, gender, and stage. Meanwhile, users can upload their own dataset to explore and validate the research result.
Shortcomings of the current research: first, detection platforms were different between model dataset and validation dataset. The influence of different detection platforms on gene expression read counts should be taken into account while evaluating clinical application of the current prognostic model. Second, several important prognostic factors were not included in the current study, including surgical, radiotherapy and chemotherapy regimens. Third, the prognostic model was developed and validated by using datasets downloaded from public databases without research team's study cohort. External applicability of prognostic model needs to be validated by study cohorts from different population.

Conclusions
In conclusion, the current study provided deeper understanding of prognosis-related ceRNA regulatory network for CRC. Two precision medicine predictive tools named Smart Cancer Survival Predictive System and Gene Survival Analysis Screen System were constructed for CRC. These two precision medicine predictive tools can provide individual prognostic information before surgery and improve the individualized treatment decision-making.