Integration analysis for novel lncRNA markers predicting tumor recurrence in human colon adenocarcinoma

Background Numerous evidence has suggested that long non-coding RNA (lncRNA) acts an important role in tumor biology. This study focuses on the identification of novel prognostic lncRNA biomarkers predicting tumor recurrence in human colon adenocarcinoma. Methods We obtained the research data from The Cancer Genome Atlas (TCGA) database. The interaction among different expressed lncRNA, miRNA and mRNA markers between colon adenocarcinoma patients with and without tumor recurrence were verified with miRcode, starBase and miRTarBase databases. We established the lncRNA–miRNA–mRNA competing endogenous RNA (ceRNA) network based on the verified association between the selected markers. We performed the functional enrichment analysis to obtain better understanding of the selected lncRNAs. Then we use multivariate logistic regression to identify the prognostic lncRNA markers with covariates. We also generated a nomogram predicting tumor recurrence risk based on the identified lncRNA biomarkers and clinical covariates. Results We included 12,727 lncRNA, 1881 miRNA and 47,761 mRNA profiling and clinical features for 113 colon adenocarcinoma patients obtained from the TCGA database. After filtration, we used 37 specific lncRNAs, 60 miRNAs and 148 mRNAs in the ceRNA network analysis. We identified five lncRNAs as prognostic lncRNA markers predicting tumor recurrence in colon adenocarcinoma, in which four of them were identified for the first time. Finally, we generated a nomogram illustrating the association between the identified lncRNAs and the tumor recurrence risk in colon adenocarcinoma. Conclusions The four newly identified lncRNA biomarkers might be potential prognostic biomarkers predicting tumor recurrence in colon adenocarcinoma. We recommend that further clinical and fundamental researches be conducted on the identified lncRNA markers.


Background
Colorectal carcinoma (CRC) is among the most common cancers with high morbidity and mortality among all malignancies and the most common CRC is colon adenocarcinoma (CA) [1]. It was reported that over 70% of CA patients would develop tumor recurrence within 24 months after surgery [2] and tumor recurrence still acts as one of the most severe risk factors to overall survival of CA patients [3]. Thus, the issue of tumor recurrence following a primary CA becomes very important [4]. It is essential to identify prognostic markers in order to study the biological mechanism in CA and identify the candidate targets for therapy.
Long non-coding RNAs (lncRNAs), with lengths of at least 200 nucleotides, modulate gene expression at the post-transcriptional level [5]. With the innovations in RNA sequencing technologies and computational biology, recent findings suggest that lncRNAs are involved in the biological process of cancer development [6]. Numerous studies on the role of lncRNAs in various types of cancers have been performed, and several lncRNA biomarkers have been identified to be related to the development, diagnosis and overall survival of various cancers [7][8][9][10][11]. Recently, lncRNA HOTAIR has been identified to be related to the overall survival in CA [12]. LncRNA ATB is associated with poor prognosis of CRC [13]. LncRNA CCAT1 is reported to be of clinical value in the diagnosis of CA [14]. All these studies suggest a potential value of lncRNAs in the prognosis and diagnosis of CA.
The competing endogenous RNA (ceRNA) hypothesis was presented as a new model demonstrating the association between non-coding and coding RNAs and accepted as one of the most efficient tools in lncRNAs research [10]. It has been widely utilized in the identification of diagnostic and prognostic lncRNA markers in various cancers [15][16][17][18].
Few studies have focused on the ceRNA network related to tumor recurrence in CA [19][20][21][22]. Thus, in this study, we aim to establish the lncRNA-miRNA-mRNA ceRNA network for the tumor recurrence of CA to identify novel prognostic lncRNA biomarkers for the prediction of tumor recurrence in CA and to achieve better understanding of the role of lncRNAs in CA based on the RNA sequencing data obtained from The Cancer Genome Atlas (TCGA) database.

Data profile
We obtained the RNA sequencing (including miRNA, lncRNA and mRNA) measurements and clinical characteristics of CA patients from The Cancer Genome Atlas (TCGA) database (https ://cance rgeno me.nih.gov/), an open source of information to identify novel biomarkers in cancer research, using R package "TCGAbiolinks" in June 2018. As a result, 12,727 lncRNA, 1881 miRNA and 47,761 mRNA profiling were obtained.
We excluded patients with missing information in tumor recurrence status, tumor location, venous status, lymphatic invasion status, histology type, pathology stage, TNM stage or age at diagnosis. We also excluded records without lncRNA, miRNA or mRNA measurements. Finally, 113 records were included in this study. The maximum follow-up time was 10.37 years (3780 days) with medium follow-up time equal to 1.34 years (488 days). During the follow-up, 6 out of 113 individuals were recorded as deceased.

Data pre-processing
Since the obtained lncRNA and miRNA expression measurements were not normally distributed, we performed a log transformation to normalize and correct their positively skewed distributions. Suppose x ij is the expression for jth lncRNA or miRNA expression of ith individual, the transformed value of jth lncRNA or miRNA expression of ith individual would equal to ln(x ij ) (if x ij > 0) or 0 (if x ij = 0). The transformation can be expressed with: Then, before incorporating into multi-variate analysis, to obtain a better explanation to the coefficient obtained in regression analysis, we transform the log-transformed lncRNA expressions into binary variables according to whether the log-transformed expression was higher (upregulated) or lower (down-regulated) than its log-transformed mean.
The mRNA profiling obtained from the TCGA database were normally distributed, therefore, no pre-processing was performed on the mRNA expressions.

Statistical analysis
All analyses were performed through R (version 3.4.4, the R Foundation for Statistical Computing, Vienna, Austria). Clinical and demographic characteristics were tested (with α = 0.05) by Chi-square test (gender, pathology stage and tumor site), Mann-Whitney test (TNM stage) and t-test (age at diagnosis). We also used t-test to select lncRNAs, miRNAs (log-transformed) and mRNAs with different expression levels between CA patients with and without tumor recurrence (with α = 0.01), the p-value obtained were adjusted with the BH method [23]. Prognostic lncRNA markers were identified based on the adjusted ORs obtained from multi-variate logistic regression.

Nomogram construction
Finally, we generated a nomogram predicting tumor recurrence risk for asymptotic CA patient based on the lncRNA markers and clinical features identified in the current study using R package "rms". The prediction performance was evaluated by the ROC analysis and C-index.

Characteristics of included patients
Out of the 113 patients, 50

Selection of different expressed RNA sequencing measurements
We selected the lncRNA, miRNA and mRNA measurements with different expression levels between patients with and without tumor recurrence through independent t-tests with p-value adjusted through the BH method [19]. As a result, we selected 61 lncRNA, 167 miRNA and 354 mRNA measurements with adjusted p-value less than 0.01. The selected lncRNAs, miRNAs and mRNAs are contained in Additional file 1: S1 (sheets 1-3).

Establishment of lncRNA-miRNA-mRNA ceRNA network
We used the miRcode, starBase and miRTarBase databases to verify the interaction relationship between the different expressed lncRNA, miRNA and mRNA markers. Based on the verified interaction relationship, we conducted the lncRNA-miRNA-mRNA ceRNA network analysis to reveal the association between selected lncR-NAs and miRNAs as shown in Fig. 1. The detailed information is shown in Tables 2 and 3. As a result, the ceRNA network indicated that 60 particular miRNAs interacted with 33 specific lncRNAs and 148 mRNAs.

Functional enrichment analysis
In order to obtain a deep understanding on the selected genes, we performed the functional enrichment analysis to the intersection mRNAs as shown in Fig. 2. It revealed that the enriched GO terms for biological process (BP) were mainly related to the several immune processes including the regulation of the differentiation and proliferation for several immune cells (T cell, mononuclear cell and leukocyte), the regulation of cytotoxicity and immunity mediated by natural killer cell and leukocyte, the differentiation of dendritic cell, the regulation of interleukin-10 and -12 and the regulation of cytokine, as shown in Fig. 2a. The associated cell component (CC) included protein phosphatase type 2A complex and mast cell granule as shown in Fig. 2b. The enriched GO terms for molecular function (MF) were mainly about core promoter binding, interleukin-10 activity, MHC class 1 activity, as shown in Fig. 2c. The pathways enriched based on the KEGG and Reactome databases are shown in Fig. 2d. The revealed pathways included RHO GTPases Activate Rhotekin and Rhophilins pathway, Interleukin-21 signaling pathway, Interleukin-2 family signaling pathway and Nuclear Receptor transcription pathway.

Subnetwork analysis
We used the "clusterMaker2" [28] application within the Cytoscape software to identify the subnetworks from the main ceRNA network through the MCL clustering approach [28]. As a result, we identified 37 distinct subnetworks. Based on the number of nodes in each subnetworks, after filtration, we selected 4 subnetworks with at least 10 nodes, as shown in Fig. 3. We also performed the functional enrichment analysis for the genes in the selected subnetworks; the results are shown in Fig. 4. The enriched GO terms for BP in subnetwork 1 and 4 were mainly about several human   immune processes, which consistent with that obtained from the main network.

Identification of prognostic lncRNA markers
We

Nomogram for tumor recurrence prediction
Finally, we generated a simple-to-use nomogram based on the 5 prognostic lncRNA markers and clinical characteristics (pathology stage and age at diagnosis) of CA patients as shown in Fig. 5. It could provide useful information in prediction of tumor recurrence for asymptomatic CA patients based on multivariate logistic regression. The C-index for the model was 0.895 and the area under the ROC for the model is 0.885 (95% CI based on bootstrap method: 0.836-0.935). Both C-index and the ROC analysis suggested a good predict performance.

General comments
Differential expression of lncRNAs has been widely identified in various cancers. Published studies have revealed that lncRNAs have key roles in vital biological functions of cancers. However, only few studies have described the role of lncRNA profiles in tumor recurrence of CA [19][20][21][22]. In this study, we focused on the identification of novel prognostic markers for the tumor recurrence of CA based on the RNA sequencing data from the TCGA database. We have constructed the lncRNA-miRNA-mRNA ceRNA network to clarify the unknown ceRNA regulatory network in tumor recurrence of CA. As a result, 5 lncRNAs (CASC2, AL078459.1, AL390066.1, STK4-AS1 and HOXA-AS3) were identified through the ceRNA network and multivariate logistic regression. We also performed functional enrichment analysis to investigate the molecular role of the identified lncRNA biomarkers.
For the 5 identified lncRNA markers, CASC2 has been reported to be associated with colorectal cancer [33][34][35]. Studies have revealed that the deregulation of CASC2 by miRNA hsa-mir-21 and hsa-mir-18a increases the proliferation and migration of cancer cells in colorectal cancer [33][34][35]. The link between the CASC2 and the prognosis of CA suggested in published literatures were consistent with the results in this study. This also indicated that the results of our research were Table 3 (continued)   miRNA  Targeted mRNA   hsa-mir-30b  AQP6, CD180, CD3D, CLEC7A, FAM81B, FGF19, GNG4, HAVCR2, IGSF6, IL21R, ITGA4, LAX1, MSRB2, MTA1, SCNM1, TOR1AIP2   hsa-mir-30c  AQP6, CD180, CD3D, CLEC7A, FGF19, GNG4, HAVCR2, IGSF6, LAX1, MSRB2, SCNM1, TOR1AIP2   hsa-mir-30d  AQP6, AQP6, CD3D, CLEC7A, FAM81B, GNG4, HAVCR2, IGSF6, IL21R, LAX1 reasonable. For STK4-AS1, it was associated with protein coding gene STK4 while the down-regulation of STK4 was associated with the invasion and migration of colorectal cancer [36]. This also suggests that the association between STK4-AS1 and the prognosis of CA revealed in this study was reasonable. For HOXA-AS3, though it has not been identified in CA before, the upregulation of HOXA-AS3 was reported to be associated with tumor progression and poor prognosis in glioma [37]. In this study, we found STK4-AS1 and HOXA-AS3 were also related to tumor recurrence in CA. For lncR-NAs AL078459.1 and AL390066.1, though no functional roles have been reported in CA before this work, in our study, have been identified to be related to the tumor recurrence in CA.
To obtain a deep understanding of the selected lncRNA markers, the functional enrichment analyses were performed. The enriched in GO terms were related with several human immune process. IL-10 and IL-12, as representative immune factors, play an important role in inflammation and tumorigenesis [38], and published research also suggested a potential relationship between them and the progression of CRC [39]. Tumor infiltrating T-cells was related to the microsatellite instability and the prognosis of CRC [40]. Natural killer cell plays an important role in the anti-cancer defense, and has great potential in cancer immunotherapy in cancer immunotherapy [41]. Dendritic cell [42], tumor infiltrating mononuclear cell [43] and tumor infiltrating mast cell [44] were all associated with the progression and prognosis of CRC. MHC class I is a major component of tumor-associated antigen presenting system, which responded to a large number of chemotherapeutic agents in the treatment of CRC [45]. For the enriched pathways, RHO GTPases Activate Rhotekin and Rhophilins pathway was associated with the development and progression of several solid malignancies including CRC [46,47], Interleukin-21 signaling pathway was associated with the development of colitis-associated CRC [48] and Interleukin-2 family signaling pathway acts important role in current antitumor immunotherapy [49]. The immune infiltration is closely associated with prognosis of CRC [50,51], the results of enrichment analysis suggested that the genes in the ceRNA network were associated with several important human immune processes, thus may be associated with the clinical outcome of CRC.
For the subnetwork analysis, we identified 37 distinct subnetworks while only those with at least 10 nodes were selected (as shown in Fig. 3). Then, we performed the functional enrichment analysis for the genes involved in each subnetworks. For subnetwork 1, the enriched GO terms for BP were mainly about the co-stimulation of T cell and lymphocyte cell, the differentiation of endodermal cell and the proliferation of stem cell. For subnetwork 3, the enriched biological processes were mainly about the regulation of IL-17 and IL4, the differentiation of thymus and the regulation of the production of interferon gamma, all related to the progression and prognosis of CA [52][53][54][55]. Most of biological processes enriched for the subnetworks were about human immune process and this was consistent with the biological processes enriched for the main network. For the enriched terms for MF in subnetwork 1, though different from that enriched from the main network, phosphatidylcholine has been found to involve in the growth of CRC cell [56]. The enriched pathways based on the KEGG and Reactome databases for subnetwork 2 (nuclear transcript pathway) were consistent with those enriched in the main network. The consistency in functional enrichments between the subnetworks and the main network suggested good robustness of our analysis. The nomogram generated in this study was simpleto-use and would be useful in estimating the tumor recurrence risk for asymptotic patients with CA. It also visualized the associations between each prognostic lncRNA and clinical features (stage and age at diagnosis) and the tumor prognosis of CA patients.
Recently, many studies on the identification of prognostic genes in CA based on the TCGA database have been reported and their findings are then verified with functional experiments in succession [57][58][59][60]. Those studies all suggested a good reliability of the TCGA database in identifying new prognostic gene signatures for the prognosis in cancer studies and the potential value of the results obtained in the current work.

Limitations
Firstly, our study was a preliminary work. In this study, we obtained the RNA sequencing and clinical data from public database, but no clinical samples were involved. Thus, the importance of the selected markers still need to be validated in future cohort study. Then, though we tried to incorporated as many clinical factors as possible in our analysis, some important factors were still not available, for example, treatment information (such as chemotherapy, surgery and radiotherapy), and living habits (such as smoking or drinking habits). These might cause potential bias in analysis.

Conclusion
In this study, we identified five prognostic lncRNA markers for the prediction of tumor recurrence in CA based on ceRNA hypothesis and data obtained from the TCGA database, in which, four of the selected lncRNA markers (AL078459.1, AL390066.1, STK4-AS1 and HOXA-AS3) were identified for the first time. The hub genes in the network were annotated with functional gens sets associated with colorectal cancer and the mechanism of tumor progression and invasion. Our work also provides a simple-to-use nomogram