A robust six-gene prognostic signature for prediction of both disease-free and overall survival in non-small cell lung cancer

Background The high mortality of patients with non-small cell lung cancer (NSCLC) emphasizes the necessity of identifying a robust and reliable prognostic signature for NSCLC patients. This study aimed to identify and validate a prognostic signature for the prediction of both disease-free survival (DFS) and overall survival (OS) of NSCLC patients by integrating multiple datasets. Methods We firstly downloaded three independent datasets under the accessing number of GSE31210, GSE37745 and GSE50081, and then performed an univariate regression analysis to identify the candidate prognostic genes from each dataset, and identified the gene signature by overlapping the candidates. Then, we built a prognostic model to predict DFS and OS using a risk score method. Kaplan–Meier curve with log-rank test was used to determine the prognostic significance. Univariate and multivariate Cox proportional hazard regression models were implemented to evaluate the influences of various variables on DFS and OS. The robustness of the prognostic gene signature was evaluated by re-sampling tests based on the combined GEO dataset (GSE31210, GSE37745 and GSE50081). Furthermore, a The Cancer Genome Atlas (TCGA)-NSCLC cohort was utilized to validate the prediction power of the gene signature. Finally, the correlation of the risk score of the gene signature and the Gene set variation analysis (GSVA) score of cancer hallmark gene sets was investigated. Results We identified and validated a six-gene prognostic signature in this study. This prognostic signature stratified NSCLC patients into the low-risk and high-risk groups. Multivariate regression and stratification analyses demonstrated that the six-gene signature was an independent predictive factor for both DFS and OS when adjusting for other clinical factors. Re-sampling analysis implicated that this six-gene signature for predicting prognosis of NSCLC patients is robust. Moreover, the risk score of the gene signature is correlated with the GSVA score of 7 cancer hallmark gene sets. Conclusion This study provided a robust and reliable gene signature that had significant implications in the prediction of both DFS and OS of NSCLC patients, and may provide more effective treatment strategies and personalized therapies. Electronic supplementary material The online version of this article (10.1186/s12967-019-1899-y) contains supplementary material, which is available to authorized users.


Background
Lung cancer is the leading cause of cancer death worldwide, and non-small cell lung cancer (NSCLC) composes the majority (approximately 85%) of all lung cancers [1,2]. Despite advances in treatment strategies, the high mortality rate for lung cancer patients has not considerably declined, due to the late diagnosis of the disease [3]. The major clinical determinants of NSCLC prognosis include tumor extension, performance status and histological type [4,5]. However, various disease outcomes have been identified in patients with similar clinical and pathological features, suggesting that the current clinical prognostic factors used may be insufficient to consistently predict individual clinical outcomes [6]. This emphasizes the necessity of identifying robust and reliable prognostic markers with higher sensitivity and accuracy in NSCLC.
Transcriptome profiling has widely been used to characterize prognostic signatures in patients with lung cancer, and has generated a number of candidate biomarkers with potential clinical values [7][8][9]. However, the suggested signatures lack consistency among studies and provide limited prognostic information, partially due to the limited sample size and technical factors. Moreover, NSCLC is a highly heterogeneous disease, thus it is critical to identify a reliable signature that can define patients who are at a high-risk to develop disease recurrence. To this end, integrating the results from multiple studies holds promise for more robust prognostic signatures. In addition, most investigations used overall survival (OS) rather than tumor recurrence as an end point [9][10][11]. Disease-free survival (DFS) is defined as the interval from surgery to the first diagnosis of any type of relapse or death, and is used as a possible alternative for OS. Therefore, we attempted to identify and validate a robust and reliable prognostic signature for DFS and OS prediction by integrating multiple datasets of NSCLC patients. In the present study, we revealed a six-gene signature with a reliable prognostic value in NSCLC, which might complement conventional clinical prognostic factors, and further provide more effective therapeutic interventions and personalized therapies for NSCLC patients.

Patient data
Gene expression data and corresponding clinical information data of NSCLC patients were obtained from the publicly available database Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/ geo/). Three independent datasets were collected in this study, under the accessing number of GSE31210 [12,13], GSE37745 [14] and GSE50081 [15]. These gene expression data were generated using the same chip platform Affymetrix HG-U133 Plus 2.0 platform. GSE31210 consisted of a total of 226 lung adenocarcinoma cases. GSE37745 included 196 NSCLC cases, including 106 adenocarcinoma, 24 large cell carcinoma, and 66 squamouscarcinoma. GSE50081 included 181 NSCLC cases, including 127 adenocarcinoma, 7 large cell carcinoma, 43 squamous carcinoma, and 4 adenosquamous carcinoma. A total of 603 cases were enrolled for OS analysis, including 226 patients from GSE31210, 196 patients from GSE37745, and 181 patients from GSE50081. For DFS analysis, a total of 499 patients were finally included, including 226 patients from GSE31210, 96 patients from GSE37745, and 177 patients from GSE50081. The details of the patients' clinical information in each dataset are described in Tables 1 and 2. All microarray data were normalized using robust multi-array average (RMA) and microarray Suite 5 (MAS5) methods, and log2scale transformed in this study.
The genomic data and clinical information of NSCLC patients in The Cancer Genome Atlas (TCGA) were obtained from the University of California Santa Cruz Xenabrowser (UCSC Xena, http://xena.ucsc.edu/) [16]. This cohort has 761 NSCLC patients with the corresponding gene expression data (read count) and clinical information (including survival data).

Prognostic gene signature screening
In this study, we firstly screened candidate prognostic genes from each cohort, and selected the common ones for constructing the prognostic gene signature. Then, the prognostic value of the signature was validated using each cohort. The flow diagram of this study is illustrated in Fig. 1. An univariate Cox proportional hazard regression model was implemented to determine the association of gene expression with DFS and OS in each cohort. Genes under a cutoff value of P < 0.05 were defined as candidate genes related to OS and DFS, and the common genes among three datasets were selected to construct the prognostic signature. Hazard ratio (HR) from the univariate Cox regression analysis was used to determine the protective (HR < 1) and risky genes (HR > 1).
Then, a risk score was established for each patient by calculating the expression values of the selected genes weighted by regression coefficients in the univariate Cox regression analysis. The formula used was as follows: Where n is the number of selected genes, exp i is the expression level of gene i, and β i represents the regression coefficient of gene i. Subsequently, the risk score was dichotomized at the median value, and patient whose risk score was greater than the median value was divided into a high-risk group, otherwise into alow-risk group. Evaluation of the robustness of prognostic gene signature by re-sampling tests The data from three GEO datasets (GSE31210 [12,13], GSE37745 [14] and GSE50081 [15]) were combined.
Then a subset that contained 70% samples of the combined GEO dataset (re-sampling) was randomly selected and used to determine the prediction power

Association of prognostic gene signature and cancer hallmarks
A total of 50 hallmark gene sets which are currently recognized were downloaded from the molecular signature database (MSigDB, http://softw are.broad insti tute.org/ gsea/msigd b). Next, gene set variation analysis (GSVA) package and its ssGSEA method (http://www.bioco nduct or.org) were implemented for these 50 hallmark gene sets to further obtain the GSVA scores of each gene set for each sample in the combined GEO datasets [17]. The GSVA score devotes the degree of absolute enrichment of a gene set in each sample. After that, Pearson's correlation analysis was performed to investigate whether the GSVA score of the members of the given gene set was correlated with the risk score. The correlation coefficients (R), confidence interval (CI) and P values were calculated.

Statistical analysis
The DFS and OS were calculated using Kaplan-Meier curves, and the statistical difference was determined by log-rank test. Influences of various variables on DFS and OS were evaluated by univariate and multivariate Cox proportional hazard regression models. HR and the 95% CI were generated using Cox proportional hazards models. The receiver operating characteristic (ROC) curve analysis was carried out to compare the predictive accuracy of the gene signature. A P value < 0.05 was set as the statistically as the significant difference.

Identification of a six-gene prognostic signature
We firstly identified survival-related genes using univariate Cox regression analysis in each dataset. Under the cut-off threshold of Cox P < 0.05, 7217 genes in GSE31210, 1195 genes in GSE37745 and 2813 genes in GSE50081 were identified as candidate predictive genes that presented close association with DFS. Similarly, 2539 genes in GSE31210, 1720 genes in GSE37745 and 2453 genes in GSE50081 were identified to be involved in OS. By overlapping these candidate genes among three datasets, a set of 6 common genes was screened finally, including one risky gene (HR > 1) and 5 protective genes (HR < 1). The general information of these 6 genes is displayed in Table 3.

The six-gene signature predicts survival of NSCLC patients
According to the gene expression and regression coefficients of the 6 genes, a prognostic model was developed to predict prognosis using a risk score method. In the prognostic model, each patient was endowed a risk score.
Using the median risk score value as the cut-off point, patients in each dataset were classified into low-risk and high-risk groups. The DFS prediction power of the sixgene signature for patients in each dataset is displayed in Fig. 2. The distribution of gene risk scores, gene expression levels, and patients' relapse status in each dataset are shown in Fig. 2a.
The OS prediction value of the six-gene signature for patients in each dataset is shown in Fig. 3. Figure 3a illustrates the distribution of gene risk scores, gene expression levels and patients' survival status in each dataset. Consistent with our previous finding, patients in the highrisk groups had significantly shorter OS when compared with those in the low-risk groups (GSE31210: HR = 6.14, 95% CI 2.68-14.07, P < 0.05; GSE37745: HR = 1.56, 95% CI 1.12-2.16, P < 0.05; GSE50081: HR = 2.21, 95% CI 1.34-3.64, P < 0.05) (Fig. 3b). Patients with high risk scores tended to have poorer clinical outcomes compared with those with low risk scores. In addition, the time-dependent ROC curve was implemented to measure the sensitivity and specificity of the six-gene signature for OS prediction in each dataset. Markedly, the signature achieved AUC values of 0.749, 0.685 and 0.667 in GSE31210, GSE37745 and GSE50081, respectively (Fig. 3c), implying a high OS prediction performance. The six-gene prognostic signature is robust Previous study has demonstrated that tumor heterogeneity limits the generation of robust prognostic biomarker [18]. Thus, we conducted re-sampling tests for validation of the robustness of the prognostic gene signature. As shown in Additional file 1: Table S1, we found that in all the random 100 model validations of prediction power of the gene signature for OS by re-sampling, the P values were less than 0.0001 in each univariate Cox and Kaplan-Meier analysis. Notably, the six-gene signature achieved the AUC values of more than 0.650 for 1, 2, 3, 4, and 5-year OS in the combined GEO datasets, demonstrating a high OS prediction performance. Similarly, among all the random 100 model validations of prediction power of the gene signature for DFS, the P values were less than 0.0001 in each univariate Cox and Kaplan-Meier analysis (Additional file 2: Table S2). Moreover, the six-gene signature obtained the AUC values of more than 0.610 for 1, 2, 3, 4, and 5-year DFS in the combined GEO datasets (Additional file 2: Table S2), which implicates that this signature has an effective performance for DFS prediction. Overall, these results suggest this six-gene signature for predicting prognosis of NSCLC patients is robust.

The six-gene signature is an independent prognostic factor
Here, we performed univariate and multivariate Cox regression models in these three datasets. The six-gene risk score and other clinicopathological factors, including age, gender, stage, histological type, gene mutation, smoking and performance status were used as covariates.
The association between these factors and DFS is shown in Table 1. Univariate regression analysis indicated that age, EGFR mutation, triple negative status, disease stage and risk score were significantly associated with the DFS of NSCLC patients in GSE31210; WHO performance status and risk score were significantly associated with the DFS of patients in GSE37745; and stage and risk score were related to the DFS of patients in GSE50081.
In the entire cohort, stage and risk score were identified to have significant correlation with the DFS of NSCLC patients. Moreover, in order to determine whether the six-gene signature was independent of other clinical factors, we performed a multivariate regression analysis, and found a significant correlation of the six-gene signature with DFS in three datasets (GSE31210: HR = 3.10, 95% CI 1. The correlation of risk score and other clinicopathological factors with the OS of NSCLC patients is shown in Table 2. We performed an univariate regression analysis to determine the correlation between these factors and OS. Our results indicated that age, EGFR mutation, triple negative status, stage and risk score were OS prognostic factors for NSCLC patients in GSE31210; stage I/III, WHO performance status and risk score were significantly related to OS of patients in GSE37745; and age, stage and risk score were OS prognostic factors for patients in GSE50081. In the entire cohort, age, gender, stage and risk score were correlated with OS of NSCLC patients. Subsequent multivariate regression analysis indicated that the sixgene signature was an independent OS prognostic factor in three datasets (GSE31210: HR = 5.47, 95% CI 2.30-12.99, P = 1.21E−04; GSE37745: HR = 1.44, 95% CI 1.02-2.04, P = 3.80E−02; GSE50081: HR = 2.09, 95% CI 1.27-3.45, P = 3.91E−03) and entire cohort (HR = 1.65, 95% CI 1.26-2.18, P = 3.32E−04), after adjusting for other clinical factors. Taken together, our data show that the six-gene risk score was an independent adverse prognostic factor for both DFS and OS of NSCLC patients.
Furthermore, we performed a data stratification analysis on the entire cohort. These patients (499 patients for DFS and 603 for OS) were factitiously stratified based on their clinical parameters, such as age (≤ 65/> 65), gender (female/male), stage (I/II) and histological type (adenocarcinoma/squamous carcinoma). Because of the small sample size, patients in stage III and IV, and patients with large cell cancer were removed from the stratification analysis. The results showed that the six-gene risk score remained the ability of predicting the DFS and OS within each stratum. In Fig. 4a, the results of stratification analysis showed that high-risk patients in each stratum of age, gender and early stage had significantly shorter DFS than lowrisk patients (P < 0.05). For patients with adenocarcinoma, high-risk patients showed significantly shorter DFS than low-risk patients (P < 0.05), while there was no significant difference between high-risk and low-risk patients for patients with squamous carcinoma, might due to the small sample size of patients with squamous carcinoma. In Figure 4b, the results of stratification analysis indicated that high-risk patients in each stratum presented significantly poorer OS than low-risk patients (P < 0.05), except for patients in stage II and patients with squamous carcinoma. Taken together, our findings suggested that the six-gene signature was independent of other clinical features for DFS and OS prediction in NSCLC patients.

Further validation of the six-gene signature using another independent dataset
To investigate the reliability of the six-gene signature, another independent dataset from the TCGA was used for further validation. The risk score of each sample in this dataset was calculated, and the samples were then classified into low-and high-risk groups using the median risk score value as the cut-off point (Fig. 5a, b). Kaplan-Meier and univariate Cox regression analysis exhibited that the patients in the high-risk group had obviously shorter DFS than those in the low-risk groups (HR = 1.33, 95% CI 1.02-1.73, P < 0.05) (Fig. 5a and  Table 4). Similarly, patients in the high-risk group presented remarkably shorter OS compared to those in the low-risk groups (HR = 1.54, 95% CI 1.20-1.96, P < 0.05) ( Fig. 5b and Table 4). Univariate Cox regression analysis also indicated that stage II, stage III and squamous histologic type were poor prognostic factors for prediction of DFS, and stage II, stage III and stage IV were poor prognostic factors for prediction of OS (Table 4). However, subsequent multivariate regression analysis indicated that only stage II, stage III and risk score were independent prognostic factors for prediction of both DFS and OS (P < 0.05). These results suggest that the six-gene signature is valid and reliable across datasets and platforms.

The six-gene signature is association with several hallmarks
To identify the six-gene signature associated biological processes, the correlation of the risk score of the gene signature for predicting DFS/OS and the GSVA score of cancer hallmark gene sets was investigated. As shown in Fig. 6, a total of 7 hallmark gene sets (E2F_TARGETS, G2M_CHECKPOINT, GLYCOLYSIS, is higher than or equal to 0.4; P < 0.0001]. Interestingly, these biological processes and the risk score displayed the same trend, suggesting that activation of theses hallmarks might participate in the process of tumor progression and affect the survival of the patients with NSCLC.

Discussion
Numerous reports have indicated that disturbed gene expression may be implicated in various aspects of tumor, including tumorigenesis, progression and prognosis [19][20][21]. Some genes have been considered as prospective biomarkers to predict prognosis in NSCLC patients [14,22,23]. However, several concerns limit their prognostic and predicative power, such as inadequate samples, lack of DFS prediction, and lack of effective validation. In this study, we developed and validated a novel prognostic six-gene signature that was found to be significantly associated with both the DFS and OS of NSCLC patients.
Our results revealed that this classifier could successfully identify high-risk and low-risk NSCLC patients with significant differences in both DFS and OS. In addition, the prognostic value of the six-gene signature was verified in three GEO datasets and an independent TCGA dataset, suggesting the reproducibility and reliability of the six-gene signature for both DFS and OS prediction in NSCLC.
The clinical prognostic factors in NSCLC include stage, age, gender and performance status [6,24]. Our study showed that stage and age were significantly correlated with both the DFS and OS of patients in GSE31210; performance status was significantly associated with DFS, and stage and performance status were related to OS of patients in GSE37745; stage and gender were significantly involved in OS of patients in GSE50081. In the entire patient cohort, stage was identified as an independent prognostic factor for DFS, and age and stage were associated with OS of NSCLC patients. Furthermore, we performed a stratification analysis on the entire cohort, and found that the prognostic power of the six-gene signature was independent of age, gender and stage. Interestingly, Birim et al. [24] indicated that non-squamous cell histology was a risk factor for postoperative outcome in NSCLC. Our study showed that histological type had no significant association with either DFS or OS in NSCLC. While stratification analysis indicated that high-risk group had significantly shorter DFS and OS than low-risk group for patient with adenocarcinoma but not squamous carcinoma.
Currently, tumor stage has been broadly utilized as a strong indicator of survival in NSCLC [25]. However, the current staging system is far from accurate in the aspect of survival prediction at the individual level [26]. As expected in our study, univariate and multivariate analysis showed that stage II, stage III, and stage IV were significantly associated with OS and DFS in the entire GEO cohort. However, stage II, and stage III were found to be the independent prognostic factors for prediction of both DFS and OS in the TCGA database. As documented, age is a main indicator of patient survival, and younger patients are tended to survive longer than the older ones [27][28][29]. Nevertheless, age alone is not a survival indicator for cancer patients because older patients are less likely to receive adjuvant therapy [30]. Multivariate analysis showed that age was significantly associated with OS in the entire GEO cohort and the TCGA cohort, but there was no correlation between DFS and age in these two cohorts. Thus, compared to age and stage, the risk score was a more reliable prognostic factor for NSCLC patients. Several gene mutations have been revealed to be associated with the pathogenesis of NSCLC, such as EGFR and KRAS mutations [31]. Notably, EGFR or KRAS mutated lung cancer accounts for a significant subgroup of NSCLC, especially in adenocarcinoma [32,33]. Markedly, targeting EGFR mutations has changed the therapeutic paradigm in NSCLC patients harboring EGFR mutations [34]. Over the last decade, multiple EGFR tyrosine kinase inhibitors (TKIs) have been developed to target mutated EGFR, and have achieved a better survival in patients with EGFR mutations than in those with the wild type [35]. Whereas, KRAS mutations predict a worse prognosis among NSCLC patients treated by chemotherapy and EGFR-TKIs [36,37]. In the present study, univariate analysis indicated that EGFR mutations, but not KRAS mutations, were correlated with both DFS and OS, whereas multivariate analysis indicated that EGFR mutation status did not act as an independent prognostic factor.
In this study, we developed a prognostic six-gene signature for both DFS and OS prediction in NSCLC. Most of these genes have not been well characterized in tumor biology, except for CDCP1. CDCP1, also known as CUB domain-containing protein 1, is a transmembrane glycoprotein, whose phosphorylation is linked with the progression and metastasis of several cancers [38,39]. In addition, blocking of CDCP1 has been shown to be a potential mode for therapeutic intervention against metastatic disease [38,40]. Chiu et al. [41] showed that the ADAM9 metalloprotease enhanced CDCP1 expression via activating EGFR signaling pathways in advanced lung cancer disease. Ikeda et al. [42] revealed that CDCP1 expression was an independent prognostic factor for both OS and DFS, and could be used as an useful marker for survival prediction of patients with lung adenocarcinoma. Our study combined these 6 genes into a single panel, and established its prognostic value in both DFS and OS in NSCLC.
As reported, the complexity of cancer can be decreased and presented by a few cancer hallmarks that enable cancer cell proliferation and metastasis. These hallmarks can offer a framework to understand the cancer diversity. Hence, we focused on detecting the association of prognostic gene signature and cancer hallmarks [43]. Studies have revealed that DNA replication, cell cycle, DNA damage repair, apoptosis, chromosome and gene instability, energy supply play important roles in cancer progression [44][45][46][47]. Coincidentally, the GSVA results demonstrated that the six-gene signature was remarkably connected with these biological processes. Specifically, E2F targets have been demonstrated to participate in DNA replication, cell cycle, DNA damage repair, apoptosis [48]. G2M checkpoint, mitotic spindle, and MYC targets have been reported to contribute to the instability of chromosome and gene [49][50][51]. Utilization of glycolysisrelated metabolic pathway has been implicated to provide ATP as a main source of energy supply for cancers [52]. Moreover, mTORC1 signaling has been suggested to be activated in human fibrolamellar liver carcinoma [53]. Demonstrated here, the results of present study demonstrated that the six-gene signature correlated with several cancer-progression associated biological processes which supported the DFS/OS predictive ability of the signature. Significantly, the correlation analysis in our study showed that patients having these activated biological processes tended to have adverse outcomes. Thus, this further confirmed that our six-gene signature used for predicting prognosis was reasonable and reliable.
In this work, some limitations need to be acknowledged. First, a few clinical characters presented an unbalanced distribution, such as an overwhelming majority of patients in stage I/II and presenting a histological adenocarcinoma type. Thus, the robustness of the six-gene signature requires further validation in large-scale prospective investments. Second, most of the genes identified here are rarely reported in the academic literature, and there are no experimental data regarding the identified signature, thus more evidence is needed to elucidate the inherent correlation between the six-gene signature and the prognosis of NSCLC patients. Despite these drawbacks, our results demonstrate valuable information on the importance and significance of the six-gene signature in both DFS and OS prediction in NSCLC.

Conclusions
In this study, we developed an innovative six-gene prognostic signature for both DFS and OS prediction in NSCLC patients. The six-gene signature was an independent prognostic factor, and might complement clinicopathological factors and facilitate the personalized treatment of NSCLC patients. Large-scale prospective investments should be applied for further assessment of the robustness of this signature in future studies.

Additional files
Additional file 1: Table S1. Validating the prediction power of the gene signature for OS in the combined GEO dataset by re-sampling analysis.
Additional file 2: Table S2. Validating the prediction power of the gene signature for DSF in the combined GEO dataset by re-sampling analysis.