Identification of an immune signature predicting prognosis risk of patients in lung adenocarcinoma

Background Lung cancer has become the most common cancer type and caused the most cancer deaths. Lung adenocarcinoma (LUAD) is one of the major type of lung cancer. This study aimed to establish a signature based on immune related genes that can predict patients’ OS for LUAD. Methods The expression data of 976 LUAD patients from The Cancer Genome Atlas database (training set) and the Gene Expression Omnibus database (four testing sets) and 1534 immune related genes from the ImmPort database were used for generation and validation of the signature. The glmnet Cox proportional hazards model was used to find the best gene model and construct the signature. To assess the independently prognostic ability of the signature, the Kaplan–Meier survival analysis and Cox’s proportional hazards model were performed. Results A gene model consisting of 30 immune related genes with the highest frequency after 1000 iterations was used as our signature. The signature demonstrated robust prognostic ability in both training set and testing set and could serve as an independent predictor for LUAD patients in all datasets except GSE31210. Besides, the signature could predict the overall survival (OS) of LUAD patients in different subgroups. And this signature was strongly associated with important clinicopathological factors like recurrence and TNM stage. More importantly, patients with high risk score presented high tumor mutation burden. Conclusions This signature could predict prognosis and reflect the tumor immune microenvironment of LUAD patients, which can promote individualized treatment and provide potential novel targets for immunotherapy. Electronic supplementary material The online version of this article (10.1186/s12967-019-1824-4) contains supplementary material, which is available to authorized users.


Background
According to the latest cancer statistics released in 2018, lung cancer has become the most frequently diagnosed cancer type and the top-ranked reason for cancer death in the combined population of women and men world widely [1]. In the United States, there were approximately 234,030 new cases and 154,050 deaths in 2018 [2]. Lung cancer mainly has two subtypes, including non-small cell lung cancer (NSCLC) and small cell lung cancer. Adenocarcinoma (LUAD) and squamous cell carcinoma are two main types of NSCLC [3], of which LUAD is the most common type [4]. With the decreasing of smoking rates, lung cancer cases of never-smoker are increasing, most of which comprise LUAD. For those patients, molecularly targeted therapies considerably enhance their survival outcomes. Tyrosine kinase inhibitors (TKIs) targeting epidermal growth factor receptor (EGFR) have been observed as the first-line treatment method for advanced LUAD patients with sensitizing EGFR mutation [5]. ROS proto-oncogene 1 (ROS1) and anaplastic lymphoma kinase (ALK) gene rearrangements are other common oncogenes which are somatically activated for the targeted therapies of LUAD [6]. However, a large amount of advanced LUAD patients do not have targetable mutations. For these patients, antibodies against immune checkpoints like programmed death 1 (PD-1) and cytotoxic T lymphocyte-associated antigen-4 (CTLA-4) demonstrate established treatment activity and safety [7,8]. This highlights the importance of tumor immune microenvironment (TIM) on the clinical outcomes of LUAD patients.
The TIM constitutes of a variety of immune cells with either immune promoter or immune suppressor ability. TIM is able to limit the accumulation of T cells to where cancer cells locate [9]. Studies focusing on the impact of immune suppression elements like tumor-associated macrophages and myeloid-derived suppressor cells on LUAD patients' survival outcomes have achieved tremendous development [10][11][12][13]. However, there has been no signature that can systematically evaluate the TIM on the basis of immune-related genes and predict LUAD patients' overall survival or response to immunotherapies. Zheng et al. [14] recently demonstrated a signature based on B7-CD28 family that can predict LUAD patients' prognosis. Nevertheless, their investigations were limited to B7-CD28 family members, which may not represent the status of the entire TIM. Therefore, it's essential to develop an immune signature on the basis of a comprehensive list of immune-related genes that can stand for the immune status of TIM and be with prognostic ability in LUAD.
Our efforts concentrated on developing an immune signature with prognostic ability based on the comprehensive list of immune-related genes downloaded from The Immunology Database and Analysis Portal (ImmPort) database. The RNA sequencing (RNA-seq) data and microarray data from The Cancer Genome Atlas (TCGA) database and the Gene Expression Omnibus (GEO) database were used for analysis. Then, we evaluated whether this signature was associated with the survival outcome of subgroups of LUAD patients and clinicopathological factors. And finally, we tried to figure out the relationship between the signature and tumor immune-related indexes including mutation load and neoantigen in LUAD.

Publicly attainable expression datasets and immune related genes
The expression data were downloaded from the TCGA database and the GEO database. The RNA-seq data of 500 LUAD patients were collected from the TCGA database and used as the training set, which were downloaded from University of California Santa Cruz (UCSC) Genome Browser (https ://xena.ucsc.edu/publi c-hubs/). GSE 81089 was the other RNA-seq data of 108 LUAD patients downloaded from the GEO database (http:// www.ncbi.nlm.nih.gov/geo), which was used as one of the testing sets for constructing this signature. Fragments per kilobase of exon per million fragments mapped (FPKM) value was used to measure all of the RNA-seq data. The microarray data from GSE30219 (N = 85), GSE31210 (N = 226), GSE3141 (N = 57) were also collected from the GEO database and used as testing sets, respectively. A total of 976 patients were included for analysis. The clinical and survival information of the included datasets were summarized in Table 1. The comprehensive list of immune related genes containing a total of 1534 genes were downloaded from the ImmPort database (https :// immpo rt.niaid .nih.gov) [15].

Development and validation of the immune signature for LUAD
The cases from the TCGA database were used as the training set to develop the immune signature. Univariate analysis and logRank test were used to identify immune related genes with prognostic ability. For the genes with prognostic ability, Cox proportional hazards model (iteration = 1000) with an lasso penalty was used to find the best gene model utilizing a R package called "glmnet" [16]. The best gene model was used to establish the immune signature. Then, the concordance (c)-index proposed by Harrell et al. [17] was applied to validate the predictive ability of the signature in all of the five datasets, by using the "survcomp" R package [18]. The larger c-index indicated the more accurate predictive ability of the model.

Survival analysis
The Kaplan-Meier (K-M) survival curves were generated to graphically demonstrate the overall survival (OS) of the high-risk group and low-risk group which were stratified by the immune signature. The univariate and multivariate analyses of survival were conducted for both the immune signature and clinicopathologic factors. The R package called "survival" was utilized to perform the survival analysis.

Mutation load and neoantigen analysis
Mutation data that contained somatic variants were stored in Mutation Annotation Format (MAF) form and were downloaded from Genomic Data Commons (GDC) (https ://porta l.gdc.cance r.gov/). Nonsynonymous mutations were used for our investigations, considering the uncertainty of functional consequences of synonymous mutations. And nonsynonymous mutations were potential sources of neoantigen epitopes. Nonsynonymous mutations included missense mutation, nonsense mutation, splice site mutation, frameshift mutation, and inframe mutation. The total number of nonsynonymous mutations were utilized as the mutation burden of LUAD patients to investigate the relationship between the immune signature and patients' mutation load. The single nucleotide polymorphism (SNP) was also analyzed for its association with our signature. The number of neoantigens was cited from a published study [19] so as to figure out the correlation of the signature with the number of neoantigens in LUAD patients.

Statistical analysis
Student's t test was conducted to make statistical comparison. The "ggplot" R package was used to generate boxplots. "ComplexHeatmap" R package was applied to generate heatmaps [20]. Two-tailed p values less than  0.05 were thought to be statistically significant. All of our analyses were conducted using R software version 3.5.1 (https ://www.r-proje ct.org/).

Construction of immune signature
To make our investigations clearer, a workflow that illustrated the generation of the signature was demonstrated in Fig. 1. The univariate analysis was performed in all of the 1534 immune related genes for TCGA LUAD datasets. There were 144 genes with prognostic ability after the univariate analysis and logRank test (P < 0.05). The 144 immune related genes then underwent the Cox proportional hazards regression with tenfold cross-validation to generate the best gene model. We totally performed 1000 iterations and included 10 gene groups for further screening. The gene lists of the 10 gene groups were shown in Additional file 1: Table S1. As illustrated in Fig. 2a, a gene model with 30 immune related genes was with the highest frequencies of 211 compared to other nine gene models. Thus, this gene model became the most suitable role to generate the immune signature for LUAD. Therefore, we utilized the 30 immune related genes in this gene model to construct our immune signature, as listed in Additional file 1: Table S1. The coefficient value of the 30 genes were listed in Additional file 2: Table S2. The prognostic ability of the 30 immune related genes in LUAD patients was confirmed in the training set (The TCGA dataset, Additional file 3: Figure S1), which showed that all of the 30 genes were able to predict survival outcome of LUAD patients. However, the prognostic ability of the 30 genes was not consistent in the four testing sets (The GEO datasets, Additional files 4, 5, 6, 7: Figures S2-S5).

Validation of immune signature
To validate our signature, we firstly calculated the c-index for the prediction of OS. The c-index for TCGA dataset, GSE30219, GSE31210, GSE3141, and GSE81089 were 0.723, 0.657, 0.7061, 0.641, and 0.619 respectively (P < 0.05, Fig. 2b), which indicated the high predictive accuracy of the signature for survival. Then, the risk score for each patient was calculated according to the coefficient value of the 30 genes. Patients were divided into high-risk and low-risk groups with the median risk score utilized as the cutoff value, as demonstrated in Fig. 3ae. Patients of high-risk were with poor OS compared with those of low-risk in both TCGA and GEO datasets ( Fig. 4a-e, P < 0.05). We further validated the prognostic ability of the signature in subgroups of LUAD, and we found the immune signature could also predict the survival outcome of patients in clinically important subgroups. In TCGA datasets, patients in high risk group demonstrated poor prognosis in T1-3 stage, N0-3 stage, M0-1 stage, stage I-IV, recurrence, and no recurrence (P < 0.05, Additional file 8: Figure S6). In GSE30219 datasets, patients in high risk group showed poor prognosis in T1 stage, N0 stage (P < 0.05, Additional file 9: Figure S7). In GSE31210 dataset, high risk patients in stage I group demonstrated poor survival outcome (P < 0.05, Additional file 10: Figure S8). In GSE81089 datasets, patients in stage III exhibited a negative correlation between the risk score and patients' OS (P < 0.05, Additional file 11: Figure S9). The univariate Cox analysis of the immune signature also indicated the significant association of the signature with LUAD patients' OS in both TCGA and GEO datasets (P < 0.05, Fig. 5 Fig. 6. As for the prognostic ability of clinical factors, we found only tumor recurrence and stage IV could serve as independent predictors for patients' OS, which indicated the strong prognostic ability of our signature.

Association with clinicopathologic factors
To further validate the clinical value of the 30-genes immune signature, we evaluated the relationship between the signature and clinicopathologic factors. In TCGA cohort, patients of high-risk were tended to have advanced T stage, N stage, M stage, pathological stage and were under high risk of recurrence (P < 0.05, Additional file 12: Figure S10). In GSE30219 cohort, high-risk score was associated with higher T stage and N stage (P < 0.05, Additional file 13: Figure S11). In GSE31210 cohort, the risk score was only positively related to advanced pathological stage (P < 0.05, Additional file 14: Figure S12). And we did not find the association of the risk score and pathological stage in GSE81089 (Additional file 15: Figure S13).

Association with mutation load and neoantigen
Higher nonsynonymous mutation burden load and neoantigen number have shown associations with clinical efficacy of immune checkpoint inhibitor therapy [21,22]. Therefore, we would like to investigate whether our immune signature could affect mutation load and number of neoantigen of LUAD for the possibility of the risk score to be the predictor of response to immune checkpoint inhibitor. Patients with high risk score exhibited higher nonsynonymous mutation load than those with low risk score (P = 0.0112, Fig. 7a). To further explore which types of nonsynonymous mutation were the major contributors to this relationship, we evaluated the association of the signature with different types of nonsynonymous mutation. High-risk group patients had higher missense mutation (P = 0.0098, Fig. 7b), nonsense mutation (P = 0.0166, Fig. 7c), splice site mutation (P = 0.0217, Fig. 7d), and inframe deletion (P = 0.0085, Fig. 7g). We did not find this relationship in frameshift mutation (Fig. 7e, f ), inframe insertion (Fig. 7h), total deletion mutation (Fig. 7j), and total insertion mutation (Fig. 7k). Besides, we found there demonstrated a positive correlation between the signature and the number of SNP (P = 0.0098, Fig. 7i). However, we did not find correlation between the signature and the number of neoantigens in LUAD (Fig. 7l).

Discussion
The treatment of LUAD has experienced huge evolvement in the past 30 years, especially with the efficacy of immunotherapy. This shades light on the important role of TIM in the development and progression of LUAD.
In this investigation, we established a robust prognostic signature on the basis of TIM in TCGA dataset and proved its efficacy in four GEO datasets. Our signature may represent the status of TIM for LUAD patients and provide potential biomarkers for the response to immunotherapy and targets for immunotherapy. The study found that our signature was significantly correlated with LUAD patients' OS. And their correlation showed a high statistical significance in the training set, testing sets, and the subgroups of LUAD patients, which indicates the signature is able to provide a robust prognostic tool for the total cohort and subgroups of LUAD patients. Besides, the signature showed strong correlation with clinicopathologic factors, further highlighting the firmly prognostic ability of our signature. In addition, we found that some of the 30 immune genes had different prognostic ability in different datasets. This indicates the Fig. 6 Multivariate Cox analysis evaluating independently predictive ability of our signature for OS. The signature was able to independently predict patients' OS in TCGA cohort [Hazard ratio (HR) = 2.1868, 95% confidence intervals (95% CI) 1.7612 to 2.7152, P < 0.001], GSE30219 cohort (HR = 1.6354, 95% CI 1.1632-2.2993, P = 0.0047), and GSE81089 (HR = 1.5156, 95% CI 1.1425-2.0106, P = 0.0039). However, this relationship could not find in GSE31210 (P = 0.4298). TCGA The Cancer Genome Atlas, HR hazard ratio, 95% CI 95% confidence interval instability of a single gene in predicting the OS of LUAD patients, while the signature which integrates the efficacy of all the 30 immune related genes showed a consistent predictive ability of OS in all of the datasets. Therefore, this immune signature has a greater value than a single gene in predicting patients' survival outcome. More importantly, our signature was on the basis of immune related genes and demonstrated a positive association with nonsynonymous mutation load and different types of nonsynonymous mutation. Considering the importance of tumor mutation load in predicting the response to immunotherapy [23], we could confer there may be a connection between our signature and response to immunotherapy. The NF-κB is a key participant in both immune response and human cancer initiation and progression [24][25][26]. Therefore, NF-κB is a crucial part linking immunity and cancer. Interestingly, there was a study demonstrating that inhibition of NF-κB c-Rel could impair regulatory T cells mediated immunosuppression and potentiate anti-PD-1 therapy efficacy [27]. Among the 30 immune related genes in our signature, RELA is a subunit of NF-κB that is essential for NF-κB activation [28]. Hence, this further indicates that our signature may be related to response to immunotherapy. Considering Fig. 7 The relationship between the risk signature and nonsynonymous mutation burden, different nonsynonymous mutation types, and neoantigen. a High-risk group patients had higher nonsynonymous mutation burden (P = 0.0112). b Patients in high-risk group were associated with higher number of missense mutation (P = 0.0098). c Patients in high-risk group were associated with higher number of nonsense mutation (P = 0.0166). d Patients in high-risk group were associated with higher number of splice site (P = 0.0217). e There was no association of the risk score with the number of frame shift deletion. f There was no association of the risk score with the number of frame shift insertion. g Patients in high-risk group was associated with higher number of inframe deletion (P = 0.0085). h There was no association of the risk score with the number of inframe insertion. i Patients in high-risk group was associated with higher number of SNP (P = 0.01). j There was no association of the risk score with total number of deletion mutation. k There was no association of the risk score with total number of insertion mutation. l There was no association of the risk score with the number of neoantigen. Del deletion, Ins insertion, SNP single nucleotide polymorphism, NeoAgs neoantigens, TCGA The Cancer Genome Atlas