Development and validation of a survival model for lung adenocarcinoma based on autophagy-associated genes

Background Given that abnormal autophagy is involved in the pathogenesis of cancers, we sought to explore the potential value of autophagy-associated genes in lung adenocarcinoma (LUAD). Methods RNA sequencing and clinical data on tumour and normal samples were acquired from The Cancer Genome Atlas (TCGA) database and randomly assigned to training and testing groups. Differentially expressed autophagy-associated genes (AAGs) were screened. Within the training group, Cox regression and Lasso regression analyses were conducted to screen five prognostic AAGs, which were used to develop a model. Kaplan–Meier (KM) and receiver operating characteristic (ROC) curves were plotted to determine the performance of the model in both groups. Immunohistochemistry was used to demonstrate the differential expression of AAGs in tumour and normal tissues at the protein level. Gene Ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were utilized to further elucidate the roles of AAGs in LUAD. Results The data from the TCGA database included 497 tumour and 54 normal samples, within which 30 differentially expressed AAGs were screened. Using Cox regression and Lasso regression analyses for the training group, 5 prognostic AAGs were identified and the prognostic model was constructed. Patients with low risk had better overall survival (OS) in the training group (3-year OS, 73.0% vs 48.0%; 5-year OS, 45.0% vs 33.8%; P = 1.305E−04) and in the testing group (3-year OS, 66.8% vs 41.2%; 5-year OS, 31.7% vs 25.8%; P = 1.027E−03). The areas under the ROC curves (AUC) were significant for both the training and testing groups (3-year AUC, 0.810 vs 0.894; 5-year AUC, 0.792 vs 0.749). Conclusions We developed a survival model for LUAD and validated the performance of the model, which may provide superior outcomes for the patients.


Background
Lung cancer has the highest morbidity and mortality rates worldwide and is therefore a constant threat to human life [1]. Lung adenocarcinoma (LUAD) is a prevalent pathological subtype of lung cancer, accounting for nearly 45% of lung cancer. Despite advances in the global medical industry and changes in health awareness, the outcomes of patients with lung cancer remain poor, in part because almost 80% of the patients are at an advanced stage when diagnosed; another reason may be that the current TNM (tumour size/lymph nodes/distant metastasis) staging system is not always accurate for postoperative tumour staging, and therefore necessary adjuvant treatments may not be applied [2,3]. Therefore, it is necessary to explore alternative methods for diagnosis and accurate postoperative tumour staging.
Autophagy is considered a vital catabolic process within eukaryotic cells, allowing lysosomes to degrade damaged, senescent, or nonfunctional proteins and organelles [4,5]. Early studies have reported that autophagy is involved in many pathophysiological processes such as immune responses, inflammation, neurodegenerative diseases, tumourigenesis and cancer progression [6,7]. Early in 1976, JS et al. first reported that cellular autophagocytosis progressed in cervical cancer cells in the absence of serum and amino acids [8]. Later studies showed that autophagy may play a part in degrading and recycling components of nonfunctional organelles to supply the demands of tumour progression [9,10]. Nassour et al. demonstrated that autophagy was vital for tumour suppression, and the absence of autophagy was necessary for the initiation of tumours [11]. Therefore, autophagy may not only be involved in the inhibition of cancer but may also be related to the development and advancement of tumours [12][13][14].
Over the last decade, scholars have performed many studies to explore the role of autophagy in LUAD. Some studies have concluded that downregulating autophagy indirectly enhances the efficacy of the LUAD suppressors [15][16][17]; conversely, high-level autophagy was proven to promote tumourigenesis of LUAD in other studies [18][19][20][21][22]. Some results have provided evidence that the upregulation of autophagy is correlated with cisplatin or docetaxel resistance in LUAD [23][24][25]. Wang et al. found that autophagy impacted the low-dose hyper-radiosensitivity of LUAD [26].
Given these contradictory results, we sought to explore the potential value of autophagy in LUAD by integrating the entire set of autophagy-associated genes (AAGs) and the corresponding gene expression with clinical data acquired from The Cancer Genome Atlas (TCGA) portal. First, 30 AAGs that were differentially expressed in tumour and non-tumour tissues were screened and randomly divided into training and testing groups. We then performed Cox regression and Lasso regression analyses within the training group to identify the AAGs associated with remarkable overall survival (OS) in LUAD patients, and the prognostic model was constructed. To validate the accuracy of the model, the Kaplan-Meier (KM) estimator and the receiver operating characteristic (ROC) curve were applied. In addition, we investigated the results of Gene Ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses to further elucidate the role of AAGs in LUAD.

Data source and pre-processing
The entire set of 232 AAGs was collected from the human autophagy portal (http://www.autop hagy.lu/ index .html), which is an online database that provides a complete set of human genes related to autophagy as described in the literature. RNA sequencing and clinical data consisting of 497 LUAD and 54 non-tumour tissues was downloaded from the TCGA data portal. The ensemble gene IDs were then converted to gene symbols using the online database GENCODE (https :// www.genco degen es.org/human /relea ses.html), a project for referencing human genome annotation. Finally, the expression data of the AAGs were extracted.

Screening of differentially expressed AAGs in LUAD
The expression data of 232 AAGs comprising 497 LUAD and 54 non-tumour samples were processed using the mean function, and the mean expression values were normalized by log2 transformation. The 30 AAGs that were differentially expressed between the tumour and normal samples were then identified using the Wilcoxon signed-rank test in R (version 3.6.1, https ://www.r-proje ct.org/) with a threshold of |log(fold change) > 1 and a false discovery rate (FDR) < 0.05. Next, we integrated the expression data of the 30 AAGs with the corresponding clinical information. Finally, the data were randomly divided into training and testing groups for subsequent validation.
The expression data of the 30 AAGs within the training group were then analysed using univariate Cox regression analysis to obtain the AAGs that were significantly related to survival (P < 0.05). The least absolute shrinkage and selection operator (Lasso) regression selectively enters variables into the model to obtain improved performance parameters and to control the complexity of the model through a series of parameters to avoid overfitting [27]. Therefore, we employed a Lasso regression analysis to remove highly correlated survival-related AAGs.

Construction of the prognostic model
We performed a multivariate Cox regression analysis using both forward and backward selection to identify the 5 prognostic AAGs and their coefficients, on which we constructed the prognostic model. Every LUAD patient in both training and testing groups received an individual risk score.
The calculation of the risk score based on the AAG model was conducted as follows: Risk score = n i=1 v i × c i (the v i is the expression value of gene i, c i represents the regression coefficient of gene i in the multivariate Cox regression analysis, and n represents the number of independent indicators).

Validating the performance of the prognostic model in training and testing groups
Based on the individual risk scores, all patients were separated into one of two groups (low/high score) by the median risk scores. The Kaplan-Meier (K-M) survival curve was plotted to evaluate the differences in overall survival between the two groups using the log-rank test to assign statistical significance. In addition, we generated receiver operating characteristic (ROC) curves to determine the accuracy of the prognostic model.

Exploration of the expression of AAGs at the protein level
The Human Protein Atlas is an interactive web-based database (https ://www.prote inatl as.org) that contains the RNA and protein expression profiles of more than ninety percent of the putative protein-encoding genes and includes more than 13 million high-resolution images [28]. The immunohistochemical results of the five prognostic AAGs were explored using this database to verify their differential expression in tumour and normal tissues.

Enrichment analysis of AAGs
To explore the potential tumour-related molecular mechanisms of AAGs, GO functional annotation and KEGG pathway enrichment analyses were performed in R using the packages DOSE, Cluster Profiler, ggplot2, GO plot, etc. with both the p-value and q-value set at 0.05. The outcomes were visually illustrated in multidimensional views.

Statistical analysis
All statistical analyses and graphics were performed using the R 3.6.1 (https ://www.r-proje ct.org/) and Perl language packages. Cox regression analyses were utilized to screen the AAGs related to survival. A Lasso regression analysis was used to eliminate highly correlated AAGs and prevent overfitting of the model. The Kaplan-Meier curve was plotted to display the differences in overall survival between the two groups and the log-rank test was performed to determine the significance of the differences. The ROC curve and the corresponding area under the curve (AUC) were used to evaluate the performance of the model. Statistical significance was defined as P < 0.05.

Survival-related AAGs and the prognostic model
We conducted a univariate Cox regression analysis and identified 6 AAGs (GAPDH, ERO1A, NLRC4, ITGB4, ATG9B, and CDKN2A) that were significantly related to LUAD survival in the training group. Of the 6 survival-related AAGs, 4 genes (GAPDH, ERO1A, ITGB4, and CDKN2A) were considered risk factors (all P < 0.05; HRs, 1.0007-1.0175) and that their overexpression may reduce survival; overexpression of the remaining two genes (NLRC4 and ATG9B) (all P < 0.05; HRs, 0.6913 and 0.7382, respectively) may improve the survival of patients. The Lasso regression analysis was then applied to exclude genes that may be highly correlated with other genes (Fig. 3). The 6 survival-related AAGs were then submitted to a multivariate Cox proportional hazards model, resulting in 5 candidate genes (ITGB4, NLRC4, ATG9B, CDKN2A, and ERO1A) that may serve as significant predictors of the prognosis (Table 1). Based on the 5 candidate AAGs, the formula for the risk score of every LUAD patient was constructed: risk score = (expression value of ITGB4 * 0.0063) + (-expression value of NLRC4 * 0.354) + (-expression value of ATG9B * 0.3956) + (expression value of CDKN2A * 0.0202) + (expression value of ERO1A * 0.0122).  (Fig. 4c, d).

Validation of the model performance
In addition, we ranked the all of the LUAD patients by their risk scores to analyse the survival distribution. From the scatterplot, we could identify the survival status of patients with different risk scores; the mortality rate of patients rose with the increase in risk score. The heat maps illustrate the expression of AAGs with the rising risk scores of patients (Fig. 5a-f ).

Differential expression of prognostic AAGs at the protein level
We used immunohistochemistry to compare the expression of the 5 prognostic AAGs (ITGB4, NLRC4, ATG9B, CDKN2A, and ERO1A) in LUAD with their expression in normal tissues (Fig. 6a-d). As expected, the levels of protein expression of the three high-risk genes (ITGB4, CDKN2A, and ERO1A) were demonstrably higher in tumour tissues with more intense antibody staining and a greater proportion of stained cells. In contrast, NLRC4, a protective gene, stained fewer cells with weaker intensity in LUAD. The results were compatible with our findings of AAGs in LUAD; there were no data for another protective gene, ATG9B, in the Human Protein Atlas database.

GO and KEGG analyses of AAGs
To evaluate the molecular mechanisms of AAGs in LUAD, GO functional annotation and KEGG pathway enrichment analyses were conducted (Table 2). GO analysis consists of three categories: biological processes (BP), cellular components (CC) and molecular function (MF). We found that the most significant GO enriched terms involved in autophagy were the intrinsic apoptotic signalling pathway, cellular response to unfolded/topologically incorrect proteins and neuron death (BP); autophagosome and endoplasmic reticulum-Golgi intermediate compartment (CC); and  protein phosphatase binding, chemorepellent activity, receptor activator activity, R-SMAD binding and NAD binding (MF) (Fig. 7a, b). In the KEGG enrichment analysis, the AAGs were primarily correlated with pathways related to autophagy-animal, the ErbB signalling pathway, bladder cancer, the HIF-1 signalling pathway, platinum drug resistance, proteins processed in the endoplasmic reticulum, EGFR tyrosine kinase inhibitor resistance, PD-L1 expression and the PD-1 checkpoint pathway in cancer (Fig. 8a, b).

Discussion
Adenocarcinoma is the most prevalent histological subtype of lung cancer and has been broadly explored for distinct genetic drivers and diverse prognostic factors. However, LUAD patients still experience high mortality due to undetected pathogenesis [29][30][31]. Many researchers believe that the existing guidelines and definitions of lung cancer may result in different clinical decisions for preoperative and postoperative patients. Zhang et al. found that dissection of the 4th lymph nodes was related to a better prognosis, although this was not recommended by the International Association for the Study of Lung Cancer (IASLC) [32]. There are also scholars who believe that the present staging guide is insufficient for predicting individual-level overall survival because many early-stage patients may experience a later relapse [33,34]. Moreover, Valeria et al. suggested abandoning the concept of non-small cell lung cancer because a large body of experimental evidence suggests that LUAD and lung squamous cell carcinoma appear to be distinct tumours at the molecular, pathological and clinical levels [35]. Therefore, academics have placed increasing emphasis on the use of precision medicine in lung cancer [36][37][38]. It is necessary to explore methods to consolidate the current staging system and to improve the prognosis for lung cancer patients. Over the last decade, breakthroughs in microarrays and genome sequencing have promoted the discovery of prognostic biomarkers, which have greatly increased the accurate classification of diseases and improved individual treatment. Many studies have demonstrated that genomic data, particularly multigene signatures, demonstrate superior performance in prognosis analysis compared with the current staging system [39][40][41][42].
To our knowledge, this is the first study to combine the entire set of AAGs with LUAD and explore as well as validate the potential value of AAGs in LUAD. First, we selected 30 differentially expressed AAGs from 497 tumour samples and 54 normal samples. We then randomly divided the data into training and testing groups. Using Lasso regression and Cox survival analyses, we constructed a risk model based on five prognostic AAGs (ITGB4, NLRC4, ATG9B, CDKN2A, and ERO1A). Using the model, every LUAD patient was assigned a risk score. The differences in survival between patients with low and high scores were significant in both the training group and the testing group. The ROC curves and AUCs indicated that models of the training and testing groups both performed well. In addition, we performed immunohistochemistry that further proved the significant roles of AAGs in LUAD. Furthermore, GO and KEGG enrichment analyses of the differentially expressed AAGs were conducted to explore the underlying molecular mechanisms. The results of GO functional annotation revealed that the AAGs were primarily enriched in the intrinsic apoptotic signalling pathway, cellular response to topologically incorrect proteins and the PERK-mediated unfolded protein response, which is consistent with the conclusion of previous studies that autophagy is a physiological process that eliminates misfolded proteins and damaged organelles in response to cellular stress [12,43]. In the KEGG pathway analysis, AAGs were primarily enriched in the ErbB, IL-17 and HIF-1 signalling pathways. EGFR (ErbB1) is not unknown to us; in 2004, there was a major discovery that treatment with the EGFR-TKI (epidermal growth factor receptor-tyrosine kinase inhibitor) gefitinib caused tumour regression in some patients with NSCLC [44], and the third-generation EGFR-TKI axitinib confers greater survival benefits to patients, particularly those with the T790M mutation [45]. In addition to EGFR (ErbB1), the proteins HER2 (ErbB2), HER3 (ErbB3) and HER4 (ErbB4) compose the ErbB family of transmembrane receptor tyrosine kinases (RTKs), which is one of the most broadly explored therapeutic targets in human malignancies [46]. Jutten et al. found that autophagy activity influenced the expression of EGFR and the resistance to EGFR-targeting therapies could be reduced by downregulating autophagy [47,48]; IL-17 (interleukin-17), as a signature proinflammatory cytokine of the CD4+ T helper 17 (Th17) cells [49], was shown to participate in the formation and advancement of various tumours [50] and was widely distributed in the tumour microenvironment, where it has twin roles in tumourigenesis and tumour suppression [51]. Previous studies have indicated that the formation of lung cancer is closely related to local dysbiosis and inflammation mediated by Th17 cells [52,53]. The 2019 Nobel Prize in physiology or medicine was awarded to Professor William G. Kaelin Jr., Sir Peter J. Ratcliffe and Gregg L. Semenza for their contributions to elucidating the mechanisms by which cells sense and adapt to the availability of oxygen. They found that HIF-1 (hypoxia-inducible factor-1) regulates more than 4000 targeted genes, some of which can increase oxygen transport for angiogenesis and erythropoiesis. Another study reported that under emergent oxygen fluctuations, autophagy can be harmful and can lead to cell death [54]. Moreover, Bellot et al. reported that various mechanisms such as autophagy activation enabled tumour cells to adjust to hypoxia [55]. Therefore, the regulation of HIF-1 may represent an important breakthrough in tumour therapy, as angiogenesis and erythropoiesis play crucial roles in the occurrence and development of cancer. It is apparent that autophagy plays many roles in tumourigenesis and development, which is consistent  18:149 with the association between autophagy genes and LUAD in our study. However, some limitations are worth noting. As a retrospective study, this research has an inherent bias; although we validated the model using training/testing sets and immunohistochemistry, additional validation of prognostic designs should be conducted in vitro, in vivo and in clinical trials; moreover, the biological processes and molecular mechanisms of the 5 AAGs should be further evaluated to accelerate their clinical application in LUAD.

Conclusions
In this study, we provided insights into the roles of autophagy genes in LUAD and constructed a promising model, which could provide a reference to determine whether postoperative/preoperative patients are at high Fig. 7 Results of Gene Ontology (GO) functional annotation analysis. a Bar chart of significant terms. The change in colour from blue to red represents the increase in the adjusted P-value, and the length of the bar indicates the number of gene enrichment terms. b Bubble plot of enriched GO terms. The Z-score is plotted on the x-axis and the -log(adjusted p-value) is plotted on the y-axis; green represents a biological process, red represents cellular components and blue represents molecular function. The size of the circles is proportional to the number of genes enriched in the term