Model based on five tumour immune microenvironment-related genes for predicting hepatocellular carcinoma immunotherapy outcomes

Background Although the tumour immune microenvironment is known to significantly influence immunotherapy outcomes, its association with changes in gene expression patterns in hepatocellular carcinoma (HCC) during immunotherapy and its effect on prognosis have not been clarified. Methods A total of 365 HCC samples from The Cancer Genome Atlas liver hepatocellular carcinoma (TCGA-LIHC) dataset were stratified into training datasets and verification datasets. In the training datasets, immune-related genes were analysed through univariate Cox regression analyses and least absolute shrinkage and selection operator (LASSO)-Cox analyses to build a prognostic model. The TCGA-LIHC, GSE14520, and Imvigor210 cohorts were subjected to time-dependent receiver operating characteristic (ROC) and Kaplan–Meier survival curve analyses to verify the reliability of the developed model. Finally, single-sample gene set enrichment analysis (ssGSEA) was used to study the underlying molecular mechanisms. Results Five immune-related genes (LDHA, PPAT, BFSP1, NR0B1, and PFKFB4) were identified and used to establish the prognostic model for patient response to HCC treatment. ROC curve analysis of the TCGA (training and validation sets) and GSE14520 cohorts confirmed the predictive ability of the five-gene-based model (AUC > 0.6). In addition, ROC and Kaplan–Meier analyses indicated that the model could stratify patients into a low-risk and a high-risk group, wherein the high-risk group exhibited worse prognosis and was less sensitive to immunotherapy than the low-risk group. Functional enrichment analysis predicted potential associations of the five genes with several metabolic processes and oncological signatures. Conclusions We established a novel five-gene-based prognostic model based on the tumour immune microenvironment that can predict immunotherapy efficacy in HCC patients.

hepatitis viral infections, alcohol-related liver disease, non-alcoholic fatty liver disease, and drug-induced hepatitis. As HCC is usually diagnosed at advanced stages [2], it can be difficult to treat. Thus, it is important to elucidate the molecular mechanisms underlying HCC progression and develop novel therapeutic targets to improve patient survival outcomes.
The immune microenvironment plays a critical role in tumorigenesis and is correlated with tumour progression and treatment efficacy [3,4]. Systemic immune therapeutics have shown efficacy against HCC, especially for patients without an opportunity to undergo resection or liver transplantation [2,5]. Common immunotherapy strategies include chimeric antigen receptor-engineered T cells (CAR-T cells), cancer vaccines, cytokine therapy, and immune checkpoint inhibitors (ICIs). Currently, ICIs are the most successful class of immune therapeutics, both as monotherapy and combination therapy [6]. For example, the efficacy of anti-programmed cell death 1 (anti-PD-1) and antiprogrammed cell death ligand 1 (anti-PD-L1) in HCC has been investigated. PD-L1 expressed by T cells regulates immune responses at the initiation phase in lymph nodes and at the effector phase in tumour cells [7]. The restoration of function in "exhausted" T cells and the depletion of immunosuppressive regulatory T lymphocytes using monoclonal antibodies targeting these receptors have opened up new avenues for the treatment of several malignancies [8]. However, only approximately 25% of HCC patients with high infiltration of PD-1-expressing T cells respond to ICIs [9], and identification of patients who will respond well to ICIs is challenging.
Conventional strategies using tumour-node-metastasis (TNM) classification to predict HCC prognosis can help guide decisions in HCC clinical therapy [10]. However, their predictive efficacies are less than satisfactory. Notably, the use of genome-sequencing technologies coupled with bioinformatics analyses has improved tumour diagnosis and prognosis capabilities. Gene-based prognostic models have been established to identify differential mRNA expression patterns between cancer and normal tissues. Datasets reporting the expression profiles of long noncoding RNAs (lncRNAs) [11], genes that regulate epigenetic modifications [12], and immune-related genes [13] from The Cancer Genome Atlas (TCGA) and NCBI Gene Expression Omnibus (GEO) databases have been increasingly explored to study disease prognosis. However, there is no mature model that can stably predict patient response to HCC immunotherapy and treatment outcome. Therefore, we established a novel five-genebased model pertaining to the immune microenvironment and conducted bioinformatics analyses to assess the ability of this model to predict HCC immunotherapy outcomes.
In the present study, we conducted univariate Cox regression analyses and least absolute shrinkage and selection operator (LASSO)-Cox regression analyses to build a risk model. Risk score, time-dependent receiver operating characteristic (ROC) AUC, and Kaplan-Meier survival analyses were used to assess the prognostic ability of the model. The results indicate that our model can effectively predict the efficacy of immunotherapy and that the five genes can serve as potential independent biomarkers in clinical applications. Functional enrichment analysis predicted the potential associations of the upregulated and downregulated genes identified in our model with relevant biological mechanisms. Conclusively, we established a five-gene-based model based on the influence of the tumour immune microenvironment that could potentially be applied to predict patient response to HCC immunotherapy.

Data collection
Data from a total of 365 HCC patient samples from the TCGA-liver hepatocellular carcinoma (LIHC) dataset were retrieved and used for the analysis of prognostic gene expression signatures and the development of a prognostic model. Genes were excluded if corresponding patient sample data were lacking. Random sampling with arrangement was performed, wherein the 365 samples (including the training set [n = 219] and validation set [n = 146]) were randomly sampled 100 times with replacement. There was no significant difference in TNM stage, grade, OS, sex, or age between the training set and validation set (p > 0.05). The clinical data and mRNA expression data of the GSE14520 dataset (n = 221) were retrieved from the NCBI GEO database (https ://www. ncbi.nlm.nih.gov/geo/), and an immunotherapy dataset (Imvigor210) was obtained from published study [14]. All patient data that were used in the present study had complete clinical information, including TNM stage, grade, survival time, sex, age, and immune-related gene expression.

Establishment of the five-gene model based on immune-related genes
First, we screened for immune-related genes associated with HCC prognosis. The immune-related genes involved in HCC pathogenesis were collected from previous studies [15][16][17]. In the TCGA-LIHC training set, immune-related gene and survival data were analysed by univariate Cox regression analysis using the R package "coxph". The criterion of p < 0.001 was selected as the filtering threshold. We screened the immune-related genes associated with HCC prognosis. Second, a LASSO-Cox regression analysis was used to further filter the prognostic genes [18]. This method enables the simultaneous selection of variables and parameter estimation and can better solve the multicollinearity problem in regression analyses [18].

Assessment of the five-gene-based model by tissue microarray (TMA) analysis
To assess the prognostic ability of the five-gene-based model, we constructed a TMA comprised of 90 carcinoma tissues from HCC patients (Shanghai Outdo Biotech Co. Ltd., Shanghai, China) according to a reference method that was described previously [19]. Subsequently, immunohistochemistry (IHC) and integrated optical density (IOD) analyses were performed as described previously [20]. The primary antibodies used are shown in Additional file 1: Table S1. The IHC scores were obtained from independent assessments by three senior pathologists without any prior knowledge of patient characteristics. The IHC score of each patient was calculated using the following formula: IHC score = A Kaplan-Meier log-rank analysis was used to evaluate the difference in survival rates between the groups with a high IHC score and low IHC score.

Validation of the performance and prognostic ability of the five-gene-based model
Time-dependent ROC analyses and Kaplan-Meier logrank tests were used to evaluate the performance and prognostic ability of the model using verification datasets from the TCGA-LIHC dataset, the entire TCGA-LIHC dataset, the GSE14520 cohort, and the immunotherapy dataset (Imvigor210).

Functional enrichment analyses
To explore the underlying molecular mechanisms of the five-gene-based model, single-sample gene set enrichment analysis (ssGSEA) of the gene expression profiles was used to identify the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways predicted to be correlated with the risk score [21,22]. Clustering analyses were performed by gene set variation analysis (GSVA). A p < 0.05 and false discovery rate (FDR) < 0.05 were considered statistically significant.

Statistical analysis
Statistical analyses were performed using SPSS v25 (IBM, Chicago, IL, USA), GraphPad Prism 7.0 (GraphPad Software, La Jolla, CA, USA), and R software (version 3.5.1). Student's t-test was used for statistical comparisons, the Kaplan-Meier method was used to estimate OS, and p < 0.05 was considered statistically significant.

Identification of HCC prognostic gene expression signatures to construct the HCC prognostic model
A flowchart of the analysis workflow is illustrated in Fig. 1a. Using the TCGA-LIHC training set, univariate Cox regression analysis of the screening results, including 4,227 immune-related genes, led to the identification of 245 immune-related genesas potential prognostic indicators of HCC OS.

Construction of the five-gene-based HCC prognostic model
After primary filtering, a LASSO-penalized Cox analysis was performed to further narrow down the mRNA expression profiles (Fig. 1b and c). Five genes were identified: lactate dehydrogenase A (LDHA), phosphoribosyl pyrophosphate amidotransferase (PPAT), beaded filament structural protein 1 (BFSP1), nuclear receptor subfamily 0 group B member 1 (NR0B1), and 6-phosphofructo-2-kinase/fructose-2,6-bisphosphatase 4 (PFKFB4). A stepwise multivariate Cox regression analysis was then performed to establish the prognostic model. The risk score was calculated by summing the weighted gene expression level of each of the five genes multiplied by their respective LASSO coefficients: risk score All five genes showed positive LASSO coefficients in the multivariate Cox regression analysis. Next, the fivegene-based model was further evaluated for stability and reliability.

Evaluation of the predictive efficacy of the five-gene-based model using the TCGA-LIHC training set
To determine the association between the gene expression signatures of these five genes and HCC patient survival outcome, risk scores (AUC values) were calculated with the five-gene-based model for each sample separately, and the optimal cut-off for the risk score was defined using the TCGA-LIHC training set (Fig. 2a).
Higher AUC values indicated better classification performance of the five-gene-based HCC prognostic model. AUC values of 0.80, 0.77 and 0.73 were obtained for the 1-year, 3-year and 5-year survival rates, respectively. Kaplan-Meier survival analysis revealed that patients in the high-risk group showed worse prognosis than patients in the low-risk group (p < 0.0001; Fig. 2a). Taken together, these results indicate good performance of the established five-gene-based model for predicting HCC survival outcomes.
Validation of the five-gene-based HCC prognostic model using the GSE14520 dataset, TCGA-LIHC testing set, and whole TCGA-LIHC dataset To validate the stability of the five-gene-based model, a similar workflow was employed for the training set, wherein three datasets (the GSE14520 dataset, TCGA-LIHC testing set, and whole TCGA-LIHC dataset) were analysed. Risk scores for the five-gene-based model were obtained by calculating the AUC values of the respective ROC curves. AUC values of 0.75, 0.73 and 0.67 were obtained for 1-year, 3-year and 5-year survival rates, respectively, for the GSE14520 validation dataset. The results indicated that patients in the high-risk group showed significantly worse survival rates than patients in the low-risk group (p < 0.001; Fig. 2b). Consistent with the results of the GSE14520 dataset, patients in the highrisk group showed poorer OS than patients in the lowrisk group (all p < 0.05) and the AUC values were above 0.6 for the whole TCGA-LIHC validation dataset and the TCGA-LIHC dataset (Additional file 2: Figure S1a and b). The results collectively showed that the five-gene-based model could predict patient survival duration based on gene expression levels.  an HCC TMA using IHC. Notably, we found that the protein levels of the five genes were significantly different in HCC tissues (Fig. 3a) The group with high IHC scores was associated with higher protein expression levels and poorer disease prognosis than the group with low IHC scores (Fig. 3b). Variables that showed statistical significance in the univariate analysis were included in the Cox multivariate survival regression analysis. The results showed that the fivegene-based model score, grade and TNM stage were statistically significant (p < 0.05; Table 1). Thus, the results suggest that high protein expression of the five genes predicts poor disease prognosis.

Assessment of the ability of the five-gene-based model to predict immunotherapy efficacy
Biomarkers that can effectively predict the efficacy of immunotherapy drugs are currently lacking. Thus, identifying new predictive markers is necessary to further improve precision immunotherapy. A transcriptome dataset (Imvigor210) of the treatment response data of patients who underwent anti-PD-L1 immunotherapy was retrieved to assess the ability of the five-gene-based model to predict immunotherapy efficacy. Kaplan-Meier analysis showed that a high-risk score was associated with a poorer survival rate than a low risk score (Fig. 4a). ROC curve analysis showed that the combined consideration of neoantigen (NEO) burden, tumour mutational burden (TMB) and risk score output a higher AUC value (AUC = 0.91) than NEO burden (AUC = 0.7), TMB (AUC = 0.65), or risk score (AUC = 0.54) alone (Fig. 4b).
The risk score was not correlated with the immunotherapy efficacy subgroup (p > 0.05; Fig. 4c) but was correlated with the immune cell and tumour cell subgroups (p < 0.05; Fig. 4d and e). These results suggest that considering the five-gene-based model score together with NEO burden and TMB could enhance the assessment of immunotherapy efficacy and identify patients who will respond to immunotherapy.

The five-gene-based model can predict different clinical characteristics
After confirming the efficacy of the five-gene-based model in predicting patient response to immunotherapy, whether the five-gene-based model could be applied to determine the survival outcomes in subgroups with different clinical characteristics was investigated. The results indicated that the five-gene-based model could be used to predict different clinical characteristics (p < 0.05; Fig. 5).

Risk model and prognostic analyses of different gene mutation subtypes
To and ARID1A showed the highest mutation frequencies (28%, 24%, 8% and 7%, respectively) and were therefore used for the Kaplan-Meier analysis (Additional file 3: Figure S2). There was a significant difference in OS between patients in the high-risk and low-risk groups (Additional file 3: Figure S2a with TP53 mutation; Figure S2b without TP53 mutation; Figure S2c with CTNNB1 mutation; and Figure S2d without CTNNB1 mutation [p < 0.05; Additional file 4: Figure S3]). Patients with AXIN1 or ARID1A mutations showed no significant difference in OS between the highrisk and low-risk groups, but a significant difference was observed for patients without AXIN1 or ARID1A mutations between the high-risk and low-risk groups.
The results suggest that the five-gene-based model can also be used to predict the survival outcomes of HCC patients with genetic mutations.   Fig. 6a). Multivariate Cox regression analysis showed that T stage and the prognostic model were independent risk factors associated with OS.

Comparison of the five-gene-based model and other models
Next, the ability of our established model to determine HCC prognosis was compared with those of three other prognostic models: the four-gene signature [23], the HCC prognostic evaluation model [24] and the six-gene signature [25]. We calculated the risk score of the corresponding genes in these three models for the TCGA-LIHC dataset using a method similar to that used for the establishment of our five-gene-based model (described above).
The four-gene-based model had lower AUC values for 1-, 3-and 5-year survival rates than our model; the HCC prognostic evaluation model and the six-gene-based model had slightly higher AUC values for the 1-year survival rate than our model but had lower AUC values for 3-and 5-year survival rates ( Fig. 6b-d). These results indicated that our model performed better at predicting the long-term survival (3-year and 5-year survival) than the short-term survival (1-year) of HCC patients. Similar to our model, these three models could also predict the OS of the high-and low-risk groups (log-rank p < 0.001; Fig. 6e-g).

Relationship between risk score and KEGG pathways
We performed ssGSEA to identify potential KEGG pathways (with a correlation coefficient > 0.45) associated with the risk score. A total of 21 KEGG pathways were identified (Fig. 7a). Analysis of the relationship between gene sets and the risk score revealed that the KEGG pathways positively correlated with the risk score included DNA replication, mismatch repair, cell cycle, homologous recombination, spliceosome, oocyte meiosis, progesterone-mediated oocyte maturation, and pathogenic Escherichia coli infection. There were also downregulated pathway   19:26 terms that negatively correlated with the risk score in HCC, including butanoate metabolism, peroxisome, fatty acid, linoleic acid, tryptophan and tyrosine metabolism (Fig. 7b). The enrichment analysis revealed potential critical pathways implicated in HCC pathogenesis.

Discussion
Immune subsets were found to be significant effectors of immune defence [26][27][28]. The use of gene expression signatures to determine the outcomes of treatments has been reported by several groups. We showed that the correlation between gene signatures and the tumour immune microenvironment can be used to predict HCC immunotherapy outcome. Similarly, He et al. [13] established a model based on the expression of immunerelated genes to predict treatment outcome, Shen et al. [29] identified a clinical-immune signature to estimate OS in ovarian cancer, and Tekpli et al. [30] discovered a novel independent prognostic factor based on the tumour immune microenvironment in breast cancer. Recently, integrative analysis of multi-omics data revealed epidermal growth factor receptor (EGFR) as a critical node in the gene regulatory network that is related to immune phenotype, and the inclusion of therapeutic EGFR inhibition enhanced head and neck squamous cell carcinoma patient response to ICIs [31]. This finding suggests that critical genes associated with the tumour immune microenvironment may serve as prognostic signatures and be useful for clinical immunotherapy. HCC is a highly heterogeneous tumour at the molecular level and is pathological [32]. The most frequent mutations were identified in telomerase reverse transcriptase (TERT), ARID1A, TP53, and CTNNB1 [33][34][35][36][37]. These gene mutations were also associated with cell differentiation, proliferation, and clinical features [32,38]. In our study, the five identified genes had a low frequency of mutations, and they were conducive to constructing a stable prognostic model. In addition, the five-gene-based model established in the present study could predict HCC treatment outcomes for patients with or without Fig. 4 The five-gene-based model can predict immunotherapy efficacy for the Imvigor210 dataset. a Kaplan-Meier analysis of the five-gene-based model. b ROC curves of TMB, NEO burden, risk score, and the combination (NEO burden, TMB and risk score). The risk score was grouped by c immunotherapy efficacy, d immune cell subgroups, and e. tumour cell subgroups. ROC receiver operating characteristic, NEO neoantigens, TMB tumour mutational burden, CR complete response, PR partial response, SD stable disease, PD progressive disease, IC immune cell, TC tumour cell tumour-specific mutations in the TP53, CTNNB1, AXIN1 or ARID1A genes. Thus, the five-gene-based prognostic model was a useful classification tool for patients with various genetic mutation backgrounds.
TMB is an emerging biomarker of sensitivity to immunotherapy [39,40]. Cai et al. [41] showed that high TMB in liver cancer patients with radical resection was significantly correlated with poor prognosis. Stenzinger et al. [42] reported that high TMB correlated with increased  patient response rates and survival benefits from immune checkpoint inhibitors. Tumour-specific NEO that form as a consequence of mutations are thought to be another important biomarker for the therapeutic efficacy of cancer immunotherapies [43,44]. In our study, ROC curve analysis showed that the combined consideration of NEO burden, TMB and risk score output a higher AUC value than NEO burden, TMB, or risk score alone. Combining the five-gene-based model with NEO burden and TMB could enhance the assessment of immunotherapy efficacy and identify patients who will respond to immunotherapy. In the present study, we identified five genes (LDHA, PPAT, BFSP1, NR0B1, and PFKFB4) associated with the tumour immune microenvironment in HCC patient  Relationships between the risk score and KEGG pathways. a. Twenty-one KEGG pathways were found to be correlated with the risk score by ssGSEA. b. The ssGSEA score of the KEGG pathway changes as the risk score increases. Samples (rows) are ranked from low to high risk scores. The colour scale indicates upregulation (red) or downregulation (blue) of gene expression. ssGSEA single-sample gene set enrichment analysis, KEGG Kyoto Encyclopedia of Genes and Genomes