Differential expression analysis of EMT-related genes and lncRNA
We downloaded a total of 588 CRC samples and 48 normal samples from the TCGA database. Combined with clinical data, after removing incomplete data, we finally got 429 CRC samples for subsequent analysis. We have obtained 56 EMT-RDGs including 40 genes with high expression and 16 genes with low expression and analyzed the association between them. Most of the genes are related to each other (Additional file 1: Figure S1).
Gene enrichment and function analysis
GO analysis showed that EMT-RDGs focus on epithelial morphogenesis, tissue morphogenesis, negative regulation of cell proliferation, and other processes in the biological process; (Fig. 1A) the cellular components focused on the base cortex, SMAD protein complex, beta-catenin-TCF-complex, and so on; (Fig. 1B) the molecular functions focus on I-SMAD binding, chemoattractant activity, 1-phosphatidylinositol-3-kinase activity, and other functions (Fig. 1C). The KEGG signaling pathway is indeed mainly enriched in TGF-β, Hippo, Wnt signaling pathways, and other mechanisms (Fig. 1D). The special function analysis of prognosis-related EMT-RDGs verified that these key EMT-RDGs were indeed EMT-related genes (Fig. 1E).
Cluster analysis of EMT-RDGs
We divided the CRC samples into two clusters according to the principle of unsupervised cluster classification. The samples with BMI ≥ 23.5 [BMI = weight (Kg)/height2(m2)] were more in cluster2 and the remaining samples with BMI < 23.5 are more in cluster1. Other clinical characteristics, overall survival, and 22 immune cells were not significant differences in the two clusters (Additional file 1: Figure S2).
Prognostic model construction based on EMT-RDGs
Nine prognostic-related EMT-RDGs were screened by univariate cox and then Lasso regression analysis in the TCGA data, and the score of each gene was calculated. The risk prognosis model of CRC was constructed by the expression level of each gene*risk score. The effect of the model was verified by substituting the corresponding value of GEO data into the following formula: (Riskscore = TCF15*0.006387445 + SIX2*0.000957825 + NOG*0.016976643 + FGF8*0.047052635 + TBX5*0.00178245 + SNAI1*0.000456714 + PHLDB2*1.08E-05 + TIAM1*6.55E-05 + TWIST1*6.70E-05). The model was verified by substituting the corresponding value of GEO data into the above formula. The overall survival (OS) was longer in low-risk group of TCGA training set (Fig. 2A and C), GSE40967 (HR = 0.54857, 95% CI 0.41328–0.72814) (Fig. 3B), GSE12954 set (HR = 0.576808, 95% CI 0.184833–1.800043) (Additional file 1: Figure S3A, B), GSE17536 set (HR = 0.587008, 95% CI 0.370944–0.928924) (Additional file 1: Figure S3E, F) and GSE17537 set (HR = 0.032210, 95% CI 0.013055–0.079467) (Additional file 1: Figure S3I, J). The disease-free survival (DFS) was also longer in the low-risk group of GSE12954 set (p > 0.05) (Additional file 1: Figure S3C), GSE17536 set (p < 0.05) (Additional file 1: Figure S3G) and GSE17537 set (p < 0.05) (Additional file 1: Figure S3K). The area under the ROC curve (AUC) was 0.66 in TCGA (Fig. 2B), 0.657 in GSE40967 (Fig. 3C), 0.639 in GSE17536 (Additional file 1: Figure S3H), and 0.854 in GSE17537 (Additional file 1: Figure S3L), respectively, indicating that this model had good accuracy in predicting the prognosis of CRC patients. However, the validation of the GSE12954 set (Additional file 1: Figure S3D) showed a meaningless result. The model predicted the 3-year survival rate more accurately (Figs. 2H, 3G), but the accuracy of the 5-year survival rate was average (Figs. 2I, 3H). In the TCGA data, pathological staging, TNM staging, follow-up treatment success, BMI, history of colon polyps, dMMR, permanent invasion present, primary therapy outcome success, synchronous colon cancer present, and venous invasion are all significantly different in high and low-risk groups (Fig. 2D). More than 65 years of age, history of colon polyps, KRAS gene mutation, new tumors after initial treatment were found to be risk factors for the prognosis of CRC, and dMMR is a protected factor by the univariate and multivariate cox regression. Advanced clinical and pathological stage, residual tumor, and high-risk score are only risk factors for the prognosis of CRC, and non-invasive lymph nodes and successful primary treatment are protective factors in univariate regression (Fig. 2E). Distal bowel segment, BMI ≥ 23.5, and postoperative_rx_tx found only in multivariate regression were risk factors (Fig. 2F).
In the GEO data, > 65-year-old, M stage, and RFS events are also risk factors that are regressed by univariate and multivariate cox. Male, pathological stage, clinical TN stage, KRAS gene mutation, and risk score were found to be risk factors in univariate regression (Fig. 3D). In multivariate cox regression, MMR is a protective factor (Fig. 3E).
In addition, the Nomogram model constructed based on multivariate cox regression is a tool used to predict the prognostic risk of CRC (Figs. 2G, 3F).
Construction of a prognostic model based on EMT-RlncRNAs
32 EMT-RlncRNAs were analyzed by WGCNA in the TCGA dataset (Additional file 1: Figure S4A). We also constructed another prognostic risk model with EMT-RlncRNAs (Riskscore = AC068418.2*0.000119527496121585 + AC006273.1*0.0076287254340738 + LINC02437*0.00163200945364475 + LINP1*0.0120094796161845 + GPC5.IT1*0.000613250610676045). The AUC of ROC was 0.585 (Additional file 1: Figure S4B). The survival trend was longer in the low-risk group (Additional file 1: Figure S4C). The perineural and lymphatic invasion was strongly correlated to the prognostic risk of CRC (Additional file 1: Figure S4D). More than 65-year-old, advanced TNM stage, history of colon polyps, KRAS mutation, lymphatic invasion, new tumor after initial treatment, and residual tumor were risk factors; and pMMR and synchronous CRC present were protective factors in univariate regression (Additional file 1: Figure S4E). In multivariate cox regression, over 65-year-old, history of colon polyps, and KRAS mutation were also risk factors; anatomic neoplasm subdivision, BMI < 23.5, pMMR, and new tumor after initial treatment were protective factors (Additional file 1: Figure S4F). The nomogram model was constructed by multivariate cox regression (Additional file 1: Figure S4G).
Nine prognosis-related EMT-RDGs and prognosis of CRC
In TCGA, only the highly expressed FGF8 corresponds to a longer survival time (Additional file 1: Figure S5A). In the GSE40967 set, the low expression of NOG, SIX2, and SNAI1 correspond to a longer survival time (Additional file 1: Figure S5B). In the GEPIA database, low expression of NOG, PHLDB2, SIX2, SNAI1, and TCF15 has a better prognosis in CRC (Additional file 1: Figure S5C). The verification in the GCSC database found that the high expression of TBX5, TCF15, NOG, SIX2, and SNAI1 in CRC patients has a higher survival risk (Additional file 1: Figure S5D).
The relationship between prognostic-related EMT-RDGs and immunity
Based on the CIBERSORT algorithm in TCGA data, monocytes had higher infiltration levels in the high-risk group, and Macrophage M0, Macrophage M1, and activated DCs had higher infiltration levels in the low-risk group (Fig. 3J). In GSE40967 data, T cell gamma delta, Monocyte, Macrophage M2, activated dendritic cell and mast cell resting were higher in the high-risk group; while B cell naive, B cell plasma, T cell CD4 + naive, and T cell CD4 + Memory resting, T cell CD4 + memory activated, NK cell resting, Macrophage M1 and Mast cell activated have higher levels in the low-risk group (Fig. 3I). The score of DCs, macrophages, Tfh cells, Th1 cells, Th2 cells, and Tregs was significantly higher in the high-risk group, and the score of mast cells and NK cells was higher in the low-risk group from the TCGA set (Fig. 4A). In the GSE40967 set, the score of B cells, DCs, macrophages, neutrophils, Th cells, and TIL was higher in the high-risk group. However, the score of mast cells was higher in the high-risk group (Fig. 4B). The score of immune functions including co-stimulation or co-suppression of antigen-presenting cells (APC), chemotaxis of CCR, immune checkpoints, HLA, parainflammation, MHC class I, and T cell activation co-stimulation or co-suppression was higher in the high-risk group. The score of Type I INF response was higher in the low-risk group in the TCGA set (Fig. 4C). In the GSE40967 set, the score of APC co-stimulation, CCR, checkpoints, HKA, T cell co-suppression, and Type II INF response was higher in the high-risk group (Fig. 4D).
In TIMER, we found that nine genes had different effects on immune cell infiltration, and thus had different effects on immunotherapy. Highly expressed FGF8 was positively correlated with the infiltration level of CD4 + T cells, CD8 + T cells, activated NK cells, macrophages M1, cancer-associated fibroblasts, and B cells; and was negatively correlated with the infiltration level of hematopoietic stem cells, neutrophil, resting NK cells and macrophages M0. The infiltration level of immune cells (p < 0.05) was usually more abundant in mutated-type FGF8 CRC than wild-type FGF8. The statistically significant results of immune cells were lower in arm-level deletion CNA of CRC compared to normal CNA (Additional file 1: Figure S6-aA–C). The effect of gene expression combined with the level of immune cell infiltration on the prognosis of CRC had been focused on. The survival time was longer in low FGF8 expression + low cancer associated fibroblast, high FGF8 expression + low common lymphoid progeniters, high FGF8 expression + low eosinophils, high FGF8 expression + high macrophage M0, high FGF8/NOG expression + low macrophage M2, low FGF8 expression + low cancer associated fibroblast, low FGF8 expression + high Tfh cells, low FGF8 expression + high γδT cells and high FGF8 expression + high T memory resting cells (Additional file 1: Figure S6-aD); in high NOG expression + low eosinophils, high NOG expression + low cancer associated fibroblast, low NOG expression + low B cell memory, and high NOG expression + high CD8 + T cells (Additional file 1: Figure S6-b); in high PHLDB2 expression + low B cell naive, low PHLDB2 expression + high endothelial cells, low PHLDB2 expression + low resting mast cells, high PHLDB2 expression + high neutrophils and high PHLDB2 expression + high CD4 + T cells (Additional file 1: Figure S6-c); in high SIX2 expression + high B cell naive, low SIX2 expression + high activated NK cells, low SIX2 expression + low T CD4 + naive cells, high SIX2 expression + high CD4 + T cells and low SIX2 expression + low Tfh cells (Additional file 1: Figure S6-d); in high SNAI1 expression + low B cells, high SIX2 expression + low eosinophil cells, low SNAI1 expression + high macrophage M1 cells, high SNAI1 expression + low Tfh cells and low SIX2 expression + high CD8 + T cells (Additional file 1: Figure S6-e); in high TBX5 expression + high B plasma cells, low TBX5 expression + high macrophage M0 cells, high TBX5 expression + low macrophage M2 cells, high TBX5 expression + low activated mast cells, high TBX5 expression + low resting DC cells, low TBX5 expression + low NK cells and high TBX5 expression + high T cell CD4 + memory activated (Additional file 1: Figure S6-f); in high TCF15 expression + high B plasma cells, low TCF15 expression + low activated DC cells, high TCF15 expression + low resting DC cells and high TCF15 expression + low neutrophils (Additional file 1: Figure S6-g); in high TIAM1 expression + low B cells, low TIAM1 expression + high macrophage M1 cells, low TIAM1 expression + low resting mast cells and low TIAM1 expression + high T cell CD4 + memory activated (Additional file 1: Figure S6-h); in high TWIST1 expression + low B cells, low TWIST1 expression + low B cell memory and low TWIST1 expression + low eosinophil (Additional file 1: Figure S6-i).
Single-cell analysis verification
After discovering that nine EMT-RDGs have a significant correlation with stromal cells and immune cells in the CRC microenvironment, we further explored the heterogeneity and function of these cells in CRC to verify whether they are related to EMT. In the single-cell sequencing data set (GSE81861), PHLDB2 was positively correlated to cancer-associated fibroblasts (CAFs) about the EMT state of CRC (Additional file 1: Figure S7).
The correlation between nine prognosis-related EMT-RDGs and methylation
Next, we tested the methylation status and mutation level of nine prognostic-related EMT-RDGs on the prognosis of CRC, as well as their correlation with drug response and resistance. The expression of TBX5, TIAM1, SIX2, TWIST1, SNAI1, and TCF15 was negatively correlated to methylation (Fig. 5A). The expression of FGF8 and SNAI1 was significantly positively correlated with methylation level in the GCSC database; while the expression of NOG, PHLDB2, TBX5, TIAM1, and TWIST1 was negatively correlated with methylation level from the cBioPortal database (Fig. 5D). In DNMIVD analysis, the expression of FGF8, PHLDB2, and TBX5 was verified that was significantly positively correlated with methylation. The remaining prognosis-related EMT-RDGs were negatively correlated with methylation (Fig. 5E).
In the GCSC database, the methylation levels of FGF8, NOG, TCF15, TWIST1, TBX5, SIX2, and TIAM1 are higher in colon cancer (Fig. 5B). The hypermethylation of TIAM1 has a higher survival risk for colon cancer but lacking data in rectal cancer. The remaining genes did not have effective information (Fig. 5C). In the DNMIVD database, the methylation level of FGF8, NOG, PHLDB2, TCF15, TWIST1, TBX5, and TIAM1 was higher in CRC compared to normal tissues. On the contrary, the methylation level of SIX2 and SNAI1 was higher in CRC (Fig. 5F).
The top five CpG methylation sites associated with nine prognosis-related EMT-RDGs that were identified were cg04917226, cg06243400, cg05769349, cg03843000, and cg09799658 (Fig. 5G). All the CpG methylation sites were shown in Additional file 1: Table S1. The Cox proportional hazards regression model (Fig. 5H) was constructed based on CpG methylation sites associated with Nine prognosis-related EMT-RDGs and indicated longer OS in the low-risk group (Fig. 5I).
The relationship between the expression and mutation of nine prognosis-related EMT-RDGs
We analyzed the mutational panorama of the CRC gene and nine prognostic-related EMT-RDGs. The top five genes with mutation rates in CRC were APC, TP53, TTN, KRAS, and SYNE1 (Fig. 6A). The mutation rate of nine prognosis-related EMT-RDGs is less than 10% in all samples from TCGA data (Fig. 6B). Nine prognostic EMT-RGDs were more likely to be mutated in adenocarcinoma of CRC from cBioPortal database. The frequency of copy number variation was relatively higher in FGF8, TWIST1, SNAI1, and TCF15 (Fig. 6C). The most frequent occurrence in single nucleotide variation (SNV) is C > T (Fig. 6D, E). The RTK-RAS and Wnt pathway were easily affected by gene mutation of CRC (Fig. 6F). In the GCSC database, The SNV frequency of TIAM1 altered in 60 samples was 50%; The SNV frequency of PHLDB2 was 27%; The SNV frequency of TBX5 was 23%; The SNV frequency of FGF8 was 10% (Additional file 1: Figure S8A). The mutation frequency of TIAM1 in colon cancer was 29%. The mutation frequency of PHLDB2 in CC was 18%. The mutation frequency of TBX5 in CC was 17%. The mutation frequency of other prognostic-related EMT-RDGs in CC and all prognostic-related EMT-RDGs in rectal cancer was less than 10% (Additional file 1: Figure S8B). In the cBioPortal database, the correlation between mutated count and fraction genome altered of nine prognostic EMT-RDGs and the comparison of expression Z-score of nine prognostic-related EMT-RDGs in the mutated and wild group were respectively shown in Additional file 1: Figure S9A, B. Among them, mutated types of were FGF8, NOG, SIX2, SNAI1, TIAM1, and TBX5 were enriched with missense. The mutated sites of these genes were shown in Additional file 1: Figure S9C. The comprehensive comparison of mutated counts and disease-free survival (DFS) of nine prognostic-related EMT-RDGs was analyzed in Additional file 1: Figure S9D.
The CNV of FGF8 in CRC was positively correlated to mRNA RSEM of FGF8. CNV of TIAM1 only in rectal cancer was significantly correlated to mRNA RSEM of TIAM1 (Fig. 7A). Heterozygous amplification of eight genes is present in CRC except for FGF8. Except for the heterozygous deletion of TWIST1 and SNAI1 in CRC, and the lack of SIX2 in colon cancer, all other genes have heterozygous deletions (Fig. 7B, C). The CNV status of nine prognostic-related EMT-RDGs was shown in Additional file 1: Figure S10.
The relationship of prognosis-related EMT-RDGs and CRC metastasis
Twenty nine CRC with liver metastasis samples, 53 CRC without liver metastasis samples, 28 normal CRC samples, and 13 normal liver samples were obtained from the GSE6988 dataset. SNAI1, TCF15, TIAM1, and TWIST1 were found in this dataset. TCF15, TIAM1, and TWIST1 were significantly different in the four types of tissues (Fig. 8A). Although the expression of SNAI1, TCF15, TIAM1, and TWIST1 did not have a significantly statistical difference in the CRC with or without liver metastasis, the level of four EMT-RDGs was a higher trend in CRC with liver metastasis.
In GSE28814 (GPL13425) set, we got 92 non-metastasis CRC samples and 33 metastasis CRC samples. The expression of FGF8, PHLDB2, SIX2, and SNAIL was higher and the expression of NOG and TWIST1 was lower in the non-metastasis CRC group (Fig. 8B).
The relationship of nine prognosis-related EMT-RDGs and CRC therapy
In the GSE36864 set, the expression of SIX2 was highest in CRC patients treated with capecitabine, followed in the capecitabine + irinotecan group, and finally in XELOX (capecitabine + oxaliplatin) + bevacizumab group. The remaining prognostic-related EMT-RDGs did not differ significantly among the three treatment groups. Moreover, the trend of FGF8, NOG, PHLDB2, and TIAM1 was consistent with the expression of SIX2 in three treatment groups (Additional file 1: Figures S10-S11).
Computational analysis of resistance with nine prognosis-related EMT-RDGs in CRC showed a correlation with drug resistance and reactivity in the CARE database. The expressions and mutations of nine prognosis-related EMT-RDGs are mainly related to the reactivity and resistance of PI3K signaling pathway inhibitors and RAS/RAF/MEK/MAP signaling pathway inhibitors. Among them, PHLDB2 mutation is related to ZSTK474 resistance. The TBX5 mutation is related to the sensitivity of two BRAF_V600E mutation inhibitors: PLX4720 and 878739-06-1 (Fig. 9A).
In the CTRP database, FGF8 was related to tozarsetib resistance; TCF15 was related to BRD-K75293299 resistance; SIX2 were related to response sensitivity of trametinib; TWIST1 was positively correlated with treatment response of COL-3, skepinone-L, SR8278, and valdecoxib treatment sensitivity are related; vorinostat, SCH79797, Panobinostat, KX2-391, GSK-J4, entinostat, dinaciclib, CHM-1, brefeldin-A, belinostat, apicidin, and alvocidib. PHLDB2 is related to the non-response of JW-55 and dasatinib and the sensitivity of multiple drugs (Fig. 9B). In the GDSC database, NOG was negatively correlated with therapeutic sensitivity of Sunitinib, Salubrinal, and XMD8-85; SNAI1 was negatively correlated with TGX221 sensitivity; TWIST1 is negatively correlated with Docetaxel, AG-014699, and was positively correlated with AT-7519. TBX5 was negatively correlated with the sensitivity of HG-5-88-01; SIX2 was positively correlated with the sensitivity of Z-LLNle-CHO and Dasatinib; TIAM1 was positively correlated with the sensitivity of PD-0325901, Dasatinib, Sunitinib, and 17-AAG. PHLDB2 was positively correlated with the sensitivity of 5-Fluorouracil and negatively with the sensitivity of Gefitinib, Afatinib, Cetuximab, piperlongumine, Bleomycin (50 uM), and Docetaxel (Fig. 9C). All the information about drugs was shown in Additional file 1: Table S2.
Verification of the expression level of EMT-RDGs in oncomine, GEPIA, and HPA database
In the TCGA database, the expression of TIAM1, PHLDB2, NOG, and TCF15 is low; the expression of SNAI1, FGF8, TWIST1, SIX2, and TBX5 is high in CRC. We verified the expression levels of nine prognostic-related EMT-RDGs in the GEPIA database. The high expression of FGF8, SIX2, SNAI1, and TWIST1 in CRC, and the low expression of NOG, PHLDB2, TCF15, and TIAM1 are consistent with our results. However, there is no significant difference in the expression of TBX5 (Fig. 10A). In the Oncomine database, the results of PHLDB2, SIX2, SNAI1, TCF15, TIAM1, TWIST1, and FGF8 are consistent with ours. However, the expression of NOG and TBX5 was contrary to our findings (Fig. 10B). In the HPA database, the results of the six genes exist but the information of FGF8, SIX2, and TWIST1 did not exist (Fig. 10C). In the DONIVD database, the results were consistent with the training set (Additional file 1: Figure S12).