Integrating TCGA breast cancer RNA-seq datasets with DNA methylation datasets according to the flowchart (Additional file 3: Figure S1) 306 genes were identified that form an overlapping cluster (up-regulated expressed genes overlap with hypomethylated genes and down-regulated expressed genes overlap with hypermethylated genes between tumour and normal tissues, respectively). Of these 306 genes, 95 genes had a significant correlation between the mRNA expression and DNA methylation values. LASSO Cox regression analysis build the prediction model with a 13-gene epigenetic signature as the best predictor for overall survival of breast cancer patients. ssGSEA was applied to identify the association between epigenetic signature and cancer-related hallmarks (e.g. MTORC1 signaling, G2M checkpoint). Using ssGSEA, WGCNA and downstream GO, KEGG analysis indicated that cell division, and cell cycle and related terms were closely linked to the signature. The nomogram which included the 13-gene epigenetic model and other clinicopathological factors exhibited high accuracy.
Identification of differently expressed genes and differently methylated genes between tumour and normal tissues
The volcano plot (Fig. 1a) shows 3757 genes with a Ld2-FR of > 1, identified by the comparison of 1104 tumour samples and 114 normal samples. Analysing changes in the DNA methylation status, 225 were found to be hypomethylated and 446 genes hypermethylated in tumour tissues compared to normal tissue (Fig. 1b). Of those 671 genes with altered methylation status, 306 were also present among the 3757 genes with altered expression status. Of those co-regulated genes, 95 had DNA hypermethylation associated with a reduced mRNA expression level. The expression profile of the 95 genes showing negative correlation between methylation status and mRNA expression is shown in Fig. 1c together with the genomic characteristics and associated clinicopathological features.
LASSO Cox regression identifying a 13-gene epigenetic signature
95 genes from above analysis constructed a gene-expression profile, and LASSO Cox model was applied to build the prognostic signature on the gene-expression profile. Cross-validation was carried out in 5 rounds to prevent overfitting (internal training sets and internal validation sets constructed randomly) (Fig. 1d).
The most powerful features (ITPRIPL1, SIAH2, KCNH8, KRT19, NDRG2, STAC2, TPD52, EZR, PCDHGA12, HIF3A, PCDHGA3, C2orf40, CCND2) were identified by the regularisation process of LASSO COX regression (Fig. 1e).
The ROC plots for identifying the tumour and normal tissues by expression level and methylation level of the 13 genes were shown in Additional file 4: Figure S2 and Additional file 5: Figure S3. The 13 genes showed high efficiency to differentiate between tumour and normal tissues in terms of both gene expression level and DNA methylation level.
Overall survival prediction based on the epigenetic signature
A 13-gene epigenetic signature was built by the expression level of the 13 genes and the weighted parameter (formula in the method section) to predict the survival of patients with breast cancer. A median cut-off value was applied to stratify breast cancer patients into a high-risk group (n = 543) and a low-risk group (n = 544) (Fig. 2a). The survival status and heatmap for the expression of the 13 genes were showed in Fig. 2b, c. The Kaplan–Meier curve indicated patients in the low-risk group have a significantly better overall survival (OS) (HR = 0.3) and relapse-free survival (RFS) (HR = 0.45) compared to those in the high-risk group (Fig. 3a, b). The time-dependent ROC analysis revealed the 13-gene epigenetic signature had the best capacity to predict OS compared with that of other clinicopathological properties (Fig. 3c). Moreover, the correlation between the risk scores from epigenetic signature and ssGSEA scores were analysed and results showed signs of cancer-related hallmarks, e.g. mTOR signalling, G2M checkpoints, MYC targets significantly correlated with the risk scores (FDR q < 0.001) (Fig. 3d).
WGCNA on the transcriptome of breast cancer patients
For a better understanding of the molecular underpinnings of the clinical characteristics of the patients we applied WGCNA on the RNA-seq data matrix. Genes from RNA-seq data matrix were applied to build a gene co-expression network (Fig. 4a). The heatmap in Fig. 4b plots the topological matrix among the transcriptome (Fig. 4b). The relationships between clinical traits (molecular subtypes, pathological stage, distant metastasis, lymph node metastasis) and the eigenvalue of each gene module are presented in Fig. 4c. The blue module, which had the highest correlation (Cor = 0.4, p = 2e−16) with the 13-gene signature, was selected for further analysis. The genes in blue module, which had absolute values of correlation coefficients with the 13-gene signature greater than 0.2, were identified as hub-genes. The scatterplot below illustrates the strength of the link between the 13-gene signature and the module membership for each gene in the blue module (Fig. 4c). The gene co-expression network in the blue module were analysed by cytoscape (Fig. 4d). A variety of cell cycle-related genes, such as E2F, KIF2C, CDK1 and RA7D51, were included in the network (Fig. 4e). Submitting these hub-genes to GO and KEGG analysis a strong relationship between cell division, cell cycle and 13-gene epigenetic signature is apparent (Fig. 4f, g).
DNA methylation pattern, gene expression level in tumour and normal tissues and association of OS and RFS for the 13 genes
ITPRIPL1, SIAH2, KCNH8, KRT19, NDRG2, STAC2, TPD52, EZR, PCDHGA12, HIF3A, PCDHGA3, C2orf40, CCND2 were the 13 features (genes) in our LASSO Cox model. The correlation between DNA methylation status and gene expression is shown below (Fig. 5). All 13 genes show a high correlation between gene expression and DNA methylation level. The expression level of these 13 genes in different molecular subtypes of breast cancer is shown in Fig S4. Results revealed that the expression profiles were different in the 4 molecular subtypes of breast cancer. Then, the association between the expression of single gene and the OS and RFS of breast cancer patients was analysed using the Kaplan–Meier curve and univariate cox analysis (Additional file 7: Figure S5 and Additional file 8: Figure S6).
Subgroup analysis on the 13-gene epigenetic signature
As shown in Additional file 9: Figure S7A–F, the prognostic epigenetic signature serves as a promising biomarker for predicting the survival of breast cancer in different subgroups, including Luminal A type (p = 0.03), Luminal B type (p = 0.026), HER2-enriched (p = 0.012) and triple negative (p = 0.004), stage I-II (p < 0.001), stage III-IV (p < 0.001) patients, respectively.
Validation of the 13-gene epigenetic signature by independent breast cancer datasets
Four independent external GEO cohorts (GSE20685, GSE86948, GSE17907 and GSE12093) (Table 1) were applied to confirm the predictive value of the 13-gene epigenetic signature. The risk score for each patient was calculated by the formula we obtained from the training set (TCGA cohort). GSE20685, GSE86948, and GSE17907 have OS as the endpoint, and GSE12093 has disease-free survival (DFS) as the endpoint. The Kaplan–Meier curve showed a significantly worse survival in the high-risk group than in the low-risk group in GSE20685 (p < 0.001) (Fig. 6a), GSE86948 (p = 0.004) (Fig. 6b), GSE17907 (p = 0.003) and GSE 12093 (p = 0.034) (Fig. 6c).
Construction of a nomogram
To provide the clinician with a quantitative method by which to predict a patient’s probability of OS, a nomogram that integrated the 13-gene epigenetic signature, stage and molecular subtypes was constructed (Fig. 7a). The prediction efficiency was confirmed by the calibration plots (Fig. 7b).