An integrated prognosis model of pharmacogenomic gene signature and clinical information for diffuse large B-cell lymphoma patients following CHOP-like chemotherapy

Background As the most common form of lymphoma, diffuse large B-cell lymphoma (DLBCL) is a clinical highly heterogeneous disease with variability in therapeutic outcomes and biological features. It is a challenge to identify of clinically meaningful tools for outcome prediction. In this study, we developed a prognosis model fused clinical characteristics with drug resistance pharmacogenomic signature to identify DLBCL prognostic subgroups for CHOP-based treatment. Methods The expression microarray data and clinical characteristics of 791 DLBCL patients from two Gene Expression Omnibus (GEO) databases were used to establish and validate this model. By using univariate Cox regression, eight clinical or genetic signatures were analyzed. The elastic net-regulated Cox regression analysis was used to select the best prognosis related factors into the predictive model. To estimate the prognostic capability of the model, Kaplan–Meier curve and the area under receiver operating characteristic (ROC) curve (AUC) were performed. Results A predictive model comprising 4 clinical factors and 2 pharmacogenomic gene signatures was established after 1000 times cross validation in the training dataset. The AUC of the comprehensive risk model was 0.78, whereas AUC value was lower for the clinical only model (0.68) or the gene only model (0.67). Compared with low-risk patients, the overall survival (OS) of DLBCL patients with high-risk scores was significantly decreased (HR = 4.55, 95% CI 3.14–6.59, log-rank p value = 1.06 × 10−15). The signature also enables to predict prognosis within different molecular subtypes of DLBCL. The reliability of the integrated model was confirmed by independent validation dataset (HR = 3.47, 95% CI 2.42–4.97, log rank p value = 1.53 × 10−11). Conclusions This integrated model has a better predictive capability to ascertain the prognosis of DLBCL patients prior to CHOP-like treatment, which may improve the clinical management of DLBCL patients and provide theoretical basis for individualized treatment.


Background
Diffuse large B-cell lymphoma (DLBCL), as the most frequent adult non-Hodgkin lymphoma, representing approximately 30-40% of patients with lymphoid neoplasms in different geographic regions, with an annual incidence rate of over 100,000 cases worldwide [1,2]. As an aggressive malignancy, DLBCL showing some heterogeneous disorders with variable clinical characteristics, morphology, gene profile, and response to therapy [3,4]. These changeable characters make it difficult to prognosis and decide therapeutic strategies for patients with DLBCL. Nowadays, the initial standard therapy with rituximab in combination with CHOP (R-CHOP) chemotherapy is consists of rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone, and approximately 75-80% patients get completely remission [5,6]. However, a proportion of patients still intractable to this first-line therapy. Therefore, it is increasingly essential to identify early therapeutic prognosis and identify highrisk patients who are unlikely to respond sensitively or benefit from therapeutic medicines.
The resistance of chemotherapy is changing among patients. It often occurs at the beginning of drug treatment or after an initial response. Numerous clinical, pathological and molecular markers have been developed to characterize the disease. Different kinds of resistance have been defined and associated with adverse clinical prognosis. Patients response to cancer medicines is significantly related with the genetic features of cancer cells lines [7]. Due to the heterogeneous response of patients to therapies and the frequent development of drug resistance, it is vital to recognize the molecular biomarkers to manage tolerable treatments for patients [8][9][10]. Since the development of microarrays, resistance gene signatures (REGSs) have been widely researched for prediction of cancer chemoresistance. There are some useful websites which contain plenty of information about drug resistance in cancer cell lines [11][12][13]. Genomics of Drug Sensitivity in Cancer (GDSC), one of the public databases provides drug sensitivity information in cancer cell lines and link with genomic datasets to identify new biomarkers for therapy. Recently, many molecular studies involving gene expression profiling in cancer cell lines [14][15][16] have shown that variant resistance possibility of therapies could influence patients' prognosis. Researches on drug resistance based on cancer cell lines have typically been generated through dose response experiments. Based on various analysis methods, more useful REGSs have been developed with strong prediction for clinical outcomes. Boegsted et al. [15], Steffen et al. [14] and Liedtke et al. [16] have described predictive REGSs based on the gene expression of cells lines for multiple myeloma, DLBCL and breast cancer, respectively.
On the purpose of predicting response to cyclophosphamide, doxorubicin and vincristine, Steffen et al. [14] have formulated three drug-specific resistance gene signatures from DLBCL patients treated with CHOPlike therapy, which could be potentially used to guide effective therapy for the individual patient from the beginning. These discovery and application of prognostic molecular signatures help identify potentially highrisk patients with DLBCL. Currently, in addition to this molecular methods, bio-clinical approach is also used to formulate treatment modalities for DLBCL [17,18]. There are many clinicopathological features associated with the outcomes of DLBCL patients [19][20][21], such as the Eastern Cooperative Oncology Group (ECOG) performance status, age, Ann Arbor stage and the nodal and extra nodal sites number were used commonly as clinical parameters in large number of studies. International Prognostic Index (IPI) is a robust prognostic tool, based on clinical and biochemical pre-treatment parameters, usually used to predict the outcomes of patients with DLBCL [22]. However, it seems that REGSs or clinical factors alone are insufficient to predict patients' prognosis. And to data, there is no one has ever considered to combine drug resistance signatures with clinical information to predicting prognosis of DLBCL. Thus, in order to establish a more sensitive models for forecasting patients' outcomes prior to therapy, our model comprehensively considered both drug sensitive information and clinical factors which have closely relation to the prognosis.
In our study, to build an integrated model for prognosis prediction for DLBCLs, we initially calculated cyclophosphamide, doxorubicin and vincristine drug resistance probabilities for each DLBCL patients treated with CHOP-like (including cyclophosphamide, doxorubicin, vincristine) chemotherapy. Then we analyzed clinical data and three drug resistance probabilities of DLBCL patients treated with CHOP-like therapy from GEO to screen factors that predictive for survival time, such as OS or progression-free survival (PFS). By using Cox regression analysis, we found four clinical variables and the resistance probabilities to doxorubicin and vincristine were associated with OS of DLBCL patients, and developed an integrated gene-and-clinical model for OS and another integrated gene-and-clinical model for PFS including three clinical variables and the resistance probabilities to doxorubicin from the training dataset, respectively. Due to the PFS of patients in the validating dataset wasn't recorded, we only confirmed the prognostic prediction ability of model for OS in validation dataset. We also compared the prognosis power of this integration model with a pharmacogenomic signature-alone model and a clinical-alone model. Our results suggested that a

Patients' samples
Two datasets that containing corresponding clinical information and genome-wide gene expression microarray data by using pretreatment lymphoma tissues from DLBLC were download from the GEO databases (http:// www.ncbi.nlm.nih.gov/geo/, Additional file 1: Table S1). All patients receiving CHOP or CHOP based (R-CHOP) treatment (cyclophosphamide, doxorubicin, vincristine and prednisone plus rituximab). The gene expression profiles of patients were conducted using the Affymetrix Human Genome U133 Plus 2.0 array from datasets utilized in this study. Our study included overall 791 patients with DLBCL, consisting of 449 patients from Visco's study (GSE31312) [23] and 342 patients from Lenz's study (GSE10846) [24], who have complete clinical information. Dataset GSE31312 was used for train prognosis models, and GSE10846 was used to validate the prognosis model generated from the training dataset.
The clinical endpoint was death, or the date of last assessment without any such event (censored observation), and the secondary endpoint was disease progression. The raw files in CEL format were downloaded from GEO and preprocessed with the RMA algorithm using the 'affy' R package for the expression data. The probe-level expression values were converted into gene-based expressions via the collapse row function [25].

Clinical and Pharmacogenomic factors
Univariate cox regression analysis was separately performed to select the demographic and clinical features and drug resistance signatures associated with the OS or PFS of DLBCL patients from following eight factors: gender, age at diagnosis, stage, extra nodal sites number, ECOG performance score and resistance probabilities of cyclophosphamide, doxorubicin, and vincristine. REGSs for three specific drugs (cyclophosphamide, doxorubicin, and vincristine) were applied to calculated the possibility of drug-specific resistance for individual patient [14]. Next, by using elastic net-regulated Cox regression, clinical related factors or pharmacogenomic factors resulting in a p-value less than 0.05 were respectively selected into the clinical alone model or gene alone model in the training dataset with 1000 times cross validations using the glmnet R package.

Integrated clinical-and-gene model for OS
Furthermore, we built an integrated model for OS of DLBCL patients with six factors, comprising age at diagnosis, stage, extra nodal sites number, the ECOG performance score, drug resistance probability of doxorubicin and drug resistance probability of vincristine. The prediction score was calculated according the following formula: where n is six, factor i and Coe i stands for the value of factor i and the estimated regression coefficient of factor i in the elastic net-regulated Cox regression model after 1000 times cross validation, respectively. We used multivariate Cox regression analysis to examine whether drug specific resistance signatures were independent of clinical factors.

Integrated clinical-and-gene model for PFS
Meanwhile, we built an integrated model for PFS of DLBCL patients with 4 factors, comprising stage, extra nodal sites number, the ECOG performance score, drug resistance probability of doxorubicin. The prediction score was calculated according the following formula: where n is four, factor i and Coe i stands for the value of factor i and the estimated regression coefficient of factor i in the elastic net-regulated Cox regression model after 1000 times cross validation, respectively. We used multivariate Cox regression analysis to examine whether drug specific resistance signatures were independent of clinical factors.

Signature performance evaluation
According to the median cutoff of the signature, we classified DLBLC patients into two distinct risk groups. Then DLBCL patients' OS or PFS were compared between two groups via Kaplan-Meier curves and p values (log-rank test). ROC analysis was used to estimate the prognostic utility of this integrated signatures according to their ability and efficiency to predict the risk of death or disease progress, which was performed with survivalROC R packages. The ROC curve draws the true-positive vs. false-positive predictions. The higher AUC value indicates better prediction efficiency (AUC = 0.5 means random prediction). The predictive capability of the integrated model was compared with the pharmacogenomic gene model and the clinical model. Meanwhile, to estimate the prediction and survival classification value of integrated model, the overall concordance statistic (C-index), integrated discrimination improvement (IDI), net reclassification improvement (NRI) and restricted mean survival (RMS) ratio were calculated in training and validating dataset. Furthermore, we examined the predictive power of prognostic model in distinct DLBCL subtypes.

Statistical analysis
Statistical analysis in this study were all performed using the R software version 3.4.1 (https ://www.r-proje ct.org/) with related packages or our custom compiled functions. All reported p-values were two sided. The survival R package was utilized for survival analysis. We calculated the hazard ratio (HR) and 95% confidence interval (CI) with a Cox regression model, and survival curves were drawn from Kaplan-Meier estimates. Differences in survival between groups were compared using the twosided log-rank test. C-index, IDI and NRI were calculated in each datasets by R package survC1. The RMS analysis was performed using R package survRM2 [26].

Demographic and clinical characteristics of datasets utilized in our study
DLBCL datasets with whole genome mRNA expression information and clinical data were obtained from the GEO database with accession number GSE31312 and GSE10846. After removal of patients without clinical and survival data, 791 DLBCL patients were analyzed in our study (  Table S2). And stage, the number of extra nodal sites, the ECOG performance score were related to PFS in DLBCL patients (p < 0.05, Additional file 3: Table S3).

Pharmacogenomic gene only signature
In dataset GSE31312 and GSE10846, each individual was calculated three drug-specific resistance probability for cyclophosphamide, doxorubicin, and vincristine separately via resistance gene signature classifiers [14]. Among 3 drug-specific resistance probability, two of them with elastic net-regulated Cox regression coefficients that did not equal 0 were included in the pharmacogenomic gene signature for OS and one drug-specific resistance probability was chosen as pharmacogenomic gene signature for PFS. The riskscore formula based on the expression of resistance gene signature for death risk prediction as follows: pharmacogenomic gene score = coefficient* (expression level of gene level), the corresponding coefficient for each gene is listed in Additional file 4: Table S4 (OS) and Additional file 3: Table S3 (PFS). According to the formula of model for OS or PFS, patients in the training dataset were divided into two adverse risk groups via the median score as the cut-off value, respectively. The survival time between two groups were estimated by using Kaplan-Meier analysis. Then using the two-sided log rank test to compare different OS or PFS between these two risk groups in the training and validating dataset (Additional file 5: Figure S1; Additional file 6: Figure S2). In univariate factor analysis, the HR of the high-risk group versus the low-risk group for OS was 1.

Prognostic value of the drug resistance factors is independent of clinical information
To further test whether the drug resistance signature was an independent prognosis predictor of DLBCL patients, the multivariate analysis was performed. We first performed Multivariate Cox regression analysis and identified clinical information that be associated with OS (

Integrated clinical-and-gene model in the training dataset
In our study, we develop an integrated clinical-andgenomic model for OS or PFS in the training dataset using elastic net-regulated Cox regression model after 1000 times cross validation, respectively. All of the prognosis related factors with a non-zero elastic netregulated Cox regression coefficients were selected into the model for OS: drug resistance signature, age at diagnosis, stage, the number of extra nodal sites and the ECOG performance score (Table 3; Additional file 8: Figure S3). We created a risk-score formula for OS as follows: Prognosis score (OS) = 0.0213* (age at diagnosis) +0.1499* (extra nodal sites number) + 0.2360* stage + 0.2951* (ECOG performance score) + 0.8221* (drug resistance probability of doxorubicin) + 0.0998* (drug resistance probability of vincristine). According to prognosis score, patients were separated into lowrisk (n = 224) and high-risk groups (n = 225) in the training dataset via the median score from formula as cut-off. Distribution of the prognosis score, survival status and gene expression in patients in the training dataset (Fig. 1a). In univariate factors analysis, the HR of high score group versus the low score group for OS was 4.55 (95% CI 3.14-6.59, p = 1.06 × 10 −15 , by logrank test, Fig. 1b), suggesting that the patients with high scores was significantly associated with shorter OS. Meanwhile, A risk-score formula for PFS was developed as follows: Prognosis score (PFS) = 0.0726* (extra nodal sites number) + 0.3347* stage + 0.1615* (ECOG performance score) + 0.9754* (drug resistance  probability of doxorubicin) (Additional file 9: Table S6; Additional file 10: Figure S4). To compare the sensitivity and utility of prognosis prediction between integrated gene-clinical model and pharmacogenomic gene only or clinical only model in the training model, Kaplan-Meier curve analysis and ROC analysis was conducted and the under receiver operating characteristic (AUC) value were calculated. ROC curves suggested that all the above three signatures showed prediction power in predicting OS in the training dataset (AUC > 0.5). Furthermore, compared with pharmacogenomic gene-only and clinical-only model, the integrated model with pharmacogenomic gene score and clinical information showed better performance in predicting OS at 10-year time point, as demonstrated by AUC values (Fig. 1c, Additional file 11: Table S7; Additional file 5: Figure S1). Compared with clinical-only model and pharmacogenomic gene signature, integrated model for OS had a highest mean C-index 0.73 with a SE of 0.02 and a significant RMS time ratio (1.73) in training dataset (Additional file 12: Table S8; Additional file 13: Table S9).
IPI, as a powerful predictor, is often used to predict outcomes in patients with DLBCL. According to risk group defined by our risk score and IPI, then we performed the Kaplan-Meier curve analysis to compare the value of survival prediction between these two predictive models in training dataset (Additional file 14: Figure S5).The integrated model achieved higher HR value (HR = 5.07) than IPI index (HR = 2.94), indicating that the predictive ability of our integrated model was better than IPI index.

Validation of the prognostic integrated model for OS in independent dataset
To test the robustness and repeatability of our findings, we validated our integrated gene-and clinicalsignature in an independent dataset. Due to lack of validation, the model for PFS will not be discussed. We classified patients into high risk (n = 171) or low risk group (n = 171) using the median value of the risk scores generated with the same model formula above as cutoff.
The distribution of the prognosis score (OS), the survival status of the DLBCL patients treated with CHOP-like chemotherapy and the gene expression were shown in Fig. 2a. In accordance with the finding in the training dataset, patients with high risk had significantly shorter OS than individuals in low risk group (HR = 3.47; 95% CI 2.42-4.97; p = 1.53 × 10 −11 , by logrank test, Fig. 2b, Additional file 11: Table S7). The ROC curves suggested that the prognosis score showed prediction power in predicting OS in the validating dataset (AUC = 0.67, Fig. 2c). The mean C-index 0.71 with a SE of 0.02 and significant RMS time ratio (1.97) showed integrated model for OS also had a stable predictive power in validating dataset. (Additional file 12: Table S8; Additional file 13: Table S9). And the integrated model represented higher HR value (HR = 3.31) than IPI (HR = 2.72), indicating that our integrated model performed a stronger power to predict patients' survival (Additional file 14: Figure S5).

Validation of the integrated gene-clinical signature in molecular subsets
A subgroup analysis was performed in different subsets according to the microarray molecular classifier for prognosis. In the training dataset, the prognosis score (OS) classified 199 patients with ABC subtype into high risk and low risk groups with obviously distinct OS (HR = 3.96, 95% CI 2.42-6.48, p = 3.95 × 10 −6 , by logrank test, AUC = 0.67, Fig. 3a, c). As for 227 patients with GCB subtype, the forecast capability of the integrated signature between different risk subgroups in predicting OS (HR = 4.45, 95% CI 2.42-8.19, p = 1.6 × 10 −6 , by log-rank test, AUC = 0.73, Fig. 3b, c) was similar. In the validating dataset, patients with higher level of risk score were associated with shorter OS in ABC subtype (HR = 1.99, 95% CI 1.29-3.05, p = 1.69 × 10 −3 , by log-rank test, Fig. 3d,   Fig. 2 Performance evaluation of the integrated model for OS of DLBCL patients treated with CHOP-based chemotherapy in the validating dataset. Patients' overall survival status and risk score generated with integrated model in the validating dataset (GSE10846, n = 342). a The distribution plot, patients' overall survival status and time and heatmap of the integrated model profiles. Rows represent clinical information and drug resistance probability, and columns represent patients. The grey dotted line represents the median integrated model risk score cutoff dividing patients into low-and high-score groups. b The Kaplan-Meier curves for patients in the validating dataset. The two-sided Log-rank test was performed to test the difference for OS between the high-risk and low-risk groups determined based on the median risk score from the validating set patients. The number of patients at risk was listed below the survival curves. The tick marks on the Kaplan-Meier curves represents the censored subjects. c The ROC curve had an AUC of 0.67  Figure S6). The results shown that the integrated gene-clinical model for OS or PFS has a prognostic value in both different subsets categorized based on a microarray molecular classifier.

Discussion
DLBCL is a molecular heterogeneous disease leading to heterogeneous responses to therapy and different survival outcomes. Given the variations of drug resistance and clinical features among patients with DLBCL, the gene expression profiles (GEP) of tumor cells and clinical parameters have enabled the identification of genetic and clinical factors associated with prognostic value. In our study, we built a comprehensive DLBCL prognosis model for OS that combined 4 clinical factors and 2 genomic signatures of drug resistance. The prognostic probability of this integrated model was confirmed in another dataset. The results suggest that the combined genetic and clinical signature plays a crucial role in predicting prognosis of DLBCL and guiding therapeutic efficacy. Numerous clinical, pathological and molecular markers have been developed to characterize the disease [2,[27][28][29][30][31]. Based on these characteristics, new different combinations of predictive biomarkers are being investigated to optimize the treatment and improve the outcomes of DLBCL. So far, there have been proposed advanced methods for survival prediction of various cancer and analysis of gene expression profiling. Some integrated bioinformatics analyses have progressively attempted to stratify patients with related diseases into low and high-risk groups, identified by gene expression profiling. Similar studies have also been reported for DLBCL. Caroline Bret et al. [32]. Shipp et al. [33] and Rosenwald et al. [34] have proposed several gene signatures to predict the clinical outcomes of DLBCL patients. Song et al. [35] demonstrated that CD59 could predict the reaction to R-CHOP treatment of patients with DLBCL. Another study identified a single molecular biomarker associated with diagnosis and treatment of DLBCL by using bioinformatics analysis [36]. Though molecular researches have filtered a plenty of genes, come out very divergent results. There are a few overlap prognostic genes in those studies. As we known, based on GEP is one of the promising methods for establish prognostic models to predict patients' survival with high accuracy, this method is still having some limitations in clinical practice. While some GEP studies [24,[32][33][34]37] showed that using genetic signatures could predict the clinical outcomes of DLBCL patients, lacking of reliable, stable and efficient biomarkers restrained its clinic application. Some studies demonstrating combined predictive models for prognosis is superior to predictive models relying on a single predictor [22,38]. Combined gene biomarker of DLBCL and IPI score showed a better predictive capability for prognosis than either one of these markers alone [39]. Liu et al. [40] reported that combining gene signature (five-miRNA signature) and clinical factors (TNM stage) was a more sensitive predictor for nasopharyngeal carcinoma. In another study, the FGD3-SUSD3 metagene model was demonstrated to have a superior prognostic value for breast cancer [41].
To overcome the above problems, first we respectively selected genetic and clinical factors associated with OS of 449 patients by using univariate survival analysis in the training dataset. To improve the validity of the model, then we developed a promising prognostic signature model contained gene and clinical factors by using elastic net-regulated Cox regression. Interestingly, when combined gene signature with clinical signature, the integrated model led to a more potent prognostic prediction of DLBCL patients. And the integrated model seems to have a better predictive value for OS than IPI. Integrated model could probably ensure valid therapy for each individual patient before the treatment is initiated. In this study, we split DLBCL patients into two distinct risk groups according to the risk score of each individual, and we found that, as the time passing by, patients in low score group had better survival outcomes. In order to have a good prognosis for DLBCL patients, the patients with high risk score should tried other different treatment methods.
In recent years, activated B-cell-like (ABC) type and germinal center B-cell-like (GCB) type as two distinct molecular subtypes of DLBCL have been recognized by gene expression profiling, which based on Cell-of-Origin (COO) classifier [42]. These two molecular subclasses express different gene expression, clinical presentation and drug response [34,43,44]. Next, we performed the predictive power of our model in patients of ABC and GCB subtypes from training and validation datasets, the model's prognostic capability was also strong in each group, indicating the good reproducibility and reliability of this signature. Though this signature showed few differences in the predictive capability, it is still powerful in both subtypes.
It is the first time to propose the idea of associating clinical information with drug resistance signatures to predicting prognosis of DLBCL individual with CHOPlike treatment. Nevertheless, there were several limitations in our study. First, we used only two independent datasets to respectively establish and validate the model. Therefore, more research is needed to validate this integrated model among enough number of DLBCL patients. Besides, there was only one drug resistance signature was independent of clinical factors, which may be caused by the limited sample number in analysis. And there were only three drug-specific resistance probability calculated in our study. The other drug resistance signatures may also worthy of investigation which need further research.

Conclusions
In conclusion, the new comprehensive clinical-gene model for OS presented in our study has a better value than gene only model for predicting prognosis for patients with DLBCL. And these clinical information and drug-specific resistance probability of patients with DLBCL before treatment play a vital role in therapy management. Our findings suggest that combining clinical factors with genomic data could help us in-depth understand DLBCL survival and improve prognosis accuracy and promote the development of individualized therapy. Future investigations will focus on the validation of our defined integrated signature in planned clinical trials.