Skip to main content

Integrating knowledge graphs into machine learning models for survival prediction and biomarker discovery in patients with non–small-cell lung cancer

Abstract

Accurate survival prediction for Non-Small Cell Lung Cancer (NSCLC) patients remains a significant challenge for the scientific and clinical community despite decades of advanced analytics. Addressing this challenge not only helps inform the critical aspects of clinical study design and biomarker discovery but also ensures that the ‘right patient’ receives the ‘right treatment’. However, survival prediction is a highly complex task, given the large number of ‘omics; and clinical features, as well as the high degree of freedom that drive patient survival. Prior knowledge could play a critical role in uncovering the complexity of a disease and understanding the driving factors affecting a patient’s survival. We introduce a methodology for incorporating prior knowledge into machine learning–based models for prediction of patient survival through Knowledge Graphs, demonstrating the advantage of such an approach for NSCLC patients. Using data from patients treated with immuno-oncologic therapies in the POPLAR (NCT01903993) and OAK (NCT02008227) clinical trials, we found that the use of knowledge graphs yielded significantly improved hazard ratios, including in the POPLAR cohort, for models based on biomarker tumor mutation burden compared with those based on knowledge graphs. Use of a model-defined mutational 10-gene signature led to significant overall survival differentiation for both trials. We provide parameterized code for incorporating knowledge graphs into survival analyses for use by the wider scientific community.

Introduction

The identification of patients who are likely to benefit from a specific therapeutic intervention is crucial to the success of clinical trials. This is particularly the case in oncology, where disease progression can be rapid and the side effects of treatment can have a significant impact on patients’ quality of life. The task of optimal patient selection is often addressed by using survival modeling analysis, in which the likelihood that a patient’s disease will respond to treatment is predicted on the basis of a combination of clinical and genomic factors. Improving patient-level survival prediction involves combining multiple modalities, [1, 2] incorporating nonlinear models, [2, 3] or performing individual biomarker hypothesis testing [4]. Although some methods use prior knowledge, this is not done in a generalizable or replicable fashion.

Biomedical knowledge graphs (KGs) incorporate entities and their relationships in graph-based representations derived from the literature and data, including compounds, essays, genes, drug targets, protein variants, pathways, cellular components, and associated diseases. KGs have recently become prominent in biomedical and drug discovery research to extract and integrate prior knowledge [5,6,7]. Their value in collecting and organizing prior biological knowledge and uncovering novel or hidden connections and the biological effects of drug responses is well documented [6, 7].

Recent studies have introduced biomedical KGs that integrate diverse public data sets to provide a single platform for answering core biomedical questions, such as disease gene identification, biomarker discovery, and drug-drug interactions [5, 7, 8]. Several biologically inspired KGs are widely used, including PrimeKG, [5] HetioNet, [7] Orphanet, [8] and Wikidata, [9] among others. The Biological Insights Knowledge Graph (BIKG) [10] is AstraZeneca’s internal KG that integrates data from more than 50 data sources, capturing more than 130 million biomedical relationships. The BIKG has been successfully used to map a prior knowledge space to focused clinical trial data to improve disease understanding and signal identification [6, 10].

Although various studies have explored the biomedical relationships derived from KGs, [5, 10] the application of prior knowledge from KGs in patient-level survival prediction models has not been as widely investigated. Noteworthy developments include the use of deep graph-based networks for survival prediction with breast cancer samples [11] and a KG-based long short-term memory model to predict mortality in patients with acute kidney injury [12]. Methods for extracting knowledge relevant to survival modeling from KGs remain an active area of research and include graph-based feature extraction, [13] ontology-based integration, [14] and literature-based knowledge extraction [15].

This study aimed to incorporate prior knowledge captured in KGs to machine learning–based models for predicting patient survival and to demonstrate the improvement of these models’ predictive power, including in patients with non–small-cell lung cancer (NSCLC) treated with immuno-oncologic (IO) therapies. To this end, we developed a methodology for knowledge extraction from KGs that is suitable for patient-level survival analysis by using ML-based predictive survival models. We observed a measurable advantage of using prior knowledge in predicting the survival rate of NSCLC patients. Here we compare the performance of different KGs, including BIKG [10] and Hetionet (heterogeneous information network), [7] graph embedding, and modeling methods. We also provide a reusable pipeline for predicting patient survival based on combining the patient’s genomic data and the user’s choice KG. Finally, we present the biological relevance of the findings, suggesting that the proposed methodology of information extraction from KGs specific to IO treatment in patients with NSCLC can capture relevant signals out of millions of KG entities and relations.

Results

Framework for prediction of patient survival and response and biomarker discovery

Figure 1 depicts an overall framework incorporating KGs in the patient survival prediction model. Using the patient’s genomic signature, each gene information (e.g., mutation status) is projected to the KG, expanding knowledge to include gene-gene interactions through connected KG pathways. The identified connected graph is then projected to a lower-dimensional embedding by using the SocioDim algorithm [16]. The gene-specific embeddings are then aggregated to a patient-level one that serves as an input for the ML patient survival prediction model. During the test phase, patient data are transformed into patient-level embedding that serves as input to a pretrained patient survival prediction model. Figure 1b provides an enlarged view of the generation of gene-level low-dimensional embedding. For the results reported here, a 16-dimensional embedding was used.

Fig. 1
figure 1

Framework incorporating KGs into ML-based models for patient survival prediction. a, Framework for patient survival prediction incorporating KG as input to the ML OS prediction model. b, Patient-level genomic data served as an input to the KG, which expands gene-gene relationships based on internal KG structure. For each gene in the panel, we identified a subgraph of interactions within the KG, which is projected to gene-specific embedding in a lower-dimensional space using the SocioDim algorithm

We evaluated a number of other methods of generating low-dimensional embedding from KGs, including GLEE, [17] NetMF, [18] RandNE, [19] NodeSketch, [20] and BoostNE, [21] indicating the advantage of using the SocioDim algorithm for incorporating KGs into models for prediction of OS (Extended Data Fig. 1).

We developed a stand-alone software application knowledge interface (API) that enables the incorporation of patient genomic data with relevant prior knowledge from KGs as input to the ML model to generate predictions of patient survival. In this study, we first integrate patient genomic data with prior knowledge from our internally developed knowledge graph called BIKG [10]. Subsequently, we compare its performance to that of a well-established, publicly available biomedical KG, Hetionet [7] (Extended Data Table 1), demonstrating the adaptability of our framework to various KG platforms.

Table 1 Recovery of lost information from the incomplete gene panel by BIKG

In addition, we evaluated multiple ML-based predictive algorithms for patient survival. Using random survival forest proved to be the fastest, most stable, and most generalizable method across the suite of standard methods that can use censored data, though the advantage was insignificant.

Finally, we applied the proposed framework to analyze improvements in the use of KGs to predict survival in NSCLC patients from different studies and clinical trials.

Use of prior knowledge from KGs in OS prediction: NSCLC patient cohort

Genomic panels play a significant role in oncology to assess treatment response in patients. These panels comprise genes carefully selected through comprehensive studies and expert curation [22].

In this first study, we assessed the predictive power for OS outcome in NSCLC patients, using a gene panel alone and in combination with prior knowledge derived from the BIKG with the MSK (Memorial Sloan Kettering)–MET (Metastatic Events and Tropisms) 2021 NSCLC primary tissue sample data set (see Methods). Although adding the BIKG to a full gene panel did not improve OS predictions, adding it to an incomplete panel (with a significant drop in predictive performance) demonstrated an ability to recover missing information and bring OS predictions to the level achieved using the entire panel. In particular, we evaluated four inputs for the OS prediction model: (a) a complete MSK gene panel (481 genes), (b) a complete MSK gene panel combined with the BIKG, (c) a random subset of an MSK gene panel (100 genes), and (d) a random subset of an MSK gene panel combined with the BIKG.

Table 1 shows a comparison of the OS predictive performance of the four above-mentioned models, using concordance index (C-index) and variance derived from 10 separate runs. The results are reported on the independent test sets. There was no statistically significant improvement in OS prediction between scenarios (a) and (b) across 10 independent runs (Table 1). When the predictive performance of the model based on a random subset of genes was compared with that based on a full panel (scenarios a and c), there was a statistically significant decrease in predictive performance (Table 1). Adding the BIKG to a reduced gene subset (scenario d) brought the performance back to that of the full panel (Table 1).

Figure 2 shows a comparison of the models’ performance using Kaplan–Meier (KM) survival analysis [23]. The OS prediction was used to stratify patients into high- and low-risk subgroups with a 75th percentile cutoff. The model did not effectively stratify patients when a complete gene panel (Fig. 2a) was used. In contrast, using the full gene panel combined with the BIKG achieved statistically significant patient stratification (Fig. 2b). Similarly, use of an incomplete gene panel did not result in successful stratification (Fig. 2c), whereas adding the BIKG to the input yielded statistically significant stratification (Fig. 2d). No significant difference in OS prediction was seen when primary and metastatic tissue site data were compared (Extended Data Fig. 2).

Fig. 2
figure 2

Patient OS stratification using different inputs to OS prediction models. a, Complete MSK gene panel; b, BIKG using full MSK panel; c, random 100 genes from the MSK panel; d, BIKG using a random gene panel. KM plots show statistically significant survival stratification for b and d, indicating that BIKG improved patient stratification when based on incomplete panel data, thereby complementing the incomplete MSK gene panel

Addition of BIKG for OS prediction in NSCLC IO-treated patients in clinical trials: POPLAR and OAK trials

The accuracy of OS prediction was improved through the incorporation of prior knowledge from the BIKG, as opposed to exclusively relying on gene panel data when analyzing two distinct cohorts of patients with late-stage NSCLC treated with IO from two separate clinical trials, OAK and POPLAR.

BIKG-based prediction of OS in OAK clinical trial data

We used the OAK NSCLC phase 4 gene panel data independently and in conjunction with the BIKG as input features for ML models aimed at predicting OS. Figure 3a–c; Table 2 show independent test set results of patient stratification based on the models’ predictions using a 75% cutoff. The model trained with the BIKG in combination with the OAK gene panel data performed statistically better than the model based on the OAK gene panel only and better than that using known blood-based tumor mutation burden (bTMB) with a cutoff of 16 (Table 2) [24].

Fig. 3
figure 3

BIKG-based OS predictive performance for NSCLC IO OAK and POPLAR clinical trials. Trained using OAK, evaluation results are reported on an independent test set from the OAK clinical trial and all POPLAR clinical trial data. ac, Evaluation of model performance using KM stratification with fivefold cross-validation on OAK data set. a, Model trained with BIKG using OAK gene panel data; b, model trained with OAK gene panel; c, model trained using bTMB with a cutoff of 16. The model trained with BIKG (a) had a statistically significant P value, whereas models trained with the OAK gene panel only (b) or bTMB only (c) did not. df, Performance on an independent POPLAR data set by models trained with the OAK data set (same models as in ac). d, Model trained with BIKG in combination with the OAK gene panel; e, model trained with the OAK gene panel; f, model using bTMB with a cutoff of 16. The model using BIKG (d) had a statistically significant P value, whereas models trained with the OAK gene panel only (e) or bTMB only (f) did not show statistical significance

Table 2 Patient OS stratification using different model inputs: OAK independent test set

BIKG-based prediction of OS in POPLAR clinical trial data

The developed OAK-based models were further evaluated by using data from an independent phase 2 clinical trial, POPLAR [25] (Fig. 3d–f; Table 3). Similar to the evaluation using an independent test set from the OAK trial, a model trained with the BIKG in combination with the OAK gene panel performed statistically better than the model based on the OAK gene panel only. Tumor mutation burden (TMB)–based stratification with a cutoff of 16 was not as effective (Table 3).

Table 3 Patient OS stratification using different model inputs: POPLAR data set

Stability and performance gain of BIKG-based OS prediction models

Figure 4 illustrates the performance of the BIKG-based model in predicting OS with the MSK and POPLAR data while evaluating models’ stability and benefit of adding BIKG for high risk versus low risk groups. The average performance indicated a consistent advantage of the BIKG-based models in predicting OS in a group of high risk patients.

Fig. 4
figure 4

OS prediction in models using BIKG versus gene panel only. Patient survival prediction using a, the MSK independent test set and b, the POPLAR data set. Patients are sorted in ascending order by their OS. The BIKG-based model (red dashed line) indicated a much more elevated predicted risk for patients with low OS than the model based on the gene panel only (blue dashed line). The slope difference between the red and blue lines is (–0.469) – (–0.112) = (–0.347) in a and (–0.400) – (–0.065) = (–0.335) in b. The slope different between lines underscore the substantial difference in overall survival (OS) prediction performance between the BIKG-based model and the panel-based model. Particularly, the BIKG-based model demonstrates significantly higher average predicted risk for patients with an unfavorable risk profile, indicating its potential advantage in predicting outcomes for high-risk patients across different datasets

The difference in OS prediction performance between the models was much greater for patients with high predicted risk than for those with low predicted risk. Specifically, when focusing on the top 20% of patients with an unfavorable risk profile, the BIKG-based model exhibited an average predicted risk that was 15.6% higher than that of the panel-based model using the MSK data set, and 28.1% greater than that of the panel-based model using the POPLAR data set. When focusing on the top 20% of patients with a favorable predicted outcome, we noted that the BIKG-based model exhibits an average predicted risk of the BIKG-based model was only 0.5% higher than that of the model using the MSK data set, and 6% higher than the panel-based model using the POPLAR data set.

Biomarker-based signature for OS differentiation

To identify genomic signatures associated with high and low OS, we assessed the importance of individual genes identified by the predictive model, followed by an analysis of the mutation signature of a combination of genes.

Importance of individual genes

Tables 4 and 5 show the top most predictive genes identified by the developed OS prediction models across 10 independent runs.

Table 4 Top 10 genes identified as important in predicting OS in the OAK and POPLAR studies with BIKG-based models
Table 5 Top 10 genes identified as important in predicting OS in the MSK-MET 2021 study

Recent studies have highlighted the importance of TP53-ATM co-mutations as a biomarker for immune checkpoint inhibitor (ICI) response in NSCLC patients [26, 27]. Chen et al. showed that PD-L1 expression was not only high in a co-mutated cohort but was also associated with better OS than both single-gene mutations and wild-type subpopulations [26]. Zhang et al. reported the remarkable concurrency of TP53, STK11, PTPRD, RBM10, and ATM mutations in patients with KRAS-mutant NSCLC and highlighted the significance of this relationship in assessing ICI efficacy [28]. Bai et al. went one step further and recommended an eight-gene mutation signature to predict the response of ICI in patients with non-squamous NSCLC [29].

Gene mutation signature as an OS differentiator

Individual genes identified as important by the models did not show stratification power. However, a combination of key genes identified by the BIKG-based model (EGFR, TP53, ATM, PRKDC, STAT3, CTNNB1, KRAS, NFE2L2, EPHA7, and SOX9) was used to differentiate OS in OAK and POPLAR patient cohorts (Fig. 5a, b). We considered the biomarker signature to be positive (defined in Fig. 5 as “Mut”) if there were at least two mutations among the selected genes and negative (defined in Fig. 5 as “Wt”, wild type) otherwise. Using key genes identified by the BIKG-based model, we observed a statistically significant difference in OS for the OAK data set (Fig. 5a). A similar, but not statistically significant, pattern was observed for POPLAR (potentially due to a too-small patient cohort; n = 57) (Fig. 5b).

Fig. 5
figure 5

Differentiation by OS on OAK, POPLAR, and MSK data sets using mutation signature. Patient survival prediction using a, the OAK independent test set, b, the POPLAR data set, and c, the MSK independent test set. Mut, mutation biomarker; Wt, wild type

Figure 5c shows the results of using the gene signature of key genes (STK11, KRAS, KEAP1, TP53, SMARCA4, ATM, EPHA3, PTPRD, RBM10, and FAT1) to differentiate OS in the MSK cohort.

Incidence of mutations in predicted high- and low-risk groups

We compared the difference in mutation prevalence between predicted high- and low-risk groups when using the gene panel only versus the BIKG as input to the predictive survival model for the OAK and POPLAR data sets (Extended Data Table 4). Models using the BIKG identified more patients with high risk associated with TP53, EGFR, EPHA7, PRKDC, and NFE2L2 gene mutations than low-risk patients. The high-risk patients had lower TMB on average. Similarly, we compared the high- and low-risk groups using the MSK data set (Extended Data Table 5). Models using the BIKG identified more patients with high risk associated with STK11, KRAS, KEAP1, and TP53 gene mutation than low-risk patients.

Discussion

The results of this study can provide a blueprint for how integrative analytics using prior knowledge feature spaces can add value to critical clinical decisions, helping to identify the right drug for the right patient. The proposed methodology of incorporating prior knowledge into patient-level survival analysis demonstrates the advantages of automated information extraction from KGs, allowing it to speed up and boost the performance of predictive models.

Incorporation of KG information into patient-level predictive survival models

Our methodology aims to automatically incorporate prior knowledge extracted from KGs into patient-level predictive models, specifically, ML-based patient survival prediction models. Earlier studies on core biomedical questions have not addressed the extraction of information from KGs that describe patients, concentrating more on biomedical interactions and the effects of individual genes or pathways. This study demonstrated the value of using KGs in addressing patient-level problems.

The patient survival prediction pipeline, using patient genomic features as input and prediction of patient survival as output, is a stand-alone API bringing KGs as input to patient survival prediction models. The API can incorporate different KGs and modeling methods in addition to those introduced in this study. It provides training, testing, and validation schema with added bootstrapping capability to ensure the generalization of results to previously unseen data.

Several methods of translating graph representation into low-dimensional embedding suitable for ML models have been evaluated in this study. Of those, SocioDim has demonstrated a stable advantage in improving the predictive power of patient survival models.

Comparing the BIKG and Hetionet as examples of KGs in application to NSCLC, we observed similar predictive performance of the models, with a slight advantage of BIKG. The use of disease- and treatment-specific subgraphs rather than a full graph (with BIKG) improved survival prediction, suggesting that targeting more relevant information in KGs helps to reduce the noise and uncertainty of information extracted from KGs.

Subgraphs from BIKG capture the complex biomedical relationships within a network architecture and may have the potential to provide explainable insights into disease biology and drug response. Although the performance of predictive models based on full BIKG versus selected subgraphs was not notably different, using subgraphs may allow better identification of potential biomarkers.

Of the predictive algorithms, random survival forest proved to be the most stable and generalizable across the suite of standard methods that use censored data, though the advantage was insignificant.

Application of KG-based models for OS prediction to NSCLC patients

We observed a consistent advantage of using prior knowledge extracted from KGs for OS prediction with MSK-MET 2021, POPLAR, and OAK data in NSCLC patients. For MSK-MET 2021 data, adding BIKG to the full panel as input did not confer a significant advantage. When BIKG was added to the incomplete panel, however, we observed a remarkably automated retrieval of the missing information, allowing the predictive performance to return to the levels obtained with the complete panel. This finding suggests that BIKG can restore incomplete, missing data in a manner unbiased by human error in a fast and automated way. Identifying important genes constituting a comprehensive gene panel is a time-consuming manual process compared with the quick, automatic extraction of information from the BIKG. The low variance in performance between different bootstraps supports the superiority of the BIKG’s performance despite the graph’s high dimensionality and noisiness.

For POPLAR and OAK clinical trial data from IO-treated NSCLC patients, the performance of OS prediction was consistently improved when the BIKG was used as input than when models based on the gene panel alone or bTMB were used. The results were generalizable to the unseen data; the BIKG-based model developed with OAK clinical trial data was evaluated using POPLAR data and demonstrated consistently better performance as compared to gene panel or bTMB. As with the MSK-MET 2021 data, we observed low variance in performance between models using different bootstraps, supporting the BIKG-based model improved robustness claim.

KG-based models showed a higher gain in performance improvement for patients in the high-risk group, observed for both MSK, and POPLAR-OAK datasets, suggesting that the additional knowledge extracted from KGs contributes to the stabilization and improvement of high-risk predictions while the smaller incidence of important for the prediction model mutations in the low-risk group indeed makes KG contribution to the model less significant.

The genes that contributed most to improving OS prediction showed consistency across the models (Tables 4 and 5). The co-occurrence of mutations confirmed previously observed associations (Extended Data Fig. 3). The genes identified as important by the models have strong clinical relevance from previously published studies. A complete OncoKB [30] assessment of these genes is provided in Extended Data Tables 2 and 3. For both MSK and POPLAR-OAK datasets, BIKG-based models were able to improve the differentiation in key mutation incidence between high- and low-risk patient groups.

The identified mutational signature built from a combination of selected genes identified by predictive models proved to be an OS differentiator. For the -OAK IO-treated population, the mutational signature showed a statistically significant difference in survival between signature-positive and signature-negative patients (Fig. 5a). A similar but not statistically significant trend was observed for POPLAR and MSK (Fig. 5b-c) on the independent test set. The test sets’ lack of significance may be attributed to the small sample size. The finding supports the importance of combining genes as a potential biomarker to amplify a patient’s response to treatment.

Limitations and next steps

The analysis of NSCLC data was performed by using gene panel data only, limiting our ability to identify novel biomarkers. Clinical gene panels capture only a tiny fraction of relevant disease biology, presenting a significant hindrance to clinical translation of any predictive analysis. Next steps should include identifying suitable complete genome data and applying the proposed methodology to the identification of new biomarkers. While genomic data form the majority of currently established clinical biomarkers, solely relying on genomics could introduce bias into ML/AI methodologies, such as the one described in this study. Firstly, patient selection and recruitment bias is not captured within our model. Secondly, it is important to highlight the value of other ‘omic and Real World Evidence (RWE) data towards patient selection, companion diagnostics, and precision medicine therapeutic recommendations. Such data include, but are not restricted to, transcriptomics, proteomics, clinical biomarkers, and image-derived biomarkers, which could provide a deeper and more comprehensive disease and response signature from the patient population. Even though the availability of such data from clinical studies is quite rare owing to cost and logistical challenges, recent clinical trials have been actively generating such data to inform downstream precision medicine decisions. Extending our approach by overlaying multiple data domain layers as listed above could help mitigate the bias introduced by learning from just genomic data, and Knowledge Graphs are ideal for integrating cross-modality data seamlessly.

Although this study highlights the value of biomedical KGs in deciphering signals of clinical response, the biomedical and data science communities need to address critical gaps and learnings to deliver stable, reliable and scalable value. Patient survival and clinical response have multiple independent confounding factors (e.g., clinical and socioeconomic), not all of which are currently captured in a comprehensive fashion. Clinical trials capture an inherent bias in patient selection that is not reflected in existing predictive models. KGs should evolve to include nonscientific metadata to address this issue. KGs are also inherently noisy, and careful consideration is required to define context-specific graphs to address clinical questions. From a technical perspective, ML models are “data hungry,” and better models that can handle small patient cohorts need to be developed, especially given that the design of clinical trial arms is becoming increasingly specific in the era of precision medicine.

Methods

Data

MSKCC metastatic events and tropisms

The MSK Cancer Center (MSKCC) cohort is a pan-cancer cohort of 25,775 patients with metastatic diseases across 50 tumor types [31]. All samples were profiled by using the MSK-IMPACT (Integrated Mutation Profiling of Actionable Cancer Targets) targeted panel, which can detect mutations and other critical changes in the genes of both rare and common cancers. The data set comprises primary (61%) and metastatic (39%) samples. This paper focuses on data from NSCLC patients in this cohort, which contains 1,494 metastatic and 1,855 primary samples. The gene panel has 481 genomic features.

The POPLAR (ClinicalTrials.gov ID: NCT01903993) [25] and OAK (ClinicalTrials.gov ID: NCT02008227) [32] clinical trials are, respectively, phase 2 and phase 3 studies measuring the effectiveness of atezolizumab against docetaxel as a second line of treatment for patients with failed first-line platinum chemotherapy. The POPLAR study was conducted in Europe, Asia, and North America, and the OAK study was conducted in Europe, Asia, North America, South America, and New Zealand. Both studies enrolled adults (\(\:\ge\:\)18 years old) with measurable disease per Response Evaluation Criteria in Solid Tumors (RECIST) version 1.1 and an Eastern Cooperative Oncology Group performance status of 0 or 1. Patients in both studies had previously received one or two cytotoxic chemotherapy regimens for stage IIIB or IV NSCLC. In addition to this, patients who carry EGFR mutations or ALK fusion oncogenes received tyrosine kinase inhibitor treatment [33]. To identify the molecular mechanisms associated with immunotherapy, all analyses were conducted in the atezolizumab arm, which consists of 102 samples from POPLAR and 324 samples from OAK. The overlapped gene panels between OAK and POPLAR consists of 355 genomic features (Extended Data Fig. 4). The patient datasets utilized in our study are published clinical studies and ethics considerations in accordance to the Declaration of Helsinki have been made in their respective original publications.

Knowledge Graphs: BIKG and Hetionet

The BIKG is AstraZeneca’s internal KG [6, 10]. It leverages all available AstraZeneca internal and external biological data sets, such as genes, pathways, cell lines, assays, compounds, and diseases. It includes more than 500 million relationships from more than 166 million publications and more than 50 databases, licensed repositories, and internal data sets. Graph-based knowledge allows us to infer evidence from similar features when there are missing data and sparse knowledge of a biological system. The BIKG also captures multiple biological mechanisms, which aids in the exploration of nonlinear biology and allows causal inference.

Hetionet is an integrative biomedical knowledge network [7] that combines information from 29 public databases. The network contains 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Hetionet captures the conceptual distinctions between various components and mechanisms, such as genes and diseases or upregulation and binding.

KG subgraphs for IO-related cancer treatment

To overcome the sparse molecular and missing biological information from the patient input data set, we constructed several BIKG subgraphs by integrating molecular and pathway biological information (Table 6). First, the following gene sets were identified:

Table 6 KG subgraphs evaluated in application to NSCLC, including IO related

Immune-related gene set: The latest version of a public database of innate immune interactions (InnateDB) [34, 35] was used to define an immune-relevant gene set (1,069 genes) for subgraph construction and context specificity. The gene-pathway mapping for this subset of genes was extracted from the baseline BIKG.

CIViC disease specificity gene set: CIViC (Clinical Interpretation of Variants in Cancer) was queried to identify 440 genes annotated as clinically relevant in NSCLC. The gene-pathway mapping for this subset of genes was extracted from the baseline BIKG.

Biorelate NSCLC causal data: Biorelate is a UK-based start-up company specializing in natural language processing–derived causal relationships between biomedical entities. AstraZeneca licensed this data to extract approximately 108,000 gene-gene relationships specific to NSCLC (specifically, lung adenocarcinoma [LUAD] and lung squamous-cell carcinoma [LUSC]) based on literature evidence. All Biorelate data can be accessed via the company’s portal (https://www.biorelate.com).

Using these specific gene sets and their interactions reflected in BIKG, we then constructed subgraphs in which each type of subgraph provided context-specific (e.g., cancer type, patient subpopulation, mutation profile) querying and embedding methods (Table 6). The results in this manuscript are reported using a Biorelate NSCLC causal subgraph, which showed a consistent predictive performance and better stability of the results in the preliminary studies (Extended Data Table 1).

KG feature extraction via graph embeddings for survival modeling

Creating structured, low-dimensional, patient-level embedding from unstructured graph data was critical for the next step, in which ML models were used to predict OS (Fig. 1b). The workflow consisted of three steps: (a) identifying patient-specific features in the graph through the patient genomic profile, (b) creating a gene-specific embedding from the graph, and (c) generalizing it to patient-level embedding.

  1. 1.

    From the gene panel, patient-level information was projected to the KG, exciting nodes, and associated pathways for each mutated gene in the patients’ profiles. The information was propagated through the graph via connected pathways.

  2. 2.

    Several methods were used to create gene-specific, fixed, low-dimensionality embedding from the KG, including GLEE, [17] NetMF, [18] RandNE, [19] NodeSketch, [20] BoostNE, [21] and SocioDim [16]. The SocioDim algorithm is based on identifying the largest eigenvalues from the gene-specific modularity matrix created from a patient-specific graph network. The process of learning gene latent representation from a fused graph of patient-gene and gene-gene interaction includes several steps. Given a patient heterogeneous graph G, we first calculated the G’s modularity matrix B:

$$\:B=A-\frac{d{d}^{T}}{2m},$$

where A is the interaction matrix, d is the degree of each node, dT is the transpose matrix of d, and m is the number of edges in the graph G.

We then extracted the eigenvectors corresponding to the largest eigenvalues of the graph modularity matrix. The Lanzcos method [36] was applied to calculate the top eigenvectors used as the gene embedding representation.

Patient embeddings were generated by averaging gene embeddings over all genes from the panel. We calculated the patient embedding as follows:

$$\:{p}_{em}=\frac{1}{n}\sum\:_{i=1}^{n}{g}_{i}$$

,

where gi is gene embedding and n is the number of gene mutations a patient carries.

Readers interested in reproducing our results can access the detailed Jupyter notebook and source code implementation at https://github.com/rafaelfang/KGSurvPrediction.

Survival modeling

We used Cox regression [37] and random survival forest [38] to model OS. Preliminary analysis demonstrated a much improved OS predictive performance with random survival forest, a nonparametric, nonlinear approach, so the results are reported here using this algorithm unless stated otherwise.

We compared input feature representations consisting of the gene panel with that from KG-based embedding derived using the gene panel to evaluate the importance of the KG in improving prediction of patient OS. For the IO-treated cohort, the results were also compared with the performance using TMB, a U.S. Food and Drug Administration–approved biomarker with a 10 Mut/Mb cutoff [39]. In additional analyses, we evaluated the differences in the models’ predictive power by comparing primary with metastatic tissues.

Training and evaluation approach

Data was split into training, test, and evaluation subsets. Bootstrapping was used for stability and sensitivity to training data. We used C-index for performance evaluation and KM survival analysis [23] to stratify patients into high- and low-risk groups, inversely proportional to OS. We considered several cutoffs; results are reported here with the 75th percentile cutoff, which is widely used in the literature [40,41,42]. The cutoff was determined by using training data during model development and applied to the evaluation data set.

Biological interpretability

Key molecular features associated with prediction of patient survival

For each developed model, we identified the top 10 crucial gene features from the panel that contributed most to the OS prediction, using a decision tree regression model that associates embedding features with genomic features. We report the most important genomic features with a count of how many times they appear as important out of 10 independent models.

Differential analysis

The average value of mutation incidence was calculated for each gene. We compared patients’ high- and low-predicted risk groups to identify mutations associated with the high-risk group.

Data availability

The pipeline and analysis code are available at https://github.com/rafaelfang/KGSurvPrediction. Publicly available KGs that can be incorporated into the pipeline are PrimeKG [5], available at https://github.com/mims-harvard/PrimeKG; and Hetionet [7], available at https://het.io/. The patient cohort datasets used in this study are available at: MSK MetTropism: https://www.cbioportal.org/study/summary?id=msk_met_2021. OAK clinical trial: https://ega-archive.org/datasets/EGAD00001008549. POPLAR clinical trial: https://ega-archive.org/datasets/EGAD00001008548. OAK and POPLAR data can be found in the Supplementary Tables from “Blood-based tumor mutational burden as a predictor of clinical benefit in non-small-cell lung cancer patients treated with atezolizumab” [24]

References

  1. van Timmeren JE, et al. Survival prediction of non-small cell lung cancer patients using radiomics analyses of cone-beam CT images. Radiother Oncol. 2017;123:363–9.

    Article  PubMed  Google Scholar 

  2. Yao J, Zhu X, Zhu F, Huang J. Deep correlational learning for survival prediction from multi-modality data. Medical Image Computing and Computer-Assisted Intervention, Quebec City, Quebec, Canada, September 10–14, 2017.

  3. Vale-Silva LA, Rohr K. Long-term cancer survival prediction using multimodal deep learning. Sci Rep. 2021;11:13505.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Andre F, et al. Biomarker studies: a call for a comprehensive biomarker study registry. Nat Rev Clin Oncol. 2011;8:171–6.

    Article  PubMed  Google Scholar 

  5. Chandak P, Huang K, Zitnik M. Building a knowledge graph to enable precision medicine. Sci Data. 2023;10:67.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Gogleva A, et al. Knowledge graph-based recommendation framework identifies drivers of resistance in EGFR mutant non-small cell lung cancer. Nat Commun. 2022;13:1667.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Himmelstein DS, et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife. 2017;6:e26726.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Weinreich SS, Mangon R, Sikkens JJ, Teeuw ME, Cornel MC. Orphanet: a European database for rare diseases. Ned Tijdschr Geneeskd. 2008;152:518–9.

    CAS  PubMed  Google Scholar 

  9. Waagmeester A, et al. Wikidata as a knowledge graph for the life sciences. eLife. 2020;9:e52614.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Geleta D et al. Biological Insights Knowledge Graph: an integrated knowledge graph to support drug development, (2021). https://www.biorxiv.org/content/https://doi.org/10.1101/2021.10.28.466262v1. Accessed February 6, 2024.

  11. Ramirez R, et al. Prediction and interpretation of cancer survival using graph convolution neural networks. Methods. 2021;192:120–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Liu LJ, Ortiz-Soriano V, Neyra JA, Chen J. Kgdal: knowledge graph guided double attention lstm for rolling mortality prediction for aki-d patients. Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Gainesville, FL, August 1–4, 2021.

  13. Fu X, Patrick E, Yang JY, Feng DD, Kim J. Deep multimodal graph-based network for survival prediction from highly multiplexed images and patient variables. Comput Biol Med. 2023;154:106576.

    Article  PubMed  Google Scholar 

  14. Zhang H, Data integration through ontology-based data access to support integrative data analysis: a case study of cancer survival. 2017 IEEE International Conference on Bioinformatics and, Biomedicine et al. (BIBM), Kansas City, MO, November 13–16, 2017.

  15. Zhao Y, et al. Pathologic lymph node ratio is a predictor of esophageal carcinoma patient survival: a literature-based pooled analysis. Oncotarget. 2017;8:62231.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Tang L, Liu H. Relational learning via latent social dimensions. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28–July 1, 2009.

  17. Torres L, Chan KS, Eliassi-Rad T. GLEE: geometric laplacian eigenmap embedding. J Complex Netw. 2020;8:cnaa007.

    Article  Google Scholar 

  18. Qiu J et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina del Ray, CA, February 5–9, 2018.

  19. Zhang Z, Cui P, Li H, Wang X, Zhu W. Billion-scale network embedding with iterative random projection. 2018 IEEE International Conference on Data Mining (ICDM), Sentosa, Singapore, November 17–20, 2018.

  20. Yang D, Rosso P, Li B, Cudre-Mauroux P. Nodesketch: highly efficient graph embeddings via recursive sketching. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, August 4–8, 2019.

  21. Li J, Wu L, Guo R, Liu C, Liu H. Multi-level network embedding with boosted low-rank matrix approximation. Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, August 27–30, 2019.

  22. Easton DF, et al. Gene-panel sequencing and the prediction of breast-cancer risk. N Engl J Med. 2015;372:2243–57.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Goel MK, Khanna P, Kishore J. Understanding survival analysis: Kaplan-Meier estimate. Int J Ayurveda Res. 2010;1:274.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Gandara DR, et al. Blood-based tumor mutational burden as a predictor of clinical benefit in non-small-cell lung cancer patients treated with atezolizumab. Nat Med. 2018;24:1441–8.

    Article  CAS  PubMed  Google Scholar 

  25. Fehrenbacher L, et al. Atezolizumab versus Docetaxel for patients with previously treated non-small-cell lung cancer (POPLAR): a multicentre, open-label, phase 2 randomised controlled trial. Lancet. 2016;387:1837–46.

    Article  CAS  PubMed  Google Scholar 

  26. Chen Y, et al. TP53 and ATM co-mutation predicts response to immune checkpoint inhibitors in non-small cell lung cancer (Abstract 1240P). Ann Oncol. 2019;30:V506.

    Article  Google Scholar 

  27. Zhang F, et al. Co-occurring genomic alterations and immunotherapy efficacy in NSCLC. NPJ Precis Oncol. 2022;6:4.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Zhang F, et al. Co-occurring genomic alterations and immunotherapy efficacy in NSCLC. NPJ Precision Oncol. 2022;6:4.

    Article  Google Scholar 

  29. Bai X et al. Development and validation of a genomic mutation signature to predict response to PD-1 inhibitors in non-squamous NSCLC: a multicohort study. (Correction in J Immunother Cancer. 2020;8(2):e000381corr1). J Immunother Cancer 8, e000381 (2020).

  30. Chakravarty D et al. OncoKB: a precision oncology knowledge base. JCO Precis Oncol 2017, PO.17.00011 (2017).

  31. Nguyen B, et al. Genomic characterization of metastatic patterns from prospective clinical sequencing of 25,000 patients. Cell. 2022;185:563–75.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Rittmeyer A, et al. Atezolizumab versus Docetaxel in patients with previously treated non-small-cell lung cancer (OAK): a phase 3, open-label, multicentre randomised controlled trial. Lancet. 2017;389:255–65.

    Article  PubMed  Google Scholar 

  33. Mazieres J, et al. Atezolizumab versus Docetaxel in pretreated patients with NSCLC: final results from the randomized phase 2 POPLAR and phase 3 OAK clinical trials. J Thorac Oncol. 2021;16:140–50.

    Article  CAS  PubMed  Google Scholar 

  34. InnateDB. About InnateDB. https://www.innatedb.com. Accessed February 5, 2024.

  35. Breuer K, et al. InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation. Nucleic Acids Res. 2013;41:D1228–33.

    Article  CAS  PubMed  Google Scholar 

  36. Lanczos C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Natl Bur Stand. 1950;45:255–82.

    Article  Google Scholar 

  37. Lunn M, McNeil D. Applying Cox regression to competing risks. Biometrics, 524–32 (1995).

  38. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random Survival Forests. (2008).

  39. Sha D, et al. Tumor mutational burden as a predictive biomarker in solid tumors. Cancer Discov. 2020;10:1808–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Lee H, et al. Prognostic implications of type and density of tumour-infiltrating lymphocytes in gastric cancer. Br J Cancer. 2008;99:1704–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Martin L, et al. Examining the technique of angiogenesis assessment in invasive breast cancer. Br J Cancer. 1997;76:1046–54.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Preisser F, et al. Extent of lymph node dissection improves survival in prostate cancer patients treated with radical prostatectomy without lymph node invasion. Prostate. 2018;78:469–75.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Ben Sidders and Paul Metcalfe for their helpful suggestions, which improved the manuscript, the R&D IT BIKG team for their help with the BIKG data and establishing the platform, and D.J. Shuman for editing help.

Funding

This study was funded by AstraZeneca.

Author information

Authors and Affiliations

Authors

Contributions

E.J., C.B., I.K., and N.M. have made contributions to the conception of this study. C.F., G.A. I.K., K.B., and N.M. contributed to the design of this study. C.F. and G.A. contributed to the analysis of the data. C.F., G.A., I.K., K.B., and N.M. contributed to the interpretation of the data. C.F, and K.H. contributed to the creation of new software used in this work. C.F., G.A., K.B., K.H., I.K., and N.M. have drafted and revised the manuscript.

Corresponding authors

Correspondence to Krishna C. Bulusu or Natasha Markuzon.

Ethics declarations

Competing interests

All authors were employees of AstraZeneca when this work was performed and may have stock ownership, options, or interests in the company.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it.The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, C., Arango Argoty, G.A., Kagiampakis, I. et al. Integrating knowledge graphs into machine learning models for survival prediction and biomarker discovery in patients with non–small-cell lung cancer. J Transl Med 22, 726 (2024). https://doi.org/10.1186/s12967-024-05509-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12967-024-05509-9