Data collection and preprocessing
The mRNA data in fragments per kilobase per million mapped reads format and patients’ clinical information of LUAD were retrieved from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). AS data of LUAD were downloaded from TCGA SpliceSeq (https://bioinformatics.mdanderson.org/TCGASpliceSeq) [14]. The percent-spliced-in (PSI) value, representing the ratio between reads including or excluding exons, was calculated to describe detected ASEs. According to splicing patterns, all of these ASEs were classified into seven types: exon skip (ES), retained intron (RI), alternate donor site (AD), alternate acceptor site (AA), alternate promoter (AP), alternate terminator (AT), and mutually exclusive exons (ME) (Fig. 1a). Only ASEs available in more than 70% of samples were included in this study. Missing values were imputed using R package impute.
The name of each ASE consists of three parts: gene symbol, ID number designated in TCGA SpliceSeq database, and splicing type. For example, the name “CHEK-19309-AP” indicates the parent gene of this event is CHEK, its ID number in TCGA SpliceSeq database is 19309, and its splicing type is AP.
Machine learning algorithms
Random forests
Random forests is an ensemble learning technique that makes a prediction based on constructing multiple unpruned decision trees, each of which is constructed on several bootstrap samples of the training set data using a subset of randomly picked variables [15]. The tree structure of random forests can be denoted as:
$$\{ h(X,\psi (t));t = 1, \ldots ,T\} .$$
In the formula, X is an input vector and ψ(t) represents the independent trees in random forests and each tree elects the most popular class for X via a unit vote. Then the decisions made by all the trees were aggregated and the class of X is determined based on the principle of majority voting. This supervised non-parametric machine learning method can help researchers acquire key information from massive complicated data and resist both overfitting and underfitting [16].
Boruta
Boruta algorithm is a random-forest based feature selection method. This algorithm estimates the importance of features and captures important features in the dataset [17]. Through the following workflow, the algorithm finds all features that have either strong or weak correlations with the outcome variable: (1) Boruta duplicates the given dataset and shuffles these added attributes to increase randomness. The new features are called shadow features. (2) It develops a random forest classifier on the extended dataset and gathers the importance of each feature, which was measured by Z-scores. Z-score is computed by dividing the average loss of mean decrease accuracy by its standard deviation (SD). The higher the Z-score, the more important the feature. (3) Then, the algorithm checks whether a real feature has a higher Z-score than the maximum Z-score among shadow attributes. If not, the real feature would be deemed as unimportant and removed. Afterward, another iteration would begin. (4) These procedures repeat until the importance of all the features is assigned or the algorithm reaches the preset limit of runs.
Random survival forests (RSF)
RSF is an extension of the original random forest technique which can be used for survival data [18]. Based on random forests, RSF splits decision trees on a predictor using the splitting criterion. A node of the decision tree is split on the predictor which makes differences across daughter nodes reaching the maximal. In this study, the differences were determined by a log-rank splitting rule. When the tree grows to its terminal node, the cumulative hazard function (CHF) for each node was calculated, which is calculated by the Nelson–Aalen estimator:
$$N_{b,h} (t) = \mathop \sum \limits_{{t_{l,h} \le t}} \frac{{d_{l,h} }}{{I_{l,h} }}.$$
In this formula, h, b, and t refer to the terminal node, survival tree, and time, respectively. dl,h represents the number of death, Il,h represents patients at risk, and tl,h represents distinct time events. The same CHF is assigned to all cases in h. Then an ensemble CHF is computed for the survival forest with B trees for a given d-dimensional case xi:
$$H_{e}^{s} (t,x_{i} ) = \frac{1}{B}\mathop \sum \limits_{b = 1}^{B} \mathop \sum \limits_{h \in T\left( b \right)} H_{b,h}^{s} (t,x_{i} )$$
\(H_{b,h}^{s} (t,x_{i} )\) in the above formula is calculated as [19]:
$$H_{b,h}^{s} \left( {t,x_{i} } \right) = \left\{ {\begin{array}{*{20}c} {N_{b,h} (t)} & {x_{i} \in h} \\ 0 & {otherwise.} \\ \end{array} } \right.$$
Through the above methods, RSF adapts traditional random forest algorithm and can handle problems associated with survival. And these procedures in the present study are carried by randomForestSRC R package. Using Surv and var.select functions in this package we preliminary screened out survival-related ASEs.
Selection models
Cox regression model
Cox regression model is a model simultaneously analyzing the effects of several variables on survival. Based on the condition of the proportional hazard, this model assumes the hazard functions for different individuals are proportional and covariates’ effects on individuals are constant. Cox regression model can be formulated as:
$$h(t,X) = h_{0} (t ) {\text{exp}}\left( {\mathop \sum \limits_{i = 1}^{m} \beta_{i} X_{i} } \right).$$
In this formula, h0(t) is the baseline hazard function, t is a time variable, and βi is a coefficient vector weighing the contribution of feature Xi.
Forward selection model
We used a forward selection model to select prognosis-related genes from survival-related genes. This selection was achieved by rbsurv R package in the following procedures [20]: (1) The dataset was randomly divided into the training set (3/4 of all samples) and the validation set (1/4 of all samples). A gene was then fitted to the training set and the parameter estimate \(\hat{\beta }_{i}^{0}\) for this gene was obtained. Next, \(\hat{\beta }_{i}^{0}\) and the validation set were used to evaluate the log-likelihood. This process was repeated for each ASE. (2) The above procedures were repeated 100 times and we obtained 100 log-likelihoods for each ASE. Then the ASE with the largest mean log-likelihood was selected as the best ASE which is the most survival-associated one. Simultaneously, we selected the next best ASE by repeating previous procedures and found the optimal two-ASE model with the largest mean log-likelihood. (3) These forward selection methods continued until the fitting is impossible, resulting in a series of models. Then the Akaike Information Criterion (AIC) was calculated to evaluate these models to avoid overfitting. Finally, the model with the minimal AIC was selected as the final model.
Workflow of the current study
Preliminary filtering
SD reflects the information entropy of a feature. The greater the SD, the more informative the feature. To filter out less informative ASE and to decrease the computation of subsequent analyses, we analyzed SDs of all the ASEs in the dataset and excluded ASEs with SD < 0.1. Besides, ASEs whose mean PSI ≤ 0.05 were also excluded.
Identification of LUAD-related ASEs implicated in splicing switch
First, to circumvent the problem caused by severely imbalanced data (10.3% normal, 89.7% LUAD) in the learning process, we balanced the proportions of normal and LUAD samples by oversampling normal samples using the ovun.sample function of ROSE R package. Thus, we generated augmented data with a balanced class distribution. Second, we applied Boruta algorithm to select ASEs which were important for distinguishing between normal and LUAD samples. Third, to further explore ASEs implicated in splicing switch, we separately analyzed the correlations of LUAD-related ASEs derived from the same gene.
Construction of classifier for recognizing LNM
Applying Boruta algorithm, we selected ASEs correlated with outcome variables (LNM or not). Using these ASEs, we performed nested five-fold cross-validation based on the random forest model. The cross-validation sequentially reduced the number of ASEs (ranked by variable importances from Boruta analysis) and this process repeated five rounds. The mean cross-validation error was calculated and the classifier with the minimum error rate was chosen. The classifier’s classification capacity was evaluated by cross-validation.
Model construction and functional enrichment analysis
Cox regression is the traditional method for survival analyses with an understandable output, while RSF provides more insight into the relative importance of model covariates [21]. Based on previous results, the combination of these two methods could produce results with higher confidence than a single one [21]. Therefore, RSF and Cox regression were performed for each ASE, and only ASEs which were survival-related in both methods were selected. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis and Reactome pathways analysis were performed to analyze the functional categories of parent genes of survival-related ASEs using Cytoscape (version 3.7.2) plug-in ClueGO (version 2.5.5) [22, 23]. The dataset was further divided into the training set (3/4 of all samples) and the test set (1/4 of all samples). We used the forward selection model to identify prognosis-related ASEs and multivariate Cox regression to construct the prognosis model. The final model was tested in the internal test set. Relationships between the prognostic model and clinical-pathological variables were analyzed in the entire set using the Wilcoxon test. P < 0.05 was considered statistically significant.
Splicing network analysis
A total of 390 SFs were retrieved from SpliceAid 2 database (http://193.206.120.249/splicing_tissue.html) [24]. Univariate Cox analysis was used to identify survival-related SFs. P < 0.01 was considered to be significant. Spearman correlation analysis was conducted to evaluate the correlations between survival-related SFs and ASEs. The criterion of selecting correlated variables was P < 0.01 and |coefficient| > 0.2. Finally, their correlations were visualized via Cytoscape.
Software
All statistical analyses were conducted using R software (version 3.6.3). R package Boruta, randomForest, RandomforestSRC, caret, survival, rbsurv, pROC, timeROC, pheatmap, and ggplot2 were used in this study for analyzing data or drawing purposes. All codes are available on GitHub (GitHub, Inc., San Francisco, California) at https://github.com/cqd1308/JTM-Rscript-ASsignatures. Cytoscape software was applied to conduct functional and pathway enrichment analysis and plot the network graph.