Ensemble methods of rank-based trees for single sample classification with gene expression profiles

Building Single Sample Predictors (SSPs) from gene expression profiles presents challenges, notably due to the lack of calibration across diverse gene expression measurement technologies. However, recent research indicates the viability of classifying phenotypes based on the order of expression of multiple genes. Existing SSP methods often rely on Top Scoring Pairs (TSP), which are platform-independent and easy to interpret through the concept of “relative expression reversals”. Nevertheless, TSP methods face limitations in classifying complex patterns involving comparisons of more than two gene expressions. To overcome these constraints, we introduce a novel approach that extends TSP rules by constructing rank-based trees capable of encompassing extensive gene-gene comparisons. This method is bolstered by incorporating two ensemble strategies, boosting and random forest, to mitigate the risk of overfitting. Our implementation of ensemble rank-based trees employs boosting with LogitBoost cost and random forests, addressing both binary and multi-class classification problems. In a comparative analysis across 12 cancer gene expression datasets, our proposed methods demonstrate superior performance over both the k-TSP classifier and nearest template prediction methods. We have further refined our approach to facilitate variable selection and the generation of clear, precise decision rules from rank-based trees, enhancing interpretability. The cumulative evidence from our research underscores the significant potential of ensemble rank-based trees in advancing disease classification via gene expression data, offering a robust, interpretable, and scalable solution. Our software is available at https://CRAN.R-project.org/package=ranktreeEnsemble.


Introduction
The heterogeneity of cancers necessitates the precise classification of patients into correct cancer subtypes for both prognosis and effective treatment.In the past two decades, the utilization of gene expression profiles has increasingly demonstrated success in identifying cancer subtypes [1][2][3][4][5].Numerous studies have highlighted the potential of using gene expression profiles for cancer tissue classification, leveraging both statistical and machine learning models.However, these models often encounter challenges in data transformation, normalization, and management of batch effects, which can significantly impact their performance [6][7][8][9].A notable issue is "test set bias", where predictions for an individual patient vary depending on the patient sample group used in the normalization process, rather than reflecting the patient's unique characteristics [10].
An emerging alternative for single sample classification is the Single Sample Predictor (SSP) approach [11][12][13][14][15].This method offers significant advantages, such as the ability to utilize samples from diverse gene expression platforms without the need for calibration.SSPs enable personalized predictions by focusing on the unique attributes and contexts of individual samples, rather than relying on aggregated or generalized trends from larger datasets [16,17].Consequently, SSP methods are promising for developing precise and robust classification rules that are effective across various studies and platforms.
Typically, SSP methods utilize either nearest centroids methods [11,12] or rank statistics of gene pairs [18,19], the latter often being referred to as Top Scoring Pairs (TSP) based methods [20,21].Centroid-based methods classify samples based on proximity to the nearest centroid in feature space, typically using distance metrics like Euclidean distance.Although intuitive and effective in cases with distinct class centroids, they may underperform with overlapping classes or complex class boundaries.Furthermore, these methods were not primarily designed for individual sample concordance, leading to potential inconsistencies in patient-to-molecular subtype assignments [22].In contrast, TSP methods and their extensions [19,[23][24][25][26] offer scalability, interpretability, and robust feature selection.They generate gene rules by comparing expression values within a single sample, thus avoiding normalization with another dataset.However, their classification accuracy has often been suboptimal, limiting their clinical applicability and underscoring the need for more accurate and robust decision models.
In this study, we propose an advancement of TSP methods through the construction of rank-based trees combined with ensemble learning techniques.A singlesplit tree is analogous to a TSP classifier, and developing deeper trees represents the integration of multiple TSPs for formulating a comprehensive decision rule.To mitigate overfitting, we create multiple trees and ensemble them using techniques such as random forests and gradient boosting, thereby expanding the TSP framework from basic one-to-one gene comparisons to a more complex many-to-one or many-to-many interaction model.Our approach not only enhances the TSP method but also leverages the strengths of ensemble learning.Building upon the work of [27], who demonstrated a basic random forest strategy comparable to the k-TSP method, our paper extends this by employing multi-class trees with class-balanced sampling.This strategy improves computational efficiency and prediction performance.Moreover, we extract interactive ranked gene pairs from our random forest model for added interpretive depth.To maximize predictive power, we meticulously tune and compare various parameters for tree construction and ensemble strategies.Additionally, recognizing the prevalence of noise and redundancy in gene expression data, we implement dimension-reduction techniques.These techniques are crucial for eliminating irrelevant features and isolating the most informative and discriminative patterns, thereby facilitating more efficient analysis and interpretation.

Rank-based trees
In this section, we introduce a general framework for rank-based trees using pairwise gene comparisons among a number of gene expressions.Let X = (X 1 , X 2 , . . ., X P ) denote the expression values of P genes on an expression matrix, which could be generated from different platforms (see Fig. 1 subfigures A and B for conceptual illustration).Our objective is to use X to distinguish among K phenotypes for the cells in the tissue, denoted as Y ∈ {1, . . ., K } .(Since the boosting algorithm only accommodates binary outcomes, we denote Y ∈ {−1, 1} for the boosting case.)A tree classifier is inferred from training data L = {(X (1) , Y (1) ), . . ., (X (N ) , Y (N ) )} , where (X i , Y i ) are independently distributed.For a given expres- sion vector x , a classifier h associates it with a label h(X) ∈ {−1, 1} .We denote the tree predictor of h(x) as h(x, �, L) , where a parameter vector � = (θ 1 , θ 2 , . . ., θ T ) associates the parameter θ t with the t-th terminal nodes and T denotes the total number of terminal nodes.
To grow a rank-based classification tree, the splitting rule can be described as follows.If p = (p 1 , . . ., p K ) are the class proportions of outcome Y for classes 1 through K, the Gini index of impurity is defined as As shown in Fig. 1C, by splitting features recursively into left and right daughter nodes, a tree is grown by minimizing tree impurity.The Gini index split statistic for a split on node s on a pair of features X i and X j at a given tree node is where the subscripts l = {X j ≤ X k } and r = {X j > X k } denote the left and right daughter nodes formed by the split at s and n l and n r are the sample sizes of the two daughter nodes; n = n l + n r is the parent sample size.
With some algebra, this is equivalent to maximizing the split statistic where n k,l is the number of cases of class k in the left daughter node and n k is the number of cases of class k; n = K k=1 n k is the total sample size.At tree node s, we randomly select a set of candidate features X (s) = {X 1 ′ , . . ., X Q ′ } , Q ′ ≤ P , and the pair of variables with indices (i s , j s ) will be split if We partition the expression values into a set of gene pairs for constructing splits in the tree nodes and trees are built in a binary fashion: each internal node has an associated splitting rule that uses two predictors, X i and X j , to assign a observation k to either its left or right child nodes, The terminal nodes thus identify a partition of the observation space according to the subdivision defined by a series of splitting rules.For each terminal node t, we can arrange the variable indices in pairs {(i 1 , j 1 ), . . ., (i t , j t )} , t = 1, . . ., T − 1 , such that θ t = {x : x i 1 < x j 1 , . . ., x i t < x j t } .For a binary outcome Y ∈ {−1, 1} , we calculate the estimated proba- bility for a given x as the proportion of class label 1 at the (1)

Random Rank Forest
The rank-based trees could be of low accuracy with high variance.To prevent overfitting, we first ensemble these trees in a fashion of random forest [28,29].As in [28], we define a collection of randomized tree predictors {h(•, � m , L), m = 1, . . ., M} .We denote the mth tree pre- dictor of h(x) as h(x, � m , L) , m = 1, . . . ,M , where { m } are independent identically distributed random quantities encoding the randomization needed for constructing a tree, which are selected prior to grow the tree.These pre-selected parameters are refered to as tuning parameters and discussed in the Discussion section.The tree predictors are combined to form the finite forest estimator of h(x) as and h(x) = arg max k∈{1,...,K } pk (x).
Although random forest offers the advantage of achieving high levels of accuracy, the decision rules become extremely complex after averaging the rank based trees, which motivates us to extract information from the blackbox to increase interpretability.Since each terminal node of a tree can be viewed as a classification rule from multiple TSPs, we propose the Algorithm 1 to identify some importance classification rules.Note that each tree in a random forest algorithm is fitted from a bootstrap sample of the original data, leaving approximately 1− 0.632 = 0.368 out-of-sample data for each tree which is called out-of-bag (OOB).This data can be utilized to estimate the prediction performance and obtain an OOB prediction error without the need for an additional crossvalidation step to evaluate the prediction error.Here we calculate the OOB prediction error for each terminal node for selecting rules in Algorithm 1.

Algorithm 1 Extracting rules via rank-based trees
Rules from Algorithm 1 are constructed with multiple TSPs so they are high-order classification rules.The classification rule is more interpretable than the classic permutation-based variable importance from random forests and might contribute to biological understanding.In our empirical studies, the top rules tend to be more complex than the simple decision rules from TSP methods, and less number of rules are needed to achieve comparable results as the k-TSP method.Although we can aggregate these rules in the fashion of TSP methods, we found that random rank forests always show better prediction performance.Therefore, classification rules only serve the purpose of interpretation, instead of prediction.

Boosting with the LogitBoost cost
As another ensemble technique, boosting [30,31] has been used as a powerful tool for classification, especially in highdimensional settings.As weak learners, random rank trees are ensembled according to a LogitBoost cost function [32] In each iteration m, a regression tree is fit using the negative gradient of C(y i , F (x i )) as working responses For a tree with S terminal nodes, the update uses a refined optimization with unique estimates for each terminal node: where Note that unlike Eq. ( 1) for a classification tree, the splitting rule for partition θ s,m is similar to a regression tree [33], which maximizes where the subscripts l = {X j ≤ X k } and r = {X j > X k } denote the left and right daughter nodes for s; zl and zr denote the average of z m (x i ) in the corresponding daugh- ter nodes.After M iterations from Eq. ( 4), the final predictor . (4) For an outcome with K class labels, we encode the data into K "one against all" datasets with the outcomes {Y = k} and {Y � = k} to compute pk (x).

Ensemble Algorithm with reduced dimension
The challenge of rank-based tree method is high dimensionality.When we have p genes, there are O(p2 ) calcula- tions involved in Eq. ( 2) for constructing tree nodes.As a solution, we propose a two-step ensemble algorithm, in which the first ensemble step is to reduce dimensionality and the second ensemble step is to predict the outcome.

Algorithm 2 Ensemble Algorithm with Reduced Dimension
123 For variable selection, we have to construct a variable importance (VIMP) measurement based on a loss function.For classification problems, measures of performance used are the misclassification error or the Brier score [34][35][36].For the latter, we have To measure VIMP, we grow each tree using a bootstrap sample of the original data and the previously mentioned OOB data is used to calculate the loss function under the original OOB data and the permuted OOB data.Let L OOB be the OOB data and let pk ( x(ij) ) be the estimator for permuted x where the relationship of X i and X j is swapped in all the rank-based trees, which can be achieved by permuting the ith and jth columns in L OOB .The VIMP for gene pairs X i and X j is defined as Utilizing VIMP, the two-step ensemble algorithm is described in Algorithm 2.

Gene expression data and evaluation methods
In the next section, we evaluate the effectiveness of our ensemble methods of rank-based trees, as depicted in Fig. 1D, on gene expression datasets of both binary and multi-class outcomes.In this regard, we gathered 12 publicly accessible gene expression datasets, with sample sizes ranging from 22 to 587 and numbers of genes ranging from 85 to 2526.Table 1 summarizes these datasets, which are all related to studies of human cancer, including liver, central nervous system, brain, prostate, lymphoma, breast, small round blue cell tumors, leukemia, lung and bladder.Further information can be obtained by referring to the relevant publications.The last dataset studies the classification of triple negative breast cancer (TNBC) with four subtypes [37], including two basallike (BL1 and BL2) subtypes, a mesenchymal (M) subtype, and a luminal androgen receptor (LAR) subtype.To evaluate the prediction performance of our methods in cross-platform scenarios, we also downloaded the TNBC datasets generated from RNA sequencing in [38] with a sample size of 26; in [39] with a sample size of 475; and in the Cancer Genome Atlas database [40] with a sample ( 7) size of 136.The dataset in [37] was generated from the Affymetrix (Affy) GeneChip microarray; therefore, our training dataset and test dataset are from different platforms.

Other SSP methods and algorithm implementation
Beside the k-TSP method, we also compared our methods with the nearest template prediction (NTP) method, which compares the gene expression profile of a single sample to a pre-defined set of gene expression profiles, known as templates.The subclass label can be determined using a distance metric (e.g.cosine distance, Euclidean distance, etc.) as the similarity to each template [14].In our comparison, k-TSP was implemented from the "switchbox" R package [52], in which the optimal number of gene pairs was selected from a range of values from 2 to 10 with fivefold cross-validation.For multiclass classification, a one-vs-one scheme was used and a classifier was trained for each pair of subclasses [53].
To avoid ties in majority voting, only odd numbers were considered during training.We implemented the NTP method with the "CMScaller" package [54], which was originally created for classifying colorectal cancer preclinical models [4,55].The prediction for each sample was determined using the sample's closest cosine distance to each template.We utilized the "gbm" R package [56] for implementing our boosting algorithm and the "randomForestSRC" R package for our random forest algorithm [57].For the random forest implementation, we adopted the multi-class tree with class-balanced sampling instead of fitting separate one-versus-rest models for each class [27] to improve computational efficiency and prediction performance.We noticed that there are other classical methods available, such as k-nearest neighbor (KNN) and support vector machines (SVM).We did not present the results in Section 4 because the comparison was already presented in Tan et al. [21] and showed that k-TSP works superior comparable to KNN and SVM (see Tables 3 and 4 in Tan et al. [21], and we have the same conclusion with them).We also tried random forest/boosted trees using single gene features, the results of which are similar to SVM, and we did not include the results due to limited space.

Performance measures
Given a dataset with sample size N and an outcome of K classes, let c ij be the number of samples belonging to class i that are predicted to the jth class and the sample size for class i is denoted as n i = K j=1 c ij (see Fig. 2).The perfor- mance measure is defined as accuracy (ACC):

Table 1 Binary and multi-class datasets of gene expression profiles for cancer discrimination
a CNS central nervous system, AODs anaplastic oligodendrogliomas, NHL Non-Hodgkin's lymphoma, SRBCTs small round blue cell tumors, ALL acute lymphoblastic leukemia, TNBC triple negative breast cancer b N stands for number of samples, P for number of genes and K for number of classes c We downloaded other three datasets [38][39][40] and trained our models on data from one platform (e.g.microarray) while tested its prediction performance on data from another platform (e.g.RNA-seq) Note that ACC is highly influenced by the imbalanced sample sizes among different classes.Therefore, we subsample or bootstrap the data such that n i /N ≈ 1/K .All datasets were randomly divided into class-balanced training (70%), validation (15%), and test data (15%).
To evaluate the robustness and assess the performance of the methods, we fitted the four models on the training data, used the validation data for tuning parameters, and compared the ACC values on the corresponding test data.We replicated this procedure 50 times to compare the ACC values.

Results
Figure 3 summarizes the ACC results of our proposed methods, random rank forest (RRF) and boosting algorithm with the LogitBoost cost (Boosting), on the benchmark datasets in Table 1  Both boosting and random forest have proven to be successful in our real-data applications.Their effectiveness stems from their ability to handle high-dimensional complex relationships, reduce overfitting, and provide robust predictions by leveraging ensemble methods.However, the choice between boosting and random forest depends on the specific dataset characteristics, and it is often a matter of empirical evaluation to determine which method performs better for a given task.We recommend random forest over boosting for multiclass problems and large-size datasets since the boosting model has to transform multiclass outcomes into binary outcomes to calculate loss function and trees in boosting models are sequentially grown instead of parallelly grown.The k-TSP outperforms the NTP method because it can be more robust to noise and outliers.By considering multiple top scoring pairs, the influence of individual noisy or outlier templates is reduced, leading to more reliable predictions.On the other hand, the NTP method is more susceptible to the influence of outliers or noise in the template set because it relies on a single nearest template.
Table 2 shows the dimension reduction results from RRF and boosting.There are two stages for variable selection: gene selection in the initial stage and gene-pair selection in the subsequent stage, whose results were displayed as the number of genes and number of gene pairs selected, respectively.For the TNBC dataset, 49 common genes are identified after data preprocessing across different platforms, which are all considered informative variables by the algorithm.Although the prediction performance of boosting and random forest appears comparable, it is an interesting observation that boosting tends to select fewer variables than random rank forests.However, the variance of the total selected variables by RRF appears to be smaller than that observed with boosting.We posit that rank-based trees excel in borrowing information across different variables, resulting in a robust prediction performance despite variations in variable selection results.
As mentioned in the previous section, one advantage of RRF is its capacity to extract precise and easily understandable rules that offer biological insights into the classification process.We used the terminal nodes of rank-based trees as the candidate "simple decision rules" and adopted a similar algorithm of k-TSP [21] to rank and select these candidate rules.The result for the Liver dataset is listed in Table 3.These rules are different from those in the k-TSP methods since k-TSP only ranks gene pairs one by one, while rules from trees are combinations of multiple gene pairs.We found that this multivariate fashion can improve prediction accuracy with much fewer rules than k-TSP.The accuracy values from boosting, RRF, k-TSP and NTP are 0.89, 0.94, 0.90 and 0.80 for this dataset, while adopting only four rules in Table 3 could provide comparable accuracy of 0.85.The results obtained validate the findings of [41], which demonstrated that our method is capable of generating accurate and interpretable decision rules for effectively classifying microarray data.

Discussion
The results shown in the previous section are subject to specific tuning parameters, which are discussed in this section.Although the following results are problemspecific, they show some robustness of our model and provide some insight for the readers to customize grid search on their own.The following parameters are influential for optimizing the model's behavior and adapting it to specific datasets.We suggest systematically exploring the parameter space, evaluating different configurations, and selecting the optimal set of parameter values based on performance metrics.

Learning rate for boosting
The learning rate in boosting algorithms shown in Eq. ( 4) determines the contribution of each weak learner (e.g., rank-based tree) to the final ensemble   4 demonstrates the effect of the learning rate on the classification of the Liver dataset.Overall, the model is robust to learning rates in a wide range.It's important to note that the optimal learning rate for boosting depends on the specific dataset and problem at hand.We used cross-validation to determine the learning rate that achieves the best balance between convergence speed, accuracy, and robustness for a given dataset.

Number of trees/iterations M
The number of trees is an important parameter in both boosting algorithms and random forests.Increasing the number of trees tends to improve the model's performance since as more trees are added, the boosting model can better capture complex patterns and reduce both bias and variance errors; with more trees, the random forest ensemble becomes more robust and stable as it aggregates predictions from a larger number of diverse decision trees.Figure 5 demonstrates the influence of iteration/tree number on model performance for the Liver dataset, where a number of 250 seems sufficient for both random forest and boosting.For all the datasets, adding too many trees is unlikely to increase the risk of overfitting; however, increasing the number of trees also increases the computational cost of training and inference.Therefore, there is a trade-off between model performance and computational resources.From our empirical experimentation, an iteration number of 500 is sufficient for most datasets in both random forests and boosting and increasing the number larger than 1000 is unlikely to make any difference.

Depths of trees and terminal node size
The depth of trees, also known as the tree's maximum depth or tree size, plays a crucial role in growing rankbased trees.It has a similar influence as terminal node size since the deeper the tree is, the smaller the terminal node size is.6 and 7 for the Liver dataset, it is crucial to strike a balance between the tree depth and the model's generalization ability in both boosting and random forests.The optimal tree depth depends on the dataset characteristics, and we used cross-validation to determine the appropriate tree depth without a specific constraint on the terminal node size.Note that the tree depth of 1 in the first column of Fig. 6 for random forest is roughly equivalent to the k-TSP method since a tree of one split is equivalent to a top scoring pair.Figure 6 demonstrates that extending the k-TSP method via growing deeper trees and ensemble methods can achieve higher accuracy in prediction.

Number of competing variables q at each split
The number of competing gene pairs at each split, also known as feature subspace size, is defined in Eq. ( 2) denoted as q.A larger q will increase the computational cost.However, it does not hold much significance in boosting algorithms nor random forests.Boosting algorithms typically do not involve explicit feature subsampling at each split.Instead, they focus on sequentially adjusting the weights of training examples to improve the model's performance.Therefore, the number of competing variables at each split does not directly impact boosting.In random forests, the number of competing variables at each split determines the randomness and diversity among decision trees in the ensemble.A smaller number of competing features at each split helps to decorrelate the trees in the random forest ensemble and prevents a few dominant features from overshadowing 0.8 0.9  8, random forest is also robust to the number of competing variables since the total number of variables is large in genetic datasets.In other words, when q << p , the influence of q is small.

Conclusions
In this study, we introduce an advanced rank-based tree model that builds upon TSP methods, incorporating ensemble techniques such as boosting and random forests to achieve enhanced predictive power.This approach allows us to derive interpretable rules from the terminal nodes of rank-based trees, akin to TSP methods.Our classifiers, grounded in the ranking of gene expression values within individual profiles, remain robust against preprocessing effects.When tested across twelve diverse human cancer gene expression datasets, both binary and multi-class, our methods demonstrated marked superiority over traditional k-TSP and NTP classifiers.A notable feature of our Random Forest-derived rules is their succinctness, comprising fewer gene pairs while maintaining or surpassing accuracy in predictions.The strength of our approach lies in the multivariate capability of decision trees, which adeptly adjust for multiple ranked gene pairings.This ability to encapsulate intricate gene-target outcome relationships enables the learning of complex non-linear patterns and gene interactions.In contrast, conventional TSP methods, often restricted to basic if-then logic, may falter in capturing these complexities.Our method addresses the common issue of overfitting in tree models by integrating ensemble techniques, which enhances both the accuracy and robustness of the predictions.This integration avoids the complexities of tree construction rules, focusing instead on leveraging the collective strength of multiple decision trees [58].
Furthermore, these rank-based trees serve as fundamental units in ensemble methods, such as random forests and boosting algorithms.The aggregating of multiple trees in these methods not only improves prediction accuracy but also offers resilience against model biases.By employing data resampling techniques, we utilize class-balanced sampling strategies, effectively addressing the prevalent challenge of class imbalance in many datasets [27,[59][60][61].This approach offers a notable advantage over the one-versus-rest models, which, despite their appearance of treating class categories equally, still grapple with class imbalance within individual category models.
While tree-based algorithms offer optimization avenues, such as missing data imputation or feature importance analysis [62], our study also acknowledges certain limitations that warrant further exploration.One such area is the handling of ties in ranking variables.Our methods demonstrated reduced effectiveness in datasets with abundant zero values, suggesting the need for strategies like introducing artificial noise to enhance model performance [63].Another aspect for future refinement is the computational intensity of our dimension reduction step, which currently relies on random forest or boosting models, as opposed to more straightforward filter methods [64].Addressing these limitations will be pivotal in our ongoing efforts to refine and enhance the efficacy of rank-based tree methods for gene expression data classification.

Abbreviations
corresponding terminal node θ t , p(x) = P(Y = 1|x ∈ θ t ) and estimate E[Y |x] as f (x) = 2p(x) − 1 .The estimator for a multi-class outcome of K labels can be calculated as the proportion of the corresponding class label, p k (x) = P(Y = k|x ∈ θ t ) and the tree takes a Bayes clas- sifier h(x) = arg max k∈{1,...,K } P(Y = k|x ∈ θ s ).

Fig. 1 A
Fig. 1 A Examples of different platforms to obtain gene expression values.The venn diagram illustrates that certain prediction models are either unavailable or lack predictive accuracy when applied to data from particular platforms.B An illustration to show that gene expression values from different platforms are not comparable due to variations in chemistry, quantification, and normalization techniques used by each platform.C Illustration of a rank-based tree.D The methodological framework utilizing rank-based trees

Fig. 3
Fig. 3 Model performance on benchmark datasets

Table 2
Dimension reduction results from random forest and boosting a Total number of selected genes from Line 3 in Algorithm 2 b Total number of selected gene pairs from Line 9 in Algorithm 2 c Mean and standard deviation (SD) were calculated from 50 replications

Table 3
Classification rules from Algorithm 1 for the liver dataset Robustness of competing ranked pairs to model performance