Discriminant analysis and machine learning approach for evaluating and improving the performance of immunohistochemical algorithms for COO classification of DLBCL

Background Diffuse large B-cell lymphoma (DLBCL) is classified into germinal center-like (GCB) and non-germinal center-like (non-GCB) cell-of-origin groups, entities driven by different oncogenic pathways with different clinical outcomes. DLBCL classification by immunohistochemistry (IHC)-based decision tree algorithms is a simpler reported technique than gene expression profiling (GEP). There is a significant discrepancy between IHC-decision tree algorithms when they are compared to GEP. Methods To address these inconsistencies, we applied the machine learning approach considering the same combinations of antibodies as in IHC-decision tree algorithms. Immunohistochemistry data from a public DLBCL database was used to perform comparisons among IHC-decision tree algorithms, and the machine learning structures based on Bayesian, Bayesian simple, Naïve Bayesian, artificial neural networks, and support vector machine to show the best diagnostic model. We implemented the linear discriminant analysis over the complete database, detecting a higher influence of BCL6 antibody for GCB classification and MUM1 for non-GCB classification. Results The classifier with the highest metrics was the four antibody-based Perfecto–Villela (PV) algorithm with 0.94 accuracy, 0.93 specificity, and 0.95 sensitivity, with a perfect agreement with GEP (κ = 0.88, P < 0.001). After training, a sample of 49 Mexican-mestizo DLBCL patient data was classified by COO for the first time in a testing trial. Conclusions Harnessing all the available immunohistochemical data without reliance on the order of examination or cut-off value, we conclude that our PV machine learning algorithm outperforms Hans and other IHC-decision tree algorithms currently in use and represents an affordable and time-saving alternative for DLBCL cell-of-origin identification. Electronic supplementary material The online version of this article (10.1186/s12967-019-1951-y) contains supplementary material, which is available to authorized users.

Background Diffuse large B-cell lymphoma (DLBCL), also known as aggressive lymphoma, is the most common type of Non-Hodgkin lymphoma, and it can be classified into germinal center (GCB), activated B-cell (ABC), and mediastinal large B-cell lymphoma cell-of-origin (COO), the latest frequently called as unclassified (UC). The importance of these entities lies in the different driven intracellular oncogenic signaling pathways that lead to a distinct clinical outcome [1,2]. Drugs currently under clinical trials targeting the DLBCL COO groups are being tested to provide more specific therapeutic strategies, particularly for non-germinal center-like patients who present the worse outcome [3].
The development of techniques to assess the gene expression profiling (GEP) has allowed DLBCL COO classification. GEP is now widely considered to be the "gold standard, " but its generalized clinical application remains limited due to technical, financial and regulatory obstacles [4]. The requirement of fresh tissue, large initial capital costs, an ongoing necessity for skilled labor and consumables are prohibitive to many health institutions and diagnostic laboratories [5]. Because of this current impracticality of performing GEP assays, immunohistochemistry (IHC)-decision tree algorithms for DLBCL classification have been proposed to do COO identification economical and less time-consuming. Furthermore, GCB and non-GCB COO groups have been included as representative DLBCL subtypes in the 2016 revision of the World Health Organization classification of lymphoid neoplasms, where the use of IHC is suggested [3]. Thus, there is now an impending need for accurate, robust and affordable methods to identify COO groups [6].
Hans et al. [7] established the first non-automatic IHCdecision tree algorithm for DLBCL COO classification. This algorithm was constructed by a dichotomous classifier, with activated B-cell and mediastinal large B-cell lymphoma simplified as non-germinal center-like (non-GCB) molecular subgroup. Hans algorithm uses CD10, BCL6, and MUM1 expression to classify DLBCL into GCB (CD10 and/or BCL6 positive, MUM1 negative) or non-GCB subtype with the reverse pattern of staining. Reproducibility of this algorithm leads to research groups to propose alternate IHC strategies, such as Colomo [8], Nyman [9], Choi [10], Hans modified (Hans*) [11], modified Choi (Choi*) [11] and Visco-Young [12] with three (VY3) and four (VY4) antibodies, all reported to be predictive of outcomes. All these algorithms are based on a classical decision tree structure (Additional file 1: Figure  S1), and use two to five antibodies, such as CD10, BCL6, FOXP1, GCET1, and MUM1. Additionally, IHC algorithms using different antibodies have been reported [11,13,14].
Immunohistochemistry has been compared to GEP with regards to classification and prediction of clinical outcomes. Some studies report that IHC-decision tree algorithms correlate well with GEP in predicting prognosis [7,10,11], but others find a lack of prognostic utility in IHC-based COO assignment [15]. The conflicting concordance between IHC and GEP results may partly be due to the heterogeneity of studied samples (extranodal vs nodal DLBCL) and populations (treatment, patient age). Moreover, technical differences related to IHC analysis such as antigen retrieval, non-standardized scoring criteria for IHC, different primary antibodies employed in diagnostic laboratories, and batch-to-batch antibody variations may also be significant [5,16]. Others have tried to improved IHC algorithms with biomarker studies, but this implies an elevated cost in diagnostics [17][18][19]. Thereby, we decided to carry out a statistical description by discriminant analysis and to explore machine learning approach to increase IHC-based classification and GEP concordance.

Linear discriminant analysis
Linear discriminant analysis (LDA) is frequently applied to obtain a description of separability in a dataset and overall to evaluate the influence of each feature (predictor) over such separation of groups. LDA is a common analysis presented in cancer's basic and clinical research to evaluate the data clustering [20,21] including lymphoma studies [22]. In this study, LDA was applied to statistically describe the influence of the combination of the antibodies over the classification of GCB and non-GCB cases.

Classification by machine learning
Machine learning algorithms have been applied for neoplastic molecular subtype classification considering as input information derived from microarrays [23], and q-PCR [24].
In this study, we applied the machine learning approach, considering as features the same combination of antibodies as in each published IHC-decision tree algorithms, following Bayesian classifier (B), Bayesian simple classifier (BS), Naïve Bayesian classifier (BN), artificial neural networks (ANN), and support vector machines (SVM) structures. Bayesian classifier considers the minimal probability of misclassification. Bayesian simple classifier involves a conditional probability model which uses independent variables, whereas Naïve Bayesian classifier considers that each feature contributes independently to the probability of the class from any other feature. ANN combines training, learning and nonlinear functions to minimize the classification error. SVM builds a model that separate each category by a gap  17:198 where new sets can be predicted to belong to a category depending on which side of the gap they fall on [25]. As will be discussed, the use of automatic classification methods based on machine learning is an option to analyze the concordance and increase the accuracy between IHC and GEP. Here, we performed the first quantitative comparison between the most common nonautomatic IHC-decision tree algorithms against machine learning algorithms to derive the best diagnostic model.

IHC-decision trees and linear discriminant analysis of the public DLBCL database
Visco et al. [12] generated a database (from now called Visco-Young database) containing GEP, immunohistochemistry staining data (corresponding to CD10, BCL6, FOXP1, GCET1, and MUM1 antibodies), and clinical information of 475 de novo DLBCL patients who were treated with rituximab-CHOP chemotherapy (available at https ://www.natur e.com/artic les/leu20 1283#suppl ement ary-infor matio n).
Immunohistochemistry staining data of Visco-Young database was used to classify the DLBCL cases according to the eight reported IHC-decision tree algorithms, (decision trees with antibodies combination and cut-off values are detailed on Additional file 1: Figure S1). Classes were identified as GCB or non-GCB cell-of-origin groups, the later comprised by both activated B-cell (ABC) and unclassified (UC) cases. From here on, predicted classes are the results obtained for any classifier, and true classes consisted of the genetic expression profiling. To differentiate between GCB and non-GCB class, we defined the true positive (TP) result when a subject belonging to the true GCB class is identified as GCB and true negative (TN) result when a subject belonging to the true non-GCB class is identified as non-GCB.
We implemented LDA to describe the influence of each antibody over the linear separability of the database into the GCB and non-GCB classes. Considering numerical tags for each antibody: 1 = CD10, 2 = BCL6, 3 = FOXP1, 4 = GCTE1, and 5 = MUM1. We analyzed 25 possible combinations of the set of five antibodies as predictor sets; it includes the cases of such used by each one of the eight IHC-decision tree algorithms (Additional file 1: Table S1). For example, Nyman's algorithm includes FOXP1 and MUM1, it corresponds to the predictor set combination (3,5). Colomo and Hans use the same antibody combination (1,2,5). IHC algorithms include different cut-offs but this condition is indistinct for LDA, and combinations such as (1,2) not used for any of the IHCdecision trees but assessed in the LDA analysis.
The corresponding linear discriminant function (LDF) for each combination and each class is obtained with the weight associated to each antibody. IHC-decision trees were executed in scripts for Matlab (R2018a, MathWorks, USA) and LDA analysis were performed in Minitab (v18.1, Minitab, Inc., State College, PA, USA).

IHC-decision trees and LDA classification, comparative performance analysis
The performance of IHC-decision trees and LDA classification algorithms were evaluated by computing metrics such as accuracy (Acc), specificity (Spec), sensitivity (Sens), positive predictive value (PPV), negative predictive value (NPV), likelihood ratio for positive test results (LR+), likelihood ratio for negative test result (LR−) [26].

Classification of DLBCL in COO molecular subgroups by automatic classifiers
For the classification stage by machine learning algorithms, the Visco-Young database was split into training, testing, and validation data subsets (75%, 20% and 5% respectively). To preserve the same proportion of ranked patients and avoid the overfitting, the so-called k-fold cross-validation technique was applied. Testing and validation data subsets (VY subset) were merged to assess classification performance.
Each machine learning structure, B, BS, BN, ANN, and SVM was implemented considering as feature inputs the same antibodies combinations used by the IHC-decision trees. We obtained 35 algorithms for automatic classification, considering that Colomo and Hans's algorithms use the same combination of antibodies. As was defined, classes were identified as GCB or non-GCB COO groups. To avoid overfitting, we used large training set (75% of cases as mentioned above); we implemented k-fold cross-validation to assure that classification results were independent of the training, testing, and validations data subsets; and during training, a level of 95% maximum accuracy was set. The stopping criterion during the training stage for all classification models was an error less than 1 × 10 −3 or 100 training epochs, whichever was satisfied first.
Bayesian classifiers were executed in Weka [27]. Artificial neural networks were implemented using Matlab Neural Network Pattern Recognition Toolbox. In the hidden layer, five neurons were used, and the cross-validation method selected was Entropy Reduction. Support vector machines were implemented using the Sci-Kit Learn Python Library [28], with a Kernel Transformation using the Radial Base Function, and using stratified k-fold grid search cross-validation. Classification of data can be done upon request, and a mobile application is in the making.

IHC-decision trees and machine learning classifiers, comparative performance analysis
The same metrics used for comparison between IHC and LDA were calculated to compare the IHC-decision trees and trained machine learning algorithms, for both approaches the VY subset is considered for classification comparison. Additionally, to statistically asses the concordance between IHC and machine learning algorithms with the GEP true class, all outputs were analyzed by Cohen's kappa (κ) for agreement analysis and Pearson's Chi-squared test in Matlab. Kappa result was interpreted as follows: values ≤ 0 as indicating no agreement and 0.01-0.20 as poor, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as good, and 0.81-1.00 as very good agreement. Per example, a κ = 0.01 comparing GEP to a model indicated no agreement, that is, a GCB patient identified by GEP was misclassified as non-GCB by an algorithm. Conversely, a κ = 0.85 indicated very good agreement, that is, the GCB patient identified by GEP was correctly classified as GCB by an algorithm.

Validation in a clinical sample set
To test the efficacy of COO molecular group classification, we use an independent series of cases. Archived formalin-fixed, paraffin-embedded (FFPE) tissues from 60 DLBCL Mexican-mestizo patients were collected from January 2009 to March 2011, after approval by the Institutional Review Board of Escuela de Medicina y Ciencias de la Salud from Tecnologico de Monterrey. Retrieved blocks dated from 1999 to 2010. Clinical data were incomplete for 11 cases; therefore, they were withdrawn from the analysis.
FFPE sections from clinical sample set patients were stained with standard hematoxylin-eosin (H&E) staining method. Each H&E slide was reviewed, and morphologically representative, non-necrotic tumor areas were used for tissue array (TA) assembly.
Three pathologists reviewed the slides independently. Difficult cases were examined and resolved by joint review on a multiheaded microscope. Such an event occurred in less than 5% of the cases. Immunoreactivity was scored as the percentage of positive tumor cells over total tumor cells. The data analysis included only the cases with both complete IHC scores readings and clinical information.

Survival analysis
For VY subset and clinical sample set, survival of GCB and non-GCB COO groups identified by all algorithms were compared using Kaplan-Meier curves, and the significance was calculated using log-rank test. For all the analysis, a p-value < 0.05 was considered to be statistically significant.

Public database analysis
Visco-Young database comprised 231 GCB, 200 activated B-cell (ABC), and 44 unclassified (UC) cases identified by genetic profiling, resulting in 231 (48.6%) GCB and 244 (51.4%) non-GCB cases. In Visco-Young database, 71% of the cases occurred in patients from North America, 6% from Asia, and the remainder were from Western Europe (Italy, Spain, Switzerland, Denmark, and The Netherlands). The median age was 62 years (61% > 60 years, and male patients ~ 57%). Patients were in advanced Ann Arbor stage (53%), with not favorable performance status (20% with ECOG ≥ 2), high lactate dehydrogenase values in serum (65%), with extranodal involvement (23% with ≥ 2), and with B-symptoms manifestations (32%). All these characteristics accounted for 14% patients having a high intermediate to high IPI risk. Treatment consisted of R-CHOP (91%) or R-CHOP-like regimens (9%; epirubicin or mitoxantrone based combination chemotherapy). Patients that achieved complete remission on initial chemotherapy treatment were 76%.
For automatic classifiers, a total of 354 cases were used during training, and the remaining 121 (VY subset) were used for testing and validation steps. This VY subset included 51.2% GCB and 48.8% non-GCB by GEP, a similar proportion to the training set, and was used to compare classification metrics.

Clinical sample set characteristics
Our clinical sample set comprised of Mexican-mestizo patients (see Additional file 1: Table S3) is similar to Visco-Young database. Main treatment was anthracycline-based (36.7%), R-anthracycline-based (42.9%), and 20.4% as other chemotherapeutic regimens. It is important to mention that around 57% of the patients did not receive Rituximab since this treatment is not available in the entire public Mexican health system. In regards with response, almost half of the patients achieved complete remission (49.0%) on initial chemotherapy treatment, and 12.2% of them relapsed during  [29][30][31].

IHC-decision trees and classification by LDA, performance analysis
The metrics obtained for each IHC-decision tree considering the complete classification of the cases in the Visco-Young database are presented in Table 1. In the same table we show the equivalent classification after the LDA for the same antibody combination, according to the derived linear discriminant function (LDF). We have also included the most relevant combination that has not been evaluated by any IHC-decision tree algorithms. The complete classification metrics for the rest of possible combinations by LDA and their respective set of linear coefficients can be consulted in Additional file 1: Tables S4, S5, respectively.

IHC-decision tree algorithms performance
Three algorithms Choi, VY3, and VY4, reached the most considerable accuracy (Acc 0.88), representing the most balanced options of sensibility and specificity, with similar performance metrics (Table 1). These algorithms presented a better performance for the classification of GCB subjects (Sens 0.94 Choi, 0.92 VY3, 0.93 VY4) compared with non-GCB classification (Spec 0.84); however, regarding to the probability that a subject classified as GCB was correctly classified, the outcome is fair (PPV 0.84 Choi, 0.85 VY3, 0.85 VY4), but certainly better for non-GCB (NPV 0.93 Choi, 0.92 VY3, 0.92 VY4). Similar results were obtained for likelihood ratios, moderate for positive test results (LR+ 5.70 Choi, 5.92 VY3, 5.80 VY4) and excellent for negative test results (LR− 0.08 Choi, 0.09 VY3, 0.09 VY4). It is important to remark that VY3 considers only three antibodies (1,2,3), such combination is included in Choi and VY4 with a different cut-off in the decision tree branches. Nyman reached the best specificity (Spec 0.91); however, the probability that a subject classified as non-GCB was correctly classified is low (NPV 0.67), its LR− is also reduced (LR− 0.53). Hans algorithm, usually considered as a reference, had lower performance compared with VY3, VY4, and Choi. The   17:198 worst evaluated were Colomo and Choi*, and the excellent sensitivity of Hans* (Sens 0.94) it is not enough to match the performance of Choi, VY3, and VY4.

Linear discriminant analysis performance
Antibody combinations (1,2,3) and (1,2,3,4,5) reached the more considerable accuracy (Acc 0.89), these combinations are the same used by VY3 and Choi respectively. For these linear discriminants the GCB classification gives a fair sensibility (Sens 0.87), which is lower than sensitivity for the equivalent IHC-decision trees; but, in terms of probability their PPV is better (PPV 0.90), and the likelihood ratio for LR+ (LR+ 9.19 (1,2,3), 9.23 (1,2,3,4,5)) overcome the performance of VY3 and Choi. This can be appreciated on Additional file 1: Figure S2, where the separation of COO groups increases using LDA in comparison to IHC-decision trees.
In accordance to results in Table 2, for combination (1,2,3), the higher coefficients of the LDF for GCB cases relays on antibody BCL6 = 6.99 and is followed by CD10 = 4.86, with less influence of FOXP1 = 0.64. A similar situation is for combination (1,2,3,4,5), its coefficient for BCL6 = 6.31 and is followed by CD10 = 4.59; less impact is obtained from the rest of antibodies (CGET1 = 3.28, MUM1 = 1.02 and FOXP1 = 0.71). As can be appreciated in Table 2, the inclusion of BCL6 and CD10 in the LDF is strongly related to the classification of GCB cases, for any of the combinations in which they are included, their coefficients are the largest; keeping the coefficient relation: BCL6 larger than CD10 coefficient if they occur in the same combination. Antibody combinations (1,2,3) and (1,2,3,4,5) have a larger specificity (Spec 0.91); however, their NPV and LR− cannot overcome their counterparts IHC-decision trees. On average sense, performance metrics are improved by LDA classification for GCB classification, both diminish the non-GCB classification metrics. From Table 2, it is possible to confirm that the inclusion of MUM1 in any algorithm is related to non-GCB detection, getting the most considerable value in the LDF.
By LDA, we observed that BCL6 coefficient weight resembles with the molecular hallmark of GCB, since BCL6 expression is mainly restricted to germinal center cells [32,33]. Moreover, CD10 has been reported as been positive and negative for GCB patients [18,34,35], and appeared to have higher influence for GCB identification according to the CD10 coefficient weight we obtained. Although in normal GC B-cells MUM1 and BCL6 are mutually exclusive, in tumor cells both proteins are coexpressed [36], and we observed a more significant influence of MUM1 for non-GCB group, in agreement with it post-germinal marker feature [18].
We want to remark that there is not any combination with better metrics than those usually used by the

Table 2 Coefficients of linear discriminant functions (LDF) derived from LDA for all possible combination of antibodies
Columns (Antibody) show the coefficient associated with each antibody for GCB classification (Sensibility performance) and non-GCB classification (Specificity performance). The first column remarks the combinations of antibodies as were used by IHC-decision trees. BCL6 and CD10 in the LDF are strongly related to the classification of GCB cases, whereas the inclusion of MUM1 in any algorithm is related to non-GCB detection, getting the most considerable value in the LDF Numeric Tags 1 = CD10, 2 = BCL6, 3 = FOXP1, 4 = GCTE1, and 5 = MUM1  17:198 IHC-decision trees. For that reason, machine learning approach was focused on the equivalent antibody combinations proposed by previous authors.

Machine learning algorithms performance analysis
The metrics of the best five of 35 machine learning and 8 IHC-decision tree algorithms in VY subset are shown in Table 3. Complete results of 43 algorithms are shown in Additional file 1: Table S6.
Our results show Hans as a fair performance classifier and ranked 21 out of 43 algorithms tested (ranking by accuracy, Fig. 1a). Hans had a better performance for the classification of GCB subjects (Sens 0.95) compared with non-GCB classification (Spec 0.83); however, the probability that a GCB subject was correctly classified was fair (PPV 0.86), but better for non-GCB (NPV 0.94). An in terms of likelihood ratios, these were moderate for GCB (LR+ 5.52) and excellent for non-GCB (LR− 0.06). Nonetheless, Choi was the IHC-decision tree algorithm with better metrics (Acc 0.93), and ranked 8 out of 43 algorithms tested (Fig. 1a), followed by VY3 and VY4 (ranked 15 and 16, respectively). Choi presented an excellent performance for GCB classification (Sens 1.00), but with interesting metrics for non-GCB cases considering probability and likelihood ratios (NPV 1.00, LR− 0.00).
It should be noted that although in LDA analysis BCL6 antibody had emerged as a relevant component in COO classification, PV algorithm does not include it. BCL6 influence in the linear discriminate is not necessary for the BS (Bayesian simple), since this structure relies on a probability function instead of a linear component. It is consistently observed in the rest of the machine learning algorithms, which are more complex structures and do not need the weight of BCL6, CD10, and MUM1 as such in simpler structures like IHC-decision tree or LDA.
Based on the ROC spaces computed for each algorithm (Fig. 1b), the best performances were obtained for

Survival analysis
A survival analysis with VY subset and our Mexicanmestizo clinical sample set was performed, using the eight IHC-decision tree and the 35 machine learning algorithms here proposed. Kaplan-Meier curves of DLBCL COO molecular groups classified by Hans and the best five machine learning algorithms were carried out using VY subset data. As expected, GCB overall survival was significantly better than non-GCB cases (Fig. 2). Our machine learning algorithms provided a clear difference in GCB patients outcome (for PV, p = 0.011), along with Hans IHC-decision tree algorithm (p = 0.002).
In the case of our clinical Mexican-mestizo sample set, Hans IHC-decision tree algorithm classified 23 GCB patients and 26 non-GCB patients, whereas PV classification resulted in 32 GCB patients and 17 non-GCB patients (Fig. 2). The validation set of 49 cases is small but sufficient to show a statistical significant survival difference. In both scenarios, GCB patients overall survival was better (p = 0.466 by Hans and p = 0.033 by PV classification), as has been clinically observed in other populations. When we analyzed the most important clinical variables according to the COO (non-GCB vs GCB), we observed that only the high-risk group according to the NCCN-prognostic index was higher in non-GCB than GCB (23.5% vs 3.1%, respectively, p = 0.016). We also observed a trend of less extranodal involvement in 2 or more sitesin non-GCB than GCB (29.4% vs. 40.6%, respectively, p = 0.07). Regarding the response after treatment (complete response vs. partial response + refractory) and the relapse, no differences were observed either. To the best of our knowledge, our study is the first analyzing and classifying a Mexican-mestizo sample set.

Discussion
Gene expression profiling is the gold standard for DLBCL cell-of-origin classification. The presence of MYC/BCL2 translocations, overexpression of specific genes or the presence of mutations will be part of COO classification in the near future, since two large multiplatform genomic analysis have described four [37] and five [38] prominent genetic subtypes in DLBCL that will unravel the pathogenesis of the disease beyond the cell-of-origin classification. Unfortunately, GEP wide application is still  17:198 prohibitive to many health institutions and diagnostic laboratories. Hans remains the most popular IHC-decision tree algorithm used for COO DLBCL classification and has a reasonable correlation with the GEP [3]. However, some issues have arisen about the proportion of misclassified cases by IHC-decision tree algorithms compared with GEP, since a higher percentage of misclassified cases in the GCB than in the non-GCB group has been observed, with proportions of up 60.0% and 38.0% for GCB and non-GCB, respectively [15,39]. In our analysis, Hans IHC-decision tree algorithm misclassified 17.2% of GCB and 4.8% of non/GCB cases (PPV NPV). Additionally, Hans* had the worst proportions of cases that were not correctly allocated for GCB with 24.1%, whereas Nyman had the worst for non-GCB with 35.5%. Remarkably, PV had 6.9% and 4.8% of misclassified cases for GCB and non-GCB, respectively. As has been discussed by Meyer [11], this reflects that IHC-decision tree algorithms use a combination of antibodies for proteins expressed predominantly by either GCBs or non-GCB cases and examined in a specific order. Because of reliance on the order of examination, the result of an antibody early in the algorithm can make the results of antibodies used later in the algorithm irrelevant. Moreover, our LDA analysis showed higher influence over the linear discriminant function (LDF) of BCL6 and CD10 for GCB, and MUM1 for non-GCB, consistent with the hallmarks of both COOs. Furthermore, there was not an antibodies combination with better performance metrics tan such explored by IHC-decision trees; however, classification by LDF derived from LDA improved LR+ and LR− metrics. Thereby for machine learning analysis we only assessed the antibodies combination proposed by previous authors. Coutinho et al. [15], investigated the concordance of nine IHC-decision tree algorithms for the molecular classification of a 298 sample set of DLBCL diagnostic biopsies. They used Hans, Hans*, Nyman, Choi, Choi*, VY3, VY4, Natkunam, Tally, Muris IHC-decision tree algorithms, the last three not considered in our study. They found an extremely low concordance across those nine IHC-decision tree algorithms, with only 4.1% of the tumors classified as GCB and 21.0% as non-GCB by all IHC-decision tree algorithms. Moreover, poor and fair κ values were detected in 44.4% on pairwise concordance assessment; and in only 20.0% was κ good or very good. The highest level of agreement was found between Choi, and VY3 and VY4 IHC-decision tree algorithms (κ = 0.85). GEP data for their sample set was not available, and therefore no further conclusions can be done. Considering our analysis, we found a moderate concordance across the eight IHC-decision tree algorithms here tested, with 30.0% of both GCB and non-GCB cases classified as such. Moreover, a higher concordance was observed across our five best machine learning algorithms, with 50.0% of the cases classified as GCB and 45.0% as non-GCB by the five machine learning algorithms. On pairwise concordance assessment, we observed 46.4% of IHC-decision tree algorithms with moderate κ values, 39.3% were good, and 4.0% were very good within themselves. The highest level of agreement was found between VY3 and VY4 (κ = 1.0). This can be explained since Visco [12] evaluated using the same clones. Conversely, machine learning algorithms outperformed with 8.9% good, and 91.1% very good κ values. The highest level of agreement was found between the SVM (1,2,3,4,5) and SVM (1,2,3,4) algorithms (κ = 1.0). This machine learning algorithms performance is portrayed as the abundance of green color region in the heatmap depicted in Fig. 1c, compared with the red and black regions correspondent to IHC-decision tree algorithms concordance. Nevertheless, parameters such as accuracy, sensitivity, and specificity for GEP data is a better approach to describe a classifier performance.
In the case of DLBCL, the adequate identification of GCB and non-GCB COO groups represents a major concern in personalized treatment. Nowadays, clinical guidelines for DLBCL patients includes R-CHOP as first-line treatment regardless COO group [40,41]. It is important to point out that current clinical trials rely COO DLBCL classification on Hans IHC-decision tree algorithm, compromising final outcomes since this algorithm has less accuracy and overall performance metrics than Choi, VY3, and VY4, as we observed in our study. The lack of significant differences between GCB or non-GCB treatment outcomes using precision medicine regimens under investigation (i.e. lenalidomide, ibrutinib, and bortezomib) [42][43][44] is evidence of this. The machine learning approach can achieve improvement of accuracy and the rest of the performance metrics in immunohistochemistry-based COO DLBCL classification, and further study is needed. In our case, it was not possible to analyze the clinical sample set by GEP; however, the clinical protocol to test our PV algorithm is at an early stage. Translation strategy will comprise an embed process of PV algorithm in a software application, as an auxiliary tool in the diagnostic for the pathologist when GEP is not available, as well as a guide to the clinician for better treatment choices.

Conclusions
In conclusion, our results suggest that the use of Hans should be reconsidered in favor of new classification techniques when GEP analysis is not accessible. By harnessing all of the available immunohistochemical data without reliance on the order of examination or a cut-off