Subtyping of sarcomas based on pathway enrichment scores in bulk and single cell transcriptomes

Sarcomas are highly heterogeneous in molecular, pathologic, and clinical features. However, a classification of sarcomas by integrating different types of pathways remains mostly unexplored. We performed hierarchical clustering analysis of sarcomas based on the enrichment scores of 14 pathways involved in immune, stromal, DNA damage repair (DDR), and oncogenic signatures in three bulk tumor transcriptome datasets. Consistently in the three datasets, sarcomas were classified into three subtypes: Immune Class (Imm-C), Stromal Class (Str-C), and DDR Class (DDR-C). Imm-C had the strongest anti-tumor immune signatures and the lowest intratumor heterogeneity (ITH); Str-C showed the strongest stromal signatures, the highest genomic stability and global methylation levels, and the lowest proliferation potential; DDR-C had the highest DDR activity, expression of the cell cycle pathway, tumor purity, stemness scores, proliferation potential, and ITH, the most frequent TP53 mutations, and the worst survival. We further validated the stability and reliability of our classification method by analyzing a single cell RNA-Seq (scRNA-seq) dataset. Based on the expression levels of five genes in the pathways of T cell receptor signaling, cell cycle, mismatch repair, focal adhesion, and calcium signaling, we built a linear risk scoring model (ICMScore) for sarcomas. We demonstrated that ICMScore was an adverse prognostic factor for sarcomas and many other cancers. Our classification method provides novel insights into tumor biology and clinical implications for sarcomas.

of six types of soft tissue sarcomas, including dedifferentiated liposarcoma (DDLPS), leiomyosarcoma (LMS), undifferentiated pleomorphic sarcoma (UPS), myxofibrosarcoma (MFS), malignant peripheral nerve sheath tumor (MPNST), and synovial sarcoma (SS). This study demonstrated two key findings for these adult soft tissue sarcomas: (1) CNAs are predominant over somatic mutations; and (2) molecular subtypes and the tumor immune microenvironment are highly associated with clinical outcomes. Besides, many studies have performed molecular classification of sarcomas based on genomic profiling. For example, Kim et al. classified complex karyotype sarcomas into three subtypes based on their CDK4 and RB1-associated CNAs [3]. Lee et al. classified SS, LMS, and malignant fibrous histiocytoma (MFH) into four classes based on expression profiling of 833 genes [4]. Koelsche et al. developed an algorithm to classify sarcomas based on DNA methylation profiling [5]. Gibault et al. identified five subtypes of soft tissue sarcomas with complex genomics by clustering analysis of transcriptome data [6].
Despite these previous molecular classification studies for sarcomas [1,[3][4][5], a classification of sarcomas by integrating different types of pathways remains mostly unexplored. Abundant evidence has shown that sarcomas are highly heterogenous in the immune and stromal microenvironment [1,7], genomic instability [1,8], and oncogenic pathways [9]. Thus, a classification of sarcomas based on these features of pathways has the potential to provide new insights into the heterogeneity of sarcomas. To this end, we implemented unsupervised clustering of sarcomas based on the enrichment scores of 14 pathways. These pathways were associated with immune regulation (natural killer cell-mediated cytotoxicity, antigen processing and presentation, T cell receptor signaling, B cell receptor signaling, and JAK-STAT signaling), stromal signatures (ECM-receptor interaction, focal adhesion, adherens junction, and calcium signaling), DNA damage repair (DDR) (mismatch repair and homologous recombination), and oncogenic signatures (TGF-β signaling, Wnt signaling, and cell cycle). Because the pathway enrichment score integrates the expression levels of a set of genes into a single value, the pathway enrichment-based clustering is likely to exhibit higher stability and robustness than the gene expression profiles-based clustering. In addition, the pathway enrichment-based clustering may result in more straightforward and explainable results relevant to the subtyping of cancers than the gene expression profiles-based clustering. In fact, the pathway (or gene set) enrichmentbased clustering method has been employed in many recent studies and shown its advantages over the gene expression profiles-based clustering method [10][11][12][13]. By the pathway-based clustering analyses, we identified three subtypes of sarcomas, consistently in three different datasets. We further provided a comprehensive portrait of the molecular and clinical characteristics of these sarcoma subtypes. Finally, we validated our methods and results in a single cell RNA-Seq (scRNA-seq) dataset. Our classification method may furnish new insights into the cancer biology of sarcomas and clinical implications for the management of this disease.

Datasets
We obtained transcriptomic and clinical data for TCGA-SARC from the genomic data commons (GDC) data portal (https:// portal. gdc. cancer. gov/) and GSE30929 and GSE71121 from the NCBI gene expression omnibus (GEO) (https:// www. ncbi. nlm. nih. gov/ geo/). From the GDC data portal, we also obtained the profiles of somatic mutations ("maf " files), SCNAs ("SNP6" files), protein expression (level 3), and DNA methylation (HM450) for TCGA-SARC and transcriptomic (level 3 and RSEM normalized) and clinical data for other 29 cancer types. In addition, we downloaded a single-cell RNA sequencing (scRNA-seq) dataset (GSE131309) for sarcomas from the GEO. All gene expression values (RSEM normalized) were added 1 and then log2-transformed before further analyses. A summary of these datasets is shown in Additional file 1: Table S1.

Gene-set enrichment analysis
We calculated the enrichment scores of pathways, immune signatures, or biological processes in a tumor sample by single-sample gene-set enrichment analysis (ssGSEA) [14] based on the expression levels of its related genes (pathway genes or marker genes). The ssGSEA is an extension of the GSEA method, which outputs the enrichment scores of the input gene sets in different samples by inputting an expression matrix and a list of gene sets. We presented the pathways, immune signatures, and biological processes and their related genes in Additional file 2: Table S2. In addition, we identified KEGG pathways significantly associated with a gene set using the GSEA tool [15].

Clustering analysis
We used a clustering method to identify sarcoma subtypes based on the enrichment scores of 14 pathways. The clustering method we used is hierarchical clustering algorithm, which is an unsupervised machine learning algorithm that determines the similarity between data points in each category by calculating the distance between them and all data points; the smaller the distance, the higher the similarity, and combining the two data points or categories with the closest distance to generate a clustering tree. The 14 pathways were involved in immune regulation, stromal signatures, DDR, and oncogenic signatures. We performed the hierarchical clustering in TCGA-SARC, GSE30929, and GSE71121, respectively, by using the R package "hclust".
Calculation of immune score, stromal score, tumor purity, TMB, HRD, ITH, and SCNA We calculated the immune score, stromal score, and tumor purity for each tumor sample by the ESTIMATE algorithm [16]. The immune score quantifies the immune infiltration level in the tumor microenvironment (TME), while the stromal score quantifies the stromal content in the TME. Tumor purity represents the proportion Pathway score  Fig. 1 Identification of sarcoma subtypes based on pathway enrichment scores. Hierarchical clustering identifying three sarcoma subtypes: Immune Class (Imm-C), Stromal Class (Str-C), and DDR Class (DDR-C), based on the enrichment scores of 14 pathways in three different datasets (TCGA-SARC, GSE30929, and GSE71121). The pathway enrichment scores were calculated by ssGSEA [28]. The  of tumor cells in the bulk tumor. We defined TMB as the total number of non-synonymous somatic mutations in the tumor. The Homologous recombination deficiency (HRD) scores of TCGA-SARC tumors were obtained from the publication by Knijnenburg et al. [17]. We scored intratumor heterogeneity (ITH) scores by the DEPTH algorithm [18], which evaluates ITH at the mRNA level. The R package "DEPTH" was used to calculate ITH scores with the input of gene expression profiles in tumor and/or normal tissues. We used GISTIC2 [19] to calculate arm-and focal-level SCNAs and G-scores with the input of "SNP6" files.

Construction of the prognostic risk scoring model
To build the gene expression-based linear risk scoring model (ICMScore) for evaluating the prognostic risk of sarcomas, we first identified seven of the 14 pathways using the Cox proportional hazards model by Lasso based on their enrichment scores in TCGA-SARC. The seven pathways included T cell receptor signaling, focal adhesion, adherens junction, Wnt signaling, calcium signaling, cell cycle, and mismatch repair. In each of the seven pathways, we identified the genes having strong expression correlations with the ssGSEA scores of the pathway (Spearman correlation ρ > 0.5). A total of 26 genes were identified, from which we selected 18 genes using the Cox proportional hazards model by Lasso based on their expression levels. The 18 genes were included in five pathways: T cell receptor signaling (CD40LG), focal adhesion (FLT4 and ITGA1), calcium signaling (ATP2B4, ADCY2, and FGF7), cell cycle (BUB1B, MCM4, CDC25A, CDK2, MCM6, RBL1, and TFDP1), and mismatch repair (EXO1, RFC5, MSH2, RPA3, and POLD2). Finally, we selected five genes (CD40LG, CDC25A, MSH2, FLT4, and ADCY2) from the 18 genes, which were included in five different pathways and had the smallest P-values in the univariate Cox proportional hazards model based on gene expression levels among all genes in the same pathway. Using the five genes as independent variables, we built the linear risk scoring model (ICMScore) as follows: where G i represents one of the five genes and exp(G i ) the expression level of G i in the tumor; β i is the regression β coefficient for G i in its univariate Cox proportional hazards model. The function "cv.glmnet" in the R package "glmnet" was utilized for the variable selection by Lasso in the Cox proportional hazards model, and the function "coxph" in the R package "survival" was used for the univariate and multivariable Cox regression analyses.

Analysis of scRNA-seq data
The scRNA-seq (SMART-seq2 [20]) dataset (GSE131309) for sarcomas was gene expression profiles in 6951 single cells from 12 human SyS tumors, which contained 4371 malignant cells and 2580 non-malignant cells. We used the tSNE algorithm [21] to cluster malignant cells and non-malignant cells, respectively. t-SNE produces a single map to demonstrate structure at many different scales, particularly useful for high-dimensional data [21].

Statistical analysis
In comparisons of two classes of normally distributed data, including gene expression levels, protein expression levels, and the ratios of immune-stimulatory to immuneinhibitory signatures, we used two-tailed Student's t test.
In comparisons of two classes of data that were not normally distributed, we used the one-tailed Mann-Whitney U test. In comparisons of three classes of data, if they were normally distributed, we used the one-way ANOVA test, otherwise, we used the Kruskal-Wallis (K-W) test. When analyzing contingency tables, we utilized the Fisher's exact test. We used the Benjamini-Hochberg method [22] to calculate FDR for adjusting for multiple tests. We used Kaplan-Meier curves to compare the survival (OS, DFS, and MFS) time between different groups and the log-rank test to evaluate the significance of survival time differences.

Identification of sarcoma subtypes based on pathway scores
We quantified the activity of a pathway in a tumor sample by the single-sample gene set enrichment analysis (ssGSEA) [14]. Based on the ssGSEA scores of the 14 pathways involved in immune, stromal, DNA damage repair, or oncogenic signatures, we hierarchically clustered sarcomas in three different datasets, respectively. The three datasets included TCGA-SARC [1], GSE30929 [23], and GSE71121 [24], which encompassed 259, 140, and 312 tumor samples, respectively. Consistently in the three datasets, sarcomas were clearly classified into three subtypes: Immune Class (Imm-C), Stromal Class (Str-C), and DDR Class (DDR-C) (Fig. 1). Among the three subtypes, Imm-C displayed the strongest immune signatures, Str-C showed the strongest stromal signatures, and DDR-C had the highest DDR activity. Meanwhile, Str-C exhibited the highest activity of the TGF-β and Wnt signaling pathways (Fig. 1). This is consistent with the prominent roles of both pathways in the activation of stromal cells [25]. DDR-C most highly expressed the cell cycle pathway. It is justified since elevated cell cycle activity must promote the DDR activity in tumors [26].

Comparisons of immune, stromal, and DDR signatures among the sarcoma subtypes
We compared the enrichment scores of CD8 + T cells, NK cells, and immune cytolytic activity, which were ssGSEA scores of their marker gene sets, among the sarcoma subtypes. All these immune signatures exhibited consistent expression patterns among these subtypes: Imm-C > Str-C > DDR-C, in all three datasets (one-tailed Mann-Whitney U test, P < 0.01) ( Fig. 2A). In addition, we compared immune scores, which were evaluated by the ESTIMATE algorithm [16], among the sarcoma subtypes. Likewise, the immune scores followed the pattern: Imm-C > Str-C > DDR-C, in all three datasets (one-tailed Mann-Whitney U test, P < 0.01) ( Fig. 2A). Moreover, most human leukocyte antigen (HLA) genes showed a similar expression pattern: Imm-C > Str-C > DDR-C (one-way ANOVA test, P < 0.001) (Additional file 3: Fig.  S1). Furthermore, we compared the ratios of immunestimulatory to immune-inhibitory signatures (CD8 + / CD4 + regulatory T cells) among the sarcoma subtypes. The ratios of CD8 + /CD4 + regulatory T cells were the base-2 log-transformed values of the geometric mean expression levels of all marker genes of CD8 + T cells divided by those of CD4 + regulatory T cells. Notably, the ratios exhibited similar pattern: Imm-C > Str-C > DDR-C (two-tailed Student's t test, P < 0.05) (Fig. 2B). Taken together, these results confirmed that Imm-C had the strongest anti-tumor immune response among the three subtypes; they also demonstrated that DDR-C had the weakest anti-tumor immune response.
We also compared stromal signatures among the sarcoma subtypes. The stromal scores by ESTI-MATE [16] followed the pattern: Str-C > DDR-C and Imm-C > DDR-C (one-tailed Mann-Whitney U test, P < 0.001) (Fig. 2C). Epithelial-to-mesenchymal transition (EMT) is a representative signature for assessing tumor stromal content [27]. The TCGA-SARC data showed that EMT scores were significantly higher in Str-C than in Imm-C and DDR-C (P < 0.05) (Fig. 2D). These data confirmed that Str-C had the strongest stromal signature. Besides, we compared the expression levels of seven DDR genes among the sarcoma subtypes. These genes included MSH2, MSH3, MSH6, MLH1, PMS1, POLD1, and POLE. Notably, these genes showed significantly higher expression levels in DDR-C than in Imm-C and Str-C (P < 0.05) (Fig. 2E). It supported that DDR-C had the strongest DDR signature.
We further compared tumor purity among the sarcoma subtypes. We evaluated tumor purity by ESTIMATE [16], which uses a cosine function of the sum of immune and stromal scores to calculate tumor purity. Interestingly, tumor purity consistently followed the pattern: Imm-C < Str-C < DDR-C, in the three datasets (one-tailed Mann-Whitney U test, P < 0.001) (Fig. 2F). These results indicated that DDR-C was most enriched with tumor cells while Imm-C harbored the highest proportion of non-tumor cells.

Clinical and phenotypic characteristics of the sarcoma subtypes
We compared survival [overall survival (OS), disease-free survival (DFS), and metastasis-free survival (MFS)] prognosis among the sarcoma subtypes. Notably, DDR-C was likely to have the worst survival consistently in the three datasets (log-rank test, P < 0.05), while Imm-C and Str-C showed no significant difference in survival (Fig. 3A). Interestingly, Imm-C patients were significantly older than Str-C and DDR-C patients (one-tailed Mann-Whitney U test, P < 0.05) (Fig. 3B), while Str-C and DDR-C patients had no significant difference in ages. In addition, in male patients, Str-C had the highest proportion (40.68%) and DDR-C had the lowest proportion (26.27%). In contrast, in female patients, DDR-C had the highest     proportion (47.52%) and Imm-C had the lowest proportion (23.40%) (Fig. 3C). We compared several tumor progression phenotypes among the sarcoma subtypes, including proliferation potential, stemness, and intratumor heterogeneity (ITH). These phenotypes are associated with tumor progression, immune evasion, drug resistance, and unfavorable prognosis [28,29]. Notably, proliferation potential scores followed the pattern: Str-C < Imm-C < DDR-C, consistently in the three datasets (one-tailed Mann-Whitney U test, P < 0.001) (Fig. 3D). Stemness scores followed the pattern: Str-C < DDR-C and Imm-C < DDR-C (one-tailed Mann-Whitney U test, P < 0.001) (Fig. 3E). In addition, we used the DEPTH algorithm [18] to score the ITH of the sarcomas in TCGA-SARC. The ITH scores were the highest in DDR-C and the lowest in Imm-C (one-tailed Mann-Whitney U test, P < 0.05) (Fig. 3F). Altogether, these results confirmed that DDR-C had the worst prognosis among the three sarcoma subtypes.
Because age, immune score, tumor purity, tumor proliferation potential, stemness, and ITH are potentially associated with clinical outcomes in cancer and were significantly different among the sarcoma subtypes, the worse survival prognosis in DDR-C versus the other subtypes could have an association with these confounding variables. To explore the possibility, we performed multivariate (age, immune score, tumor purity, tumor proliferation potential, stemness, ITH, and DDR-C subtype) survival analysis by the multivariate Cox proportional hazards model. We found that the subtype DDR-C remained a significant risk factor for DFS (P = 0.047; hazard ratio (HR) = 1.72 and its 95% confidence interval (CI) [1.01, 2.94] (Fig. 3G) after correcting for these variables. It suggests that DDR-C is an independent risk factor for sarcomas.

Molecular characteristics of the sarcoma subtypes
We compared various molecular characteristics among the sarcoma subtypes, including genomics, transcriptomics, methylation profiles, and proteomics. In comparisons of genomic, methylation profiles, and proteomic characteristics, we merely used the TCGA-SARC dataset since related data were not available in the other two datasets.

Genomic characteristics
Genomic instability often leads to a high tumor mutation burden (TMB) and/or increased CNAs [30]. We found that TMB was significantly lower in Str-C than in Imm-C and DDR-C (one-tailed Mann-Whitney U test, P < 0.05), while it showed no significant difference between Imm-C and DDR-C (Fig. 4A). Homologous recombination deficiency (HRD) may promote tumor aneuploidy levels, namely CNAs [17]. We found that HRD scores were significantly lower in Str-C than in Imm-C and DDR-C (P < 0.001), while they were not significantly different between Imm-C and DDR-C (Fig. 4B). In addition, we used GISTIC2 [19] to calculate arm-and focal-level somatic copy number alterations (SCNAs) and G-scores. The G-score reflects the amplitude of the SCNA and the frequency of its occurrence across a group of samples [19]. Notably, the frequencies of arm-level copy number amplifications and deletions followed the tendency: Str-C < Imm-C < DDR-C (Fig. 4C). The G-scores of copy number amplifications were significantly higher in Imm-C than in DDR-C and Str-C, while the G-scores of copy number deletions were significantly higher in DDR-C than in Imm-C and Str-C (Fig. 4D). Overall, these results indicated that Str-C was more genomically stable than Imm-C and DDR-C.

Somatic mutations
We observed that TP53 had a significantly higher mutation frequency in DDR-C than in Imm-C and Str-C (Fisher's exact test, P = 0.01, odds ratio (OR) = 2.0) (Fig. 5A). It could partially explain why DDR-C had the worst prognosis among the three subtypes since TP53 mutations have been associated with unfavorable outcomes in various cancers [31]. Indeed, we found that TP53mutated tumors had a significant worse OS and DFS than TP53-wildtype tumors in DDR-C (P < 0.02). On the other hand, TP53 mutations were less frequent in Str-C than in DDR-C and Imm-C (P = 0.009, OR = 0.46). This result could explain why Str-C was more genomically stable than the other subtypes because p53 plays an important role in the maintenance of genomic stability [32]. In addition, DYNC2H1 and MDN1 were more frequently mutated in DDR-C than in Imm-C and Str-C (P = 0.04, OR = 4.14) (Fig. 5A). There were 14 genes showing significantly higher mutation frequencies in Imm-C than in DDR-C and Str-C (P < 0.05, OR > 4.0) (Fig. 5B). These genes included BEST3, CACNA1C, CEP170, DST, FRMPD3, KMT2D, PEG3, SFMBT2, SOGA2, TCHHL1, TENM2, THSD7A, TRPM6, and WNK2. Notably, the mutations in many of these genes were positively correlated with the enrichment scores of CD8 + T cells and/ or immune cytolytic activity, including CACNA1C, THSD7A, PEG3, TENM2, KMT2D, TRPM6, FRMPD3, DST, and TCHHL1 (P < 0.05) (Fig. 5C). It is consistent with the strongest immune signatures presented in Imm-C.

DNA methylation
We found 1716 genes showing significantly higher methylation levels in Imm-C compared to both Str-C and DDR-C [one-tailed Mann-Whitney U test, false discovery rate (FDR) < 0.05]. In contrast, 124 and 180 genes had significantly higher methylation levels in Str-C and DDR-C, respectively, compared to other subtypes. Notably, most of these genes showed significant inverse correlations of their expression levels with methylation levels (Pearson correlation, P < 0.05) (Additional file 4: Table S3). The top 30 genes with the most significant upregulation of methylation levels in each of the three subtypes are presented Additional file 5: Fig. S2. The KEGG pathways significantly associated with the 1716 hypermethylated genes in Imm-C mainly included focal adhesion, regulation of actin cytoskeleton, MAPK signaling, adherens junction, ECM-receptor interaction, TGF-β, ErbB signaling, calcium signaling, Notch signaling, gap junction, tight junction, cell cycle, and Hedgehog signaling (Fig. 6A). These results supported that Imm-C had lower activities of stromal signatures and oncogenic pathways. In contrast, the pathways associated with the 180 hypermethylated genes in DDR-C mainly included cytokine-cytokine receptor interaction, Jak-STAT signaling, antigen processing and presentation, the intestinal immune network for IgA production, natural killer cell-mediated cytotoxicity, cell adhesion molecules, and p53 signaling. These results confirmed that DDR-C had a lower anti-tumor immune response and p53 function  6B). It has been shown that low methylation levels correlate with increased TMB and CNAs in cancer [33]. We compared global methylation levels [33] among the sarcoma subtypes and observed the pattern: Str-C > Imm-C > DDR-C (P < 0.05) (Fig. 6C). This conforms with our previous results showing that Str-C had the lowest TMB and CNAs among the sarcoma subtypes.

A risk scoring model based on the expression levels of five genes in the pathways
Using the TCGA-SARC dataset, we developed a linear risk scoring model (ICMScore) to evaluate the prognostic risk of sarcomas based on the expression levels of five genes, including CD40LG, CDC25A, MSH2, FLT4, and ADCY2. The five genes were involved in five of the 14 pathways for clustering analysis, including T cell receptor signaling (CD40LG), cell cycle (CDC25A), mismatch repair (MSH2), focal adhesion (FLT4), and calcium signaling (ADCY2). ICMScore calculates risk score in a tumor as follows: , w her e e xp(X) denotes the expression level of gene X in the tumor sample. To prove that ICMScore is an authentic risk factor in sarcomas, we compared ICMScores among the three sarcoma subtypes and analyzed their correlation with survival prognosis in sarcomas. In the three sarcoma datasets, As expected, ICMScores was significantly higher in DDR-C than in Imm-C and Str-C (two-tailed Student's t test, P < 0.05) (Fig. 8A). Survival analyses showed that higher-ICM-Score (> median) tumors had significantly worse survival (OS, DFS, and MFS) prognosis than lower-ICMScore (< median) tumors in the sarcoma datasets (log-rank test, P < 0.05) (Fig. 8B). These results supported that ICMScore was a prognostic risk factor in sarcomas. Interestingly, we found that ICMScore was also a prognostic risk factor in many other cancer types, including adrenocortical carcinoma (ACC), kidney chromophobe (KICH), kidney renal clear cell carcinoma (KIRC), brain lower grade glioma (LGG), prostate adenocarcinoma (PRAD), and skin cutaneous melanoma (SKCM), as evidenced by that elevated ICMScores were associated with worse OS and/or DFS in these TCGA cancer types (Fig. 8C).

Validation by analyzing scRNA-seq data
Using the pathway-based clustering method, we analyzed a scRNA-seq dataset (GSE131309), which involved gene expression profiles in 6951 single cells We performed hierarchical clustering of the 6951 single cells based on the enrichment scores of the 14 pathways identified three subtypes of these cells (Fig. 9B). Notably, almost all immune cells (B cells, macrophages, mastocytes, NK cells, CD4 + T cells, CD8 + T cells, and T cells) were included in Imm-C, and all stromal cells (endothelial cells and CAFs) were included in Str-C (Fig. 9C). The 4371 tumor cells were classified into Str-C (n = 3147) and DDR-C (n = 1224). These results support the stability and reliability of our method.
In the scRNA-seq dataset, the enrichment scores of anti-tumor immune signatures (CD8 + T cells, NK cells, and immune cytolytic activity) were significantly higher in Imm-C than in Str-C and DDR-C (Fig. 9D). The ratios of CD8 + /CD4 + regulatory T cells were also significantly higher in Imm-C than in Str-C and DDR-C (Fig. 9D). Interestingly, PD-L1 expression levels were significantly higher in Imm-C than in Str-C and DDR-C (Fig. 9D). It indicates that PD-L1 is more abundant in immune cells than in stromal and tumor cells. The seven DDR genes (MSH2, MSH3, MSH6, MLH1, PMS1, POLD1, and POLE) exhibited the consistent expression pattern: Imm-C < Str-C < DDR-C (two-tailed Student's t test, P < 0.001) (Fig. 9E), supporting that DDR-C has the strongest DDR activity. Likewise, stemness scores exhibited the pattern: Imm-C < Str-C < DDR-C (one-tailed Mann-Whitney U test, P < 0.001) (Fig. 9F). It is consistent with our

Discussion
In this study, we proposed a novel classification method for sarcomas based on the enrichment scores of 14 pathways, which were involved in immune, stromal, DDR, and oncogenic signatures. In three datasets for bulk tumors and a scRNA-seq dataset, we reproducibly identified three sarcoma subtypes: Imm-C, Str-C, and DDR-C. Imm-C had the strongest anti-tumor immune signatures and the lowest ITH; Str-C showed the strongest stromal signatures, the highest genomic stability and global methylation levels, and the lowest proliferation potential; DDR-C had the highest DDR activity, expression of the cell cycle pathway, tumor purity, stemness scores, proliferation potential, and ITH, the most frequent TP53 mutations, and the worst survival.
It is interesting to observe that there was no significantly different TMB between DDR-C and Imm-C, while their immune infiltration levels were significantly different. Two possible reasons could explain this observation: (1) DDR-C had more frequent arm-level copy number amplifications and deletions than Imm-C; and (2) DDR-C had higher ITH than Imm-C, since both CNAs [41] and ITH [18] are negatively correlated with anti-tumor immune response. DDR-C displayed worse clinical outcomes than Imm-C and Str-C. It could be attributed to the high proliferation potential, stemness, ITH, and genomic instability in DDR-C. Meanwhile, the lowest ratios of immune-stimulatory/immuneinhibitory signatures in DDR-C indicate the least activation of anti-tumor immune response in this subtype. It could also be a factor that leads to the worst prognosis in DDR. The TCGA Research Network analyzed six types of adult soft tissue sarcomas: DDLPS, LMS, UPS, MFS, MPNST, and SS [1]. We found that 52%, 38%, and 10% of DDLPS tumors belonged to Str-C, Imm-C, and DDR-C, respectively (Additional file 7: Fig. S4A). It indicates that most DDLPS tumors are Str-C or Imm-C. In contrast, 80% of SS and 67% of ULMS tumors are DDR-C, indicating that SS and ULMS are dominated by DDR-C. In addition, 0%, 7%, and 8% of SS, ULMS, and STLMS tumors belonged to Imm-C, compared to 55% of UPS tumors being Imm-C; 5% of UPS tumors were Str-C, compared to 52% of DDLPS and 53% of STLMS tumors belonging to Str-C. Taken together, these data suggest that the different types of adult soft tissue sarcomas are dominated by different subtypes we identified. In another study [42], Thorsson et al. identified six immune subtypes of TCGA pan-cancer, including wound healing, IFN-γ dominant, inflammatory, lymphocyte depleted, immunologically quiet, and TGF-β dominant. We found that the wound healing sarcomas were mainly Str-C and DDR-C, and only 11% were Imm-C; the IFN-γ dominant sarcomas were mainly Imm-C and DDR-C, and only 16% were Str-C; the inflammatory sarcomas were dominated by Str-C (71%), compared to 10% of the lymphocyte depleted tumors were Str-C (Additional file 7: Fig.  S4B). This study revealed that the inflammatory subtype and lymphocyte depleted subtype had the best and worst prognosis, respectively. It is in accord with our results of the worst survival in DDR-C in that 10% of the inflammatory sarcomas and 61% of the lymphocyte depleted sarcomas belonged to DDR-C, respectively. In addition, Gibault et al. identified five subtypes (Clusters A-E) of soft tissue sarcomas based on gene expression profiles [6]. Cluster B was a subgroup of sarcomas with a favorable metastasis outcome in multivariate analysis [6], which constituted the least proportion (9%) of DDR-C among the five clusters (Additional file 7: Fig. S4C). It is consistent with the worst prognosis of DDR-C among the three subtypes we identified. Clusters C-E harbored poorly differentiated LMS, UPS, MFS, and pleiomorphic sarcomas of the limbs and displayed combinations of expression of genes involved in invasion, extracellular matrix, or inflammatory processes, which were predominant in DDR-C, Str-C, and Imm-C, respectively. Hence, our subtyping method demonstrated a clearer separation of sarcomas with respect to pathways and biological processes compared to previous subtyping methods.

Conclusions
Based on the enrichment scores of 14 pathways associated with immune, stromal, and DDR signatures, we classified sarcomas into three subtypes. The three sarcoma subtypes were characterized by different immune infiltration levels, stromal signatures, DDR activity, genome features, tumor progression phenotypes, and clinical outcomes. Our new classification method for sarcomas provides novel insights into tumor biology and clinical implications for this disease.