A robust qualitative transcriptional signature for the correct pathological diagnosis of gastric cancer

Background Currently, pathological examination of gastroscopy biopsy specimens is the gold standard for gastric cancer (GC) diagnosis. However, it has a false-negative rate of 10–20% due to inaccurate sampling locations and/or insufficient sampling amount. A signature should be developed to aid the early diagnosis of GC using biopsy specimens even when they are sampled from inaccurate locations. Methods We extracted a robust qualitative transcriptional signature, based on the within-sample relative expression orderings (REOs) of gene pairs, to discriminate both GC tissues and adjacent-normal tissues from non-GC gastritis, intestinal metaplasia and normal gastric tissues. Results A signature consisting of two gene pairs for GC diagnosis was identified and validated in data of both biopsy specimens and surgical resection specimens pooled from publicly available datasets measured by different laboratories with different platforms. For gastroscopy biopsy specimens, 96.20% of 79 non-GC tissues were correctly identified as non-GC, and 96.84% of 158 GC tissues and six of seven adjacent-normal tissues were correctly identified as GC. For surgical resection specimens, 98.37% of 2560 GC tissues and 97.28% of 221 adjacent-normal tissues were correctly identified as GC. Especially, 97.67% of the 257 GC patients at stage I were exactly diagnosed as GC. We additionally measured 21 GC tissues from seven different GC patients, each with three specimens sampled from three tumor locations with different proportions of the tumor epithelial cell. All these GC tissues were correctly identified as GC, even when the proportion of the tumor epithelial cell was as low as 14%. Conclusions The qualitative transcriptional signature can distinguish both GC and adjacent-normal tissues from normal, gastritis and intestinal metaplasia tissues of non-GC patients even using inaccurately sampled biopsy specimens, which can be applied robustly at the individual level to aid the early GC diagnosis. Electronic supplementary material The online version of this article (10.1186/s12967-019-1816-4) contains supplementary material, which is available to authorized users.


Background
Gastric cancer (GC) is one of the most frequent malignant tumors with a high mortality rate [1][2][3]. GC patients at early stage could benefit from surgical resection [4,5]. However, only about 10-20% of GC patients are diagnosed at early stage [6,7]. Currently, pathological examination based on gastroscopic biopsy tissue is still the most effective approach for confirming GC [8,9]. However, the result of pathological examination for gastroscopic biopsy tissue depends on the skills and experiences of the endoscopists and pathologists [10][11][12]. The false-negative rate of GC diagnosis has been reported to be 10-20% [13][14][15][16][17][18]. Among the falsenegative samples, 85.2% are at the early stage [19], and 71.4% are wrongly diagnosed as gastritis, ulcer or "suspicious lesion" [16]. Most of the false-negative samples (73%) are caused by inaccurate sampling locations and the remainder (27%) could be attributed to pathologist errors [16].
Therefore, it is vitally important to develop an objective molecular signature to complement the existing subjective diagnostic technique of histology, which could aid the pathologists to identify early GC even when the sampling location of gastroscopic biopsy tissue is inaccurate. It's possible because the GC adjacent-normal tissues might also gain some similar molecular characteristics of GC [20,21]. However, most of the reported diagnostic signatures are identified using GC adjacentnormal tissues as the normal samples [22][23][24], which will make false-negative diagnosis when the location of gastroscopic biopsy tissue is inaccurate [13]. Another critical limitation of previously reported diagnostic signatures is that they are based on risk scores summarized from quantitative gene expression measurements of the signature genes [22,23,25], which are highly sensitive to measurement batch effects and lab differences and thus cannot be robustly applied to independent samples [26][27][28] even with data normalization [29]. Fortunately, it has been reported that the within-sample relative expression orderings (REOs) of genes are robust against experimental batch effects [30,31]. Besides, we have shown that the within-sample REOs are robust even when the tumor tissues sampled from different tumor locations contain different proportions of the tumor epithelial cell [32] and partial RNA degradation during specimen preparation and storage [33], and the RNA amplification bias exists for minimum specimens. Notably, Zheng et al. have identified the within-sample REO of one pair of microRNA (hsa-miR-196a and hsa-miR-148a) as a qualitative GC diagnosis signature using GC and normal gastric mucosa samples [34]. However, the performance of this signature to identify gastritis, intestinal metaplasia and cancer adjacent-normal samples was not evaluated [34].
In this study, we aim at identifying a signature that can discriminate GC tissues, including the inaccurately sampled GC adjacent-normal tissues, from non-GC tissues including gastritis, intestinal metaplasia and normal gastric tissues. A signature consisting of two gene pairs was identified in the training data and validated in multiple datasets measured by different laboratories with different platforms, even when the proportion of the tumor epithelial cell was as low as 14%.

Samples and data measurement
We measured 21 GC specimens from seven GC patients. For each patient, three specimens were sampled from three different tumor locations. The proportion (about 14%-93%) of the tumor epithelial cell was measured by pathological section analysis (see Table 1). The baseline characteristics of the seven GC patients were shown in Additional file 1: Table S1. All cancer specimens were collected from the operating room immediately after surgical resection and were fresh frozen for subsequent RNA extraction. This study was approved by the institutional review boards of all participating institutions, and written consent forms were obtained from all participants.
Total RNA was isolated from fresh frozen GC tissues using Trizol reagent (Invitrogen) according to the manufacture's protocol. The quality of RNA was assessed using Agilent 2200 TapeStation (Agilent technologies, US) to ensure high quality (RNA integrity number > 6). Then, 1-2 μg of total RNA was used for mRNA capture using NEBNextPolyA mRNA Magnetic Isolation Module and stranded RNA-seq libraries were constructed using a NEBNext Ultra Directional RNA Library Prep Kit. The 2 × 150 paired-end sequencing was performed on an Illumina HiSeqXten (Illumina, US). The resulting raw RNA-seq files (.fastq) were preprocessed using Trimmomatic [35], and reads were aligned to the reference genome (GRCh37) using hisat2 [36]. Finally, the reads per kilobase per million mapped reads (RPKM) values of genes were computed to represent the expression levels of genes using StringTie [37]. The data has been submitted to Gene Expression Omnibus (GEO, GSE116782).

Public data and preprocessing
Gene expression profiles of gastric tissues measured by the Affymetrix, Illumina or RNA-seq platform were collected from the GEO and The Cancer Genome Atlas (TCGA) data portal (http://tcga-data.nci.nih.gov/tcga/), as described in Table 2.
For the gene expression profiles measured by the Affymetrix platform, the raw data (.CEL files) was downloaded and preprocessed using the Robust Multi-array Average algorithm for background adjustment without quantile normalization [38]. For the gene expression profiles measured by the Illumina platform, the processed data was directly downloaded and used for the following analysis. For the gene expression profiles from TCGA detected by RNA-seq, the level 3 data was directly downloaded for our analysis.
For the array-based data, every probe ID was mapped to Entrez gene ID using the corresponding platform file. If multiple probes were mapped to a gene, the expression level of this gene was summarized as the arithmetic mean of the values of these probes.

Developing the diagnostic signature
The gene expression profiles of GC, normal and gastritis tissues in the training data were used to identify REObased diagnostic signature (Table 2). First, we defined the stable REOs of gene pairs in a type of gastric tissues. The REO of a gene pair (i, j) is denoted as Gi > Gj or Gi < Gj if the gene i has a higher or lower expression level than the gene j within a sample. The REO of a gene pair is defined as stable if the same REO kept in at least 99% of the samples. Furthermore, a gene pair (i, j) is defined as reversal if the REO of the gene pair is stable in both of two types of gastric tissues, but with different REO patterns (Gi < Gj or Gi > Gj in one type of tissues but Gi > Gj or Gi < Gj in the other type of tissues). Here, the stable gene pairs with the same REO pattern between normal samples and gastritis samples were defined as stable gene pairs of non-GC tissues. We then selected the reversal gene pairs between GC and non-GC tissue samples. These reversal gene pairs were the candidate qualitative REO-based diagnostic signatures. The absolute rank difference for every reversal gene pair in each of the GC or non-GC samples is calculated as follow: where R i and R j represent the ranks of gene i and j in a sample, respectively.
and mean [R ij (gc)] denote the means of the absolute rank differences between gene i and gene j in non-GC tissue samples and GC tissue samples, respectively. Then, their geometric mean (avgR ij ) is calculated to evaluate the reversal degree of the gene pair between GC and non-GC tissue samples.
The larger the geometric mean for a reversal gene pair, the larger the reversal degree of the REO of the gene pair  between GC and non-GC tissue samples. All reversal gene pairs were sorted in a descending order according to the geometric means. Finally, we took the top k reversal gene pairs as a signature according to the reversal degrees of the identified reversal gene pairs, and a given sample was identified as GC tissue when at least a half of gene pairs in the signature exhibit the same REOs for GC; otherwise, it was identified as non-GC tissue. The signature achieved the highest classification accuracy in the training data was defined as GC diagnosis signature. All the analysis programs to develop the diagnostic signature were written using the R language (R 3.1.3). The program codes were shown in Additional file 2.

Performance evaluation
The sensitivity, specificity, accuracy and the area under curve (AUC) of the receiver operating characteristic (ROC) curves were used to evaluate the performance of the signature. The sensitivity was defined as the proportion of correctly identified GC samples in all GC samples. The specificity was defined as the proportion of correctly identified non-GC samples in all non-GC samples including normal tissues, gastritis adjacent-normal tissues and gastritis tissues. The accuracy was defined as the proportion of correctly identified samples of all GC and non-GC samples. Here, the nonparametric Hanley-McNeil algorithm was used to calculate the AUC value [39,40] and 95% confidence intervals (CI) for AUC was computed using an approximate normal distribution.

Identifying the diagnostic gene pair signature
The flowchart for the identification and validation of the qualitative diagnostic signature is described in Fig. 1.
Firstly, we identified gene pairs with an identical REO in at least 99% of 26 gastric normal samples, 23 gastritis samples and 347 GC samples, respectively, using the training data integrated from 11 datasets measured by the Affymetrix or Illumina platform (see Table 1). We found 32,483,417 overlapped gene pairs with the same stable REOs between the gastric normal and gastritis samples, among which six gene pairs had stable but reversal REOs in the GC tissues (Additional files 3 and 4), which were potential GC diagnostic signatures.
We then evaluated the reversal degrees of the six gene pairs with reversal REOs between the GC and non-GC samples including normal and gastritis samples in the training data (see Methods). According to the reversal degrees of the six gene pairs, we took the top k (1, 2,…, 6) gene pairs as a signature and calculated its classification accuracy (Fig. 2). Finally, the top two gene pairs consisting of three genes, were defined as the diagnosis signature (Table 3). In the training data, all the 26 gastric normal and 23 gastritis tissues were correctly classified as non-GC samples, and all the 347 GC tissues were correctly classified as cancer samples. The AUC and the accuracy were 0.99 and 100%, respectively. The detailed classification accuracy of the signature in each of the training datasets was shown in Additional file 5: Table S3.

Validating the signature
The gene expression profiles of gastric tissues sampled by gastroscopic biopsy or surgical resection were used to validate the performance of the qualitative signature. Non-GC tissues, including normal, gastritis adjacent-normal, gastritis and intestinal metaplasia tissues, from non-GC patients were all sampled by gastroscopic biopsy. The result showed that 96.20% of the 79 non-GC tissues from GSE5081, GSE60662, GSE106656 and GSE34619 were correctly identified as non-GC (Table 4 and Additional file 6). For gastroscopic biopsy specimens, 96.84% of the 158 GC tissues from the GSE14210 and GSE52138 datasets and six of seven GC adjacent-normal tissues from the GSE52138 dataset were correctly identified as GC (Table 4 and Additional file 6). For surgical resection specimens, as described in Table 2, 98.37% of 2560 GC tissues and 97.28% of 221 samples were correctly identified as GC ( Table 4). The surgical resection specimens were measured by multiple platforms including the Affymetrix, Illumina and RNA-seq platforms. For the Affymetrix and Illumina platforms used in training data, 99.77% of the 2185 GC tissues and all the 189 GC adjacent-normal tissues were correctly classified to GC tissues. Moreover, 95.73% of the 375 GC tissues and 81.25% of the 32 GC adjacent-normal tissues measured by RNA-seq were correctly classified to GC given that no RNA-seq data participated in training the signature. Especially, 97.67% of the 257 GC patients at stage I were correctly identified as GC. The accuracy and AUC of the validation data were 98.55% and 0.99 (95% CI = 0.95-1, Fig. 3).
To further validate the signature, using RNA-seq platform, we additionally measured gene expression profiles of 21 GC tissues from seven different GC patients, each with three specimens sampled from three tumor locations with different proportions of the tumor epithelial cell (see Table 1). All the 21 GC tissues were correctly   classified to GC by our signature, even when the proportion of the tumor epithelial cell was as low as 14% (Table 4). Together, the above results validated that the signature can accurately discriminate GC, including GC adjacentnormal tissues, from non-GC patients, even when the sampling location is inaccurate.

Discussion
At present, the histological analysis of the gastroscopic biopsy specimen is affected by the sampling location and tissue amount [8]. In this study, a robust qualitative transcriptional signature, including two gene pairs consisting of three genes, was developed to aid the early diagnosis of GC using either gastroscopic biopsy or surgical resection specimens. The signature can accurately distinguish GC tissues from non-GC tissues including normal, gastritis and intestinal metaplasia tissues. As shown in this study, the signature can accurately classify GC tissues to GC when the proportion of the tumor epithelial cell was as low as 14%. Especially, it can identify most of GC adjacent-normal tissues as cancer, suggesting that the signature can identify GC even when the sampling location is inaccurate. Notably, all the non-GC tissues sampled by gastroscopic biopsy can be correctly identified as non-GC. However, the specimens sampled by gastroscopic biopsy for gastritis and intestinal metaplasia are limited, and it deserves further studies using large collections of non-GC specimens.
The amount of the gastroscopic biopsy specimens used in the study was about 1-8 µg total RNA [41][42][43] which was relatively large. In clinical practice, it is often difficult to obtain sufficient amount of biopsy specimens for gene expression profiling or other molecular measurements [11,44]. Fortunately, we have shown that the REO-based signatures can be robustly applied to specimens with RNA amplification from as low as 150-250 pg total RNA of cancer cells [31]. Therefore, it is highly possible that the two gene pairs could be used to gastroscopic biopsy specimens with minimum sampling amounts. We compared the expression levels of the two genes in each of the signature gene pairs. The fold changes (FC) of the two genes in each of the signature gene pairs across different datasets for the GC, GC adjacent-normal and non-GC groups were quite different (Additional files 7 and 8). For the gene pair of CYR61 and MMP28, the median values of FC between CYR61 and MMP28 ranged from 1.17 to 30.56 in the GC group across different datasets, while in the non-GC group the median values of FC ranged from 0.76 to 0.89 (Additional file 7: Table S4). Similar results for the gene pair of CYR61 and ACOX1 were also observed (Additional files 7 and 8). Notably, two genes with high expression levels in a sample can hardly reach large FC even if the absolute expression level difference between the two genes is rather large. Besides, two genes with low expression levels in a sample may reach large FC simply due to large measurement variations [45]. To more clearly show the quantitative expression level difference of two genes in each of the signature gene pairs, we also calculated the value of the expression level of CYR61 minus the expression level of MMP28 (ACOX1) in a sample as a measure to show the difference of the two genes consisting of the signature gene pairs (Additional files 9 and 10). The median values of the subtraction of MMP28 from CYR61 ranged from 1.30 to 1868.50 in the GC group across different datasets, while in the non-GC group the median values ranged from − 2.29 to − 0.73 (Additional file 9: Table S5). The results were similar for the gene pair of CYR61 and ACOX1 (Additional files 9 and 10). The subtraction values were quite different for different platforms. However, they varied even in the same platform. For example, the median values of the subtraction of MMP28 from CYR61 in GC group ranged from 2.84 to 1868.5 for GPL6947 (Additional files 9 and 10). The above results showed that the subtle quantitative difference (such as FC and subtraction) of each of the signature gene pairs is quite different across different samples for both the GC and non-GC groups because the quantitative gene expression measurements are affected by the measurement batch effects and many other factors such like the sample quality [29,31,46]. However, the REOs of the gene pairs in each group are very stable.
We additionally evaluated the performance of the signature on other types of cancers including liver, colorectal and pancreatic cancers (Additional file 11: Table S6). As shown in Additional file 12: Table S7, the results showed that the signature was unsuitable for these types of cancers. Notably, the signature can classify cancer tissues of liver, colorectum and pancreas as cancer although it cannot correctly classify