Sample selection and RNA-seq data acquisition
We focused on early-passage PDXs due to the similarity of their genomic and transcriptional profiles to those of the original tumor [17]. RNA-seq data for 61 early-passage (passage 0, 1, and 2) tumor xenografts of human origin belonging to 20 distinct patient-derived xenograft (PDX) models were downloaded from the publicly-accessible NCI PDMR website (https://pdmr.cancer.gov/). In this paper, we used the term “replicate” to denote samples from the same tumor implanted into different mice (i.e., biological replicates). Of the 20 PDX models, 19 had three replicate samples from the same passage with available RNA-seq data, while the remaining model had four replicate samples from the same passage. The 20 PDX models covered 15 different cancer subtypes (Additional file 1: Table S1).
The detailed standing operating procedures for the RNA-seq library preparation and data processing can be found in the SOP section of the NCI PDMR website (https://pdmr.cancer.gov/sops/). Briefly, the samples were sequenced on the Illumina HiSeq Sequencing platform. FASTQ files were generated with bcl2fastq (version: 2.17.1.14, Illumina). Adaptors were trimmed within this process using the default cutoff of the adapter-stringency option. PDX mouse reads were bioinformatically removed from the raw FASTQ files using bbsplit (bbtools v37.36). The fastq files were mapped to the human transcriptome based on exon models from hg19 using Bowtie2 (version 2.2.6). The resulting SAM files were converted to BAM format using samtools, and the transcriptomic coordinates from the BAM file were converted to the corresponding genomic (hg19) coordinates using RSEM (version 1.2.31). Gene and transcript level quantification were also performed with RSEM (version 1.2.31). In our comparative study, we focused on the gene level output files, which contained the TPM, FPKM, expected counts, and effective length for 28,109 genes.
Quantification and normalization methods
The aim of the present study was to compare the performance of different RNA-seq gene expression quantification measures for downstream analysis. All gene expression measures included in our study are defined below.
RPKM and FPKM
The measure RPKM (reads per kilobase of exon per million reads mapped) was devised as a within-sample normalization method; as such, it is suitable to compare gene expression levels within a single sample, rescaled to correct for both library size and gene length [1].
FPKM stands for fragments per kilobase of exon per million mapped fragments. It is analogous to RPKM and is used specifically in paired-end RNA-seq experiments [17]. The calculation of RPKM or FPKM for gene i uses the following formula:
$$RPKM_{i} ~{\text{or}}~FPKM_{i} = \frac{{q_{i} }}{{\frac{{l_{i} }}{{10^{3} }}*\frac{{\mathop \sum \nolimits_{j} q_{j} }}{{10^{6} }}}} = \frac{{q_{i} }}{{l_{i} *\mathop \sum \nolimits_{j} q_{j} }}*10^{9}$$
where \(q_{i}\) are raw read or fragment counts, \(l_{i}\) is feature (i.e., gene or transcript) length, and \(\mathop \sum \limits_{j} q_{j}\) corresponds to the total number of mapped reads or fragments. The RSEM output files containing RNA-seq data for the selected samples downloaded from the NCI PDMR include both FPKM and TPM expression values.
TPM
TPM was introduced in an attempt to facilitate comparisons across samples. TPM stands for transcript per million, and the sum of all TPM values is the same in all samples, such that a TPM value represents a relative expression level that, in principle, should be comparable between samples [18].
$$TPM_{i} = \frac{{q_{i} /l_{i} }}{{\mathop \sum \nolimits_{j} \left( {q_{j} /l_{j} } \right)}}*10^{6}$$
where qi denotes reads mapped to transcript, li is the transcript length, and \(\mathop \sum \limits_{j} (q_{j} /l_{j} )\) corresponds to the sum of mapped reads to transcript normalized by transcript length.
The TPM measure can easily be converted to FPKM: \(TPM_{i} = \left( {\frac{{FPKM_{i} }}{{\mathop \sum \nolimits_{j} FPKM_{j} }}} \right)*10^{6} .\)
Count normalization methods
The R package tximport was used to prepare gene level count data from RSEM output files [19]. Subsequently, normalized count data were derived using the DESeq2 package [20]. The normalization approach used by DESeq2 is to form a “virtual reference sample” by taking the geometric mean of counts over all samples for each gene [20]. Then, DESeq2 normalizes each sample to this virtual reference to get one scaling factor per sample.
TMM stands for a weighted trimmed mean of M values, which are gene-wise log-fold change quantities originally defined by Robinson and Oshlack [21]. Normalization using the TMM method was performed on count data generated from tximport with the ‘tmm’ function in Bioconductor package NOISeq [22]. The TMM normalization method is also implemented in the edgeR package [21].
Z-score normalization on TPM-level data
Z-score normalization is considered a centering and variance stabilization method. Z-score on TPM-level data was calculated using the following formula:
$$Z_{{ij}} = \frac{{log_{2} \left( {TPM_{{ij}} + 1} \right) - median\left( {log_{2} \left( {TPM_{i} + 1} \right)} \right)}}{{SD\left( {log_{2} \left( {TPM_{i} + 1} \right)} \right)}}$$
where the indices i and j stand for gene and sample index, respectively; and SD stands for standard deviation.
Measures of variation
Hierarchical clustering
The R function ‘hclust’ was used for sample clustering based on gene expression matrices. The distance matrix is based on 1 − r, where r is the Pearson correlation coefficient between sample pairs. Ward’s minimum variance method (i.e., linkage method option ‘ward.D2’) was used as the agglomeration method [23, 24]. Euclidean distance metric was also computed to evaluate which measure could more closely align the replicates, in terms of absolute expression measures, for each PDX model.
Median CV
The coefficient of variation (CV) was defined as the ratio of the standard deviation to the mean expression of each gene across replicate samples within each of the 20 PDX models. The median CV, as well as the interquartile range, were documented for each PDX model.
Intraclass correlation coefficient (ICC)
For each PDX model, an intraclass correlation coefficient, denoted by ICCg, was computed to examine the impact of each quantification measure on the variability between genes relative to the total variation (across genes and replicate samples) [24,25,26].
This analysis was based on a components of variance model:
$$Y_{{ij}} = g_{i} + e_{{ij}}$$
where \(Y_{{ij}}\) denotes the log transformed unit of gene i in the replicate j for a particular model. The error variance component \(\sigma _{e}^{2}\) associated with \(e_{{ij}}\) (technical error) reflects the reproducibility of the measure. The variance component \(\sigma _{g}^{2}\) associated with \(g_{i}\) (true gene expression) represents the true gene-to-gene variability.
The intra-class correlation (ICCg) for each PDX model is defined as
$$ICC_{g} = \frac{{\sigma _{g}^{2} }}{{\sigma _{g}^{2} + \sigma _{e}^{2} }}$$
and estimated by the following equation defined by Shrout et al. [25]:
$$\frac{{MS_{g} - MS_{e} }}{{MS_{g} + \left( {k - 1} \right)MS_{e} }}$$
where \(MS_{g}\) is the between-genes mean squares, \(MS_{e}\) is the between-samples mean squares, k is the number of samples. The ICCg, which ranges between 0 and 1, estimates the proportion of the total variance due to the between-gene variance. Larger ICCg values indicate higher similarity (i.e., agreement) between replicate samples while preserving biological differences among genes within a PDX model. Computing an ICCg for each PDX model, as described above, resulted in a set of 20 ICCg values for each quantification method.
Next, in order to evaluate which measure can better preserve true biological differences within the same gene across different PDX models, another version of intraclass correlation, denoted by ICCm, was computed for each gene. This metric allowed for examination of the impact of each quantification measure on the variability between PDX models relative to the total variation (across models and replicate samples). This analysis was based on a components of variance model:
$$Y_{{ij}} = m_{i} + e_{{ij}}$$
where \(Y_{{ij}}\) denotes the log transformed unit of PDX model i in the replicate j for a particular gene. For simplicity of notation, gene index was not included in the formula. The error variance component \(\sigma _{e}^{2}\) associated with \(e_{{ij}}\) (technical error) reflects the reproducibility of the measure. The variance component \(\sigma _{m}^{2}\) associated with \(m_{i}\) (true gene expression) represents the true model-to-model variability.
The intra-class correlation (ICCm) for each gene is defined as
$$ICC_{m} = \frac{{\sigma _{m}^{2} }}{{\sigma _{m}^{2} + \sigma _{e}^{2} }}$$
and estimated by the following equation defined by Shrout et al. [25]:
$$\frac{{MS_{m} - MS_{e} }}{{MS_{m} + \left( {k - 1} \right)MS_{e} }}$$
where \(MS_{m}\) is the between-models mean squares,\(MS_{e}\) is the between-samples mean squares, k is the number of samples. The ICCm, which ranges between 0 and 1, estimates the proportion of the total variance due to the between-model variance. Larger ICCm values indicate higher similarity (i.e., agreement) between replicate samples. Computing an ICCm for each gene, as described above, resulted in a set of 28,109 ICCm values for each quantification method. A known feature of the ICC estimator used here is that sometimes it could produce negative values when the true ICC is close to zero and sample size is small. For practical purposes, these negative estimates of ICC are considered to be equivalent to ICC ≈ 0.
Model 947758-054-R is the only model that has four replicates, while the other 19 models all have three replicates. For simplicity, the first three replicates of model 947758-054-R were selected to form a uniform data matrix (20 × 3 for each gene) for the calculation of ICC for each gene. The resulting balance in number of replicates allowed for easier calculation of the ICCg and ICCm estimates using the irr R package (version 0.84.1) [25, 26].
Calculation of percentages of TPM for the top five most abundant genes
To help identify what may cause transcript distribution differences between replicates, we calculated the percentage of TPM for the top five most abundant genes. For each PDX model, the 28,109 genes were first sorted by the sum of TPMs across the replicate samples. The TPM percentages of the top five most abundant genes in each replicate was then calculated as the sum of TPMs corresponding to the top five most abundant genes identified for each model divided by 106.