Comprehensive analysis of transcriptome profiles in hepatocellular carcinoma

Background Hepatocellular carcinoma is the second most deadly cancer with late presentation and limited treatment options, highlighting an urgent need to better understand HCC to facilitate the identification of early-stage biomarkers and uncover therapeutic targets for the development of novel therapies for HCC. Methods Deep transcriptome sequencing of tumor and paired non-tumor liver tissues was performed to comprehensively evaluate the profiles of both the host and HBV transcripts in HCC patients. Differential gene expression patterns and the dys-regulated genes associated with clinical outcomes were analyzed. Somatic mutations were identified from the sequencing data and the deleterious mutations were predicted. Lastly, human-HBV chimeric transcripts were identified, and their distribution, potential function and expression association were analyzed. Results Expression profiling identified the significantly upregulated TP73 as a nodal molecule modulating expression of apoptotic genes. Approximately 2.5% of dysregulated genes significantly correlated with HCC clinical characteristics. Of the 110 identified genes, those involved in post-translational modification, cell division and/or transcriptional regulation were upregulated, while those involved in redox reactions were downregulated in tumors of patients with poor prognosis. Mutation signature analysis identified that somatic mutations in HCC tumors were mainly non-synonymous, frequently affecting genes in the micro-environment and cancer pathways. Recurrent mutations occur mainly in ribosomal genes. The most frequently mutated genes were generally associated with a poorer clinical prognosis. Lastly, transcriptome sequencing suggest that HBV replication in the tumors of HCC patients is rare. HBV-human fusion transcripts are a common observation, with favored HBV and host insertion sites being the HBx C-terminus and gene introns (in tumors) and introns/intergenic-regions (in non-tumors), respectively. HBV-fused genes in tumors were mainly involved in RNA binding while those in non-tumors tissues varied widely. These observations suggest that while HBV may integrate randomly during chronic infection, selective expression of functional chimeric transcripts may occur during tumorigenesis. Conclusions Transcriptome sequencing of HCC patients reveals key cancer molecules and clinically relevant pathways deregulated/mutated in HCC patients and suggests that while HBV may integrate randomly during chronic infection, selective expression of functional chimeric transcripts likely occur during the process of tumorigenesis.

4 sampling cycles generating more than 7 fusion sites on chromosome 10. Hence the empirical pvalue was 6.1×10 -4 (610/1,000,000). After multiple test correction, FDR was found to be 0.01464, which is less than 0.05, and thus is considered statistically significant. The R Project for Statistical Computing was used to perform the random sampling and the probability calculations. Similar method was used to assess whether there was significant enrichment of breakpoints in various genic regions.

Pathway analysis of differentially expressed genes and genes with somatic mutations
Differentially Expressed Genes were analyzed using Ingenuity Pathway Analysis to identify significantly enriched Canonical Pathways and Upstream Regulators. A right-tailed Fisher's Exact Test (FET) was used to test for significant overlap between the differentially expressed genes between T and NT, as well as the genes in the categories of Canonical Pathways and Upstream Regulators respectively, as defined by Ingenuity Knowledge Base TM . After multiple test correction, FDR of less than 0.01 was defined to be statistically significant for the FET.
An Activation Z-score was used to predict the activation status of the canonical pathways and upstream regulators based on the expression level of the differentially expressed genes associated with the canonical pathway and upstream regulators. The associated differentially expressed genes were compared with their expected expression level based on literature curated in Ingenuity Knowledge Base TM . An absolute Z-score above 2 was considered to be significant and this allows for the prediction of the activation or inhibition status of a pathway or regulator.
To identify the functions of genes carrying somatic mutation, the Database for Annotation, Visualization and Integrated Discovery (DAVID) was utilized for the annotation of functions. 5 Gene ontology (biological process -FAT, cellular compartment -FAT, molecular function-FAT) and KEGG pathway analysis were performed for the gene groups of interests. After Benjamini-Hochberg correction, FDR<0.05 was used as the cutoff for statistical significance.

Detection of somatic mutations
The output BAM files from Tophat alignments were used for generating the pileup files using Samtools and the consensus of the reads covering each base were obtained. For each position where the consensus base was different from the reference (indicated by the presence of "alternative base"), statistics of the numbers of high-quality (Phred-score quality score ≥ 50) reads carrying the alternative and reference bases were computed. Only the mutation events that contain at least five high-quality alternative reads in the tumor samples and five high-quality reads with reference allele in the non-tumorous samples were determined as tumor-specific somatic mutations in this study.

Clinical association and survival analysis
Most of the clinical parameters were binary variables e.g. vascular invasion and no vascular invasion. We grouped the patients with histological grade=1 or 2 as low-grade group, while those with grade=3 or 4 as high-grade group. Patients at stage I and II were grouped as early-stage, patients while the others were at late stage.
For analysis of association between clinical parameters and gene expression, we first calculated the ratio of gene expression in tumor to expression in paired non-tumor tissue i.e. T/NT ratio and performed log transformation of all the T/NT ratios. Then we compared the log ratios in two groups (E.g. patients with and without vascular invasion for each of the human genes) using Student's t-test followed by Benjamini-Hochberg correction. Genes that exhibit significant 6 differences in T/NT ratios between the two groups (FDR < 0.05) were identified to be associated with clinical parameters.
Based on the presence of somatic mutations in a particular gene, we grouped the patients into two categories and built a 2×2 contingency table corresponding to a particular clinical phenotype e.g. with and without vascular invasion. FET was used to identify the association between mutations and clinical phenotype, and associations with p-value<0.05 were reported.
Cox proportional hazards modeling was employed to analyze the association between gene expression and overall survival. For each of the genes, the ratio of the gene expression in tumor to that in the adjacent non-tumor tissue was calculated for every patient and used as the explanatory variable in the Cox proportional hazards regression model.
Kaplan-Meier survival analysis was performed for identification of mutated genes associated with overall survival outcome. For each of the human genes in which mutations were identified in the tumors of at least five HCC patients, we classified the patients into two groups based on the presence of somatic mutations within the genes. Kaplan-Meier curves were plotted for the two groups and Log-rank test was used for comparison of the two survival curves. Statistical significance was determined as Log-rank p-value<0.05.