Analysis of prognosis, genome, microbiome, and microbial metabolome in different sites of colorectal cancer

Background The colorectum includes ascending colon, transverse colon, descending colon, sigmoid colon, and rectum. Different sites of colorectal cancer (CRC) are different in many aspects, including clinical symptoms, biological behaviour, and prognosis. Purpose This study aimed to analyse prognosis, genes, bacteria, fungi, and microbial metabolome in different sites of CRC. Methods The Surveillance, Epidemiology, and End Results (SEER) database and STAT were used to statistically describe and analyse the prognosis in different sites of CRC. RNA sequences of CRC from Broad Institute’s GDAC Firehose were re-annotated and reanalysed based on different sites using weighted gene co-expression network analysis (WGCNA). The Kaplan–Meier method was used to analyse the prognosis and Cytoscape was used to construct a drug-target network based on DGIdb databases. Bacterial 16S V3–V4 and fungal ITS V3–V4 ribosomal RNA genes of stool samples were sequenced. Gas chromatography/mass spectrometry (GS/MS) was performed to detect the microbial metabolites in stool samples. Bioinformatics analysis was performed to compare distinct gut microorganisms and microbial metabolites between rectal and sigmoid cancers. Results The prognosis in CRC with different sites is significantly different. The closer to the anus predicted longer survival time. The difference between genes and co-expression pairs in CRC with different sites were constructed. The relative abundance of 112 mRNAs and 26 lncRNAs correlated with the sites of CRC were listed. Nine differentially expressed genes at different sites of CRC were correlated with prognosis. A drug-gene interaction network contained 227 drug-gene pairs were built. The relative abundance of gut bacteria and gut fungus, and the content of microbe-related metabolites were statistically different between rectal and sigmoid cancers. Conclusions There are many differences in prognosis, genome, drug targets, gut microbiome, and microbial metabolome in different colorectal cancer sites. These findings may improve our understanding of the role of the CRC sites in personalized and precision medicine.


Introduction
The length of the colorectum in healthy adults is about 1 m, mainly including the ascending colon, transverse colon, descending colon, sigmoid colon, and rectum depending on location [1]. Colorectal cancer (CRC) is one of the most common malignancies, and morbidity and mortality are rapidly increasing [2,3]. According to the anatomical location, the colorectal cancer is divided into the right-sided CRC including the ascending colon and the right half of the transverse colon cancer, and the left-sided CRC including the sigmoid colon, the descending colon and the left half of the transverse colon.
The clinical features of left-sided and right-sided CRC were different in outcomes, prognoses, and clinical responses to chemotherapy [4]. Most studies showed that the prognoses of patients with left-sided CRC were better than those of patients with right-sided CRC [5][6][7][8]. Intestinal obstruction is more likely to occur on the left-sided because the right-sided colon wall is thin and easy to expand [9][10][11]. The more abundant blood supply in the right-sided colon results in faster tumour growth, more prone to necrosis, bleeding, and secondary infection and leads to the more common clinical characteristics including anaemia, emaciation, fever, and dyscrasia [12,13]. From the perspective of pathological classification, the right-sided CRC has more mucinous type cancer, poorly differentiated cancer, and advanced tumour-node-metastasis stage [14]. The right-sided CRC is more frequently BRAF mutated, deficient in mismatch repair and microsatellite instability from the tumour molecule [6,15,16]. The division of left-sided and right-sided CRC also affects clinical decisions. For instance, the cetuximab (epidermal growth factor receptor inhibitor) has a better effect than bevacizumab (vascular endothelial growth factor receptor inhibitor) in left-sided CRC patients with wild-type K-RAS, N-RAS and BRAF genes [17].
It may be important to reiterate the pathological relevance of the CRC sites in the management and implication on future clinical and scientific research of personalized medicine. However, we forward our view that it is not detailed enough for the left-right division of CRC, which is about 1 m. It is even more unreasonable to make an early clinical prediction and select therapeutic regimen based on the left-right division. In the present study, the location of CRC was divided in more detail to analyse prognosis, genes, proteins, bacteria, fungi, and microbial metabolome.
The SEER Program, as a public database, provides numerous cancer-related data and statistics among the U.S. population [18,19]. We attempted to clarify that the CRC at different sites has different prognosis by retrieving and analysing the sites and outcome of CRCs in the SEER database. The occurrence of CRC is the result of the combined action of genetic factors and environmental factors [20]. With the progress of molecular biology technology in recent years, it has been found that many molecular signalling pathways, such as PI3K/AKT signalling pathway, and Wnt/β-catenin signalling pathway, are involved in the pathogenesis of CRC [21][22][23][24]. The colorectum, as the storage place of faeces, has the largest number and diversity of microorganisms in the human body. These microorganisms are greatly influenced by the acquired diet, living habits, living, and working environment [25]. Micro ecological environment composed of microorganisms and microbial related metabolites is related to the occurrence and development of CRC [26][27][28][29]. Many gut microbes, such as Fusobacteria, Streptococcus and Clostridium [25,[30][31][32], and microbial metabolites, such as hydrolytic, reductive enzymes [33], O(6)-methyl guanine [34], short chain fatty acids and secondary bile acids, are involved in the occurrence of CRCs [26,35,36]. Thus, these factors, including prognosis, RNAs, bacteria, fungi, and microbial metabolites, were selected as indicators to analyse the differences at different sites of CRC in the present study.

SEER database retrieval
SEER*Stat software (seer.cancer.gov/seerstat) was used to access the SEER data after signing a research data agreement. All patients diagnosed with colon and rectal adenocarcinoma were included. An analysis of colorectal cancer cases from 2006 to 2015, with follow-up through 2017 was conducted. The included samples were divided according to different sites, including the cecum, ascending colon, transverse colon, descending colon, sigmoid colon, and rectum. Multivariate overall survival analysis and multivariate CRC-specific survival analysis were carried out, and K-M survival analysis was used to analyse the relationship between the sites and prognosis. The alignment nomogram was constructed to describe the contribution of clinical features to prognosis. The SEER database retrieval strategy is shown in Additional file 1: Figure S1.

Differential gene screening
Differential gene analysis strategy is shown in Additional file 2: Figure S2. Clinical data and RNA-seq (exon quantification) data were obtained from Broad Institute's GDAC Firehose (http://gdac.broad insti tute.org). CRCs at different sites, including ascending colon, transverse colon, descending colon, sigmoid colon, and rectum, were included. The RNA-seq data were re-annotated with lncRNA and mRNA, and then screened for the differentially expressed RNA by using the EdgeR TMM  [37,38].
Weighted gene co-expression network analysis (WGCNA) was used to analyse the co-expressed gene module. WGCNA of R package (Version 1.61, https :// cran.r-proje ct.org/web/packa ges/WGCNA /) [39] was used to screen for gene sets associated with colon sites. Soft-threshold (power) analysis was used to perform the Pearson correlation analysis for the expression profile and construct a weighted network. The power value was defined as the square of log(k) in the network and log(p(k)) correlation coefficient fist reached 0.95. In module mining, the minimum number of genes for each module is limited to 30, and a threshold of 0.25 is used to combine similar modules. Module eigengenes (MEs) are the main components of the gene principle component analysis in a module, and they represent the overall expression mode of this module. The phenotypic correlation modules were found by calculating the correlation between MEs and traits. Downstream analysis was performed for the modules with the most significant correlation under each phenotype (The largest correlation coefficient and p < 0.05). The coexpression network of the genes (lncRNA/mRNA) was constructed by Cytoscape software [40]. The two genes with the correlation coefficient weight more than 0.1 were included in the network.
The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 was used to perform gene ontology (GO), biological process (BP), and Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway analysis for modules with significant correlation among CRC sites.
The Kaplan-Meier method was used to analyse the relationship between differentially expressed genes and prognosis. The relationship between the screened differential RNA and drug targets was predicted based on drug prediction databases DGIdb (version: 3.0, http://www. dgidb .org/searc h_inter actio ns) [41]. Cytoscape is used to construct a drug-target network.

Stool samples collection
The patients with CRC admitted to the department of gastrointestinal surgery in Huzhou Central Hospital from January 2018 to December 2018 were recruited. All patients were pathologically diagnosed with adenocarcinoma and volunteered for the study. CRC patients with distant organ metastasis, complicating other gut diseases, such as ulcerative colitis and Crohn's disease, complicating multiple primary tumours, and known primary organ failure were excluded. Approximately 30-50 g of stool samples were collected before treatment (chemotherapy, radiotherapy or surgery) and stored at − 80 °C within half an hour of sample collection. The stool samples were eliminated in the CRC patients with diarrhoea, constipation, or bloody stool, the use of oral microbial agents within 1 month, the use of a purgative or lubricant within 1 week prior to collection. The ethical approval and the informed consent form were approved by the Chinese Clinical Trial Registry (http://www.chict r.org.cn, No. ChiCTR1800018908) and Ethics Committee of Huzhou Central Hospital (No. 201601023).

Detection of gut microorganisms PCR amplification of bacterial 16S rDNA and fungal ITS rDNA
Prior to gut microorganism detection, the stored stool samples were tested for quality, including storage time, colour, weight, and stool characteristics. The E.Z.N.A. ® Stool DNA Kit (Omega Bio-Tek, Norcross, GA, U.S.) was used to extract total DNA. The nanodrop ND-1000 spectrophotometer (LabTech, Washington, DC, USA) with absorbances at 260 nm and 280 nm (A260/A280) was used to detect the purity of DNA. The integrity and quality of DNA were detected by electrophoresis on 2.0% (w/v) agarose gel. The PCR amplification reaction included 5 μl of DNA template, 2 μl of Nextera XT Index Primer 1 (10 μM), 2 μl of Nextera XT Index Primer 2 (10 μM), and 16 μl of ddH 2 O. The primer sequences were as follows: 16S V3-V4 rDNA: forward, CCT ACG GGNGGC WGC AG; reverse, GAC TAC HVGGG TAT CTA ATC C. ITS rDNA: forward, 5′-CTT GGT CAT TTA GAG GAA GTAA-3′ and reverse, 5′-GCT GCG TTC TTC ATC GAT GC-3′). The PCR amplification procedure was as follows, 95 °C for 3 min, 25 cycles at 95 °C for 30 s, 55 °C for 30 s, 72 °C for 45 s, and 72 °C for 5 min. Three biological replicates and three technical replicates were performed for each experiment. The 2% agarose gels, AxyPrep DNA Gel Extraction Kit (Axygen Biosciences, Union City, CA, USA) and QuantiFluor ™ -ST (Life Technologies, Invitrogen) were used to extract, purify and quantify the amplicons, respectively.

Sequencing of bacterial 16S rDNA and fungal ITS rDNA
MiSeq library was constructed as follows: the PCR products were ligated with the Y adapter. The magnetic nanoparticles were used to remove self-ligated Y adapters. The pooled DNA products constructed an Illumina Pair-End library and treated with NaOH solution. The amplicon library was sequenced using the Illumina MiSeq platform (Shanghai BIOZERON Co., Ltd.) according to standard protocols. The raw reads were deposited into the NCBI Sequence Read Archive (SRA) database.

Sequencing data analysis
The raw data were optimized as follows: sequences without primers were removed using Cutadapt version 1.11. The allowable error for primer matching process was 0.15. PE reads were assembled using Pandaseq, version 2.9. The allowable overlap was longer than 10 bp. The mosaic sequences longer than 300-480 bp or reads with an average quality score less than 20 were discarded. After filtering and trimming, the sequencing data was used for clustering OTUs (operational taxonomic units) and taxonomic analysis. The clean reads of 16S rDNA sequences were analysed and compared based on the Silva database (Release 119 http://www.arb-silva .de) [42]. The ITS rDNA sequences were analysed and compared based on Unite database (Release 6.0 http://unite .ut. ee/index .php) [43]. The Mothur software [44] was used to acquire the taxonomy information. The OTUs were annotated, clustered using RDP-classifier and Qiime software. The number of species in each sample was estimated by OTUs with 97% similarity. R package HCLUST (http://sekho n.berke ley.edu/stats /html/hclus t.html) was used to analyse diversities and community structures.

Microbial metabolites detection Stool samples preparation
The stored stool samples were tested for quality, including storage time, colour, weight, and stool characteristics. Approximately

Data analysis
The stool sample and data control procedure were followed according to the criterion (ISO9001, QAIC/ CN/170149, Metabo-Profile, Shanghai, China). The proprietary XploreMET software (v2.0, Metabo-Profile, Shanghai, China) was used to process the raw data from GC/TOFMS analysis. The XploreMET software can automatically process the following, including baseline correction, smoothing, peak picking, and peak signal, library searching, and area calculation. GC/MS workstation software was used to identify the differential metabolites through automatically comparing the fragment mass to charge ratio and abundance of characteristic ion fragmentation patterns. The NISI II standard muss spectral databases, the Finch databases linked to Chrom TOF software and the available reference standards in our lab was used as reference standards. The allowable similarity was more than 70%.

Statistical analysis
The data were indicated with the mean ± standard deviation. The Student's t test or SNK test was appropriately used to analyse the statistical difference in data between groups. The Chi square test was used to analyse the data count and ratio. The rank sum test was used for data that did not conform to the normal distribution. A two-tailed p < 0.05 was considered statistically significant.   Tables 2 and 3, respectively. The results suggested that the ascending colon cancer has a better prognosis than other colon cancers. Right-sided (RS) CRC includes cancers located in the cecum, ascending colon, and transverse colon,  Tables 5 and 6, respectively. The results suggested that the left-sided CRC has a better prognosis than the right-sided CRC. The CRC specific survival rate and overall survival rate varied among the different site's areas and are shown in Fig. 1a, b. The results suggested the 10-year survival time of patients with different sites of CRC was cecum, transverse colon, ascending colon, descending colon, sigmoid colon, and rectum, in order.

Prognosis in different sites of CRC
The CRC specific survival rate and overall survival rate varied between the left-sided and right-sided CRC (Fig. 1c, d). The results suggested that the 10-year survival time of patients with right-sided CRC was longer than that of patients with left-sided CRC. Moreover, as shown in Fig. 1e, based on prognostic factors including adequate node examination (more than 12 nodes), stages, gender, age and sites, a nomogram model was constructed and used for predicting mean survival months.
Differential gene analysis was performed on lncRNA and mRNA levels in the five groups, and a total of 421 differential lncRNAs and 1770 differential mRNAs were obtained. Figure S3 A-B shows the heatmap of the differentially expressed mRNA and lncRNA. The expression levels of mRNA and lncRNA were combined into one expression profile for WGCNA analysis. According   17:353 to the method described in "Materials and methods" section, the power value was defined as the square of log(k) in the network and log(p(k)) correlation coefficient fist reached 0.95 (β = 6, scale free R 2 = 0.98). The key parameters including frequency of k, check scale-free topology scale, scale independence and mean connectivity are shown in Additional file 3: Figure S3 c-e. As shown in Fig. 2a, different modules in the gene co-expression network were constructed based on WGCNA. Each colour represents a module. According to the module mining method and the set threshold, the brown, turquoise, yellow, blue, green, and red contains 405, 518, 173, 465, 162 and 123 genes, respectively. The grey contains genes that cannot be classified as any module.   The correlation analysis between the modules and the sites of CRC was performed to explore the significant modules related to the sites of CRC. According to the standard set in the method, the module-trait relationship is shown in Fig. 2b. The yellow module, red module, turquoise module, and the blue module were applied to Fig. 1 Prognosis in different sites of CRC. CRC can be divided into right-sided CRC and left-sided CRC according to the pathogenic site. Right-sided CRC includes cancers located in the cecum, ascending colon, and transverse colon, and left-sided CRC includes those located in the descending colon, sigmoid colon, and rectum. The clinical data from the SEER database was used to analyze the prognosis in different sites of CRC. a, b The CRC specific survival rate and overall survival rate varied among the different sites, respectively. c, d The CRC specific survival rate and overall survival rate between the right-sided CRC and left-sided CRC, respectively. e A nomogram model to be used for predicting the prognosis of CRC via some prognostic factors, including adequate node examination, stages, gender, age, and sites analyse the downstream genes for ascending colon cancer, rectum cancer, sigmoid colon cancer, and transverse colon cancer, respectively. No significant correlation with descending colon cancer was found. According to the threshold set in the method, the co-expression network of the four modules was constructed (Fig. 2c-f ).
The blue module associated with the transverse colon cancer contains 201 genes and 3230 co-expression pairs (Fig. 2c). The red module associated with rectum cancer contains 28 genes and 91 co-expression pairs (Fig. 2d).
The turquoise module associated with sigmoid cancer contains 132 genes and 147 co-expression pairs (Fig. 2e).
The yellow module associated with ascending colon cancer contains 49 genes and 74 co-expression pairs (Fig. 2f ). The Gene Ontology biological process (GO, BP) and KEGG pathway enrichment analysis were performed for all genes contained in the four modules. The results are shown in Additional file 4: Figure S4. Firstly, the Pearson correlation coefficients between all genes and the modules were calculated, and the significance test was carried out according to the set threshold value (correlation coefficient > 0.6 and p < 0.05). The results showed that the number of genes significantly correlated with the yellow module, turquoise module, blue module, and red module was 80, 132, 148, and 25, respectively. Secondly, the Pearson correlation coefficients among different sites of CRC were calculated, and the significance test was carried out according to the set threshold value (correlation coefficient > 0.2 and p < 0.05). The results showed that the number of genes significantly correlated with ascending colon cancer, sigmoid colon cancer, transverse colon cancer, and rectum cancer was 460, 733, 140 and 249, respectively. The intersection of the genes above two parts was taken, 56 yellow module genes, 106 turquoise module genes, 8 blue module genes and 7 red module genes which were found to be significantly correlated with ascending colon cancer, sigmoid colon cancer, transverse colon cancer, and rectum cancer, respectively. A total of 158 genes were identified, 19 of which were significantly associated with both the ascending and sigmoid colon. The samples were divided into high and low expression groups according to the median expression values of these 158 genes. The Chi square test was used to compare that in different sites of CRC. Finally, 138 genes including 112 mRNA and 26 lncRNA were significantly correlated with the sites of CRC. The heatmap in Fig. 3a, b represents the expression of the 26 lncRNA and 112 mRNA. The results showed that the gene expression patterns of ascending colon, descending colon and transverse colon were similar, and the gene expression patterns of sigmoid colon and rectum were also similar.
K-M survival analysis was used to analyse the relationship between the above mentioned 138 genes and the prognosis. The log-rank test showed nine genes, including C19orf12, SLC44A3, AP3M1, LYST, PMM2, MYL12A, SRGAP2, GPR143, and CTD-2201E18.3 had statistical significance (p < 0.05). The gene expression distribution and survival curves of these 9 genes in different sites of CRC are shown in Figs. 4 and 5, respectively.
The drug prediction database DGIdb was used to predict drugs that interact with these 138 genes. The network of drug-gene interaction is shown in Fig. 6, which indicates that there were 227 drug-gene pairs containing 212 drugs and 19 genes.

Gut microorganisms in rectal and sigmoid cancers
A total of 28 CRC patients including 15 cases of rectal cancer and 13 cases of sigmoid cancer in Huzhou Central Hospital from January 2018 to August 2018 were recruited for the microbiology and microbial metabolites study after screening and identification. The clinical characteristics of these case are shown in Table 8.
The alpha diversity analysis was carried out to compare the diversity of gut microorganism in rectal cancer and that in sigmoid cancer. The Chao1 index, Shannon index and Simpson index are indicators of microbial diversity. Additional file 5: Figure S5 A-C shows the Chao1 curves, Shannon curves, and Simpson curves, respectively. Higher Chao1 and Shannon value indicate higher community diversity. Lower Simpson value indicates a higher community diversity. The box chart on the right in Additional file 5: Figure S5 indicates no statistical difference of these alpha diversity indexes about gut bacteria between rectal cancer (marked "1" group) and sigmoid cancer (marked "2" group) (Kruskal test, p > 0.05). The analysis of bacterial community structure is shown in Fig. 7. The taxonomic tree heatmap in Fig. 7a shows the relative abundance ratio of gut bacteria in the top 100 at different taxonomic levels. The taxonomic tree heatmap indicates that the relative abundance of gut bacteria including Prevotella, Eubacterium, Dorea, Fusicatenibacter, Howardella, Butyricicoccus, Anaerococcus, Alloprevotella, Faecalibacterium, Roseburia, and Sutterella in rectal cancer were more than that in sigmoid cancer at genus level, and the relative abundance of gut bacteria including Granulicatella, Burkholderiales, Flavonifractor, Coprobacillus, Parabacteroides, Anaerotruncus, Akkermansia, Allisonella, and Alistipes in rectal cancer were less than that in sigmoid cancer (Wilcox test, p < 0.05). The violin plot in Fig. 7b displayed the relative abundance of gut bacteria in rectal cancer and sigmoid cancer in the top 15 at the genus levels. The results showed that there were statistically significant differences in these bacteria, including Prevotella, Faecalibacterium, Parabacteroides, Roseburia, Akkermansia, and Alloprevotella between the two groups (Wilcox test, p < 0.05).
The Chao1 curves, Shannon curves, and Simpson curves of gut fungus in rectal cancer and sigmoid cancer are shown in Additional file 6: Figure S6 A-C. The results showed that there was no statistical difference of these alpha diversity indexes of gut fungus between rectal cancer (marked "1" group) and sigmoid cancer (marked "2" group)(Kruskal test, p > 0.05). The taxonomic tree heatmap in Fig. 8a shows the relative abundance of gut fungus including Humicola, Dipodascaceae, Nectriaceae, o_Pleosporales, f_Nectriaceae, Tetracladium, Clonostachys, f_Dipodascaceae, o_Sordariales, Microascaceae, Ceratobasidiaceae, c_Sordariomycetes, Gibberella, o_Hypocreales and Fig. 6 The network of a drug-gene interaction. The drug prediction databases DGIdb was used to predict drugs that interact with the differentially expressed genes in different sites of CRC. The Figure shows the network of drug-gene interaction. The yellow circles and purple circles represent the genes significantly associated with the ascending colon and the sigmoid colon, respectively. The white squares indicate predicted drugs. The red lines, green lines, and black lines represent the antagonistic effects, promoting effects, and unknown effects, respectively Sordariomycetes in rectal cancer were less than that in sigmoid cancer at the genus level. The violin plot in Fig. 8b displays that there were statistically significant differences in these two fungi, including Humicola and Dipodascaceae in rectal cancer and sigmoid cancer in top 15 at genus levels. Figure 9 shows that the composition and proportion of microbial metabolites in rectal and sigmoid cancers at the class level. The concentration of microbial metabolites in the same site was combined into a group. The result suggests that the proportion of fatty acids from patients with rectal cancer was higher than that from patients with sigmoid cancers, while the proportion of amino acids and hydroxy acids from patients with rectal cancer was lower than that from patients with sigmoid cancer. Figure 9b shows the content of the top 30 microbial metabolites in rectal and sigmoid cancers. The results showed that the top 5 microbial metabolites were, in order, acetic acid, l-methionine, 3-hydroxybutyric acid, l-glutamic Different fungal community structure between rectal cancer and sigmoid cancer. Rectal cancer and sigmoid cancer was marked "1" group and "2" group, respectively. a The taxonomic tree heatmap, which displayed the relative abundance ratio of gut fungus in the top 100 at different taxonomic levels. The innermost layer shows the taxonomic tree. The circle from the inside to the outside represents different taxon levels from the phylum to the genus. The white circles represent no statistical difference, and the cyan circles represent the higher abundance of species in the sigmoid cancer group (Kruskal test, p-value < 0.05). The outermost layer is fungal annotation at the genus level. b The violin plot, which displayed the relative abundance of gut fungus in rectal cancer and sigmoid cancer in the top 15 at the genus levels. The red letters represent statistically different in the abundance of fungus acid, and l-lysine. Figure 9c and Table 9 show that the content of microbial metabolites with statistical differences between the two groups. The results showed that the microbial metabolites with statistical differences between rectal and sigmoid cancers including palmitoleic acid, 1 H-indole-3-acetamide, indole, citraconic acid, erucic acid, fumaric acid, hippuric acid, hydrocinnamic acid, nicotinic acid, oxoglutaric acid, acetic acid, and m-hydroxyhippuric acid.

Discussion
Analysis of the effect of different pathogenic sites on the prognosis of CRC suggested the RS CRC has a better prognosis than the LS CRC, consistent with several studies [7]. The present study further explored the correlation between different pathogenic sites, including the cecum, transverse colon, ascending colon, descending colon, sigmoid colon and rectum, and prognosis of CRC. The results suggested that the closer to the anus predicted longer survival time, which will be favourable to the accurate judgment of clinical prognosis of diseases. The differential expression of mRNA and lncRNA in different pathogenic sites of CRC were analysed based on RNA-seq data from Broad Institute's GDAC Firehose. The analysis of these differential gene expressions provides further research direction for the pathogenesis and prognosis of different CRC sites. The expression levels of some genes, for instance, K-RAS, N-RAS, BRAF, p53, Ki-67, and MSI (microsatellite instability) classification have been used to predict the prognosis of CRC [45][46][47][48]. The search for more gene targets will contribute to accurately determine the treatment regimen decision and prognosis evaluation of CRC. Nine genes with different prognosis were initially identified based on the classification of pathogenic sites of CRC.
Small molecule targeted drugs for the treatment of CRC recommended by the NCCN guidelines mainly includes anti-EGFR such as cetuximab and anti-VEGF such as bevacizumab and panitumumab [49]. The introduction of more targeted drugs into clinical applications and the patients with advanced disease states seeking more targeted drugs in clinical trial phase put higher demands on the prediction of drug targets and the selection of targeted drugs. The network of drug-gene interaction annotated in this study will provide a reference for gene target detection and drug selection based on the classification of pathogenic sites of CRC.
The colorectum has the largest microbial community in the body. The gut microecological environment may be the incentive of somatic genetic and epigenetic changes in CRC. Gut microorganisms cannot be ignored in the study of CRC. The results showed that the abundance of many gut bacteria and gut fungus and the content of many microbial related metabolites between rectal and sigmoid cancers are statistically different between rectal and sigmoid cancers. On the one hand, the results explained the differences between microorganisms and microbial related metabolites in different pathogenic sites of CRC. On the other hand, the intervention, such as faecal transplantation, oral administration of microorganisms or microbial metabolites and dietary choices may provide alternative and auxiliary treatments for CRC patients. The results may provide a new idea for the microbiological treatment decision.
The era of big data provides tremendous data support for medical research. The integration of various databases for the diagnosis and treatment of diseases will effectively promote the development of human health. The present study was based on the SEER database to analyze the influence of the incidence site on the prognosis of CRC. RNA-sequence data from Broad Institute's GDAC Firehose provide the data support for the genomic analysis of different CRC sites. Drug prediction database DGIdb provides accurate guidance for drug target screening of differential genes. The microbiological analysis also depends on some databases, including Silva database and Unite database. Based on the application of big data, the collision of epidemiology, genomics, microbiome, and metabonomics may lead to a new interdisciplinary study to establish a basis for individualized prevention and treatments for CRC.
There are some shortcomings in this study. The SEER database which is based on the North American population, cannot be the representative of the patients from other races and ethnicities. The results of data analysis based on RNA database still need further clinical validation. A database of correlation between microbiome and cancer has not yet been established in the low incidence of right colorectal cancer and most cases in the late stage. The sample size of the right CRC without organ metastasis was insufficient. We only included the cases of rectal and sigmoid cancers for microbiome and microbial metabolomics studies.
The following directions are put forward for future development. (i) At the Big data application level, greater data sharing will promote the development of diagnosis and treatment of colorectal cancer. Advances in applied mathematics will drive the establishment of more advanced mathematical models. It will further promote the development of precision medicine. (ii) At the basic research level, these aspects need to be further developed, such as the molecular mechanism of genes and proteins on the occurrence of CRC, the influence of microorganisms and metabolites on DNA and RNA, the relationship between microorganisms and metabolites and the identification of fungal species. (iii) In clinical application, stratification based on pathogenic sites will be taken into account in the subsequent diagnosis and treatment of CRC.

Conclusions
In the present study, we show first the clinical data analysis from the SEER database, which indicates that the prognosis in CRC with different sites is significantly different. The general trend is that the closer to the anus predicted longer survival time. Second, the difference between genes and co-expression pairs in CRC with different sites were constructed through the analysis of RNA-seq from Broad Institute's GDAC Firehose. The expression differences of 112 mRNA and 26 lncRNA correlated with the sites of CRC were listed. The nine differentially expressed genes correlated with prognosis were further analysed. Third, a network of drug-gene interaction was built based on the different genes at different sites using DGIdb databases. Fourth, the diversity and microbial community structure differences of gut bacteria and fungus were compared between the rectal and sigmoid cancers. Fifth, the plot and difference of gut microbial related metabolites between the rectal and sigmoid cancers were displayed. We emphasize that there are many differences with regard to prognosis, genome, drug targets, gut microbiome, and microbial metabolome among different colorectal cancer sites. Our findings, therefore, might be useful in understanding the role of the CRC sites in personalized and precision medicine.