Gene correlation network analysis to identify regulatory factors in sepsis

Background and objectives Sepsis is a leading cause of mortality and morbidity in the intensive care unit. Regulatory mechanisms underlying the disease progression and prognosis are largely unknown. The study aimed to identify master regulators of mortality-related modules, providing potential therapeutic target for further translational experiments. Methods The dataset GSE65682 from the Gene Expression Omnibus (GEO) database was utilized for bioinformatic analysis. Consensus weighted gene co-expression netwoek analysis (WGCNA) was performed to identify modules of sepsis. The module most significantly associated with mortality were further analyzed for the identification of master regulators of transcription factors and miRNA. Results A total number of 682 subjects with various causes of sepsis were included for consensus WGCNA analysis, which identified 27 modules. The network was well preserved among different causes of sepsis. Two modules designated as black and light yellow module were found to be associated with mortality outcome. Key regulators of the black and light yellow modules were the transcription factor CEBPB (normalized enrichment score = 5.53) and ETV6 (NES = 6), respectively. The top 5 miRNA regulated the most number of genes were hsa-miR-335-5p (n = 59), hsa-miR-26b-5p (n = 57), hsa-miR-16-5p (n = 44), hsa-miR-17-5p (n = 42), and hsa-miR-124-3p (n = 38). Clustering analysis in 2-dimension space derived from manifold learning identified two subclasses of sepsis, which showed significant association with survival in Cox proportional hazard model (p = 0.018). Conclusions The present study showed that the black and light-yellow modules were significantly associated with mortality outcome. Master regulators of the module included transcription factor CEBPB and ETV6. miRNA-target interactions identified significantly enriched miRNA.


Background
Sepsis is defined as organ dysfunction syndrome caused by uncontrolled inflammatory response to infection. Sepsis is a leading cause of mortality in hospitalized patients [1,2], and accounts for 30% of case fatality in hospitalized patients [3]. Despite the high mortality and morbidity, few agents are proven to be effective for the treatment of sepsis. Thus, more regulatory factors need to be identified to provide potential targets for the design of effective therapeutic agents.
Several studies have used transcriptome analysis to investigate potential biological pathways regulating the pathogenesis of sepsis [4][5][6][7][8]. These studies were performed by differential gene expression analysis, followed by enrichment analyses to established functional pathways. In these analyses, genes were tested individually. The sensitivity to identify biologically meaningful genes can be low due to multiple testing adjustment. Sepsis is a heterogeneous syndrome and its pathogenesis involves hundreds of genes. In this situation, the individual contribution of a single gene is too small to be detected with univariate test. In most diseases, genes function via networks of co-expressed genes with similar biological functions. Thus, identification of co-expression pattern could provide further insights into sepsis-associated biological pathways. Weighted gene co-expression network analysis (WGCNA) is a systems biology approach used for finding gene clusters with highly correlated expression levels and for relating them to phenotypic traits [9]. Rather than relating thousands of genes to the clinical trait, WGCNA focuses on the relationship between a few modules and the trait [10,11]. To the best of our knowledge, WGCNA has been used to explore coexpression pattern in mouse model of sepsis [12], HIV infection [13], in vitro inflammatory cells [14], and pediatric sepsis [15]. Regulatory factors including transcription factors and miRNA were not systematically explored in adult sepsis.
The present study aimed to identify gene co-expression modules in sepsis by using the consensus WGCNA (consensus from different causes of sepsis including pneumonia and abdominal sepsis). These consensus modules were related to clinical traits and enriched to functional biological pathways. Potential regulators of these modules were explored by using well curated databases.

The GEO dataset and data preprocessing
The study used the publicly avaiable dataset GSE65682 from the Gene Expression Omnibus (GEO) database. The dataset contained 802 samples including healthy controls, non-sepsis critically ill patients and sepsis patients. Futhermore, the sepsis patients could be further categorized into pneumonia sepsis (n = 192), abdominal sepsis (n = 51) and others (n = 443) based on infection site. PAXgene blood RNA was isolated at intensive-care unit (ICU) admission and whole-blood leukocyte transcriptome was performed at the platform of Affymetrix Human Genome U219 Array. Key benefits of the gene chip include increased productivity and efficiency through parallel processing, excellent gene expression accuracy and reproducibility and complete coverage of the annotated genome. Further detials of the dataset can be found at other publications [16,17]. Raw intensity expression data were preprocessed with the Robust Multi-array Average (RMA) method [18]. An advantage of this method is that normalization occurs at the probe level (rather than at the probeset level) across all of the selected hybridizations. The maximum expression intensity was used when multiple probe sets mapped an individual gene symbol. The quality of processed data were checked by using MA plot (Additional file 1: Figure S1).

Consensus weighted gene co-expression network analysis
The first step in constructing a consensus WGCNA was to choose the soft threshoulding power to which coexpression similarity was raised to calculte adjaency. We chose from a set of values from 4 to 20 based on the criterion of approximate scale-free topology [10]. Since the topological overlap matrices (TOM) of different sepsis may have different statistical property, we performed quantile normalization over the three types of sepsis (e.g. pneumonia, abdomianl and other sepsis). The consensus TOM was calculated by taking the component-wise ("parallel") minimum of the TOMs in individual dataset, which was then input to hierarchical clustering. Finally, modules were identified in the resulting dendrogram using the Dynamic Tree Cut algorithm [19]. This algorithm has several advantages such as capability of identifying nested clusters and flexibility. Modules with similar expression profiles were merged at the threshold of 0.25.
Gene significance was defined as the Student t-test statistic for testing differential expression between sepsis and healthy controls. The significance level was adjusted for multiple testing with Bonferroni correction. The dataset also contained critically ill patients such as those with major abdominal surgery without infection. There were common pathways between critical illness and severe infection, the comparison between sepsis versus non-infectious critical illness would omit some important genes. Thus, the differential expression was tested between sepsis versus healthy controls.

Relating consensus module to pneumonia-specific sepsis module
Modules specific to pnenomia sepsis were identified by the method as described above. Pneumonia sepsis modules were then related to consensus modules. We calculated the overlaps of each pair of pneumonia-consensus modules, and used the hypergeometric test to assign a p-value to each of the pairwise overlaps. This is also known as the cross-tabulation based comparison of modules. This method is justified by the idea that if a module is well preserved and reproducible in all types of sepsis (pneumonia, abdominal and other sepsis), this module could represent the common pathways involving the pathogenesis of sepsis [20].
Module preservation across all three datasets were explored by pairwise comparing eigengene networks in penumonia, abdominal and other sepsis. Mortality was added as an additional "eigengene". Network preservation is simply the difference between adjacencies in the two compared sets [20]. A small difference of the adjacency matrix between two sets indicate the modules are well preserved between the two comparing sets.

Relating consensus modules to clinical traits
Module eigengene was calculated for each module as the first principal component of gene expressions for that module. Correlation analysis was performed to relate module eigengene to external traits including age, gender, mortality and survival time (e.g. survivors were censored at 28 days). We combined three datasets into one and performed correlation analysis.
Gene significance for mortality was correlated to the module membership to investigate whether genes significantly associated with mortality outcome was also associated with module membership. Module membership (eigengene-based connectivity) for each gene was calculated by correlating its gene expression profile with the module eigengene of a given module. For a given module, a module membership value of 0 indicates that a gene is not part of the module; whereas a module memberhsip of − 1 or 1 is highly connected to the module.

Enrichment analysis for biological function and transcription factors
Modules associated with important clinical trait such as mortality were further analyzed for their enrichment in Gene Ontology (GO) pathways [21]. Specifically, the gene set from a given module were enriched to GO terms to find whether some of functional GO terms are over-represented using annotations for that gene set. Upset plot was employed to display overlapped genes among different GO terms. Dotplot shows the gene ratio and adjusted p values for each enriched GO terms. Enriched terms were organized into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules. The category netplot depicts the linkages of genes and GO terms as a network, which is helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories.
Modules (gene lists) significantly correlated with the mortality trait were tested for its over-representation in transcription factor (TF) binding motifs by using RcisTarget [22]. Two types of databases (i.e. Gene-motif rankings and the annotation of motifs to transcription factors) were employed in the analysis: Gene-motif rankings which provides the rankings of all the genes for each motif and the annotation of motifs to transcription factors. Parameter settings for the score of each pair of genemotif were: species = Homo sapiens, Scoring/search space = 500 bp uptream the transcription start site (TSS), Number of orthologous species = 10. The annotation of motifs to transcription factors was performed using the motifAnnotations_hgnc ('mc9nr' , 24,453 motifs).

Identification of miRNA-target interactions
The multiMiR package was employed for the retrieval of miRNA-target interactions from 14 external databases in R. These databases are comprehensive collections of predicted and validated miRNA-target interactions and their associations with diseases and drugs [23]. The module of interest was those associated with mortlaity outcome. It was interesting to check whether some, or all, of these genes within a module were targeted by the same miRNA(s). We restrited our search to the "mirtarbase" table because this table included only experimentally validated miRNA-target interactions.

Survival analysis
The association of each module with the survival outcome was determined by using Cox proportional model. The eigengene value of each module was added into the Cox regression model for univariate analysis. Genes matched to the module with the strongest association with survival were used to cluster patients into two groups using reversed graph embedding (DDRTree), which projects data into a reduced dimensional space while constructs a principal tree which passes through the middle of the data simultaneously [24]. Samples were clustered into two groups using k-means clustering. Survival probablity of the two groups were comapred using the log-rank test.

Demographic data
A total of 802 subjects were initially identified from the GEO database. 116 subjects were excluded because they were either healthy or non-infectious controls, and 4 subjects were excluded because they were outliers (Additional file 1: Figure S2). As a result, a number of 682 subjects were included in our analysis. These subjects were classified by the infection site into pneumonia sepsis, abdominal sepsis and unselected sepsis (other sepsis). Demographic data were comparable between the three groups ( Table 1). The median age was 63 years (IQR: 53 to 72), and 58% (394/682) patients were male. The overall 28 day mortality rate was 17% (113/682).

Consensus Network construction and module detection
A soft-thresholding power of 6 was used to obtain approximate scale-free topology for the network (Additional file 1: Figure S3). Consensus WGCNA identified 27 modules (Fig. 1). Gene differential expression analysis between healthy controls and sepsis showed that genes such as ARG1, CD177, MMP8 and C19orf59 were upregulated. Up-or down-regulation of modules were determined by the median of gene significance T value and fold changes. For instance, the black model was significantly upregulated with median T value > 3 and log2 fold change > 0.5 (Fig. 2).
Consensus eigengene networks and their differential analysis are shown in Fig. 3. The results showed that the eigengene networks in the three datasets were well preserved. To explore whether modules identified in the pneumonia sepsis could also be identified in consensus modules, the correspondence of pneumonia set specific and consensus modules was explored (Fig. 4). The result indicates that most peumonia set-specific modules have a consensus counterpart (Fig. 4). The black color module in the consensus analysis corresponds to the red module in the pneumonia specific set analysis (275 genes overlap with significant p value).

Relating Consensus modules to external clinical traits
The correlation between module eigengene and clinical traits were explored by Pearson's correlation analysis (Fig. 5). The black module was significantly associated with mortality with higher module express correlated to lower mortality rate (coefficient = -0.16; p < 0.001). However, the light-yellow module was positively associated with mortality (coefficient = 0.14; p < 0.001). The black module was positively associated with survival time (coefficient = 0.16; p < 0.001). The correlations between gene significance for mortality and module membership were explored and the results showed that there was a highly significant correlation between module membership and gene significance for mortality in three modules inclusing the light-yellow, ligh cyan and pink modules (Additional file 1: Figure S4).

Module biological function
The most enriched pathways of the black module included myeloid leukocyte mediated immunity, neutrophil mediated immunity, leukocyte degranulation and myeloid cell activation involved in immune response (Fig. 6). The light-yellow module was enriched in biological pathways such as translation, nucleobase-containing compound catabolic process, heterocycle catabolic process and cellular nitrogen compound catabolic process (Fig. 7).

Survival analysis
To further validate that the black module was related to the mortality outcome, the high-dimension space was reduced to lower dimension space by using reversed graph embedding (DDRTree). The method provided  non-linear dimension reduction that retains the intrinsic structure of the sample data. The silhouette method showed that 2-cluster would be the optimal number (Fig. 8a). In the 2-dimension space (Fig. 8b), the samples were classified into two clusters. Heatmap of all module genes showed that the two clusters could be well separated (Fig. 8c), which supported that the intrinsic structure was well preserved with the dimension reduction. The two clusters were significantly associated with survival outcome (p = 0.018 in the Cox proportional hazard model, Fig. 8d). Cluster 2, as compared with Cluster 1, was characrized by activated functions including myeloid cell activation, cell activation, leukocyte activation and myeloid leukocyte activation (Fig. 9). The light-yellow module was also related to the mortality outcome. The silhouette method showed that 2-cluster would be the optimal number (Fig. 10a). The study population could be classified into two clusters in a 2-dimension space. Heatmap (Fig. 10b, c) showed that the study population could be well separated into two clusters. Cluster 2 showed significantly lower survival probability than cluster 1 (p = 0.01, Fig. 10d).

Discussion
The study employed consensus network analysis to identify gene co-expression modules. These modules were involved in distinct biological functions and were associated with clinical traits. Regulatory mechanisms of some important modules involving transcription regulators and miRNA were explored by validated databases. Consistent with previously published studies employing highthroughput dataset to examine sepsis, these modules were enriched for pathways related to immune response [28,29]. The black module was significantly associated with survival outcome and the transcription factor CEBPB was the master regulator of this module. Several miRNAs including hsa-miR-335-5p, hsa-miR-26b-5p, hsa-miR-16-5p, hsa-miR-17-5p and hsa-miR-124-3p were identified to be important regulators of the gene expressions in the module.
Our analysis has several implications for research and clinical practice. First, sepsis is shown to have a dysregulated immune response highlighted by the upregulation of the black module involving biological functions such as myeloid leukocyte mediated immunity, neutrophil mediated immunity, leukocyte degranulation and myeloid cell activation involved in immune response. Immune dysfunction has long been recognized as an important mediator of sepsis [4,30,31]. However, previous studies mostly analyzed high-throughput data at individual gene level involving differential expression analysis followed by functional pathway enrichment [4,15,31,32]. In neonate sepsis, Meng and colleagues identified 7 hub genes in key pathways. However, the study did not relate these gene expression profiling to clinical outcomes and the results cannot be extrapolated to adult sepsis [33]. The present study employed WGCNA to identify modules associated with mortality by treating co-expressed genes as a module. The idea is that genes are working together to take their functions in disease processes. Second, by focusing on modules most significantly associated with mortality, we identified several important regulators including transcription factors and miRNAs. The results support the hypothesis that co-expressed genes are very likely to be regulated by common factors. These transcription factors Biological pathways such as myeloid leukocyte mediated immunity, leukocyte degranulation and neutrophil mediated immunity shared large number of genes. b Dot plot showing the ordered enrichment biological pathways; the most enriched pathways included myeloid leukocyte mediated immunity and neutrophil mediated immunity. c Enriched terms were organized into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules; and d The category netplot depicts the linkages of genes and GO terms as a network, which is helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories. The most important biological pathways in the module involves the immune response, including the immune response including myeloid leukocyte mediated immunity, neutrophil mediated immunity, leukocyte degranulation and myeloid cell activation (such as CEBPB and ETV6) and miRNAs are potential targets for the treatment of sepsis. Since the upregulation of the black module is significantly associated with mortality, drugs targeting these sites may improve the survival outcome. Third, although sepsis is a heterogeneous syndrome [34][35][36][37][38], different types of sepsis share some common important immune regulatory pathways. Our study employed consensus WGCNA and examined whether the module identified in one cause of sepsis can also be found in other causes of sepsis. The results show that most modules are well preserved despite the various causes of sepsis, indicating common pathways associated with the sepsis. Furthermore, the module network constructed by TOM is also well preserved across various causes of sepsis. Collectively, these findings support the notion that sepsis can be considered as a clinical symdrome because various causes of sepsis lead to common pathways. Drugs designed to target these common pathways can be helpful in improving clinical outcomes. For example, our study showed that cluster 2 identified by the black module was associated with lower survival probability, and this subtypes of sepsis was characterized showing the ordered enrichment biological pathways; c Enriched terms were organized into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules; and d The category netplot depicts the linkages of genes and GO terms as a network, which is helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories. The most important biological pathways in the module involves the translation and catabolic processes, including nucleobase-containing compound catabolic process, heterocycle catabolic process and cellular nitrogen compound catabolic process by leukocyte activation. Such over-activation of inflammatory response may indicate that immunoregulatory agents such as arachidonic acid and eicosapentaenoic acid can help to improve the survival outcome [39]. Mortality is an important clinical trait and thus we tried to identify modules most significantly assocated with mortlaity. Two methods were employed to validate that the black module was associated with mortality. Firstly, we simply correlate the module eigengene to the mortality. Module eigengene is computed as the first component in principal component analysis (PCA), which however is a linear transformation of the high-dimensional space. It is well recognized that linear transformation cannnot fully recover the intrinsic structure of a high-dimensional space [40][41][42]. Thus, we also emplyed manifold learning to better capture the black module [24]. The result is consistent with the above observation that the black module can well separate survivors from non-survivors. Most clinical trials of sepsis are focusing on how to reduce mortality. The identified black module in our study can help to design drugs that potentially useful for reducing mortality rate. For example, our analysis shows that  GSM1692099 GSM1692065 GSM1692123 GSM1692425 GSM1692003 GSM1692090 GSM1692097 GSM1692372 GSM1692307 GSM1692316 GSM1692383 GSM1692070 GSM1692299 GSM1692412 GSM1692115 GSM1692174 GSM1691912 GSM1692349 GSM1602947 GSM1692026 GSM1692480 GSM1602922 GSM1692356 GSM1692470 GSM1602840 GSM1691901 GSM1692018 GSM1692056 GSM1692441 GSM1602847 GSM1691980 GSM1691898 GSM1691954 GSM1602968 GSM1692350 GSM1692036 GSM1692108 GSM1692179 GSM1692207 GSM1692058 GSM1692208 GSM1692032 GSM1692276 GSM1691862 GSM1692273 GSM1692055 GSM1692019 GSM1692153 GSM1692171 GSM1602946 GSM1692268 GSM1692140 GSM1602811 GSM1692129 GSM1691863 GSM1692409 GSM1602923 GSM1692059 GSM1692191 GSM1692364 GSM1692493 GSM1692006 GSM1692060 GSM1692045 GSM1692184 GSM1692484 GSM1602839 GSM1692126 GSM1692138 GSM1692024 GSM1692347 GSM1692119 GSM1692449 GSM1692177 GSM1692477 GSM1602908 GSM1692007 GSM1692022 GSM1692118 GSM1692322 GSM1692459 GSM1692335 GSM1602813 GSM1691916 GSM1691889 GSM1692446 GSM1602815 GSM1692040 GSM1692331 GSM1692421 GSM1692366 GSM1691909 GSM1692482 GSM1692488 GSM1692436 GSM1691904 GSM1692202 GSM1692002 GSM1692481 GSM1602818 GSM1692251 GSM1692201 GSM1692182 GSM1692294 GSM1692504 GSM1692145 GSM1602975 GSM1692304 GSM1602882 GSM1692445 GSM1691895 GSM1692440 GSM1691953 GSM1692404 GSM1692452 GSM1602837 GSM1692199 GSM1692223 GSM1692259 GSM1692156 GSM1692023 GSM1692107 GSM1692241 GSM1692415 GSM1691965 GSM1692187 GSM1692124 GSM1602925 GSM1691990 GSM1692033 GSM1692009 GSM1602823 GSM1692139 GSM1692122 GSM1692012 GSM1692255 GSM1692303 GSM1692343 GSM1602859 GSM1692309 GSM1691911 GSM1602929 GSM1692398 GSM1602802 GSM1602863 GSM1692052 GSM1692135 GSM1692148 GSM1692388 GSM1602842 GSM1602933 GSM1692359 GSM1602868 GSM1602958 GSM1692079 GSM1692305 GSM1692327 GSM1692375 GSM1692178 GSM1692073 GSM1692226 GSM1602942 GSM1691925 GSM1692257 GSM1692485 GSM1602831 GSM1602850 GSM1692318 GSM1602830 GSM1602928 GSM1602976 GSM1692157 GSM1602845 GSM1602879 GSM1692289 GSM1692306 GSM1691947 GSM1691915 GSM1692061 GSM1691882 GSM1692387 GSM1691935 GSM1692431 GSM1692438 GSM1692100 GSM1692114 GSM1692203 GSM1691877 GSM1692038 GSM1602885 GSM1602834 GSM1692084 GSM1691865 GSM1602829 GSM1691987 GSM1691959 GSM1692454 GSM1692132 GSM1692374 GSM1692434 GSM1691927 GSM1692071 GSM1692192 GSM1691918 GSM1692397 GSM1602936  hsa-miR-335-5p is the top ranked miRNA in the regulation of the black module. Since the black module consists mostly genes involved in inflammatory response, the upregulation of hsa-miR-335-5p is proposed to have inflammatory suppression effects [43][44][45]. More recently, hsa-miR-335-5p is shown to reduce inflammation via negative regulation of the TPX2-mediated AKT/GSK3β signaling pathway in a chronic rhinosinusitis mouse model [46]. Collectively, these observations strongly support our results that hsa-miR-335-5p can be a candidate therapeutic target. The CEBPB was annotated to the most significantly enriched motifs for genes in the black module. The eigengene of black module is the most significant associated with mortality, and thus the master regulator CEBPB is also an important risk factor for mortality via immunomodulation [47,48]. Our finding is also consistent with a recent study comparing sepsis with and without shock, in which CEBPB is significantly enriched in septic shock versus non-shock patients [49]. Since the presence of shock is a significant risk factor for mortality, the result supports the notion that CEBPB can be a potential target for improving survival outcome.
The strength of the study is the large sample size. To the best of our knowledge, the MARS consortium is the largest sepsis cohort with genome-wide EXOSC4  OSTalpha  LDHA  ENTPD7  MMP8  SMPDL3A  DDAH2  HK3  RGL4  RETN  HGF  HP  ST6GALNAC3  NDST2  ZDHHC19  FAR2  CKAP4  CEACAM1  CLEC5A DHRS9  TTC35  LCN2  CA4  PECR  ANKRD22  DHCR7  SFT2D2 BMX  OLFM4  TCN1  GADD45A  PCOLCE2  KCNE1  FAM89A  JMJD6  ARG1  RAB20  BPI  PLB1 TFF3  LTF  ANKRD34B   CCR3 LGALS2 The differences in biological functions between cluster 2 and cluster 1 as identified by the black module. a Volcano plot showing differentially expressed genes in the black module between cluster 2 and cluster 1. b enriched GO biological pathways. c Gene set enrichment analysis with GO term; as compared to cluster 1, cluster 2 was characterized by activated functions such as leukocyte activation, cell activation and myeloid leukocyte activation.  10 The survival analysis with the light-yellow module. a Optimal number of classes derived from DDRTree reduced dimension space with the silhouette width; the two-class model was the best fit model with the highest silhouette width. b Discriminative dimension reduction (DDR) graph of the light-yellow module gene list. K-means clustering was used to separate samples into two groups (blue and red). c Heatmap of gene expression sorted by DDR score. The two clusters identified by the DDR score were consistent with that identified by light-yellow module genes.
Genes showing on the right were those from the light-yellow module. d Kaplan-Meier curves for groups defined by DDR k-means clustering of light-yellow module genes. Cox's proportional hazard modelling of module groups showed that the cluster membership was significant association with survival (p = 0.01)