Prediction of hepatocellular carcinoma risk in patients with chronic liver disease from dynamic modular networks

Background Discovering potential predictive risks in the super precarcinomatous phase of hepatocellular carcinoma (HCC) without any clinical manifestations is impossible under normal paradigm but critical to control this complex disease. Methods In this study, we utilized a proposed sequential allosteric modules (AMs)-based approach and quantitatively calculated the topological structural variations of these AMs. Results We found the total of 13 oncogenic allosteric modules (OAMs) among chronic hepatitis B (CHB), cirrhosis and HCC network used SimiNEF. We obtained the 11 highly correlated gene pairs involving 15 genes (r > 0.8, P < 0.001) from the 12 OAMs (the out-of-bag (OOB) classification error rate < 0.5) partial consistent with those in independent clinical microarray data, then a three-gene set (cyp1a2-cyp2c19-il6) was optimized to distinguish HCC from non-tumor liver tissues using random forests with an average area under the curve (AUC) of 0.973. Furthermore, we found significant inhibitory effect on the tumor growth of Bel-7402, Hep 3B and Huh7 cell lines in zebrafish treated with the compounds affected those three genes. Conclusions These findings indicated that the sequential AMs-based approach could detect HCC risk in the patients with chronic liver disease and might be applied to any time-dependent risk of cancer. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-021-02791-9.

presence of premalignant lesions and tumors [8]. Despite progress in diagnostics and treatment of HCC, its prognosis remains poor [9,10].
Evidence suggests that there is usually a critical transition point during disease progression, resulting in the critical transition from a normal state to a disease state. Therefore, it is very important to detect the early warning signals of the predisease state to prevent sudden deterioration [11]. Thus, can we identify predictive risk for HCC at an earlier stage?
From the perspective of Modular Pharmacology (MP), the treatment of complex diseases requires a modular design to affect multiple targets [12]. The exploration of modular structure has been a key factor in understanding the complexity of disease networks [13]. A disease module represents a cellular function whose disruption results in a particular disease phenotype [13]. In our previous study, we proposed the concept of allosteric modules (AMs), which refers to multipotent functional changes in modular architecture [14]. Allostery controls pathway divergence and unification and encodes specific effects on cellular pathways [15,16]. The fundamental importance of allostery is the exertion of functional effects on signaling pathways and the entire cellular network [16,17]. The AMs may provide valuable structural information about disease and pharmacological networks beyond pathway analysis.
In this study, by integrating the multi-source data (including AMs, clinical microarray data and The Cancer Genome Atlas [TCGA] dataset), we constructed risk prediction models and proposed the sequential AMs -based approach for predicting the risk of HCC in patients with chronic liver disease.

Constructing disease-associated networks for each pathological stage
A list of disease-associated genes was obtained from the Online Mendelian Inheritance in Man (OMIM) database (http:// www. ncbi. nlm. nih. gov/ omim), including 220 hepatitis B-related genes, 152 liver cirrhosis-related genes, and 213 HCC-related genes. We used disease-associated genes from OMIM to construct 3 global disease-associated networks using the Agilent literature search plugin in Cytoscape.

Identifying and optimizing functional modules in different groups
In each disease-associated network, functional modules were identified using the Molecular Complex Detection (MCODE) algorithm [18]. For MCODE, we tried all possible combinations of the following parameters: Include Loops: false; Degree Cutoff: 3; Node Score Cutoff: 0.0, 0.2, 0.3; Haircut: true or false; Fluff: true or false; K-Core: 2; and Max Depth from Seed: 100, 5, 4, 3. A total of 48 parameter combinations were calculated. After the functional modules were identified, they were optimized according to the minimum entropy criterion, and the analysis of calculating minimal network entropy was carried out as described previously [14].

Calculating the similarities of the AMs
The similarities of the nodes and edges of the modules were calculated with our proposed method of SimiNEF [14]. Briefly, we used similarity S ne to quantify the relative overlaps between AMs m i and m j , including the overlaps of nodes and edges together. The similarities of nodes S n (m i , m j ) and edges S e (m i , m j ) are defined in Eqs. 1 and 2, respectively.

Enrichment analysis of KEGG pathways
The enrichment analysis of KEGG pathways in the modules was performed using a hypergeometric test, as implemented on the KOBAS 2.0 web server (http:// kobas. cbi. pku. edu. cn/) [19].

Clinical microarray data Clinical samples and RNA extraction
Morning fasting venous blood samples from a total of 36 patients were obtained from Shuguang Hospital and Longhua Hospital in Shanghai, China, including 3 healthy people, 10 chronic hepatitis B (CHB) patients, 13 HBVrelated cirrhosis (cirrhosis) patients and 10 HCC patients. The research protocol was approved by the respective Institutional Review Boards. The study was approved by the Official Ethics Committee of the Shanghai University of Traditional Chinese Medicine, and written informed consent was obtained from all study participants. Chronic hepatitis B, HBV-related cirrhosis and HCC were diagnosed according to the "Chronic hepatitis B prevention and treatment guidelines" [20], "Standard of clinic diagnosis, syndrome differentiation and assessing curative effect on hepatocirrhosis" [21], and "clinical diagnosis and staging criteria for primary hepatocellular carcinoma" established by the Chinese Society of Liver Cancer in 2001 [22], respectively.
The microarray methods followed those described in previous studies [23][24][25]. The leukocytes were isolated from the blood samples by Ficoll optimized density gradient separation and stored at − 80 ℃ [26]. Total RNA was extracted using a "two-step" protocol as described previously. Total RNA from leukocytes from whole blood was extracted using TRIzol reagent according to the manufacturer's instructions (Invitrogen, Carlsbad, CA, USA) and stored at − 80 ℃. The quantity and quality of RNA were assessed using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technology, Rockland, DE).

Microarray data analysis
Briefly, cDNA was synthesized by the Invitrogen First-Strand cDNA Synthesis Kit (Invitrogen, Carlsbad, CA, USA), and RNA polymerase was added to degrade RNA. The biotinylated cDNAs were labeled and hybridized to a NimbleGen Homo sapiens 12 × 135K gene expression array (Roche, Cat No. A6484-00-01). After hybridization and washing, the processed slides were scanned with the Axon GenePix 4000B microarray scanner (Molecular Devices, Sunnyvale, CA). Raw data were extracted as pair files by NimbleScan software (version 2.5), and the data were considered robustly expressed if the signal/noise ratio (SNR) > 2. NimbleScan software's implementation of the robust multiarray analysis (RMA) algorithm offers the quantile normalization and background correction of data. The gene summary files were imported into Agilent GeneSpring Software (version 11.0, Agilent, USA) for further analysis. Both the P-value significance of t-test and the fold-change directionality (up-or downregulation) were taken into consideration for identifying differentially expressed genes between the two groups. Genes with a P-value < 0.05 and a fold-change > 1.5 or < − 1.5 were considered differentially expressed.

Construction of random forests models and rule extraction for predicting HCC
First, by combining genes in the OAMs with microarray data, we used the random forests algorithm to model and predict chronic hepatitis B, cirrhosis and HCC. The random forests algorithm was run independently on each of the OAMs. Then, the out-of-bag (OOB) error rates of the random forests models were computed. The variables of the model leading to the smallest OOB error were selected. The random forests algorithm has been extensively used to rank variable importance, i.e., genes. In this study, the Gini index was used as a measurement of predictive performance and a gene with a large mean decrease in Gini index (MDG) value is more important than a gene with a small MDG. The importance of the genes in discriminating HCC from non-tumor samples was evaluated by the MDG values. Second, we further explored the predictive performance of the important genes for HCC by using The Cancer Genome Atlas (TCGA) database for the liver hepatocellular carcinoma (LIHC) project (https:// portal. gdc. cancer. gov/ proje cts/ TCGA-LIHC). Human HCC mRNA-seq data were downloaded, containing 374 HCC tumor tissues and 50 adjacent non-tumor liver tissues. Receiver operating characteristic (ROC) curves and the associated area under the curve (AUC) values of the important genes were generated to evaluate their capacity to distinguish non-tumor tissues from HCC samples. An AUC value close to 1 indicates that the test classifies the samples as tumor or non-tumor correctly, while an AUC of 0.5 indicates no predictive power. In addition, The G-mean was used to consider the classification performance of HCC and non-tumor samples at the same time; The F-value, Sensitivity and Precision were used to consider the classification power of HCC; The Specificity is used to consider the classification power of normal; Accuracy is used to indicate the performance of all categories correctly. In particular, the intergroup differences of classification evaluation indexes between two-gene and three-gene combinations were evaluated using the normal t-test or nonparametric Mann-Whitney U test.
The data analysis in this paper is implemented by R software. We used RandomForest function in the ran-domForest package and these functions (RF2List, extrac-tRules, unique, getrulemors, pruneRule, selectRuleRRF, buildLearner, applyLearner, presentRules) in the inTrees package. All parameters of functions were set by default.
Next, we used rule extraction to establish the conditions of the three genes to correctly predict HCC. We applied the inTrees (interpretable trees) framework to extract interpretable information from tree ensembles [27]. A total of 1780 rule conditions extracted from the first 100 trees with a maximum length of 6 were selected from random forests by the condition extraction method in the inTrees package. Leave-one-out pruning was applied to each variable-value pair sequentially. In the rule selection process, we applied the complexity-guided regularized random forest algorithm to the rule set (with each rule being pruned).

Experimental verification
We screened related compounds that affected the three genes (cyp1a2-cyp2c19-il6). Then, the drug combination containing the corresponding compounds was used to treat three different human HCC cell lines (Bel-7402, Hep 3B and Huh7). Bel-7402, Hep 3B and Huh7 cells were labeled with green fluorescent dye and transplanted into the yolk sac of wild-type AB strain zebrafish 2 days after fertilization (2 dpf ) by microinjection. About 200 cells were transplanted into each fish to establish a zebrafish human HCC transplantation model. Zebrafishes injected with human HCC cells were cultured at 35 ℃ to 3 dpf. At 3 dpf, zebrafishes with good consistency of transplanted tumor were selected under the microscope and randomly distributed into 6-well plates with 30 fishes per well. In experimental groups, Jiangan (JG) granules were given with water-soluble concentrations of 27.8, 83.3 and 250 µg/mL, respectively. The positive control group was treated by cisplatin (15 µg/mL). And the vehicle group was set. Ten zebrafishes for each group were randomly selected to collect the fluorescence intensity of transplanted tumor. The statistical analysis results of fluorescence intensity were used to evaluate the growth inhibition effect of JG granules on human HCC transplanted tumor in the zebrafish model.

Constructing disease-associated networks for each pathological stage
A schematic diagram of the entire analysis framework is shown in Fig. 1. CHB-, cirrhosis-, and HCC-associated networks were constructed, involving 1104, 487, and 1079 nodes, respectively (Additional file 1: Table S1). The cirrhosis-associated network had the minimum number of nodes, and there was only a small difference in network size between the other two networks (Additional file 1: Table S1). Therefore, an analysis of the entire networks might not be sufficient to reveal the pathophysiological changes from chronic hepatitis to HCC.

Identifying and optimizing functional modules
The results identified by MCODE are shown in Additional file 1: Table S2. Considering the influence of different parameters on the clustering results, we tested 48 parameter settings. After the optimization of minimum entropy, 53, 21, and 60 modules (nodes ≥ 4) were identified from CHB-, cirrhosis-, and HCC-associated networks, respectively (Additional file 1: Table S1). The average sizes of these modules ranged from 4.609 to 6.447, and the entropy values were similar between the CHB-and HCC-associated networks after module optimization (Additional file 1: Table S1).

Difference gradient among the AMs of the three pathological stages
We used similarity S ne > 0, > 20%, > 40%, > 60%, > 80%, and = 100% to define the overlap between AMs. Hence, we obtained different degrees of differences between the AMs (Fig. 2a). For example, it should be noted that S ne > 20% means S n > 20% and S e > 20% simultaneously. When S ne = 0 or S n > 0 but S e = 0, these modules are referred to as disease-exclusive modules (DEMs); that is, the module did not overlap with any other module from other groups (Figs. 1, 2c). There were 35, 6, and 44 DEMs in the CHB, cirrhosis, and HCC groups, respectively ( Fig. 2a). The results showed that from S ne ≥ 0 to S ne = 100% in 20% increments, the number of overlapping modules among the CHB, cirrhosis and HCC groups was 3, 1, 1, 1, 0, and 0; the number of overlapping modules between the CHB and cirrhosis groups was 7, 6, 5, 4, 4, and 4; the number of overlapping modules between the CHB and HCC groups was 8, 4, 1, 1, 1, and 1; and the number of overlapping modules between the cirrhosis and HCC groups was 5, 4, 3, 2, 2, and 2, respectively, showing a gradual decreasing trend. In other words, with the increments of S ne , the degree of difference among AMs increased gradually (Fig. 2a, b). When S ne > 80% and S ne = 100%, there were no overlapping modules among the three groups ( Fig. 2a, b).

Distribution of the different AMs of the three pathological stages
Based on the changes in nodes and edges, the comparison of these modules in different disease stages resulted in three types of AMs ( Fig. 1). (1) Conserved allosteric modules (CAMs, AM C ). If the modular overlap between the CHB and cirrhosis groups, the cirrhosis and HCC groups, the CHB and HCC groups, or among the three groups reached 100% (S ne = 100%), these modules were referred to as CAMs (Figs. 1, 2d). A total of 7 CAMs , and AM C C19-HCC49 . (2) Transitional allosteric modules (TAMs, AM T ). Some partially overlapping modules (0 < S ne < 100%) were identified only between the CHB and cirrhosis groups and could not be found in HCC; these modules were referred to as TAMs (Figs. 1, 2e). Four TAMs were identified, including AM T CHB10-C5 , AM T CHB6-C2 , AM T CHB53-C21 , and AM T CHB7-C2 . (3) Oncogenic allosteric modules (OAMs, AM O ). Many modules partially overlapped (0 < S ne < 100%) between the CHB and HCC groups, the cirrhosis and HCC groups, or among the three groups, and these modules were referred to as potential OAMs (

Topological variations in potential OAMs
Next, we focused on the topological variations of the 13 potential OAMs. As shown in Fig. 3, a partially overlapping structure existed in each OAM that served as a bridge between modules, generally including the  1 Flow diagram. CHB-, cirrhosis-, and HCC-associated networks were constructed using disease-associated genes downloaded from OMIM. Functional modules were identified using the MCODE algorithm. Then, the results of module identification were optimized based on the minimum entropy criterion. The enrichment analysis of KEGG pathways was performed with DAVID 6.7 software. The similarity between modules was calculated using SimiNEF. Four AMs (DEMs, CAMs, TAMs, and OAMs) were identified. The relationships between OAM genes and HCC were validated by published literature. AMs allosteric modules, DEMs disease-exclusive modules, CAMs conserved allosteric modules, TAMs transitional allosteric modules, and OAMs oncogenic allosteric modules. '√' or '×' represents its appearance 'yes' or 'no' in the group, respectively. For example, the module is identified as 'conserved' when it is found both in CHB and cirrhosis, cirrhosis and HCC, CHB and HCC, or among the three groups ('√'), and S ne = 100%  . 3a). One-edge overlap was the most common type, and it could be found in OAMs from the three paths above. Triangular overlap and multiedge overlap only existed in OAMs between the CHB and HCC groups (Fig. 3).
In addition, the topological changes in the nonoverlapping parts of each OAM also involved three situations as follows. (1) Node-node changes, wherein the modular changes included adding or removing nodes (the number of changing nodes < 3). Two OAMs were related to the change in nodes (Fig. 3a). (2) Node-module changes. These changes included changes from nodes (the number of changing nodes < 3) to a module (the number of changing nodes ≥ 3) or from a module to nodes. Three OAMs showed changes between nodes and modules (Fig. 3b). (3) Module-module changes. Eight OAMs were involved in the changes from module to module, indicating that the total number of nodes and edges in modules increased or decreased. Module-module changes appeared in all three carcinogenic paths (Fig. 3c).

KEGG pathway analysis of 13 OAMs
In the 13 OAMs, the number of overlapping pathways between any two pathological stages (CHB, cirrhosis and HCC) was 18, 24, and 7, respectively. A total of 7 overlapping pathways were identified among the three pathological stages (Fig. 4a, Additional file 1: Table S3). KEGG pathways were restricted to those involved in biological processes. Consequently, disease pathways were discarded (except KEGG pathways related to liver disease).
After removing other disease pathways and overlapping pathways, the remaining nonoverlapping pathways were referred to as altered pathways. A total of 24 altered pathways were found during CHB-HCC progression, which could be largely divided into 10 categories, including cell growth and death (4.2%), cell motility (4.2%), cellular community (8.3%), endocrine system (8.3%), human diseases cancers (4.2%), immune system (41.7%), membrane transport (4.2%), nervous system (4.2%), signal transduction (16.7%), and signaling molecules and interaction (4.2%) (Additional file 1: Table S4, Fig. 4b). The neurotrophin signaling pathway appeared in four OAMs and had the highest frequency (Additional file 1: Table S4). The remaining pathways were all HCC-related pathways, except for six altered pathways that have not been previously reported to be associated with HCC (Additional file 1: Table S4).

Reanalysis of the genes in the 13 OAMs with clinical microarray data The consistency between the groups with differentially expressed genes and the groups represented by OAMs
The microarray expression data (comprising 19,471 genes) of 36 clinical samples were used. The number of overlapping genes between the CHB-, cirrhosis-, and HCC-associated networks (see section 1 of the results) and the microarray data was 989, 423, and 939 genes (accounting for 89.6%, 86.9%, and 87% of the network genes), respectively. In the microarray data, the numbers of genes significantly altered in the CHB, cirrhosis and HCC groups were 6251, 937, and 2175, respectively, compared with the normal group. The number of overlapping genes between CHB-, cirrhosis-, and HCC-associated networks and significantly altered genes in the microarray data was 279, 23, and 124 genes, respectively.
The further analysis showed that a total of 121 genes were included in the 13 OAMs; according to the expression levels of the 121 genes in the microarray data, a total of 7 differentially expressed genes were identified between any two groups, including cyp2b6 (CHB vs. HCC groups), pi3 (cirrhosis vs. HCC groups), and mmp2, pi3, ptk2, timp2, tnfrsf10b (CHB vs. cirrhosis groups). Furthermore, we identified 4 differentially expressed genes (cyp2b6, pi3, mmp2, and timp2) (Fig. 5a-d) that satisfied the following condition: the groups in which the differentially expressed gene existed were consistent with the groups represented by the OAM from which the gene was derived.

Selecting the top 5 important genes in OAMs and identifying highly correlated gene pairs
We conducted correlation analysis of the 121 genes contained in the 13 OAMs. First, the Pearson correlation coefficients between these genes were calculated from 36 clinical samples. Through statistical tests, we screened 273 pairs of genes that were highly correlated (r > 0.8, P-value < 0.001), of which 39 highly correlated gene pairs appeared in 10 of the 13 OAMs (Additional file 1: Table S5).
Then, according to the 13 OAMs, we constructed 13 random forests models for three disease groups and estimated the out-of-bag (OOB) classification error rate respectively. A total of 12 out of the 13 OAMs with an OOB classification error rate < 0.5 in predicting certain disease states are summarized in column 2-5 of Additional file 1: Table S5. Among the 12 OAMs, 11 OAMs were used to predict cirrhosis with the OOB classification error rate ≤ 0.46, 4 OAMs were used to predict CHB with the OOB classification error rate ≤ 0.4, and the OOB classification error rate of these OAMs for predicting HCC is greater than or equal to 0.6, which seemed to have the lowest predictive power for HCC.
Finally, we extracted the 11 highly correlated gene pairs (involving 15 genes in total) from the 12 OAMs (OOB classification error rate < 0.5), which met the following two conditions: the top 5 important genes in the OAMs according to the MDG, and significantly  Table S5.

Associations of the 11 highly correlated gene pairs in the three disease states
Next, the Pearson correlation coefficients between the 11 pairs of genes in CHB, cirrhosis, and HCC were calculated. The 11 gene pairs were correlated in the three disease states (r > 0.63) (Fig. 5e-h). Furthermore, the changing trends in the correlation of 6 gene pairs in the three disease states were consistent with the disease states indicated by the OAMs that the gene pairs belong to (Fig. 5h). The changing trends were roughly divided into the following four categories. (1) Weak correlation  and TIMP2 among different groups. # denotes statistical significance (P < 0.05) between the CHB and HCC groups; & denotes statistical significance (P < 0.05) between the cirrhosis and HCC groups; and * denotes statistical significance (P < 0.05) between the CHB and cirrhosis groups. e-g The correlation coefficient between the 11 pairs of genes in CHB, cirrhosis, and HCC. All gene pairs were highly correlated in the three disease states (r > 0.63). In the matrix, the red circles indicate a positive correlation, while the blue circles indicate a negative correlation. The larger a circle is, the stronger the correlation. h The changing trend of the correlation coefficient between the 11 pairs of genes in the three pathologic stages (CHB, cirrhosis, and HCC). The underlined gene pairs indicate that the changing trends in the correlation of 6 gene pairs in the three disease states were consistent with the disease states indicated by the OAMs that the gene pairs belong to  (Fig. 5h). Furthermore, 10 of the 15 genes have been previously reported to be associated with the disease states represented by their OAMs, except that decr1, mgmt, diablo and ebp have not been reported to be associated with CHB and hdac2 has not been reported to be correlated with cirrhosis and HCC (Additional file 1: Table S6). Moreover, 9 of the 15 genes (60%) have been previously reported as biomarkers of HCC (Additional file 1: Table S7).

Assessing the predictive performance of the 15 genes for HCC using the TCGA LIHC dataset Predictive performance of the 15-gene set
The 15 genes were further evaluated to distinguish tumor tissues from non-tumor tissues by using the TCGA LIHC dataset. The training and test sets were randomly sampled at a 4:1 ratio, with 329 and 95 samples. The random forests algorithm was used to construct a predictive model for HCC in the training sets. The flow chart of Random Forest construction is shown in Fig. 6a. The results showed the classification evaluation indexes of the model. The total OOB error rate, AUC, G-mean, F-value, sensitivity, precision, specificity, and accuracy were 7.6%, 0.99, 0.8991, 0.9823, 0.9881, 0.9765, 0.8182, and 0.9684, respectively.

Predictive performance of three-gene sets, two-gene sets, and one gene
The importance of the 15 genes was evaluated by the MDG values. Starting from 15 genes, the random forests model was constructed for the remaining genes after removing the least important gene in the current model. The results showed that the model of the remaining 6 important genes was the best choice that yielded the lowest OOB error rates (total OOB error rate = 6.69%, OOB classification error rate for predicting HCC = 0.024). In order to further obtain the optimal gene combinations of low dimensions, we selected the combinations of one, two or three genes from the 6 important variables (cyp1a2, cyp2c19, cyp2c9, rac1, diablo, and il6) to establish prediction models for HCC, that is, we ran the random forests algorithm 41 times. The sensitivities of all gene combinations were above 0.9 (range from 0.9048 to 0.9881). Seventeen gene combinations (9 three-gene sets and 8 two-gene sets) achieved a specificity ≥ 0.6364, and only the three-gene set (cyp1a2-cyp2c19-il6) had a specificity greater than 0.9. Almost all gene combinations achieved an AUC > 0.75 except one gene of il6, rac1, cyp2c19, and a two-gene set (diablo-il6). Nineteen gene combinations (14 three-gene sets and 5 two-gene sets) achieved an AUC > 0.95 (Additional file 1: Table S8, Fig. 6b-d).
In summary, the overall predictive performance of all gene combinations was ranked as follows: threegene sets > two-gene sets > one gene (Additional file 1: Table S8, Fig. 6b-d). All classification evaluation indexes in three-gene combinations were better than those in two-gene combinations. Almost all of the observed differences were statistically significant (P < 0.05), except for G-mean, Precision and Specificity (Additional file 1: Table S8-1). We finally identified a three-gene set (cyp1a2-cyp2c19-il6, total OOB error rates = 5.78%, AUC = 0.9730, G-mean = 0.9305, F-value = 0.9697, sensitivity = 0.9524, precision = 0.9877, specificity = 0.9091, and accuracy = 0.9474) with the optimal predictive performance.

Rule extraction for predicting HCC
Additional file 1: Table S9 shows the 7 most accurate rules. The total error rate was 0.049. The results showed that the present extracted rules achieved a very good performance. Among the 7 conditions, the expression levels of cyp1a2 and cyp2c19 in the non-tumor tissues were greater than those in the HCC tissues (Additional file 1: Table S9). The condition "cyp1a2 > 12,201.5 and cyp2c19 < = 103.5 and il6 < = 48.5" (error rate = 0.000, frequency = 0.046) might have a greater probability of being correctly predicted as the HCC group. The expression levels of the three genes in different populations (a total of 460 patients, including 53 healthy people, 10 CHB patients, 13 HBV-related cirrhosis patients and 384 HCC patients) were shown in Fig. 7a-c.

Relationships between gene combinations with good predictive performance and OAMs
In combination with the previous OAMs results, 80% of the 15 genes were the nodes with the highest degree in the OAMs (Additional file 1: Table S10). We found that the two-gene set (cyp1a2-cyp2c19) appeared in the same OAM (AM O CHB 11-HCC6 ). Cyp1a2 and cyp2c19 were located in the overlapping part of AM O CHB 11-HCC6 , and an edge existed between them (Fig. 3b). Both were the nodes with the highest degree (degree = 7, 8) in the module (Additional file 1: Table S10). While il6 did not appear in the same OAM with cyp1a2 and cyp2c19, it appeared in AM O CHB 7-HCC20 and AM O C2-HCC20 separately. It did not have the highest degree (degree = 3) (Additional file 1: Table S10), and it was not located on the overlapping structure of the module (Fig. 3c).

Discussion
The pathogenesis of HCC is complex and heterogenous, and multiple mechanisms of tumorigenesis could be involved. The annual incidence of liver cirrhosis in CHB patients without anti-viral therapy was 2-10%, and the annual incidence of HCC in non-cirrhotic HBV-infected patients was 0.5-1.0%. The annual incidence of HCC in patients with cirrhosis was 3-6% [28]. Despite a large number of promising molecules, the heterogeneity of HCC makes early detection a major challenge [29,30], and individual markers generally lack sensitivity and/or specificity to be sufficiently effective. The future of HCC screening will most likely involve the use of a combination of biomarkers based on various macromolecules such as mRNAs, proteins, and miRNAs [31].
Sequential AMs contributed to revealing the dynamic evolution from CHB to cirrhosis and HCC The clinical pathway of most HBV-related HCC may follow the four states: healthy, hepatitis, cirrhosis, and HCC. In our study, the cohort included healthy individuals and patients with CHB, HBV-related cirrhosis and HCC.
Using the AMs-based approach, four types of modular allostery (DEMs, CAMs, TAMs and OAMs) were identified that might reveal the dynamic evolution of pathological processes from CHB to HCC. Module-module associations (finally forming the AMs) among CHB, cirrhosis and HCC were established through the partially overlapping structures, which were similar to the linkers connecting domains in protein allostery, implying topological variations in modular networks. Identification of 13 potential OAMs also reflected three disease processes in HBV-related HCC cases: from HBV to cirrhosis to HCC, from cirrhosis to HCC, and from HBV to HCC directly. It was also consistent with previous findings that not all patients with HCC have underlying liver cirrhosis, especially CHB patients [32]. The OAMs were the partially overlapping modules among different stages in the progression of chronic liver diseases. At different stages, the structures and functions of these modules have partial differences, and further changes may occur. In addition, the invariant modules CAMs might reflect the conservation and stability of the organism. As for DEMs, they were the differential modules only found in the three diseases, representing the feature modules unique to CHB, HBV-related cirrhosis or HCC. We identified 35, 6, and 44 DEMs in the CHB, cirrhosis, and HCC groups, respectively. DEMs might demonstrate the unique characteristics of each stage of hepatitis, cirrhosis and liver cancer. From the perspective of Modular Pharmacology, sequential AMs might contribute to illustrating the molecular mechanism of the pathological progression from CHB to HCC. CAMs, OAMs and DEMs might have pharmacological implications at the systems level and serve as universal or specific therapeutic targets in disease treatment [33,34]. Further, OAMs might play an important role in the pathological progression from CHB to cirrhosis to HCC, and therefore had considerable clinical value in predicting early-stage HCC risk.

Functional changes of OAMs: alterations in multiple cellular signaling pathways
As shown in Fig. 4c, d, the carcinogenic effects of the 13 OAMs involve different changes in multiple signaling pathways at different pathological stages. We infer that alterations in these signaling pathways as well as some molecular targets in the pathways might participate in critical steps in the development of HBV-associated HCC. The most frequent pathway, the neurotrophin signaling pathway, appeared in four OAMs, showing that the dysregulation of neurotrophin signaling might play a role in the progression of HCC [35]. Evidence indicates that growth factor-mediated angiogenic signaling (VEGF, EGFR, IGF and HGF/c-MET), the ERK/MAPK pathway, the PI3K-AKT-mTOR signaling pathway, the WNT/bcatenin pathway, cytokine/chemokine production/activation, leukocyte infiltration, c-erbB-3, adherens junction, focal adhesion, and antigen processing and presentation are implicated in HCC [36][37][38][39][40][41][42][43]. In the erbB family, upregulated ERBB-2 was associated with HBV infection [44]. HBV alters TLR signaling, resulting in liver damage [45]. NK cells are important in the defense against HBV infection and exert their antiviral functions and host anticancer defense by natural cytotoxicity [46,47]. In addition, AM O CHB11-HCC6 , which is only enriched in 6 metabolism pathways, might be a metabolism-related module. Aberrations in lipid metabolism are often seen in chronic HBV infection. Downregulated linoleic acid [48], increased arachidonic acid synthesis [48] and high serum levels of retinol [49] and cytochrome P450 enzyme [50] are involved in the development of HCC.

Establishing a panel of genes to predict HCC risk for patients with chronic liver disease
In this study, 11 pairs of highly correlated genes and a panel of genes (cyp1a2-cyp2c19-il6) were identified in the core OAMs throughout the progression of CHB to cirrhosis and HCC. Almost all gene combinations achieved an AUC > 0.75. Generally, a larger AUC value indicates a better predictive model and is a commonly accepted rule in the determination of a model's performance [51]. A classification model can be considered to have an outstanding performance if the AUC value of the model is above 0.9. The performance of any classification model with AUC values between 0.8 and 0.9 is excellent [52]. Therefore, this result indicated that the 6 important genes and their combinations were successfully validated in the independent TCGA LIHC dataset and were able to accurately distinguish HCC from non-tumor tissues. A gene with an AUC value of at least 0.95 and a sensitivity and specificity of 90% or greater at the established threshold is considered adequate for the confident identification of HCC samples [31]. In addition to these criteria, we considered multiple indexes (total OOB error rates, G-mean, F-value, sensitivity, precision and specificity) as a whole and finally identified a three-gene set (cyp1a2-cyp2c19-il6) with an AUC of 0.973, a sensitivity of 0.9524, and a specificity of 0.9091. Here, considering the imbalance of the data, we mainly refer to total OOB error rates, AUC, G-mean and F-value. We also extracted the 7 most accurate rules/conditions from random forests for the three genes (cyp1a2, cyp2c19 and il6). Furthermore, the three genes have been previously reported to be associated with HCC [31,53,54], which is consistent with the results of rule extraction. In addition, the results of experimental verification indicated that JG granules had a significant inhibitory effect on tumor growth of human HCC transplanted tumors. JG granules was the drug combination containing the three compounds selected by the three genes (cyp1a2, cyp2c19 and il6), which could indirectly validate the effect of the three genes on the development of HCC. Furthermore, the two-gene set (cyp1a2-cyp2c19, AUC = 0.963) appeared in the same OAM (AM O CHB11-HCC6 ), cyp1a2 and cyp2c19 had the highest within-module degree, and an edge existed between them. This finding also confirmed the reliability of the AMs-based approach.
Finally, the limitation of this study is the lack of independent validation. In order to improve the accuracy of prediction, next we will validate the sensitivity and specificity of the three-gene set identified in our study by using an independent, large and multicenter cohort, and furtherly evaluate the diagnostic performance of the three-gene set in different Barcelona Clinic Liver Cancer (BCLC) stages. In addition, the performance of the three-gene set in differentiating the HCC group from the healthy, CHB, and cirrhosis groups will be also evaluated.

Conclusions
Taken together, we showed that the three-gene set (cyp1a2-cyp2c19-il6) was optimized to distinguish HCC from non-tumor samples using random forests with an AUC of 0.973. These findings indicated that the proposed sequential AMs-based approach contributed to revealing the dynamic evolution from CHB to cirrhosis and HCC, identifying a panel of genes for the assessment of HCC risk in patients with chronic liver disease and might be applied to any time-dependent cancer risk prediction.