We first demonstrate the theoretical background, i.e., the dynamical behaviour of a complex system when approaching to the tipping point or bifurcation point. Then the detailed algorithm used in this study is provided.
Three Markov processes and related statistic conditions
The progressions of most diseases are generally divided into the following three states/stages: normal state, pre-disease state and disease state (Fig. 1a). Both the normal state and the disease state are stable or stationary with high resilience or little fluctuations, and are robust to parameter changes. For these two states, it is not easy to trigger a critical transition towards an alternative state through external perturbations [21]. Thus, these two states are modelled as two stationary Markov processes (Fig. 1b). The pre-disease state is a state defined as the limit of the normal state just before the critical phase transition. Different from the above two stable states, the pre-disease state is a time-varying state with low resilience or strong fluctuations, and sensitive to even a small parameter change [21]. However, it may reverse back to the normal state if appropriately treated, which is in contrast to the irreversible disease state. Statically, there are little state change or significant differences such as consistently high/low expressions in some single biomolecules between the normal and pre-disease states [26]. Taking the dynamically unstable nature of the pre-disease state into account, it is regarded as a time-varying Markov process with strong fluctuating dynamics. For a specific disease, the pre-disease state presents the critical stage between a relatively “healthy” stage (the normal state) and a seriously ill stage (the disease state) in which it is difficult for a system to recover to the normal state even intensive interferences involved. Thus, it is of great importance to signal the pre-disease state in order to prevent catastrophic deterioration by performing timely intervention.
The DNB strategy suggests that a dominant group of molecules or DNB module appears at the pre-disease state. In such a group, the molecules together provides an early-warning signal when the biological system is in the vicinity of the tipping point [1]. Specifically, we theoretically proved that when a biological system from a normal state approaches the tipping point of a critical transition, the DNB module appear and simultaneously satisfy the following three statistic conditions [3, 15].
-
(i)
The standard deviation of each molecule in the module drastically increases.
-
(ii)
The correlation between molecules both within the module (i.e., intra-class correlation) sharply increases.
-
(iii)
The correlation between molecules inside and outside the module (i.e., inter-class correlation) rapidly decreases.
The derivation of the three generic conditions is presented in Additional file 1: A1 and A2. It also should be noted that the three-state model of complex diseases is not always generalizable. Actually, when there is no abrupt deterioration phenotype during the disease progression, or a stage-wise division is not applicable, such model is improper to characterize the biological processes of diseases.
Three Markov processes and related statistic conditions
Most biological molecules perform their functions through interactions with other biomolecules, which are either in a functional module or across modules. This inter- and intra-module interconnectivity suggests that the impact of a specific genetic abnormality not only affects the activity of the gene product that carries it, but may extend along the links of a network composed by biomolecules and change the activity of other gene products [27]. Therefore, an understanding of a biomolecule’s interaction network context is essential in determining the phenotypic impact of defects that affect it.
To study the dynamical evolution of a network system during the switch of a stationary Markov process into another time-varying Markov process, we employed a differential network that integrates differential associations/edges, including differential expression variances and differential expression correlations from adjacent sampling time points (Fig. 1d), that is, to quantify the statistical significance of each differential edge (differential Pearson’s correlation coefficient (PCC) value) in a differential network [28].
The difference between this differential network and existing methods lies in that we study the differential associations/correlations of genes/proteins rather than differential expressions of genes/proteins. In particular, we assume that the dysfunction or network rewiring of those associations during disease progression is possibly related to the relevant disease processes. Using gene or protein expression to quantitatively describe the dynamic associations between biomolecules, we can identify those significantly changed associations (i.e., gene pairs rather than individual biomolecules) between two biomolecules. These are denoted as differential associations hereafter and are presumed to be highly related to the dysfunction of regulation network of interest. The temporal-ordering differential network was obtained by the following three steps (Fig. 1d).
Building the correlation network
By mapping the correlation to an existing functional network (i.e., an integrated STRING network), the correlation network N
T was constructed at each sampling time point t = T (Fig. 1). Each edge connecting two nodes represents the correlation/association between two biomolecules, while each edge connecting only one node represents the self-regulation or variation of the biomolecule. Subsequently, there is a parameter α such that only those edges with high Pearson’s correlation coefficients (PCC) \((\left| {PCC} \right|\; \ge \;\alpha )\) or high standard deviation (SD) (\(\frac{{SD - { \hbox{min} }\left( {SD} \right)}}{{\hbox{max} \left( {SD} \right) - mean\left( {SD} \right)}} \ge \alpha\)) were reserved, where α is a to-be-determined parameter based on specific real data, that is, α is set such that there are few differential edges (i.e., significantly changed associations) in the initial differential networks during the normal stage, thus highlighting the pre-disease stage when many edges arise. The estimation of α is discussed in Additional file 1: B.
Building the specific network
Comparing the correlation networks N
T and N
T − 1, and the edges which differed between N
T and N
T − 1 were then identified and employed to illustrate the dynamic changes in associations between the adjacent time points T and T − 1. Then we obtained the specific networks \(SN_{T} = \left\{ {N_{T} } \right\}\backslash \left\{ {N_{T - 1} } \right\}\) and \(SN_{T - 1} = \left\{ {N_{T - 1} } \right\}\backslash \left\{ {N_{T} } \right\}\) respectively by removing the common edges (see Fig. 1d), where the common edges are presented in both correlation networks N
T and N
T − 1 and thus were regarded as interactions without significant change in dynamics.
Building the differential network
Combining the two specific networks SN
T and SN
T − 1, i.e., {SN
T} ∪ {SN
T − 1}, we obtained the differential network DN
T at time point T, in which the edge connecting two nodes records temporal differential correlation, and the edge connecting only one node records temporal differential variance.
We thus transformed the observed sequence (consecutive time-series data) {o
1, o
2, o
3, …, o
T, …} into a temporal-ordering differential network sequence {DN
2, DN
3, …, DN
T − 1, DN
T…} (Fig. 1b). Without abusing notation, we also denote the differential network series up to time T as \(O_{T} = \left\{ {o_{1} ,o_{2} ,o_{3} , \ldots ,o_{T - 1} , o_{T} } \right\} = \left\{ {DN_{2} ,DN_{3} , \ldots ,DN_{T - 1} ,DN_{T} } \right\}\). Each edge in the differential network DN
T represents the differential association in dynamics, i.e., distinct edges between N
T and its preceding correlation network N
T − 1. The disease-related genes are those that are common in the differential network sequence \(DN_{2} , DN_{3} , \ldots ,DN_{T}\). It is very possible that the genes included in these differential networks play important roles in progression from the normal state to the pre-disease state. An intuitive view of the differential network is that there are few differential edges when the differential network is constructed during either a normal or a disease stage, since two adjacent specific networks are expected to possess similar structures in view of the high stability nature of the system during these stages. On the other hand, since two adjacent specific networks should be quite different due to the time-varying and fluctuating dynamics of the system, many differential edges are expected to arise abruptly in a differential network when the system steps into a pre-disease stage. A sequence of such differential networks can portray time-dependent alterations of the system, and thus monitor the change at different stages of disease progression. In order to explore and quantify such a dynamical feature, we propose the scheme as follows.
Measuring the probability of being the end point of a stationary Markov process
According to the discussion above, the progression of a network system through a normal state is described by a stationary Markov process. Therefore, detecting the outset of a pre-disease state is equivalent to identifying the end of this stationary Markov process, or the switching point from a stationary Markov process to a time-varying Markov process. It is supposed that each time point \(t = T (T > 2)\) is a candidate transition point, or equivalently, the switching point of a stationary Markov process. Then we need to determine whether the observable samples or differential network derived at this candidate point is not coincident with that in the preceding stationary Markov process. Therefore, an index I-score is proposed for measuring the probability of how likely a candidate time point is the ending point of a stationary Markov process. The approach contains the following two steps. First, utilizing an observable network sequence O
T − 1 = {DN
2, DN
3, …, DN
T − 1}, i.e., T − 2 temporal-ordering differential networks based on samples from the preceding time points t = 1, 2,…, T − 1, we trained and obtained a hidden Markov model (HMM) \(\varTheta_{T - 1} \left( {O_{T - 1} } \right) = \left( {A,B,\pi } \right)\), with the subscript T − 1 of \(\varTheta\) denoting that \(\varTheta\) is constructed from the training samples/networks up to t = T − 1; parameter A represents a state transition matrix; B stands for an observation matrix; and π denotes the initial probabilities. The training of \(\varTheta\) follows the Baum-Welch procedures, an unsupervised learning method, which is presented in Additional file 1: A3. Second, we test if a candidate point is belong to the switching period, by calculating the I-score of the form:
$$I\left( T \right) = P_{T} (s_{T} = S_{pre} | s_{T - 1} = S_{normal} , \ldots , s_{1} = S_{normal} ;\varTheta_{T - 1} ,O_{T} ), \left( {T \ge 2} \right)$$
where \(\varTheta_{T - 1} (O_{T - 1} )\; = \;(A,\,B,\,\pi )\) is the trained HMM. O
T = {DN
2, DN
3, …, DN
T} is a sequence of temporal-ordering differential networks. {s
1, s
2, …, s
T − 1, s
T} denotes a state sequence where s
i is the system state at the i-th time point. S
normal and S
pre are both hidden (unobserved) states. S
normal stands for the normal state, and S
pre represents for the pre-disease state. On one hand, if the differential network {DN
T} is derived in the normal stage, or equivalently the testing point t = T, at which DN
T is built, is still within the stationary Markov process characterized by HMM \(\varTheta_{T - 1}\), then the I-score I(T) is supposed to be identical or similar to I(T − 1) (Fig. 1c). Thus the I-score curve progresses steadily and remains low value if the system is in the normal state. On the other hand, once the observation {DN
T} is from the pre-disease state, or the system transits into a time-varying Markov process at testing time point t = T, I(T) rises rapidly, suggesting that the observation o
T = {DN
T} is highly inconsistent with the HMM \(\varTheta_{T - 1}\) constructed from the prior information, i.e., the differential network series {DN
2, DN
3, …, DN
T − 1}. Clearly the inconsistent observation or distinct differential networks only appear during the switching period between the normal stage described by a stationary Markov process and the pre-disease stage described by a time-varying Markov process. Therefore, the boost of I(T) signals the onset of the pre-disease stage, or the impending of catastrophic transition. The specific algorithm for calculating I-score is presented in Additional file 1: A3.
Data processing
We applied the I-score scheme to three time-course datasets (GSE2565, GSE1393 and GSE13009) downloaded from the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo). For all these genomic data, we discarded the probes without corresponding NCBI Entrez gene symbol. For each gene mapped by multiple probes, the average value was employed as the gene expression. The procedure of building a molecular interaction network was as follows. First, the biomolecular association networks for Homo sapiens and Mus musculus were downloaded from several public databases, e.g., protein–protein interactions from STRING (http://string-db.org), and transcriptional regulations from TRED (rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process = home). We integrated these linkage information together without redundancy into a whole molecular interaction network including 65,625 functional linkages in 11,451 molecules for Homo sapiens, and 37,950 linkages in 6683 molecules for Mus musculus. Second, the genes from each microarray dataset were mapped to the integrated network to extract the related linkages. The molecular network was used for the initial construction of correlation networks and consequent analysis. Finally, our main results were visualized by Cytoscape (http://www.cytoscape.org) in the post-processing step. The functional analysis including gene ontology and pathway enrichment were based on GO database (http://www.geneontology.org/page/go-database) and KEGG mapper tool (http://www.genome.jp/kegg/tool/map\_pathway2.html).
For real datasets, to reduce computational complexity, the differential network was partitioned into many local networks. Each local network contained a centre node (a differentially-expressed gene) and all of its first-order neighbours based on the network structure, such as a mapped STRING network. The I-scores for the local networks were then calculated one by one, thus generating a weighted average score. Specifically, suppose there are k subnetworks partitioned from a differential network,
$$I = \frac{{n_{1} I_{1} + n_{2} I_{2} + \cdots + n_{k} I_{k} }}{{n_{1} + n_{2} + \cdots + n_{k} }}$$
where n
i denotes the number of nodes in the i-th local network and I
i stands for the local I-score of this subnetwork.
In Additional file 1: C, data description and processing procedures were presented and discussed in detail.