Integration of gene expression, clinical, and epidemiologic data to characterize Chronic Fatigue Syndrome

Background Chronic fatigue syndrome (CFS) has no diagnostic clinical signs or diagnostic laboratory abnormalities and it is unclear if it represents a single illness. The CFS research case definition recommends stratifying subjects by co-morbid conditions, fatigue level and duration, or functional impairment. But to date, this analysis approach has not yielded any further insight into CFS pathogenesis. This study used the integration of peripheral blood gene expression results with epidemiologic and clinical data to determine whether CFS is a single or heterogeneous illness. Results CFS subjects were grouped by several clinical and epidemiological variables thought to be important in defining the illness. Statistical tests and cluster analysis were used to distinguish CFS subjects and identify differentially expressed genes. These genes were identified only when CFS subjects were grouped according to illness onset and the majority of genes were involved in pathways of purine and pyrimidine metabolism, glycolysis, oxidative phosphorylation, and glucose metabolism. Conclusion These results provide a physiologic basis that suggests CFS is a heterogeneous illness. The differentially expressed genes imply fundamental metabolic perturbations that will be further investigated and illustrates the power of microarray technology for furthering our understanding CFS.


Background
Chronic fatigue syndrome (CFS) is defined solely by selfreported symptoms and associated disability. There are no characteristic physical signs or diagnostic laboratory abnormalities. Diagnosis of CFS requires clinical evaluation to rule out other medical or psychiatric conditions that could cause or contribute to the patient's complaints [1]. Indeed, it remains unclear whether CFS represents a unique disease or a common clinical end-point of diverse pathologic processes.
CFS has been hypothesized to involve an abnormal response to infection, immunologic dysfunction, dysregulation of the hypothalamic-pituitary-adrenal axis, and dysautonomia, yet no biologic and physiologic perturbations have been reproducibly detected. This may reflect poor specificity of the case definition, patient selection bias, or other study design issues. Clearly, discovery of laboratory markers that improve the specificity of case ascertainment or differentiate groups within the CFS classification would increase the possibility of identifying pathogenic mechanisms.
The international CFS research guideline recommends that cases be stratified before analysis by several variables including co-morbid conditions, current level and total duration of fatigue, current level of functional impairment and type of fatigue onset [1]. People with CFS often describe a sudden onset to their illness, having become sick over one or two days, while others recount a gradual onset in which the symptom complex develops over weeks or months. Studies indicate that stress history [2] and recovery [3] appear to vary with mode of onset. Another approach is to group subjects based on symptoms. A recent study identified two subgroups, one with higher energy levels and fewer accompanying symptoms and another with significantly lower energy levels [4]. Deciphering the physiologic basis for CFS would go far in accessing the heterogeneity of the illness and would advance diagnosis and treatment.
Unique gene expression profiles have been found in cancer [5], chronic inflammatory/allergic diseases [6,7], autoimmune disorders (e.g., rheumatoid arthritis) [8], and multiple sclerosis [9]. We have previously shown that peripheral blood mononuclear cell (PBMC) gene expression profiles can distinguish the majority of CFS cases from non-fatigued controls [10]. In this study, we measured levels of gene expression in 23 persons with CFS identified in the general Wichita population. Our objective was to determine if integration of gene expression results with clinical and epidemiologic data would identify CFS subgroups.

Study Design
This study adhered to human experimentation guidelines of the U.S. Department of Health and Human Services. All participants were volunteers who gave informed consent. The Centres for Disease Control and Prevention Human Subjects Committee approved study protocols.
Forty-three CFS subjects were identified in a survey of the Wichita, Kansas's adult population [11]. CFS subjects fulfilled all criteria of the CFS research case definition [1]. The clinical evaluation was used to identify any co-morbid conditions and to detect the presence of exclusionary diagnoses. These included Major Depressive Disorder with Melancholic/Psychotic features, psychosis, alcohol/ drug addiction, bulimia/anorexia and medical conditions including cancer, hepatitis or pregnancy. We obtained information concerning current disability, duration of ill-ness, type of fatigue onset, and number and nature of accompanying symptoms. We also obtained blood samples, as described below. Because only 6 CFS subjects were men, we limited the present study to women. Of the 37 female CFS subjects, 5 were excluded because of lack of sample, 7 were excluded because of poor quality RNA, and 2 were excluded because of poor quality of array hybridization, leaving 23 women for analysis. Table 1 lists the clinical and epidemiologic variables used in our analysis. Onset of illness was defined as sudden (self-reported development of fatigue in less than 1 week) or gradual (developing fatigue over more than 1 month). Only one woman reported that her fatigue developed between 1 week and 1 month (Table 1) so her microarray results were only used in cluster analysis. Age was categorized as ≤50 or >50 years old, and duration of illness was categorized as ≤10 or >10 years (grouping into different periods did not alter the results). Body Mass Index (BMI) was categorized as normal (≤24.9 kg/m 2 ), overweight (25 -29.9 kg/m 2 ), or obese (30 -39.9 kg/m 2 ) [12].

Gene Expression Profiling
Nucleic acid extraction During the clinical evaluation, a 10 ml blood sample was obtained and PBMC were isolated using LSM ® Lymphocyte Separation Media (ICN Biomedicals, Costa Mesa, CA). Cells were washed, counted and stored for viability in liquid nitrogen as described [13]. Total RNA was extracted using the RNAqueous™ kit (Ambion Inc., Austin, TX) and the quality and quantity were assessed as previously described [14].
Preparation and hybridization of labelled cDNA Biotinylated cDNA synthesis from 1 µg of total RNA was performed as previously described [14]. The cDNA probe was hybridized to the Atlas™ Human 3.8I oligonucleotide glass microarrays (CLONTECH Laboratories, Inc., Palo Alto, CA) using the Ventana Discovery™ system and their ChipMap™ kit (Ventana Medical Systems, Tucson, AZ). Hybridization was for 12 hours at 42°C, followed by three 10 minute stringency washes in 0.1X SSC at 42°C. Anti-biotin antibodies conjugated to RLS™ particles (Genicon Sciences Corporation, San Diego, CA) were used for signal detection as previously described [14]. The slides were archived and images captured using the GSD-501™ scanner (Genicon Sciences Corporation, San Diego, CA), and analyzed with ArrayVision™ RLS image analysis software (Genicon Sciences Corporation).

Data analysis
The scanned TIFF images were processed using ArrayVi-sion™ (Imaging Research Inc., Ontario, Canada). Features deemed unsuitable for accurate quantitation because of artefacts, poor morphology, or uneven hybridization were excluded from further analysis. A median background value was calculated around each feature and subtracted from the mean signal to give the net signal for the respective gene. Data was uploaded into the CDC MAdB webbased analysis package where background-adjusted intensity values were scaled and normalized to the 75 th percentile. Values were log 2 transformed and mean centered to fit the data to a Gaussian distribution.
We initially examined gene expression intensities for all 23 CFS subjects using the one-class analysis component of the Significance Analysis of Microarrays (SAM) program [15] to determine if the mean gene expression for each of 3,800 genes differed from zero. In the one-class analysis we used false discovery rates (FDR) of up to 25%. SAM was also used for a two-class analysis to compare the mean differences between the gene intensity values categorised by the clinical and epidemiologic variables listed in Table 1. An FDR of 5% was used for two-class analysis.
To identify distinct gene clusters we performed a two-way hierarchical cluster analysis as described by Eisen et al [16]. The dendograms were viewed using Tree View [16], http://rana.lbl.gov/EisenSoftware.htm. All genes identified by SAM were submitted to Onto-Express (version 2) [17]http://vortex.cs.wayne.edu:8080/index.jsp to identify current gene ontology classifications. OntoExpress was chosen because it interprets the probability that a particular molecular function, biological process or cellular component occurs by chance in the context of the genes represented on the microarray being used.
The standard statistical t-test (assuming unequal variances) and the nonparametric Wilcoxon rank sum test were used in conjunction with the SAM two-class analysis to examine the potential differences in gene expression with respect to the variables outlined in Table 1. For the ttest and Wilcoxon test statistical significance was set at a p-value <0.01.

Differential gene expression
One-class analysis of gene expression data Application of this method to the 23 CFS subjects identified no genes with expression variance statistically greater than the average that would provide evidence for heterogeneity of the CFS sample. a Onset type defined as sudden (self-reported as developing fatigue in = 1 week) or gradual (developing fatigue over a period = 1 month). One subject described onset as between 1 week and one month and was not classified for this stratification. b Further stratification and analysis using the Kruskal-Wallis nonparametric test did not show different results. c Analysis performed on BMI <25 (normal) compared with >30(obese). Subjects considered overweight were not included in this particular analysis. Further stratification and analysis using the Kruskal-Wallis test showed no significant differences between the groups (results not shown.). d Illness group defined by factor analysis of symptoms followed by cluster analysis [4].

Two-class analysis of data
The 23 CFS subjects were grouped with respect to the variables listed in Table 1 and the mean differences between their gene expression values then compared. This approach identified 117 genes that were differentially expressed when the CFS subjects were grouped by onset type (Table 2). Two-class analysis did not detect any differentially expressed genes at a false discovery rate of 5% when comparing any other variable listed in Table 2. Both the t-test and the Wilcoxon test results were similar to the two-class analysis and there was considerable overlap among the genes detected by these 3 tests for type of fatigue onset. In total, 95/117 genes identified by twoclass analysis were detected by either t-test or Wilcoxon test. Analysis by age, illness duration, number of CFS symptoms, illness group and BMI identified a few differentially expressed genes ( Table 2), but there were no common genes across statistical tests, and no overlap with any of the 117 genes that differentiated onset type. For this reason, only the 117 genes identified by two-class analysis were examined further. Figure 1 and 2 displays the two-way hierarchical cluster analysis of the 117 genes. The majority of subjects clustered according to onset type and the genes fell into two distinct clusters. Expression of 19 of the 117 genes was increased in the gradual compared to sudden onset group, while the expression of the remaining 98 genes was decreased. Figure 3 summarizes the functional classification of all 117 differentially expressed genes with respect to cluster group. Twenty-four genes are associated with metabolism (p < 0.01, hypergeometric probability distribution test). Twenty of these genes were down-regulated in the gradual onset cluster, and they were mainly involved in regulation of glycolysis, glucose and disaccharide metabolism, oxidative phosphorylation, amino acid biosynthesis, and purine or pyrimidine metabolism. Of the 19 up-regulated genes, some were involved in metabolism, but they were not statistically significant. The 7 genes involved in RNA processing were, however, statistically significant in this group (p < 0.01, hypergeometric probability distribution test).

Discussion
It is thought that CFS is a heterogeneous illness since a single cause of CFS has not been identified and it is thought that various kinds of physiologic stressors such as infection, trauma and toxins can trigger the development of CFS in susceptible individuals. A major difficulty in identifying etiologies for CFS is that the case definition requires a minimum duration of six months of illness. In most studies, subjects have been ill many years, making it difficult to detect initial disease triggers, as causal factors may be difficult to detect or are no longer present. In addition, in many diseases, factors associated with disability are distinct from causative factors. Biomarkers have the potential to give clues to disease etiology as well as mode of action.
In an attempt determine whether CFS was a single or heterogeneous illness, we used microarrays to profile the expression of 3,800 genes in 23 women with CFS. We analyzed the array data using three statistical tests: 1) a program specifically designed for the analysis of microarray data (SAM), 2) a parametric t-test, and 3) a nonparametric rank sum test. One class analysis by SAM failed to detect differences in gene expression profiles of the CFS subjects because many of the genes introduced noise into the process, masking the differences that were evident in two-class analysis. In the two-class analysis the only variable that differentiated the CFS subjects was type of fatigue onset, that is, whether the women described their fatigue as occurring suddenly over the course of a week, or gradually, over the course of months. Different gene expression profiles among those who describe a difference in illness onset imply distinct etiological or triggering events, and a Number of genes for which false discovery rate (FDR) = 5% b Number of genes for which comparison yielded p < 0.01 c The 7 subjects with 5 symptoms (Table 1) were excluded from the analysis of data shown in Table 2. Analysis including these subjects into either group did not significantly affect the outcome.
Hierarchical clustering of the differential gene expression patterns for gradual compared with sudden onset in CFS subjects Figure 1 Hierarchical clustering of the differential gene expression patterns for gradual compared with sudden onset in CFS subjects. Matrix of the two-dimensional hierarchical clustering of genes and CFS subjects stratified on syndrome onset. Each row represents the hybridization results for a single gene, and each column represents a CFS subject. Transcript levels that are statististically different between onset types are shown above (red) and below (green) the mean.
shows that these differences are maintained well into the disease process. All the other variables thought to be important in characterizing and defining CFS did not have any differentially expressed genes associated with them when CFS subjects were grouped accordingly. Interestingly, this is not the first time that type of fatigue onset has distinguished people with CFS. DeLuca et al [18] showed that CFS subjects with gradual onset tend to develop CFS-type physical symptoms as a variant of a psychiatric disorder, while CFS patients with sudden onset may be more closely associated with a non-psychiatric etiology (i.e. a viral or infectious etiology). Mawle et al. [19] reported that CFS patients with gradual onset had more major life events occurring in the year prior to onset than did patients with sudden onset. In this study the 1994 CFS research case definition [1] was strictly used in designating CFS caseness, therefore most psychiatric conditions, (other than Major Depressive Disorder which is comorbid in many people with CFS, or any chronic illness) were exclusionary. We believe that this considerably reduced the other possible symptoms or conditions that may be highly correlated with fatigue and could potentially confound our data.
Our findings of differentially expressed metabolic and RNA processing genes make both biologic and physiologic sense relative to CFS. We identified differences in purine and pyrimidine metabolism, glycolysis, oxidative phosphorylation, and glucose metabolism. Oxidative phosphorylation and the ATP generated by this process are the major source of energy for the normal function of most cells in the body. Metabolic changes are known to take place, and in some instances drive the pathophysiology of a number of chronic diseases. Subjects with sudden onset CFS often describe an infectious, viral-like illness as the initiating process. It is well-known that many RNA processing proteins are central to the effective action of the antiviral interferon [20]. Alterations in effective antimicrobial responses may also explain the chronic fatigue state.
The nature of the specimen determines the view of the disease reviewed by gene expression profiling. In CFS there are no anatomical lesions to sample. Peripheral blood is an accessible source of circulating cells that reflect systemic changes, so it is a good starting point to profile diseases that have no lesions, or lesions that are inaccessible. However, peripheral blood mononuclear cells are themselves very heterogeneous, including B and T lymphocytes, monocytes, and natural killer cells. Changes in gene expression could be due to changes in the cellular composition as well as to differences in cellular activities. However, several groups including our own, [21,22], have surveyed the magnitude of variation in gene expression patterns of peripheral blood and found it to be fairly limited. This study, as well as an earlier study of PBMCs in CFS [13] indicate that the peripheral blood does detect relevant gene expression differences. Fractionation of the PBMC population may give different insights into the disease process, and will be important to further characterize the pathophysiology of CFS.
The study must be interpreted with caution, as the number of subjects is small and the gene profiled represent a fraction of those potentially of importance. However, these data do support the idea that CFS is a heterogeneous illness with a biochemical basis to explain the fatigue. Different gene expression profiles among those who describe a difference in illness onset imply distinct etiological or triggering events, and shows that these differences are maintained well into the disease process. The results in this study demonstrate the utility of gene expression profiling to characterize an illness at the biological and physiological level. This should advance the cause for defining CFS at a molecular resulting in diagnosis and possible identification of causative agents.
Hierarchical clustering of the differential gene expression patterns for gradual compared with sudden onset in CFS subjects Figure 2 Hierarchical clustering of the differential gene expression patterns for gradual compared with sudden onset in CFS subjects. Dendograms showing average-linkage hierarchical clustering of CFS subjects. A blue circle indicates a subject with a sudden onset of CFS symptoms, yellow indicates a gradual onset. The black circle is a subject whose onset was between that defined by sudden/ gradual onset.