An interactive web application for the dissemination of human systems immunology data
© Speake et al. 2015
Received: 23 February 2015
Accepted: 18 May 2015
Published: 19 June 2015
Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators’ interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery.
State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples.
We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page (https://gxb.benaroyaresearch.org/dm3/landing.gsp)]. The source code is also available openly [Gene Expression Browser Source Code (https://github.com/BenaroyaResearch/gxbrowser)].
We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.
Systems studies rely on high throughput profiling technologies to measure the abundance or activity of all the constituents of a given biological system. This unbiased approach provides a global perspective on biological phenomena that can thus be studied as a whole, rather than a sum of parts. It has also proven a particularly powerful approach for hypothesis generation. Whole transcriptome profiling technologies constitute a robust yet affordable means to generate data on a systems scale and have been extensively used. As a result, vast amounts of transcriptome data are now available in public repositories. For example, more than 37,000 microarray or RNAseq studies are available in the NCBI Gene Expression Omnibus (GEO) repository , corresponding to more than 800,000 individual transcriptome profiles. In the immunology field, transcriptome profiling has allowed in-depth phenotyping of cell populations , the identification of transcriptional programs regulating hematopoiesis , lymphocyte differentiation [3, 4], and host responses [5, 6]. The use of large scale profiling technologies has transformed our understanding of human immunology , unraveling the novel pathways that underlie disease pathogenesis [8–10] and vaccine responses [11–13]. The data associated with each study, which tend to be underutilized beyond publication of primary results, potentially constitutes an invaluable resource when reinterpreted alongside other related datasets. It can provide context for the interpretation of newly generated data, and when analyzed collectively can yield insights that could not otherwise be obtained from the analysis of individual datasets.
We have developed an interactive data browsing application to promote the integration and dissemination of immunologically relevant transcriptome data. Rich contextual information is provided to support data interpretation and foster novel immunological insights. The Gene Expression Browser (GXB) platform leverages social media such as Google+ to provide users with the ability to gather, prioritize, and share findings that arise while browsing large scale profiling data. Such a framework was used previously to promote the dissemination of clinical and transcriptomic data generated in the course of a systems immunology study, thereby promoting exploration and discovery of novel knowledge by readers as well as increasing transparency of the data included in this publication . Here we make a compendium of curated public domain datasets accessible via a web portal as a resource for the community, at . In addition the source code for this software is released and made available for reuse by others at .
The web application supports the importation of GEO data from .soft and .family style files, as well as Illumina BeadStudio standard output format. Authenticated users may upload data and add annotations using standard spreadsheet tools via comma-separated value (.csv) files.
Microarray chip probe mapping definitions were downloaded from Affymetrix and Illumina user support websites, and imported into the web application. In all, twenty different microarray chip types are currently supported by the application.
Results and discussion
Assembly of a collection of curated datasets
We have assembled and curated a collection of 169 datasets that are relevant to human immunology, representing a total of 13,089 unique transcriptome profiles. These sets were selected from studies currently available in NCBI’s Gene Expression Omnibus (GEO) . We queried GEO for all datasets related to any of the following search terms: “monocyte”, “neutrophil”, “CD4”, “CD8”, “B cell”, “NK”, “Natural Killer”, “Plasma cell”, “CD19”, and “CD20”. The query list was filtered to select microarray datasets generated from human samples, with a sample count greater than 10. The selection was further refined to include datasets generated using Affymetrix or Illumina chips, the two most commonly used commercial microarray platforms; a few datasets from other platforms were also incorporated. In addition, relevance to human immunology research was assessed for each sampleset. Datasets were removed from the tool if they could not be appropriately displayed; most commonly, this happened when authors deposited normalized data into GEO but did not provide information on the type of normalization performed, which resulted in a very limited range of values detected in the dataset. Log2 normalization, which is applied to the majority of Affymetrix data, is automatically detected and addressed at loading. Separately, we queried GEO for whole-blood microarray datasets containing both disease case and control samples. The whole-blood data and cell-type specific data were all loaded into the web tool. Annotation data supplied with each dataset in GEO are stored in the tool. Annotations available from GEO sometimes provide substantial contextual information [9, 13, 27], but in most cases annotations are quite limited . When available, additional annotation data found in the dataset’s primary publication were manually added to the annotation information provided with the microarray data.
Data upload and processing
Dataset navigation interface
The database can also be queried for studies in which a gene of interest is differentially expressed between two groups. Entering an official gene symbol in the query box on the top left corner of the page will return a list of studies for which the queried gene meets a user-specified fold change cut-off. For instance, a query for IFITM3 with a twofold difference between groups returns 43 datasets. These include several studies of vaccine responses, as well as a TB dataset. It should be noted that many datasets include more than two study groups, therefore multiple comparisons are performed for any given gene. For simplicity, datasets are listed only once even though further examination may reveal that differences meeting the pre-specified cutoff can be found for more than one group comparison.
Thus, a navigation interface with advanced query and filter capabilities has been implemented to provide users with the ability to quickly identify relevant datasets for further in-depth exploration and data browsing. While the interface is intended to be easy for users to quickly identify datasets of interest, we recognize that any new software tool can be complex. A video tutorial is available as a companion to the website , which covers the majority of functions described in this manuscript.
Data browsing and visualization interface
Clicking on one of the studies listed in the dataset navigation interface opens a viewer designed to provide interactive browsing and graphic representations of large-scale data in an interpretable format. This interface is designed to navigate ranked gene lists and display expression results graphically in a context-rich environment. Selecting a gene from the rank ordered list on the left will display its expression values graphically in the screen’s central panel.
Graphical representation (central panel): Expression values for the selected gene can be represented as a histogram, where each available sample is shown as a bar (Figure 4), or as a box plot where each sample is shown as a dot. Directly above the graphical display, drop down menus give users the ability: (a) To change how the gene list is ranked. This allows the user to change the method used to rank the genes, or to include only genes that are selected for specific biological interest. Gene lists come from the KEGG database , or are constituted of immune-relevant genes (e.g. cytokine ligands and receptors, T cell signaling), or of genes associated with known disease signatures (GVHD and SLE, among others). (b) To change sample groups (Group Set button). In some datasets, a user can switch between groups based on cell type to groups based on disease type, for example. (c) To sort individual samples within a group based on associated categorical or continuous variables (e.g. gender or age). (d) To toggle between the histogram view and a box plot view. Samples are split into the same groups whether displayed as a histogram or box plot. (e) To view a color legend for the sample groups. (f) To select categorical information that is to be overlaid at the bottom of the graph. For example, the user can display gender or smoking status in this manner. (g) To view a color legend for the categorical information overlaid at the bottom of the graph. (h) To download the graph as a jpeg image. After the graph has been customized it can be downloaded as seen on screen, and an advanced menu gives the user the opportunity to provide a title for the graph and change the legends for the X and Y axes.
Information about the gene selected from the list on the left side of the display is available under the “Gene” tab. Description of the gene function is parsed from the RefSeq database . Links to the gene’s page on external resources at NCBI Gene , Wikipedia , and Wolfram Alpha  are also provided. A list of titles for the most recent 25 PubMed articles mentioning the gene is available with a single click, and the titles link out to PubMed for quick access to relevant literature.
Information about the study is available under the “Study” tab. Data interpretation requires an understanding of how the study or experiment was done and why it was done. This section provides background information about the study design, a reference to the primary publication associated with the dataset, and a link to the PubMed abstract.
Information available about individual samples is provided under the “Sample” tab. Rolling the mouse cursor over a histogram bar while displaying the “Sample” tab lists any clinical, demographic, or laboratory information available for the selected sample. When a large amount of sample information is available it can be broken down over multiple tabs, in order to display all ancillary data associated with a sample. This level of detail is rarely available for studies published in GEO, but we made use of the software application to display such information in a recent publication  and associated website .
Most of this sample information can be overlaid on the gene expression data histogram as colored rectangles at the bottom of each bar for categorical variables (Figure 4), or as data points plotted on a separate axis (see ). In our example from the tuberculosis study, this includes demographic data like age, gender, and ethnicity, as well as clinical and laboratory data such as radiographic extent of disease and drug resistance patterns of the infecting bacterial strains. As mentioned above, clinical and demographic data can be used to sort samples within a group. This feature makes it easy for users to quickly visualize relationships between clinical variables and changes in expression data. The customized expression data can be downloaded along with user-defined clinical overlays as publication/presentation-quality graphics with one click. Legends for both the histogram and the clinical data overlays are viewable within the tool and are downloadable as well.
Finally, the “Downloads” tab allows advanced users to retrieve the original dataset for analysis outside this tool. It also provides all available sample annotation data for use alongside the expression data in third party analysis software.
Thus this tool not only allows the navigation and querying of vast amounts of data with minimal user learning time, it is also capable of integrating heterogeneous ancillary information that is paramount for the interpretation of transcriptome data and generation of novel knowledge.
The envelope icon on the top right corner of the display will setup a new e-mail message in the user's default e-mail application. The e-mail is pre-populated with a web link that can load the user’s current view of the dataset, including sample groupings, information overlays, and other plot options.
The web application presented here employs state of the art web programming to uniquely address Big Data’s canonical “3 Vs”, which are: Volume, Variety and Velocity . Increasingly larger Volumes of data are generated through widespread use of systems or large-scale profiling approaches. The storage and management of such large volumes of data has become a challenge, but the GXB tool has been designed specifically to handle the large collections of datasets generated through systems-scale profiling approaches. The second V stands for Variety. Datasets have become increasingly heterogeneous, especially in the context of clinical studies where a wide array of information about study subjects is available and needs to be captured. Here we have shown that GXB can incorporate and present to the user large amounts of heterogeneous ancillary information necessary for data interpretation. There is also an obvious need to integrate data generated across studies. This need will continue to grow as more clinical research studies begin to employ large scale profiling technologies in concert (e.g. genome, transcriptome, microbiome), thus heralding the era of so called “multi-omics” approaches. The third V stands for Velocity. Acquisition, storage, and integration of vast amounts of data is an important goal but in order to prove useful, especially as a source of novel insight and knowledge, this data must be readily and seamlessly accessible to the user. A primary strength of the GXB tool is its ability to enable rapid querying, access, and visualization of large and heterogeneous datasets. Tackling big data in the biomedical sciences will undeniably require the continued development and use of sophisticated data mining solutions, which enable bioinformaticians to map relationships in systems data and thereby reduce its dimensions . It will also require the development of tools like the GXB that can engage the participation of the biomedical research community at large, can expose knowledge gaps, and can hopefully accelerate the pace of medical discoveries.
While other data viewers share some characteristics with the GXB [2, 38–45], none have fully integrated all attributes available into a tool that is designed to address the challenges posed by biomedical big data. Special emphasis has been put into user interface design; the GXB interface is as clean and simple as possible so that the vast amount of data appears clear and seamless to each scientific user. As demonstrated in one of our earlier publications, such interactive data visualization tools can be employed for the generation of interactive supplements to static figures in publications . This enables more democratic access to the data underlying each static figure, and creates the opportunity for others to derive additional insight from this vast dataset. Greater transparency is an added benefit to providing data via a web-based tool. This is especially important when publishing studies based on large-scale datasets, since those data are seldom easily visualized by reviewers or by the community. For all these reasons, the use of interactive figures and data browsing software as companion to publications must be promoted and should become widespread in scientific publishing. We hope that the data browsing software tool that we have developed will further this goal.
The GXB tool is continually being improved to better enable data sharing and analysis. In the immediate future, we will begin to support upload and analysis of RNAseq, both at the gene-level and exon-level, and of high-throughput quantitative polymerase chain reaction (qPCR) datasets. It should also be noted that the tool can be used to display other data types, such as protein or cellular measurements, regardless of whether they are high dimensional. For instance, a modified version of the tool can be used to plot results from flow cytometry studies while displaying FACS images (see [13, 46]).
CS tested the software, uploaded datasets, annotated datasets, and drafted the manuscript. SP participated in software design, programmed portions of the web application, tested the software, uploaded datasets, annotated datasets, and assisted in drafting the manuscript. KD participated in software design, programmed portions of the web application, and tested the software. BZ participated in software design, programmed portions of the web application, and tested the software. AB participated in software design, programmed portions of the web application, and tested the software. DA participated in software design, programmed portions of the web application, and tested the software. MM tested the software, uploaded datasets, and annotated datasets. EW programmed portions of the web application. OV annotated datasets. DP annotated datasets. DR curated the dataset collection and assisted in preparation of manuscript figures. NJC tested the software, uploaded datasets, and annotated datasets. LC tested the software, uploaded datasets, and annotated datasets. CQ participated in software design, programmed portions of the web application, tested the software, uploaded datasets, annotated datasets, and assisted in drafting the manuscript. DC participated in software design, tested the software, and drafted the manuscript. All authors read and approved the final manuscript.
The authors would like to acknowledge Kristen Dang PhD, who uploaded and annotated several studies. This work was supported by Benaroya Research Institute funding, as well as grants from the National Institutes of Health (U19 AI08998, U19 AI057234, U01 AI082110, N01-AI-15416, and P01 CA084512). Funding bodies had no role in preparing either the software product or the manuscript, nor in our decision to submit it for publication.
Compliance with ethical guidelines
Competing interests The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF et al (2011) NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res 39(Database issue):D1005–D1010PubMed CentralPubMedView ArticleGoogle Scholar
- Heng TSP, Painter MW (2008) The Immunological Genome Project: networks of gene expression in immune cells. Nat Immunol 9:1091–1094PubMedView ArticleGoogle Scholar
- Novershtern N, Subramanian A, Lawton LN, Mak RH, Haining WN, McConkey ME et al (2011) Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell 144:296–309PubMed CentralPubMedView ArticleGoogle Scholar
- Haining WN, Ebert BL, Subrmanian A, Wherry EJ, Eichbaum Q, Evans JW et al (2008) Identification of an evolutionarily conserved transcriptional signature of CD8 memory differentiation that is shared by T and B cells. J Immunol 181:1859–1868PubMed CentralPubMedView ArticleGoogle Scholar
- Chevrier N, Mertins P, Artyomov MN, Shalek AK, Iannacone M, Ciaccio MF et al (2011) Systematic discovery of TLR signaling components delineates viral-sensing circuits. Cell 147:853–867PubMedView ArticleGoogle Scholar
- Miao EA, Leaf IA, Treuting PM, Mao DP, Dors M, Sarkar A et al (2010) Caspase-1-induced pyroptosis is an innate immune effector mechanism against intracellular bacteria. Nat Immunol 11:1136–1142PubMed CentralPubMedView ArticleGoogle Scholar
- Germain RN, Meier-Schellersheim M, Nita-Lazar A, Fraser IDC (2010) Systems biology in immunology: a computational modeling perspective. Annu Rev Immunol 2011(29):527–585Google Scholar
- Pascual V, Allantaz F, Arce E, Punaro M, Banchereau J (2005) Role of interleukin-1 (IL-1) in the pathogenesis of systemic onset juvenile idiopathic arthritis and clinical response to IL-1 blockade. J Exp Med 201:1479–1486PubMed CentralPubMedView ArticleGoogle Scholar
- Berry MPR, Graham CM, McNab FW, Xu Z, Bloch SAA, Oni T et al (2010) An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature 466:973–977PubMed CentralPubMedView ArticleGoogle Scholar
- Pascual V, Chaussabel D, Banchereau J (2009) A genomic approach to human autoimmune diseases. Annu Rev Immunol 2010(28):535–571Google Scholar
- Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, Teuwen D et al (2009) Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans. Nat Immunol 10:116–125PubMed CentralPubMedView ArticleGoogle Scholar
- Nakaya HI, Wrammert J, Lee EK, Racioppi L, Marie-Kunze S, Haining WN et al (2011) Systems biology of vaccination for seasonal influenza in humans. Nat Immunol 12:786–795PubMed CentralPubMedView ArticleGoogle Scholar
- Obermoser G, Presnell S, Domico K, Xu H, Wang Y, Anguiano E et al (2013) Systems scale interactive exploration reveals quantitative and qualitative differences in response to influenza and pneumococcal vaccines. Immunity 38:831–844PubMed CentralPubMedView ArticleGoogle Scholar
- Gene Expression Browser Landing Page (https://gxb.benaroyaresearch.org/dm3/landing.gsp)
- Gene Expression Browser Source Code (https://github.com/BenaroyaResearch/gxbrowser)
- Grails Programming Language (http://www.grails.org)
- Groovy Programming Language (http://groovy.codehaus.org/)
- Apache Tomcat (http://tomcat.apache.org/)
- MySQL Database (http://www.mysql.com/)
- Mongo Database (http://www.mongodb.org/)
- R Programming Language (http://www.r-project.org/)
- Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3(1):1–25Google Scholar
- Gene Expression Browser R Scripts (https://github.com/BenaroyaResearch/gxrscripts)
- Gene Expression Browser Starter Databases (http://gxb.benaroyaresearch.org/downloads)
- NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/)
- Hutcheson J, Scatizzi JC, Siddiqui AM, Haines GK, Wu T, Li Q-Z et al (2008) Combined deficiency of proapoptotic regulators Bim and Fas results in the early onset of systemic autoimmunity. Immunity 28:206–217PubMedView ArticleGoogle Scholar
- Vargova K, Curik N, Burda P, Basova P, Kulvait V, Pospisil V et al (2011) MYB transcriptionally regulates the miR-155 host gene in chronic lymphocytic leukemia. Blood 117:3816–3825PubMedView ArticleGoogle Scholar
- Gene Expression Browser Video Tutorial (https://gxb.benaroyaresearch.org/dm3/tutorials.gsp)
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 40((Database issue)):D109–D114PubMed CentralPubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40:130–135View ArticleGoogle Scholar
- NCBI Gene [http://www.ncbi.nlm.nih.gov/gene]
- Wikipedia [https://www.wikipedia.org/]
- Wolfram Alpha [http://www.wolframalpha.com/]
- Interactive Gene Expression Figures Associated with Obermoser, Presnell et al. (2013) [http://www.interactivefigures.com/dm3/miniURL/view/K7]
- Gartner’s 3 Vs of Big Data [http://www.gartner.com/newsroom/id/1731916]
- Chaussabel D, Baldwin N (2014) Democratizing systems immunology with modular transcriptional repertoire analyses. Nat Rev Immunol 14:271–280PubMed CentralPubMedView ArticleGoogle Scholar
- Zoubarev A, Hamer KM, Keshav KD, McCarthy EL, Santos JRC, Van Rossum T et al (2012) Gemma: a resource for the re-use, sharing and meta-analysis of expression profiling data. Bioinformatics 28(17):2272–2273PubMed CentralPubMedView ArticleGoogle Scholar
- Kilpinen S, Autio R, Ojala K, Iljin K, Bucher E, Sara H et al (2008) Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues. Genome Biol 9:R139PubMed CentralPubMedView ArticleGoogle Scholar
- Hruz T, Laule O, Szabo G, Wessendorp F, Bleuler S, Oertle L et al (2008) Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv Bioinformatics 2008:420747PubMed CentralPubMedView ArticleGoogle Scholar
- Schmid PR, Palmer NP, Kohane IS, Berger B (2012) Making sense out of massive data by going beyond differential expression. Proc Natl Acad Sci USA 109:5594–5599PubMed CentralPubMedView ArticleGoogle Scholar
- Adler P, Kolde R, Kull M, Tkachenko A, Peterson H, Reimand J et al (2009) Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biol 10:R139PubMed CentralPubMedView ArticleGoogle Scholar
- James RA, Rao MM, Chen ES, Goodell MA, Shaw CA (2012) The Hematopoietic Expression Viewer: expanding mobile apps as a scientific tool. Bioinformatics 28:1941–1942PubMed CentralPubMedView ArticleGoogle Scholar
- Siebert JC, Munsil W, Rosenberg-Hasson Y, Davis MM, Maecker HT (2012) The Stanford Data Miner: a novel approach for integrating and exploring heterogeneous immunological data. J Transl Med 10:62PubMed CentralPubMedView ArticleGoogle Scholar
- Kupershmidt I, Su QJ, Grewal A, Sundaresh S, Halperin I, Flynn J et al (2010) Ontology-based meta-analysis of global collections of high-throughput public data. PLoS One 5(9):e13066. doi:10.1371/journal.pone.0013066 PubMed CentralPubMedView ArticleGoogle Scholar
- Interactive FACS Figure Associated with Obermoser, Presnell et al (2013) [http://www.interactivefigures.com/sdb/dataVisualizer/view?miniUrl=iwtil0]