Effective knowledge management in translational medicine
© Szalma et al. 2010
Received: 6 April 2010
Accepted: 19 July 2010
Published: 19 July 2010
Skip to main content
© Szalma et al. 2010
Received: 6 April 2010
Accepted: 19 July 2010
Published: 19 July 2010
The growing consensus that most valuable data source for biomedical discoveries is derived from human samples is clearly reflected in the growing number of translational medicine and translational sciences departments across pharma as well as academic and government supported initiatives such as Clinical and Translational Science Awards (CTSA) in the US and the Seventh Framework Programme (FP7) of EU with emphasis on translating research for human health.
The pharmaceutical companies of Johnson and Johnson have established translational and biomarker departments and implemented an effective knowledge management framework including building a data warehouse and the associated data mining applications. The implemented resource is built from open source systems such as i2b2 and GenePattern.
The system has been deployed across multiple therapeutic areas within the pharmaceutical companies of Johnson and Johnsons and being used actively to integrate and mine internal and public data to support drug discovery and development decisions such as indication selection and trial design in a translational medicine setting. Our results show that the established system allows scientist to quickly re-validate hypotheses or generate new ones with the use of an intuitive graphical interface.
The implemented resource can serve as the basis of precompetitive sharing and mining of studies involving samples from human subjects thus enhancing our understanding of human biology and pathophysiology and ultimately leading to more effective treatment of diseases which represent unmet medical needs.
The effective management of knowledge in translational research setting [1, 2] is a major challenge and opportunity for pharmaceutical research and development companies. The wealth of data generated in experimental medicine studies and clinical trials can inform the quest for next generation drugs but only if all the data generated during those studies are appropriately collected, managed and shared. Some notable successes have been already achieved.
Merck has developed a system which enables sharing of human subject data in oncology trials with the Moffit Cancer Center and Research Institute . This system is built from proprietary and commercial components such as Microsoft BizTalk business process server, Tibco and Biofortis LabMatrix application and does not address any data sharing issues outside of the two institutions.
There is a growing set of data being deposited in NCBI GEO , EBI Array Express , Stanford Microarray Database  and the caGRID infrastructure  which is derived from gene expression experiments on tissue samples collected from clinical setting. Many of those are from either drug discovery or biomarker discovery projects. In particular, Johnson & Johnson through its subsidiaries have contributed such data sets into GEO and Array Express.
These databases enforce standards for some of the elements of the experimental metadata . In general, the phenotype annotations in the metadata are not required to follow standard dictionaries or vocabularies. That can cause considerable issues as it was recently demonstrated  and described in the following example. These databases allow bioinformaticians to download the normalized data and carry out further analysis. The typical setting for such analyses that the scientist poses some hypotheses with respect to the phenotype and the informatician then needs to discern those phenotypes from the semi-structured data and correlate it with genotype in a sub-optimal process. In some cases the decoding and interpretation of the different phenotype can lead to serious mistakes such as the case recently discovered when multiple publications interpreted normal samples as cancer samples leading to erroneous conclusions .
The computational experiments can lead to validation of the primary findings or to novel discoveries such as in the case of meta-analysis of multiple datasets. The burden of deconvoluting the phenotypes from source files downloaded from these primary sources and coding them in a standard to enable large-scale meta-analyses makes these types of discoveries very costly and in fact quite rare [10–13].
Data curation is a way to tackle some of these issues. Typically, derived databases of omics experiments are curated to create comparisons for specialized mining with specific questions in mind. For example, there are multiple resources being developed to integrate and analyze gene expression and other omic data and create contrasts (A vs. B comparisons) or signatures [14, 15]. The limitation of these resources is that they strive to answer specific questions and thus limit in-depth exploration of the data.
The treasure trove of high-content data derived from human samples can be much more effectively mined if standard dictionaries applied to all these studies and each subject's clinical and the associated sample's genomics data is stored and analyzed through a system which enables efficient access and mining. An example of such standardized infrastructure and potential for pre-competitive sharing is presented below.
Johnson & Johnson has invested in translational research by establishing within its pharmaceutical R&D division a set of translational and biomarker groups and focusing also on the management and mining of the data emanating from integrative settings crossing the drug discovery and development stages. One of the deliverables of this enhanced governance structure was the development of a translational medicine informatics infrastructure. This infrastructure is a combination of dedicated people, robust processes and informatics solution - tranSMART.
We have established a strong cooperation across the R&D of the pharmaceutical companies of Johnson & Johnson and an open innovation partnerships with the Cancer Institute of New Jersey and St. Jude Children's Research Hospital . The R&D Informatics and IT group works in close collaboration with discovery biologists, pharmacologists, translational and biomarker scientist, clinicians and compound development team leaders with a goal to develop a system which enables democratic access to all the data generated during target validation, biomarker discovery, mechanism of action, preclinical and translational studies and clinical development.
An important aspect of successfully introducing a paradigm shift within a large pharmaceutical organization is change management. From the start we have recruited biologists, pharmacologists and physicians from various therapeutic areas to help champion the adaptation of the newly developed translational infrastructure but also to guide us through the development of the application in an agile environment.
Data is stored in an Oracle 11 database which is fully auditable. We selected a set of open-source components to assemble the application in contrast to the strategy followed by Merck. A user interface providing a biological concept search of analyzed data sets and an i2b2  based comparison engine for subject level clinical data were constructed in Java using GRAILS. Advanced pipelines were established for launching analytical workflows of gene and protein expression and SNP data between cohorts to present comparisons in Gene Pattern  and Haploview . The solution is hosted on Amazon Elastic Compute Cloud and strict security policy is enforced. Authentication as well as role-based authorization model is implemented throughout the application so that study level permissions are enabled.
Clinical trial and experimental medicine study data sets were transformed by curators and ETL (Extract, Transform, Load) developers into an i2b2  EAV (Entity-Attribute-Value model) structure and a standardized ontology based on CDISC SDTM (Clinical Data Interchange Standards Consortium - Study Data Tabulation Model)  was applied. Currently, the system contains a growing number of curated internal studies - at the time of writing it is 30 across immunology, oncology, cardiovascular and CNS therapeutics areas.
The tranSMART system allows clinicians, translational scientists and discovery biologists to interrogate aligned phenotype/genotype data to enable better clinical trial design or to stratify disease into molecular subtypes with great efficiency. Initial successes of applying this system point towards the high value of translational data in proposing indications for drugs with new mechanism of actions [J. Smart, personal communication] and selecting biomarkers for stratified medicine.
The system has been in wide use across multiple therapeutic areas within the pharmaceutical companies of Johnson and Johnson. Comparing biological processes and pathways between multiple data sets from related disease or even across multiple therapeutic areas is an important benefit of such a system. Through the examples presented above we have shown that the tranSMART system allows scientist to quickly re-validate hypotheses or generate new ones with the use of an intuitive graphical user interface. The use cases supported by tranSMART have been developed in close collaboration with key users and the solution was built from many open source systems making the adaptation of the system straightforward.
We have implemented a fine-grained, role-based authorization model throughout the application so that study level permissions are enabled and can be controlled by the study owners. During curation the study owners are actively involved in reviewing and approving the loading and standardization of the data from their studies. This approach greatly enhanced the cooperation of the study owners and the ultimate success of the data warehouse.
A well-constructed system can enable scientist to test but also generate new hypotheses using well-curated, high-content translational medicine data leading to deeper understanding of various biological processes and eventually helping to develop better treatment options.
Active curation and enterprise data governance have proven to be critical aspects of success. The capability of the system to query both internal and public data and that during the development and implementation full organizational alignment was ensured turned out to be crucial.
Because large part of tranSMART is built from open source systems it is much more amenable for being shared with academic institutions in a pre-competitive setting enabling collaborations aimed at developing deeper understanding disease biology.
The tranSMART system is under active development including active curation of additional studies, implementing new modalities and adding novel workflows. Future development may include connection to the internal biobank. By the established system and the robust processes our research and development organization can now effectively manage not just the complex and multimodal data but can also unlock the potential of the data by transforming it into actionable knowledge.
We thank Daniel Housman, Jinlei Liu and Joseph Adler from Recombinant Data Corporation for their work in implementing the system. We are also thankful to reviewers for helpful suggestions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.