As stated above, the development of genomic technologies following the post-genomic era have resulted in immense large-scale of data generation. However, it is not only large volumes of DNA sequence data generated from large-scale projects, e.g., (e.g. 1000 genome  and ENCODE ), that pose challenges for computing infrastructures. Technological advances have opened new windows into genomics beyond the DNA sequence. Some examples of different types of data that can be generated today—from inside the cell—include: DNA-methylations, SNPs, CNVs, protein coding RNA, non-coding RNA, splice variants, histone modifications, nucleosome positions, transcription factors and their DNA binding sites, transcription start-sites, promoters, protein-protein interactions, protein localization, protein modifications (these are numerous), DNA binding proteins, and metabolites. In addition clinical practice and healthcare have produced large amounts of data describing diseases, medications, environmental factors, and lifestyle-related information. Clinical data stored in electronic medical records is very restricted and managed differently than research data commonly shared and available through public repositories or scientific journals. The lack of appropriate and useable computing infrastructure reduces the utilization of data sources, which have much greater research potential than is currently realized. In particular, there is an urgent need for computing resources to connect both molecular and healthcare data. Current challenges to be addressed are secure and easy access to biomedical databases, patient data protection, data sharing, and database integration. The current lack of methods and systems to bridge the gap between research and clinic information constitute a major road-block for translational research and for the benefit of healthcare.
The current research addresses the above challenges and provides an informatics platform for modeling and integrating multiple data sources in the Rheumatology Research Laboratory at the Center for Molecular Medicine (CMM), Karolinska Institute (KI) and the clinical data at the Rheumatology Clinic at Karolinska University Hospital in the other hand. The data sources at CMM are the rheumatoid arthritis (RA) biobank (serum, EDTA-plasma and DNA), cell registry (PBMC, SFMC, etc.), genotype variants, and serology database for a cohort of 379 patients diagnosed with RA (defined by ACR 1987 or later ACR/Eular 2010).
The cohort presents three profiles:
Genotype of 65 SNPS all predisposing for RA either directly or in interaction with HLA.
Detection of anti-CCP antibodies IgG antibodies against citrullinated alpha-enclose peptide-1 (CEP-1) and citrullinated type-II collagen (citC1III), IgG antibodies against citrullinated vitamin.
At the Rheumatology Clinic, data about disease duration, treatment, disease activity, and specification of the disease are stored in the SRQ . A translational medicine platform that integrates all data sources is the key to making research even more translational , and it will also empower current research to find predictive markers, such as immunological phenotypes.
RA is a common chronic inflammatory debilitating disease that primarily affects the synovial joints, but it may also affect tissues and organs. For patients, quality of life and the possibility of maintaining employment is significantly affected. The life time risk of developing RA in Sweden is around 2% , and despite the use of the new improved therapies, the rate of sick leave in early RA is still close to 50% . Risk factors for developing RA have been mapped to both genetic and environmental factors, with the Human Leukocyte Antigen (HLA) region and cigarette smoking conferring the strongest risk . The HLA association is tightly linked to the emergence of a set of autoantibodies denoted as ACPA (anti-citrullinated protein antibodies) , which today are used to subcategorize this disease. Hence, immunological studies aimed at increasing our understanding of disease initiation and perpetuation needs to take into account both the genetic and serological profile of the included patient material.
i2b2 and STRIDE: community driven software solutions
A number of technology platform solutions are available to manage biomedical data in translational research. Some of them, developed by research community are released as open-source under General Public License (GPL ), developed by research communities at universities and research institutes. One of the commonly used platforms is Informatics for Integrating Biology and the Bedside (i2b2) . The i2b2 platform is funded by the National Institutes of Health (NIH). i2b2 uses The International Classification of Diseases (ICD)  as a taxonomic standard to classify diseases, and it enables the creation of formal ontologies to meet the specific requirements of different research studies.
The design of i2b2 provides software platform and scalable solutions that facilitate repurposing of clinical data into the research setting and to secure the access and management of patient information for research purposes. i2b2 was implemented as a set of software cells orchestrated in hive architecture that communicate via web service technology in a Service-Oriented Architecture (SOA) environment. This kind of architecture provides secure communication based on Simple Object Access Protocol (SOAP) messages. The principle design of i2b2 paid attention to query and data retrieval performance. Two predefined test cases were supported by i2b2, as mentioned in [12
Explore patient data to find sets of patients that would be of interest for further research, and
Make use of the detailed data provided by the Electronic Medical Record (EMR) to discover different phenotypes of the set of patients identified (first test case) in support of genomic, outcome, and environmental research.
Based on the Health Level 7 (HL7) data model, the Stanford Translational Research Integrated Database Environment (STRIDE) represents an integrated standards-based translational research informatics platform. It provides a number of functionalities required in translational research . The basic building blocks of STRIDE are; a clinical data warehouse based on the Health Level Seven (HL7) Reference Information Model (RIM) , an application development framework for building research data management applications on the STRIDE platform and a biospecimen data management system.
In addition to the EMR, STRIDE provides biobank data management. Similar to i2b2, STRIDE uses ICD and other standards like Systematized Nomenclature Of Medicine Clinical Terms (SNOMED)  to build the semantic model to represent biomedical concepts and different types of relationships. The data warehouse of STRIDE built on Oracle 11g, the database organized in three logically clustered databases; clinical data warehouse, research data management and biobanks. The schemas used based on an Entity-Attribute-Value (EAV) model and object-oriented data structures derived from the HL7. Different software components of STRIDE are communicating via set of web services in a service oriented architecture (SOA) platform. Through the semantic layer, STRIDE support standards-based data entry, data integration, data retrieval and data interoperability.
Translational informatics challenges and solutions
Due to the different storage strategies for patient data and the explosion in volume, translational informatics faces a significant challenge in database integration. At the research level, the increasing pace of molecular data-production through high throughput technologies creates a great demand for data management (storage, transfer, retrieval, processing, and interpretation). On the other hand, patient data at the health care level is becoming more complicated since patient records are stored in EMR and the quality of care registry for different diseases. Re-use of clinical data in the research setting brings data management challenges. Data management includes not only storage of the data, but also access restrictions and control. Researchers need to perform queries across different data sources (patient bio samples, genetics, serology, etc.) and clinical data (diagnosis, medications, diseases activities, life style) from healthcare facilities. Our approach is to collect and define end-user requirements (biomedical and bioinformatics researchers) for the study the etiology, pathogenesis, disease course, co-morbidities, and therapies of RA. We matched the requirements with the current solutions and used engineering methods to implement the system at the CMM. By selecting and implementing the CDC from Oracle™ (see the method sections), we achieved our objectives and satisfied end-user requirements.