Empirical study using network of semantically related associations in bridging the knowledge gap
© Abedi et al.; licensee BioMed Central Ltd. 2014
Received: 22 August 2014
Accepted: 11 November 2014
Published: 27 November 2014
The data overload has created a new set of challenges in finding meaningful and relevant information with minimal cognitive effort. However designing robust and scalable knowledge discovery systems remains a challenge. Recent innovations in the (biological) literature mining tools have opened new avenues to understand the confluence of various diseases, genes, risk factors as well as biological processes in bridging the gaps between the massive amounts of scientific data and harvesting useful knowledge.
In this paper, we highlight some of the findings using a text analytics tool, called ARIANA - Adaptive Robust and Integrative Analysis for finding Novel Associations.
Empirical study using ARIANA reveals knowledge discovery instances that illustrate the efficacy of such tool. For example, ARIANA can capture the connection between the drug hexamethonium and pulmonary inflammation and fibrosis that caused the tragic death of a healthy volunteer in a 2001 John Hopkins asthma study, even though the abstract of the study was not part of the semantic model.
An integrated system, such as ARIANA, could assist the human expert in exploratory literature search by bringing forward hidden associations, promoting data reuse and knowledge discovery as well as stimulating interdisciplinary projects by connecting information across the disciplines.
KeywordsKnowledge discovery Hypothesis generation Literature mining Ontology mapping PubMed Medical subject headings (MeSH) Multi-gram dictionary Latent semantic analysis (LSA) Network of association Semantic associations
ARIANA  is a software system that is designed to capture “crisp semantic associations” among bio-medical concepts of interest and provide scalable Web-Services (Figure 2). It integrates semantic-sensitive analysis of text data through ontology mapping with database search and advanced visualization of the network of semantically related associations that can be easy to collapse and expand, allowing the user to have a global view of the results or to focus on a sub-network. As an integrative tool, goals of ARIANA are to find the network of semantic associations in bridging the gap between the production and utilization of data, disambiguate the domain specific entities, provide robust results to a broad range of queries and, deliver a scalable Web-Service using state-of-the-art technology.
Results and discussion
Empirical study using the improved ARIANA was performed to identify network of associations with single as well as multiple query words. Representative of the findings are succinctly summarized below to illustrate the utility of such system in discovering unknown interactions and also to generate robust hypothesis by connecting the information from interdisciplinary fields. However, in order to extract hidden knowledge for a single vital query, such as the case for the asthma study at John Hopkins, it is imperative to not only focus on the graph representation but also extract the raw association scores and investigate entities with weaker level of associations. In essence, with no direct evidence in the literature, weaker yet positive associations tend to provide key indication for further in-depth investigation.
Case Study on (lethal) drug interactions in designing experiments: In 2001, an asthma research team at the John Hopkins University used the drug hexamethonium on a young healthy volunteer that ended in a tragic death due to pulmonary inflammation and fibrosis. Office for Human Research Protections of the US Department of Health and Human Services faulted the investigators for ignoring published information regarding the lung toxicity of the drug. In an internal investigation , the committee noted “The principal investigator subsequently stated to the investigation committee that he had performed a standard PubMed search”. The committee panel referred to a number of studies, in addition to one case-report published in 1955 , that have reported an association between hexamethonium and pulmonary fibrosis. In that case report , a 28 year old woman died after receiving hexamethonium over a period of six months. Even after these two tragedies, the association between, hexamethonium and pulmonary fibrosis, or fibroma are still not evident with a keyword search from PubMed. The second tragedy was never published as a case report; nonetheless, the autopsy report as well as news broadcasts are available on the internet. This tragedy gained media’s attention because it could have been prevented. In our test, ARIANA provides evidence for such associations. This knowledge was extracted even though the constructed core database contains publications from 1960 to 2012. Out of 2,545 concepts selected from the MeSH, “Scleroderma, Systemic”, “Neoplasms, Fibrous Tissue”, “Pneumonia”, “Fibroma”, and “Pulmonary Fibrosis” were ranked as the 13th, 16th, 38th, 174th and 257th ranked-concept, respectively. If the researchers had access to such knowledge discovery tool, capable of identifying novel associations, this investigator would likely have performed additional in-depth research before using this drug on a healthy subject. A network view of the query hexamethonium indicates that the top seven associations are relevant; however, due to the nature of the investigation, we expect the weaker associations to provide key information worth further in-depth verification by experts.
Identification of network of semantically related entities with a single or double query can uncover hidden knowledge and facilitate data reuse among other things. Alzheimer’s disease (AD) is a debilitating disease of the nervous system, mostly affecting the older population. ARIANA captured some of the obvious associations such as Tauopathies; Proteostasis Deficiencies; Amyloidosis; Cerebral Arterial Diseases; Multiple System Atrophy; Agnosia. It also identified some of the less obvious associations such as Tissue Inhibitor of Metalloproteinases ,. Using Tuberculosis (TB) as a second query, a common entity was recognized to be linked to both AD and TB. “Proteostasis Deficiencies > Amyloidosis” is highly related (cosine score of 0.5651) to TB and moderately related (cosine score of 0.0734) to AD. Further investigation by expert revealed that AD and TB could be indirectly related through MMP (Matrix metalloproteinases) gene family members. MMPs are zinc-binding endopeptidases that degrade various components of the extracellular matrix ,. MMPs are believed to be implicated in TB by the concept of a matrix degrading phenotype . Various studies in human cells, animal models as well as gene profiling studies support the association of MMPs and TB and involvement of TB-driven lung matrix deconstruction -. MMPs are also implicated in AD  but in a more positive way. In fact MMP proteins can breakdown the amyloid proteins  that are present in the brain of the AD patients. There is literature evidence for the link between MMP genes and AD, and similarly between MMP genes and TB; however, the connection between AD and TB through the MMP genes is extracted by a global analysis of the literature.
Finally, ARIANA can be used by expert to perform global literature search using 17,074 different queries, and these include diseases, risk factors, biomedical entities and biological processes. Two additional search results from the system are summarized: 1) Query term: CD4. The five top associated headings are i) cyclin D, ii) retinal pigments > opsins, iii) human immunodeficiency virus, iv) beta-endorphin, and v) alloys > steel. 2) Query term: Helicobacter pylori. Among the top associated headings are i) apolipoproteins B, ii) adrenergic alpha-agonists, iii) isonicotinic acids, iv) oral fistula, v) identification (psychology) > gender identity and vi) diabetes mellitus, type 1. All these associated MeSH terms with the two queries have supporting evidence in the literature, even if at first some might seem unrelated. Exploring such associations, and even those that are at slightly weaker levels could provide valuable opportunities in knowledge discovery and hypothesis generation.
ARIANA is a LSA-based technique that integrates ontology mapping and advanced visualization technique to provide a global view of the knowledge that is buried in the ocean of literature. ARIANA has many advantages, such as scalability, context specificity, robustness and language-independence; however, the system has also some limitation. For instance, it is well agreed that an LSA-based technique is computationally intensive because of its utilization of Singular Value Decomposition step . However, with higher computing power and the possibility to perform parallel computing this limitation can soon be overcome. A second limitation of this method is in its use of bag-of-word model, where ordering of words is lost; ARIANA uses multi-gram dictionary which alleviates this problem to some extent while still proving scalability. Finally a major different between LSA based techniques and part-of-speech tagging is LSA’s inability to provide direct link to the specific publication that was the source of the identified association. We are currently working to address this specific limitation which can also be very valuable to the broader field of computational science.
An array of text-analytics tools ,,, are being developed to answer and solve specific problems when dealing with biomedical literature that is increasing at an unprecedented rate. There are three main features that distinguish this work from closely related work such as Bio-LDA : 1) modularity in terms of concept selection (from MeSH), 2) multi-gram dictionary construction (providing context specificity and enhanced semantics) and 3) scalability (where 50 years of literature from PubMed is analyzed). However, the system has its own limitations as stated in the discussion; our group along with others in the computational field - are actively working towards addressing these limitations..
Finally, network of semantically related associations is critical to understand the confluence between diseases, drugs, genes and risk factors. To be effective, such a tool must be efficient, robust, scalable, and useable in finding meaningful information beyond literature mining. It is the features like disambiguation of domain specific entities, flexibility in terms of visualization, broadness in coverage, robustness in modeling and scalability in providing array of Web services that made ARIANA an important tool to bridge the gap between data and knowledge.
Software is available with properly executed end users licensing agreement (EULA) at http://www.ARIANAmed.org a.
aRequests for an account should be made to VA (email@example.com) or to MY (firstname.lastname@example.org).
bThe list of 2,545 hierarchically-structured Headings used in the model is available upon request.
cThe multi-gram dictionary used in the study is available upon request.
Adaptive robust and integrative analysis for finding novel associations
Latent semantic analysis
Medical subject headings
Online mendelian inheritance in man
Parameter optimized latent semantic analysis
Reverse ontology mapping
This work was partially supported by the NSF grant NSF-IIS-0746790, Herff College of Engineering and Bioinformatics Program at the University of Memphis. The authors also acknowledge the Virginia Tech’s Open Access Subvention Fund and the University of Tennessee Health Science Center (UTHSC), to cover the publication cost.
- PubMed. , [http://www.ncbi.nlm.nih.gov/pubmed]
- Rzhetsky A, Seringhaus M, Gerstein M: Seeking a new biology through text mining. Cell. 2008, 134: 9-13. 10.1016/j.cell.2008.06.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Wei C-H, Harris BR, Li D, Berardini TZ, Huala E, Kao H-Y, Lu Z: Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts. Database (Oxford). 2012, 2012: bas041-10.1093/database/bas041.View ArticleGoogle Scholar
- Wang H, Ding Y, Tang J, Dong X, He B, Qiu J, Wild DJ: Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLoS One. 2011, 6: e17243-10.1371/journal.pone.0017243.PubMed CentralView ArticlePubMedGoogle Scholar
- Abedi V, Zand R, Yeasin M, Faisal FE: An automated framework for hypotheses generation using literature. BioData Min. 2012, 5: 13-10.1186/1756-0381-5-13.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu Z: PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford). 2011, 2011: baq036-10.1093/database/baq036.View ArticleGoogle Scholar
- Chen H, Martin B, Daimon CM, Maudsley S: Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications. Front Physiol. 2013, 4: 8.PubMed CentralPubMedGoogle Scholar
- Landauer TK, Laham D, Derr M: From paragraph to graph: latent semantic analysis for information visualization. PNAS. 2004, 101: 5214-5219. 10.1073/pnas.0400341101.PubMed CentralView ArticlePubMedGoogle Scholar
- Abedi V, Yeasin M, Zand R: ARIANA: adaptive robust and integrative analysis for finding novel associations. 2014 Int Conf Adv Big Data Anal. 2014, CSREA Press, Las Vegas, NVGoogle Scholar
- Medical Subject Headings. ., [http://www.ncbi.nlm.nih.gov/mesh]
- Online Mendelian Inheritance in Man. ., [http://omim.org/]
- Yeasin M, Malempati H, Homayouni R, Sorower M: A systematic study on latent semantic analysis model parameters for mining biomedical literature. BMC Bioinformatics. 2009, 10 (Suppl 7): A6-10.1186/1471-2105-10-S7-A6.PubMed CentralView ArticleGoogle Scholar
- Internal Investigative Committee Membership: Report of Internal Investigation into the Death of a Volunteer Research Subject. 2001, ., [http://www.hopkinsmedicine.org/press/2001/july/report_of_internal_investigation.htm]
- Robillard R, Riopelle JL, Adamkiewicz L, Tremblay G, Genest J: Pulmonary complications during treatment with hexamethonium. Can Med Assoc J. 1955, 72: 448-451.PubMed CentralPubMedGoogle Scholar
- Wollmer MA, Papassotiropoulos A, Streffer JR, Grimaldi LME, Kapaki E, Salani G, Paraskevas GP, Maddalena A, de Quervain D, Bieber C, Umbricht D, Lemke U, Bosshardt S, Degonda N, Henke K, Hegi T, Jung HH, Pasch T, Hock C, Nitsch RM: Genetic polymorphisms and cerebrospinal fluid levels of tissue inhibitor of metalloproteinases 1 in sporadic alzheimer’s disease. Psychiatr Genet. 2002, 12: 155-160. 10.1097/00041444-200209000-00006.View ArticlePubMedGoogle Scholar
- Ridnour LA, Dhanapal S, Hoos M, Wilson J, Lee J, Cheng RYS, Brueggemann EE, Hines HB, Wilcock DM, Vitek MP, Wink DA, Colton CA: Nitric oxide-mediated regulation of β-amyloid clearance via alterations of MMP-9/TIMP-1. J Neurochem. 2012, 123: 736-749. 10.1111/jnc.12028.PubMed CentralView ArticlePubMedGoogle Scholar
- Brinckerhoff CE, Matrisian LM: Matrix metalloproteinases: a tail of a frog that became a prince. Nat Rev Mol Cell Biol. 2002, 3: 207-214. 10.1038/nrm763.View ArticlePubMedGoogle Scholar
- Davidson JM: Biochemistry and turnover of lung interstitium. Eur Respir J Off J Eur Soc Clin Respir Physiol. 1990, 3: 1048-1063.Google Scholar
- Elkington PT, Ugarte-Gil CA, Friedland JS: Matrix metalloproteinases in tuberculosis. Eur Respir J Off J Eur Soc Clin Respir Physiol. 2011, 38: 456-464.Google Scholar
- Thuong NTT, Dunstan SJ, Chau TTH, Thorsson V, Simmons CP, Quyen NTH, Thwaites GE, Lan NTN, Hibberd M, Teo YY, Seielstad M, Aderem A, Farrar JJ, Hawn TR: Identification of tuberculosis susceptibility genes with human macrophage gene expression profiles. PLoS Pathog. 2008, 4 (12): e1000229-10.1371/journal.ppat.1000229.PubMed CentralView ArticlePubMedGoogle Scholar
- Mehra S, Pahar B, Dutta NK, Conerly CN, Philippi-Falkenstein K, Alvarez X, Kaushal D: Transcriptional reprogramming in nonhuman primate (Rhesus Macaque) tuberculosis granulomas. PLoS One. 2010, 5 (8): e122666-10.1371/journal.pone.0012266.View ArticleGoogle Scholar
- Russell DG, VanderVen BC, Lee W, Abramovitch RB, Kim M, Homolka S, Niemann S, Rohde KH: Mycobacterium tuberculosis wears what it eats. Cell Host Microbe. 2010, 8: 68-76. 10.1016/j.chom.2010.06.002.PubMed CentralView ArticlePubMedGoogle Scholar
- Berry MPR, Graham CM, McNab FW, Xu Z, Bloch SAA, Oni T, Wilkinson KA, Banchereau R, Skinner J, Wilkinson RJ, Quinn C, Blankenship D, Dhawan R, Cush JJ, Mejias A, Ramilo O, Kon OM, Pascual V, Banchereau J, Chaussabel D, O’Garra A: An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature. 2010, 466: 973-977. 10.1038/nature09247.PubMed CentralView ArticlePubMedGoogle Scholar
- Van der Sar AM, Spaink HP, Zakrzewska A, Bitter W, Meijer AH: Specificity of the zebrafish host transcriptome response to acute and chronic mycobacterial infection and the role of innate and adaptive immune components. Mol Immunol. 2009, 46: 2317-2332. 10.1016/j.molimm.2009.03.024.View ArticlePubMedGoogle Scholar
- Yong VW, Krekoski CA, Forsyth PA, Bell R, Edwards DR: Matrix metalloproteinases and diseases of the CNS. Trends Neurosci. 1998, 21: 75-80. 10.1016/S0166-2236(97)01169-7.View ArticlePubMedGoogle Scholar
- Yan P, Hu X, Song H, Yin K, Bateman RJ, Cirrito JR, Xiao Q, Hsu FF, Turk JW, Xu J, Hsu CY, Holtzman DM, Lee J-M: Matrix metalloproteinase-9 degrades amyloid-beta fibrils in vitro and compact plaques in situ. J Biol Chem. 2006, 281: 24566-24574. 10.1074/jbc.M602440200.View ArticlePubMedGoogle Scholar
- Rusu C, Dumitrescu B: Stagewise K-SVD to design efficient dictionaries for sparse representations. IEEE Signal Process Lett. 2012, 19: 631-634. 10.1109/LSP.2012.2209871.View ArticleGoogle Scholar
- Yaguang D, Guofeng Z, Chenyang C, Jian Z, Liang T: A parallel implementation of singular value decomposition based on map-reduce and PARPACK. Proc 2011 Int Conf Comput Sci Netw Technol. Volume 2. 2011, IEEE, Harbin, China, 739-741. 10.1109/ICCSNT.2011.6182070.View ArticleGoogle Scholar
- Liang Z, Li W, Li Y: A parallel probabilistic latent semantic analysis method on MapReduce platform. 2013 IEEE Int Conf Inf Autom. 2013, IEEE, Yinchuan, China, 1017-1022.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.