intelligent services
Motivation: The World Wide Web has profoundly changed the way in which we access information. Searching the internet is easy and fast, but more importantly, the interconnection of related contents makes it intuitive and closer to the associative organization of human memory. However, the information retrieval tools currently available to researchers in biology and medicine lag far behind the possibilities that the layman has come to expect from the internet.
Results: By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource. iHOP (Information Hyperlinked over Proteins) is an online service that provides this gene-guided network as a natural way of accessing millions of PubMed abstracts and brings all the advantages of the internet to scientific literature research. Navigating across interrelated sentences within this network is closer to human intuition than the use of conventional keyword searches and allows for stepwise and controlled acquisition of information. Moreover, this literature network can be superimposed upon experimental interaction data to facilitate the simultaneous analysis of novel and existing knowledge. The network presented in iHOP currently contains 5 million sentences and 40 000 genes from Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, Arabidopsis thaliana, Saccharomyces cerevisiae and Escherichia coli.
Technical information
Underlying data
In a process previous to the web application, genes and proteins and MeSH terms (biomedical thesaurus) are identified in about 12 million biomedical abstracts from PubMed. Of a total number of 200000 genes, about 30000 genes were identified in 2 million abstracts.
Starting from this index, 1 XML document was created for each abstract, 1 for each gene and 2 different xml documents for each of the gene in the literature. Thus the total number of different XML documents is around 2.3 million.
These documents essentially contain the original text divided into individual sentences with gene synonyms, MeSH terms, and verbs tagged. Gene documents also contain general information, such as database references, synonyms, and a list of homologous genes, which in the web application are used to provide links to external resources.
Web application
The web application currently consists of the data in 2.3 million XML-documents, 33 XSP scripts and 37 different transformation style sheets for all the different views on the gene and abstract data.
Dynamic effects are achieved through the HTML and JavaScript layer on the client side to minimize server load and to avoid complex front-end database queries. This way, extremely fast response times are obtained and multiple concurrent usage of the system is possible.
Please contact Dr. Robert Hoffmann for more information.
| object oriented languages | data exchange | scripting languages | ||
| Java, C++ |
XML, XSL | JavaScript, HTML, Perl |
||
| database design | server-side | operating systems | ||
| Postgres, Oracle, SQL, jdbc | JSP, XSP, Tomcat, JBoss, Cocoon |
Unix, Linux, win |
||
| statistical analysis | ||||
| R Project, spss, splus |