Columbia Genome Center
Department of Medical Informatics
Department of Computer Science



Introduction

Knowledge integration in the domain of molecular biology is crucial for a deeper understanding of molecular cell functions. Integration can be accomplished by gathering research results published in the scientific literature. Unfortunately, researchers are facing a true challenge how to best integrate the sheer amount of scientific publications available. GeneWays is built to help integrating research results by providing automatic tools to gather knowledge from the scientific literature. The GeneWays project team comprises experts from biology, computer science and linguistics, who use natural language processing  (NLP) to scan thousands of research articles in order to automatically extract relevant molecular knowledge. The key idea is rather simple: While a single author may be an ultimate expert on a specific molecular substance,  the collective knowledge of the whole research community  is currently unavailable in an integrated form. Molecular interactions are frequently represented in research articles by statements such as „protein A activates protein B“, which can be collected and integrated into knowledge about a complex molecular network consisting of thousands of genes, proteins, small molecules (along with other substances) and their interactions. GeneWays's architecture combines various modules designed to automatically gather knowledge on signal transduction pathways from online scientific journals. The core of the system is a knowledge base of molecular actions [8]. The knowledge is provided by various system modules, which select scientific journals of interest, mark and identify substance names in the journal text [5, 7] and extract interactions between these substances and other actions by means of natural language processing (NLP) [4]. The integrated knowledge stored in a database can be queried, analyzed, critiqued and visualized [6] by interested researchers.