It is widely recognized that successful data integration is one of the keys to improved productivity in biopharmaceutical R&D. Through data integration, researchers and program directors can discover relationships that enable them to make better and faster decisions about disease targets and drug compounds. Given the time, effort, funding and resources required for drug discovery research, substantial gains can be achieved by more focused and efficient selection of targets and leads. Ultimately, effective data integration, particularly for genomic and proteomic information, should contribute to more effective selection of drug leads and shorter time to market for those drug compounds.
The need for data integration is widely acknowledged in the bioinformatics community. Bioinformatics data is currently spread across the Internet and throughout organizations in a wide variety of formats. Success in most bioinformatics-related activities, from functional characterization of genomic sequences to prioritization of drug targets, requires an integrated view of all relevant data in a drug discovery R&D program. The challenges of data integration may be addressed using a wide variety of approaches, and integration systems abound in both the academic and commercial sectors. While each approach has strengths and weaknesses, it can be difficult to evaluate which approach suits a particular need best without fully understanding the data integration landscape.
Data Sources. Bioinformatics data sources often have large, complex data structures, reflecting the richness of the scientific concepts they model. Many bioinformatics data sources cover similar domains, such as genes, proteins, sequence annotations or microarray results. To derive the greatest benefit for scientific investigation, it is important to provide an integrated view of all data sources that are relevant for a particular research project.
Data obtained from various sources is often structured differently in each source. Thus, to use data effectively from disparate sources, one must understand the database schemas used to store data in each source system, and translate among the schemas in order to exchange information between them.
Data sources often contain similar or overlapping data elements but use conflicting data definitions. Thus, there is often a need for user-friendly tools and interfaces to transform bioinformatics data from one database schema to another, and to discover correlated data among many databases, regardless of the structure of the databases or the names that are given to corresponding attributes in those databases.
XML. The Extensible Markup Language (XML) is a general-purpose markup language that facilitates the sharing of data across heterogeneous computer systems. XML is a format of choice for storing information with an inherent hierarchical structure and has been widely accepted in the bioinformatics industry as a means of data exchange. Its major advantages are its ease of use, wide support available from software, database and LIMS (Laboratory Information Management System) vendors, and the large number of tools that exist to facilitate its use. A steadily growing number of databases, software applications and tools are XML-compliant, such as NCBI BLAST, PIR, SWISS-PROT, InterPro and GO.
XML facilitates data integration and application interoperability through the adoption of standards for representing certain types of data, e.g., genome annotations or microarray experiments. Once a standard is agreed upon, all databases and applications that store or process the data can share a common interface, namely the schema for the XML data representation. Adoption of XML in the bioinformatics community is steadily improving the consistency of published data sets, facilitating data exchange and improving the resolution of inconsistencies in data representations.
Markup Languages. In bioinformatics, markup languages based on XML are often used as a form of data integration. Markup languages provide standardized descriptions and representations of bioinformatics data, which facilitates the exchange of data between heterogeneous computer systems and storage devices. There is a large and growing list of markup languages used in the bioinformatics industry today. Table 1 shows a small sample of some of these languages.
|
Markup Language |
Purpose |
|
Genomic annotation and visualization |
|
|
Experimental information for biopolymers |
|
|
Genomic sequences and biological function |
|
|
Management of molecular information |
|
|
Microarray gene expression data exchange |
|
|
Systems biology and biochemical networks |
Table 1. A sample of XML-based markup languages that are commonly used in bioinformatics databases and applications. Standardized representations of data formats and content facilitate data exchange and data integration.
Database Federation. One method of data integration is to embed middleware between heterogeneous databases, such that the middleware application acts as a mediator between the database platforms. Database federation leverages the native data management and search capabilities of individual source databases and creates a single, unified, logical view of the federated databases. Applications interacting with the middleware are presented with a unified data schema even though the source database schemas are distributed across many federated databases. In this system architecture data integration occurs in the middleware layer.
One advantage of database federation is that it does not require modification of the primary data stores. In the bioinformatics community there is often a large number of heterogeneous databases to deal with. Furthermore, many databases (i.e., PDB) are in the public domain and thus not directly modifiable by researchers. Federation is a means of collating data from these disparate sources without direct intervention in those sources.
Operational Data Stores. Another strategy for solving data integration issues is to create a custom database that stores and manages integrated data. This is often accomplished in an Operational Data Store (ODS), which is a database that sits between transactional source systems (i.e., LIMS systems) on one side and data warehouse/data marts on the other side. Data sent to the ODS is cleansed, standardized and formatted according to predetermined scientific and business rules. This is often the place where bioinformatics standards are applied to data elements, such as controlled vocabularies and ontologies, before the data is forwarded to a data warehouse or data mart.
Controlled Vocabularies. A controlled vocabulary is a list of standardized terms or descriptors whose meanings are specifically defined or authorized by a standards organization. The terms in controlled vocabularies ensure that everyone and every system in an organization are using the same word to mean the same thing. In a bioinformatics context, controlled vocabularies offer a form of data integration by enforcing naming conventions for data elements that ultimately appear in bioinformatics databases.
One example of a controlled vocabulary in bioinformatics is the Gene Ontology (GO) from the Gene Ontology Consortium. GO is an important tool for the representation and processing of gene- and gene-product-related information across all species. Its controlled vocabulary supports data integration for biomedical researchers by enabling them to store results and generate reports using a common terminology in annotating genes and gene products.
This brief article just scratches the surface in its coverage of bioinformatics data integration. In future articles I’ll review some of the benefits that bioinformaticians derive from using data integration strategies.
Recent articles by Dr. Richard M. Casey
Richard is the Founder and Chief Scientific Officer of RMC Biosciences Inc., a firm that offers services in Bioinformatics and Computer Aided Drug Design. Dr. Casey received a Ph.D. in Biological Sciences from Colorado State University. He has 20-plus years experience in Computational Sciences, Information Technology and High-Performance Computing. He has held corporate and academic positions at Hewlett-Packard, Boeing Computer Services, Arizona State University, Colorado State University, the Alabama Supercomputer Center, and the Institute for Computational Studies at CSU and was the founder of a software consulting firm, Alpine Computing Inc. He holds a Project Management Professional Certificate and a Bioinformatics Certificate from Stanford University. Richard can be reached at rcasey@rmcbiosciences.com.
Editor's note: More bioinformatics articles, resources, news and events are available in the Business Intelligence Network's Bioinformatics Channel. Be sure to visit today!