We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Federated Databases in Bioinformatics and Translational Medical Research Federated Systems Enhance “Bench-to-Bedside” Research

Originally published February 21, 2006

In my last article, I described federated database architectures and their use in bioinformatics research. This month I would like to review some real-world examples of federated systems and describe how they are currently being used in the bioinformatics community and in medical research.  Translational medical research, which seeks to integrate basic academic medical research with clinical trials research, will especially benefit from federated systems.  

The SPINE2 (Structural Proteomics in the Northeast) project, which is sponsored by the Northeast Structural Genomics (NSG) Consortium, uses a federated system to integrate data and information sources for protein structure determination. The SPINE2 system can pull together data sources that reside within individual laboratories, data from remotely located sites within the NSG consortium, and from widely distributed public sources (such as the PDB database) located around the country.

The SPINE2 federated system is somewhat unique because it is essentially a distributed LIMS (Laboratory Information Management System) application linked to a federation database. The LIMS application can gather proteomics data within a specific laboratory environment and populate the federation database. As a result, this application makes local biological data immediately available for all consortium members. Effectively, the SPINE2 LIMS system creates an integrated “virtual lab,” making all labs in the consortium appear as one. Figure 1 shows the SPINE2 system.

Figure 1: A high-level view of the SPINE2 federation system. Local resources are shown in white, consortium resources are shown in orange and remote national resources are shown in yellow. The SPINE2 LIMS application presents all resources to end users in a single, integrated Web-based interface.

As shown in the figure, SPINE2 brings together local, consortium and remote public information. Local information includes a protein structure gallery for displaying laboratory-derived proteomics data, publication pages for cross-referencing protein targets with literature references (such as MEDLINE) and Web-based collaborative tools for interacting with consortia members. Consortium information includes links to specialized biological databases housed in individual consortium labs (for example, federate databases with NMR and X-ray crystal data) and LIMS systems based in various remote labs. Public information includes a variety of national resources including SwissProt, PIR and Wormbase. All of these resources are accessed through the SPINE2 federation database in a unified way.

The Cancer Biomedical Informatics Grid (caBIG) is an ambitious effort by the National Cancer Institute (NCI) to create a national grid of bioinformatics resources. Cancer researchers nationwide use caBIG’s grid infrastructure and its bioinformatics resources for drug discovery and the prevention and treatment of cancer.

The caBIG project is designed around federation architectures to make all resources readily available to anyone plugged into the caBIG informatics grid. For example, bioinformatics tools and databases are currently available for clinical trial management systems, integrated cancer research, tissue banks and pathology tools, controlled vocabularies, genomic and proteomic databases, biochemical pathway analysis, and image repositories. By joining the grid, researchers can decide which applications, tools and databases fit their needs, and they can access these resources from anywhere in the country. The underlying federation architecture enables a high level of integration among these resources.

For example, caIntegrator is one of many bioinformatics tools that have been developed in the caBIG program. The caIntegrator provides an informatics bridge between academic research labs, clinical trials labs, hospitals and other research centers. Figure 2 shows diagrammatically how remote, heterogeneous data sources are brought together by caIntegrator through an intermediary federated database architecture.

Figure 2: An example of federated data integration using the caIntegrator application. Genomic, proteomic and clinical data is presented in a single unified view, although data sources are distributed in various labs and research institutes around the country.

Raw data from genomic and proteomic microarray analyses, clinical trials data from academic or hospital research labs, pathology and radiology data, as well as many others from local and remote sources, is collated in the caIntegrator framework. Once this is done, it is summarized where necessary, correlated and displayed in user-friendly Web interfaces. The caIntegrator and its underlying federated design thus enhance translational medical research. Academic cancer researchers, clinical trials researchers and physicians all have immediate access to the same data sets. Furthermore, analytical bioinformatics tools and applications can be shared across the federation, thereby improving the collaborative capabilities of these translational medical research groups. 

The caBIG program, bioinformatics tools such as caIntegrator, and federated informatics bridges support the “bench-to-bedside” paradigm in translational medical research.

Rodeo is a federated bioinformatics database project hosted by the Computational Biology Initiative (CBI) at Harvard Medical School (HMS). Like many organizations that use a variety of heterogeneous bioinformatics data sources, the CBI staff recognized a need to efficiently integrate their data sources to promote medical research in HMS. One common approach for integration would be to develop a centralized data warehouse, perhaps with a number of specialized data marts. However, there are numerous issues that arise in data warehouse architectures that are difficult to resolve when dealing with distributed data sources. Instead, CBI/HMS chose a federated approach to integrate its biological databases. 

The Rodeo federated architecture offers some advantages over warehouse-based systems since remote data sources (federates) remain completely autonomous from the federation servers. Individual federate data sources can be changed and updated without impacting other federates in the system.  Federate database schemas do not have to be modified to conform to the central federation database schema. Moreover, a single unified view of the integrated data sources can be maintained as new federates are added to the system. This level of autonomy becomes increasingly important as a large number of remote data sources are added to the system. It should be noted, however, that performance becomes an issue with federated database systems. Performance should be carefully considered in federation design.

Rodeo uses IBM’s DiscoveryLink product, as well as other federation technologies, as its federation engine. One of the general challenges in life sciences research and bioinformatics is the complex, heterogeneous nature of its data sources. It is not uncommon to find a variety of relational databases (such as Oracle, DB2, and MySQL) mixed with non-relational sources. Such non-relational sources include XML documents, spreadsheets, ASCII text files, BLAST/FAST/HMMR files and various kinds of structured and unstructured data sets. DiscoveryLink provides the Rodeo application with an efficient federation mechanism to integrate these disparate bioinformatics data sources.

The Federation’s Future
With the ever-increasing complexity of biological data sets, geographically distributed data sources and cross-functional collaborative research teams, the federation approach to bioinformatics data management has a bright future. It seems likely that database federations will become much more prevalent, especially in the translational medical research community, as these examples demonstrate.

  • Dr. Richard CaseyDr. Richard Casey

    Richard is the Founder and Chief Scientific Officer of RMC Biosciences Inc., a firm that offers services in Bioinformatics and Computer Aided Drug Design. Dr. Casey received a Ph.D. in Biological Sciences from Colorado State University. He has 20-plus years experience in Computational Sciences, Information Technology and High-Performance Computing. He has held corporate and academic positions at Hewlett-Packard, Boeing Computer Services, Arizona State University, Colorado State University, the Alabama Supercomputer Center, and the Institute for Computational Studies at CSU and was the founder of a software consulting firm, Alpine Computing Inc. He holds a Project Management Professional Certificate and a Bioinformatics Certificate from Stanford University. Richard can be reached at rcasey@rmcbiosciences.com.

Recent articles by Dr. Richard Casey



Want to post a comment? Login or become a member today!

Be the first to comment!