Bioinformatics Tools for Protein Analysis

Originally published July 12, 2005

There are 100,000 proteins in the human body—or 1 million—depending on who is counting. Right now the exact number of proteins in humans is unknown, although it's possible to make some educated guesses based on the number of genes in the human genome (around 20,000 more or less), alternative splicing (a process of generating multiple proteins from single genes) and post-translational modifications (a process of altering proteins by modifying their structure). In any event, the number of proteins that constitute the human proteome is quite large.

Many efforts are underway to examine the human proteome, and the protein complements in other species as well. To assist in this effort there is a wide variety of bioinformatics tools and databases available for the analysis of proteins. We’ll take a brief look at some of them here.

  • Protein Microarrays. Protein microarrays consist of antibodies, proteins, protein fragments, peptides, aptamers or carbohydrate elements that are immobilized in a grid-like pattern on a glass surface. The arrayed molecules are then used to screen and assess protein interaction patterns with samples containing distinct proteins or classes of proteins.

protein microarray

These microarrays are used to identify protein-protein interactions, to identify the substrates of proteins or to identify the targets of biologically active small molecules. Like its cousin the gene microarray, the protein microarray market is growing rapidly. And with this growth comes a need for bioinformatics tools to analyze the microarrays. Many bioinformatics tools for analyzing protein microarrays are offered by vendors of the microarrays, such as TeleChem, but we should see a steady increase in the number of publicly available, Open Source bioinformatics tools in the near future.

  • Protein Amino Acid Sequences. The analysis of amino acid sequences, or primary structure, of proteins provides the foundation for many other types of protein studies. The primary structure ultimately determines how proteins fold into functional 3D structures. Primary structure is used in multiple sequence alignment studies to determine the evolutionary relationships between proteins, and to determine relationships between structure and function in related proteins.

protein sequence

Substitutions of specific amino acids are used in mutagenesis studies to determine how modifications in structure affect protein function. Therefore, scientists often begin formulating their research protocols around an analysis of amino acid sequences that make up protein primary structure. A wide variety of tools are available for analyzing protein amino acid sequences, but a few of the more well-known ones are Modeller, the ExPASy Proteomics Server and HMMER

  • Protein-Ligand Docking. In drug discovery and development, the manner in which small-molecule compounds bind or dock with proteins is of the utmost importance. Proteins are often the main targets for new drugs. And many drug compounds are small molecules that are designed to bind preferentially to specific proteins. Because of this need to design small molecules for protein docking, many bioinformatics tools exist for the analysis of protein-ligand interactions. These tools often fall in the category of computational chemistry. At the atomic scales in which compounds dock with proteins, the interactions are biochemical and biophysical in nature. Computational chemistry tools are then the method of choice for analyzing these physicochemical interactions. Some common tools for this include NWChem, Gaussian and GAMESS.
  • Protein Folds. Although there is no universal agreement on how to define protein folds, one simple characterization of folds is “an arrangement of secondary structures into a unique tertiary structure.” That is, protein amino acid sequences arrange themselves in recognizable, identifiable, 3D structures. Some of these structures are so common in many different proteins that they are given special names, i.e. Rossmann folds, TIM barrels, etc. 

On one hand, protein folds are defined strictly by their 3D structure and topological arrangement. On the other hand, protein folds are often associated with specific functions. For example, transferase and hydrolase enzymes often contain alpha/beta folds and perform similar functions. So it is important to know both the structure and function of folds in proteomics analysis.

protein folds

The colorful figure above shows various sets of protein folds. Along one axis (mostly red) are proteins containing predominantly alpha-helices; along another axis (mostly yellow) are proteins containing mostly beta-strands; and along yet a third axis (mostly green and blue) are proteins containing an equal mix of alpha-helices and beta-strands. In the brief sample of proteins shown in this diagram, one can see a great variety of folds and related secondary structures. 

Currently, there are about 550 recognized folds, although this number will continue to rise for some time. The total number of unique folds in nature may be around 1,000.  There are several well-known protein fold databases, including SCOP, CATH and FSSP

  • Protein Families. These are sets of proteins that share a common evolutionary origin and common function. Often, these proteins contain similar amino acid sequences, or similar primary, secondary or tertiary structures. Identifying protein families is useful in drug discovery programs as proteins within a family often share similar structures and functions. For example, the protein kinase family contains many enzymes with related 3D structures and correspondingly related functions. Therefore, by studying enzymes within the kinase family researchers can often extrapolate or predict the function of kinases by comparison with others in the same family. Some common protein family databases include PFAM, PROSITE and iProClass.

Protein Interaction Maps. Many cellular processes and regulatory pathways are controlled by networks of interacting proteins. These networks determine how cells grow, divide, die, differentiate and communicate with other cells. And groups of cooperating, interacting proteins carry out fundamental cellular processes such as DNA replication, DNA repair, transcription, translation and protein synthesis.

protein map

Therefore, to gain intimate knowledge of how cellular processes work, it’s important to know how proteins interact with one another. Protein interaction maps identify which proteins interact and how they are grouped together to form functional units. For example, there are about 14,000 proteins in the fruit fly, Drosophila melanogaster. Nearly complete protein interaction maps are available for the fruit fly, which should lead to new discoveries about cell signaling pathways in this organism. Having complete protein interaction maps provides detailed information about specific cell signaling pathways and how they function. Some sites that describe protein interaction maps include Genome Biology and the Institute of Molecular Biotechnology.

Protein Design

One of the fascinating forefronts in protein research is the ability to design and engineer proteins with novel structures and functions. By using the methods described above and by systematically altering protein structures, it’s possible to create (semi-) artificial proteins with new enzymatic functions and unusually strong binding capabilities. Researchers at the Howard Hughes Medical Institute are engaged in this type of research. And they have developed a new tool, ORBIT (Optimization of Rotamers by Iterative Techniques) to assist in the protein design process.

  • Dr. Richard CaseyDr. Richard Casey

    Richard is the Founder and Chief Scientific Officer of RMC Biosciences Inc., a firm that offers services in Bioinformatics and Computer Aided Drug Design. Dr. Casey received a Ph.D. in Biological Sciences from Colorado State University. He has 20-plus years experience in Computational Sciences, Information Technology and High-Performance Computing. He has held corporate and academic positions at Hewlett-Packard, Boeing Computer Services, Arizona State University, Colorado State University, the Alabama Supercomputer Center, and the Institute for Computational Studies at CSU and was the founder of a software consulting firm, Alpine Computing Inc. He holds a Project Management Professional Certificate and a Bioinformatics Certificate from Stanford University. Richard can be reached at rcasey@rmcbiosciences.com.

Recent articles by Dr. Richard Casey



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!