Structural genomics
Structural genomics seeks to describe the
Because protein structure is closely linked with protein function, the structural genomics has the potential to inform knowledge of protein function. In addition to elucidating protein functions, structural genomics can be used to identify novel protein folds and potential targets for drug discovery. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or structural homology to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure.
As opposed to traditional structural biology, the determination of a protein structure through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein function from its 3D structure.
Structural genomics emphasizes high throughput determination of protein structures. This is performed in dedicated centers of structural genomics.
While most structural biologists pursue structures of individual proteins or protein groups, specialists in structural genomics pursue structures of proteins on a genome wide scale. This implies large-scale cloning, expression and purification. One main advantage of this approach is economy of scale. On the other hand, the scientific value of some resultant structures is at times questioned. A Science article from January 2006 analyzes the structural genomics field.[1]
One advantage of structural genomics, such as the Protein Structure Initiative, is that the scientific community gets immediate access to new structures, as well as to reagents such as clones and protein. A disadvantage is that many of these structures are of proteins of unknown function and do not have corresponding publications. This requires new ways of communicating this structural information to the broader research community. The Bioinformatics core of the Joint center for structural genomics (JCSG) has recently developed a wiki-based approach namely Open protein structure annotation network (TOPSAN) for annotating protein structures emerging from high-throughput structural genomics centers.
Goals
One goal of structural genomics is to identify novel protein folds. Experimental methods of protein structure determination require proteins that express and/or crystallize well, which may inherently bias the kinds of proteins folds that this experimental data elucidate. A genomic, modeling-based approach such as ab initio modeling may be better able to identify novel protein folds than the experimental approaches because they are not limited by experimental constraints.
Protein function depends on 3-D structure and these 3-D structures are more highly conserved than
Methods
Structural genomics takes advantage of completed genome sequences in several ways in order to determine protein structures. The gene sequence of the target protein can also be compared to a known sequence and structural information can then be inferred from the known protein's structure. Structural genomics can be used to predict novel protein folds based on other structural data. Structural genomics can also take modeling-based approach that relies on homology between the unknown protein and a solved protein structure.
de novo methods
Completed genome sequences allow every
Modelling-based methods
ab initio modeling
This approach uses protein sequence data and the chemical and physical interactions of the encoded amino acids to predict the 3-D structures of proteins with no homology to solved protein structures. One highly successful method for ab initio modeling is the Rosetta program, which divides the protein into short segments and arranges short polypeptide chain into a low-energy local conformation. Rosetta is available for commercial use and for non-commercial use through its public program, Robetta.
Sequence-based modeling
This modeling technique compares the gene sequence of an unknown protein with sequences of proteins with known structures. Depending on the degree of similarity between the sequences, the structure of the known protein can be used as a model for solving the structure of the unknown protein. Highly accurate modeling is considered to require at least 50% amino acid sequence identity between the unknown protein and the solved structure. 30-50% sequence identity gives a model of intermediate-accuracy, and sequence identity below 30% gives low-accuracy models. It has been predicted that at least 16,000 protein structures will need to be determined in order for all structural motifs to be represented at least once and thus allowing the structure of any unknown protein to be solved accurately through modeling.[3] One disadvantage of this method, however, is that structure is more conserved than sequence and thus sequence-based modeling may not be the most accurate way to predict protein structures.
Threading
Threading bases structural modeling on fold similarities rather than sequence identity. This method may help identify distantly related proteins and can be used to infer molecular functions.
Examples of structural genomics
There are currently a number of on-going efforts to solve the structures for every protein in a given proteome.
Thermotoga maritima proteome
One current goal of the Joint Center for Structural Genomics (JCSG), a part of the Protein Structure Initiative (PSI) is to solve the structures for all the proteins in Thermotoga maritima, a thermophillic bacterium. T. maritima was selected as a structural genomics target based on its relatively small genome consisting of 1,877 genes and the hypothesis that the proteins expressed by a thermophilic bacterium would be easier to crystallize.
Lesley et al used Escherichia coli to express all the open-reading frames (ORFs) of T. martima. These proteins were then crystallized and structures were determined for successfully crystallized proteins using X-ray crystallography. Among other structures, this structural genomics approach allowed for the determination of the structure of the TM0449 protein, which was found to exhibit a novel fold as it did not share structural homology with any known protein.[4]
Mycobacterium tuberculosis proteome
The goal of the
The fully sequenced genome of M. tuberculosis has allowed scientists to clone many of these protein targets into expression vectors for purification and structure determination by X-ray crystallography. Studies have identified a number of target proteins for structure determination, including extracellular proteins that may be involved in pathogenesis, iron-regulatory proteins, current drug targets, and proteins predicted to have novel folds. So far, structures have been determined for 708 of the proteins encoded by M. tuberculosis.
Protein structure databases and classifications
- Protein Data Bank (PDB): repository for protein sequence and structural information
- UniProt: provides sequence and functional information
- Structural Classification of Proteins(SCOP Classifications): hierarchical-based approach
- Class, Architecture, Topology and Homologous superfamily(CATH): hierarchical-based approach
See also
- Genomics
- Omics
- Structural proteomics
- Protein Structure Initiative
References
Further reading
- Hooft RW, Vriend G, Sander C, Abola EE (May 1996). "Errors in protein structures". Nature. 381 (6580): 272. S2CID 4368507.
- Marsden RL, Lewis TA, Orengo CA (2007). "Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint". BMC Bioinformatics. 8: 86. PMID 17349043.
- Baker EN, Arcus VL, Lott JS (2003). "Protein structure prediction and analysis as a tool for functional genomics". Appl. Bioinform. 2 (3 Suppl): S3–10. PMID 15130810.
- Goulding CW, Perry LJ, Anderson D, et al. (September 2003). "Structural genomics of Mycobacterium tuberculosis: a preliminary report of progress at UCLA". Biophys. Chem. 105 (2–3): 361–70. PMID 14499904.
- Skolnick J, Fetrow JS, Kolinski A (March 2000). "Structural genomics and its importance for gene function analysis". Nat. Biotechnol. 18 (3): 283–7. S2CID 2723601.