Structural genomics

Structural genomics seeks to describe the

3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction

is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.

Because protein structure is closely linked with protein function, the structural genomics has the potential to inform knowledge of protein function. In addition to elucidating protein functions, structural genomics can be used to identify novel protein folds and potential targets for drug discovery. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or structural homology to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure.

As opposed to traditional structural biology, the determination of a protein structure through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein function from its 3D structure.

Structural genomics emphasizes high throughput determination of protein structures. This is performed in dedicated centers of structural genomics.

While most structural biologists pursue structures of individual proteins or protein groups, specialists in structural genomics pursue structures of proteins on a genome wide scale. This implies large-scale cloning, expression and purification. One main advantage of this approach is economy of scale. On the other hand, the scientific value of some resultant structures is at times questioned. A Science article from January 2006 analyzes the structural genomics field.^[1]

One advantage of structural genomics, such as the Protein Structure Initiative, is that the scientific community gets immediate access to new structures, as well as to reagents such as clones and protein. A disadvantage is that many of these structures are of proteins of unknown function and do not have corresponding publications. This requires new ways of communicating this structural information to the broader research community. The Bioinformatics core of the Joint center for structural genomics (JCSG) has recently developed a wiki-based approach namely Open protein structure annotation network (TOPSAN) for annotating protein structures emerging from high-throughput structural genomics centers.

Goals

One goal of structural genomics is to identify novel protein folds. Experimental methods of protein structure determination require proteins that express and/or crystallize well, which may inherently bias the kinds of proteins folds that this experimental data elucidate. A genomic, modeling-based approach such as ab initio modeling may be better able to identify novel protein folds than the experimental approaches because they are not limited by experimental constraints.

Protein function depends on 3-D structure and these 3-D structures are more highly conserved than

sequences. Thus, the high-throughput structure determination methods of structural genomics have the potential to inform our understanding of protein functions. This also has potential implications for drug discovery and protein engineering.^[2] Furthermore, every protein that is added to the structural database increases the likelihood that the database will include homologous sequences of other unknown proteins. The Protein Structure Initiative (PSI) is a multifaceted effort funded by the National Institutes of Health

with various academic and industrial partners that aims to increase knowledge of protein structure using a structural genomics approach and to improve structure-determination methodology.

Methods

Structural genomics takes advantage of completed genome sequences in several ways in order to determine protein structures. The gene sequence of the target protein can also be compared to a known sequence and structural information can then be inferred from the known protein's structure. Structural genomics can be used to predict novel protein folds based on other structural data. Structural genomics can also take modeling-based approach that relies on homology between the unknown protein and a solved protein structure.

de novo methods

Completed genome sequences allow every

nuclear magnetic resonance

(NMR). The whole genome sequence allows for the design of every primer required in order to amplify all of the ORFs, clone them into bacteria, and then express them. By using a whole-genome approach to this traditional method of protein structure determination, all of the proteins encoded by the genome can be expressed at once. This approach allows for the structural determination of every protein that is encoded by the genome.

Modelling-based methods

ab initio modeling

This approach uses protein sequence data and the chemical and physical interactions of the encoded amino acids to predict the 3-D structures of proteins with no homology to solved protein structures. One highly successful method for ab initio modeling is the Rosetta program, which divides the protein into short segments and arranges short polypeptide chain into a low-energy local conformation. Rosetta is available for commercial use and for non-commercial use through its public program, Robetta.

Sequence-based modeling

This modeling technique compares the gene sequence of an unknown protein with sequences of proteins with known structures. Depending on the degree of similarity between the sequences, the structure of the known protein can be used as a model for solving the structure of the unknown protein. Highly accurate modeling is considered to require at least 50% amino acid sequence identity between the unknown protein and the solved structure. 30-50% sequence identity gives a model of intermediate-accuracy, and sequence identity below 30% gives low-accuracy models. It has been predicted that at least 16,000 protein structures will need to be determined in order for all structural motifs to be represented at least once and thus allowing the structure of any unknown protein to be solved accurately through modeling.^[3] One disadvantage of this method, however, is that structure is more conserved than sequence and thus sequence-based modeling may not be the most accurate way to predict protein structures.

Threading

Threading bases structural modeling on fold similarities rather than sequence identity. This method may help identify distantly related proteins and can be used to infer molecular functions.

Examples of structural genomics

There are currently a number of on-going efforts to solve the structures for every protein in a given proteome.

Thermotoga maritima proteome

One current goal of the Joint Center for Structural Genomics (JCSG), a part of the Protein Structure Initiative (PSI) is to solve the structures for all the proteins in Thermotoga maritima, a thermophillic bacterium. T. maritima was selected as a structural genomics target based on its relatively small genome consisting of 1,877 genes and the hypothesis that the proteins expressed by a thermophilic bacterium would be easier to crystallize.

Lesley et al used Escherichia coli to express all the open-reading frames (ORFs) of T. martima. These proteins were then crystallized and structures were determined for successfully crystallized proteins using X-ray crystallography. Among other structures, this structural genomics approach allowed for the determination of the structure of the TM0449 protein, which was found to exhibit a novel fold as it did not share structural homology with any known protein.^[4]

Mycobacterium tuberculosis proteome

The goal of the

multi-drug-resistant tuberculosis

.

The fully sequenced genome of M. tuberculosis has allowed scientists to clone many of these protein targets into expression vectors for purification and structure determination by X-ray crystallography. Studies have identified a number of target proteins for structure determination, including extracellular proteins that may be involved in pathogenesis, iron-regulatory proteins, current drug targets, and proteins predicted to have novel folds. So far, structures have been determined for 708 of the proteins encoded by M. tuberculosis.

Protein structure databases and classifications

Protein Data Bank (PDB): repository for protein sequence and structural information
UniProt: provides sequence and functional information
Structural Classification of Proteins
(SCOP Classifications): hierarchical-based approach
Class, Architecture, Topology and Homologous superfamily
(CATH): hierarchical-based approach

References

S2CID 800902
.

PMID 12413557
.

S2CID 7193705
.

PMID 12193646
.

Further reading

Hooft RW, Vriend G, Sander C, Abola EE (May 1996). "Errors in protein structures". Nature. 381 (6580): 272.
S2CID 4368507
.

Marsden RL, Lewis TA, Orengo CA (2007). "Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint". BMC Bioinformatics. 8: 86.
PMID 17349043
.

Baker EN, Arcus VL, Lott JS (2003). "Protein structure prediction and analysis as a tool for functional genomics". Appl. Bioinform. 2 (3 Suppl): S3–10.
PMID 15130810
.

Goulding CW, Perry LJ, Anderson D, et al. (September 2003). "Structural genomics of Mycobacterium tuberculosis: a preliminary report of progress at UCLA". Biophys. Chem. 105 (2–3): 361–70.
PMID 14499904
.

Skolnick J, Fetrow JS, Kolinski A (March 2000). "Structural genomics and its importance for gene function analysis". Nat. Biotechnol. 18 (3): 283–7.
S2CID 2723601
.

External links

Protein Structure Initiative (PSI)

PSI Structural Biology Knowledgebase: A Nature Gateway

v
t
e
Omics
Genomics

Cognitive genomics

Computational genomics

Comparative genomics

Functional genomics

Genome project
Human Genome Project

Metagenomics
Human Microbiome Project

Pangenomics

Personal genomics

Population genomics

Social genomics

Structural genomics

Bioinformatics

Biochip

Cheminformatics

Chemogenomics

Connectomics
Human Connectome Project

Epigenomics
Human Epigenome Project

Glycomics

Immunomics

Lipidomics

Metabolomics

Microbiomics

Nutrigenomics

Paleopolyploidy

Pharmacogenetics

Pharmacogenomics

Systems biology

Toxicogenomics

Transcriptomics

Structural biology

Proteomics
Human proteome project

Call-map proteomics

Structure-based drug design

Expression proteomics

Research tools

2-D electrophoresis

Mass spectrometer

Electrospray ionization

Matrix-assisted laser desorption ionization

Matrix-assisted laser desorption ionization-time of flight mass spectrometer

Microfluidic-based tools

Isotope affinity tags

Chromosome conformation capture

Organizations

DNA Data Bank of Japan (JP)

European Molecular Biology Laboratory (EU)

National Institutes of Health (USA)

Wellcome Sanger Institute (UK)

List

Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=Structural_genomics&oldid=1188098770"

[1] S2CID 800902
.

[2] PMID 12413557
.

[3] S2CID 7193705
.

[4] PMID 12193646
.

[1]

[2]

[3]

[4]