Protein function prediction
Protein function prediction methods are techniques that
Generally, function can be thought of as, "anything that happens to or through a protein".
While techniques such as
Homology-based methods
However, closely related proteins do not always share the same function.
There is no hard sequence-similarity threshold for "safe" function prediction; many proteins of barely detectable sequence similarity have the same function while others (such as Gal1 and Gal3) are highly similar but have evolved different functions. As a rule of thumb, sequences that are more than 30-40% identical are usually considered as having the same or a very similar function.
For enzymes, predictions of specific functions are especially difficult, as they only need a few key residues in their active site, hence very different sequences can have very similar activities. By contrast, even with sequence identity of 70% or greater, 10% of any pair of enzymes have different substrates; and differences in the actual enzymatic reactions are not uncommon near 50% sequence identity.[8][9]
Sequence motif-based methods
The development of protein domain databases such as
Structure-based methods
Because 3D protein structure is generally more well conserved than protein sequence, structural similarity is a good indicator of similar function in two or more proteins.[6][12] Many programs have been developed to screen a known protein structure against the Protein Data Bank[16] and report similar structures (for example, FATCAT (Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists),[17] CE (combinatorial extension)[18]) and DeepAlign (protein structure alignment beyond spatial proximity).[19] Similarly, the main protein databases, such as UniProt, have built-in tools to search any given protein sequences against structure databases, and link to related proteins of known structure.
Protein structure prediction
To deal with the situation that many protein sequences have no solved structures, some function prediction servers such as
Computational solvent mapping
One of the challenges involved in protein function prediction is discovery of the active site. This is complicated by certain active sites not being formed – essentially existing – until the protein undergoes conformational changes brought on by the binding of small molecules. Most protein structures have been determined by X-ray crystallography which requires a purified protein crystal. As a result, existing structural models are generally of a purified protein and as such lack the conformational changes that are created when the protein interacts with small molecules.[26]
Computational solvent mapping utilizes probes (small organic molecules) that are computationally 'moved' over the surface of the protein searching for sites where they tend to cluster. Multiple different probes are generally applied with the goal being to obtain a large number of different protein-probe conformations. The generated clusters are then ranked based on the cluster's average free energy. After computationally mapping multiple probes, the site of the protein where relatively large numbers of clusters form typically corresponds to an active site on the protein.[26]
This technique is a computational adaptation of 'wet lab' work from 1996. It was discovered that ascertaining the structure of a protein while it is suspended in different solvents and then superimposing those structures on one another produces data where the organic solvent molecules (that the proteins were suspended in) typically cluster at the protein's active site. This work was carried out as a response to realizing that water molecules are visible in the electron density maps produced by
Genome context-based methods
Many of the newer methods for protein function prediction are not based on comparison of sequence or structure as above, but on some type of correlation between novel genes/proteins and those that already have annotations. Several methods have been developed to predict gene function on the local genomic or phylogenomic context and structure of genes:
Phylogenetic profiling is based on the observation that two or more proteins with the same pattern of presence or absence in many different genomes most likely have a functional link.[12][28] Whereas homology-based methods can often be used to identify molecular functions of a protein, context-based approaches can be used to predict cellular function, or the biological process in which a protein acts.[3][28] For example, proteins involved in the same metabolic pathway are likely to be present in a genome together or are absent altogether, suggesting that these genes work together in a functional context.
Operons are clusters of genes that are transcribed together. Based on co-transcription data but also based on the fact that the order of genes in operons is often conserved across many bacteria, indicates that they act together.[29]
Gene expression and location-based methods
In
Genes involved in similar functions are also often co-transcribed, so that an unannotated protein can often be predicted to have a related function to proteins with which it co-expresses.
With the accumulation of RNA-seq data that are capable of estimating expression profiles for alternatively spliced isoforms, machine learning algorithms have also been developed for predicting and differentiating functions at the isoform level.[36] This represents an emerging research area in function prediction, which integrates large-scale, heterogeneous genomic data to infer functions at the isoform level.[37]
Network-based methods
Guilt by association type algorithms may be used to produce a functional association network for a given target group of genes or proteins.[38] These networks serve as a representation of the evidence for shared/similar function within a group of genes, where nodes represent genes/proteins and are linked to each other by edges representing evidence of shared function.[39]
Integrated networks
Several networks based on different data sources can be combined into a composite network, which can then be used by a prediction algorithm to annotate candidate genes or proteins.[40] For example, the developers of the bioPIXIE system used a wide variety of Saccharomyces cerevisiae (yeast) genomic data to produce a composite functional network for that species.[41] This resource allows the visualization of known networks representing biological processes, as well as the prediction of novel components of those networks. Many algorithms have been developed to predict function based on the integration of several data sources (e.g. genomic, proteomic, protein interaction, etc.), and testing on previously annotated genes indicates a high level of accuracy.[39][42] Disadvantages of some function prediction algorithms have included a lack of accessibility, and the time required for analysis. Faster, more accurate algorithms such as GeneMANIA (multiple association network integration algorithm) have however been developed in recent years[40] and are publicly available on the web, indicating the future direction of function prediction.
Tools and databases for protein function prediction
STRING: web tool that integrates various data sources for function prediction.[43]
VisANT: Visual analysis of networks and integrative visual data-mining.[44]
Mantis: A consensus-driven function prediction tool that dynamically integrates multiple reference databases.[45]
See also
- Gene prediction
- Protein–protein interaction prediction
- Protein structure prediction
- Structural genomics
- Functional genomics
References
- ^ S2CID 8800506.
- PMID 10802651.
- ^ S2CID 18032660.
- PMID 21330331.
- S2CID 42949514.
- ^ S2CID 27123114.
- PMID 10737789.
- PMID 12051862.
- PMID 14568541.
- PMID 19920124.
- PMID 23161684.
- ^ S2CID 8932206.
- PMID 19858104.
- PMID 11099261.
- S2CID 16509924.
- PMID 10592235.
- PMID 15215455.
- PMID 9796821.
- PMID 23486213.
- PMID 21155016.
- ^ PMID 23514271.
- PMID 26773655.
- S2CID 26066208.
- PMID 14681376.
- PMID 25343578.
- ^ PMID 16878974.
- S2CID 20273975.
- ^ S2CID 4398864.
- PMID 21051344.
- ^ PMID 10427000.
- PMID 10077608.
- PMID 12695325.
- PMID 10613842.
- PMID 22824328.
- ^ PMID 23936626.
- PMID 24244129.
- PMID 24951248.
- S2CID 3009359.
- ^ PMID 17353930.
- ^ PMID 18613948.
- PMID 16420673.
- PMID 18613946.
- PMID 27924014.
- PMID 27081850.
- PMID 34076241.
External links
- The dcGO database
- Protein Data Bank
- Catalytic Site Atlas
- RaptorX Server for model-assisted protein function prediction
- Blast2GO, high-throughput tool for protein function prediction and functional annotation (webpage).