UniProt
Primary citation | UniProt Consortium[1] |
---|---|
Access | |
Data format | Custom flat file, FASTA, GFF, RDF, XML. |
Website | www www |
Download URL | www |
Web service URL | Yes – JAVA API see info here & REST see info here |
Tools | |
Web | Advanced search, BLAST, ClustalO, bulk retrieval/download, ID mapping |
Miscellaneous | |
License | Creative Commons Attribution-NoDerivs |
Versioning | Yes |
Data release frequency | 8 weeks |
Curation policy | Yes – manual and automatic. Rules for automatic annotation generated by database curators and computational algorithms. |
Bookmarkable entities | Yes – both individual protein entries and searches |
UniProt is a freely accessible database of
The UniProt consortium
The UniProt consortium comprises the
The roots of the UniProt databases
Each consortium member is heavily involved in protein database maintenance and annotation. Until recently, EBI and SIB together produced the Swiss-Prot and TrEMBL databases, while PIR produced the Protein Sequence Database (PIR-PSD).
Swiss-Prot was created in 1986 by
The consortium members pooled their overlapping resources and expertise, and launched UniProt in December 2003.[10]
Organization of the UniProt databases
UniProt provides four core databases: UniProtKB (with sub-parts Swiss-Prot and TrEMBL), UniParc, UniRef and Proteome.
UniProtKB
UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries).[11] As of 22 February 2023[update], release "2023_01" of UniProtKB/Swiss-Prot contains 569,213 sequence entries (comprising 205,728,242 amino acids abstracted from 291,046 references) and release "2023_01" of UniProtKB/TrEMBL contains 245,871,724 sequence entries (comprising 85,739,380,194 amino acids).[12]
UniProtKB/Swiss-Prot
UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. It combines information extracted from scientific literature and
Sequences from the same gene and the same species are merged into the same database entry. Differences between sequences are identified, and their cause documented (for example alternative splicing, natural variation, incorrect initiation sites, incorrect exon boundaries, frameshifts, unidentified conflicts). A range of sequence analysis tools is used in the annotation of UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant results selected for inclusion in the entry. These predictions include post-translational modifications, transmembrane domains and topology, signal peptides, domain identification, and protein family classification.[13][14]
Relevant publications are identified by searching databases such as PubMed. The full text of each paper is read, and information is extracted and added to the entry. Annotation arising from the scientific literature includes, but is not limited to:[10][13][14]
- Protein and gene names
- Function
- Enzyme-specific information such as catalytic activity, cofactors and catalytic residues
- Subcellular location
- Protein-protein interactions
- Pattern of expression
- Locations and roles of significant domains and sites
- substrate- and cofactor-binding sites
- Protein variant forms produced by natural genetic variation, proteolyticprocessing, and post-translational modification
Annotated entries undergo quality assurance before inclusion into UniProtKB/Swiss-Prot. When new data becomes available, entries are updated.
UniProtKB/TrEMBL
UniProtKB/TrEMBL contains high-quality computationally analyzed records, which are enriched with automatic annotation. It was introduced in response to increased dataflow resulting from genome projects, as the time- and labour-consuming manual annotation process of UniProtKB/Swiss-Prot could not be broadened to include all available protein sequences.
UniParc
UniProt Archive (UniParc) is a comprehensive and non-redundant database, which contains all the protein sequences from the main, publicly available protein sequence databases.[18] Proteins may exist in several different source databases, and in multiple copies in the same database. In order to avoid redundancy, UniParc stores each unique sequence only once. Identical sequences are merged, regardless of whether they are from the same or different species. Each sequence is given a stable and unique identifier (UPI), making it possible to identify the same protein from different source databases. UniParc contains only protein sequences, with no annotation. Database cross-references in UniParc entries allow further information about the protein to be retrieved from the source databases. When sequences in the source databases change, these changes are tracked by UniParc and history of all changes is archived.
Source databases
Currently UniParc contains protein sequences from the following publicly available databases:
- DDBJ/GenBanknucleotide sequence databases
- Ensembl
- European Patent Office (EPO)
- FlyBase: the primary repository of genetic and molecular data for the insect family Drosophilidae (FlyBase)
- H-Invitational Database (H-Inv)
- International Protein Index (IPI)
- Japan Patent Office (JPO)
- Protein Information Resource (PIR-PSD)
- Protein Data Bank (PDB)
- Protein Research Foundation (PRF)[19]
- RefSeq
- Saccharomyces Genome Database (SGD)
- The Arabidopsis Information Resource (TAIR)
- TROME[20]
- US Patent Office(USPTO)
- UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL
- Vertebrate and Genome Annotation Database(VEGA)
- WormBase
UniRef
The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets of protein sequences from UniProtKB and selected UniParc records.[21] The UniRef100 database combines identical sequences and sequence fragments (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries and links to the corresponding UniProtKB and UniParc records are displayed. UniRef100 sequences are clustered using the CD-HIT algorithm to build UniRef90 and UniRef50.[21][22] Each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the longest sequence. Clustering sequences significantly reduces database size, enabling faster sequence searches.
UniRef is available from the UniProt FTP site.
Funding
UniProt is funded by grants from the National Human Genome Research Institute, the National Institutes of Health (NIH), the European Commission, the Swiss Federal Government through the Federal Office of Education and Science, NCI-caBIG, and the US Department of Defense.[11]
References
- PMID 25348405.
- ^ Dayhoff, Margaret O. (1965). Atlas of protein sequence and structure. Silver Spring, Md: National Biomedical Research Foundation.
- ^ "2002 Release: NHGRI Funds Global Protein Database". National Human Genome Research Institute (NHGRI). Archived from the original on 24 September 2015. Retrieved 14 April 2018.
- PMID 12230036.
- PMID 12520019.
- PMID 12520024.
- PMID 8594581.
- PMID 10812477.
- ISSN 1660-9824.
- ^ PMID 15036160.
- ^ PMID 19843607.
- ^ "UniProtKB/Swiss-Prot Release 2023_01 statistics". web.expasy.org. Retrieved 31 March 2023.
- ^ a b c "How do we manually annotate a UniProtKB entry?". UniProt. September 21, 2011. Archived from the original on Dec 13, 2013. Retrieved 14 April 2018.
- ^ PMID 14681372.
- ^ "Where do the UniProtKB protein sequences come from?". UniProt. September 21, 2011. Archived from the original on Dec 15, 2013. Retrieved 14 April 2018.
- from the original on 30 Mar 2024 – via PMC.
- ^ Hassabis, Demis (22 July 2022). "Putting the power of AlphaFold into the world's hands". Deepmind. Archived from the original on 24 July 2021. Retrieved 24 July 2021.
- (PDF) from the original on Mar 30, 2024.
- ^ "Protein Research Foundation".
- ^ ftp://ftp.isrec.isb-sib.ch/pub/databases/trome[permanent dead link]
- ^ PMID 17379688.
- PMID 11294794.