Sequence clustering

In

genomic, "transcriptomic" (ESTs) or protein

Some clustering algorithms use

similarity over a particular threshold. UCLUST^[1] and CD-HIT^[2] use a greedy algorithm that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on sequence alignment. Sequence clustering is often used to make a non-redundant set of representative sequences

.

Sequence clusters are often synonymous with (but not identical to)

tertiary structure for each sequence cluster is the aim of many structural genomics

initiatives.

Sequence clustering algorithms and packages

CD-HIT [2]
UCLUST in USEARCH^[1]
Starcode:^[3] a fast sequence clustering algorithm based on exact all-pairs search.^[4]
OrthoFinder:^[5] a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)^[6]^[7]
Linclust:^[8] first algorithm whose runtime scales linearly with input set size, very fast, part of MMseqs2^[9] software suite for fast, sensitive sequence searching and clustering of large sequence sets
TribeMCL: a method for clustering proteins into related groups^[10]
BAG: a graph theoretic sequence clustering algorithm^[11]
JESAM:^[12] Open source parallel scalable DNA alignment engine with optional clustering software component
UICluster:^[13] Parallel Clustering of EST (Gene) Sequences
BLASTClust single-linkage clustering with BLAST^[14]
Clusterer:^[15] extendable java application for sequence grouping and cluster analyses
PATDB: a program for rapidly identifying perfect substrings
nrdb:^[16] a program for merging trivially redundant (identical) sequences
CluSTr:^[17] A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI
ICAtools^[18] - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
Skipredudant EMBOSS tool^[19] to remove redundant sequences from a set
CLUSS Algorithm^[20] to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver ^[21]
CLUSS2 Algorithm^[22] for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver ^[21]

Non-redundant sequence databases

PISCES: A Protein Sequence Culling Server^[23]
RDB90^[24]
UniRef: A non-redundant UniProt sequence database^[25]
Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.^[26]
Virus Orthologous Clusters:^[27] A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity

References

^ ^a ^b "USEARCH". drive5.com.
^ ^a ^b "CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data". cd-hit.org.
^ "Starcode repository". GitHub. 2018-10-11.
PMID 25638815
.

^ "OrthoFinder". Steve Kelly Lab.

PMID 26243257
.

PMID 31727128
.

PMID 29959318
.

S2CID 402352
.

PMID 11917018
.

^ "Archived copy". Archived from the original on 2003-12-06. Retrieved 2004-02-19.{{cite web}}: CS1 maint: archived copy as title (link)

^ "Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters". littlest.co.uk.

^ "pedretti@eyeball -- Clustering Page". ratest.eng.uiowa.edu. Archived from the original on 2005-04-09.

^ "NCBI News: Spring 2004-BLASTLab". nih.gov.

^ "Clusterer: extendable java application for sequence grouping and cluster analyses". bugaco.com.

^ "Index of /pub/nrdb". Archived from the original on 2008-01-01.

^ "CluSTr". Archived from the original on 2006-09-24. Retrieved 2006-11-23.

^ "Introduction to the ICAtools". littlest.co.uk.

^ "EMBOSS: skipredundant". pasteur.fr.

PMID 17683581
.

^ ^a ^b "CLUSS Home Page".

PMID 20058485
.

^ "Dunbrack Lab". fccc.edu.

PMID 9682055
.

^ "About UniProt". uniprot.org.

PMID 27899574
.

^ "VOCS - Viral Bioinformatics Resource Center". uvic.ca.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Sequence_clustering&oldid=1188065688"

[usearch-1] "USEARCH". drive5.com.

[cdhit-2] "CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data". cd-hit.org.

[3] "Starcode repository". GitHub. 2018-10-11.

[pmid25638815-4] PMID 25638815
.

[5] "OrthoFinder". Steve Kelly Lab.

[pmid26243257-6] PMID 26243257
.

[pmid31727128-7] PMID 31727128
.

[pmid29959318-8] PMID 29959318
.

[pmid29035372-9] S2CID 402352
.

[pmid11917018-10] PMID 11917018
.

[11] "Archived copy". Archived from the original on 2003-12-06. Retrieved 2004-02-19.{{cite web}}: CS1 maint: archived copy as title (link)

[12] "Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters". littlest.co.uk.

[13] "pedretti@eyeball -- Clustering Page". ratest.eng.uiowa.edu. Archived from the original on 2005-04-09.

[14] "NCBI News: Spring 2004-BLASTLab". nih.gov.

[15] "Clusterer: extendable java application for sequence grouping and cluster analyses". bugaco.com.

[16] "Index of /pub/nrdb". Archived from the original on 2008-01-01.

[17] "CluSTr". Archived from the original on 2006-09-24. Retrieved 2006-11-23.

[18] "Introduction to the ICAtools". littlest.co.uk.

[19] "EMBOSS: skipredundant". pasteur.fr.

[pmid17683581-20] PMID 17683581
.

[prospectus.usherbrooke.ca-21] "CLUSS Home Page".

[pmid20058485-22] PMID 20058485
.

[23] "Dunbrack Lab". fccc.edu.

[rdb90-24] PMID 9682055
.

[25] "About UniProt". uniprot.org.

[pmid27899574-26] PMID 27899574
.

[27] "VOCS - Viral Bioinformatics Resource Center". uvic.ca.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

Sequence clustering algorithms and packages

Non-redundant sequence databases

See also

References