Sequence clustering

Source: Wikipedia, the free encyclopedia.

In

genomic, "transcriptomic" (ESTs) or protein
origin. For proteins,
mRNA
.

Some clustering algorithms use

similarity over a particular threshold. UCLUST[1] and CD-HIT[2] use a greedy algorithm that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on sequence alignment. Sequence clustering is often used to make a non-redundant set of representative sequences
.

Sequence clusters are often synonymous with (but not identical to)

tertiary structure for each sequence cluster is the aim of many structural genomics
initiatives.

Sequence clustering algorithms and packages

  • CD-HIT[2]
  • UCLUST in USEARCH[1]
  • Starcode:[3] a fast sequence clustering algorithm based on exact all-pairs search.[4]
  • OrthoFinder:[5] a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)[6][7]
  • Linclust:[8] first algorithm whose runtime scales linearly with input set size, very fast, part of MMseqs2[9] software suite for fast, sensitive sequence searching and clustering of large sequence sets
  • TribeMCL: a method for clustering proteins into related groups[10]
  • BAG: a graph theoretic sequence clustering algorithm[11]
  • JESAM:[12] Open source parallel scalable DNA alignment engine with optional clustering software component
  • UICluster:[13] Parallel Clustering of EST (Gene) Sequences
  • BLASTClust single-linkage clustering with BLAST[14]
  • Clusterer:[15] extendable java application for sequence grouping and cluster analyses
  • PATDB: a program for rapidly identifying perfect substrings
  • nrdb:[16] a program for merging trivially redundant (identical) sequences
  • CluSTr:[17] A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI
  • ICAtools[18] - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
  • Skipredudant EMBOSS tool[19] to remove redundant sequences from a set
  • CLUSS Algorithm[20] to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver [21]
  • CLUSS2 Algorithm[22] for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver [21]

Non-redundant sequence databases

  • PISCES: A Protein Sequence Culling Server[23]
  • RDB90[24]
  • UniRef: A non-redundant UniProt sequence database[25]
  • Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.[26]
  • Virus Orthologous Clusters:[27] A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity

See also

References

  1. ^ a b "USEARCH". drive5.com.
  2. ^ a b "CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data". cd-hit.org.
  3. ^ "Starcode repository". GitHub. 2018-10-11.
  4. PMID 25638815
    .
  5. ^ "OrthoFinder". Steve Kelly Lab.
  6. PMID 26243257
    .
  7. .
  8. .
  9. .
  10. .
  11. ^ "Archived copy". Archived from the original on 2003-12-06. Retrieved 2004-02-19.{{cite web}}: CS1 maint: archived copy as title (link)
  12. ^ "Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters". littlest.co.uk.
  13. ^ "pedretti@eyeball -- Clustering Page". ratest.eng.uiowa.edu. Archived from the original on 2005-04-09.
  14. ^ "NCBI News: Spring 2004-BLASTLab". nih.gov.
  15. ^ "Clusterer: extendable java application for sequence grouping and cluster analyses". bugaco.com.
  16. ^ "Index of /pub/nrdb". Archived from the original on 2008-01-01.
  17. ^ "CluSTr". Archived from the original on 2006-09-24. Retrieved 2006-11-23.
  18. ^ "Introduction to the ICAtools". littlest.co.uk.
  19. ^ "EMBOSS: skipredundant". pasteur.fr.
  20. PMID 17683581
    .
  21. ^ a b "CLUSS Home Page".
  22. PMID 20058485
    .
  23. ^ "Dunbrack Lab". fccc.edu.
  24. PMID 9682055
    .
  25. ^ "About UniProt". uniprot.org.
  26. PMID 27899574
    .
  27. ^ "VOCS - Viral Bioinformatics Resource Center". uvic.ca.