Sequence alignment
This article needs additional citations for verification. (March 2009) |
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1][2] Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.
Interpretation
If two sequences in an alignment share a common ancestor, mismatches can be interpreted as
Alignment methods
Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of
Representations
Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the
Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as
CIGAR Format
Ref. : GTCGTAGAATA
Read: CACGTAG—TA
CIGAR: 2S5M2D2M
where:
2S = 2 soft clipping (could be mismatches, or a read longer than the matched sequence)
5M = 5 matches or mismatches
2D = 2 deletions
2M = 2 matches or mismatches
The original CIGAR format from the exonerate alignment program did not distinguish between mismatches or matches with the M character.
The SAMv1 spec document defines newer CIGAR codes. In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous.
CIGAR Code | BAM Integer | Description | Consumes query | Consumes reference |
M | 0 | alignment match (can be a sequence match or mismatch) | yes | yes |
I | 1 | insertion to the reference | yes | no |
D | 2 | deletion from the reference | no | yes |
N | 3 | skipped region from the reference | no | yes |
S | 4 | soft clipping (clipped sequences present in SEQ) | yes | no |
H | 5 | hard clipping (clipped sequences NOT present in SEQ) | no | no |
P | 6 | padding (silent deletion from padded reference) | no | no |
= | 7 | sequence match | yes | yes |
X | 8 | sequence mismatch | yes | yes |
- “Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
- H can only be present as the first and/or last operation.
- S may only have H operations between them and the ends of the CIGAR string.
- For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
- Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ
Global and local alignments
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.[5]
Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.[8] Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence.
Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time. Optical computing approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be tested [1].
Pairwise alignment
Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;[1] however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned.
Maximal unique match
One way of quantifying the utility of a given pairwise alignment is the '
More precisely:
"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d= 20) such that
- it is maximal, that is, it cannot be extended on either end without incurring a mismatch; and
- it is unique in both sequences"[10]
Dot-matrix methods
The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.
Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.
Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect occurs when a protein consists of multiple similar
Dynamic programming
The technique of
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.[citation needed]
Word methods
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family.[1] Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.
Multiple sequence alignment
Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.Dynamic programming
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and
Progressive methods
Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to FASTA. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation[14][15][16] are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee.[17]
Iterative methods
Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an
Motif finding
Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved
Techniques inspired by computer science
A variety of general
The Burrows–Wheeler transform has been successfully applied to fast short read alignment in popular tools such as Bowtie and BWA. See FM-index.
Structural alignment
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure prediction[21] because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.[21]
DALI
The DALI method, or
SSAP
SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments,
Combinatorial extension
The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.
Phylogenetic analysis
Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly
Assessment of significance
Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.[26][28][29][30][31][32][33][34]
Assessment of credibility
Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.[35]
Scoring functions
The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using
It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.
Other biological uses
Sequenced RNA, such as
Non-biological uses
The methods used for biological sequence alignment have also found applications in other fields, most notably in
Software
A more complete list of available software categorized by algorithm and alignment type is available at
Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE.[48] The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.[49][50] A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.[51]
See also
- Sequence homology
- Sequence mining
- BLAST
- String searching algorithm
- Alignment-free sequence analysis
- UGENE
- Needleman–Wunsch algorithm
- Smith-Waterman algorithm
- Sequence analysis in social sciences
References
- ^ ISBN 978-0-87969-608-5.
- OCLC 1240827446.)
{{cite book}}
: CS1 maint: date and year (link - ^ "Clustal FAQ #Symbols". Clustal. Archived from the original on 24 October 2016. Retrieved 8 December 2014.
- PMID 11337480.
- ^ S2CID 2658261.
- PMID 2172928.
- ^ "Sequence Alignment/Map Format Specification" (PDF).
- PMID 12855437.
- PMID 10325427.
- ISBN 978-1420070330.
- PMID 8790475.
- PMID 17037961.
- PMID 2734293.
- PMID 3243435.
- PMID 7984417.
- PMID 12824352.
- S2CID 10189971.
- PMID 7796270.
- PMID 9927713.
- PMID 3709526.
- ^ PMID 15653774.
- S2CID 7509134.
- PMID 7849601.
- PMID 9309224.
- PMID 9796821.
- ^ PMID 21258650.
- ISBN 978-0-87893-177-4.
- )
- S2CID 193085.
- PMID 18973434.
- S2CID 15640896.
- PMID 14990449.
- S2CID 6559731.
- PMID 20063463. Archived from the originalon 28 January 2013.
- PMID 19119992.
- PMID 18566765.
- S2CID 31148824.
- PMID 19477687.
- PMID 19386041.
- S2CID 121097811.
- )
- ^ Kondrak, Grzegorz (2002). "Algorithms for Language Reconstruction" (PDF). University of Toronto, Ontario. Archived from the original (PDF) on 17 December 2008. Retrieved 21 January 2007.
{{cite journal}}
: Cite journal requires|journal=
(help) - .
- ^ EMBL-EBI. "ClustalW2 < Multiple Sequence Alignment < EMBL-EBI". www.EBI.ac.uk. Retrieved 12 June 2017.
- ^ T-coffee
- ^ "BLAST: Basic Local Alignment Search Tool". blast.ncbi.nlm.NIH.gov. Retrieved 12 June 2017.
- ^ "UVA FASTA Server". fasta.bioch.Virginia.edu. Retrieved 12 June 2017.
- PMID 10068696.
- ^ BAliBASE
- PMID 10373585.
- ^ "Multiple sequence alignment: Strap". 3d-alignment.eu. Retrieved 12 June 2017.
External links
- Media related to Sequence alignment at Wikimedia Commons