Sequence analysis
Not to be confused with sequential analysis, sequence analysis of synthetic polymers, or sequence analysis in social sciences.
In
Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased very rapidly. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays, there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand its biology.
Sequence analysis in molecular biology includes a very wide range of relevant topics:
- The comparison of sequences in order to find similarity, often to infer if they are related (homologous)
- Identification of intrinsic features of the sequence such as regulatory elements
- Identification of sequence differences and variations such as single nucleotide polymorphism (SNP) in order to get the genetic marker.
- Revealing the evolution and genetic diversity of sequences and organisms
- Identification of molecular structure from sequence alone.
History
Since the very first sequences of the
Sequence alignment
There are millions of
- Pair-wise alignment - BLAST, Dot plots
- Multiple alignment - ClustalW, PROBCONS, MUSCLE, MAFFT, and T-Coffee.
A common use for pairwise sequence alignment is to take a sequence of interest and compare it to all known sequences in a database to identify
Profile comparison
In 1987, Michael Gribskov, Andrew McLachlan, and
These models have become known as profile-HMMs.In recent years,[when?] methods have been developed that allow the comparison of profiles directly to each other. These are known as profile-profile comparison methods.[14]
Sequence assembly
Sequence assembly refers to the reconstruction of a DNA sequence by aligning and merging small DNA fragments. It is an integral part of modern DNA sequencing. Since presently-available DNA sequencing technologies are ill-suited for reading long sequences, large pieces of DNA (such as genomes) are often sequenced by (1) cutting the DNA into small pieces, (2) reading the small fragments, and (3) reconstituting the original DNA by merging the information on various fragments.
Recently, sequencing multiple species at one time is one of the top research objectives. Metagenomics is the study of microbial communities directly obtained from the environment. Different from cultured microorganisms from the lab, the wild sample usually contains dozens, sometimes even thousands of types of microorganisms from their original habitats.[15] Recovering the original genomes can prove to be very challenging.
Gene prediction
Gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode
Protein structure prediction
The 3D structures of molecules are of major importance to their functions in nature. Since structural prediction of large molecules at an atomic level is a largely intractable problem, some biologists introduced ways to predict 3D structure at a primary sequence level. This includes the biochemical or statistical analysis of amino acid residues in local regions and structural the inference from homologs (or other potentially related proteins) with known 3D structures.
There have been a large number of diverse approaches to solve the structure prediction problem. In order to determine which methods were most effective, a structure prediction competition was founded called CASP (Critical Assessment of Structure Prediction).[20]
Methodology
The tasks that lie in the space of sequence analysis are often non-trivial to resolve and require the use of relatively complex approaches. Of the many types of methods used in practice, the most popular include:
- Dynamic programming
- Artificial neural network
- Hidden Markov model
- Support vector machine
- Clustering
- Bayesian network
- Regression analysis
- Sequence mining
- Alignment-free sequence analysis
See also
- Fourier transform
- Least-squares spectral analysis
- List of sequence alignment software
- List of alignment visualization software
- List of phylogenetics software
- List of phylogenetic tree visualization software
- List of protein structure prediction software
- List of RNA structure prediction software
- Sequence analysis in social sciences