Conserved sequence
In
A highly conserved sequence is one that has remained relatively unchanged far back up the
History
The discovery of the role of
Mechanisms
Over many generations, nucleic acid sequences in the genome of an evolutionary lineage can gradually change over time due to random mutations and deletions.[9][10] Sequences may also recombine or be deleted due to chromosomal rearrangements. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate.[11]
Conservation can occur in
Coding sequence
In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of the genetic code means that synonymous mutations in a coding sequence do not affect the amino acid sequence of its protein product.[15]
Amino acid sequences can be conserved to maintain the
The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. The codon usage bias in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that cause secondary structure in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA.[19][20]
Non-coding
Non-coding sequences important for
Identification
Conserved sequences are typically identified by bioinformatics approaches based on sequence alignment. Advances in high-throughput DNA sequencing and protein mass spectrometry has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.[23][24]
Homology search
Conserved sequences may be identified by homology search, using tools such as BLAST, HMMER, OrthologR,[25] and Infernal.[26] Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from multiple sequence alignments of known related sequences. Statistical models such as profile-HMMs, and RNA covariance models which also incorporate structural information,[27] can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as PAM and BLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.[28]
Multiple sequence alignment
Multiple sequence alignments can be used to visualise conserved sequences. The CLUSTAL format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( )[30] Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height.[29]
Genome alignment
Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and scalability of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes.[32] However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible.[33][34]
Scoring systems
Other approaches use measurements of conservation based on
The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence.[35][36]
LIST[37] [38] (Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA.
Aminode[39] combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject to purifying selection and are typically critical for normal protein function.
Other approaches such as PhyloP and PhyloHMM incorporate statistical phylogenetics methods to compare probability distributions of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a phylogenetic tree. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as a likelihood-ratio test or score test. P-values generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM uses hidden Markov models to generate probability distributions. The PhyloP software package compares probability distributions using a likelihood-ratio test or score test, as well as using a GERP-like scoring system.[40][41][42]
Extreme conservation
Ultra-conserved elements
Universally conserved genes
The most highly conserved genes are those that can be found in all organisms. These consist mainly of the
Genes or gene families that have been found to be universally conserved include
Applications
Phylogenetics and taxonomy
Sets of conserved sequences are often used for generating
Medical research
As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of
Functional annotation
Identifying conserved sequences can be used to discover and predict functional sequences such as genes.[67] Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such as Pfam and the Conserved Domain Database can be used to annotate functional domains in predicted protein coding genes.[68]
See also
- Evolutionary developmental biology
- NAPP (database)
- Segregating site
- Sequence alignment
- Sequence alignment software
- UCbase
- Ultra-conserved element
References
- ^ "Clustal FAQ #Symbols". Clustal. Archived from the original on 24 October 2016. Retrieved 8 December 2014.
- S2CID 4067991.
- ^ PMID 14147455.
- ^ PMID 22308526.
- ^ Zuckerlandl, Emile; Pauling, Linus B. (1962). "Molecular disease, evolution, and genetic heterogeneity". Horizons in Biochemistry: 189–225.
- PMID 14077496.
- )
- S2CID 23208558.
- S2CID 4161261.
- PMID 5767777.
- PMID 4527913.
- PMID 18166073.
- PMID 18245453.
- PMID 28402878.
- PMID 24954581.
- S2CID 15248867.
- PMID 23244440.
- ISSN 0020-7608.
- PMID 16168082.
- PMID 18042713.
- PMID 24184936.
- PMID 17151342.
- PMID 14656959.
- PMID 15829234.
- PMID 25631928.
- PMID 24008419.
- PMID 8029015.
- PMID 32954566.
- ^ a b "Weblogo". UC Berkeley. Retrieved 30 December 2017.
- ^ "Clustal FAQ #Symbols". Clustal. Archived from the original on 24 October 2016. Retrieved 8 December 2014.
- ^ "ECR Browser". ECR Browser. Retrieved 9 January 2018.
- PMID 25273068.
- PMID 26442149.
- PMID 24676150.
- PMID 15965027.
- ^ "Sidow Lab - GERP".
- PMID 30952844.
- PMID 32352516.
- PMID 29358731.
- PMID 19858363.
- ^ "PHAST: Home".
- PMID 17919331.
- S2CID 2790337.
- PMID 16024819.
- PMID 24218634.
- PMID 22232343.
- PMID 25207863.
- PMID 22496592.
- S2CID 15707806.
- PMID 12618371.
- PMID 24524803.
- PMID 15593277.
- PMID 7524576.
- PMID 27572647.
- PMID 14595094.
- PMID 11010916.
- PMID 16751257.
- PMID 22454494.
- PMID 20348308.
- PMID 11571146.
- PMID 10386377.
- PMID 18369433.
- PMID 21415126.
- PMID 15239832.
- S2CID 1707096.
- PMID 19808789.
- S2CID 1530261.
- PMID 21109532.