Nucleic acid structure prediction
Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling (when the structure of a homologous sequence is known).
The problem of predicting nucleic acid secondary structure is dependent mainly on
While the methods are similar, there are slight differences in the approaches to RNA and DNA structure prediction. In vivo, DNA structures are more likely to be duplexes with full complementarity between two strands, while RNA structures are more likely to fold into complex secondary and tertiary structures such as in the ribosome, spliceosome, or transfer RNA. This is partly because the extra oxygen in RNA increases the propensity for hydrogen bonding in the nucleic acid backbone. The energy parameters are also different for the two nucleic acids. The structure prediction methods can follow a completely theoretical approach, or a hybrid one incorporating experimental data.[1][2]
Single sequence structure prediction
A common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given only a nucleic acid sequence. However, in the case of RNA much of the final structure is determined by the
The most stable structure
Secondary structure of small RNA molecules is largely determined by strong, local interactions such as
The simplest way to find the lowest free energy structure would be to generate all possible structures and calculate the free energy for it, but the number of possible structures for a sequence increases exponentially with the length of RNA: number of secondary structures = (1,8)N, N- number of nucleotides
.[6] For longer molecules, the number of possible secondary structures is huge: a sequence of 100 nucleotides has more than 1025 possible secondary structures.[3]
Dynamic programming algorithms
Most popular methods for predicting RNA and DNA's secondary structure involve dynamic programming.[7][8] One of the early attempts at predicting RNA secondary structure was made by Ruth Nussinov and co-workers who developed a dynamic programming-based algorithm that maximized the length and number of a series of "blocks" (polynucleotide chains).[7] Each "block" required at least two nucleotides, which reduced the algorithm's storage requirements over single base-matching approaches.[7] Nussinov et al. later published an adapted approach with improved performance that increased the RNA size limit to ~1,000 bases by folding increasingly sized subsections while storing the results of prior folds, now known as the Nussinov algorithm.[8] In 1981, Michael Zuker and Patrick Stiegler proposed a refined approach with performance comparable to Nussinov et al.'s solution but with the additional ability to find also find "suboptimal" secondary structures.[9]
Dynamic programming algorithms provide a means to implicitly check all variants of possible RNA secondary structures without explicitly generating the structures. First, the lowest conformational free energy is determined for each possible sequence fragment starting with the shortest fragments and then for longer fragments. For longer fragments, recursion on the optimal free energy changes determined for shorter sequences speeds the determination of the lowest folding free energy. Once the lowest free energy of the complete sequence is calculated, the exact structure of RNA molecule is determined.[3]
Dynamic programming algorithms are commonly used to detect
Suboptimal structures
The accuracy of RNA secondary structure prediction from one sequence by free energy minimization is limited by several factors:
- The free energy value's list in nearest neighbor model is incomplete
- Not all known RNA folds in such a way as to conform with the thermodynamic minimum.
- Some RNA sequences have more than one biologically active conformation (i.e., riboswitches)
For this reason, the ability to predict structures which have similar low free energy can provide significant information. Such structures are termed suboptimal structures. MFOLD is one program that generates suboptimal structures.[11]
Predicting pseudoknots
One of the issues when predicting RNA secondary structure is that the standard free energy minimization and statistical sampling methods can not find pseudoknots.[5] The major problem is that the usual dynamic programing algorithms, when predicting secondary structure, consider only the interactions between the closest nucleotides, while pseudoknotted structures are formed due to interactions between distant nucleotides. Rivas and Eddy published a dynamic programming algorithm for predicting pseudoknots.[10] However, this dynamic programming algorithm is very slow. The standard dynamic programming algorithm for free energy minimization scales O(N3) in time (N is the number of nucleotides in the sequence), while the Rivas and Eddy algorithm scales O(N6) in time. This has prompted several researchers to implement versions of the algorithm that restrict classes of pseudoknots, resulting in performance gains. For example, pknotsRG tool includes only the class of simple recursive pseudoknots and scales O(N4) in time.[12]
Other approaches for RNA secondary structure prediction
Another approach for RNA secondary structure determination is to sample structures from the
Comparative secondary structure prediction
Sequence covariation methods rely on the existence of a data set composed of multiple
In general, the problem of alignment and consensus structure prediction are closely related. Three different approaches to the prediction of consensus structures can be distinguished:[16]
- Folding of alignment
- Simultaneous sequence alignment and folding
- Alignment of predicted structures
Align then fold
A practical
Align and fold
Evolution frequently preserves functional RNA structure better than RNA sequence.[17] Hence, a common biological problem is to infer a common structure for two or more highly diverged but homologous RNA sequences. In practice, sequence alignments become unsuitable and do not help to improve the accuracy of structure prediction, when sequence similarity of two sequences is less than 50%.[20]
Structure-based alignment programs improves the performance of these alignments and most of them are variants of the Sankoff algorithm.[21] Basically, Sankoff algorithm is a merger of sequence alignment and Nussinov [7] (maximal-pairing) folding dynamic programming method.[22] Sankoff algorithm itself is a theoretical exercise because it requires extreme computational resources (O(n3m) in time, and O(n2m) in space, where n is the sequence length and m is the number of sequences). Some notable attempts at implementing restricted versions of Sankoff's algorithm are Foldalign,[23][24] Dynalign,[25][26] PMmulti/PMcomp,[22] Stemloc,[27] and Murlet.[28] In these implementations the maximal length of alignment or variants of possible consensus structures are restricted. For example, Foldalign focuses on local alignments and restricts the possible length of the sequences alignment.
Fold then align
A less widely used approach is to fold the sequences using single sequence structure prediction methods and align the resulting structures using tree-based metrics.[29] The fundamental weakness with this approach is that single sequence predictions are often inaccurate, thus all further analyses are affected.
Tertiary structure prediction
Once secondary structure of RNA is known, the next challenge is to predict
The three-dimensional structure prediction methods can use comparative modeling which starts from a related known structure known as the template.[34] The alternative strategy is de novo modeling of RNA secondary structure[35] which uses physics-based principles such as molecular dynamics[36] or random sampling of the conformational landscape[37] followed by screening with a statistical potential for scoring.[38] These methods either use an all-atom representation[39] of the nucleic acid structure or a coarse-grained representation.[40] The low-resolution structures generated by many of these modeling methods are then subjected to high-resolution refinement.[41]
See also
- RNA
- RNA structure
- Non-coding RNA
- List of RNA structure prediction software
- Comparison of nucleic acid simulation software
- Comparison of software for molecular mechanics modeling
References
- PMID 30670629.
- PMID 24785264.
- ^ PMID 16500677.
- PMID 15123812.
- ^ S2CID 19989405.
- S2CID 189885784.)
{{cite journal}}
: CS1 maint: DOI inactive as of February 2024 (link - ^ a b c d Nussinov R, Piecznik G, Grigg JR and Kleitman DJ (1978) Algorithms for loop matchings. SIAM Journal on Applied Mathematics.
- ^ PMID 6161375.
- PMID 6163133.
- ^ S2CID 2228845.
- PMID 12824337.
- PMID 15294028.
- S2CID 12629688.
- ^ PMID 14654704.
- PMID 11108471.
- PMID 15458580.
- ^ PMID 12079347.
- PMID 12824339.
- ^ Ruan, J., Stormo, G.D. & Zhang, W. (2004) ILM: a web server for predicting RNA secondary structures with pseudoknots. Nucleic Acids Research, 32(Web Server issue), W146-149.
- PMID 19833701.
- doi:10.1137/0145048.
- ^ PMID 15073017.
- PMID 15657094.
- ^ Torarinsson E, Havgaard JH, Gorodkin J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics.
- PMID 11902836.
- ^ Harmanci AO, Sharma G, Mathews DH, (2007), Efficient Pairwise RNA Structure Prediction Using Probabilistic Alignment Constraints in Dynalign, BMC Bioinformatics, 8(130).
- ^ Holmes I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005 Mar 24;6:73.
- PMID 17459961.
- ^ Shapiro BA and Zhang K (1990) Comparing Multiple RNA Secondary Structures Using Tree Comparisons Computer Applications in the Biosciences, vol. 6, no. 4, pp. 309–318.
- ^ Shapiro BA, Yingling YG, Kasprzak W, Bindewald E. (2007) Bridging the gap in RNA structure prediction. Curr Opin Struct Biol.
- PMID 1716375.
- PMID 8415714.
- PMID 19543381.
- PMID 21300639.
- OCLC 795570014.
- S2CID 35501632.
- PMID 18573079.
- PMID 21514143.
- PMID 28986779
- PMID 26687716.
- PMID 30898165.
Further reading
- Baker D, Sali A (2001). "Protein structure prediction and structural genomics". Science. 294 (5540): 93–6. S2CID 7193705.
- Chiu D.K.; Kolodziejczak T. (1991). "Inferring consensus structure from nucleic acid sequences". Comput. Appl. Biosci. 7 (3): 347–352. PMID 1913217.
- Do CB, Woods DA, Batzoglou S (2006). "CONTRAfold: RNA secondary structure prediction without physics-based models". Bioinformatics. 22 (14): e90–8. PMID 16873527.
- Gutell R.R.; et al. (1992). "Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods". Nucleic Acids Res. 20 (21): 5785–5795. PMID 1454539.
- Leontis NB, Lescoute A, Westhof E (2006). "The building blocks and motifs of RNA architecture". Curr Opin Struct Biol. 16 (3): 279–87. PMID 16713707.
- Lindgreen S, Gardner PP, Krogh A (2006). "Measuring covariation in RNA alignments: physical realism improves information measures". Bioinformatics. 22 (24): 2988–95. PMID 17038338.
- Lorenz, Ronny (2014). RNA secondary structure thermodynamics and kinetics. Vienna, Austria: University of Vienna, Dissertation.
- Macke T, Case D (1998). "Modeling Unusual Nucleic Acid Structures". Modeling unusual nucleic acid structures. In Molecular Modeling of Nucleic Acids. Edited by Leontes N, SantaLucia JJ. Washington, DC. ACS Symposium Series. Vol. 682. American Chemical Society. pp. 379–393. ISBN 978-0-8412-3541-0.
- Major F (2003). "Building three-dimensional ribonucleic acid structures". Computing in Science & Engineering. 2003 (5): 44–53. S2CID 17627934.
- Massire C, Westhof E. "MANIP: an interactive tool for modelling RNA". J Mol Graph Model. 1998 (16): 197–205, 255–257.
- Parisien M.; Major F. (2008). "The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data". Nature. 452 (7183): 51–55. S2CID 4415777.
- Tuzet, H. & Perriquet, O., 2004. CARNAC: folding families of related RNAs. Nucleic Acids Research, 32(Web Server issue), W142-145.
- Touzet H (2007). "Comparative Analysis of RNA Genes". Comparative Genomics. S2CID 244726.
- Yingling YG, Shapiro BA (2006). "The prediction of the wild-type telomerase RNA pseudoknot structure and the pivotal role of the bulge in its formation". J Mol Graph Model. 25 (2): 261–274. PMID 16481205.
- Zwieb C, Muller F (1997). "Three-dimensional comparative modeling of RNA". Nucleic Acids Symp Ser. 36 (36): 69–71. PMID 9478210.
- ModeRNA: A program for comparative RNA modeling