Computational phylogenetics

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization

parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data.^[1]^[2] Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements

, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

Maximum Likelihood (also likelihood) optimality criterion is the process of finding the tree topology along with its branch lengths that provides the highest probability observing the sequence data, while parsimony optimality criterion is the fewest number of state-evolutionary changes required for a phylogenetic tree to explain the sequence data.[1]^[2]

Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms, while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification.

Many forms of molecular phylogenetics are closely related to and make extensive use of

evolutionary tree that represents the historical relationships between the species being analyzed.^{[citation needed}

] The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species.

Types of phylogenetic trees and networks

leaf nodes and their distances from the root proportional to their genetic distance from the hypothesized MRCA. Identification of a root usually requires the inclusion in the input data of at least one "outgroup" known to be only distantly related to the sequences of interest.^{[citation needed}

]

By contrast, unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent. An unrooted tree can always be produced from a rooted tree, but a root cannot usually be placed on an unrooted tree without additional data on divergence rates, such as the assumption of the molecular clock hypothesis.^[3]

The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional "tree space" through which search paths can be traced by

optimization algorithms. Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology, it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters.^[2]

Both rooted and unrooted phylogenetic trees can be further generalized to rooted or unrooted

hybridization or horizontal gene transfer.^{[citation needed}

]

Coding characters and defining homology

Morphological analysis

The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; for individual species, they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which features to measure is a major inherent obstacle to the method. The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant.^[4] Morphological studies can be confounded by examples of convergent evolution of phenotypes.^[5] A major challenge in constructing useful classes is the high likelihood of inter-taxon overlap in the distribution of the phenotype's variation. The inclusion of extinct taxa in morphological analysis is often difficult due to absence of or incomplete fossil records, but has been shown to have a significant effect on the trees produced; in one study only the inclusion of extinct species of apes produced a morphologically derived tree that was consistent with that produced from molecular data.^[6]

Some phenotypic classifications, particularly those used when analyzing very diverse groups of taxa, are discrete and unambiguous; classifying organisms as possessing or lacking a tail, for example, is straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution. A common method is simply to sort the measurements of interest into two or more classes, rendering continuous observed variation as discretely classifiable (e.g., all examples with humerus bones longer than a given cutoff are scored as members of one state, and all members whose humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements.^[7]

Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, reuse of previously compiled data matrices is not uncommon, although this may propagate flaws in the original matrix into multiple derivative analyses.^[8]

Molecular analysis

The problem of character coding is very different in molecular analyses, as the characters in biological sequence data are immediate and discretely defined - distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences. However, defining homology can be challenging due to the inherent difficulties of multiple sequence alignment. For a given gapped MSA, several rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are "mutations" versus ancestral characters, and which events are insertion mutations or deletion mutations. For example, given only a pairwise alignment with a gap region, it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In practice, sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation.^{[citation needed]}

Multigene phylogeny

A tree built on a single gene as found in different organisms (

orthologs) may not show sufficient phylogenetic signal for drawing strong conclusions. Adding more genes by concatenating their respective multiple sequence alignments into a "supermatrix", effectively creating a huge virtual gene with more evolutionary changes available for tree inference. This naive method only works well on genes with similar evolutionary histories; for more complex cases (organellar+nuclear datasets or joint amino acid+nucleotide alignments), some algorithms allow for informing them where each gene starts and ends (data partitioning). Alternatively, one can infer several single-gene trees and combine them into a "supertree". With the advent of phylogenomics, hundreds of genes may be analyzed at once.^[9]

Distance-matrix methods

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore, they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches.

interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignments. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.^[2]

UPGMA and WPGMA

The

ultrametric tree in which the distances from the root to every branch tip are equal.^[10]

Neighbor-joining

Neighbor-joining methods apply general

neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages.^[11]

Fitch–Margoliash method

The

Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches.^[2] Another modification of the algorithm can be helpful, especially in case of concentrated distances (please refer to concentration of measure phenomenon and curse of dimensionality): that modification, described in,^[13]

has been shown to improve the efficiency of the algorithm and its robustness.

The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is

NP-complete,^[14] so heuristic

search methods like those used in maximum-parsimony analysis are applied to the search through tree space.

Using outgroups

Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one

conserved across lineages. Horizontal gene transfer, especially between otherwise divergent bacteria, can also confound outgroup usage.^{[citation needed}

]

Maximum parsimony

Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include a "cost" associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost. This is a useful approach in cases where not every possible type of event is equally likely - for example, when particular nucleotides or amino acids are known to be more mutable than others.

The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be

steepest descent-style minimization mechanism operating on a tree rearrangement

criterion.

Branch and bound

The

NP-hard problems first applied to phylogenetics in the early 1980s.^[15] Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions. As its name implies, it requires as input both a branching rule (in the case of phylogenetics, the addition of the next species or sequence to the tree) and a bound (a rule that excludes certain regions of the search space from consideration, thereby assuming that the optimal solution cannot occupy that region). Identifying a good bound is the most challenging aspect of the algorithm's application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of criteria known as Zharkikh's rules^[16]

severely limit the search space by defining characteristics shared by all candidate "most parsimonious" trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have produced identical data) and the elimination of character sites at which two or more states do not occur in at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.

Sankoff-Morel-Cedergren algorithm

The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences.

interior nodes of the tree are scored and summed over all the nodes in each possible tree. The lowest-scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function. Because the method is highly computationally intensive, an approximate method in which initial guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming.^[2]

MALIGN and POY

More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSA.^[19] However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary events.^[20] This, in turn, has been countered by the view that such methods should be seen as heuristic approaches to find the trees that maximize the amount of sequence similarity that can be interpreted as homology.^[18]^[21]

Maximum likelihood

The

statistically independent. Maximum likelihood is thus well suited to the analysis of distantly related sequences, but it is believed to be computationally intractable to compute due to its NP-hardness.^[22]

The "pruning" algorithm, a variant of

Newton–Raphson method

are often used.

Some tools that use maximum likelihood to infer phylogenetic trees from variant allelic frequency data (VAFs) include AncesTree and CITUP.[23]^[24]

Bayesian inference

Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assume a prior probability distribution of the possible trees, which may simply be the probability of any one tree among all the possible trees that could be generated from the data, or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes. The choice of prior distribution is a point of contention among users of Bayesian-inference phylogenetics methods.^[2]

Implementations of Bayesian methods generally use

internal node between two related trees.^[26] The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criterion, and prior distribution in published work.^[2] Bayesian methods are generally held to be superior to parsimony-based methods; they can be more prone to long-branch attraction than maximum likelihood techniques,^[27] although they are better able to accommodate missing data.^[28]

Whereas likelihood methods find the tree that maximizes the probability of the data, a Bayesian approach recovers a tree that represents the most likely clades, by drawing on the posterior distribution. However, estimates of the posterior probability of clades (measuring their 'support') can be quite wide of the mark, especially in clades that aren't overwhelmingly likely. As such, other methods have been put forwards to estimate posterior probability.[29]

Some tools that use Bayesian inference to infer phylogenetic trees from variant allelic frequency data (VAFs) include Canopy, EXACT, and PhyloWGS.^[30]^[31]^[32]

Model selection

Molecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates of mutation at various sites along the gene or amino acid sequences being studied. At their simplest, substitution models aim to correct for differences in the rates of transitions and transversions in nucleotide sequences. The use of substitution models is necessitated by the fact that the genetic distance between two sequences increases linearly only for a short time after the two sequences diverge from each other (alternatively, the distance is linear only shortly before coalescence). The longer the amount of time after divergence, the more likely it becomes that two mutations occur at the same nucleotide site. Simple genetic distance calculations will thus undercount the number of mutation events that have occurred in evolutionary history. The extent of this undercount increases with increasing time since divergence, which can lead to the phenomenon of long branch attraction, or the misassignment of two distantly related but convergently evolving sequences as closely related.^[33] The maximum parsimony method is particularly susceptible to this problem due to its explicit search for a tree representing a minimum number of distinct evolutionary events.^[2]

Types of models

All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the

Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate.^[2] More advanced models distinguish between transitions and transversions. The most general possible time-reversible model, called the GTR model, has six mutation rate parameters. An even more generalized model known as the general 12-parameter model breaks time-reversibility, at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages.^[2] One possible variation on this theme adjusts the rates so that overall GC content - an important measure of DNA double helix stability - varies over time.^[34]

Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation follows from the arrangement of nucleotides in protein-coding genes into three-base

codons. If the location of the open reading frame (ORF) is known, rates of mutation can be adjusted for position of a given site within a codon, since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon's meaning in the genetic code.^[33] A less hypothesis-driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution, often the gamma distribution or log-normal distribution.^[2] Finally, a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates, so that the mutation rate of a given site is correlated across sites and lineages.^[35]

Choosing the best model

The selection of an appropriate model is critical for the production of good phylogenetic analyses, both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated, and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit.

likelihood ratio test (LRT), which produces a likelihood estimate that can be interpreted as a measure of "goodness of fit" between the model and the input data.^[33] However, care must be taken in using these results, since a more complex model with more parameters will always have a higher likelihood than a simplified version of the same model, which can lead to the naive selection of models that are overly complex.^[2] For this reason model selection computer programs will choose the simplest model that is not significantly worse than more complex substitution models. A significant disadvantage of the LRT is the necessity of making a series of pairwise comparisons between models; it has been shown that the order in which the models are compared has a major effect on the one that is eventually selected.^[36]

An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback–Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models.^[33] The AIC is calculated on an individual model rather than a pair, so it is independent of the order in which models are assessed. A related alternative, the Bayesian information criterion (BIC), has a similar basic interpretation but penalizes complex models more heavily.^[33] Determining the most suitable model for phylogeny reconstruction constitutes a fundamental step in numerous evolutionary studies. However, various criteria for model selection are leading to debate over which criterion is preferable. It has recently been shown that, when topologies and ancestral sequence reconstruction are the desired output, choosing one criterion over another is not crucial. Instead, using the most complex nucleotide substitution model, GTR+I+G, leads to similar results for the inference of tree topology and ancestral sequences.^[37]

A comprehensive step-by-step protocol on constructing phylogenetic trees, including DNA/Amino Acid contiguous sequence assembly, multiple sequence alignment, model-test (testing best-fitting substitution models) and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Protocol Exchange^[38]

A non traditional way of evaluating the phylogenetic tree is to compare it with clustering result. One can use a Multidimensional Scaling technique, so called Interpolative Joining to do dimensionality reduction to visualize the clustering result for the sequences in 3D, and then map the phylogenetic tree onto the clustering result. A better tree usually has a higher correlation with the clustering result.^[39]

Evaluating tree support

As with all statistical analysis, the estimation of phylogenies from character data requires an evaluation of confidence. A number of methods exist to test the amount of support for a phylogenetic tree, either by evaluating the support for each sub-tree in the phylogeny (nodal support) or evaluating whether the phylogeny is significantly different from other possible trees (alternative tree hypothesis tests).

Nodal support

The most common method for assessing tree support is to evaluate the statistical support for each node on the tree. Typically, a node with very low support is not considered valid in further analysis, and visually may be collapsed into a polytomy to indicate that relationships within a clade are unresolved.

Consensus tree

Many methods for assessing nodal support involve consideration of multiple phylogenies. The consensus tree summarizes the nodes that are shared among a set of trees.^[40] In a *strict consensus,* only nodes found in every tree are shown, and the rest are collapsed into an unresolved polytomy. Less conservative methods, such as the *majority-rule consensus* tree, consider nodes that are supported by a given percentage of trees under consideration (such as at least 50%).

For example, in maximum parsimony analysis, there may be many trees with the same parsimony score. A strict consensus tree would show which nodes are found in all equally parsimonious trees, and which nodes differ. Consensus trees are also used to evaluate support on phylogenies reconstructed with Bayesian inference (see below).

Bootstrapping and jackknifing

In statistics, the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data. For example, given a set of 100 data points, a pseudoreplicate is a data set of the same size (100 points) randomly sampled from the original data, with replacement. That is, each original data point may be represented more than once in the pseudoreplicate, or not at all. Statistical support involves evaluation of whether the original data has similar properties to a large set of pseudoreplicates.

In phylogenetics, bootstrapping is conducted using the columns of the character matrix. Each pseudoreplicate contains the same number of species (rows) and characters (columns) randomly sampled from the original matrix, with replacement. A phylogeny is reconstructed from each pseudoreplicate, with the same methods used to reconstruct the phylogeny from the original data. For each node on the phylogeny, the nodal support is the percentage of pseudoreplicates containing that node.^[41]

The statistical rigor of the bootstrap test has been empirically evaluated using viral populations with known evolutionary histories,^[42] finding that 70% bootstrap support corresponds to a 95% probability that the clade exists. However, this was tested under ideal conditions (e.g. no change in evolutionary rates, symmetric phylogenies). In practice, values above 70% are generally supported and left to the researcher or reader to evaluate confidence. Nodes with support lower than 70% are typically considered unresolved.

Jackknifing in phylogenetics is a similar procedure, except the columns of the matrix are sampled without replacement. Pseudoreplicates are generated by randomly subsampling the data—for example, a "10% jackknife" would involve randomly sampling 10% of the matrix many times to evaluate nodal support.

Posterior probability

Reconstruction of phylogenies using Bayesian inference generates a posterior distribution of highly probable trees given the data and evolutionary model, rather than a single "best" tree. The trees in the posterior distribution generally have many different topologies. When the input data is variant allelic frequency data (VAF), the tool EXACT can compute the probabilities of trees exactly, for small, biologically relevant tree sizes, by exhaustively searching the entire tree space.^[30]

Most Bayesian inference methods utilize a Markov-chain Monte Carlo iteration, and the initial steps of this chain are not considered reliable reconstructions of the phylogeny. Trees generated early in the chain are usually discarded as burn-in. The most common method of evaluating nodal support in a Bayesian phylogenetic analysis is to calculate the percentage of trees in the posterior distribution (post-burn-in) which contain the node.

The statistical support for a node in Bayesian inference is expected to reflect the probability that a clade really exists given the data and evolutionary model.^[43] Therefore, the threshold for accepting a node as supported is generally higher than for bootstrapping.

Step counting methods

Bremer support counts the number of extra steps needed to contradict a clade.

Shortcomings

These measures each have their weaknesses. For example, smaller or larger clades tend to attract larger support values than mid-sized clades, simply as a result of the number of taxa in them.^[44]

Bootstrap support can provide high estimates of node support as a result of noise in the data rather than the true existence of a clade.^[45]

Limitations and workarounds

Ultimately, there is no way to measure whether a particular phylogenetic hypothesis is accurate or not, unless the true relationships among the taxa being examined are already known (which may happen with bacteria or viruses under laboratory conditions). The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence. Several potential pitfalls have been identified:

Homoplasy

Certain characters are more likely to

maximum likelihood or Bayesian methods can be used to analyze them. For molecular sequences, this problem is exacerbated when the taxa under study have diverged substantially. As time since the divergence of two taxa increase, so does the probability of multiple substitutions on the same site, or back mutations, all of which result in homoplasies. For morphological data, unfortunately, the only objective way to determine convergence is by the construction of a tree – a somewhat circular method. Even so, weighting homoplasious characters^[how?] does indeed lead to better-supported trees.^[46] Further refinement can be brought by weighting changes in one direction higher than changes in another; for instance, the presence of thoracic wings almost guarantees placement among the pterygote insects because, although wings are often lost secondarily, there is no evidence that they have been gained more than once.^[47]

Horizontal gene transfer

In general, organisms can inherit genes in two ways: vertical gene transfer and

antibiotic resistance as a result of gene exchange between various bacteria leading to multi-drug-resistant bacterial species. There have also been well-documented cases of horizontal gene transfer between eukaryotes

.

Horizontal gene transfer has complicated the determination of phylogenies of organisms, and inconsistencies in phylogeny have been reported among specific groups of organisms depending on the genes used to construct evolutionary trees. The only way to determine which genes have been acquired vertically and which horizontally is to parsimoniously assume that the largest set of genes that have been inherited together have been inherited vertically; this requires analyzing a large number of genes.

Hybrids, speciation, introgressions and incomplete lineage sorting

The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (bar horizontal gene transfer, see above), speciation is often much less orderly. Research since the cladistic method was introduced has shown that hybrid speciation, once thought rare, is in fact quite common, particularly in plants.^[48]^[49] Also paraphyletic speciation is common, making the assumption of a bifurcating pattern unsuitable, leading to phylogenetic networks rather than trees.^[50]^[51] Introgression can also move genes between otherwise distinct species and sometimes even genera,^[52] complicating phylogenetic analysis based on genes.^[53] This phenomenon can contribute to "incomplete lineage sorting" and is thought to be a common phenomenon across a number of groups. In species level analysis this can be dealt with by larger sampling or better whole genome analysis.^[54] Often the problem is avoided by restricting the analysis to fewer, not closely related specimens.

Taxon sampling

Owing to the development of advanced sequencing techniques in

mitochondrial genomes (~16,000 nucleotides, in many animals). However, simulations have shown that it is more important to increase the number of taxa in the matrix than to increase the number of characters, because the more taxa there are, the more accurate and more robust is the resulting phylogenetic tree.^[55]^[56] This may be partly due to the breaking up of long branches

.

Phylogenetic signal

Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal, a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly. Tests for phylogenetic signal exist.[57]

Continuous characters

Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding.^[58] In the original form of gap coding:^[58]

group means for a character are first ordered by size. The pooled within-group standard deviation is calculated ... and differences between adjacent means ... are compared relative to this standard deviation. Any pair of adjacent means is considered different and given different integer scores ... if the means are separated by a "gap" greater than the within-group standard deviation ... times some arbitrary constant.

If more taxa are added to the analysis, the gaps between taxa may become so small that all information is lost. Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa.^[58]

Missing data

In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no more detrimental than simply having fewer data, although the impact is greatest when most of the missing data are in a small number of taxa. Concentrating the missing data across a small number of characters produces a more robust tree.^[59]

The role of fossils

Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of

living taxa, extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in sparse areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa contribute as much to tree resolution as modern taxa.^[60] Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record;^[1] stratocladistics

incorporates age information into data matrices for phylogenetic analyses.

References

^
hdl:10222/60796
.

^
ISBN 978-0-87893-177-4
.

^
ISBN 978-0-87969-712-9
.

PMID 12066691
.

PMID 16282167
.

PMID 15566946
.

PMID 12116939
.

PMID 12116943
.

^ "Part 4: Multigene phylogenetics". web.natur.cuni.cz.

^ Sokal R, Michener C (1958). "A statistical method for evaluating systematic relationships". University of Kansas Science Bulletin. 38: 1409–1438.

PMID 3447015
.

PMID 5334057
.

PMID 21697992
.

S2CID 189885258
.

doi:10.1016/0025-5564(82)90027-X
.

ISBN 978-3-662-12530-4
.

PMID 4201431
.

^
ISBN 978-0-19-856493-5
.

doi:10.1093/oxfordjournals.jhered.a111492
.

PMID 15120385
.

S2CID 221582410
.

PMID 15961504
.

PMID 26072510
.

PMID 25568283
.

JSTOR 1390728
.

PMID 9214744
.

PMID 20011052
.

S2CID 53123024
.

PMID 23479066
.

^
Bibcode:2019arXiv190808623R. {{cite journal}}: Cite journal requires |journal= (help
)

PMID 27573852
.

PMID 25786235
.

^
PMID 20671039
.

PMID 9656487
.

S2CID 26638948
.

PMID 15764562
.

PMID 30804347
.

doi:10.1038/protex.2013.065
.

S2CID 9581901
.

ISBN 978-1-936221-16-5
.

PMID 28561359
.

ISSN 1063-5157
.

PMID 15764559
.

hdl:11336/4144
.

PMID 15084674
.

^
S2CID 913161
.

S2CID 196595734
.

ISBN 978-0-19-509975-1
.

ISBN 978-0-19-535668-7
.

S2CID 33951905
.

^ "Genealogy of Life (GoLife)". National Science Foundation. Retrieved 5 May 2015. The GoLife program builds upon the AToL program by accommodating the complexity of diversification patterns across all of life's history. Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted - for every branch of the tree - as a single, typological, bifurcating tree.

PMID 24903145
.

S2CID 22635918
.

PMID 17132051
.

PMID 12228001
.

PMID 15922672
.

S2CID 221735844
.

^
JSTOR 2413151
.

S2CID 86850694
.

PMID 17886145
.

Further reading

Semple C,
ISBN 978-0-19-850942-4
.

Cipra BA (2007). "Algebraic Geometers See Ideal Approach to Biology" (PDF). SIAM News. 40 (6). Archived from the original (PDF) on 3 March 2016.

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007). "Section 16.4. Hierarchical Clustering by Phylogenetic Trees". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press.
ISBN 978-0-521-88068-8. Archived from the original
on 11 August 2011. Retrieved 17 August 2011.

Huson DH, Rupp R, Scornavacca C (2010). Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press.
ISBN 978-1-139-49287-4
.

External links

Media related to Computational phylogenetics at Wikimedia Commons

v
t
e
Phylogenetics
Relevant fields

Computational phylogenetics

Molecular phylogenetics

Cladistics

Taxonomy

Evolutionary taxonomy

Systematics

Evolutionary biology portal
Basic concepts

Phylogenesis
Cladogenesis

Phylogenetic tree

Cladogram

Phylogenetic network

Long branch attraction

Clade vs Grade

Lineage
Ghost lineage

Ghost population

Inference methods

Maximum parsimony

Phylogenetic reconciliation

Probabilistic methods
Maximum likelihood

Bayesian inference

Distance-matrix methods
Neighbor-joining

UPGMA

Least squares

Three-taxon analysis

Current topics

PhyloCode

DNA barcoding

Molecular phylogenetics

Phylogenetic comparative methods

Phylogenetic niche conservatism

Phylogenetic signal

Phylogenetics software

Phylogenomics

Phylogeography

Group traits

Primitive
Plesiomorphy

Symplesiomorphy

Derived
Apomorphy

Synapomorphy

Autapomorphy

Group types

Monophyly

Paraphyly

Polyphyly

Nomenclature

Phylogenetic nomenclature

Crown group

Sister group

Basal

Supertree

Category

Commons

v
t
e
Evolutionary biology

Introduction

Outline

Timeline of evolution

History of life

Index

Evolution

Abiogenesis

Adaptation

Adaptive radiation

Altruism
Cheating

Reciprocal

Baldwin effect

Cladistics

Coevolution
Mutualism

Common descent

Convergence

Divergence

Earliest known life forms

Evidence of evolution

Evolutionary arms race

Evolutionary pressure

Exaptation

Extinction
Event

Homology

Last universal common ancestor

Macroevolution

Microevolution

Mismatch

Non-adaptive radiation

Origin of life

Panspermia

Parallel evolution

Signalling theory
Handicap principle

Speciation
Species

Species complex

Taxonomy

Unit of selection
Gene-centered view of evolution

Population
genetics

Artificial selection

Biodiversity

Evolutionarily stable strategy

Fisher's principle

Fitness
Inclusive

Gene flow

Genetic drift

Kin selection
Inbreeding avoidance

Kin recognition

Parental investment

Parent–offspring conflict

Mutation

Population

Natural selection

Sexual dimorphism

Sexual selection
Flowering plants

Fungi

Mate choice

Social selection

Trivers–Willard hypothesis

Variation

Development

Canalisation

Evolutionary developmental biology

Genetic assimilation

Inversion

Modularity

Phenotypic plasticity

Of taxa

Bacteria

Birds
origin

Brachiopods

Molluscs
Cephalopods

Dinosaurs

Fish

Fungi

Insects
butterflies

Life

Mammals
cats

canids
wolves

dogs

hyenas

dolphins and whales

horses

Kangaroos

primates
humans

lemurs

sea cows

Plants
pollinator-mediated

Reptiles

Spiders

Tetrapods

Viruses

Of organs

Cell

DNA

Flagella

Eukaryotes
symbiogenesis

chromosome

endomembrane system

mitochondria

nucleus

plastids

In animals
eye

hair

auditory ossicle

nervous system

brain

Of processes

Aging
Death

Programmed cell death

Avian flight

Biological complexity

Cooperation

Color vision
in primates

Emotion

Empathy

Ethics

Eusociality

Immune system

Metabolism

Monogamy

Morality

Mosaic evolution

Multicellularity

Sexual reproduction
Gamete differentiation/sexes

Life cycles/nuclear phases

Mating types

Meiosis

Sex-determination

Snake venom

Tempo and modes

Gradualism/Punctuated equilibrium/Saltationism

Micromutation/Macromutation

Uniformitarianism/Catastrophism

Speciation

Allopatric

Anagenesis

Catagenesis

Cladogenesis

Cospeciation

Ecological

Hybrid

Non-ecological

Parapatric

Peripatric

Reinforcement

Sympatric

History

Renaissance and Enlightenment

Transmutation of species

David Hume
Dialogues Concerning Natural Religion

Charles Darwin
On the Origin of Species

History of paleontology

Transitional fossil

Blending inheritance

Mendelian inheritance

The eclipse of Darwinism

Neo-Darwinism

Modern synthesis

History of molecular evolution

Extended evolutionary synthesis

Philosophy

Darwinism

Alternatives
Catastrophism

Lamarckism

Orthogenesis

Mutationism

Saltationism

Structuralism
Spandrel

Theistic

Vitalism

Teleology in biology

Related

Biogeography

Ecological genetics

Evolutionary medicine

Group selection
Cultural evolution

Cultural group selection

Dual inheritance theory

Hologenome theory of evolution

Missing heritability problem

Molecular evolution

Astrobiology

Phylogenetics
Tree

Polymorphism

Protocell

Systematics

Transgenerational epigenetic inheritance

Category

Portal

Retrieved from "https://en.wikipedia.org/w/index.php?title=Computational_phylogenetics&oldid=1287880790"

[Huelsenbeck-1] 
hdl:10222/60796
.

[felsenstein-2] 
ISBN 978-0-87893-177-4
.

[mount-3] 
ISBN 978-0-87969-712-9
.

[Swiderski-4] PMID 12066691
.

[Gaubert-5] PMID 16282167
.

[Strait-6] PMID 15566946
.

[Wiens-7] PMID 12116939
.

[Jenner-8] PMID 12116943
.

[9] "Part 4: Multigene phylogenetics". web.natur.cuni.cz.

[10] Sokal R, Michener C (1958). "A statistical method for evaluating systematic relationships". University of Kansas Science Bulletin. 38: 1409–1438.

[11] PMID 3447015
.

[fitch-12] PMID 5334057
.

[Lespinats-13] PMID 21697992
.

[day-14] S2CID 189885258
.

[hendy-15] :10.1016/0025-5564(82)90027-X
.

[zharkikh-16] ISBN 978-3-662-12530-4
.

[sankoff-17] PMID 4201431
.

[de_laet2005-18] 
ISBN 978-0-19-856493-5
.

[wheeler-19] :10.1093/oxfordjournals.jhered.a111492
.

[simmons-20] PMID 15120385
.

[de_laet2015-21] S2CID 221582410
.

[chor-22] PMID 15961504
.

[23] PMID 26072510
.

[24] PMID 25568283
.

[Mau-25] JSTOR 1390728
.

[Yang-26] PMID 9214744
.

[Kolaczkowski2009-27] PMID 20011052
.

[Simmons2012-28] S2CID 53123024
.

[Larget2013-29] PMID 23479066
.

[Ray_2019-30] 
Bibcode:2019arXiv190808623R. {{cite journal}}: Cite journal requires |journal= (help
)

[31] PMID 27573852
.

[32] PMID 25786235
.

[Sullivan-33] 
PMID 20671039
.

[Galtier-34] PMID 9656487
.

[FitchMarkowitz-35] S2CID 26638948
.

[Pol-36] PMID 15764562
.

[37] PMID 30804347
.

[38] :10.1038/protex.2013.065
.

[39] S2CID 9581901
.

[BaumSmith2013-40] ISBN 978-1-936221-16-5
.

[Felsenstein1985-41] PMID 28561359
.

[HillisBull1993-42] ISSN 1063-5157
.

[HuelsenbeckRannala2004-43] PMID 15764559
.

[44] :11336/4144
.

[45] PMID 15084674
.

[Goloboff2008-46] 
S2CID 913161
.

[Goloboff1997-47] S2CID 196595734
.

[Arnold-48] ISBN 978-0-19-509975-1
.

[49] ISBN 978-0-19-535668-7
.

[50] S2CID 33951905
.

[51] "Genealogy of Life (GoLife)". National Science Foundation. Retrieved 5 May 2015. The GoLife program builds upon the AToL program by accommodating the complexity of diversification patterns across all of life's history. Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted - for every branch of the tree - as a single, typological, bifurcating tree.

[52] PMID 24903145
.

[53] S2CID 22635918
.

[54] PMID 17132051
.

[Zwickl2002-55] PMID 12228001
.

[Wiens2006-56] PMID 15922672
.

[Blomberg2003-57] S2CID 221735844
.

[Archie1985-58] 
JSTOR 2413151
.

[59] S2CID 86850694
.

[Cobbett2007-60] PMID 17886145
.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[13]

[14]

[15]

[16]

[19]

[20]

[18]

[21]

[22]

[24]

[26]

[27]

[28]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[58]

[59]

[60]