Distance matrices in phylogeny
Distance matrices are used in phylogeny as
Distance-matrix methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they start with a multiple sequence alignment (MSA) as an input. From it, they construct an all-to-all matrix describing the distance between each sequence pair. Finally, they construct a phylogenetic tree that places closely related sequences under the same
Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches.[1]
Distance-matrix methods are frequently used as the basis for progressive and iterative types of multiple sequence alignment.
The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.[2]
Neighbor-joining
Neighbor-joining methods apply general
UPGMA and WPGMA
The
Fitch–Margoliash method
The Fitch–Margoliash method uses a weighted
The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is
search methods like those used in maximum-parsimony analysis are applied to the search through tree space.Using outgroups
Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one
Weaknesses of different methods
In general, pairwise distance data are an underestimate of the path-distance between taxa on a
Several simple algorithms exist to construct a tree directly from pairwise distances, including UPGMA and neighbor joining (NJ), but these will not necessarily produce the best tree for the data. To counter potential complications noted above, and to find the best tree for the data, distance analysis can also incorporate a tree-search protocol that seeks to satisfy an explicit optimality criterion. Two optimality criteria are commonly applied to distance data, minimum evolution (ME) and least squares inference. Least squares is part of a broader class of regression-based methods lumped together here for simplicity. These regression formulae minimize the residual differences between path-distances along the tree and pairwise distances in the data matrix, effectively "fitting" the tree to the empirical distances. In contrast, ME accepts the tree with the shortest sum of branch lengths, and thus minimizes the total amount of evolution assumed. ME is closely akin to parsimony, and under certain conditions, ME analysis of distances based on a discrete character dataset will favor the same tree as conventional parsimony analysis of the same data.
Phylogeny estimation using distance methods has produced a number of controversies.
Many scientists eschew distance methods, for various reasons. A commonly cited reason is that distances are inherently
More practically, distance methods are avoided because the relationship between individual characters and the tree is lost in the process of reducing characters to distances. These methods do not use character data directly, and information locked in the distribution of character states can be lost in the pairwise comparisons. Also, some complex phylogenetic relationships may produce biased distances. On any phylogram, branch lengths will be underestimated because some changes cannot be discovered at all due to failure to sample some species due to either experimental design or extinction (a phenomenon called the node density effect). However, even if pairwise distances from genetic data are "corrected" using stochastic models of evolution as mentioned above, they may more easily sum to a different tree than one produced from analysis of the same data and model using
Despite these potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as
Distance methods are popular among molecular systematists, a substantial number of whom use NJ without an optimization stage almost exclusively. With the increasing speed of character-based analyses, some of the advantages of distance methods will probably wane. However, the nearly instantaneous NJ implementations, the ability to incorporate an evolutionary model in a speedy analysis, LogDet distances, network estimation methods, and the occasional need to summarize relationships with a single number all mean that distance methods will probably stay in the mainstream for a long time to come.