Felsenstein's tree-pruning algorithm: Difference between revisions

Content deleted Content added

Inline

Revision as of 18:54, 7 January 2024

In statistical genetics, Felsenstein's tree-pruning algorithm (or Felsenstein's tree-peeling algorithm), attributed to Joseph Felsenstein, is an algorithm for efficiently computing the likelihood of an evolutionary tree from nucleic acid sequence data. ^[1]^[2]

The algorithm is often used as a subroutine in a search for a maximum likelihood estimate for an evolutionary tree. Further, it can be used in a hypothesis test for whether evolutionary rates are constant (by using likelihood ratio tests). It can also be used to provide error estimates for the parameters describing an evolutionary tree.

Details

The likelihood of a tree $T$ is, by definition, the probability of observing certain data $D$ ( $D$ being a nucleotide sequence alignement for example i.e. a succesion of $n$ DNA site $s$ ) given the tree. It is often written : $P(D|T)$ .

Here is an example of an evolutionary tree on arbitrary sequence data $D$ :

This is a key value and is often quite complicated to compute. To ease the computations, Felsenstein and his colleagues used several assumptions that are still widely used today. The main assumption is that mutations between DNA sites are independant of each other. This permits to compute the likelihood as a simple product of probabilities. Now you can divide the data $D$ between several $D_{s}$ for each nucleotide site $s$ inside of $D$ . The global likelihood of the tree will be the product of the likelihoods of each site:

$P(D|T)=\prod _{s=1}^{n}{P(D_{s}|T)}$

If I reuse the exemple above, $D_{1}$ would be:

The second assumption concerns the models of DNA DNA sequence evolution. The computation of the likelihood needs the nucleotide frequencies as well as the transition probabilities (when a mutation occurs, probability of going from a nucleotide to another). The simplest model is the Jukes-Cantor model, assuming equal nucleotide frequencies $\left(\pi _{A}=\pi _{G}=\pi _{C}=\pi _{T}={1 \over 4}\right)$ and equal transition probabilities from $X$ to $Y$ ( $p_{A\rightarrow T}=p_{A\rightarrow C}=p_{A\rightarrow G}={\frac {1}{4}}(1-e^{-\mu l})$ and $p_{A\rightarrow A}=e^{-\mu l}+{\frac {1}{4}}(1-e^{-\mu l})$ ) and idem for the other bases). Here $\mu$ is the global mutation rate of the model.

Felsenstein proposed to decomposed computations even more by using "partial likelihoods" in the computation of $P(D_{s}|T)$ . Here is how it works. Assume we are on a node $k$ on the tree $T$ . $k$ has two daughter nodes $i$ and $j$ and, for each DNA base $X=\{A,T,C,G\}$ present on $k$ , we can define a partial likelihood $w_{k}(X)$ such as:

$w_{k}(X)=(\sum _{Y}p_{X\rightarrow Y}\centerdot w_{i}(Y))\centerdot (\sum _{Z}p_{X\rightarrow Z}\centerdot w_{j}(Z))$

where $Y$ and $Z$ are also DNA bases. $p_{X\rightarrow Y}$ is the transition probability from nucleotide $X$ to nucleotide $Y$ (idem for $p_{X\rightarrow Z}$ ). $w_{i}(Y)$ is the partial likelihood of the daughter node $i$ , evaluated on nucleotide $Y$ (idem for $w_{j}(Z)$ ).

Using this formula, one has to start from the tips of the tree $T$ , then move towards the root and compute the partial likelihoods of each necessary node on the way (4 partial likelihoods per node). Having finished at the root of the tree, the likelihood of the tree (for this particular site) is then the sum of the partial likelihoods of the root times the appropriated nucleotide frequency.

$P(D_{s}|T)=\sum _{X}p_{X}\centerdot w_{r}(X)$

After doing so for every site $s$ , one can finally obtain the likelihood of the global evolutionary tree.

Algorithm

References

^ Felsenstein, J. (1973). "Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters". Systematic Biology. 22 (3): 240–249. doi:10.1093/sysbio/22.3.240.
^ Felsenstein, J. (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. Bibcode:1981JMolE..17..368F. doi:10.1007/BF01734359. PMID 7288891. S2CID 8024924.

This genetics article is a stub. You can help Wikipedia by expanding it.

This statistics-related article is a stub. You can help Wikipedia by expanding it.

[1] Felsenstein, J. (1973). "Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters". Systematic Biology. 22 (3): 240–249. doi:10.1093/sysbio/22.3.240.

[2] Felsenstein, J. (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. Bibcode:1981JMolE..17..368F. doi:10.1007/BF01734359. PMID 7288891. S2CID 8024924.

[1]

[2]

@@ Line 1: / Line 1: @@
 {{Short description|Technique in statistical genetics}}
-In [[statistical genetics]], '''Felsenstein's tree-pruning algorithm''' (or '''Felsenstein's tree-peeling algorithm'''), attributed to [[Joe_Felsenstein|Joseph Felsenstein]], is an [[algorithm]] for computing the [[likelihood]] of an [[evolutionary tree]] from [[nucleic acid]] sequence data. <ref>{{Cite journal | last1 = Felsenstein | first1 = J.| author-link1 =Joseph Felsenstein| title = Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters | doi = 10.1093/sysbio/22.3.240 | journal = Systematic Biology | volume = 22 | issue = 3 | pages = 240–249 | year = 1973 }}</ref><ref>{{Cite journal | last1 = Felsenstein | first1 = J.| author-link1 = Joseph Felsenstein| title = Evolutionary trees from DNA sequences: A maximum likelihood approach | doi = 10.1007/BF01734359 | journal = Journal of Molecular Evolution | volume = 17 | issue = 6 | pages = 368–376 | year = 1981 | pmid =  7288891| bibcode = 1981JMolE..17..368F| s2cid = 8024924}}</ref>
+In [[statistical genetics]], '''Felsenstein's tree-pruning algorithm''' (or '''Felsenstein's tree-peeling algorithm'''), attributed to [[Joe_Felsenstein|Joseph Felsenstein]], is an [[algorithm]] for efficiently computing the [[likelihood]] of an [[evolutionary tree]] from [[nucleic acid]] sequence data. <ref>{{Cite journal | last1 = Felsenstein | first1 = J.| author-link1 =Joseph Felsenstein| title = Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters | doi = 10.1093/sysbio/22.3.240 | journal = Systematic Biology | volume = 22 | issue = 3 | pages = 240–249 | year = 1973 }}</ref><ref>{{Cite journal | last1 = Felsenstein | first1 = J.| author-link1 = Joseph Felsenstein| title = Evolutionary trees from DNA sequences: A maximum likelihood approach | doi = 10.1007/BF01734359 | journal = Journal of Molecular Evolution | volume = 17 | issue = 6 | pages = 368–376 | year = 1981 | pmid =  7288891| bibcode = 1981JMolE..17..368F| s2cid = 8024924}}</ref>
-The algorithm is often used as a subroutine in a search for a [[maximum likelihood]] estimate for an evolutionary tree.  Further, it can be used in a hypothesis test for whether evolutionary rates are constant (by using [[likelihood ratio test]]s).  It can also be used to provide error estimates for the parameters describing an evolutionary tree.
+The algorithm is often used as a subroutine in a search for a [[maximum likelihood]] estimate for an evolutionary tree. Further, it can be used in a hypothesis test for whether evolutionary rates are constant (by using [[likelihood ratio test]]s).  It can also be used to provide error estimates for the parameters describing an evolutionary tree.
+=== Details ===
+The '''likelihood''' of a tree <math>T</math> is, by definition, the probability of observing certain data <math>D</math> (<math>D</math> being a nucleotide sequence alignement for example ''i.e.'' a succesion of <math>  n  </math> DNA site <math> s </math>) given the tree. It is often written : <math>P(D|T)</math>.
+Here is an example of an evolutionary tree on arbitrary sequence data <math>D</math>:
+This is a key value and is often quite complicated to compute. To ease the computations, Felsenstein and his colleagues used several assumptions that are still widely used today. The '''main assumption''' is that '''mutations between DNA sites are independant''' of each other. This permits to compute the likelihood as a simple product of probabilities. Now you can divide the data <math>D</math> between several <math>D_s</math> for each nucleotide site <math>s</math> inside of <math>D</math>. The global likelihood of the tree will be the product of the likelihoods of each site:
+<math>
+P(D|T) = \prod_{s=1}^{n} {P(D_s|T)}
+</math>
+If I reuse the exemple above, <math>D_1</math> would be:
+The '''second assumption''' concerns the [[Substitution model|models of DNA DNA sequence evolution]]. The computation of the likelihood needs the nucleotide frequencies as well as the transition probabilities (when a mutation occurs, probability of going from a nucleotide to another). The simplest model is the Jukes-Cantor model, assuming equal nucleotide frequencies <math>\left(\pi_A = \pi_G = \pi_C = \pi_T = {1\over4}\right)</math> and equal transition probabilities from <math>X</math> to <math>Y</math> (<math>p_{A \rightarrow T} = p_{A \rightarrow C} = p_{A \rightarrow G} = \frac{1}{4} (1 - e^{-\mu l}) </math> and <math>p_{A \rightarrow A} = e^{-\mu l} + \frac{1}{4} (1 - e^{-\mu l})  </math>) and idem for the other bases). Here <math>  \mu  </math> is the global [[mutation rate]] of the model.
+Felsenstein proposed to decomposed computations even more by using "partial likelihoods" in the computation of <math>
+ P( D_s | T)
+</math>. Here is how it works. Assume we are on a node <math>
+ k
+</math> on the tree <math>
+ T
+</math>. <math>
+ k
+</math> has two daughter nodes <math>
+ i
+</math> and <math>
+ j
+</math> and, for each DNA base <math>  X = \{ A, T, C, G \}  </math> present on <math>
+ k
+</math>, we can define a partial likelihood <math> w_k (X) </math> such as:
+<math>  w_k (X) = ( \sum_Y p_{X \rightarrow Y} \centerdot w_i (Y)) \centerdot ( \sum_Z p_{X \rightarrow Z} \centerdot w_j (Z)) </math>
+where <math> Y </math> and <math> Z </math> are also DNA bases. <math> p_{ X\rightarrow Y}  </math>is the transition probability from nucleotide <math>X</math> to nucleotide <math> Y </math> (idem for <math>  p_{X \rightarrow Z} </math>). <math> w_i(Y) </math> is the partial likelihood of the daughter node <math>
+ i
+</math>, evaluated on nucleotide <math> Y </math> (idem for <math>
+ w_j(Z)
+</math>).
+Using this formula, one has to start from the tips of the tree <math>T</math>, then move towards the root and compute the partial likelihoods of each necessary node on the way (4 partial likelihoods per node). Having finished at the root of the tree, the likelihood of the tree (for this particular site) is then the sum of the partial likelihoods of the root times the appropriated nucleotide frequency.
+<math> P(D_s|T) = \sum_X p_X \centerdot w_r (X) </math>
+After doing so for every site <math>s</math>, one can finally obtain the likelihood of the global evolutionary tree.
+=== Algorithm ===
 ==References==