# Difference between revisions of "PAUP* Exercise"

Ejmctavish (talk | contribs) (→Part 1: Basic data manipulation and parsimony analysis) |
|||

(20 intermediate revisions by 4 users not shown) | |||

Line 12: | Line 12: | ||

<pre>wget http://www.duke.edu/~dls36/paup/simulated-dna-data.nex</pre> | <pre>wget http://www.duke.edu/~dls36/paup/simulated-dna-data.nex</pre> | ||

− | + | * Start PAUP*, executing the “simulated-dna-data.nex” data file and opening a log file to save resuits: | |

− | |||

− | + | <pre>$ paup -f -L paupdemo.log simulated-dna-data.nex</pre> | |

− | |||

− | <pre>$ | ||

(or) | (or) | ||

− | <pre>$ | + | <pre>$ paup -f |

+ | log file=paupdemo.log; | ||

exec simulated-dna-data.nex;</pre> | exec simulated-dna-data.nex;</pre> | ||

+ | |||

+ | (Note: the "-f" flag used above is not ordinarily needed, but is being used to work around a problem with the installation on the MBL cluster.) | ||

* Conduct an exact (branch-and-bound) tree search to the most parsimonious tree under simple parsimony with equal weights. | * Conduct an exact (branch-and-bound) tree search to the most parsimonious tree under simple parsimony with equal weights. | ||

Line 49: | Line 49: | ||

<pre>outgroup 9;</pre> | <pre>outgroup 9;</pre> | ||

− | |||

− | |||

− | |||

Now show the trees again using the '''ShowTrees''' or '''DescribeTrees''' command. What happens to the shape of the tree? | Now show the trees again using the '''ShowTrees''' or '''DescribeTrees''' command. What happens to the shape of the tree? | ||

Line 71: | Line 68: | ||

<pre>hsearch addseq=random nrep=100;</pre> | <pre>hsearch addseq=random nrep=100;</pre> | ||

− | Does it look like the optimal trees are hard to find for this data set? {{title|This data set is easy for parsimony, and very simple searches find the optimal tree. This is not typical of larger, real data sets.|answer}} | + | Does it look like the optimal trees are hard to find for this data set? {{title|This data set is easy for parsimony, and very simple searches find the optimal tree. (Depending on the starting seed, all or almost all of the random-addition-sequence replicates find an optimal tree.) This is not typical of larger, real data sets.|answer}} |

* Describe the most-parsimonious tree as a phylogram (branch lengths proportional to number of changes assigned by parsimony). | * Describe the most-parsimonious tree as a phylogram (branch lengths proportional to number of changes assigned by parsimony). | ||

Line 99: | Line 96: | ||

<pre>constraints 1and6 = ((1,6),2,4,5,7,8,9);</pre> | <pre>constraints 1and6 = ((1,6),2,4,5,7,8,9);</pre> | ||

− | The "shorthand" is that taxa connected directly to the | + | The "shorthand" is that taxa connected directly to the root can be omitted from the Newick-format tree description. |

What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? {{title|4273 steps|answer}} | What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? {{title|4273 steps|answer}} | ||

Line 123: | Line 120: | ||

</pre> | </pre> | ||

− | How does the distance chosen affect the tree found by NJ for this data set? {{title|For all distances other than | + | How does the distance chosen affect the tree found by NJ for this data set? {{title|For all distances other than GTR, the topology of the tree is the same. The GTR tree resolves taxa 1, 4, and 6 differently. The chosen distance does affect the computed branch lengths in all cases, however.|answer}} |

− | How does the distance chosen affect the magnitude of the pairwise distance estimates? {{title|In general, the more complex the model used for distance calculation, the greater the inferred distance, because | + | How does the distance chosen affect the magnitude of the pairwise distance estimates? {{title|In general, the more complex the model used for distance calculation, the greater the inferred distance, because it better corrects for unobserved substitutions (multiple hits).|answer}} |

Are smaller distances less affected or more affected by the choice of a distance? {{title|Less affected. When taxa are more recently diverged, fewer multiple hits will have occurred and the distance correction has less effect.|answer}} | Are smaller distances less affected or more affected by the choice of a distance? {{title|Less affected. When taxa are more recently diverged, fewer multiple hits will have occurred and the distance correction has less effect.|answer}} | ||

Line 137: | Line 134: | ||

</pre> | </pre> | ||

− | What is the tree score under this criterion? {{title|ME score equals 2. | + | What is the tree score under this criterion? {{title|ME score equals 2.00162|answer}} |

Evaluate the same tree under the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting)? | Evaluate the same tree under the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting)? | ||

Line 145: | Line 142: | ||

</pre> | </pre> | ||

− | What is the WLS score of the tree? {{title|SS equals 0. | + | What is the WLS score of the tree? {{title|Wtd SS equals 0.03281 (Percent standard deviation equals 3.95257)|answer}} |

− | (" | + | ("Wtd SS" refers to the "average percent standard deviation" defined by Fitch and Margoliash, and is a normalization of the raw sum of squared deviations.) |

== Part 3: Likelihood analysis and model comparison == | == Part 3: Likelihood analysis and model comparison == | ||

Line 167: | Line 164: | ||

What is the likelihood score for this tree? {{title|ln L equals -19327.97|answer}} | What is the likelihood score for this tree? {{title|ln L equals -19327.97|answer}} | ||

− | * Set up a model with unequal base frequencies (to be estimated as parameters) and two substitution types with ti/tv ratio estimated from the data ('''LSet''' command or '''''Analysis:Likelihood Settings''''' menu command). Then use the | + | * Set up a model with unequal base frequencies (to be estimated as parameters) and two substitution types with ti/tv ratio estimated from the data ('''LSet''' command or '''''Analysis:Likelihood Settings''''' menu command). Then use the '''LScores''' ('''''Trees:Tree Scores:Likelihood''''' menu command) to estimate the ti/tv ratio on the tree found in the previous step. |

− | '''LScores''' ('''''Trees:Tree Scores:Likelihood''''' menu command) to estimate the ti/tv ratio on the tree found in the previous step. | ||

<pre> | <pre> | ||

Line 189: | Line 185: | ||

<pre> | <pre> | ||

− | lset | + | lset fixall; |

hsearch; | hsearch; | ||

describe; | describe; | ||

Line 236: | Line 232: | ||

<pre>lscores/rclass=(a b a a c a); </pre> | <pre>lscores/rclass=(a b a a c a); </pre> | ||

− | * Now we can assess whether either of these models is justified according to a likelihood ratio test. | + | * Now we can assess whether either of these models is justified according to a likelihood ratio test. (I typically prefer using the AIC for model selection, but this just shows how the LRT would be performed.) |

** GTR model: ln L = -18731.11 | ** GTR model: ln L = -18731.11 | ||

** TamNei model: ln L = -18732.66 | ** TamNei model: ln L = -18732.66 | ||

Line 251: | Line 247: | ||

<pre> | <pre> | ||

lscores/nst=6 rclass=(abaaca) rmat=estimate basefreq=est rates=gamma shape=estimate; | lscores/nst=6 rclass=(abaaca) rmat=estimate basefreq=est rates=gamma shape=estimate; | ||

− | lset | + | lset fixall; |

set criterion=likelihood; | set criterion=likelihood; | ||

hsearch; | hsearch; | ||

Line 263: | Line 259: | ||

<pre> | <pre> | ||

lscores/rmatrix=estimate basefreq=estimate shape=estimate; | lscores/rmatrix=estimate basefreq=estimate shape=estimate; | ||

− | lset | + | lset fixall; |

hsearch; | hsearch; | ||

</pre> | </pre> | ||

Line 269: | Line 265: | ||

The same tree is found, so the successive approximations approach has converged. We're done! | The same tree is found, so the successive approximations approach has converged. We're done! | ||

− | + | Finally, use the new automated model selection capability of PAUP* to estimate a good model. This is the way you'll probably end up doing it from now on, but now you know how it works! | |

+ | <pre> | ||

+ | automodel; | ||

+ | </pre> |

## Latest revision as of 18:53, 19 July 2016

## Contents

### An exercise for learning to use PAUP*, with emphasis on model selection

The data set we will use for this exercise is simulated-dna-data.nex. It is a simulated data set consisting of DNA sequences for 9 organisms and 2500 sites (taxon 9 is the outgroup). I won’t give away much more than that about the simulation conditions for now, but you can assume that the simulation assumed a homogenous Markov process (i.e., all sites evolving according to the same model, and the model does not change over the tree).

In this key, (terse) answers to the questions can be obtained by hovering your mouse momentarily over the word "answer". Lines beginning with “paup> “ are PAUP* input command lines.

## Part 1: Basic data manipulation and parsimony analysis

*Note: “$” represents the Linux/MacOS prompt. All other commands are entered at the prompt issued by PAUP* ("paup>").*

If you are using the cluster version of paup you can download the dataset into your home directory by using

wget http://www.duke.edu/~dls36/paup/simulated-dna-data.nex

- Start PAUP*, executing the “simulated-dna-data.nex” data file and opening a log file to save resuits:

$ paup -f -L paupdemo.log simulated-dna-data.nex

(or)

$ paup -f log file=paupdemo.log; exec simulated-dna-data.nex;

(Note: the "-f" flag used above is not ordinarily needed, but is being used to work around a problem with the installation on the MBL cluster.)

- Conduct an exact (branch-and-bound) tree search to the most parsimonious tree under simple parsimony with equal weights.

bandb;

Alternatively, you can do an "exhaustive" search that evaluates all possible trees:

alltrees;

What is the length of the optimal tree? answer

- Examine the topology of the tree.

showtrees all;

An alternative command for seeing the tree that provides more information is the **DescribeTrees** command.

describe all;

Mac and Windows users can also see and manipulate a graphical view of the tree using the * Trees:Print/Preview Trees*
menu command.

**Note: this will not work on the MBL cluster (which most of you will be using)**

- Change the outgroup status so that the only outgroup taxon is taxon 9 and re-examine the tree topology. You can

refer to taxa by either their numbers or their names.

outgroup 9;

Now show the trees again using the **ShowTrees** or **DescribeTrees** command. What happens to the shape of the tree?
answer

Use the **PScores** command (or **Trees:Tree Scores:Parsimony** menu command) to calculate the length (= parsimony score) of the tree.

pscores;

Did the tree length change? answer

- Do a heuristic search using PAUP*’s default settings using the
**HSearch**command or themenu command).**Analysis:Heuristic Search**

hsearch;

Perform another heuristic search using the random-addition-sequence method with 100 replicates.

hsearch addseq=random nrep=100;

Does it look like the optimal trees are hard to find for this data set? answer

- Describe the most-parsimonious tree as a phylogram (branch lengths proportional to number of changes assigned by parsimony).

describe/plot=p;

Note that the slash character ('/') is needed in the command to separate the (optional> tree list from the remaining command options. Many commands in PAUP* work this way.

At this point, it will be useful to make a quick drawing of the parsimony tree for later reference. (Ordinarily, I would have you print the tree, but printing is not available in the MBL course.)

- Use the bootstrap method for assessing confidence on the groups inferred by parsimony analysis. We will use 1,000 pseudo-replicated data sets and 10 random-addition-sequence replicates per bootstrap replicate.

bootstrap nreps=1000/addseq=random nreps=10;

Which grouping on the most-parsimonious tree is least well supported as assessed by the bootstrap? answer

- Perform a
*constrained*search that forces taxa 1 and 6 to be monophyletic. First we need to define a constrain tree:

constraints 1and6 = ((1,6));

Note that the name *1and6' is a name that you choose to identify the constaint. You can use the ShowConstraints command (or *

**Analysis:Show Constraints**

*menu command) to verify that the constrain tree has been entered correctly:*

showconstr;

Note that the above syntax for specifying the topology of the constraint tree is a shorthand for a longer equivalent command:

constraints 1and6 = ((1,6),2,4,5,7,8,9);

The "shorthand" is that taxa connected directly to the root can be omitted from the Newick-format tree description.

What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? answer

hsearch enforce constraints=1and6

## Part 2: Distance analyses

*Note: this section is optional. If time is short, you should probably proceed to the likelihood section and come back to this section later.*

- Reset all program options to their “factory defaults” (PAUP* remembers most option settings until they are changed again, and sometimes it's easiest to just start "fresh" rather than worrying about each individual setting.)

factory;

- Perform neighbor-joining analyses using uncorrected (“p”), Jukes-Cantor, Kimura 2-parameter, HKY85, Tamura-Nei, and GTR distances. Examine the distance matrix in each case.

dset dist=jc; showdist; nj; dset dist=k2p; showdist; nj; dset dist=hky; showdist; nj; dset dist=tamnei; showdist; nj; dset dist=gtr; showdist; nj;

How does the distance chosen affect the tree found by NJ for this data set? answer

How does the distance chosen affect the magnitude of the pairwise distance estimates? answer

Are smaller distances less affected or more affected by the choice of a distance? answer

- Using the Tamura-Nei distance, search for a tree under the minimum evolution (distance) criterion:

set criterion=distance; dset objective=me dist=tamnei; hsearch;

What is the tree score under this criterion? answer

Evaluate the same tree under the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting)?

dscores/objective=ls power=2;

What is the WLS score of the tree? answer

("Wtd SS" refers to the "average percent standard deviation" defined by Fitch and Margoliash, and is a normalization of the raw sum of squared deviations.)

## Part 3: Likelihood analysis and model comparison

- Perform a heuristic search for an optimal tree under the likelihood criterion using PAUP*’s default model settings.

The default model is an HKY model with a transition/transversion ratio of 2.0 and unequal base frequencies (estimated empirically).

set criterion=likelihood; hsearch;

“Describe” the tree and examine the topology.

describe;

How does it compare to the trees found previously using parsimony and distance methods? answer

What is the likelihood score for this tree? answer

- Set up a model with unequal base frequencies (to be estimated as parameters) and two substitution types with ti/tv ratio estimated from the data (
**LSet**command ormenu command). Then use the**Analysis:Likelihood Settings****LScores**(menu command) to estimate the ti/tv ratio on the tree found in the previous step.**Trees:Tree Scores:Likelihood**

lset tratio=estimate basefreq=estimate; lscores;

Note that you can combine the **LSet** and **LScores** commands into a single command, which is convenient when
trying alternative models:

lscores/tratio=estimate basefreq=estimate;

What is the estimate of the ti/tv ratio? answer

How does it compare to the previous value (i.e., the default setting)? answer

What effect did optimization of the ti/tv ratio have on the likelihood score for the tree? answer

- Fix the ti/tv ratio and base frequencies to the values you estimated in the step above. Perform another heuristic search using the modified settings.

lset fixall; hsearch; describe;

Did the tree topology change? answer

- Using the tree topology found in the above search, fit models that assume that rates across sites are (1) gamma distributed, (2) equal rates at variable sites but with some sites invariable, and (3) both gamma-distributed rates and invariable sites.

lset tratio=estimate basefreq=estimate; lscores/rates=gamma shape=estimate; lscores/rates=equal pinv=estimate; lscores/rates=gamma shape=estimate pinv=estimate;

Does it appear that all sites are evolving at the same rate? answer

What is the likelihood score and parameter estimates for the preferred (so far) model? answer

- Evaluate whether a simpler DNA substitution rate matrix (e.g., JC, F81, K2P) can explain the data adequately. First set up the model for among-site rate variation chosen above:

lset rates=gamma shape=est pinv=0;

Evalate the JC+G, F81, and K80 (=K2P) models, respectively:

lscores/nst=1 basefreq=equal; lscores/nst=1 basefreq=estimate; lscores/nst=2 basefreq=equal tratio=estimate;

Do you think that any of the above models are adequate? answer

- Evaluate more complex substition models than HKY85 (with ASRV). Specifically, we will try the Tamura-Nei and GTR models.

First the GTR:

lscores/basefreq=estimate rates=gamma shape=estimate nst=6 rmatrix=estimate;

Now the Tamura-Nei:

lscores/rclass=(a b a a c a);

- Now we can assess whether either of these models is justified according to a likelihood ratio test. (I typically prefer using the AIC for model selection, but this just shows how the LRT would be performed.)
- GTR model: ln L = -18731.11
- TamNei model: ln L = -18732.66
- HKY model: ln L = -18742.18 (from above)

delta(TamNei vs. HKY) = 2(18742.18 - 18732.66) = 19.04; df=1, P-value < 0.00001

delta(GTR vs. TamNei) = 2(18732.66 - 18731.11) = 3.10; df =3, P-value ≈ 0.0783

The HKY model is strongly rejected in favor of Tamura-Nei. GTR model is unnecessarily complex (fails to reject Tamura-Nei). The AIC would also prefer the Tamura-Nei model (3 fewer parameters, only about 1.5 log likelihood units worse)

- Search for an optimal tree using the model identified above. Note that we first fix all model parameters to their maximum-likelihood estimates before doing the search.

lscores/nst=6 rclass=(abaaca) rmat=estimate basefreq=est rates=gamma shape=estimate; lset fixall; set criterion=likelihood; hsearch;

This search finds the true tree that generated the data. In Newick representation, this tree is

(((((1,6),8),4),5),(2,(3,7)))

- To continue the successive approximations approach, we re-estimate parameters on this new tree, and repeat the search using the new parameters...

lscores/rmatrix=estimate basefreq=estimate shape=estimate; lset fixall; hsearch;

The same tree is found, so the successive approximations approach has converged. We're done!

Finally, use the new automated model selection capability of PAUP* to estimate a good model. This is the way you'll probably end up doing it from now on, but now you know how it works!

automodel;