An exercise for learning to use PAUP*, with emphasis on model selection
The data set we will use for this exercise is simulated-dna-data.nex. It is a simulated data set consisting of DNA sequences for 9 organisms and 2500 sites (taxon 9 is the outgroup). I won’t give away much more than that about the simulation conditions for now, but you can assume that the simulation assumed a homogenous Markov process (i.e., all sites evolving according to the same model, and the model does not change over the tree).
In this key, (terse) answers to the questions can be obtained by hovering your mouse momentarily over the word "answer". Lines beginning with “paup> “ are PAUP* input command lines.
Part 1: Basic data manipulation and parsimony analysis
Note: “$” represents the Linux/MacOS prompt. All other commands are entered at the prompt issued by PAUP* ("paup>").
If you are using the cluster version of paup you can download the dataset into your home directory by using
- Start PAUP*, executing the “simulated-dna-data.nex” data file and opening a log file to save resuits:
$ paup -f -L paupdemo.log simulated-dna-data.nex
$ paup -f log file=paupdemo.log; exec simulated-dna-data.nex;
(Note: the "-f" flag used above is not ordinarily needed, but is being used to work around a problem with the installation on the MBL cluster.)
- Conduct an exact (branch-and-bound) tree search to the most parsimonious tree under simple parsimony with equal weights.
Alternatively, you can do an "exhaustive" search that evaluates all possible trees:
What is the length of the optimal tree?
- Examine the topology of the tree.
An alternative command for seeing the tree that provides more information is the DescribeTrees command.
Mac and Windows users can also see and manipulate a graphical view of the tree using the Trees:Print/Preview Trees menu command. Note: this will not work on the MBL cluster (which most of you will be using)
- Change the outgroup status so that the only outgroup taxon is taxon 9 and re-examine the tree topology. You can
refer to taxa by either their numbers or their names.
Now show the trees again using the ShowTrees or DescribeTrees command. What happens to the shape of the tree?
Use the PScores command (or Trees:Tree Scores:Parsimony menu command) to calculate the length (= parsimony score) of the tree.
Did the tree length change?
- Do a heuristic search using PAUP*’s default settings using the HSearch command or the Analysis:Heuristic Search menu command).
Perform another heuristic search using the random-addition-sequence method with 100 replicates.
hsearch addseq=random nrep=100;
Does it look like the optimal trees are hard to find for this data set?
- Describe the most-parsimonious tree as a phylogram (branch lengths proportional to number of changes assigned by parsimony).
Note that the slash character ('/') is needed in the command to separate the (optional> tree list from the remaining command options. Many commands in PAUP* work this way.
At this point, it will be useful to make a quick drawing of the parsimony tree for later reference. (Ordinarily, I would have you print the tree, but printing is not available in the MBL course.)
- Use the bootstrap method for assessing confidence on the groups inferred by parsimony analysis. We will use 1,000 pseudo-replicated data sets and 10 random-addition-sequence replicates per bootstrap replicate.
bootstrap nreps=1000/addseq=random nreps=10;
Which grouping on the most-parsimonious tree is least well supported as assessed by the bootstrap?
- Perform a constrained search that forces taxa 1 and 6 to be monophyletic. First we need to define a constrain tree:
constraints 1and6 = ((1,6));
Note that the name 1and6' is a name that you choose to identify the constaint. You can use the ShowConstraints command (or Analysis:Show Constraints menu command) to verify that the constrain tree has been entered correctly:
Note that the above syntax for specifying the topology of the constraint tree is a shorthand for a longer equivalent command:
constraints 1and6 = ((1,6),2,4,5,7,8,9);
The "shorthand" is that taxa connected directly to the root can be omitted from the Newick-format tree description.
What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa?
hsearch enforce constraints=1and6
Part 2: Distance analyses
Note: this section is optional. If time is short, you should probably proceed to the likelihood section and come back to this section later.
- Reset all program options to their “factory defaults” (PAUP* remembers most option settings until they are changed again, and sometimes it's easiest to just start "fresh" rather than worrying about each individual setting.)
- Perform neighbor-joining analyses using uncorrected (“p”), Jukes-Cantor, Kimura 2-parameter, HKY85, Tamura-Nei, and GTR distances. Examine the distance matrix in each case.
dset dist=jc; showdist; nj; dset dist=k2p; showdist; nj; dset dist=hky; showdist; nj; dset dist=tamnei; showdist; nj; dset dist=gtr; showdist; nj;
How does the distance chosen affect the tree found by NJ for this data set?
How does the distance chosen affect the magnitude of the pairwise distance estimates?
Are smaller distances less affected or more affected by the choice of a distance?
- Using the Tamura-Nei distance, search for a tree under the minimum evolution (distance) criterion:
set criterion=distance; dset objective=me dist=tamnei; hsearch;
What is the tree score under this criterion?
Evaluate the same tree under the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting)?
What is the WLS score of the tree?
("Wtd SS" refers to the "average percent standard deviation" defined by Fitch and Margoliash, and is a normalization of the raw sum of squared deviations.)
Part 3: Likelihood analysis and model comparison
- Perform a heuristic search for an optimal tree under the likelihood criterion using PAUP*’s default model settings.
The default model is an HKY model with a transition/transversion ratio of 2.0 and unequal base frequencies (estimated empirically).
set criterion=likelihood; hsearch;
“Describe” the tree and examine the topology.
How does it compare to the trees found previously using parsimony and distance methods?
What is the likelihood score for this tree?
- Set up a model with unequal base frequencies (to be estimated as parameters) and two substitution types with ti/tv ratio estimated from the data (LSet command or Analysis:Likelihood Settings menu command). Then use the LScores (Trees:Tree Scores:Likelihood menu command) to estimate the ti/tv ratio on the tree found in the previous step.
lset tratio=estimate basefreq=estimate; lscores;
Note that you can combine the LSet and LScores commands into a single command, which is convenient when trying alternative models:
What is the estimate of the ti/tv ratio?
How does it compare to the previous value (i.e., the default setting)?
What effect did optimization of the ti/tv ratio have on the likelihood score for the tree?
- Fix the ti/tv ratio and base frequencies to the values you estimated in the step above. Perform another heuristic search using the modified settings.
lset fixall; hsearch; describe;
Did the tree topology change?
- Using the tree topology found in the above search, fit models that assume that rates across sites are (1) gamma distributed, (2) equal rates at variable sites but with some sites invariable, and (3) both gamma-distributed rates and invariable sites.
lset tratio=estimate basefreq=estimate; lscores/rates=gamma shape=estimate; lscores/rates=equal pinv=estimate; lscores/rates=gamma shape=estimate pinv=estimate;
Does it appear that all sites are evolving at the same rate?
What is the likelihood score and parameter estimates for the preferred (so far) model?
- Evaluate whether a simpler DNA substitution rate matrix (e.g., JC, F81, K2P) can explain the data adequately. First set up the model for among-site rate variation chosen above:
lset rates=gamma shape=est pinv=0;
Evalate the JC+G, F81, and K80 (=K2P) models, respectively:
lscores/nst=1 basefreq=equal; lscores/nst=1 basefreq=estimate; lscores/nst=2 basefreq=equal tratio=estimate;
Do you think that any of the above models are adequate?
- Evaluate more complex substition models than HKY85 (with ASRV). Specifically, we will try the Tamura-Nei and GTR models.
First the GTR:
lscores/basefreq=estimate rates=gamma shape=estimate nst=6 rmatrix=estimate;
Now the Tamura-Nei:
lscores/rclass=(a b a a c a);
- Now we can assess whether either of these models is justified according to a likelihood ratio test. (I typically prefer using the AIC for model selection, but this just shows how the LRT would be performed.)
- GTR model: ln L = -18731.11
- TamNei model: ln L = -18732.66
- HKY model: ln L = -18742.18 (from above)
delta(TamNei vs. HKY) = 2(18742.18 - 18732.66) = 19.04; df=1, P-value < 0.00001
delta(GTR vs. TamNei) = 2(18732.66 - 18731.11) = 3.10; df =3, P-value ≈ 0.0783
The HKY model is strongly rejected in favor of Tamura-Nei. GTR model is unnecessarily complex (fails to reject Tamura-Nei). The AIC would also prefer the Tamura-Nei model (3 fewer parameters, only about 1.5 log likelihood units worse)
- Search for an optimal tree using the model identified above. Note that we first fix all model parameters to their maximum-likelihood estimates before doing the search.
lscores/nst=6 rclass=(abaaca) rmat=estimate basefreq=est rates=gamma shape=estimate; lset fixall; set criterion=likelihood; hsearch;
This search finds the true tree that generated the data. In Newick representation, this tree is
- To continue the successive approximations approach, we re-estimate parameters on this new tree, and repeat the search using the new parameters...
lscores/rmatrix=estimate basefreq=estimate shape=estimate; lset fixall; hsearch;
The same tree is found, so the successive approximations approach has converged. We're done!
Finally, use the new automated model selection capability of PAUP* to estimate a good model. This is the way you'll probably end up doing it from now on, but now you know how it works!