SnakeTopoParametricBootstrappingLab

From MolEvol
Jump to: navigation, search

See https://molevol.mbl.edu/index.php/ParametricBootstrappingLab for links to the other related labs

Overview

This is an example of using parametric bootstrapping to test an aspect of a phylogenetic tree topology. The main difference from the AlgaeTopoParametricBootstrappingLab are the size of the data set and the type of hypothesis that we are testing.

Here we see an odd grouping recovered in a tree search, and want to know whether there is strong statistical support for that grouping.


Background on the Dataset

This data set is a very small subset of the data matrix published by:

Pyron RA, Burbrink FT, Wiens JJ (2013) A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. BMC Evolutionary Biology 13: 93. http://dx.doi.org/10.1186/1471-2148-13-93

The data was obtained from DataDryad:

Pyron RA, Burbrink FT, Wiens JJ (2013) Data from: A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.82h0m Specifially, the data was at: http://datadryad.org/bitstream/handle/10255/dryad.47660/Squamata_10_28.phy?sequence=1

The file included here is a matrix that consists of ND2 and ND4 (mitochondrial) gene sequences for some selected squamates (mainly snakes). Mark Holder pruned it down to make the lab run quickly. The species are not randomly sampled - some interesting groups were selected based on whether or not they are viviparous (give birth to young without an external egg).

This is the same data set that is used in the CharacterEvoParametricBootstrappingLab.

Parametric Bootstrapping Background

To save time, we will be use parsimony. Parsimony searches try to find the tree that requires the fewest number mutations to explain the data. These data are from the (fast-evolving) mitochondrial genome, so parsimony is definitely not a good choice of analysis method. We are only using it here because it is fast, and our time in the lab is limited. The same principles of parametric bootstrapping apply to ML and distance-based methods.

Parametric bootstrapping is useful for answering the question: "I have observed a particular difference between the score of the optimal tree and the score of the best tree that is compatible with the null hypothesis. Is this score difference larger than I would expect if the null were true?"

The score difference can be a difference in log-likelihoods, a difference in parsimony scores, etc.

Parametric bootstrapping has a lot of steps. You may want to look at this flowchart (note the step numbers correspond to the Algae dataset; toward the end, they differ from this exercise).


Lab Activity

1. Download the data for the lab by running:

wget http://phylo.bio.ku.edu/slides/param_boot_lab.zip
unzip param_boot_lab.zip
cd param_boot_lab/snakes

If your machine does not have wget, you can install it or use curl -O http://phylo.bio.ku.edu/slides/param_boot_lab.zip to grab the file.


The data file is snakes.nex


Calculating the test statistic on the real data

2. Find the score of the most parsimonious tree using PAUP. Launch the command line version of PAUP by typing paup.

Once we have launched PAUP, we should start a log file to save the output, execute the data set and run a heuristic search. Heuristic searches are not guaranteed to find the optimal solution, but they often do a pretty good job (we'll talk more about tree searching later in the course). The invocation shown below will conduct 100 searches of trees from 100 different starting trees that are constructed using a quick algorithm called "stepwise addition."

Log Start File = snakemonophyly.log ;
Execute snakes.nex ;
Set Criterion = pars ;
HSearch NReps = 100 Start = stepwise AddSeq = rand MulTrees = no ;
  • How many most parsimonious trees did you find? You certainly aren't finding all of the most parsimonious trees with the quick search we have done, but do make sure that you can interpret PAUP's output.
  • What is the parsimony score of the most parsimonious tree(s)? answer (your answer may vary a bit because our searches are heuristic).
You can use
ShowTree 1
to see the first trees in memory (only the trees that are tied for having the best score will be in memory at this point).

One of the odd things about the trees that I get from a quick parsimony search on these data is the fact that the blind snake Leptotyphlops_humilis is found as sister to the non-blind snakes. It probably belongs in a group with the other blind snakes (Ramphotyphlops_braminus and Typhlops_reticulatus in this data set).

This is a tiny dataset (by modern standards), could the presence of the group just be caused by sampling error? Let's see if there is bootstrap support for that odd grouping.

3. The Boot in PAUP command will do the bootstrapping. Options that you provide before the / control the bootstrap resampling. After the / you can enter parameters for the heuristic search that is conducted on each bootstrap pseudoreplicate.

Use the command:

Boot nreps = 200 / nreps = 1 ;

to calculate the bootstrap proportions and display the majority-rule consensus tree from the bootstrapping analyses. This asks for 200 bootstrap pseudoreplicates, but tell PAUP to just do one heuristic search on each replicate. Somewhat confusingly, "nreps" is the keyword in both cases, but the / provides the context needed to interpret the keyword.


4. Does the bootstrap majority-rule tree display the odd grouping of Leptotyphlops_humilis with the non-blind snakes? answer

What is the BP for the strange grouping? answer

There are a lot of reasons we could get fairly high bootstrap support: the fast rate of evolution of animal mitochondrial sequences means they are poor choices to use for deep relationships. And parsimony is a particularly bad combination.

5. A natural question to ask is: "How much worse is the best tree that does not have this grouping?"

To answer this, we can tell PAUP to do a search with a "converse constraint" (also called a negative constraint) to only search through trees that lack a certain group.

First we have to save a tree with just the suspicious grouping in a tree file. I have done this in oddconstr.tre
Take a look at that file, and make sure that you understand the format - the parentheses group taxa into clades (subtrees) in the tree. Tools like Mesquite can be useful for constructing these trees.

To use the tree, we have to load it into a special slot in memory for constraint trees.

To get PAUP to read the constraint into memory, use the
LoadConstr file=oddconstr.tre
command.

6. It is always a good idea to look at the constraint to make sure it is expressing the group that you are interested in.

Note that the constraint had a name in the file. We tell PAUP the name of the tree in this command (not the name of the file from which we read the tree)... Use the ShowConstr odd ; command to see the constraint tree that you have defined and make sure that it is the constraint that you intended.

7. Now we can conduct a HSearch, enforcing the constraint in the converse form (look only at trees without this grouping)

HSearch Enforce = yes constraints = odd converse = yes NReps = 100 Start = stepwise AddSeq = rand MulTrees = no ;

The enforce keyword tells PAUP to only consider trees that meet a constraint, and the constraints=odd keyword tell PAUP which tree to use as a constraint. The converse = yes part of the command is what tells PAUP to look for trees that are not compatible with the constraint tree. What is the parsimony score of the best tree that does not show the odd grouping? answer

8. Use the ShowTrees 1 command to see the tree. Does it satisfy the constraint? (it should).

Hypothesis testing

As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that does not have the odd grouping.

9. What is the value of the test statistic for our data? answer (you may get a slightly different answer if you searches were more or less effective than mine).


To interpret the value of the test statistic we need to have some idea the range of values would be produced if the null hypothesis were true. This is can be tricky. For one thing, there are lots of trees that are compatible with the null hypothesis. It seems logical to take the tree from the constrained search as the tree to repersent the null hypothesis; it is the tree that best explains the data (according to parsimony). Technically speaking this procedure of choosing a tree to represent the null does not guarantee that we are testing from the "least favorable conditions" as we should in hypothesis testing - but using this tree seems good enough, and it is practical.

Even if we are satisfied about the choice of a tree that represents null, we still have to have a way to find out what the distribution of the test statistic would be if this null were true. We will use Monte Carlo simulations for this.

In particular we will use Seq-Gen to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. We can compare our real data to this distribution to determine significance.

Finding model parameters

To simulate data, Seq-Gen needs a fully-specified model and a tree with branch lengths. We can use the tree that we found in the constrained search and the GTR+I+G model to get the necessary input.

10. Assuming that you have not quit PAUP and the tree found in the constrained search is still in memory, then we can save it with branch lengths that maximize the likelihood under the GTR+I+G model. Full optimization is a bit slow. I fit one of these trees with PAUP earlier, and you can use 0.57 for the gamma shape parameter and PInvar = 0.23:

  Set crit = like;
  LSet nst=6 RMat=est BaseFreq=empirical Rates = gamma Shape = 0.57 PInv=0.23;
  LScore 1 / ScoreFile=modelparams.txt ;
  LSet rmat = prev ;
  SaveTrees file = snakemodel.tre from = 1 to = 1 brlens format=newick ;

Make sure that you understand these commands (ask an instructor if you have questions). From the output of PAUP you should have the model parameter values for the simulation.

11. Look at the tree file. You should see a newick string representing a tree with branch lengths. You can use a text editor to see the newick representation.

You can use FigTree to see a graphical representation of the tree, but that will require fetching the tree to your laptop from the cluster (using scp or CyberDuck). Make sure to exit PAUP by typing
quit

before you proceed to the next step.

Invoking seq-gen

Seq-Gen is installed on the workshop's cluster. If you are running the exercise on a machine that does not have seq-gen, you'll need to download Seq-Gen and install it.

To install, you'll need to drag the seq-gen executable to the directory that you are using for this lab (or add it to your PATH environmental variable that tells your shell where to find executables notes here)

12. To see the options for seq-gen use the command

seq-gen -h

To make sure everything is working do a simple test run using the HKY model:

seq-gen -mHKY snakemodel.tre

This should generate a dataset and print it to the screen. The simulation used it default parameter values for the HKY model. We'd like to use the parameters that we inferred from our real data (because the parameter values will affect the dataset-to-dataset variance, and hence the distribution of our test statistic). All commands are given to seq-gen as command line flags. The ones that we will use are:

-mGTR to specify the GTR model
-a preceding the shape parameter value
-i preceding the proportion of invariant sites
-r preceding the 6 instantanteous rates of the GTR matrix (PAUP infers the first five and fixes the last one to 1.0)
-f preceding the base frequencies
-l2656 to simulate 2656 sites (the same length as our real dataset).
-n1000 to generate 1000 datasets
-on to request output in the NEXUS format
-xeachsnakereppaup.nex to tell it the name of a file with text input to be inserted between each dataset.

13. Take a look at the eachsnakereppaup.nex that is included in the lab. It should contain the following lines.

begin paup;
    execute snakesim.run.nex;
end;

the -xeachsnakereppaup.nex option to seq-gen will insert the contents of eachreppaup.nex in between each data set.

By putting the correct commands in a file called snakesim.run.nex (we will write this file in a few steps).

14. Now we will run seq-gen.

One additional trick: We want to redirect the output to file using the : > redirection operator so that we can store the simulated data to a file (rather than printing it out). Warning: this will overwrite whatever filename you put after the > character!. We will redirect to a file called snakesimdata.nex with the syntax >snakesimdata.nex at the end of our invocation of seq-gen.

The invocation should be something like the command below (but with the parameter estimates for this dataset filled in the appropriate spots).

seq-gen -mGTR -a0.6 -i0.32 -r 0.6 2.1 0.3 0.2 5 1 -f 0.27 0.20 0.30 0.23 -l920 -n2656 -on -xeachsnakereppaup.nex snakemodel.tre > snakesimdata.nex

Use the parameter values that you got from PAUP's LScore to construct a similar command and run it.
Note: The last time I ran this on the Windows executable version of the program, the syntax of the command line is somewhat different. The rate matrix was be specified as -r0.6,2.1,0.3,0.2,5,1 and the base frequencies as -f0.27,0.20,0.30,0.23. You will also need to direct the treefile to seq-gen by inserting a < before the filename. In the case above, the end part of the command will read -xeachsnakereppaup.nex < model.txt > snakesimdata.nex

15. Open snakesimdata.nex in a text editor. Do you see the content of the eachsnakereppaup.nex file?

Running PAUP on the simulated data

Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command execute snakesim.run.nex intercalated between each data set. If we put the commands that we want PAUP to execute in the snakesim.run.nex file then those commands will be executed for each dataset.

What do we want to do for each dataset? We want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree because of sampling error (or even systematic error due to the model of evolution not matching the assumptions of parsimony).s

16. Save the following commands to snakesim.run.nex (the statements in the brackets are just comments, you can skip them if you like; they make it a bit easier to browse the output):

#NEXUS
HSearch nomultr;
[!****This is the best tree's score****]
PScore;
GetTrees file = snakemodel.tre;
[!####This is the true tree's score####]
PScore;

These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it. We are using the output comments to make the output more visible.


This command file is really the key part. You can tweak this based on what test statistic you are interested in. These commands get executed on each simulated data set; you just need to collect all of the information that you need to calculate a test statistic value for a simulated data set. Here that is just the parsimony score of the true tree and the parsimony score of the most parsimonious tree on that data set.

17. Finally to run all of the analyses, you can run PAUP in noninteractive mode:

paup -n snakesimdata.nex > snakesim.log 2>warnings.txt

After a few seconds, you should have completed the analysis of all 1000 datasets in snakesim.log and all of the questions that PAUP asked you when you weren't listening end up in warnings.txt

NOTE: because we are redirecting the standard error stream to warnings.txt and the standard output to snakesim.log you won't see anything until PAUP is done. You will get your bash prompt (the dollar sign) when PAUP is done. It took me about 2 minutes to run 1000 reps on the cluster, but it might be slower when there are lots of people running things on the same machine.

Summarizing the output

You really don't want to browse through 1000 analyses and perform subtraction (and certainly not when you could be at the Kidd after a long day).

If you want to see how you can do a lot of the calculation using pipes from the command line (and if you are working on a non-Windows machine), check out "the geeky way." Here we will use the easy way

I wrote a simple summarization script summarizePaupLengthDiffs.py that is in the bundle that you downloaded.

If you are running Windows (and not working on the cluster), you may need to install Python (any version that starts with 2 should work) if you don't have it.

19. As long as you do not mind overwriting a file in this directory named snakesimdiffs.tsv you can run the command :

 python summarizePaupLengthDiffs.py snakesim.log > snakesimdiffs.tsv

This should report critical values for the test statistic at a few significance levels. You should be able to open the file snakesimdiffs.tsv in Excel if you want to see differences for any replicate.


20. (optional) If you have the R programming language installed then you should be able to run plot_diffs.R and run it with a command line argument to produce a pdf that summarizes the parametric bootstrapping. Pass in the observed value of the test statistic to the R script. So, if the observed length difference (on the real data) was 14 then you would use the command:

R --file=plot_snake_diffs.R --args 14

to produce a file called snake_null_distribution_pscore_diffs.pdf with a histogram and the approximate p value.

The end

Was the observed difference in this tail of the null distribution? and would you reject the null hypothesis?