ParametricBootstrappingLab

From MolEvol
Revision as of 15:28, 24 July 2012 by Ejmctavish (talk | contribs) (Pointers to other topology testing techniques)

Pointers to other topology testing techniques

The lecture mentions several tests that we are not going to run through in detail in the lab.

You will find KHTest, SHTest, and AUTest options in PAUP's LScore command. Using LScore command is the easiest way to conduct these tests. Let us know if you have any questions about running these tests.

You can run these tests and compare the results from those tests your results from the parametric bootstrapping.

Other software relevant to the testing lecture:

  • PhyML aLRT and aBayes statistics, in particular.
  • RAxML Rapid bootstrapping
  • Consel AU Test, SH Test, Weighted SH Test, KH Test, Bootstrap
  • Ed Susko's software page (software from Susko, E. (2010) and Susko, E. (2006) papers is of particular relevance to the tree testing lecture).

Parametric Bootstrapping Lab

The goal of this lab exercise is to show you how to conduct use Monte Carlo simulation to construct a null distribution for a phylogenetic hypothesis. This type of parametric bootstrapping is not a trivial analysis, so there are several steps (and we've included a script in case you want a more automated way of completing the analysis.

The easiest way to run all of the steps in the lab (via a script) is to download param_boot_lab.zip, uncompress it, cd into the unzipped directory, and run the do_param_bootlab script. From a UNIX prompt (preferably on the workshop's cluster), you can do all of this by typing:

wget http://phylo.bio.ku.edu/slides/param_boot_lab.zip
unzip param_boot_lab.zip
cd param_boot_lab
bash do_param_bootlab.sh

If your machine does not have wget, you can install it or use curl -O http://phylo.bio.ku.edu/slides/param_boot_lab.zip to grab the file.

Just running the do_param_bootlab.sh script is probably not going to help you understand the steps. I recommend that you work through the lab until at least step 16 before you resort to running the script.

Background on the Dataset

In this lab, we will use a dataset algae.nex. It contains 16S rRNA sequences for a cyanobacterium (Anacystis), a chromophyte alga (Olithodiscus), a euglenoid protist (Euglena), and six green plants, including two green algae (Chlorella and Chlamydomonas), a liverwort (Marchantia), a monocotyledonous angiosperm (rice) and a dicotyledonous angiosperm (tobacco).

This data set was used in a 1994 paper by Lockhart et al. to show how common models used in reconstructing phylogenies fail when confronted by convergence in nucleotide composition. The problem is that the common models assume stationarity of the substitution process, which leads to the assumption that base frequencies do not change across the tree. Thus, things can go wrong when the base frequencies do change from lineage to lineage, and things can go really wrong when unrelated groups tend to have similar base compositions.

In this case, all of the species except Olithodiscus and Anacystis have chlorophyll a/b and are strongly suspected to be a monophyletic group. However, as you will see, Euglena has a strong tendency to group with the unrelated chromophyte Olithodiscus because of similarities in base composition.

The complete reference to the Lockhart paper is Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. "Recovering evolutionary trees under a more realistic model of sequence evolution". Molecular Biology and Evolution 11: 605-612.


Null hypothesis

For the purpose of the lab we will use the null hypothesis that there is a branch separating Anacystis and Olithodiscus from the rest of the taxa.

Parametric Bootstrapping Background

Because we have a limited amount of time in the computer lab, we will be using parsimony as our optimality criterion in this lab. The general principles of using Monte Carlo simulation to construct a null distribution apply to any optimality criterion, so you can adapt the lab to hypothesis testing using ML or distance-based approaches.

Often, you will use parametric bootstrapping when the optimal tree disagrees with your null hypothesis. You would like to decide whether or not you should reject the null hypothesis.

Parametric bootstrapping is useful for answering the question: "I have observed a particular difference between the score of the optimal tree and the score of the best tree that is compatible with the null hypothesis. Is this score difference larger than I would expect if the null were true?"

The score difference can be a difference in log-likelihoods, a difference in parsimony scores, etc.

Lab Activity

1. Download algae.nex

2. Find the score of the most parsimonious tree using PAUP. Launch the command line version of PAUP. Because this is a small dataset, you can use the commands:

Execute algae.nex ; 
AllTrees fd=bar;

command to see all the scores of all the trees.

  • How many most parsimonious trees are there? answer
  • What is the parsimony score of the most parsimonious tree(s)? answer

3. Use the

boot

command to calculate the bootstrap proportions and display the majority-rule consensus tree from the bootstrapping analyses.


4. Does the bootstrap tree display the chlorophyll a/b clade (a clade of all of the species except Anacystis and Olithodiscus)? answer

What is the bootstrap support for the split that represents our null hypothesis? If that split is not supported by the data, then it will not be on the majority-rule tree but it should be listed in the split frequency table (Anacystis is taxon number 7 and Olithodiscus is taxon number 8, so look for a row in the table with a split pattern that groups taxa 7 and 8 in one group and all the other taxa in the other group: e.g., "......**"). answer

You can use

ShowTrees all

to see all of the trees in memory (note that the AllTrees command scores every tree, but does not retain all of the trees in memory. It will only retain the best trees according to the criterion).

5. Now we need to find the best scoring tree that is consistent with our null hypothesis. In other words, we want to find the best-scoring tree that has a clade of Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, and Tobacco excluding Anacystis nidulans and Olithodiscus. We can do this in a number of ways.

We will tell PAUP to do a search for the best tree that satisfies a topological constraint.

  • The first step is to write a file with a NEXUS trees block and a tree the contains only one branch. In a text editor, create a new file with the following contents:
#NEXUS
begin trees ;
    Tree ab = [&U](Anacystis_nidulans,Olithodiscus,(Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, Tobacco));
end;

The name of the constraint tree can be any NEXUS word that you like; in this example ab will be the name of the constraint tree. When you are dealing with large numbers of taxa and complex constraints it is often helpful to construct the constraint tree in Mesquite, save it to a file, and then read it into PAUP using the LoadConstr command.

  • Save the file as abconstraint.tre in the same directory as the algae.nex directory.
  • To get PAUP to read the constraint into memory, use the
    LoadConstr file=abconstraint.tre
    command.

6. Use the ShowConstr command to see the constraint tree that you have defined and make sure that it is the constraint that you intended.


7. Now we will conduct a branch-and-bound search that honors the constraint tree that we have just defined:

Log start file = 'steps7-11.realdatalog.txt' ; 
BAndB enforce constraints=ab

The enforce keyword tells paup to only consider trees that meet a constraint, and the constraints=ab keyword tell PAUP which tree to use as a constraint. Note that you can also use constraints with the HSearch command of PAUP (and you will need to do this for bigger datasets). What is the difference in the parsimony score between the unconstrained and constrained tree? answer

8. Use the ShowTrees command to see the tree. Does it satisfy the constraint? (it should).

9. Use the SaveTrees file=bestconstrained.tre to save this tree to a file in case you need it later.

Hypothesis testing

As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that satisfies our null hypothesis.

10. What is the value of the test statistic for our data? answer


To interpret the value of the test statistic we need to have some idea the range of values would be produced if the null hypothesis were true. This is can be tricky. For one thing, there are lots of trees that are compatible with the null hypothesis. It seems logical to take the tree from the constrained search as the tree to repersent the null hypothesis. After all, among all of the trees compatible with the constraint, it is the one that best explains the data (according to parsimony). Technically speaking this procedure of choosing a tree to represent the null does not guarantee that we are testing from the "least favorable conditions" as we should in hypothesis testing - but using this tree seems good enough, and it is practical.

Even if we are satisfied about the choice of a tree that represents the best the null can do, we still have to have a way to find out what the distribution of the test statistic would be if this null were true. We will use Monte Carlo simulations for this. In particular we will use Seq-Gen to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. We can compare our real data to that.

Finding model parameters

To simulate data, Seq-Gen needs a fully-specified model and a tree with branch lengths. We can use the tree that we found in the constrained search and the GTR+I+G model to get the necessary input.

11. Assuming that you have not quit PAUP and the tree found in the constrained search is still in memory, then we can save it with branch lengths that maximize the likelihood under the GTR+I+G model:

Set crit = like;
LSet nst=6 rmat=est basefreq=est rates = gamma shape = est pinv=est;
LScore;
SaveTrees file = model.tre brlens format=altnexus;

Make sure that you understand these commands (ask an instructor if you have questions). From the output of PAUP you should have the model parameter values for the simulation.

12. Look at the tree file. You should see a newick string representing a tree with branch lengths. You can use a text editor to see the newick representation. Or you can use FigTree to see a graphical representation of the tree.


13. Unfortunately, seq-gen does not understand NEXUS tree files. Cut just the newick tree (the parenthetical description of the tree) from the file to a new file called model.txt or you can download this version of model.txt. If you are a UNIX geek, you do this by quitting paup and issuing the command:

cat model.tre | grep PAUP_1 | awk '{print $5}' > model.txt

Non-geeks tend to prefer opening model.tre, copying the newick string, and saving it to a step13.model.txt file.

Invoking seq-gen

Seq-Gen is installed on the workshop's cluster. If you are running the exercise on a machine that does not have seq-gen, you'll need to download Seq-Gen and install it.

To install, you'll need to drag the seq-gen executable to the directory that you are using for this lab (or add it to your PATH environmental variable that tells your shell where to find executables notes here)

14. To see the options for seq-gen use the command

seq-gen -h

To make sure everything is working do a simple test run using the HKY model:

seq-gen -mHKY model.txt

This should generate a dataset and print it to the screen. The simulation used it default parameter values for the HKY model. We'd like to use the parameters that we inferred from our real data (because the parameter values will affect the dataset-to-dataset variance, and hence the distribution of our test statistic). All commands are given to seq-gen as command line flags. The ones that we will use are:

-mGTR to specify the GTR model
-a preceding the shape parameter value
-i preceding the proportion of invariant sites
-r preceding the 6 instantanteous rates of the GTR matrix (PAUP infers the first five and fixes the last one to 1.0)
-f preceding the base frequencies
-l920 to simulate 920 sites (the same length as our real dataset).
-n1000 to generate 1000 datasets
-on to request output in the NEXUS format
-xeachreppaup.nex to tell it the name of a file with text input to be inserted between each dataset.

Finally we'll want to redirect the output to file using the : > redirection operator (Remember that this will overwrite whatever filename you put after the > character!).

15. Take a look at the eachreppaup.nex that is included in the lab. It should contain the following lines.

begin paup;
	execute run.nex;
end;

the -xeachreppaup.nex option to seq-gen will insert the contents of eachreppaup.nex in between each data set.

By putting the correct commands in a file called step18.run.nex

16. Run seq-gen. The invocation should be something like the command below (but with the parameter estimates for this dataset filled in the appropriate spots):

seq-gen -mGTR -a0.6 -i0.32 -r 0.6 2.1 0.3 0.2 5 1 -f 0.27 0.20 0.30 0.23 -l920 -n1000 -on -xeachreppaup.nex model.txt > simdata.nex

Use the parameter values that you got from PAUP's LScore to construct a similar command and run it.
Note: In the Windows executable version of the program, the syntax of the command line is somewhat different. The rate matrix will be specified as -r0.6,2.1,0.3,0.2,5,1 and the base frequencies as -f0.27,0.20,0.30,0.23. You will also need to direct the treefile to seq-gen by inserting a < before the filename. In the case above, the end part of the command will read -xeachreppaup.nex < model.txt > simdata.nex

17. Open simdata.nex in a text editor. Do you see the content of the eachreppaup.nex file?

Running PAUP on the simulated data

Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command execute step18.run.nex intercalated between each data set. If we put the commands that we want PAUP to execute in the step18.run.nex file then those commands will be executed for each dataset.

What do we want to do for each dataset? We want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree because of sampling error (or even systematic error.

18. Take a look at the step18.run.nex file. It should contain:

#NEXUS
BAndB;
[!****This is the best tree's score****]
PScore;
GetTrees file = model.tre;
[!####This is the true tree's score####]
PScore;

These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it. We are using the output comments to make the output more visible.

Note that if we wanted to make the test more powerful we could do a constrained search for each simulated replicate instead of just scoring the model tree. (This would result in shorter trees that are consistent with our null hypothesis, which would tend to make the values of the difference in length smaller. Smaller values for the length difference in our null distribution would mean that the observed value of the test statistic would be further out in the tail of the null distribution; thus we would get a smaller p value). In the PAUP commands given above, we just score the model tree. In effect we are changing the null from:

"the tree has the chlorophyll a/b group"

to a more specific null: "the true tree is the tree stored in model.tre"

19. Finally to run all of the analyses it is helpful to have a simple "master" paup file. See the step19.master.nex file:

Log start replace file=step19.sim.log;
Set noQueryBeep noerrorBeep  noWarnReset noWarnTree noWarnTSave;
Execute simdata.nex;
Log stop;

Save this file as master.nex invoke PAUP using the -n flag so that it goes in non-interactive mode (and does not pester you with 1000 questions):

paup -n master.nex

or you can launch a graphical version of PAUP and tell it to execute the master.nex file. After a few seconds, you should have completed the analysis of all 1000 datasets.

Summarizing the output

You really don't want to browse through 1000 analyses and perform subtraction (and certainly not when you could be at the Kidd after a long day).

Summarize the output using the "easy way" below. If you want to see how you can do a lot of the calculation using pipes from the command line (and if you are working on a non-Windows machine), check out "the geeky way."

The easy way

I wrote a simple summarization script summarizePaupLengthDiffs.py. Make sure that it is in your current working directory.

If you are running Windows, you may need to install Python (any version that starts with 2 should work) if you don't have it.

20. As long as you do not mind overwriting a file in this directory named step20.diffs.txt you can run the command :

 python summarizePaupLengthDiffs.py step19.sim.log > step20.diffs.txt

This should report critical values for the test statistic at a few significance levels. You should be able to open the file step20.diffs.txt in Excel if you want to see differences for any replicate.

21. (optional) if you have the R programming language installed then you should be able to download plot_diffs.R and run it with a command line argument to produce a pdf that summarizes the parametric bootstrapping. Pass in the observed value of the test statistic to the R script. So, if the observed length difference (on the real data) was 2 then you would use the command:

R --file=plot_diffs.R --args 4

to produce a file called null_distribution_pscore_diffs.pdf with a histogram and the approximate p value.


The end

Was the observed difference in this tail of the null distribution? and would you reject the null hypothesis?


postscript: An alternative (geeky) way of running step 20

If you find it unsatisfying to run a pre-packaged script in step 20 to parse the output of the simulated data, you can try to write a quick parser.

This is a step by step instruction of how to construct a simple workflow using UNIX "pipes". We are going to connect the output of one process (process = running program) to the input of another process. This is done with a "pipe" between the processes. From our shell, this is done with the | symbol. The command:

cat step19.sim.log

spits out all 83,000 lines of the log file to the screen. Ugh! The command:

cat step19.sim.log | grep "Length "

filters all of those lines so that only those with the word Length followed by a space are printed. This selects just the output from the PScore command that we did. Want to get just the scores of the first tree each time (without the word Length)? Try:

cat step19.sim.log | grep "Length " | awk '{print $2}'

We are close. We now need to subtract the number in the second line from the first line; the number in the fourth line from the third line... This would give us the difference in score for each rep. I wrote a simple python script to do this. Save the script as consecutiveDiffs.py in the same directory that you have been working in. Now

cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py

Should display the length differences. At this point (or really a couple of steps ago) you could take the data into Excel and find the critical value for the p=0.05 level by looking for the 50th largest difference. We can finish the UNIX way by

cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g

to sort the values numerically. And finally:

cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g | tail -n50

to show the 50 most extreme length differences.

Topology tests in PAUP*

In PAUP you can easily run the Kishino-Hasegawa test, the Shimodaira-Hasegawa test, and the Approximately Unbiased test (of Shimodaira).

Remember *the KH test is only appropriate of the trees to be tested are specified a priori!* The param_boot_lab.zip file (which was used in the parametric bootstrapping lab) has a file called twotrees.tre which has two trees which differ in their placement of Euglena

Run:

Execute algae.nex ;
Log start file=topotesting.log.txt ;

Gettrees file = twotrees.tre ;
pscore ; 
LScore 1 2 / nst = 6 basefreq=est rates=gamma shape=est pinv=0 rmat=est ;

to load the trees in memory, score them in parsimony and in ML (under the GTR+G model, which is reasonable for these data).

Do the two criteria agree? answer

Do the data significantly prefer tree 1 according to ML? To answer this, we can use the KHTest using the normal approximation of the log likelihood ratio test statistic, we can use:

LScore 1 2 / khtest=normal ;

Compare the P-values when you use the RELL approximation. The command for RELL bootstrapping is:

LScore 1 2 / khtest=bootstrap RELL ;

Are the P-values very close to each other? If not, why not?