AlgaeTopoParametricBootstrappingLab

From MolEvol
Jump to: navigation, search

See https://molevol.mbl.edu/index.php/ParametricBootstrappingLab for links to the other related labs

Overview

This lab is written as a #Parametric Bootstrapping Lab. If you prefer, you can skip to #Topology tests in PAUP if you'd prefer to see the how to run KHTest, SHTest, and AUTest options in PAUP's LScore command.


Parametric Bootstrapping Lab

The goal of this lab exercise is to show you how to use Monte Carlo simulation to construct a null distribution for a phylogenetic hypothesis. This type of parametric bootstrapping is not a trivial analysis, so there are several steps (and we've included a script in case you want a more automated way of completing the analysis.

The easiest way to run all of the steps in the lab (via a script) is to download param_boot_lab.zip, uncompress it, cd into the unzipped directory, and run the do_param_bootlab script. From a UNIX prompt (preferably on the workshop's cluster), you can do all of this by typing:

wget http://phylo.bio.ku.edu/slides/old_param_boot_lab.zip
unzip param_boot_lab.zip
cd param_boot_lab
bash do_param_bootlab.sh

If your machine does not have wget, you can install it or use curl -O http://phylo.bio.ku.edu/slides/old_param_boot_lab.zip to grab the file.

Just running the do_param_bootlab.sh script is probably not going to help you understand the steps. I recommend that you work through the lab until at least step 16 before you resort to running the script.

Background on the Dataset

In this lab, we will use a dataset algae.nex. It contains 16S rRNA sequences for a cyanobacterium (Anacystis), a chromophyte alga (Olithodiscus), a euglenoid protist (Euglena), and six green plants, including two green algae (Chlorella and Chlamydomonas), a liverwort (Marchantia), a monocotyledonous angiosperm (rice) and a dicotyledonous angiosperm (tobacco).

This data set was used in a 1994 paper by Lockhart et al. to show how common models used in reconstructing phylogenies fail when confronted by convergence in nucleotide composition. The problem is that the common models assume stationarity of the substitution process, which leads to the assumption that base frequencies do not change across the tree. Thus, things can go wrong when the base frequencies do change from lineage to lineage, and things can go really wrong when unrelated groups tend to have similar base compositions.

In this case, all of the species except Olithodiscus and Anacystis have chlorophyll a/b and are strongly suspected to be a monophyletic group. However, as you will see, Euglena has a strong tendency to group with the unrelated chromophyte Olithodiscus because of similarities in base composition.

The complete reference to the Lockhart paper is Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. "Recovering evolutionary trees under a more realistic model of sequence evolution". Molecular Biology and Evolution 11: 605-612.


Null hypothesis

For the purpose of the lab we will use the null hypothesis that there is a branch separating Anacystis and Olithodiscus from the rest of the taxa.

Parametric Bootstrapping Background

To save time, we will be use parsimony. The same principles of parametric bootstrapping apply to ML and distance-based methods.

Parametric bootstrapping is useful for answering the question: "I have observed a particular difference between the score of the optimal tree and the score of the best tree that is compatible with the null hypothesis. Is this score difference larger than I would expect if the null were true?"

The score difference can be a difference in log-likelihoods, a difference in parsimony scores, etc.

Parametric bootstrapping has a lot of steps. You may want to look at this flowchart.

Lab Activity

1. Download the data for the lab by running:

wget http://phylo.bio.ku.edu/slides/old_param_boot_lab.zip
unzip param_boot_lab.zip
cd param_boot_lab

You can see the contents of the data file by running

cat algae.nex


2. Find the score of the most parsimonious tree using PAUP. Launch the command line version of PAUP by typing paup. Because this is a small dataset, you can use the commands:

Execute algae.nex ; 
AllTrees fd=bar;

command to see all the scores of all the trees.

  • How many most parsimonious trees are there? answer
  • What is the parsimony score of the most parsimonious tree(s)? answer
You can use
ShowTrees all
to see all of the trees in memory (note that the AllTrees command scores every tree, but does not retain all of the trees in memory. It will only retain the best trees according to the criterion). 3. Use the
boot
command to calculate the bootstrap proportions and display the majority-rule consensus tree from the bootstrapping analyses.


4. Does the bootstrap tree display the chlorophyll a/b clade (a clade of all of the species except Anacystis and Olithodiscus)? answer

What is the bootstrap support for the split that represents our null hypothesis? If that split is not supported by the data, then it will not be on the majority-rule tree but it should be listed in the split frequency table (Anacystis is taxon number 7 and Olithodiscus is taxon number 8, so look for a row in the table with a split pattern that groups taxa 7 and 8 in one group and all the other taxa in the other group: e.g., "......**"). answer


5. Now we need to find the best scoring tree that is consistent with our null hypothesis. In other words, we want to find the best-scoring tree that has a clade of Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, and Tobacco excluding Anacystis nidulans and Olithodiscus. We can do this in a number of ways.

We will tell PAUP to do a search for the best tree that satisfies a topological constraint.

  • The first step is to write a file with a NEXUS trees block and a tree the contains only one branch. In a text editor, create a new file with the following contents:
#NEXUS
begin trees ;
    Tree ab = [&U](Anacystis_nidulans,Olithodiscus,(Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, Tobacco));
end;

The name of the constraint tree can be any NEXUS word that you like; in this example ab will be the name of the constraint tree. When you are dealing with large numbers of taxa and complex constraints it is often helpful to construct the constraint tree in Mesquite, save it to a file, and then read it into PAUP using the LoadConstr command.

  • Save the file as step5.abconstraint.tre in the same directory as the algae.nex directory (you can use nano step5.abconstraint.tre to open your text editor on the cluster).
  • To get PAUP to read the constraint into memory, use the
    LoadConstr file=step5.abconstraint.tre
    command.

6. Use the ShowConstr command to see the constraint tree that you have defined and make sure that it is the constraint that you intended.


7. Now we will conduct a branch-and-bound search that honors the constraint tree that we have just defined:

Log start file = 'steps7-11.realdatalog.txt' ; 
BAndB enforce constraints=ab

The enforce keyword tells paup to only consider trees that meet a constraint, and the constraints=ab keyword tell PAUP which tree to use as a constraint. Note that you can also use constraints with the HSearch command of PAUP (and you will need to do this for bigger datasets). What is the parsimony score of the best tree compatible with the constraint? answer

8. Use the ShowTrees command to see the tree. Does it satisfy the constraint? (it should).

9. Use the SaveTrees file=bestconstrained.tre to save this tree to a file in case you need it later.

Hypothesis testing

As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that satisfies our null hypothesis.

10. What is the value of the test statistic for our data? answer


To interpret the value of the test statistic we need to have some idea the range of values would be produced if the null hypothesis were true. This can be tricky. For one thing, there are lots of trees that are compatible with the null hypothesis. It seems logical to take the tree from the constrained search as the tree to repersent the null hypothesis; it is the tree that best explains the data (according to parsimony). Technically speaking this procedure of choosing a tree to represent the null does not guarantee that we are testing from the "least favorable conditions" as we should in hypothesis testing - but using this tree seems good enough, and it is practical.

Even if we are satisfied about the choice of a tree that represents null, we still have to have a way to find out what the distribution of the test statistic would be if this null were true. We will use Monte Carlo simulations for this.

In particular we will use Seq-Gen to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. We can compare our real data to this distribution to determine significance.

Finding model parameters

To simulate data, Seq-Gen needs a fully-specified model and a tree with branch lengths. We can use the tree that we found in the constrained search and the GTR+I+G model to get the necessary input.

11. Assuming that you have not quit PAUP and the tree found in the constrained search is still in memory, then we can save it with branch lengths that maximize the likelihood under the GTR+I+G model:

Set crit = like;
LSet nst=6 rmat=est basefreq=est rates = gamma shape = est pinv=est;
LScore;
SaveTrees file = model.tre brlens format=altnexus;

Make sure that you understand these commands (ask an instructor if you have questions). From the output of PAUP you should have the model parameter values for the simulation.

12. Look at the tree file. You should see a newick string representing a tree with branch lengths. You can use a text editor to see the newick representation. You can use FigTree to see a graphical representation of the tree, but that will require using scp to get the tree onto your laptop (from the cluster).


13. Unfortunately, seq-gen does not understand NEXUS tree files. Cut just the newick tree (the parenthetical description of the tree) from the file to a new file called model.txt or you can download this version of model.txt. If you are a UNIX geek, you can cut the tree out by quitting paup and issuing the command:

cat model.tre | grep PAUP_1 | awk '{print $5}' > model.txt

Non-geeks tend to prefer opening model.tre, copying the newick string, and saving it to a model.txt file.

Invoking seq-gen

Seq-Gen is installed on the workshop's cluster. If you are running the exercise on a machine that does not have seq-gen, you'll need to download Seq-Gen and install it.

To install, you'll need to drag the seq-gen executable to the directory that you are using for this lab (or add it to your PATH environmental variable that tells your shell where to find executables notes here)

14. To see the options for seq-gen use the command

seq-gen -h

To make sure everything is working do a simple test run using the HKY model:

seq-gen -mHKY model.txt

This should generate a dataset and print it to the screen. The simulation used it default parameter values for the HKY model. We'd like to use the parameters that we inferred from our real data (because the parameter values will affect the dataset-to-dataset variance, and hence the distribution of our test statistic). All commands are given to seq-gen as command line flags. The ones that we will use are:

-mGTR to specify the GTR model
-a preceding the shape parameter value
-i preceding the proportion of invariant sites
-r preceding the 6 instantanteous rates of the GTR matrix (PAUP infers the first five and fixes the last one to 1.0)
-f preceding the base frequencies
-l920 to simulate 920 sites (the same length as our real dataset).
-n1000 to generate 1000 datasets
-on to request output in the NEXUS format
-xeachreppaup.nex to tell it the name of a file with text input to be inserted between each dataset.

15. Take a look at the eachreppaup.nex that is included in the lab. It should contain the following lines.

begin paup;
	execute step18.run.nex;
end;

the -xeachreppaup.nex option to seq-gen will insert the contents of eachreppaup.nex in between each data set.

By putting the correct commands in a file called step18.run.nex

16. Now we will run seq-gen.

One additional trick: We want to redirect the output to file using the : > redirection operator so that we can store the simulated data to a file (rather than printing it out). Warning: this will overwrite whatever filename you put after the > character!. We will redirect to a file called simdata.nex with the syntax > simdata.nex at the end of our invocation of seq-gen.

The invocation should be something like the command below (but with the parameter estimates for this dataset filled in the appropriate spots).

seq-gen -mGTR -a0.6 -i0.32 -r 0.6 2.1 0.3 0.2 5 1 -f 0.27 0.20 0.30 0.23 -l920 -n1000 -on -xeachreppaup.nex model.txt > simdata.nex

Use the parameter values that you got from PAUP's LScore to construct a similar command and run it.
Note: In the Windows executable version of the program, the syntax of the command line is somewhat different. The rate matrix will be specified as -r0.6,2.1,0.3,0.2,5,1 and the base frequencies as -f0.27,0.20,0.30,0.23. You will also need to direct the treefile to seq-gen by inserting a < before the filename. In the case above, the end part of the command will read -xeachreppaup.nex < model.txt > simdata.nex

17. Open simdata.nex in a text editor. Do you see the content of the eachreppaup.nex file?

Running PAUP on the simulated data

Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command execute step18.run.nex intercalated between each data set. If we put the commands that we want PAUP to execute in the step18.run.nex file then those commands will be executed for each dataset.

What do we want to do for each dataset? We want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree because of sampling error (or even systematic error).

18. Take a look at the step18.run.nex file. It should contain:

#NEXUS
BAndB;
[!****This is the best tree's score****]
PScore;
GetTrees file = model.tre;
[!####This is the true tree's score####]
PScore;

These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it. We are using the output comments to make the output more visible.

Note that if we wanted to make the test more powerful we could do a constrained search for each simulated replicate instead of just scoring the model tree. (This would result in shorter trees that are consistent with our null hypothesis, which would tend to make the values of the difference in length smaller. Smaller values for the length difference in our null distribution would mean that the observed value of the test statistic would be further out in the tail of the null distribution; thus we would get a smaller p value). In the PAUP commands given above, we just score the model tree. In effect we are changing the null from:

"the tree has the chlorophyll a/b group"

to a more specific null: "the true tree is the tree stored in model.tre"

19. Finally to run all of the analyses it is helpful to have a simple "master" paup file. See the step19.master.nex file:

Log start replace file=step19.sim.log;
Set noQueryBeep noerrorBeep  noWarnReset noWarnTree noWarnTSave;
Execute simdata.nex;
Log stop;

Save this file as master.nex invoke PAUP using the -n flag so that it goes in non-interactive mode (and does not pester you with 1000 questions):

paup -n master.nex

or you can launch a graphical version of PAUP and tell it to execute the master.nex file. After a few seconds, you should have completed the analysis of all 1000 datasets.

Summarizing the output

You really don't want to browse through 1000 analyses and perform subtraction (and certainly not when you could be at the Kidd after a long day).

If you want to see how you can do a lot of the calculation using pipes from the command line (and if you are working on a non-Windows machine), check out "the geeky way." Here we will use the easy way

I wrote a simple summarization script summarizePaupLengthDiffs.py that is in the bundle that you downloaded.

If you are running Windows (and not working on the cluster), you may need to install Python (any version that starts with 2 should work) if you don't have it.

20. As long as you do not mind overwriting a file in this directory named step20.diffs.txt you can run the command :

 python summarizePaupLengthDiffs.py step19.sim.log > step20.diffs.txt

This should report critical values for the test statistic at a few significance levels. You should be able to open the file step20.diffs.txt in Excel if you want to see differences for any replicate.


21. (optional) If you have the R programming language installed then you should be able to run plot_diffs.R and run it with a command line argument to produce a pdf that summarizes the parametric bootstrapping. Pass in the observed value of the test statistic to the R script. So, if the observed length difference (on the real data) was 2 then you would use the command:

R --file=plot_diffs.R --args 2

to produce a file called null_distribution_pscore_diffs.pdf with a histogram and the approximate p value.

The end

Was the observed difference in this tail of the null distribution? and would you reject the null hypothesis?


postscript: An alternative (geeky) way of running step 20

If you find it unsatisfying to run a pre-packaged script in step 20 to parse the output of the simulated data, you can try to write a quick parser.

This is a step by step instruction of how to construct a simple workflow using UNIX "pipes". We are going to connect the output of one process (process = running program) to the input of another process. This is done with a "pipe" between the processes. From our shell, this is done with the | symbol. (You can learn more about pipes and grep here: UNIX_tutorial_advanced)
The command:

cat step19.sim.log

spits out all 83,000 lines of the log file to the screen. Ugh! The command:

cat step19.sim.log | grep "Length "

filters all of those lines so that only those with the word Length followed by a space are printed. This selects just the output from the PScore command that we did. Want to get just the scores of the first tree each time (without the word Length)? Try:

cat step19.sim.log | grep "Length " | awk '{print $2}'

We are close. We now need to subtract the number in the second line from the first line; the number in the fourth line from the third line... This would give us the difference in score for each rep. I wrote a simple python script to do this. Save the script as consecutiveDiffs.py in the same directory that you have been working in. Now

cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py

Should display the length differences. At this point (or really a couple of steps ago) you could take the data into Excel and find the critical value for the p=0.05 level by looking for the 50th largest difference. We can finish the UNIX way by

cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g

to sort the values numerically. And finally:

cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g | tail -n50

to show the 50 most extreme length differences.

Topology tests in PAUP

In PAUP you can easily run the Kishino-Hasegawa test, the Shimodaira-Hasegawa test, and the Approximately Unbiased test (of Shimodaira).

Remember the KH test is only appropriate if the trees to be tested are specified a priori The param_boot_lab.zip file (which was used in the parametric bootstrapping lab) has a file called twotrees.tre which has two trees which differ in their placement of Euglena

Run:

Execute algae.nex ;
Log start file=topotesting.log.txt ;

Gettrees file = twotrees.tre ;
pscore ; 
LScore 1 2 / nst = 6 basefreq=est rates=gamma shape=est pinv=0 rmat=est ;

to load the trees in memory, score them in parsimony and in ML (under the GTR+G model, which is reasonable for these data).

Do the two criteria agree? answer

Do the data significantly prefer tree 1 according to ML? To answer this, we can use the KHTest using the normal approximation of the log likelihood ratio test statistic, we can use:

LScore 1 2 / khtest=normal ;

Compare the P-values when you use the RELL approximation. The command for RELL bootstrapping is:

LScore 1 2 / khtest=bootstrap RELL ;

Are the P-values very close to each other? If not, why not? answer

Tests when you don't have two trees a priori

The following should work on paup installed on the class servers. If not, you have to use the version of paup from the paup_test site.

In truth, I took the 2 trees from Peter Lockhart et al.'s paper. This is also the source of these data that we are using. So the trees are not specified without reference to the data, so we should not be using the KH Test.

Suppose that we a came into the study confident of the topology of the land plants: in this data set Tobacco and Rice are sister. This group of Angiosperms is sister to Marchantia.

We would not be willing to consider an inference that conflicted with this set of relationships. If this is the full extent of our knowledge, then our candidate set can be described as all trees that show those relationships.

LSet nst = 6 basefreq=est rates=gamma shape=est pinv=0 rmat=est NoVec NThreads=1;
Set maxtrees = 200 ;
Constraint sp = (((Tobacco,Rice),Marchantia),Chlamydomonas,Chlorella,Euglena,Olithodiscus,Anacystis_nidulans) ;
Showcons sp ;

will load and show the constraint tree

Now run:

Generate all constraints = sp ;
Savetrees file=candidateset.tre ;

To tell PAUP to generate all of the trees in the candidate set and save them.

Tell PAUP to run the AUTest and then SHTest with:

LScore all / NoKHTest AUTest NoSHTest ;
LScore all / NoKHTest NoAUTest SHTest  ;

Do the P-values agree? Which test is more powerful? answer

If you have time, you can see the effect of candidate set size on the power of the SH Test. Suppose that we knew that Chlamydomonas was sister to the land plants. Try out these commands:

Constraint spc = ((((Tobacco,Rice),Marchantia),Chlamydomonas),Chlorella,Euglena,Olithodiscus,Anacystis_nidulans) ;
Generate all constraints = spc ;
Savetrees file=smallcandidateset.tre ;
LScore all / NoKHTest NoAUTest SHTest  ;

The tree numbers will have changed (because you have a smaller set of trees), but you can find the matching tree by comparing lnL values. You should see a lower P-value in the second analysis - having fewer trees in the candidate set increases the power of the SHTest (but has less affect on the AUTest).

Other pointers

Other software relevant to the testing lecture:

  • PhyML aLRT and aBayes statistics, in particular.
  • RAxML Rapid bootstrapping
  • Consel AU Test, SH Test, Weighted SH Test, KH Test, Bootstrap
  • Ed Susko's software page (software from Susko, E. (2010) and Susko, E. (2006) papers is of particular relevance to the tree testing lecture).