Difference between revisions of "CharacterEvoParametricBootstrappingLab"

From MolEvol
m (Overview)
(Calculating the test statistic on the real data)
Line 107: Line 107:
You can use <pre>ShowTree 1</pre>
You can use <pre>ShowTree 1</pre>
to see the first trees in memory (only the trees that are tied for having the best score will be in memory at this point.
to see the first trees in memory (only the trees that are tied for having the best score will be in memory at this point).
We are going to simulate some data. For that we will need parameter estimates and  
We are going to simulate some data. For that we will need parameter estimates and  

Revision as of 09:59, 22 July 2017

See https://molevol.mbl.edu/index.php/ParametricBootstrappingLab for links to the other related labs


This is an example of using parametric bootstrapping to test whether or not there is evidence for the reversibility of a character. It is based on the analysis of the evolution of oviparity from viviparity ( which would be a very surprising reversal). But that question of character evolution clearly depends on the phylogenetic context. There are more data available than this tiny mitochondrial data set, but we will use this data set to estimate a tree to emphasize how we can take phylogenetic error into account in our hypothesis testing. We'll use parsimony to infer a tree (preference for tree with the minimum number of character state changes) because it is fast.

This was inspired by the Wright et al paper (doi:10.1002/jez.b.22642 http://onlinelibrary.wiley.com/doi/10.1002/jez.b.22642/abstract;jsessionid=12F996E9F8866D699CC4E886F537213D.f02t04) mentioned in Hillis' lecture. Our analysis will be much less sophisticated, but we just want to demonstrate that the general simulation tools can be used for a wide variety of questions.

The first 2 steps of this lab are the same as in SnakeTopoParametricBootstrappingLab.

Background on the Dataset

This data set is a very small subset of the data matrix published by:

Pyron RA, Burbrink FT, Wiens JJ (2013) A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. BMC Evolutionary Biology 13: 93. http://dx.doi.org/10.1186/1471-2148-13-93

The data was obtained from DataDryad:

Pyron RA, Burbrink FT, Wiens JJ (2013) Data from: A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.82h0m Specifially, the data was at: http://datadryad.org/bitstream/handle/10255/dryad.47660/Squamata_10_28.phy?sequence=1

The file included here is a matrix that consists of ND2 and ND4 (mitochondrial) gene sequences for some selected squamates (mainly snakes). Mark Holder pruned it down to make the lab run quickly. The species are not randomly sampled - some interesting groups were selected based on whether or not they are viviparous (give birth to young without an external egg). This non-random selection could certainly affect the results; Wright et al discuss the difficulties associated with dealing with this sort of bias (and the sampling in the Pyron et al paper was much less biased than the subset that we are using here).

This is the same data set that is used in the SnakeTopoParametricBootstrappingLab.

Parametric Bootstrapping Background

To save time, we will be use parsimony. Parsimony searches try to find the tree that requires the fewest number mutations to explain the data. These data are from the (fast-evolving) mitochondrial genome, so parsimony is definitely not a good choice of analysis method. We are only using it here because it is fast, and our time in the lab is limited. The same principles of parametric bootstrapping apply to ML and distance-based methods.

Here we are going to use ancestral character state reconstructions. On the most parsimonious tree, there is some signal for a reversal to oviparity.

We can quantify that signal by comparing parsimony scores. For our real data set we ask: "If we disallow reversals to oviparity, how many more evolutionary transitions are required than if we allow reversals?"

A better way to do this would be to find stochastic models - one which allows reversals and another which disallows them. See Wright et al for a discussion such models (including models that allow the trait to affect the speciation/extinction rates of species). Once again, we are using parsimony for the sake of time in the computer lab.

Calculating the test statistic on the real data

1. Download the data for the lab by running:

wget http://phylo.bio.ku.edu/slides/param_boot_lab.zip
unzip param_boot_lab.zip
cd param_boot_lab/snakes/

If your machine does not have wget, you can install it or use curl -O http://phylo.bio.ku.edu/slides/param_boot_lab.zip to grab the file.

The data file is snakes.nex

2. Find the score of the most parsimonious tree using PAUP. Launch the command line version of PAUP by typing paup.

Once we have launched paup, we should start a log file to save the output, execute the data set and run a heuristic search. Heuristic searches are not guaranteed to find the optimal solution, but they often do a pretty good job (we'll talk more about tree searching later in the course). The invocation shown below will conduct 100 searches of trees from 100 different starting trees that are constructed using a quick algorithm called "stepwise addition."

Log Start File = snakecharevol.log ;
Execute snakes.nex ;
Set Criterion = pars ;
HSearch NReps = 100 Start = stepwise AddSeq = rand MulTrees = no ;

How many most parsimonious trees did you find? You certainly aren't finding all of the most parsimonious trees with the quick search we have done, but do make sure that you can interpret PAUP's output.

What is the parsimony score of the most parsimonious tree(s)? answer (your answer may vary a bit because our searches are heuristic).

You can use

ShowTree 1

to see the first trees in memory (only the trees that are tied for having the best score will be in memory at this point).

We are going to simulate some data. For that we will need parameter estimates and branch lengths. So, we will save the tree in the ML framework (here I provide you with good estimates of the gamma shape parameter and pinvar to make the likelihood scoring faster - normally you would estimate those parameters).

Set Criterion = like ; 
LSet NST = 6 RMat = estimate Rates = gamma Shape = 0.57 PInv=0.23;
LScore 1 ; 
LSet rmat = prev ;
SaveTrees File = mpwithmlbrlens.tre BrLens From = 1 To = 1 Format=AltNexus  ; 

3. In a different file viviparity.nex, we have data (0= oviparity and 1=viviparity) for whether or not each species lays eggs. We will use a test statistic that is:

   Parsimony Irreversible Score - Parsimony Score with reversals allowed

In the jargon of parsimony "unordered" is the name of a model in which any change in character state costs 1 step. So we will calculate the irreversible and unordered parsimony scores.

First we need to execute the new data set and get our estimated tree into memory. Give PAUP the following commands to read in the tree and trait data:

LSET Pinv = 0 ;
SET Criterion = pars ; 
EXECUTE viviparity.nex ; 
GETTREES File = mpwithmlbrlens.tre ;

We use the CTYPE command to control how PAUP scores characters. To associate the trait data with the appropriate tips and score the first (and only character) in unordered mode and show the ancestral character state reconstruction, tell PAUP the following:

CTYPE Unord : 1 ;

Then we can score the character in the irreversible parsimony scheme, we need to tell PAUP that we want to root the tree, and then change the character type:

CTYPE Irrev : 1 ;

What is the difference in parsimony score between the 2 types of parsimony? answer

Hypothesis testing

If you have the phylogeny correct, then disallowing reversals costs us 1 or 2 steps in the parsimony framework (depending on which tree you found). But is that a significant extra cost? Is it plausible that we would see that magnitude of signal for reversal even if the trait evolved in an irreversible manner?

4. Simulating the data.

Tell PAUP to quit to exit and return to your bash prompt.

I wrote a short python script to simulate a character under an irreversible model. The script needs a five arguments:

  • 1: tree with branch lengths,
  • 2: a parameter that controls the rate of character change,
  • 3: a number of data sets to simulate, and
  • 4 and 5: the names of 2 of the tips in the tree. Their most recent ancestor will be used to find the subtree in which evolution occurs.

The script will start the character at state 0, and not allow it to change in the outgroup. Then it will simulate irreversible evolution that in which the character starts to evolve at the base of the ingroup.

On the real data, we estimated 10 transitions if the character evolves irreversibly. I have tried out a few different rate values for my script, and recommend using a value of around 20 for the rate parameter.

We can use Typhlops reticulatus and Agkistrodon piscivorus to designate snakes as the ingroup. So to generate 1000 data sets in stored in a file called simrecon.nex you can run:

python sim_irreversible.py  mpwithmlbrlens.tre 20 1000 'Typhlops reticulatus' 'Agkistrodon piscivorus' > simrecon.nex

If you are interested in bioinformatics, you can take a look at the python script. It will probably make more sense after John Huelsenbeck's discussion of how to simulate character evolution. I purposely wrote the script to implement the same simulation algorithm that he explains in class.

5 Writing the commands to analyze the simulated data.

If you look at the simrecon.nex file, you will see that in between each of data blocks there is a PAUP block that contains the one command: Execute eachreconcommand.nex;

So if we put the commands needed to calculate the test statistic in eachreconcommand.nex we can have the test statistic calculated on each of the simulated data set.

Put the following commands in eachreconcommand.nex:

    GetTrees file = mpwithmlbrlens.tre ; 
    outgroup Gekko_gecko Heloderma_suspectum Varanus_salvator Iguana_iguana Anolis_carolinensis Scincella_lateralis ;
    CTYPE Unord : 1 ;
    PSCORE ;
    ROOT ; 
    CTYPE Irrev : 1 ;
    PSCORE ;

to get the tree used in the simulation, specify the outgroup (for the rooting) and then calculate the score under the irreversible and unordered parsimony variants.

Note that this is the same way we calculated the test statistic on the real data (the outgroup was specified in the viviparity.nex file, so we did not have to do that bit explicitly).

6 Analyzing the simulated data

Now we analyze all of those data sets by running PAUP in non-interactive mode

paup -n simrecon.nex > simrecon.log

Summarizing the output

Browsing the simrecon.log and you will see lots of parsimony scores on simulated data. You really don't want to browse through 1000 analyses and perform subtraction (and certainly not when you could be at the Kidd after a long day).

I have a script for searching through a PAUP log and subtracting every even parsimony score from the previous parsimony score. So if you have a test statistic that is the difference of parsimony scores (like our test statistic) you can use it to summarize the output.

19. Run:

python summarizePaupLengthDiffs.py simrecon.log > simrecon.tsv

This should report critical values for the test statistic at a few significance levels. You should be able to open the file simrecon.tsv in Excel if you want to see differences for any replicate.

20. (optional) If you have the R programming language installed then you should be able to run plot_recon_diffs.R and run it with a command line argument to produce a pdf that summarizes the parametric bootstrapping. Pass in the observed value of the test statistic to the R script. So, if the observed length difference (on the real data) was 2 then you would use the command:

R --file=plot_recon_diffs.R --args 2

to produce a file called snake_null_distribution_recon_diffs.pdf with a histogram and the approximate p value.

The end

Was the observed difference in this tail of the null distribution? and would you reject the null hypothesis?