https://molevol.mbl.edu/api.php?action=feedcontributions&user=Jsukumaran&feedformat=atomMolEvol - User contributions [en]2020-11-24T10:14:56ZUser contributionsMediaWiki 1.31.7https://molevol.mbl.edu/index.php?title=Jeet_Sukumaran&diff=2556Jeet Sukumaran2012-08-02T04:03:14Z<p>Jsukumaran: </p>
<hr />
<div>You can find out more about me [http://jeetworks.org on my website].<br />
<br />
My email address is jeet.sukumaran[at]duke.edu.<br />
<br />
If you like working with Python, you might find the [http://packages.python.org/DendroPy/ DendroPy phylogenetic computing library] useful, and I will be happy to help you get started with this.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Migrate_tutorial&diff=2481Migrate tutorial2012-07-31T00:06:36Z<p>Jsukumaran: </p>
<hr />
<div><br />
== Tutorial Overview ==<br />
=== How to get MIGRATE ===<br />
Locally, the most recent copy of the software on the server is in the /class/shared/migrate_demo/local_distribution directory, <br />
you can either use the program that is already installed on the class server or use these files in the local_distribution directory to<br />
install MIGRATE on your own computer. if this fails download MIGRATE from [http://popgen.sc.fsu.edu/Migrate/Download.html the Migrate downloadwebsite].<br />
<br />
=== Comparison of gene flow models using Bayes Factors with MIGRATE ===<br />
[[File:migrate2logo.png|right|300px]] <br />
Most are familiar wit the concept of likelihood ratio tests, or Akaike’s<br />
Information criterion for model comparison. This tutorial describes how to compare models using Bayes Factors.<br />
These allow comparing nested and un-nested models, without assuming Normality, or large samples. Bayes factors<br />
are ratios of marginal likelihoods. In contrast to maximum likelihood, the marginal likelihood is the integral<br />
of the likelihood function over the complete parameter range. MIGRATE can calculate such marginal likelihoods<br />
for a particular migration model (Beerli and Palczewski 2010). This tutorial steps through all necessary<br />
program runs to calculate Bayes factors for comparing different gene flow models. We need to do the following: <br />
<br />
*Decide on the models that are interesting for a comparison. The method does not work well for a fishing expedition where one would try to evaluate all models; this is possible only for a small model. It will be possible to enumerate all models for three populations but more will be very daunting.<br />
<br />
* Run each model through MIGRATE. Use the same prior settings for each of them because the prior distribution has some influence on the Bayes factors. Use the heating menu to allow for at least four chains. The menu supplies a shortcut to specify the temperatures, it is #. It generates temperatures that are spaced in a particular way: they are spaced so that the inverse of the temperature are regularly spaced on the interval 0 to 1. For example, the 4 different chains have temperatures 1.0, 1.5, 3.0, 100,000.0, this results in the spacing 1.0, 0.666, 0.333, and 0.0.<br />
<br />
* Compare the marginal likelihood of the different runs and calculate the Bayes factor and calculate the probability for each model.<br />
<br />
The following pages detail all steps using a small example. We use a simulated dataset that was generated<br />
using parameters that force a direction of migration from the population Aadorf (A) to the population Bern<br />
(B). The Bern population is 10x larger than the Aadorf population and no individual from Bern ever goes to<br />
Aadorf, but Bern receives about 1 migrant per generation from Aadorf. The dataset name is '''twoswisstowns'''<br />
([http://popgen.sc.fsu.edu/tutorials/BF_migrate_tutorial/twoswisstowns download here])<br />
We will evaluate 4 models: <br />
# a full model with two population sizes and two migration rates (from A to B and from B to A);<br />
# a model with two population sizes and one migration rate to Bern;<br />
# a model with two population sizes and one migration rate to Aadorf;<br />
# a model where Aadorf and Bern are part of the same panmictic population.<br />
We know the truth therefore we have some prejudice about the ranking of the models, '''model 2''' should be<br />
''best'', '''model 1''', because it allows for the same migration direction as '''model 2''' should be ranked<br />
''second''. Whether '''model 3''' is better than '''model 4''' is unknown a priori and may depend on the<br />
strength of the data. First we need to figure out how to run the dataset efficiently in MIGRATE. For that we<br />
pick the most complicated '''model 1''' and experiment with run conditions until we are satisfied that the run<br />
converges and delivers posterior distributions that look acceptable.<br />
Here are detailed instructions how to rank population genetics models for a particular dataset.<br />
<br />
<br />
== Familiarize with MIGRATE [Tutorial Start] ==<br />
* Make a new directory and download or copy the datafile<br />
<pre><br />
mkdir migtut<br />
cd migtut<br />
wget http://popgen.sc.fsu.edu/tutorials/BF_migrate_tutorial/twoswisstowns<br />
</pre><br />
* Start the program MIGRATE-N (I will call it from now on simply MIGRATE). On the server use<br />
<pre style="font-size: 1.25em"><br />
migrate-n <br />
</pre><br />
On your laptop you may need to use "./migrate-n", if the program is in the same directory.<br />
The main menu will appear, looking like this <br />
<pre style="font-size: 1.25em"><br />
[pbeerli@class-02 migtut]$ migrate-n<br />
=============================================<br />
MIGRATION RATE AND POPULATION SIZE ESTIMATION<br />
using Markov Chain Monte Carlo simulation<br />
=============================================<br />
PDF output enabled [Letter-size]<br />
Version 3.3.0 <br />
Program started at Sat Jul 28 10:23:26 2012<br />
<br />
<br />
Settings for this run:<br />
D Data type currently set to: DNA sequence model <br />
I Input/Output formats<br />
P Parameters [start, migration model]<br />
S Search strategy<br />
W Write a parmfile<br />
Q Quit the program<br />
<br />
<br />
To change the settings type the letter for the menu to change<br />
Start the program with typing Yes or Y<br />
===> <br />
<br />
</pre><br />
* Go to the '''Input/Output formats''' menu (press '''I'''), in the '''INPUT''' section change the Datafile name to '''twoswisstowns''', Return to the main menu. <br />
<br />
* In the '''Search strategy''' menu check whether the strategy is set to '''Bayesian inference''', if it is set to maximum likelihood change it. Change the '''Number of recorded steps in chain''' to '''1000'''. Do not worry about priors or other runtime options for the moment. Return to the main menu.<br />
<br />
* Save the changes by using the menu item '''Write a parmfile'''. This will write a file named '''parmfile'''.<br />
<br />
* Now run the program ('''pressing Y''' will start the run if you are in the main menu). For this dataset the runtime will be very short. On a modern computerthis will take under a minute. If this takes more than 1 minute, something is not set up correctly! On the server this takes about ''5.2 seconds''.<br />
<br />
* The program writes considerable information during the run to the screen, that gives some information about the run. Most interesting are the acceptance ratio for the genealogy and the autocorrelations of the parameter and the genealogy. The default acceptance method for parameters is Slice sampling (You can find a [http://people.sc.fsu.edu/~pbeerli/mbl2011_migrate_parallel.pdf talk] discussing this); the acceptance ratio of Slice sampling is always 1.0. If the autocorrelation is high and the effective sample size is low (<500) then a longer run may be needed. If the priors boundaries are to tight, then you will see that the values reported and are either very close or exactly at the upper prior boundary, in these cases you need to extend the prior range. See prior problems, but for this dataset we will have no such problems.<br />
* Look at the '''outfile.pdf''', you will need to transfer the pdf file to your computer and use acrobat or another PDF viewer. In the outputfile look at the figures labeled Bayesian analysis: Posterior distribution, you see histograms similar to the ones in Figure 1. We expect single peaks where the shading of the histogram shows one dark block in the center (50% credibility set), two light gray bars indicating the extent of the 95% credibility set, and two lighter gray bars indication the 99% credibility set.<br />
[[File:figure1.png|right|500px]]<br />
* In your investigation of Figure 1 you recognize that the histogram does not look very smooth because our run was too short, now restart MIGRATE and set in the strategy menu the setting for change the number of recorded steps in chain from 1,000 to 10,000. This will lengthen the run by a factor of 10. Don't forget to write the parmfile to save the settings. Run and compare the results (Figure 2) with the histogram from before. You will recognize that the longer run has somewhat smoother histograms, and the double peaks vanish (hopefully). With your own data you may want to do another round of refinements, but eventually, by comparing the medians and means of the parameters in the table and the shape of the histograms you should see a good agreement on similar values, if the modes of the different runs are not within the 50% credibility intervals you certainly need to run longer. <br />
[[File:figure2.png|right|500px]]<br />
<br />
== Report marginal likelihoods of Model 1 (4 Parameter) ==<br />
<br />
* Let's assume that our runs are all satisfactory. We turn now to the best estimation of the marginal likelihood to compare models. Because we want to use the thermodynamic integration method, we need to turn on heating. Start MIGRATE, use the strategy menu and turn on heating, use static heating. MIGRATE will tell what to do next, you will need to enter 4 chains sampling at every tenth (10) interval using the temperature scheme that is suggested with the character #. Save the parmfile, and run. This will take about 4x longer than before. It should give a better posterior distribution histogram and will add a full table of (natural) log marginal likelihoods is shown towards the end of the outfile.pdf. On my computer this takes about 5 minutes.<br />
<br />
[[File:Marginaltable.png|right|300px]]<br />
* Come to the front and write down the log marginal likelihood into the spreadsheet (Columns labeled ) You will need the numbers from the row labeled All, in the table there are three columns, report the values for the Bezier approximation and the harmonic mean method. (This was our first model, we will compare the different models at the end of this exercise: my log marginal likelihood values for the Bezier approximated score and the Harmonic mean are -4862.85 and -4791.29, respectively.<br />
<br />
* To save what we have done so far, copy the parmfile to parmfile.4param, copy outfile to outfile.4param, and copy outfile.pdf to outfile.4param.pdf. MIGRATE allows to specify names in the menu but copying in the terminal is fast and also allows us to use the '''parmfile''' as a template for the other models, so that we do not need to change the run-length and heating parameters again.<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.4param<br />
cp outfile.pdf outfile.4param.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 4!! (1 Parameter) ==<br />
<br />
* We start now to work on those other models. We pick the easiest first: '''model 4'''.<br />
[[file:relabel_menu.png|right|300px]]<br />
* Start MIGRATE and choose the menu '''Parameter settings'''. Choose the entry about '''sampling locations'''. We want to use the data as if we would have sampled a single population, therefore we need to claim that the two locations Aadorf and Bern belong to the same panmictic population. MIGRATE's default is to assume that every location is a individual population. The dialog will ask first how many locations are in the dataset (for our example we have '''2'''). After that you will need to assign the locations to a population. For this model we need to assign each location to the same population. You need to enter '''1 1''' (one space one). With multiple populations more complicated settings are possible. Run MIGRATE, check the histogram, if it looks OK, come to the front and write down the log marginal likelihoods (again the row labeled All, Bezier and Harmonic score) into the spreadsheet under model 4. My run took 183 seconds and delivered these log marginal likelihoods -4887.25 and -4803.22.<br />
<br />
* Copy the parmfile to parmfile.1param, copy outfile to outfile.1param, and copy outfile.pdf to outfile.1param.pdf<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.1param<br />
cp outfile.pdf outfile.1param.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 2 (3 Parameter) ==<br />
[[File:figure3.png|right|300px]]<br />
* Now you need to consider the models with unidirectional migration.<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile.4param parmfile<br />
</pre><br />
<br />
* Start MIGRATE, choose the parameter menu. Choose the entry labeled Model is set to. MIGRATE will now show a dizzying list of options, <b><span style="color:red; background:teal">don't panic</span></b>, we will only use few of them. MIGRATE will ask you how many populations are used: enter 2. For a 2-population model we can have 4 parameters. Two population sizes and two migration rates. Before you enter values, please read this whole paragraph. A * or x means that that particular parameter will be estimated, a zero means that that particular parameter will not be estimated (is not used). Our goal is to set one of the migration parameters to zero. We start with model 2 (Figure 3). MIGRATE needs to know how to treat all connections between the populations are specified and that we also give instructions how the program will treat the population sizes. Because we want to estimate both population sizes and one migration rate, we will use the * and a zero for the unused migration rate. The connection matrix is square so we can label it like show in the first table below. <br />
[[File:table1.png|center|500px]]<br />
* MIGRATE asks now that you input each row, this can be done by either specifying '''* 0''' (see second table) and then return and then entering the next line '''* *''' return (second row in second table), or you can enter the whole matrix as '''* 0 * *'''.<br />
[[File:table2.png|center|300px]]<br />
<br />
Exit the parameter menu, write the parmfile, run MIGRATE, check the histogram, report the log marginal likelihoods. My run took 156 seconds, and delivered these log marginal likelihoods: -4860.58, -4795.88. Cautionary note: if you use this tutorial for your own work, please recognize that a standard run in migrate can only use migration model that allow to draw a complete genealogy, so for example a model * 0 0 *, that has no migration among the population, does not work out of the box. As a rule of thumb each population must be connected to at least one other population. <br />
* Copy the parmfile to parmfile.3aparam, copy outfile to outfile.3aparam, and copy outfile.pdf to outfile.3aparam.pdf <br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.3aparam<br />
cp outfile.pdf outfile.3aparam.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 3 (3 Parameter) ==<br />
<br />
* Run model 3 using the same procedure as for model 2. The string for the migration connection matrix is '''* * 0 *'''. Write parmfile, run, report. My run took 151 seconds and the log marginal likelihoods were -4863.08, and -4794.53. <br />
* Copy the parmfile to parmfile.3bparam, copy out file to outfile.3bparam, and copy outfile.pdf to outfile.3bparam.pdf <br />
* Once about more than half of the class has reached this point we will talk about the marginal likelihoods found.<br />
<br />
== Compare models ==<br />
<br />
* How to calculate Bayes factors? In the Table 3 I summarized all log marginal likelihoods, ln(mL), the Bayes factors are often calculated in very different ways. Here, I report the natural log Bayes factors where<br />
[[File:formula1.png|center|400px]]<br />
* Using the guidelines of Kass and Raftery (1995), values smaller than -2 suggest preference for 'model 2', values larger than 2 suggest preference for 'model 1'. We can use the log marginal likelihoods or the BF to order the models (see column Choice in the Table 3). <br />
<br />
* We also can calculate the model probability. It is calculated by dividing each marginal likelihood by the sum of the marginal likelihoods of all used models:<br />
<br />
[[File:formula2.png|center|300px]] <br />
<br />
* Note that for the above formula uses the marginal likelihoods, *not* the *log* marginal likelihoods (which is what the program reports). The calculation of model probabilities from the reported log likelihoods is easy with computer programs that have variable precision (for example Maple or Mathematica). Calculations on a desk calculator often fail, for example the likelihood of model 1 is a remarkable small number because the likelihood is exp(-4862.85)= 1.233 x 10<sup>-2112</sup>, my emulated HP sci 15C calculator delivers 0.0000. But you can calculate the above quantities using this recipe: (1) find the largest log likelihood (-4860.58), (2) subtract that number form each log likelihood in the list (result: -2.27, 0.0, 2.5, -26.67), (3) exponentiate each element in the new list (result: 0.1033, 1.0, 0.0821, 2.6144 x 10<sup>-12</sup>) , (4) sum all elements in the list up (0.1033+1.0+0.0821+2.6144 x 10<sup>-12</sup>), this is the denominator (1.1854). (5) now divide each element in the list by that sum and the numbers will look like the one in table 3 last column.<br />
<br />
[[File:table3.png|center|900px]]<br />
<br />
The harmonic mean fails to recover the true model for my example run and orders the models differently. Often the variance of the harmonic man estimator (when running the same model multiple times) is large. It is therefore not reliable, but this may depend on the dataset. As consequence, I suggest to use the thermodynamic integration method if the extra cost in runtime is acceptable. Looking at the model probabilities we can see that the “true” model has considerably higher support than the full model or the model that suggests a wrong direction of gene flow.<br />
<br />
=== Summary of results ===<br />
The best model (the one with the highest marginal likelihood) is model 2 (custom-migration={*0**}) if we use the the thermodynamic approximation of marginal likelihood. I was shown several times (Beerli and Palczewski 2010, Xie et al. 2011 ) now that the harmonic mean estimator is not a good estimator and may be misleading and prefer the more complex model (like in our example). The picture below is a sample from the class tutorial done on August 1st 2011.<br />
[[File:Migrate_exercise.png|center|1000px]]<br />
<br />
=== References === <br />
* Beerli, P. and M. Palczewski. 2010. Unified framework to evaluate panmixia and migration direction among multiple sampling locations. Genetics 185: 313–326.<br />
* Kass, R. E. and A. E. Raftery. Bayes factors. 1995. Journal of the American Statistical Association 90(430): 773– 795.<br />
* Xie, W., P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011 Improving marginal likelihood estimation for Bayesian phylogenetic model selection, Systematic Biology, 60: 150–160.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Migrate_tutorial&diff=2480Migrate tutorial2012-07-31T00:05:50Z<p>Jsukumaran: </p>
<hr />
<div><br />
== Tutorial Overview ==<br />
=== How to get MIGRATE ===<br />
Locally, the most recent copy of the software on the server is in the /class/shared/migrate_demo/local_distribution directory, <br />
you can either use the program that is already installed on the class server or use these files in the local_distribution directory to<br />
install MIGRATE on your own computer. if this fails download MIGRATE from [http://popgen.sc.fsu.edu/Migrate/Download.html the Migrate downloadwebsite].<br />
<br />
=== Comparison of gene flow models using Bayes Factors with MIGRATE ===<br />
[[File:migrate2logo.png|right|300px]] <br />
Most are familiar wit the concept of likelihood ratio tests, or Akaike’s<br />
Information criterion for model comparison. This tutorial describes how to compare models using Bayes Factors.<br />
These allow comparing nested and un-nested models, without assuming Normality, or large samples. Bayes factors<br />
are ratios of marginal likelihoods. In contrast to maximum likelihood, the marginal likelihood is the integral<br />
of the likelihood function over the complete parameter range. MIGRATE can calculate such marginal likelihoods<br />
for a particular migration model (Beerli and Palczewski 2010). This tutorial steps through all necessary<br />
program runs to calculate Bayes factors for comparing different gene flow models. We need to do the following: <br />
<br />
*Decide on the models that are interesting for a comparison. The method does not work well for a fishing expedition where one would try to evaluate all models; this is possible only for a small model. It will be possible to enumerate all models for three populations but more will be very daunting.<br />
<br />
* Run each model through MIGRATE. Use the same prior settings for each of them because the prior distribution has some influence on the Bayes factors. Use the heating menu to allow for at least four chains. The menu supplies a shortcut to specify the temperatures, it is #. It generates temperatures that are spaced in a particular way: they are spaced so that the inverse of the temperature are regularly spaced on the interval 0 to 1. For example, the 4 different chains have temperatures 1.0, 1.5, 3.0, 100,000.0, this results in the spacing 1.0, 0.666, 0.333, and 0.0.<br />
<br />
* Compare the marginal likelihood of the different runs and calculate the Bayes factor and calculate the probability for each model.<br />
<br />
The following pages detail all steps using a small example. We use a simulated dataset that was generated<br />
using parameters that force a direction of migration from the population Aadorf (A) to the population Bern<br />
(B). The Bern population is 10x larger than the Aadorf population and no individual from Bern ever goes to<br />
Aadorf, but Bern receives about 1 migrant per generation from Aadorf. The dataset name is '''twoswisstowns'''<br />
([http://popgen.sc.fsu.edu/tutorials/BF_migrate_tutorial/twoswisstowns download here])<br />
We will evaluate 4 models: <br />
# a full model with two population sizes and two migration rates (from A to B and from B to A);<br />
# a model with two population sizes and one migration rate to Bern;<br />
# a model with two population sizes and one migration rate to Aadorf;<br />
# a model where Aadorf and Bern are part of the same panmictic population.<br />
We know the truth therefore we have some prejudice about the ranking of the models, '''model 2''' should be<br />
''best'', '''model 1''', because it allows for the same migration direction as '''model 2''' should be ranked<br />
''second''. Whether '''model 3''' is better than '''model 4''' is unknown a priori and may depend on the<br />
strength of the data. First we need to figure out how to run the dataset efficiently in MIGRATE. For that we<br />
pick the most complicated '''model 1''' and experiment with run conditions until we are satisfied that the run<br />
converges and delivers posterior distributions that look acceptable.<br />
Here are detailed instructions how to rank population genetics models for a particular dataset.<br />
<br />
<br />
== Familiarize with MIGRATE [Tutorial Start] ==<br />
* Make a new directory and download or copy the datafile<br />
<pre><br />
mkdir migtut<br />
cd migtut<br />
wget http://popgen.sc.fsu.edu/tutorials/BF_migrate_tutorial/twoswisstowns<br />
</pre><br />
* Start the program MIGRATE-N (I will call it from now on simply MIGRATE). On the server use<br />
<pre style="font-size: 1.25em"><br />
migrate-n <br />
</pre><br />
On your laptop you may need to use "./migrate-n", if the program is in the same directory.<br />
The main menu will appear, looking like this <br />
<pre style="font-size: 1.25em"><br />
[pbeerli@class-02 migtut]$ migrate-n<br />
=============================================<br />
MIGRATION RATE AND POPULATION SIZE ESTIMATION<br />
using Markov Chain Monte Carlo simulation<br />
=============================================<br />
PDF output enabled [Letter-size]<br />
Version 3.3.0 <br />
Program started at Sat Jul 28 10:23:26 2012<br />
<br />
<br />
Settings for this run:<br />
D Data type currently set to: DNA sequence model <br />
I Input/Output formats<br />
P Parameters [start, migration model]<br />
S Search strategy<br />
W Write a parmfile<br />
Q Quit the program<br />
<br />
<br />
To change the settings type the letter for the menu to change<br />
Start the program with typing Yes or Y<br />
===> <br />
<br />
</pre><br />
* Go to the '''Input/Output formats''' menu (press '''I'''), in the '''INPUT''' section change the Datafile name to '''twoswisstowns''', Return to the main menu. <br />
<br />
* In the '''Search strategy''' menu check whether the strategy is set to '''Bayesian inference''', if it is set to maximum likelihood change it. Change the '''Number of recorded steps in chain''' to '''1000'''. Do not worry about priors or other runtime options for the moment. Return to the main menu.<br />
<br />
* Save the changes by using the menu item '''Write a parmfile'''. This will write a file named '''parmfile'''.<br />
<br />
* Now run the program ('''pressing Y''' will start the run if you are in the main menu). For this dataset the runtime will be very short. On a modern computerthis will take under a minute. If this takes more than 1 minute, something is not set up correctly! On the server this takes about ''5.2 seconds''.<br />
<br />
* The program writes considerable information during the run to the screen, that gives some information about the run. Most interesting are the acceptance ratio for the genealogy and the autocorrelations of the parameter and the genealogy. The default acceptance method for parameters is Slice sampling (You can find a [http://people.sc.fsu.edu/~pbeerli/mbl2011_migrate_parallel.pdf talk] discussing this); the acceptance ratio of Slice sampling is always 1.0. If the autocorrelation is high and the effective sample size is low (<500) then a longer run may be needed. If the priors boundaries are to tight, then you will see that the values reported and are either very close or exactly at the upper prior boundary, in these cases you need to extend the prior range. See prior problems, but for this dataset we will have no such problems.<br />
* Look at the '''outfile.pdf''', you will need to transfer the pdf file to your computer and use acrobat or another PDF viewer. In the outputfile look at the figures labeled Bayesian analysis: Posterior distribution, you see histograms similar to the ones in Figure 1. We expect single peaks where the shading of the histogram shows one dark block in the center (50% credibility set), two light gray bars indicating the extent of the 95% credibility set, and two lighter gray bars indication the 99% credibility set.<br />
[[File:figure1.png|right|500px]]<br />
* In your investigation of Figure 1 you recognize that the histogram does not look very smooth because our run was too short, now restart MIGRATE and set in the strategy menu the setting for change the number of recorded steps in chain from 1,000 to 10,000. This will lengthen the run by a factor of 10. Don't forget to write the parmfile to save the settings. Run and compare the results (Figure 2) with the histogram from before. You will recognize that the longer run has somewhat smoother histograms, and the double peaks vanish (hopefully). With your own data you may want to do another round of refinements, but eventually, by comparing the medians and means of the parameters in the table and the shape of the histograms you should see a good agreement on similar values, if the modes of the different runs are not within the 50% credibility intervals you certainly need to run longer. <br />
[[File:figure2.png|right|500px]]<br />
<br />
== Report marginal likelihoods of Model 1 (4 Parameter) ==<br />
<br />
* Let's assume that our runs are all satisfactory. We turn now to the best estimation of the marginal likelihood to compare models. Because we want to use the thermodynamic integration method, we need to turn on heating. Start MIGRATE, use the strategy menu and turn on heating, use static heating. MIGRATE will tell what to do next, you will need to enter 4 chains sampling at every tenth (10) interval using the temperature scheme that is suggested with the character #. Save the parmfile, and run. This will take about 4x longer than before. It should give a better posterior distribution histogram and will add a full table of (natural) log marginal likelihoods is shown towards the end of the outfile.pdf. On my computer this takes about 5 minutes.<br />
<br />
[[File:Marginaltable.png|right|300px]]<br />
* Come to the front and write down the log marginal likelihood into the spreadsheet (Columns labeled ) You will need the numbers from the row labeled All, in the table there are three columns, report the values for the Bezier approximation and the harmonic mean method. (This was our first model, we will compare the different models at the end of this exercise: my log marginal likelihood values for the Bezier approximated score and the Harmonic mean are -4862.85 and -4791.29, respectively.<br />
<br />
* To save what we have done so far, copy the parmfile to parmfile.4param, copy outfile to outfile.4param, and copy outfile.pdf to outfile.4param.pdf. MIGRATE allows to specify names in the menu but copying in the terminal is fast and also allows us to use the '''parmfile''' as a template for the other models, so that we do not need to change the run-length and heating parameters again.<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.4param<br />
cp outfile.pdf outfile.4param.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 4!! (1 Parameter) ==<br />
<br />
* We start now to work on those other models. We pick the easiest first: '''model 4'''.<br />
[[file:relabel_menu.png|right|300px]]<br />
* Start MIGRATE and choose the menu '''Parameter settings'''. Choose the entry about '''sampling locations'''. We want to use the data as if we would have sampled a single population, therefore we need to claim that the two locations Aadorf and Bern belong to the same panmictic population. MIGRATE's default is to assume that every location is a individual population. The dialog will ask first how many locations are in the dataset (for our example we have '''2'''). After that you will need to assign the locations to a population. For this model we need to assign each location to the same population. You need to enter '''1 1''' (one space one). With multiple populations more complicated settings are possible. Run MIGRATE, check the histogram, if it looks OK, come to the front and write down the log marginal likelihoods (again the row labeled All, Bezier and Harmonic score) into the spreadsheet under model 4. My run took 183 seconds and delivered these log marginal likelihoods -4887.25 and -4803.22.<br />
<br />
* Copy the parmfile to parmfile.1param, copy outfile to outfile.1param, and copy outfile.pdf to outfile.1param.pdf<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.1param<br />
cp outfile.pdf outfile.1param.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 2 (3 Parameter) ==<br />
[[File:figure3.png|right|300px]]<br />
* Now you need to consider the models with unidirectional migration.<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile.4param parmfile<br />
</pre><br />
<br />
* Start MIGRATE, choose the parameter menu. Choose the entry labeled Model is set to. MIGRATE will now show a dizzying list of options, <b><span style="color:red; background:teal">don't panic</span></b>, we will only use few of them. MIGRATE will ask you how many populations are used: enter 2. For a 2-population model we can have 4 parameters. Two population sizes and two migration rates. Before you enter values, please read this whole paragraph. A * or x means that that particular parameter will be estimated, a zero means that that particular parameter will not be estimated (is not used). Our goal is to set one of the migration parameters to zero. We start with model 2 (Figure 3). MIGRATE needs to know how to treat all connections between the populations are specified and that we also give instructions how the program will treat the population sizes. Because we want to estimate both population sizes and one migration rate, we will use the * and a zero for the unused migration rate. The connection matrix is square so we can label it like show in the first table below. <br />
[[File:table1.png|center|500px]]<br />
* MIGRATE asks now that you input each row, this can be done by either specifying '''* 0''' (see second table) and then return and then entering the next line '''* *''' return (second row in second table), or you can enter the whole matrix as '''* 0 * *'''.<br />
[[File:table2.png|center|300px]]<br />
<br />
Exit the parameter menu, write the parmfile, run MIGRATE, check the histogram, report the log marginal likelihoods. My run took 156 seconds, and delivered these log marginal likelihoods: -4860.58, -4795.88. Cautionary note: if you use this tutorial for your own work, please recognize that a standard run in migrate can only use migration model that allow to draw a complete genealogy, so for example a model * 0 0 *, that has no migration among the population, does not work out of the box. As a rule of thumb each population must be connected to at least one other population. <br />
* Copy the parmfile to parmfile.3aparam, copy outfile to outfile.3aparam, and copy outfile.pdf to outfile.3aparam.pdf <br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.3aparam<br />
cp outfile.pdf outfile.3aparam.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 3 (3 Parameter) ==<br />
<br />
* Run model 3 using the same procedure as for model 2. The string for the migration connection matrix is '''* * 0 *'''. Write parmfile, run, report. My run took 151 seconds and the log marginal likelihoods were -4863.08, and -4794.53. <br />
* Copy the parmfile to parmfile.3bparam, copy out file to outfile.3bparam, and copy outfile.pdf to outfile.3bparam.pdf <br />
* Once about more than half of the class has reached this point we will talk about the marginal likelihoods found.<br />
<br />
== Compare models ==<br />
<br />
* How to calculate Bayes factors? In the Table 3 I summarized all log marginal likelihoods, ln(mL), the Bayes factors are often calculated in very different ways. Here, I report the natural log Bayes factors where<br />
[[File:formula1.png|center|400px]]<br />
* Using the guidelines of Kass and Raftery (1995), values smaller than -2 suggest preference for 'model 2', values larger than 2 suggest preference for 'model 1'. We can use the log marginal likelihoods or the BF to order the models (see column Choice in the Table 3). <br />
<br />
We also can calculate the model probability. It is calculated by dividing each marginal likelihood by the sum of the marginal likelihoods of all used models:<br />
<br />
[[File:formula2.png|center|300px]] <br />
<br />
* Note that for the above formula uses the marginal likelihoods, *not* the *log* marginal likelihoods (which is what the program reports). The calculation of model probabilities from the reported log likelihoods is easy with computer programs that have variable precision (for example Maple or Mathematica). Calculations on a desk calculator often fail, for example the likelihood of model 1 is a remarkable small number because the likelihood is exp(-4862.85)= 1.233 x 10<sup>-2112</sup>, my emulated HP sci 15C calculator delivers 0.0000. But you can calculate the above quantities using this recipe: (1) find the largest log likelihood (-4860.58), (2) subtract that number form each log likelihood in the list (result: -2.27, 0.0, 2.5, -26.67), (3) exponentiate each element in the new list (result: 0.1033, 1.0, 0.0821, 2.6144 x 10<sup>-12</sup>) , (4) sum all elements in the list up (0.1033+1.0+0.0821+2.6144 x 10<sup>-12</sup>), this is the denominator (1.1854). (5) now divide each element in the list by that sum and the numbers will look like the one in table 3 last column.<br />
<br />
[[File:table3.png|center|900px]]<br />
<br />
The harmonic mean fails to recover the true model for my example run and orders the models differently. Often the variance of the harmonic man estimator (when running the same model multiple times) is large. It is therefore not reliable, but this may depend on the dataset. As consequence, I suggest to use the thermodynamic integration method if the extra cost in runtime is acceptable. Looking at the model probabilities we can see that the “true” model has considerably higher support than the full model or the model that suggests a wrong direction of gene flow.<br />
<br />
=== Summary of results ===<br />
The best model (the one with the highest marginal likelihood) is model 2 (custom-migration={*0**}) if we use the the thermodynamic approximation of marginal likelihood. I was shown several times (Beerli and Palczewski 2010, Xie et al. 2011 ) now that the harmonic mean estimator is not a good estimator and may be misleading and prefer the more complex model (like in our example). The picture below is a sample from the class tutorial done on August 1st 2011.<br />
[[File:Migrate_exercise.png|center|1000px]]<br />
<br />
=== References === <br />
* Beerli, P. and M. Palczewski. 2010. Unified framework to evaluate panmixia and migration direction among multiple sampling locations. Genetics 185: 313–326.<br />
* Kass, R. E. and A. E. Raftery. Bayes factors. 1995. Journal of the American Statistical Association 90(430): 773– 795.<br />
* Xie, W., P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011 Improving marginal likelihood estimation for Bayesian phylogenetic model selection, Systematic Biology, 60: 150–160.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Migrate_tutorial&diff=2362Migrate tutorial2012-07-29T22:24:39Z<p>Jsukumaran: </p>
<hr />
<div><br />
== Tutorial Overview ==<br />
=== How to get MIGRATE ===<br />
Locally, the most recent copy of the software on the server is in the /class/shared/migrate_demo/local_distribution directory, <br />
you can either use the program that is already installed on the class server or use these files in the local_distribution directory to<br />
install MIGRATE on your own computer. if this fails download MIGRATE from [http://popgen.sc.fsu.edu/Migrate/Download.html the Migrate downloadwebsite].<br />
<br />
=== Comparison of gene flow models using Bayes Factors with MIGRATE ===<br />
[[File:migrate2logo.png|right|300px]] <br />
Most are familiar wit the concept of likelihood ratio tests, or Akaike’s<br />
Information criterion for model comparison. This tutorial describes how to compare models using Bayes Factors.<br />
These allow comparing nested and un-nested models, without assuming Normality, or large samples. Bayes factors<br />
are ratios of marginal likelihoods. In contrast to maximum likelihood, the marginal likelihood is the integral<br />
of the likelihood function over the complete parameter range. MIGRATE can calculate such marginal likelihoods<br />
for a particular migration model (Beerli and Palczewski 2010). This tutorial steps through all necessary<br />
program runs to calculate Bayes factors for comparing different gene flow models. We need to do the following: <br />
<br />
*Decide on the models that are interesting for a comparison. The method does not work well for a fishing expedition where one would try to evaluate all models; this is possible only for a small model. It will be possible to enumerate all models for three populations but more will be very daunting.<br />
<br />
* Run each model through MIGRATE. Use the same prior settings for each of them because the prior distribution has some influence on the Bayes factors. Use the heating menu to allow for at least four chains. The menu supplies a shortcut to specify the temperatures, it is #. It generates temperatures that are spaced in a particular way: they are spaced so that the inverse of the temperature are regularly spaced on the interval 0 to 1. For example, the 4 different chains have temperatures 1.0, 1.5, 3.0, 100,000.0, this results in the spacing 1.0, 0.666, 0.333, and 0.0.<br />
<br />
* Compare the marginal likelihood of the different runs and calculate the Bayes factor and calculate the probability for each model.<br />
<br />
The following pages detail all steps using a small example. We use a simulated dataset that was generated<br />
using parameters that force a direction of migration from the population Aadorf (A) to the population Bern<br />
(B). The Bern population is 10x larger than the Aadorf population and no individual from Bern ever goes to<br />
Aadorf, but Bern receives about 1 migrant per generation from Aadorf. The dataset name is '''twoswisstowns'''<br />
([http://popgen.sc.fsu.edu/tutorials/BF_migrate_tutorial/twoswisstowns download here])<br />
We will evaluate 4 models: <br />
# a full model with two population sizes and two migration rates (from A to B and from B to A);<br />
# a model with two population sizes and one migration rate to Bern;<br />
# a model with two population sizes and one migration rate to Aadorf;<br />
# a model where Aadorf and Bern are part of the same panmictic population.<br />
We know the truth therefore we have some prejudice about the ranking of the models, '''model 2''' should be<br />
''best'', '''model 1''', because it allows for the same migration direction as '''model 2''' should be ranked<br />
''second''. Whether '''model 3''' is better than '''model 4''' is unknown a priori and may depend on the<br />
strength of the data. First we need to figure out how to run the dataset efficiently in MIGRATE. For that we<br />
pick the most complicated '''model 1''' and experiment with run conditions until we are satisfied that the run<br />
converges and delivers posterior distributions that look acceptable.<br />
Here are detailed instructions how to rank population genetics models for a particular dataset.<br />
<br />
<br />
== Familiarize with MIGRATE [Tutorial Start] ==<br />
* Make a new directory and download or copy the datafile<br />
<pre><br />
mkdir migtut<br />
cd migtut<br />
wget http://popgen.sc.fsu.edu/tutorials/BF_migrate_tutorial/twoswisstowns<br />
</pre><br />
* Start the program MIGRATE-N (I will call it from now on simply MIGRATE). On the server use<br />
<pre style="font-size: 1.25em"><br />
migrate-n <br />
</pre><br />
On your laptop you may need to use "./migrate-n", if the program is in the same directory.<br />
The main menu will appear, looking like this <br />
<pre style="font-size: 1.25em"><br />
[pbeerli@class-02 migtut]$ migrate-n<br />
=============================================<br />
MIGRATION RATE AND POPULATION SIZE ESTIMATION<br />
using Markov Chain Monte Carlo simulation<br />
=============================================<br />
PDF output enabled [Letter-size]<br />
Version 3.3.0 <br />
Program started at Sat Jul 28 10:23:26 2012<br />
<br />
<br />
Settings for this run:<br />
D Data type currently set to: DNA sequence model <br />
I Input/Output formats<br />
P Parameters [start, migration model]<br />
S Search strategy<br />
W Write a parmfile<br />
Q Quit the program<br />
<br />
<br />
To change the settings type the letter for the menu to change<br />
Start the program with typing Yes or Y<br />
===> <br />
<br />
</pre><br />
* Go to the '''Input/Output formats''' menu (press '''I'''), in the '''INPUT''' section change the Datafile name to '''twoswisstowns''', Return to the main menu. <br />
<br />
* In the '''Search strategy''' menu check whether the strategy is set to '''Bayesian inference''', if it is set to maximum likelihood change it. Change the '''Number of recorded steps in chain''' to '''1000'''. Do not worry about priors or other runtime options for the moment. Return to the main menu.<br />
<br />
* Save the changes by using the menu item '''Write a parmfile'''. This will write a file named '''parmfile'''.<br />
<br />
* Now run the program ('''pressing Y''' will start the run if you are in the main menu). For this dataset the runtime will be very short. On a modern computerthis will take under a minute. If this takes more than 1 minute, something is not set up correctly! On the server this takes about ''5.2 seconds''.<br />
<br />
* The program writes considerable information during the run to the screen, that gives some information about the run. Most interesting are the acceptance ratio for the genealogy and the autocorrelations of the parameter and the genealogy. The default acceptance method for parameters is Slice sampling (You can find a [http://people.sc.fsu.edu/~pbeerli/mbl2011_migrate_parallel.pdf talk] discussing this); the acceptance ratio of Slice sampling is always 1.0. If the autocorrelation is high and the effective sample size is low (<500) then a longer run may be needed. If the priors boundaries are to tight, then you will see that the values reported and are either very close or exactly at the upper prior boundary, in these cases you need to extend the prior range. See prior problems, but for this dataset we will have no such problems.<br />
* Look at the '''outfile.pdf''', you will need to transfer the pdf file to your computer and use acrobat or another PDF viewer. In the outputfile look at the figures labeled Bayesian analysis: Posterior distribution, you see histograms similar to the ones in Figure 1. We expect single peaks where the shading of the histogram shows one dark block in the center (50% credibility set), two light gray bars indicating the extent of the 95% credibility set, and two lighter gray bars indication the 99% credibility set.<br />
[[File:figure1.png|right|500px]]<br />
* In your investigation of Figure 1 you recognize that the histogram does not look very smooth because our run was too short, now restart MIGRATE and set in the strategy menu the setting for change the number of recorded steps in chain from 1,000 to 10,000. This will lengthen the run by a factor of 10. Don't forget to write the parmfile to save the settings. Run and compare the results (Figure 2) with the histogram from before. You will recognize that the longer run has somewhat smoother histograms, and the double peaks vanish (hopefully). With your own data you may want to do another round of refinements, but eventually, by comparing the medians and means of the parameters in the table and the shape of the histograms you should see a good agreement on similar values, if the modes of the different runs are not within the 50% credibility intervals you certainly need to run longer. <br />
[[File:figure2.png|right|500px]]<br />
<br />
== Report marginal likelihoods of Model 1 (4 Parameter) ==<br />
<br />
* Let's assume that our runs are all satisfactory. We turn now to the best estimation of the marginal likelihood to compare models. Because we want to use the thermodynamic integration method, we need to turn on heating. Start MIGRATE, use the strategy menu and turn on heating, use static heating. MIGRATE will tell what to do next, you will need to enter 4 chains sampling at every tenth (10) interval using the temperature scheme that is suggested with the character #. Save the parmfile, and run. This will take about 4x longer than before. It should give a better posterior distribution histogram and will add a full table of (natural) log marginal likelihoods is shown towards the end of the outfile.pdf. On my computer this takes about 5 minutes.<br />
<br />
[[File:Marginaltable.png|right|300px]]<br />
* Come to the front and write down the log marginal likelihood into the spreadsheet (Columns labeled ) You will need the numbers from the row labeled All, in the table there are three columns, report the values for the Bezier approximation and the harmonic mean method. (This was our first model, we will compare the different models at the end of this exercise: my log marginal likelihood values for the Bezier approximated score and the Harmonic mean are -4862.85 and -4791.29, respectively.<br />
<br />
* To save what we have done so far, copy the parmfile to parmfile.4param, copy outfile to outfile.4param, and copy outfile.pdf to outfile.4param.pdf. MIGRATE allows to specify names in the menu but copying in the terminal is fast and also allows us to use the '''parmfile''' as a template for the other models, so that we do not need to change the run-length and heating parameters again.<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.4param<br />
cp outfile.pdf outfile.4param.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 4!! (1 Parameter) ==<br />
<br />
* We start now to work on those other models. We pick the easiest first: '''model 4'''.<br />
[[file:relabel_menu.png|right|300px]]<br />
* Start MIGRATE and choose the menu '''Parameter settings'''. Choose the entry about '''sampling locations'''. We want to use the data as if we would have sampled a single population, therefore we need to claim that the two locations Aadorf and Bern belong to the same panmictic population. MIGRATE's default is to assume that every location is a individual population. The dialog will ask first how many locations are in the dataset (for our example we have '''2'''). After that you will need to assign the locations to a population. For this model we need to assign each location to the same population. You need to enter '''1 1''' (one space one). With multiple populations more complicated settings are possible. Run MIGRATE, check the histogram, if it looks OK, come to the front and write down the log marginal likelihoods (again the row labeled All, Bezier and Harmonic score) into the spreadsheet under model 4. My run took 183 seconds and delivered these log marginal likelihoods -4887.25 and -4803.22.<br />
<br />
* Copy the parmfile to parmfile.1param, copy outfile to outfile.1param, and copy outfile.pdf to outfile.1param.pdf<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.1param<br />
cp outfile.pdf outfile.1param.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 2 (3 Parameter) ==<br />
[[File:figure3.png|right|300px]]<br />
* Now you need to consider the models with unidirectional migration.<br />
<br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.1param<br />
cp outfile.pdf outfile.1param.pdf<br />
</pre><br />
<br />
* Start MIGRATE, choose the parameter menu. Choose the entry labeled Model is set to. MIGRATE will now show a dizzying list of options, <b><span style="color:red; background:teal">don't panic</span></b>, we will only use few of them. MIGRATE will ask you how many populations are used: enter 2. For a 2-population model we can have 4 parameters. Two population sizes and two migration rates. Before you enter values, please read this whole paragraph. A * or x means that that particular parameter will be estimated, a zero means that that particular parameter will not be estimated (is not used). Our goal is to set one of the migration parameters to zero. We start with model 2 (Figure 3). MIGRATE needs to know how to treat all connections between the populations are specified and that we also give instructions how the program will treat the population sizes. Because we want to estimate both population sizes and one migration rate, we will use the * and a zero for the unused migration rate. The connection matrix is square so we can label it like show in the first table below. <br />
[[File:table1.png|center|500px]]<br />
* MIGRATE asks now that you input each row, this can be done by either specifying '''* 0''' (see second table) and then return and then entering the next line '''* *''' return (second row in second table), or you can enter the whole matrix as '''* 0 * *'''.<br />
[[File:table2.png|center|300px]]<br />
<br />
Exit the parameter menu, write the parmfile, run MIGRATE, check the histogram, report the log marginal likelihoods. My run took 156 seconds, and delivered these log marginal likelihoods: -4860.58, -4795.88. Cautionary note: if you use this tutorial for your own work, please recognize that a standard run in migrate can only use migration model that allow to draw a complete genealogy, so for example a model * 0 0 *, that has no migration among the population, does not work out of the box. As a rule of thumb each population must be connected to at least one other population. <br />
* Copy the parmfile to parmfile.3aparam, copy outfile to outfile.3aparam, and copy outfile.pdf to outfile.3aparam.pdf <br />
<pre style="font-size: 1.25em"><br />
cp parmfile parmfile.3aparam<br />
cp outfile.pdf outfile.3aparam.pdf<br />
</pre><br />
<br />
== Report marginal likelihoods of Model 3 (3 Parameter) ==<br />
<br />
* Run model 3 using the same procedure as for model 2. The string for the migration connection matrix is '''* * 0 *'''. Write parmfile, run, report. My run took 151 seconds and the log marginal likelihoods were -4863.08, and -4794.53. <br />
* Copy the parmfile to parmfile.3bparam, copy out file to outfile.3bparam, and copy outfile.pdf to outfile.3bparam.pdf <br />
* Once about more than half of the class has reached this point we will talk about the marginal likelihoods found.<br />
<br />
== Compare models ==<br />
<br />
* How to calculate Bayes factors? In the Table 3 I summarized all log marginal likelihoods, ln(mL), the Bayes factors are often calculated in very different ways. Here, I report the natural log Bayes factors where<br />
[[File:formula1.png|center|400px]]<br />
* Using the guidelines of Kass and Raftery (1995), values smaller than -2 suggest preference for 'model 2', values larger than 2 suggest preference for 'model 1'. We can use the log marginal likelihoods or the BF to order the models (see column Choice in the Table 3). We also can calculate the model probability. The above outline uses only log marginal likelihoods, but for these calculations we need marginal likelihoods. It is calculated by dividing each marginal likelihood by the sum of the marginal likelihoods of all used models:<br />
<br />
[[File:formula2.png|center|300px]] <br />
<br />
* The calculation of model probabilities from the reported log likelihoods is easy with computer programs that have variable precision (for example Maple or Mathematica). Calculations on a desk calculator often fail, for example the likelihood of model 1 is a remarkable small number because the likelihood is exp(-4862.85)= 1.233 x 10<sup>-2112</sup>, my emulated HP sci 15C calculator delivers 0.0000. But you can calculate the above quantities using this recipe: (1) find the largest log likelihood (-4860.58), (2) subtract that number form each log likelihood in the list (result: -2.27, 0.0, 2.5, -26.67), (3) exponentiate each element in the new list (result: 0.1033, 1.0, 0.0821, 2.6144 x 10<sup>-12</sup>) , (4) sum all elements in the list up (0.1033+1.0+0.0821+2.6144 x 10<sup>-12</sup>), this is the denominator (1.1854). (5) now divide each element in the list by that sum and the numbers will look like the one in table 3 last column.<br />
<br />
[[File:table3.png|center|900px]]<br />
<br />
The harmonic mean fails to recover the true model for my example run and orders the models differently. Often the variance of the harmonic man estimator (when running the same model multiple times) is large. It is therefore not reliable, but this may depend on the dataset. As consequence, I suggest to use the thermodynamic integration method if the extra cost in runtime is acceptable. Looking at the model probabilities we can see that the “true” model has considerably higher support than the full model or the model that suggests a wrong direction of gene flow.<br />
<br />
=== Summary of results ===<br />
The best model (the one with the highest marginal likelihood) is model 2 (custom-migration={*0**}) if we use the the thermodynamic approximation of marginal likelihood. I was shown several times (Beerli and Palczewski 2010, Xie et al. 2011 ) now that the harmonic mean estimator is not a good estimator and may be misleading and prefer the more complex model (like in our example). The picture below is a sample from the class tutorial done on August 1st 2011.<br />
[[File:Migrate_exercise.png|center|1000px]]<br />
<br />
=== References === <br />
* Beerli, P. and M. Palczewski. 2010. Unified framework to evaluate panmixia and migration direction among multiple sampling locations. Genetics 185: 313–326.<br />
* Kass, R. E. and A. E. Raftery. Bayes factors. 1995. Journal of the American Statistical Association 90(430): 773– 795.<br />
* Xie, W., P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011 Improving marginal likelihood estimation for Bayesian phylogenetic model selection, Systematic Biology, 60: 150–160.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2238John Huelsenbeck2012-07-26T23:05:17Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHole2012_1.pdf|Bayesian Phylogenetic Analysis, I]]<br />
- [[Media:WoodsHole2012_2.pdf|Bayesian Phylogenetic Analysis, II]]<br />
- [[Media:WoodsHoleHandout.pdf|Bayesian Phylogenetic Analysis, Supp.]]<br />
<br />
Lab Files:<br />
- [https://molevol.mbl.edu/wiki/index.php/MrBayes wiki page]<br />
- [[Media:WoodsHole2012_tutorial.zip]]<br />
- [[Media:MrBayes_tutorial_instructions.zip]]<br />
<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54<br />
<br />
Files Generated during the lecture/lab:<br />
*: [http://phylo.bio.ku.edu/woodshole/simwithdice.nex simwithdice.nex] - a NEXUS file with the data patterns that the class generated using 10-sided dice.<br />
<br />
*: [[Media:woodsHoleMcmc.zip|C++ MCMC estimation of coin's bias]] <br />
<br />
*: [http://phylo.bio.ku.edu/woodshole/coin_flip_mcmc.py coin_flip_mcmc.py] - a Python program that implements the MCMC for estimating a coin's bias based on a number of heads and tails observed. This is not the code written during the lecture by John. It is an equivalent version of John's C++ implementation written (by Mark Holder) using the Python programming language. If you download it, then you can run it (from a Terminal) with the command <code>python coin_flip_mcmc.py</code></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=File:WoodsHoleMcmc.zip&diff=2237File:WoodsHoleMcmc.zip2012-07-26T23:03:16Z<p>Jsukumaran: </p>
<hr />
<div></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=File:Woodshole2012_paml_lab.zip&diff=2183File:Woodshole2012 paml lab.zip2012-07-26T19:48:02Z<p>Jsukumaran: </p>
<hr />
<div></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Jeet_Sukumaran&diff=2105Jeet Sukumaran2012-07-26T00:55:33Z<p>Jsukumaran: </p>
<hr />
<div>You can find out more about me [http://jeetworks.org here].<br />
<br />
If you like working with Python, you might find the [http://packages.python.org/DendroPy/ DendroPy phylogenetic computing library] useful, and I will be happy to help you get started with this.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Jeet_Sukumaran&diff=2103Jeet Sukumaran2012-07-26T00:53:45Z<p>Jsukumaran: </p>
<hr />
<div>You can find out more about me [http://jeetworks.org here].</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Jeet_Sukumaran&diff=2102Jeet Sukumaran2012-07-26T00:53:31Z<p>Jsukumaran: </p>
<hr />
<div>You can find out more about me [http://jeetworks.org|here].</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Jeet_Sukumaran&diff=2101Jeet Sukumaran2012-07-26T00:53:20Z<p>Jsukumaran: </p>
<hr />
<div>You can find out more about me (http://jeetworks.org|here).</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Jeet_Sukumaran&diff=2100Jeet Sukumaran2012-07-26T00:53:10Z<p>Jsukumaran: Created page with "You can find out more about me (http://jeetworks.org/|here)."</p>
<hr />
<div>You can find out more about me (http://jeetworks.org/|here).</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=MrBayes&diff=2099MrBayes2012-07-26T00:15:44Z<p>Jsukumaran: </p>
<hr />
<div>==Getting datasets and instructiions==<br />
This tutorial will contain multiple examples of MrBayes uses.<br />
<br />
The full tutorial files are available [https://molevol.mbl.edu/wiki/images/1/10/WoodsHole2012_tutorial.zip here] and the instructions are [https://molevol.mbl.edu/wiki/images/b/b5/MrBayes_tutorial_instructions.zip here]<br><br />
The instructions include telling you how to install MrBayes on your own computer but we will use the cluster version for the lab exercise<br><br />
<br />
==Datasets and Basic usage==<br />
Log on to the cluster.<br />
<br />
Download the lab files by using wget<br />
<pre><br />
wget https://molevol.mbl.edu/wiki/images/1/10/WoodsHole2012_tutorial.zip<br />
</pre><br />
and open the zip file by using<br />
<pre><br />
unzip WoodsHole2012_tutorial.zip<br />
</pre><br />
or download from [https://molevol.mbl.edu/wiki/images/1/10/WoodsHole2012_tutorial.zip here] and unzip on your laptop then scp to the cluster<br />
<br />
You can get all the instructions needed by using <br />
<pre><br />
wget https://molevol.mbl.edu/wiki/images/b/b5/MrBayes_tutorial_instructions.zip <br />
</pre><br />
and unzip in the same way or download from [https://molevol.mbl.edu/wiki/images/b/b5/MrBayes_tutorial_instructions.zip here], unzip and scp to cluster as needed<br><br />
In the file MrBayes_tutorial.txt we will start the lab from the section ANALYSIS PIPELINE (line 98)<br><br />
Open the text file on your computer and follow along as you require.<br><br />
<br />
Mr Bayes can be loaded by typing <br />
<pre><br />
mb<br />
</pre><br />
<br />
You can quit MrBayes by typing q and hitting enter<br />
<br />
==Batch files==<br />
A batch file is one that contains a series of commands for a program. Instead of typing commands individually you can run the batch file and it will automatically run all the commands contained within.<br><br />
This lab consists of individual batch files with MrBayes commands inside. There are 5 sections, each within it's own subfolder of the woodsHole2012_tutorial folder<br><br />
Inside each of these subfolders is a batch file (conifer_partn.nex, conifer_rjMCMC.nex, conifer_anc.nex, conifer_clock.nex, conifer_dte.nex respectively). This batch file is run in MrBayes by typing<br />
<pre><br />
exec batchFilename<br />
</pre><br />
within the MrBayes program<br />
*NOTE: you need to be in the folder that contains the batch file for this to work. Thus, you must cd to the subdirectory before you open MrBayes<br />
<br />
==Pre-made output==<br />
A lot of the analysis we will be performing today is very in-depth and can take a while to run. Sample output has been supplied for you in the output folder within each exercise's directory. <br><br />
You can try to organize it so that, in pairs, you run different exercises so that you can see run times and outputs. You can look through the batch files and the instructions file to see what commands are run<br><br />
<br />
==Reading commands in batch files on UNIX command line==<br />
You can download the lab onto your laptop and read each batchfile in a plain text editor as you would a normal text file.<br><br />
If you wish to read the file on the cluster without editing it in nano you can use the following command<br />
<pre><br />
more batchFilename<br />
</pre><br />
You can use the spacebar to page down through the file and press q to exit. Each line will run on to the next if it hits the edge of the screen so be aware that if commands look strange it could be run over from the line before it<br><br />
If you are finding the output from the more command confusing ask a TA about alternative viewers on the cluster.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2076John Huelsenbeck2012-07-25T15:56:14Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHole2012_1.pdf|Bayesian Phylogenetic Analysis, I]]<br />
- [[Media:WoodsHole2012_2.pdf|Bayesian Phylogenetic Analysis, II]]<br />
- [[Media:WoodsHoleHandout.pdf|Bayesian Phylogenetic Analysis, Supp.]]<br />
<br />
Lab Files:<br />
<br />
- [[Media:WoodsHole2012_tutorial.zip]]<br />
<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Schedule&diff=2075Schedule2012-07-25T15:54:22Z<p>Jsukumaran: </p>
<hr />
<div>== First Week ==<br />
===Sunday, 22 July 2012: Arrival and '''opening reception''' in the Meigs Room, Swope Building 7pm - 9pm===<br />
===All morning sessions are in Rowe, afternoons will be in room G70 of the Loeb building ===<br />
{| border="1" width="100%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(23 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(24 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(25 July)<br />
| align="center" width="14.3%" style="color:white" | '''Thursday'''<br/>(26 July)<br />
| align="center" width="14.3%" style="color:white" | '''Friday'''<br/>(27 July)<br />
| align="center" width="14.3%" style="color:white" | '''Saturday'''<br/>(28 July)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[David Hillis|Hillis]]'''<br/>(Intro)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Substitution models)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Phylogenetics of quantitative characters)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''TAs'''<br/>([[Computer lab introduction|Computer lab intro]])<br>'''[[MarkHolder|Holder]]'''<br/> ([http://phylo.bio.ku.edu/woodshole/HolderMSALectureWoodsHole2012.pdf Alignment]) <br/> [[MSA_lab|Alignment Lab]]<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Likelihoods of trees)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Historical Perspective of Phylogenetics)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([http://phylo.bio.ku.edu/woodshole/swofford_WH2011_modsel.pdf Model selection])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(Bayesian Phylogenetic Analysis: [[Media:WoodsHole2012_1.pdf|I]], [[Media:WoodsHole2012_2.pdf|II]], [[Media:WoodsHoleHandout.pdf|Supp.]])<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>(Large scale ML inference) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Evolution of HIV)<br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([[PAUP*_Exercise|PAUP*]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(Bayesian Phylogenetic Analysis: [[Media:WoodsHole2012_1.pdf|I]], [[Media:WoodsHole2012_2.pdf|II]], [[Media:WoodsHoleHandout.pdf|Supp.]])<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>([http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html GARLI lab]) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Microbiome Evolution)<br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([http://phylo.bio.ku.edu/woodshole/HolderTestingLectureWoodsHole2012.pdf Topology testing]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(Divergence time estimation)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([[ParametricBootstrappingLab|Branch support and tests]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(BEAST)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
|}<br />
<br/><br />
<!-- <br />
################ Second Week #################<br />
--><br />
<br />
== Second Week ==<br />
<br />
{| border="1" width="57%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(30 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(31 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(1 August)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]])<br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]]) <br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
|}</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Schedule&diff=2074Schedule2012-07-25T15:53:55Z<p>Jsukumaran: </p>
<hr />
<div>== First Week ==<br />
===Sunday, 22 July 2012: Arrival and '''opening reception''' in the Meigs Room, Swope Building 7pm - 9pm===<br />
===All morning sessions are in Rowe, afternoons will be in room G70 of the Loeb building ===<br />
{| border="1" width="100%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(23 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(24 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(25 July)<br />
| align="center" width="14.3%" style="color:white" | '''Thursday'''<br/>(26 July)<br />
| align="center" width="14.3%" style="color:white" | '''Friday'''<br/>(27 July)<br />
| align="center" width="14.3%" style="color:white" | '''Saturday'''<br/>(28 July)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[David Hillis|Hillis]]'''<br/>(Intro)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Substitution models)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Phylogenetics of quantitative characters)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''TAs'''<br/>([[Computer lab introduction|Computer lab intro]])<br>'''[[MarkHolder|Holder]]'''<br/> ([http://phylo.bio.ku.edu/woodshole/HolderMSALectureWoodsHole2012.pdf Alignment]) <br/> [[MSA_lab|Alignment Lab]]<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Likelihoods of trees)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Historical Perspective of Phylogenetics)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([http://phylo.bio.ku.edu/woodshole/swofford_WH2011_modsel.pdf Model selection])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(Bayesian Phylogenetic Analysis: [[Media:WoodsHole2012_1.pdf|I]], [[Media:WoodsHole2012_2.pdf|II]], [[Media:WoodsHoleHandout.pdf|Supp.]])<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>(Large scale ML inference) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Evolution of HIV)<br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([[PAUP*_Exercise|PAUP*]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(Bayesian Phylogenetic Analysis)<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>([http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html GARLI lab]) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Microbiome Evolution)<br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([http://phylo.bio.ku.edu/woodshole/HolderTestingLectureWoodsHole2012.pdf Topology testing]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(Divergence time estimation)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([[ParametricBootstrappingLab|Branch support and tests]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(BEAST)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
|}<br />
<br/><br />
<!-- <br />
################ Second Week #################<br />
--><br />
<br />
== Second Week ==<br />
<br />
{| border="1" width="57%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(30 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(31 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(1 August)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]])<br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]]) <br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
|}</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Schedule&diff=2073Schedule2012-07-25T15:50:55Z<p>Jsukumaran: </p>
<hr />
<div>== First Week ==<br />
===Sunday, 22 July 2012: Arrival and '''opening reception''' in the Meigs Room, Swope Building 7pm - 9pm===<br />
===All morning sessions are in Rowe, afternoons will be in room G70 of the Loeb building ===<br />
{| border="1" width="100%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(23 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(24 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(25 July)<br />
| align="center" width="14.3%" style="color:white" | '''Thursday'''<br/>(26 July)<br />
| align="center" width="14.3%" style="color:white" | '''Friday'''<br/>(27 July)<br />
| align="center" width="14.3%" style="color:white" | '''Saturday'''<br/>(28 July)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[David Hillis|Hillis]]'''<br/>(Intro)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Substitution models)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Phylogenetics of quantitative characters)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''TAs'''<br/>([[Computer lab introduction|Computer lab intro]])<br>'''[[MarkHolder|Holder]]'''<br/> ([http://phylo.bio.ku.edu/woodshole/HolderMSALectureWoodsHole2012.pdf Alignment]) <br/> [[MSA_lab|Alignment Lab]]<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Likelihoods of trees)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Historical Perspective of Phylogenetics)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([http://phylo.bio.ku.edu/woodshole/swofford_WH2011_modsel.pdf Model selection])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>([[Media:WoodsHole2012_1.pdf|Bayesian Phylogenetic Analysis]])<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>(Large scale ML inference) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Evolution of HIV)<br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([[PAUP*_Exercise|PAUP*]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(Bayesian Phylogenetic Analysis)<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>([http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html GARLI lab]) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Microbiome Evolution)<br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([http://phylo.bio.ku.edu/woodshole/HolderTestingLectureWoodsHole2012.pdf Topology testing]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(Divergence time estimation)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([[ParametricBootstrappingLab|Branch support and tests]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(BEAST)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
|}<br />
<br/><br />
<!-- <br />
################ Second Week #################<br />
--><br />
<br />
== Second Week ==<br />
<br />
{| border="1" width="57%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(30 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(31 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(1 August)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]])<br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]]) <br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
|}</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2072John Huelsenbeck2012-07-25T15:50:26Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHoleHandout.pdf|Bayesian Phylogenetic Analysis]]<br />
- [[Media:WoodsHole2012_1.pdf]]<br />
- [[Media:WoodsHole2012_2.pdf]]<br />
<br />
Lab Files:<br />
<br />
- [[Media:WoodsHole2012_tutorial.zip]]<br />
<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2071John Huelsenbeck2012-07-25T15:50:13Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHoleHandout.pdf Bayesian Phylogenetic Analysis]]<br />
- [[Media:WoodsHole2012_1.pdf]]<br />
- [[Media:WoodsHole2012_2.pdf]]<br />
<br />
Lab Files:<br />
<br />
- [[Media:WoodsHole2012_tutorial.zip]]<br />
<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=Schedule&diff=2070Schedule2012-07-25T15:46:56Z<p>Jsukumaran: </p>
<hr />
<div>== First Week ==<br />
===Sunday, 22 July 2012: Arrival and '''opening reception''' in the Meigs Room, Swope Building 7pm - 9pm===<br />
===All morning sessions are in Rowe, afternoons will be in room G70 of the Loeb building ===<br />
{| border="1" width="100%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(23 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(24 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(25 July)<br />
| align="center" width="14.3%" style="color:white" | '''Thursday'''<br/>(26 July)<br />
| align="center" width="14.3%" style="color:white" | '''Friday'''<br/>(27 July)<br />
| align="center" width="14.3%" style="color:white" | '''Saturday'''<br/>(28 July)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[David Hillis|Hillis]]'''<br/>(Intro)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Substitution models)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Phylogenetics of quantitative characters)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''TAs'''<br/>([[Computer lab introduction|Computer lab intro]])<br>'''[[MarkHolder|Holder]]'''<br/> ([http://phylo.bio.ku.edu/woodshole/HolderMSALectureWoodsHole2012.pdf Alignment]) <br/> [[MSA_lab|Alignment Lab]]<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Likelihoods of trees)<br />
| align="center" width="14.3%" | '''[[Paul Lewis|Lewis]]'''<br/>(Bayesian intro)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>(Codon models) <br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]'''<br/>(Coalescence)<br />
| align="center" width="14.3%" | '''[[User:Felsenst|Felsenstein]]'''<br/>(Historical Perspective of Phylogenetics)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([http://phylo.bio.ku.edu/woodshole/swofford_WH2011_modsel.pdf Model selection])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>([[Media:WoodsHole2012_1.pdf Bayesian Phylogenetic Analysis]])<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>(Large scale ML inference) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Evolution of HIV)<br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>(Sequence databases)<br />
| align="center" width="14.3%" | '''[[David Swofford|Swofford]]'''<br/>([[PAUP*_Exercise|PAUP*]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(Bayesian Phylogenetic Analysis)<br />
| align="center" width="14.3%" | '''[[Derrick_Zwickl|Zwickl]]'''<br/>([http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html GARLI lab]) <br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]]'''<br/>(Species Tree Estimation)<br />
| align="center" width="14.3%" | '''[[Conor Meehan|Meehan]]'''<br/>(Microbiome Evolution)<br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([http://phylo.bio.ku.edu/woodshole/HolderTestingLectureWoodsHole2012.pdf Topology testing]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(Divergence time estimation)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[William_Pearson|Pearson]]'''<br/>([http://fasta.bioch.virginia.edu/mol_evol/ FASTA/BLAST lab])<br />
| align="center" width="14.3%" | '''[[MarkHolder|Holder]]'''<br/>([[ParametricBootstrappingLab|Branch support and tests]])<br />
| align="center" width="14.3%" | '''[[John Huelsenbeck|Huelsenbeck]]'''<br/>(MrBayes)<br />
| align="center" width="14.3%" | '''[[Bielawski|Bielawski]]'''<br/>([http://myweb.dal.ca/js551958/PAML_lab/lab.html PAML])<br />
| align="center" width="14.3%" | '''[[Tracy Heath|Heath]]'''<br/>(BEAST)<br />
| align="center" width="14.3%" | Course dinner/party<br/>Lobster Broil<br />
|-<br />
|}<br />
<br/><br />
<!-- <br />
################ Second Week #################<br />
--><br />
<br />
== Second Week ==<br />
<br />
{| border="1" width="57%"<br />
|- style="background:teal"<br />
| align="center" width="14.3%" | <br />
| align="center" width="14.3%" style="color:white" | '''Monday'''<br/>(30 July)<br />
| align="center" width="14.3%" style="color:white" | '''Tuesday'''<br/>(31 July)<br />
| align="center" width="14.3%" style="color:white" | '''Wednesday'''<br/>(1 August)<br />
|-<br />
| align="center" width="14.3%" height="30" | 09:00-10:30<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 10:30-12:00<br />
| align="center" width="14.3%" | '''[[Sara Sawyer|Sawyer]]'''<br/>(Applications of Molecular Evolution to Understanding Disease)<br />
| align="center" width="14.3%" | '''[[Scott Edwards|Edwards]]'''<br/>(Multilocus phylogeography and phylogenetics)<br />
| align="center" width="14.3%" | '''[[Harold Zakon|Zakon]]'''<br/>(Molecular Evolution of Sodium Channels)<br />
|-<br />
| align="center" width="14.3%" height="30" | 12:00-14:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Lunch break<br />
|-<br />
| align="center" width="14.3%" height="30" | 14:00-15:30<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 15:30-17:00<br />
| align="center" width="14.3%" | '''[[Casey Dunn|Dunn]]'''<br/>(Phylogenomics: applications)<br />
| align="center" width="14.3%" | '''[[Antonis Rokas|Rokas]]'''<br/>(Genome evolution)<br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 17:00-19:00<br />
| align="center" width="14.3%" colspan="6" style="background:gold" | Dinner break<br />
|-<br />
| align="center" width="14.3%" height="30" | 19:00-20:30<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]])<br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
| align="center" width="14.3%" height="30" | 20:30-22:00<br />
| align="center" width="14.3%" | '''[[Peter Beerli|Beerli]]''' <br/>([[Migrate tutorial]] and [[Lamarc tutorial]]) <br/><br />
| align="center" width="14.3%" | '''[[Laura Kubatko|Kubatko]] and [[Scott Edwards|Edwards]]''' <br/>(Species Tree software)<br/><br />
| align="center" width="14.3%" | <br />
|-<br />
|}</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2069John Huelsenbeck2012-07-25T15:32:57Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHoleHandout.pdf]]<br />
- [[Media:WoodsHole2012_1.pdf]]<br />
- [[Media:WoodsHole2012_2.pdf]]<br />
<br />
Lab Files:<br />
<br />
- [[Media:WoodsHole2012_tutorial.zip]]<br />
<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2068John Huelsenbeck2012-07-25T15:32:24Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHoleHandout.pdf]]<br />
- [[Media:WoodsHole2012_1.pdf]]<br />
- [[Media:WoodsHole2012_2.pdf]]<br />
<br />
Lab Files:<br />
<br />
- [[Media:WoodsHole2012_tutorial.zip]]<br />
<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=File:WoodsHole2012_tutorial.zip&diff=2067File:WoodsHole2012 tutorial.zip2012-07-25T15:30:37Z<p>Jsukumaran: </p>
<hr />
<div></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2066John Huelsenbeck2012-07-25T15:29:19Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHoleHandout.pdf]]<br />
- [[Media:WoodsHole2012_1.pdf]]<br />
- [[Media:WoodsHole2012_2.pdf]]<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2065John Huelsenbeck2012-07-25T15:28:57Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[WoodsHoleHandout|Media:WoodsHoleHandout.pdf]]<br />
- [[WoodsHole2012_1|Media:WoodsHole2012_1.pdf]]<br />
- [[WoodsHole2012_2|Media:WoodsHole2012_2.pdf]]<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=John_Huelsenbeck&diff=2064John Huelsenbeck2012-07-25T15:25:37Z<p>Jsukumaran: </p>
<hr />
<div>Slides/Presentations:<br />
<br />
- [[Media:WoodsHoleHandout.pdf]]<br />
- [[Media:WoodsHole2012_1.pdf]]<br />
- [[Media:WoodsHole2012_2.pdf]]<br />
<br />
John's lab page: http://ib.berkeley.edu/people/lab_detail.php?lab=54</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=File:WoodsHole2012_2.pdf&diff=2063File:WoodsHole2012 2.pdf2012-07-25T15:20:19Z<p>Jsukumaran: </p>
<hr />
<div></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=File:WoodsHole2012_1.pdf&diff=2062File:WoodsHole2012 1.pdf2012-07-25T15:19:13Z<p>Jsukumaran: </p>
<hr />
<div></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=File:WoodsHoleHandout.pdf&diff=2061File:WoodsHoleHandout.pdf2012-07-25T15:18:31Z<p>Jsukumaran: </p>
<hr />
<div></div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=ParametricBootstrappingLab&diff=1997ParametricBootstrappingLab2012-07-24T14:33:14Z<p>Jsukumaran: </p>
<hr />
<div>= Pointers to other topology testing techniques =<br />
The lecture mentions several tests that we are not going to run through in detail in the lab.<br />
<br />
You will find <code>KHTest</code>, <code>SHTest</code>, and <code>AUTest</code> options in PAUP's <code>LScore</code> command.<br />
Using <code>LScore</code> command is the easiest way to conduct these tests.<br />
Let us know if you have any questions about running these tests.<br />
<br />
You can run these tests and compare the results from those tests your results from the parametric bootstrapping.<br />
<br />
Other software relevant to the testing lecture:<br />
* [http://www.atgc-montpellier.fr/phyml/ PhyML] aLRT and aBayes statistics, in particular.<br />
* [http://wwwkramer.in.tum.de/exelixis/software.html RAxML] Rapid bootstrapping<br />
* [http://www.is.titech.ac.jp/~shimo/prog/consel/ Consel] AU Test, SH Test, Weighted SH Test, KH Test, Bootstrap<br />
* [http://www.mathstat.dal.ca/tsusko/software.cgi Ed Susko's software page] (software from Susko, E. (2010) and Susko, E. (2006) papers is of particular relevance to the tree testing lecture).<br />
<br />
<br />
= Parametric Bootstrapping Lab =<br />
The goal of this lab exercise is to show you how to conduct use Monte Carlo <br />
simulation to construct a null distribution for a phylogenetic hypothesis. <br />
This type of parametric bootstrapping is not a trivial analysis, so there are several steps (and we've included a script in case you want a more automated way of completing the analysis.<br />
<br />
The easiest way to run all of the steps in the lab (via a script) is to download [http://phylo.bio.ku.edu/slides/param_boot_lab.zip param_boot_lab.zip], uncompress it, <code>cd</code> into the unzipped directory, and run the <code>do_param_bootlab</code> script. From a UNIX prompt (preferably on the workshop's cluster), you can do all of this by typing: <br />
<pre>wget http://phylo.bio.ku.edu/slides/param_boot_lab.zip<br />
unzip param_boot_lab.zip<br />
cd param_boot_lab<br />
bash do_param_bootlab.sh<br />
</pre><br />
<br />
However, just running the script is probably not going to help you understand the steps. I recommend that you work through the lab until at least step 16 before you resort to running the script.<br />
<br />
== Background on the Dataset ==<br />
<br />
In this lab, we will use a dataset [http://phylo.bio.ku.edu/slides/data/algae.nex algae.nex].<br />
It contains 16S rRNA sequences for a cyanobacterium (''[http://eol.org/pages/916452/overview Anacystis]''), a chromophyte alga (''[http://eol.org/pages/901701/overview Olithodiscus]''), a euglenoid protist (''[http://eol.org/pages/11704/overview Euglena]''), and six green plants, including two green algae (''Chlorella'' and ''[http://eol.org/pages/11591/overview Chlamydomonas]''), a liverwort (''[http://eol.org/pages/73295/overview Marchantia]''), a monocotyledonous angiosperm ([http://eol.org/pages/108188/overview rice]) and a dicotyledonous angiosperm ([http://eol.org/pages/581050/overview tobacco]). <br />
<br />
This data set was used in a 1994 paper by Lockhart et al. to show how common models used in reconstructing phylogenies fail when confronted by convergence in nucleotide composition. The problem is that the common models assume stationarity of the substitution process, which leads to the assumption that base frequencies do not change across the tree. Thus, things can go wrong when the base frequencies do change from lineage to lineage, and things can go really wrong when unrelated groups tend to have similar base compositions.<br />
<br />
In this case, all of the species except ''Olithodiscus'' and ''Anacystis'' have chlorophyll ''a/b'' and are strongly suspected to be a monophyletic group. However, as you will see, ''Euglena'' has a strong tendency to group with the unrelated chromophyte ''Olithodiscus'' because of similarities in base composition. <br />
<br />
The complete reference to the Lockhart paper is<br />
[http://mbe.oxfordjournals.org/content/11/4/605.citation Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. "Recovering evolutionary trees under a more realistic model of sequence evolution". ''Molecular Biology and Evolution'' '''11''': 605-612.]<br />
<br />
<br />
== Null hypothesis ==<br />
For the purpose of the lab we will use the null hypothesis that there is a branch separating ''Anacystis'' and ''Olithodiscus'' from the rest of the taxa.<br />
<br />
== Parametric Bootstrapping Background ==<br />
<br />
Because we have a limited amount of time in the computer lab, we will be using parsimony as our optimality criterion in this lab.<br />
The general principles of using Monte Carlo simulation to construct a null distribution apply to any optimality criterion, so you can adapt the lab to hypothesis testing using ML or distance-based approaches.<br />
<br />
Often, you will use parametric bootstrapping when the optimal tree disagrees <br />
with your null hypothesis. <br />
You would like to decide whether or not you should reject the null hypothesis.<br />
<br />
Parametric bootstrapping is useful for answering the question: <br />
"I have observed a particular difference between the score of the optimal tree <br />
and the score of the best tree that is compatible with the null hypothesis. <br />
'''Is this score difference larger than I would expect if the null were true?'''"<br />
<br />
The score difference can be a difference in log-likelihoods, a difference in <br />
parsimony scores, ''etc''.<br />
<br />
== Lab Activity ==<br />
1. Download [http://phylo.bio.ku.edu/slides/data/algae.nex algae.nex]<br />
<br />
2. Find the score of the most parsimonious tree using PAUP. <br />
Because this is a small dataset, you can use the <pre>AllTrees fd=bar</pre> <br />
command to see all the scores of all the trees.<br />
:* How many most parsimonious trees are there? {{title|2|answer}}<br />
:* What is the parsimony score of the most parsimonious tree(s)? {{title|411 steps|answer}}<br />
<br />
3. Use the <pre>boot</pre> command to calculate the bootstrap proportions and display the majority-rule consensus tree from the bootstrapping analyses.<br />
<br />
<br />
<br />
4. Does the bootstrap tree display the chlorophyll ''a/b'' clade (a clade of all of the species except ''Anacystis'' and ''Olithodiscus'')? {{title|No|answer}}<br />
<br />
What is the bootstrap support for the split that represents our null hypothesis? If that split is not supported by the data, then it will not be on the majority-rule tree but it should be listed in the split frequency table (''Anacystis'' is taxon number 7 and ''Olithodiscus'' is taxon number 8, so look for a row in the table with a split pattern that groups taxa 7 and 8 in one group and all the other taxa in the other group: e.g., "......**"). {{title|Typically the bootstrap support is around 25 percent|answer}}<br />
<br />
You can use <pre>ShowTrees all</pre> to see all of the trees in memory (note that the <code>AllTrees</code> command scores every tree, but does not retain all of the trees in memory. It will only retain the best trees according to the criterion).<br />
<br />
5. Now we need to find the best scoring tree that is consistent with our null hypothesis. In other words, we want to find the best-scoring tree that has a clade of ''Euglena'', ''Chlorella'', ''Chlamydomonas'', ''Marchantia'', Rice, and Tobacco excluding ''Anacystis nidulans'' and ''Olithodiscus''. We can do this in a number of ways. <br />
<br />
We will tell PAUP to do a search for the best tree that satisfies a topological constraint. <br />
:* The first step is to write a file with a NEXUS trees block and a tree the contains only one branch. In a text editor, create a new file with the following contents:<br />
<pre><br />
#NEXUS<br />
begin trees ;<br />
Tree ab = [&U](Anacystis_nidulans,Olithodiscus,(Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, Tobacco));<br />
end;<br />
</pre><br />
The name of the constraint tree can be any NEXUS word that you like; in this example <code>ab</code> will be the name of the constraint tree.<br />
When you are dealing with large numbers of taxa and complex constraints it is often helpful to construct the constraint tree in [http://mesquiteproject.org/mesquite/mesquite.html Mesquite], save it to a file, and then read it into PAUP using the <code>LoadConstr</code> command.<br />
<br />
:* Save the file as <code>abconstraint.tre</code> in the same directory as the <code>algae.nex</code> directory.<br />
:* To get PAUP to read the constraint into memory, use the <pre>LoadConstr file=abconstraint.tre</pre> command.<br />
<br />
6. Use the <code>ShowConstr</code> command to see the constraint tree that you have defined and make sure that it is the constraint that you intended.<br />
<br />
<br />
7. Now we will conduct a branch-and-bound search that honors the constraint tree that we have just defined:<br />
<pre>Log start file = 'steps7-11.realdatalog.txt' ; <br />
BAndB enforce constraints=ab<br />
</pre><br />
The <code>enforce</code> keyword tells paup to only consider trees that meet a constraint, and the <code>constraints=ab</code> keyword tell PAUP which tree to use as a constraint. <br />
Note that you can also use constraints with the <code>HSearch</code> command of PAUP (and you will need to do this for bigger datasets). <br />
What has the parsimony score of the best tree compatible with the constraint? {{title|4|answer}}<br />
<br />
8. Use the <code>ShowTrees</code> command to see the tree. Does it satisfy the constraint? (it should).<br />
<br />
9. Use the <code>SaveTrees file=bestconstrained.tre</code> to save this tree to a file in case you need it later.<br />
<br />
== Hypothesis testing ==<br />
<br />
As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that satisfies our null hypothesis. <br />
<br />
10. '''What is the value of the test statistic for our data?''' {{title|4|answer}}<br />
<br />
<br />
To interpret the value of the test statistic we need to have some idea the range of values would be produced ''if the null hypothesis were true''. <br />
This is can be tricky.<br />
For one thing, there are lots of trees that are compatible with the null hypothesis.<br />
It seems logical to take the tree from the constrained search as the tree to repersent the null hypothesis.<br />
After all, among all of the trees compatible with the constraint, it is the one that best explains the data (according to parsimony).<br />
Technically speaking this procedure of choosing a tree to represent the null does not guarantee that we are testing from the "least favorable conditions" as we should in hypothesis testing - but using this tree seems good enough, and it is practical.<br />
<br />
Even if we are satisfied about the choice of a tree that represents the best the null can do, we still have to have a way to find out what the distribution of the test statistic would be if this null were true.<br />
We will use Monte Carlo simulations for this.<br />
In particular we will use [http://tree.bio.ed.ac.uk/software/seqgen/ Seq-Gen] to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. <br />
Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. <br />
We can compare our real data to that.<br />
<br />
=== Finding model parameters ===<br />
<br />
To simulate data, Seq-Gen needs a fully-specified model and a tree with branch lengths.<br />
We can use the tree that we found in the constrained search and the GTR+I+G model to get the necessary input.<br />
<br />
11. Assuming that you have not quit PAUP and the tree found in the constrained search is still in memory, then we can save it with branch lengths that maximize the likelihood under the GTR+I+G model:<br />
<pre><br />
Set crit = like;<br />
LSet nst=6 rmat=est basefreq=est rates = gamma shape = est pinv=est;<br />
LScore;<br />
SaveTrees file = model.tre brlens format=altnexus;<br />
</pre><br />
Make sure that you understand these commands (ask an instructor if you have questions).<br />
From the output of PAUP you should have the model parameter values for the simulation.<br />
<br />
12. Look at the tree file. You should see a newick string representing a tree with branch lengths. You can use a text editor to see the newick representation. Or you can use FigTree to see a graphical representation of the tree.<br />
<br />
<br />
<br />
13. Unfortunately, seq-gen does not understand NEXUS tree files. Cut just the newick tree (the parenthetical description of the tree) from the file to a new file called '''model.txt''' or you can download this version of [http://phylo.bio.ku.edu/slides/model.txt model.txt].<br />
If you are a UNIX geek, you do this by quitting paup and issuing the command:<br />
<pre>cat model.tre | grep PAUP_1 | awk '{print $5}' > model.txt</pre><br />
Non-geeks tend to prefer opening '''model.tre''', copying the newick string, and saving it to a '''step13.model.txt''' file.<br />
<br />
=== Invoking seq-gen ===<br />
<br />
Seq-Gen is installed on the workshop's cluster. If you are running the exercise on a machine that does not have seq-gen, you'll need to download [http://tree.bio.ed.ac.uk/software/seqgen/ Seq-Gen] and install it.<br />
<br />
To install, you'll need to drag the seq-gen executable to the directory that you are using for this lab (or add it to your <code>PATH</code> environmental variable that tells your shell where to find executables [[Computer_lab_introduction#Adding_a_directory_to_the_path| notes here]])<br />
<br />
14. To see the options for seq-gen use the command<br />
<pre>seq-gen -h</pre><br />
To make sure everything is working do a simple test run using the HKY model:<br />
<pre>seq-gen -mHKY model.txt</pre><br />
This should generate a dataset and print it to the screen.<br />
The simulation used it default parameter values for the HKY model. We'd like to use the parameters that we inferred from our real data (because the parameter values will affect the dataset-to-dataset variance, and hence the distribution of our test statistic). All commands are given to seq-gen as command line flags. The ones that we will use are:<br />
: <code>-mGTR</code> to specify the GTR model<br />
: <code>-a</code> preceding the shape parameter value<br />
: <code>-i</code> preceding the proportion of invariant sites<br />
: <code>-r</code> preceding the 6 instantanteous rates of the GTR matrix (PAUP infers the first five and fixes the last one to 1.0)<br />
: <code>-f</code> preceding the base frequencies<br />
: <code>-l920</code> to simulate 920 sites (the same length as our real dataset).<br />
: <code>-n1000</code> to generate 1000 datasets<br />
: <code>-on</code> to request output in the NEXUS format<br />
: <code>-xeachreppaup.nex</code> to tell it the name of a file with text input to be inserted between each dataset.<br />
Finally we'll want to redirect the output to file using the : '''>''' redirection operator (Remember that this will overwrite whatever filename you put after the '''>''' character!). <br />
<br />
15. Take a look at the '''eachreppaup.nex''' that is included in the lab. It should contain the following lines.<br />
<pre><br />
begin paup;<br />
execute run.nex;<br />
end;<br />
</pre><br />
the <code>-xeachreppaup.nex</code> option to seq-gen will insert the contents of eachreppaup.nex in between each data set.<br />
<br />
By putting the correct commands in a file called step18.run.nex<br />
<br />
16. Run seq-gen. The invocation should be something like the command below (but with the parameter estimates for this dataset filled in the appropriate spots):<br />
<pre>seq-gen -mGTR -a0.6 -i0.32 -r 0.6 2.1 0.3 0.2 5 1 -f 0.27 0.20 0.30 0.23 -l920 -n1000 -on -xeachreppaup.nex model.txt > simdata.nex<br />
</pre><br />
Use the parameter values that you got from PAUP's <code>LScore</code> to construct a similar command and run it.<br />
<br><br />
Note: In the Windows executable version of the program, the syntax of the command line is somewhat different. The rate matrix will be specified as <code>-r0.6,2.1,0.3,0.2,5,1 </code> and the base frequencies as <code>-f0.27,0.20,0.30,0.23</code>. You will also need to direct the treefile to seq-gen by inserting a '''<''' before the filename. In the case above, the end part of the command will read <code>-xeachreppaup.nex < model.txt > simdata.nex</code> <br><br />
<br />
17. Open '''simdata.nex''' in a text editor. Do you see the content of the '''eachreppaup.nex''' file?<br />
<br />
=== Running PAUP on the simulated data===<br />
<br />
Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command <code>execute step18.run.nex</code> intercalated between each data set. <br />
If we put the commands that we want PAUP to execute in the '''step18.run.nex''' file then those commands will be executed for each dataset.<br />
<br />
What do we want to do for each dataset? We want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree because of sampling error (or even systematic error.<br />
<br />
18. Take a look at the '''step18.run.nex''' file. It should contain:<br />
<pre><br />
#NEXUS<br />
BAndB;<br />
[!****This is the best tree's score****]<br />
PScore;<br />
GetTrees file = model.tre;<br />
[!####This is the true tree's score####]<br />
PScore;<br />
</pre><br />
These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it.<br />
We are using the output comments to make the output more visible.<br />
<br />
Note that if we wanted to make the test more powerful we could do a constrained search for each simulated replicate instead of just scoring the model tree.<br />
(This would result in shorter trees that are consistent with our null hypothesis, which would tend to make the values of the difference in length smaller.<br />
Smaller values for the length difference in our null distribution would mean that the observed value of the test statistic would be further out in the tail of the null distribution; thus we would get a smaller ''p'' value).<br />
In the PAUP commands given above, we just score the model tree. <br />
In effect we are changing the null from:<br />
: "the tree has the chlorophyll ''a/b'' group"<br />
to a more specific null: "the true tree is the tree stored in '''model.tre'''"<br />
<br />
19. Finally to run all of the analyses it is helpful to have a simple "master" paup file. See the step19.master.nex file:<br />
<pre><br />
Log start replace file=step19.sim.log;<br />
Set noQueryBeep noerrorBeep noWarnReset noWarnTree noWarnTSave;<br />
Execute simdata.nex;<br />
Log stop;<br />
</pre><br />
Save this file as '''master.nex'''<br />
invoke PAUP using the <code>-n</code> flag so that it goes in non-interactive mode (and does not pester you with 1000 questions):<br />
<pre>paup -n master.nex</pre><br />
or you can launch a graphical version of PAUP and tell it to execute the '''master.nex''' file.<br />
After a few seconds, you should have completed the analysis of all 1000 datasets.<br />
<br />
= Summarizing the output =<br />
You really don't want to browse through 1000 analyses and perform subtraction (and certainly not when you could be at the Kidd after a long day).<br />
<br />
Summarize the output using the "easy way" below.<br />
If you want to see how you can do a lot of the calculation using pipes from the command line (and if you are working on a non-Windows machine), check out "the geeky way."<br />
<br />
==== The easy way ====<br />
I wrote a simple summarization script [http://phylo.bio.ku.edu/slides/lab6-Simulation/summarizePaupLengthDiffs.py summarizePaupLengthDiffs.py]. Make sure that it is in your current working directory.<br />
<br />
If you are running Windows, you may need to install [http://www.python.org/ Python] (any version that starts with '''2''' should work) if you don't have it.<br />
<br />
20. As long as you do not mind overwriting a file in this directory named '''step20.diffs.txt''' you can run the command :<br />
<pre> python summarizePaupLengthDiffs.py step19.sim.log > step20.diffs.txt</pre><br />
<br />
This should report critical values for the test statistic at a few significance levels.<br />
You should be able to open the file '''step20.diffs.txt''' in Excel if you want to see differences for any replicate.<br />
<br />
21. (optional) if you have the [http://www.r-project.org/ R] programming language installed then you should be able to download [http://phylo.bio.ku.edu/slides/lab6-Simulation/plot_diffs.R plot_diffs.R] and <br />
run it with a command line argument to produce a pdf that summarizes the parametric bootstrapping. Pass in the observed value of the test statistic to the R script. So, if the observed length difference (on the real data) was 2 then you would use the command:<br />
<pre>R --file=plot_diffs.R --args 4</pre><br />
to produce a file called '''null_distribution_pscore_diffs.pdf''' with a <br />
histogram and the approximate ''p'' value.<br />
<br />
<br />
= The end =<br />
Was the observed difference in this tail of the null distribution? and would you reject the null hypothesis?<br />
<br />
<br />
= postscript: An alternative (geeky) way of running step 20 =<br />
If you find it unsatisfying to run a pre-packaged script in step 20 to parse the output of the simulated data, you can try to write a quick parser.<br />
<br />
This is a step by step instruction of how to construct a simple workflow using UNIX "pipes". We are going to connect the output of one process (process = running program) to the input of another process. This is done with a "pipe" between the processes. From our shell, this is done with the <code>|</code> symbol.<br />
The command:<br />
<pre>cat step19.sim.log</pre><br />
spits out all 83,000 lines of the log file to the screen. Ugh!<br />
The command:<br />
<pre>cat step19.sim.log | grep "Length "</pre><br />
filters all of those lines so that only those with the word Length followed by a space are printed. This selects just the output from the PScore command that we did.<br />
Want to get just the scores of the first tree each time (without the word Length)? Try:<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}'</pre><br />
We are close. We now need to subtract the number in the second line from the first line; the number in the fourth line from the third line... This would give us the difference in score for each rep. I wrote a simple python script to do this. Save the script as [http://phylo.bio.ku.edu/slides/lab6-Simulation/consecutiveDiffs.py consecutiveDiffs.py] in the same directory that you have been working in.<br />
Now<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py</pre><br />
Should display the length differences.<br />
At this point (or really a couple of steps ago) you could take the data into Excel and find the critical value for the '''p=0.05''' level by looking for the 50th largest difference. We can finish the UNIX way by<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g</pre><br />
to sort the values numerically.<br />
And finally:<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g | tail -n50</pre><br />
to show the 50 most extreme length differences.</div>Jsukumaranhttps://molevol.mbl.edu/index.php?title=ParametricBootstrappingLab&diff=1996ParametricBootstrappingLab2012-07-24T14:31:45Z<p>Jsukumaran: </p>
<hr />
<div>= Pointers to other topology testing techniques =<br />
The lecture mentions several tests that we are not going to run through in detail in the lab.<br />
<br />
You will find <code>KHTest</code>, <code>SHTest</code>, and <code>AUTest</code> options in PAUP's <code>LScore</code> command.<br />
Using <code>LScore</code> command is the easiest way to conduct these tests.<br />
Let us know if you have any questions about running these tests.<br />
<br />
You can run these tests and compare the results from those tests your results from the parametric bootstrapping.<br />
<br />
Other software relevant to the testing lecture:<br />
* [http://www.atgc-montpellier.fr/phyml/ PhyML] aLRT and aBayes statistics, in particular.<br />
* [http://wwwkramer.in.tum.de/exelixis/software.html RAxML] Rapid bootstrapping<br />
* [http://www.is.titech.ac.jp/~shimo/prog/consel/ Consel] AU Test, SH Test, Weighted SH Test, KH Test, Bootstrap<br />
* [http://www.mathstat.dal.ca/tsusko/software.cgi Ed Susko's software page] (software from Susko, E. (2010) and Susko, E. (2006) papers is of particular relevance to the tree testing lecture).<br />
<br />
<br />
= Parametric Bootstrapping Lab =<br />
The goal of this lab exercise is to show you how to conduct use Monte Carlo <br />
simulation to construct a null distribution for a phylogenetic hypothesis. <br />
This type of parametric bootstrapping is not a trivial analysis, so there are several steps (and we've included a script in case you want a more automated way of completing the analysis.<br />
<br />
The easiest way to run all of the steps in the lab (via a script) is to download [http://phylo.bio.ku.edu/slides/param_boot_lab.zip param_boot_lab.zip], uncompress it, <code>cd</code> into the unzipped directory, and run the <code>do_param_bootlab</code> script. From a UNIX prompt (preferably on the workshop's cluster), you can do all of this by typing: <br />
<pre>wget http://phylo.bio.ku.edu/slides/param_boot_lab.zip<br />
unzip param_boot_lab.zip<br />
cd param_boot_lab<br />
bash do_param_bootlab.sh<br />
</pre><br />
<br />
However, just running the script is probably not going to help you understand the steps. I recommend that you work through the lab until at least step 16 before you resort to running the script.<br />
<br />
== Background on the Dataset ==<br />
<br />
In this lab, we will use a dataset [http://phylo.bio.ku.edu/slides/data/algae.nex algae.nex].<br />
It contains 16S rRNA sequences for a cyanobacterium (''[http://eol.org/pages/916452/overview Anacystis]''), a chromophyte alga (''[http://eol.org/pages/901701/overview Olithodiscus]''), a euglenoid protist (''[http://eol.org/pages/11704/overview Euglena]''), and six green plants, including two green algae (''Chlorella'' and ''[http://eol.org/pages/11591/overview Chlamydomonas]''), a liverwort (''[http://eol.org/pages/73295/overview Marchantia]''), a monocotyledonous angiosperm ([http://eol.org/pages/108188/overview rice]) and a dicotyledonous angiosperm ([http://eol.org/pages/581050/overview tobacco]). <br />
<br />
This data set was used in a 1994 paper by Lockhart et al. to show how common models used in reconstructing phylogenies fail when confronted by convergence in nucleotide composition. The problem is that the common models assume stationarity of the substitution process, which leads to the assumption that base frequencies do not change across the tree. Thus, things can go wrong when the base frequencies do change from lineage to lineage, and things can go really wrong when unrelated groups tend to have similar base compositions.<br />
<br />
In this case, all of the species except ''Olithodiscus'' and ''Anacystis'' have chlorophyll ''a/b'' and are strongly suspected to be a monophyletic group. However, as you will see, ''Euglena'' has a strong tendency to group with the unrelated chromophyte ''Olithodiscus'' because of similarities in base composition. <br />
<br />
The complete reference to the Lockhart paper is<br />
[http://mbe.oxfordjournals.org/content/11/4/605.citation Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. "Recovering evolutionary trees under a more realistic model of sequence evolution". ''Molecular Biology and Evolution'' '''11''': 605-612.]<br />
<br />
<br />
== Null hypothesis ==<br />
For the purpose of the lab we will use the null hypothesis that there is a branch separating ''Anacystis'' and ''Olithodiscus'' from the rest of the taxa.<br />
<br />
== Parametric Bootstrapping Background ==<br />
<br />
Because we have a limited amount of time in the computer lab, we will be using parsimony as our optimality criterion in this lab.<br />
The general principles of using Monte Carlo simulation to construct a null distribution apply to any optimality criterion, so you can adapt the lab to hypothesis testing using ML or distance-based approaches.<br />
<br />
Often, you will use parametric bootstrapping when the optimal tree disagrees <br />
with your null hypothesis. <br />
You would like to decide whether or not you should reject the null hypothesis.<br />
<br />
Parametric bootstrapping is useful for answering the question: <br />
"I have observed a particular difference between the score of the optimal tree <br />
and the score of the best tree that is compatible with the null hypothesis. <br />
'''Is this score difference larger than I would expect if the null were true?'''"<br />
<br />
The score difference can be a difference in log-likelihoods, a difference in <br />
parsimony scores, ''etc''.<br />
<br />
== Lab Activity ==<br />
1. Download [http://phylo.bio.ku.edu/slides/data/algae.nex algae.nex]<br />
<br />
2. Find the score of the most parsimonious tree using PAUP. <br />
Because this is a small dataset, you can use the <pre>AllTrees fd=bar</pre> <br />
command to see all the scores of all the trees.<br />
:* How many most parsimonious trees are there? {{title|2|answer}}<br />
:* What is the parsimony score of the most parsimonious tree(s)? {{title|411 steps|answer}}<br />
<br />
3. Use the <pre>boot</pre> command to calculate the bootstrap proportions and display the majority-rule consensus tree from the bootstrapping analyses.<br />
<br />
<br />
<br />
4. Does the bootstrap tree display the chlorophyll ''a/b'' clade (a clade of all of the species except ''Anacystis'' and ''Olithodiscus'')? {{title|No|answer}}<br />
<br />
What is the bootstrap support for the split that represents our null hypothesis? If that split is not supported by the data, then it will not be on the majority-rule tree but it should be listed in the split frequency table (''Anacystis'' is taxon number 7 and ''Olithodiscus'' is taxon number 8, so look for a split pattern that groups taxa 7 and 8 in one group and all the other taxa in the another group: "''......**''"). {{title|Typically the bootstrap support is around 25 percent|answer}}<br />
<br />
You can use <pre>ShowTrees all</pre> to see all of the trees in memory (note that the <code>AllTrees</code> command scores every tree, but does not retain all of the trees in memory. It will only retain the best trees according to the criterion).<br />
<br />
5. Now we need to find the best scoring tree that is consistent with our null hypothesis. In other words, we want to find the best-scoring tree that has a clade of ''Euglena'', ''Chlorella'', ''Chlamydomonas'', ''Marchantia'', Rice, and Tobacco excluding ''Anacystis nidulans'' and ''Olithodiscus''. We can do this in a number of ways. <br />
<br />
We will tell PAUP to do a search for the best tree that satisfies a topological constraint. <br />
:* The first step is to write a file with a NEXUS trees block and a tree the contains only one branch. In a text editor, create a new file with the following contents:<br />
<pre><br />
#NEXUS<br />
begin trees ;<br />
Tree ab = [&U](Anacystis_nidulans,Olithodiscus,(Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, Tobacco));<br />
end;<br />
</pre><br />
The name of the constraint tree can be any NEXUS word that you like; in this example <code>ab</code> will be the name of the constraint tree.<br />
When you are dealing with large numbers of taxa and complex constraints it is often helpful to construct the constraint tree in [http://mesquiteproject.org/mesquite/mesquite.html Mesquite], save it to a file, and then read it into PAUP using the <code>LoadConstr</code> command.<br />
<br />
:* Save the file as <code>abconstraint.tre</code> in the same directory as the <code>algae.nex</code> directory.<br />
:* To get PAUP to read the constraint into memory, use the <pre>LoadConstr file=abconstraint.tre</pre> command.<br />
<br />
6. Use the <code>ShowConstr</code> command to see the constraint tree that you have defined and make sure that it is the constraint that you intended.<br />
<br />
<br />
7. Now we will conduct a branch-and-bound search that honors the constraint tree that we have just defined:<br />
<pre>Log start file = 'steps7-11.realdatalog.txt' ; <br />
BAndB enforce constraints=ab<br />
</pre><br />
The <code>enforce</code> keyword tells paup to only consider trees that meet a constraint, and the <code>constraints=ab</code> keyword tell PAUP which tree to use as a constraint. <br />
Note that you can also use constraints with the <code>HSearch</code> command of PAUP (and you will need to do this for bigger datasets). <br />
What has the parsimony score of the best tree compatible with the constraint? {{title|4|answer}}<br />
<br />
8. Use the <code>ShowTrees</code> command to see the tree. Does it satisfy the constraint? (it should).<br />
<br />
9. Use the <code>SaveTrees file=bestconstrained.tre</code> to save this tree to a file in case you need it later.<br />
<br />
== Hypothesis testing ==<br />
<br />
As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that satisfies our null hypothesis. <br />
<br />
10. '''What is the value of the test statistic for our data?''' {{title|4|answer}}<br />
<br />
<br />
To interpret the value of the test statistic we need to have some idea the range of values would be produced ''if the null hypothesis were true''. <br />
This is can be tricky.<br />
For one thing, there are lots of trees that are compatible with the null hypothesis.<br />
It seems logical to take the tree from the constrained search as the tree to repersent the null hypothesis.<br />
After all, among all of the trees compatible with the constraint, it is the one that best explains the data (according to parsimony).<br />
Technically speaking this procedure of choosing a tree to represent the null does not guarantee that we are testing from the "least favorable conditions" as we should in hypothesis testing - but using this tree seems good enough, and it is practical.<br />
<br />
Even if we are satisfied about the choice of a tree that represents the best the null can do, we still have to have a way to find out what the distribution of the test statistic would be if this null were true.<br />
We will use Monte Carlo simulations for this.<br />
In particular we will use [http://tree.bio.ed.ac.uk/software/seqgen/ Seq-Gen] to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. <br />
Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. <br />
We can compare our real data to that.<br />
<br />
=== Finding model parameters ===<br />
<br />
To simulate data, Seq-Gen needs a fully-specified model and a tree with branch lengths.<br />
We can use the tree that we found in the constrained search and the GTR+I+G model to get the necessary input.<br />
<br />
11. Assuming that you have not quit PAUP and the tree found in the constrained search is still in memory, then we can save it with branch lengths that maximize the likelihood under the GTR+I+G model:<br />
<pre><br />
Set crit = like;<br />
LSet nst=6 rmat=est basefreq=est rates = gamma shape = est pinv=est;<br />
LScore;<br />
SaveTrees file = model.tre brlens format=altnexus;<br />
</pre><br />
Make sure that you understand these commands (ask an instructor if you have questions).<br />
From the output of PAUP you should have the model parameter values for the simulation.<br />
<br />
12. Look at the tree file. You should see a newick string representing a tree with branch lengths. You can use a text editor to see the newick representation. Or you can use FigTree to see a graphical representation of the tree.<br />
<br />
<br />
<br />
13. Unfortunately, seq-gen does not understand NEXUS tree files. Cut just the newick tree (the parenthetical description of the tree) from the file to a new file called '''model.txt''' or you can download this version of [http://phylo.bio.ku.edu/slides/model.txt model.txt].<br />
If you are a UNIX geek, you do this by quitting paup and issuing the command:<br />
<pre>cat model.tre | grep PAUP_1 | awk '{print $5}' > model.txt</pre><br />
Non-geeks tend to prefer opening '''model.tre''', copying the newick string, and saving it to a '''step13.model.txt''' file.<br />
<br />
=== Invoking seq-gen ===<br />
<br />
Seq-Gen is installed on the workshop's cluster. If you are running the exercise on a machine that does not have seq-gen, you'll need to download [http://tree.bio.ed.ac.uk/software/seqgen/ Seq-Gen] and install it.<br />
<br />
To install, you'll need to drag the seq-gen executable to the directory that you are using for this lab (or add it to your <code>PATH</code> environmental variable that tells your shell where to find executables [[Computer_lab_introduction#Adding_a_directory_to_the_path| notes here]])<br />
<br />
14. To see the options for seq-gen use the command<br />
<pre>seq-gen -h</pre><br />
To make sure everything is working do a simple test run using the HKY model:<br />
<pre>seq-gen -mHKY model.txt</pre><br />
This should generate a dataset and print it to the screen.<br />
The simulation used it default parameter values for the HKY model. We'd like to use the parameters that we inferred from our real data (because the parameter values will affect the dataset-to-dataset variance, and hence the distribution of our test statistic). All commands are given to seq-gen as command line flags. The ones that we will use are:<br />
: <code>-mGTR</code> to specify the GTR model<br />
: <code>-a</code> preceding the shape parameter value<br />
: <code>-i</code> preceding the proportion of invariant sites<br />
: <code>-r</code> preceding the 6 instantanteous rates of the GTR matrix (PAUP infers the first five and fixes the last one to 1.0)<br />
: <code>-f</code> preceding the base frequencies<br />
: <code>-l920</code> to simulate 920 sites (the same length as our real dataset).<br />
: <code>-n1000</code> to generate 1000 datasets<br />
: <code>-on</code> to request output in the NEXUS format<br />
: <code>-xeachreppaup.nex</code> to tell it the name of a file with text input to be inserted between each dataset.<br />
Finally we'll want to redirect the output to file using the : '''>''' redirection operator (Remember that this will overwrite whatever filename you put after the '''>''' character!). <br />
<br />
15. Take a look at the '''eachreppaup.nex''' that is included in the lab. It should contain the following lines.<br />
<pre><br />
begin paup;<br />
execute run.nex;<br />
end;<br />
</pre><br />
the <code>-xeachreppaup.nex</code> option to seq-gen will insert the contents of eachreppaup.nex in between each data set.<br />
<br />
By putting the correct commands in a file called step18.run.nex<br />
<br />
16. Run seq-gen. The invocation should be something like the command below (but with the parameter estimates for this dataset filled in the appropriate spots):<br />
<pre>seq-gen -mGTR -a0.6 -i0.32 -r 0.6 2.1 0.3 0.2 5 1 -f 0.27 0.20 0.30 0.23 -l920 -n1000 -on -xeachreppaup.nex model.txt > simdata.nex<br />
</pre><br />
Use the parameter values that you got from PAUP's <code>LScore</code> to construct a similar command and run it.<br />
<br><br />
Note: In the Windows executable version of the program, the syntax of the command line is somewhat different. The rate matrix will be specified as <code>-r0.6,2.1,0.3,0.2,5,1 </code> and the base frequencies as <code>-f0.27,0.20,0.30,0.23</code>. You will also need to direct the treefile to seq-gen by inserting a '''<''' before the filename. In the case above, the end part of the command will read <code>-xeachreppaup.nex < model.txt > simdata.nex</code> <br><br />
<br />
17. Open '''simdata.nex''' in a text editor. Do you see the content of the '''eachreppaup.nex''' file?<br />
<br />
=== Running PAUP on the simulated data===<br />
<br />
Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command <code>execute step18.run.nex</code> intercalated between each data set. <br />
If we put the commands that we want PAUP to execute in the '''step18.run.nex''' file then those commands will be executed for each dataset.<br />
<br />
What do we want to do for each dataset? We want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree because of sampling error (or even systematic error.<br />
<br />
18. Take a look at the '''step18.run.nex''' file. It should contain:<br />
<pre><br />
#NEXUS<br />
BAndB;<br />
[!****This is the best tree's score****]<br />
PScore;<br />
GetTrees file = model.tre;<br />
[!####This is the true tree's score####]<br />
PScore;<br />
</pre><br />
These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it.<br />
We are using the output comments to make the output more visible.<br />
<br />
Note that if we wanted to make the test more powerful we could do a constrained search for each simulated replicate instead of just scoring the model tree.<br />
(This would result in shorter trees that are consistent with our null hypothesis, which would tend to make the values of the difference in length smaller.<br />
Smaller values for the length difference in our null distribution would mean that the observed value of the test statistic would be further out in the tail of the null distribution; thus we would get a smaller ''p'' value).<br />
In the PAUP commands given above, we just score the model tree. <br />
In effect we are changing the null from:<br />
: "the tree has the chlorophyll ''a/b'' group"<br />
to a more specific null: "the true tree is the tree stored in '''model.tre'''"<br />
<br />
19. Finally to run all of the analyses it is helpful to have a simple "master" paup file. See the step19.master.nex file:<br />
<pre><br />
Log start replace file=step19.sim.log;<br />
Set noQueryBeep noerrorBeep noWarnReset noWarnTree noWarnTSave;<br />
Execute simdata.nex;<br />
Log stop;<br />
</pre><br />
Save this file as '''master.nex'''<br />
invoke PAUP using the <code>-n</code> flag so that it goes in non-interactive mode (and does not pester you with 1000 questions):<br />
<pre>paup -n master.nex</pre><br />
or you can launch a graphical version of PAUP and tell it to execute the '''master.nex''' file.<br />
After a few seconds, you should have completed the analysis of all 1000 datasets.<br />
<br />
= Summarizing the output =<br />
You really don't want to browse through 1000 analyses and perform subtraction (and certainly not when you could be at the Kidd after a long day).<br />
<br />
Summarize the output using the "easy way" below.<br />
If you want to see how you can do a lot of the calculation using pipes from the command line (and if you are working on a non-Windows machine), check out "the geeky way."<br />
<br />
==== The easy way ====<br />
I wrote a simple summarization script [http://phylo.bio.ku.edu/slides/lab6-Simulation/summarizePaupLengthDiffs.py summarizePaupLengthDiffs.py]. Make sure that it is in your current working directory.<br />
<br />
If you are running Windows, you may need to install [http://www.python.org/ Python] (any version that starts with '''2''' should work) if you don't have it.<br />
<br />
20. As long as you do not mind overwriting a file in this directory named '''step20.diffs.txt''' you can run the command :<br />
<pre> python summarizePaupLengthDiffs.py step19.sim.log > step20.diffs.txt</pre><br />
<br />
This should report critical values for the test statistic at a few significance levels.<br />
You should be able to open the file '''step20.diffs.txt''' in Excel if you want to see differences for any replicate.<br />
<br />
21. (optional) if you have the [http://www.r-project.org/ R] programming language installed then you should be able to download [http://phylo.bio.ku.edu/slides/lab6-Simulation/plot_diffs.R plot_diffs.R] and <br />
run it with a command line argument to produce a pdf that summarizes the parametric bootstrapping. Pass in the observed value of the test statistic to the R script. So, if the observed length difference (on the real data) was 2 then you would use the command:<br />
<pre>R --file=plot_diffs.R --args 4</pre><br />
to produce a file called '''null_distribution_pscore_diffs.pdf''' with a <br />
histogram and the approximate ''p'' value.<br />
<br />
<br />
= The end =<br />
Was the observed difference in this tail of the null distribution? and would you reject the null hypothesis?<br />
<br />
<br />
= postscript: An alternative (geeky) way of running step 20 =<br />
If you find it unsatisfying to run a pre-packaged script in step 20 to parse the output of the simulated data, you can try to write a quick parser.<br />
<br />
This is a step by step instruction of how to construct a simple workflow using UNIX "pipes". We are going to connect the output of one process (process = running program) to the input of another process. This is done with a "pipe" between the processes. From our shell, this is done with the <code>|</code> symbol.<br />
The command:<br />
<pre>cat step19.sim.log</pre><br />
spits out all 83,000 lines of the log file to the screen. Ugh!<br />
The command:<br />
<pre>cat step19.sim.log | grep "Length "</pre><br />
filters all of those lines so that only those with the word Length followed by a space are printed. This selects just the output from the PScore command that we did.<br />
Want to get just the scores of the first tree each time (without the word Length)? Try:<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}'</pre><br />
We are close. We now need to subtract the number in the second line from the first line; the number in the fourth line from the third line... This would give us the difference in score for each rep. I wrote a simple python script to do this. Save the script as [http://phylo.bio.ku.edu/slides/lab6-Simulation/consecutiveDiffs.py consecutiveDiffs.py] in the same directory that you have been working in.<br />
Now<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py</pre><br />
Should display the length differences.<br />
At this point (or really a couple of steps ago) you could take the data into Excel and find the critical value for the '''p=0.05''' level by looking for the 50th largest difference. We can finish the UNIX way by<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g</pre><br />
to sort the values numerically.<br />
And finally:<br />
<pre>cat step19.sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g | tail -n50</pre><br />
to show the 50 most extreme length differences.</div>Jsukumaran