MrBayes/instruction

From MolEvol

#################################################################################### INSTALLING, COMPILING AND CONFIGURING MRBAYES #################################################################################### ************************* I. Obtaining the program 1. Binary (precompiled) verison The binary can be installed by downloading the Mac installer from this url: http://sourceforge.net/projects/mrbayes/files/latest/download?source=files Double click the installer and follow the prompts. 2. Building from the source code To compile the binary from source, you will need the gcc compiler. To confirm that you have it type: > which gcc This should result in the directory location of your current copy of gcc, if you have one installed. If not, install one from the Developer Tools CD that came with your computer. Most Unix systems will already have gcc installed. To obtain the source code, open the Terminal application (in the Application/Utilities directory) and type: > svn co https://mrbayes.svn.sourceforge.net/svnroot/mrbayes/trunk/src mrbayes You should now get a number of files downloaded to your directory, in a folder named ”mrbayes”. Alternatively, you can download the tarball here: http://sourceforge.net/projects/mrbayes/files/mrbayes/3.2.1/mrbayes-3.2.1.tar.gz/download Unzip the tarball and move the mrbayes directory to the desired location on your computer. Change to the mrbayes directory by typing: > cd mrbayes Create the Makefile, which contains the instructions for the compiler (the ”make” command), by typing: > autoconf Then type: > ./configure if you want the serial version, or type: > ./configure --enable-mpi=yes if you want the mpr version of the program that can take advantage of multiple processors/ cores. Note: The MPI-version must be compiled from source and will only run under Unix/Linux operating systems (inckuding Mac OS). If you wish to take advantage of the GPU processor for likelihood calculations, make sure you have (1) an NVIDIA graphics card, and (2) the CUDA driver available here: http://www.nvidia.com/object/cuda-mac-driver.html and the Beagle library avaialbel here: http://code.google.com/p/beagle-lib/ Note: if you will primarily be analyzing nucleotifde data under conventional 4by4 substitution models, the GPU verison will not confer a speed advantage. If you don't want to use the Beagle-enabled GPU version, type: > ./configure --enable-mpi=yes --with-beagle=no To make sure everything is tidy, thyep > make clean and then compile the program by typing: > make It will take a moment for the compiler to assemble the binary version of the program. ************************* II. Starting the program You can run the program by typing: > ./mb We're also going to put the mb binary in our path so that the Terminal knows where to look for the program. Open the Terminal application and type > vi .profile This opens a file called ".profile" in the vi editor. Once ".profile" is open, type > i to begin editing (you should see "—Insert—" at the bottom of the Terminal window). Now simply type > export PATH="$PATH:/Applications/[wherever your directory is located]/mrbayes_3.2.1/" to add mrbayes to your path, of course, changing '[wherever your directory is located]' to to the directory path where your mrbayes directory is located. When finished, press esc to stop editing and then close the ".profile" file by typing > :wq. Now you should be able to execute the program from any directory by simply typing: > mb #################################################################################### ANALYSIS PIPELINE #################################################################################### I. Overview This tutorial demonstrates several types of analyses that might be required for a typical empirical study. Specifically, we will demonstrate how to use MrBayes to perform the following analyses: -select among partition schemes (mixed models) -perform model averaging by means of rjMCMC -estimate ancestral state for discrete traits -test for violation of the molecular-clock assumption -estimate divergence times ************************************************** 1. Selecting among partition schemes (mixed model) To accommodate process heterogeneity within and/or between various gene(omic) regions, we will evaluate the support for various various partition schemes using Bayes factors to compare the marginal likelihoods of the candiate partition schemes. We will estimate the marginal likelihood for each partition scheme using both the harmonic-mean (unreliable) and stepping-stone (preferable) estimators. Open the batch file, conifer_partn.nex, in a text editor. This file contains all of the commands required to perform the necessary analyses to explore various partition schemes (unpartitioned, partitioned by gene region, and partitioned by gene region+codon position. The details of each command are described in adjacent comments, surrounded in brackets; e.g., [this is a comment]. The relevant results of these analyses are the estimated marginal likelihoods, summarized in the table below: Partition Marginal lnL estimates scheme Harmonic mean Stepping-stone -------------------------------------------------- uniform -9905.51 -9908.26 moderate -9660.15 -15739.58 extreme -9478.21 -19107.15 -------------------------------------------------- Note that Bayes factors based on comparison of HM-based marginal likelihoods would *strongly* favor the most extremely partitioned mixed model, whereas comparison of SS-based marginal likelihoods strongly prefers the simplest, unpartitioned mixed model. The conflict in the preferred partition scheme stems from the unreliability of the HM estimator. Never, ever, ever base Bayes factors on comparison of HM-derived marginal likelihoods. ************************************************** 2. Model averaging by means of rjMCMC Typically, we first chose a stochastic model of substitution by some means (hLRT, AIVC, etc.), and then condition our inferences on the chosen model. This practice ignores uncertainty in the choice of substitution model, which in turn may bias our inferences. Rather than condition inference on a single model, we may instead treat the substitution model as a random variable, and integrate inference of phylogeny over all possible 203 members of the GTR family of substitution models. We will demonstrate how to do this using reversible-jump MCMC, which simultaneously provides both a marginal estimate of the phylogeny (topology and branch lengths) and also the marginal probability of of the individual substitution models. Open the batch file, conifer_rjMCMC.nex, in a text editor. This file contains all of the commands required to perform an analysis in which the model is treated as a random variable. The details of each command are described in adjacent comments, surrounded in brackets; e.g., [this is a comment]. For the sake of comparison (or just for fun!), let's use a conventional model-selection method to choose among models. While the rjMCMC analysis is running, we'll use jModelTest to select a substitution model. jModelTest returned the following results based on the AIC method: Model -lnL K AIC delta weight cumWeight ------------------------------------------------------------------------ GTR+G 9887.0969 25 19824.1938 0.0000 0.7759 0.7759 TIM1+G 9890.3440 23 19826.6881 2.4943 0.2229 0.9988 TIM3+G 9895.7602 23 19837.5204 13.3266 0.0010 0.9998 TrN+G 9899.0314 22 19842.0628 17.8690 0.0001 0.9999 These results suggest that the most complex model evaluated, GTR+G, provides the best fit to the data, and that there is little model uncertainty: only one model is included in the 95% confidence interval. The relevant results of the MrBayes analysis are the marginal likelihoods for the various members of the GTR family of substitution models, summarized in the table below: Model probabilities above 0.050 Estimates saved to file "conifer_rjmcmc.mstat". Posterior Standard Min. Max. Model Probability Deviation Probability Probability ---------------------------------------------------------------------------------- gtrsubmodel[123343] 0.321 0.021 0.306 0.336 gtrsubmodel[123345] 0.175 0.033 0.152 0.198 gtrsubmodel[123454] 0.130 0.011 0.123 0.138 gtrsubmodel[123453] 0.078 0.001 0.077 0.079 gtrsubmodel[123341] 0.062 0.007 0.057 0.067 gtrsubmodel[123141] 0.061 0.001 0.060 0.061 gtrsubmodel[123456] 0.058 0.001 0.057 0.059 ---------------------------------------------------------------------------------- First, note that the substitution model with the highest marginal probability is an unnamed model that differs from that identified by jModelTest, which selected GTR+G. In fact, the model with the highest marginal probability is not even considered by standard model-selection methods! Second, note that there are at least 7 models comprising the 95% credible interval of models, some models--such as [123345] and [123454]--have substantial marginal probability. In such cases, it would not be defensible to condition the analyses on any one model; it is far preferable in such cases to integrate over all models in order to accommodate the uncertainty associated with the choice of substitution model. ************************************************** 3. Ancestral-state estimation Many evolutionary studies involve questions related to the evolution of discrete phenotypic (morphological, physiological, ecological) traits, particularly the state of a trait in one or more MRCAs. Analyses of trait evolution can be performed in MrBayes in a fully hierarchical manner. That is, we can simultaneously estimate the joint posterior probability density of the phylogeny and all other model parameters for a mixed data set containing both molecular and morphological data. We can then look marginally at the state of a morphological trait (or nucleotide for a site) at a node of interest. Note that it is best to estimate ancestral states one node at a time. This analysis requires that we specify normal variables--the data partitions, the CTMM models for each partition, etc.--but we must also specify (1) the node that we wish to monitor (representing the MRCA of interest), (2) we must force the monitored node to be monophyletic, and (3) we must specify which trait(s) we wish to monitor. We should therefore only evaluate ancestral states for well supported nodes. In this example, we will estimate the ancestral state for branching order--whether spiral or opposite--in the MRCA of conifers, the monophyly of which is well supported. Open the nexus file, conifer_anc.nex, in a text editor. This file contains all of the commands required to perform an analysis in which the model is treated as a random variable. The details of each command are described in adjacent comments, surrounded in brackets; e.g., [this is a comment]. 95% HPD Interval -------------------- Parameter Mean Variance Lower Upper Median min ESS* avg ESS PSRF+ --------------------------------------------------------------------------------------------------------- TL{all} 1.469551 0.060504 1.044280 1.925985 1.426342 66.85 71.85 1.046 . . . p(0){1@conifers} 0.474846 0.008713 0.274042 0.502302 0.500000 83.66 95.86 1.002 p(1){1@conifers} 0.525154 0.008713 0.496190 0.720240 0.500000 83.66 95.86 1.002 --------------------------------------------------------------------------------------------------------- These results suggest that there is considerable uncertainty regarding the ancestral state of branching order (as spiral or opposite) in the MRCA of conifers: the marginal probabilities for both states are quite similar (P~0.5) For these analyses, it may be worthwhile to explore alternative priors and/or values for parameters used to model the morphological traits. Additionally, it may be more appropriate to estimate marginal probabilities of ancestral states from a chronogram (an ultrametric tree where the branch lengths are proportional to time--the branch lengths in phylograms are rendered as proportional to the expected number of character changes per trait). Below we demonstrate how to perform the necessary analyses to estimate divergence times. ************************************************** 4. Testing for violation of the molecular-clock assumption Many evolutionary questions will require (or at least benefit from) estimates of divergence times. These inferences initially relied on the (empirically very shaky) assumption that rates of molecular evolution were approximately constant through time and across lineages--in other words, estimating divergence times assumed a strict molecular clock model. Empirical evidence suggests that the vast majority of data sets do not conform to a molecular clock--exhibiting substantial variation in rates of substitution across lineages. The prevalence of this empirical observation has motivated theoreticians to develop and implement many relaxed-molecular clock model (which explicitly model how rates vary across lineages), and discouraged empiricists from bothering to assess whether their data conform to a clock (I guess this would be a case of a very strong prior!). Nevertheless, it is a critical first step to test for a clock--simulation studies suggest that if your data are clock like, a strict clock is your best choice of model to estimate divergence times (and a relaxed clock is not needed as it just inflates error variance). Here we run two types of analyses: one in which the clock is enforced, and another where it is not. For each type of analysis, we estimate the marginal likelihood of the clock and non-clock models with both the harmonic-mean and stepping-stone estimators. Open the nexus file, conifer_clock.nex, in a text editor. This file contains all of the commands required to perform the necessary analyses to determine whether the conifer sequences conform to a molecular clock. The details of each command are described in adjacent comments, surrounded in brackets; e.g., [this is a comment]. The relevant results of the MrBayes analyses are the marginal likelihoods for the clock and non-clock models estimated by the HM and SS methods, summarized in the table below: Branch Marginal lnL estimates lengths Harmonic mean Stepping-stone -------------------------------------------------- clock -9675.05 -17824.30 non-clock -9743.27 -21068.95 -------------------------------------------------- In contrast to the analyses performed to select among partiton schemes, Bayes factors based both on comparison of HM-based marginal likelihoods and those based on comparison of SS-based marginal likelihoods would *strongly* favor the molecular-clock hypothesis (a rarity indeed...quick submit this result to Nature!). This result suggests that inference of divergence times would preferably be performed under a strict molecular-clock model. Nevertheless, for the sake of demonstration (and because the clock is rejected for the vast majority of empirical data sets) we will proceed below to demonstrate how to estimate divergence times using a relaxed molecular clock. ************************************************** 5. Estimating divergence times from Many inference problems will require estimates of absolute or relative divergence time: trait evolution, biogeography, lineage diversification are just some of the kinds of problems that benefit from temporal information. Numerous relaxed-molecular clock models have been proposed, and as is the case for any model-based inference, we need to carefully assess the relative fit of our data to the candidate models. Moreover, many of these models involve hyperparameters that govern the way that substitution rates vary across branches. Being good Bayesians, it is important to explore the form of the prior for a given model (e.g., variance in the rate of substitution may be fixed or modeled as an exponential, uniform or Dirichlet random variable). Moreover, for a given (hyper)prior, we should explore the parameter value (the mean of the exponential, etc.) Here we perform several analyses of divergence times under the strict- and three relaxed-molecular clock models. For each clock model, we will estimate the marginal likelihood using both the harmonic-mean (HM) and stepping-stone (SS) estimators. Open the nexus file, conifer_dte.nex, in a text editor. This file contains all of the commands required to perform the necessary analyses to estimate divergence times under various clock models. The details of each command are described in adjacent comments, surrounded in brackets; e.g., [this is a comment]. Clock Marginal lnL estimates model Harmonic mean Stepping-stone -------------------------------------------------- strict -9673.11 -14761.50 CPP -9765.29 -21831.11 IGR -9761.98 -19057.93 TK02 -9779.39 -16955.77 -------------------------------------------------- Again, Bayes factors based on comparison of marginal likelihoods (derived from both the HM and SS estimators) favor the strict clock model, with the autocorrelated TK02 relaxed-clock model placing second (which makes sense, as it is 'closer' to the strict clock in model space. If we were performing these analyses 'for real', however, we would want to explore the sensitivity of the results to various priors for each of the clock models; e.g, whether we chose a fixed, exponential, or uniform prior for variance parameter of the IGR and TK02 models (and for each chosen prior, the value of the hyperprior), the clock rate prior, the branch-length priors, etc., etc.