Difference between revisions of "Garli FAQ"

From MolEvol
(Created page with "==Search options== ===How many generations/seconds should I run for?=== This is dataset specific, and there is no way to tell in advance. It is recommended to set the maximu...")
 
(No difference)

Latest revision as of 11:17, 21 July 2015

Contents

Search options

How many generations/seconds should I run for?

This is dataset specific, and there is no way to tell in advance. It is recommended to set the maximum generations and seconds to very large values (>1x106) and use the automated stopping criterion (see enforcetermconditions in the settings list). Note that the program can be stopped gracefully at any point by pressing Ctrl-C, although the results may not be fully optimal at that point.

How many runs/search replicates should I do?

That somewhat depends on how much time/computational resources you have. You should ALWAYS do multiple searches, either by using the searchreps setting or by simply running the program multiple times (but do so in different directories or change the ofprefix to avoid overwriting previous results). If you perform a few runs or replicates and get very similar trees/lnL scores (ideally within about one lnL of each other), that should give you some confidence that the program is doing a good job searching and is find the best or nearly best topology, and it suggests that you don’t need to do many more searches. If there is a lot of variation between runs, try using different starting tree options (see further FAQ entries) and choose the best scoring result that you obtain. Note that the program is stochastic, and runs performed with exactly the same starting conditions and settings (but different random number seeds) may give different results. You may also try changing some of the search parameters to make each search replicate more intensive (see further FAQ entries). The discussion on the Advanced_topics page may be helpful in determining how many replicates you should perform, and in coalating the results from multiple searches.

Should I use random starting topologies, stepwise-addition starting topologies or provide starting topologies myself?

  • Unless the number of sequences in your dataset numbers in the hundreds, it is recommended to perform multiple searches with both random (streefname = random) and stepwise-addition (streefname = stepwise) starting trees.
  • For datasets consisting of up to several hundred sequences, searches using a random starting tree often perform well (although they have slightly longer runtimes). Because the search starts from very different parts of the search space, getting consistent results from random starting trees provides good evidence that the search is doing a good job and really is finding the best trees. For datasets of more than a few hundred sequences, random starting trees sometimes perform quite poorly.
  • The quality of stepwise-addition starting trees can be controlled with the attachmentspertaxon setting. This allows the creation of starting trees that fall somewhere between completely random and very optimal. See the description of the attachmentspertaxon setting.
  • Providing your own starting trees can sometimes be helpful, especially on datasets consisting of hundreds of sequences, where the creation of the stepwise-addition tree itself may take quite a long time.
  • User-specified starting trees to contain polytomies (not be fully bifurcating), so for example a parsimony strict consensus tree could be used to get the search in the right ballpark to start with without biasing it very much. Before searching, a polytomous tree will be arbitrarily resolved, with the resolution being different for each search replicate or program execution.

What is the proper format for specifying a starting topology?

The tree should be contained in a separate file (with that filename specified on the streefname line of the configuration file) either in a Nexus trees block, or in standard Newick format (parenthetical notation). Note that the tree description may contain either the taxon numbers (corresponding to the order of the taxa in the dataset), or the taxon names. The tree can optionally contain branch lengths, as well as have polytomies. If multiple trees are contained in a single starting tree file, they will be used in order to start each successive search replicate (if searchreps > 1). See the streefname configuration entry for more details.

Should I specify a starting topology with branch lengths?

It doesn’t appear to make much of a difference, so I would suggest not doing so. Note that it is probably NOT a good idea to provide starting branch lengths estimated under a different likelihood model or by Neighbor Joining. When in doubt, leave out branch lengths.

Model parameters

How do I specify starting/fixed model parameter values?

Model parameter values are specified using a fairly cryptic scheme of specifying a single letter representing a particular parameter(s), followed by the value(s) of that parameter(s). See this page for details: Specifying model parameter values.

Should I specify starting model parameters?

If you do not intend to fix the model parameters, specifying a starting model is generally of little help. One case in which you might want to specify starting parameter values would be when doing many search replicates or bootstrap replicates, in which case getting the starting values in the right ballpark can reduce total runtimes by an appreciable amount. If you do intend to fix the parameters at values obtained elsewhere or in a previous GARLI run, then you obviously must include the starting parameter values. See the streefname configuration entry for details on how to specify model parameter values.

Should I fix the model parameters?

The main reason one would fix parameters is to increase the speed of the search. Fixing model parameters results in a huge speed increase in some inference programs (such as PAUP*), but less in GARLI (generally approx. 10-50% with an unpartitioned model, although it can be much more with a partitioned model or if there are many parameters). Unless you have good model estimates (under exactly the same model), do not fix them. One situation in which you might want to fix parameter values would be in the case bootstrapping. You might want to estimate parameter values on the real data, and then fix those parameter values for the searches on each of the pseudo-replicate datasets. See Getting the parameter values for an easy way to do this.

Model choices

What DNA/RNA substitution models can I use?

All possible submodels of the GTR (General Time Reversible) model, with or without gamma distributed rate heterogeneity and a proportion of invariable sites. This is same set of models allowed by PAUP* and represents the full set of models considered by the model selection program MODELTEST (http://darwin.uvigo.es/software/modeltest.html). See the “ Model specification settings” section on the GARLI configuration page.

Do I need to perform statistical model selection when using GARLI?

Yes! Just as when doing an ML search in PAUP* or a Bayesian analysis in MrBayes, you should pick a model that is statistically justified given your data. You may use a program like MODELTEST (http://darwin.uvigo.es/software/modeltest.html) to do the testing. However, most good sized datasets (which is mainly what GARLI is designed to analyze) do support the use of the most complex time-reversible model, GTR with a class of invariable sites and gamma distributed rate heterogeneity (“GTR+I+G”). As of GARLI version 0.96, all of the models examined by MODELTEST can now be estimated. See the “ Model specification settings” section on the GARLI configuration page and the FAQ item "MODELTEST told me to use model X. How do I set that up in GARLI?" below.

Note that there is NOT really any reason to use the model parameter values provided by MODELTEST, only the model TYPE as indicated by the MODELTEST results (i.e., JC, HKY, GTR, etc.). GARLI will do a better job of estimating the parameters because it will do so on the ML tree, plus fixing or providing the parameter values to GARLI will not help that much to reduce runtimes. See this FAQ section above: Model parameters.

What amino acid models can I use?

Amino acid analyses are typically done using fixed rate matrices that have been estimated on large datasets and published. Typically the only model parameters that are estimated during tree inference relate to the rate heterogeneity distribution. Each of the named matrices also has corresponding fixed amino acid frequencies, and a given matrix can either be used with those frequencies or with the amino acid frequencies observed in your dataset. Amino acid models may be used with the same forms of rate heterogeneity available for nucleotide models (gamma-distributed rate heterogeneity and a proportion of invariable sites). These are the implemented amino acid rate matrices:

ratematrix/statefrequencies setting reference
dayhoff Dayhoff, Schwartz and Orcutt. 1978.
jones Jones, Taylor and Thornton (JTT), 1992.
WAG Whelan and Goldman, 2001.
mtREV Adachi and Hasegawa, 1996.
mtmam Yang, Nielsen and Hasegawa, 1998.

Versions 1.0 and later also allow estimation of the full amino acid rate matrix (189 rate parameters). Do not do this unless you have lots of data, as well as a good amount of time. Newer versions also allow input of your own amino acid rate matrix, allowing you to use any model if you have the rates. A file specifying the LG model (Le and Gasquel, 2008) is included in the example directory with the program to demonstrate this.

See the amino acid section of the model specification settings section for more details on amino acid models.

How do I choose which amino acid model to use?

As with choosing a nucleotide model, your choice of an amino acid model should be based on some measure of how well the available models fit your data. The program PROTTEST (http://darwin.uvigo.es/software/prottest.html) does for amino acid models what MODELTEST does for nucleotide models, testing a number of amino acid models and helping you choose one. Note that although GARLI can internally translate aligned nucleotide sequences into amino acids and analyze them at that level, to use PROTTEST you will need to convert your alignment into a Phylip formatted amino acid alignment first.

What codon models can I use?

The codon models that can be used are related to the Goldman and Yang (1994) model. See the codon section of the model specification settings for a discussion of the various options.

How do I choose which codon model to use?

I don't currently have a good answer for this. The codon models should probably be considered experimental at the moment. Experiments to investigate the use of codon models for tree inference on large datasets are underway, and I should eventually have some general guidelines on how best to apply them. Feel free to give them a try with your data.

MODELTEST told me to use model X. How do I set that up in GARLI?

The candidate models that MODELTEST chooses from have the following format: <Model Name><optionally, +G><optionally, +I> for example, GTR+I+G, SYM+I or HKY. The model names are definitely cryptic if you aren't familiar with the evolutionary models used in phylogenetic analyses. Luckily, there is a direct correspondence between all of MODELTESTS models and particular GARLI settings. Note that GARLI allows the use of every model that MODELTEST might tell you to use. First, rate heterogeneity: For any model with "+G" in it:

ratehetmodel = gamma
numratecats = 4 (or some other number.  4 is the default in GARLI, PAUP* and MrBayes)

For any without "+G" in it:

ratehetmodel = none
numratecats = 1

For any model with "+I" in it:

invariantsites = estimate

For any without "+I" in it:

invariantsites = none

The model names each correspond to a particular combination of the statefrequencies and ratematrix configuration entries. Note that for the rate matrix settings that appear in parentheses like this: (0 1 2 3 4 5), the parentheses do need to appear in the config file. Here are all of the named models:

model name ratematrix = 'statefrequencies =
JC 1rate equal
F81 1rate estimate
K80 2rate equal
HKY 2rate estimate
TrNef (0 1 0 0 2 0) equal
TrN (0 1 0 0 2 0) estimate
K3P (= K81) (0 1 2 2 1 0) equal
K3Puf (= K81uf) (0 1 2 2 1 0) estimate
TIMef (0 1 2 2 3 0) equal
TIM (0 1 2 2 3 0) estimate
TVMef (0 1 2 3 1 4) equal
TVM (0 1 2 3 1 4) estimate
SYM 6rate equal
GTR 6rate estimate

NOTE: MODELTEST also returns parameter values with the chosen model type. You may fix those values in GARLI, but unlike in PAUP* there is little speed benefit to doing so. As long as you set the correct type of model with the above instructions GARLI will infer a better model estimate than the values given by MODELTEST, since those are estimated on a tree that is poorer than the Maximum Likelihood tree. See the " Model parameters" section of the FAQ for more information on providing/fixing model parameters.

Can GARLI do analyses assuming a relaxed or strict molecular clock?

Sorry, no.

Can I infer rooted trees in GARLI?

No. All models used are time reversible, and the position of the root neither affects nor is inferred by the analyses. Rooting the inferred tree by using an outgroup or other method is up to you. Note that you can specify an outgroup for GARLI to use, but this only affects how the trees are oriented when written to file, and has no effect on the analysis itself.

Can GARLI perform partitioned analyses, e.g. allow different models for different genes?

Yes. An official version that can do this and better documentation of it are forthcoming. In the meantime an earlier functional version is available and documented here.

Constraints

How do I specify a topological constraint?

In short, this requires deciding which branches (or bipartitions) you would like to constrain, specifying those branches in a file and telling GARLI where to find that file. See the constraintfile option for details on constraint formats.

Why might I want to specify a topological constraint?

There are two main reasons: to reduce the topology search space or to perform hypothesis testing (such as parametric bootstrapping). For large datasets in which you are certain of some groupings, it may help the search to constrain a few major groups. Note that if constraints are specified without a starting tree, GARLI will create a random or stepwise-addition tree that is compatible with those constraints. This may be an easy way of improving searching without the potential bias of using a given starting tree. A discussion of parametric bootstrapping (sometimes called the SOWH test) is out of the scope of this manual. It is a method of testing topological null hypotheses with a given dataset through simulation. See: Huelsenbeck et. al, (1996). Other statistical tests of tree topologies (attempting to answer the question “Is topology A significantly better than topology B”) are nicely reviewed in Goldman et al. (2000).

How do I fully constrain (i.e., fix) the tree topology?

This is not done with a constraint! Use the optimizeinputonly option.

Program versions

What are the differences between the Graphical (GUI) OS X version and other versions?

(NOTE that the GUI Version of GARLI is VERY out of date, version 0.951 vs 2.0. If you can use a newer non-GUI version, do so.) The main differences are in how the user interacts with the program. The GUI version changes the cryptic option names detailed below into normal English. If you hold your mouse pointer over an option in the GUI it will give you a description of what that option does (generally taken directly from this manual). There may be some options that are not available in the GUI. Searching and optimization may also not be great in this version, since there have been many improvements made to the core since its release.

Should I use a multi-threaded (openMP) version of GARLI if I’m using a computer with multiple processors/cores?

The multi-threaded versions will often increase the speed of runs by approximately 1.2 to 1.8 times, but will otherwise give results identical to those obtained with the normal version (i.e., the search algorithm is exactly the same). It will perform the best when there are many columns in the alignment, or when using amino acid or codon models. It also seems to be very hardware specific, so with dna models on some machines it may not help at all. Test it yourself on your machine before assuming that it will help.

Note that even if it is faster, this doesn't mean that running this version is the best use of computing resources. In particular, if you intend to do multiple search replicates or bootstrap replicates, simply running two independent executions of the program will give a speedup of nearly 2 times, and will therefore get a given number of searches done more quickly than a multithreaded version. One case in which the multithreaded version may be of particular use is when analyzing extremely large datasets for which the amount of memory that would be required for two simultaneous executions of the program is near or greater than the amount of memory installed in the system. Note that the multi-threaded versions by default will use all of the cores/processors that are available on the system. To change this, you can set the OMP_NUM_THREADS environment variable (you can find information on how to do that online). Note that the performance of the multithreaded version when it is only using one processor or core is actually worse than the normal version, so when in doubt use the normal version.

Should I use a 64-bit Version?

A 64-bit (sometimes called x64) version of the program will probably not help you unless you need to use large amounts of memory. In general, a 64-bit version will not be faster. If you need to use about 4 or more GB of memory, then you MUST use a 64-bit version. I do not currently have a 64-bit OS X distribution, but could make one if there is interest. Compiling your own might be a better option.

What is the parallel MPI version of GARLI? Should I use it?

This is a fairly complex question and answer. The short of it is that if you are running on a large computer cluster it may be worthwhile to use the parallel version. There is nothing wrong with using the serial version on a cluster if the cluster allows it, and there may not be much benefit to using the MPI version in this case. The MPI version can also be run on a standalone machine with multiple processors that has MPI installed, such as Linux or Mac OS X Leopard (10.5). See a detailed discussion of the MPI version here.

Miscellaneous

Can I use GARLI to do batches of runs, one after another?

Yes, any of the non-GUI versions can do this. First create a different config file for each run you need to do, and name them something like run1.conf, run2.conf, etc. Assuming that the GARLI executable is named Garli-1.0 and is in the current directory, you may then make a shell script that runs each config file through the program like this:

./Garli-1.0 –b run1.conf 
 ./Garli-1.0 –b run2.conf 
  etc. 
  The “–b” tells the program to use batch mode and to not expect user input before terminating.  The details of making a shell script are beyond the scope of this manual, but you can find help online or ask your nearest Unix guru.
  ===For nucleotide models: Is the score that GARLI reports at the end of a run equivalent to what PAUP* would calculate after fully optimizing model parameters and branch lengths on the final topology?===
  The model implementations in GARLI are intentionally identical to those in PAUP, so in general the scores should be very close. In some very rare conditions the score given by 
  GARLI is better than that given by PAUP* after optimization, which appears to be due to 
  PAUP* getting trapped in local branch-length optima.  This should not be cause for 
  concern.  If you want to be absolutely sure of the lnL score of a tree inferred by GARLI, 
  optimize it in PAUP*. Note that comparability of scores should NOT generally be assumed between other programs such as RAxML or PHYML.
  ===For nucleotide models: Is the lnL score that GARLI reports at the end of a run comparable to the lnL scores reported by other ML search programs?=== 
  In general, you 
  should not assume that lnL scores output by other ML search programs (such as PHYML 
  and RAxML) are directly comparable to those output by GARLI, even if they apparently 
  use the same model.  To truly know which program has found a better tree you will need to
  score and optimize the resulting trees using a single program, under the same model. 
  Also see the previous question.
  ===Which GARLI settings should I play around with?===
  Besides specifying your own dataset, most settings don’t need to be tinkered with, although you are free to do so if you
  understand what they do.  Settings that SHOULD be set by the user are ofprefix, 
  availablememory and genthreshfortopoterm.  If you want to tinker 
  further, you might try changing uniqueswapbias, nindiv, selectionintensity, limsprrange, startoptprec, minoptprec and numberofprecisionreductions.  In general, using a different starting topology tends to have more of an effect on the results than any of these settings do.  It is recommended that you do NOT change stopgen, stoptime,
  refinestart, enforcetermconditions and the mutation weight settings unless you have a specific reason to do so.
  ===Can I specify alignment columns of my data matrix to be excluded?===  
  Yes, if your datafile is Nexus. This is done through an “exset” command in a Nexus assumptions block, included in the same file as a Nexus data matrix.  For example, to exclude characters 1-10 inclusive and character 20, the block would look like this: 
   Begin assumptions; 
    exset * myExsetName = 1-10 20; 
     end; 
     The * means to automatically apply the exset (otherwise the command simply defines the 
     exset), and the exset name doesn’t matter.  Note that this assumes that the file has only one characters or data block, and that the characters block is not named.  If you use Mesquite to edit your data or visualize your alignment, any characters that you exclude there will automatically be written to an assumptions block in the file and will be read by GARLI. 
     (Another option for removing alignment columns is to use PAUP*.  Simply execute your dataset in PAUP*, exclude the characters that you don’t want, and then export the file to a new name.  The new file will include only the columns you want.)
     ===How do I perform a non-parametric bootstrap in GARLI?===
     Set up the config file as normal, and set the bootstrapreps setting to the number of replicates you want.  The program will perform searches on that number of bootstrap reweighted datasets, and store the best tree found for each replicate dataset in a single file called <ofprefix>.boot.tre. You can also specify searchreps > 1 while bootstrapping to perform multiple searches on each bootstrap resampled dataset.  The best tree across across all searches replicates for each bootstrapped dataset will be written to the bootstrap file.  See the bootstrapreps configuration entry for more info.
     Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself.  It simply infers the trees and are the input to that consensus.  The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: Detailed Example: A bootstrap analysis.
     ===How do I use checkpointing (i.e., stop and later restart a run)?===
     Set the writecheckpoints option to “1” before doing a run.  If the run is stopped for some reason (intentionally or not), it can be restarted by changing the restart option to “1” in the config file and executing the program.  DO NOT make any other changes to the config file before attempting a restart, or bad things may happen.  See the 
     writecheckpoints and 
     restart settings.
     ===I ran GARLI multiple times, and now I have results spread across multiple files.  How do I deal with this and compare/summarize the results?===
     See the Advanced_topics page (in particular the "Examining/collecting results" section) for a discussion of this.
     ===Can GARLI output site-likelihoods for use in a program like CONSEL?===
     Yes. You can easily do this for the best tree found at the end of a search by using the  outputsitelikelihoods
     To optimize branch-length and model parameters and then output the sitelikelihoods using user specified trees, use  optimizeinputonly.
     ===Does GARLI return trees that are not fully resolved (with polytomies)?  Does it collapse branches of length zero?===
     Yes.  The minimum length branch length that GARLI allows is actually 10^8, which is effectively zero. Setting the  collapsebranches option will cause such branches to be collapsed before writing the final trees to file.  
     Some tree inference software does NOT do this, which can be very important when analyzing datasets with low variability, i.e., when there is really no evidence for some branches.  Zero-length branches (which should really be polytomies) will be randomly resolved in one of three ways when branches are not collapsed.  Depending on how the trees are being used, this can introduce problematic extra unsupported branches.
     ==Source code compilation==
     ===Why would I want to compile GARLI myself?===
     The only reasons that would require you to do this would be if you are trying to use it on a operating system other than OS X or Windows (i.e., Linux) or if you want access to the very latest fixes and updates.  It should be easy to build on any Linux or OS X machine.
     ===How do I compile GARLI myself from a source distribution?===
     This would mean that you are starting with a file called something like garli-1.0.tar.gz that you downloaded from the [googlecode page].  Your system will need the proper tools installed to do this compilation (many Linux distributions should have this by default, for OS X you'll need to install the Developer Tools). To compile:
     1. Decompress the source distribution.  From the command line, 
      tar xzvf garli-1.0.tar.gz
      2. Change into the garli-1.0 directory that has been created
       cd garli-1.0
       3. If you want to do the most basic build, type the following and wait for a few minutes
        sh build_garli.sh
        4. If everything worked, you should now have a bin directory within the source distribution and an executable file within called garli-1.0.  The file can by copied anywhere on the system and will work properly.
        *(Optional) Alternatively, if you wanted to download and use the latest version of the Nexus Class Library (NCL) that GARLI uses to parse input files, you could type this in step 3
         sh build_garli.sh --ncl-svn
         *(Optional) If you want to pass other flags to the GARLI configure script, you can provide them as other arguments to build_garli.sh.  e.g.
          sh build_garli.sh --open-mp
          *If you want to build the very latest source in the svn repository (possibly enhanced, possibly broken) see the instructions here: building from source.
          ===How do I fix compiler errors like "‘strlen’ was not declared in this scope" (usually GCC 4.2 or later)?===
          These errors usually appear when trying to compile with the new gcc 4.3, which made some changes regarding file inclusion.  Try changing 
           #include <string>
           to 
            #include <cstring>
            at the top of the files src/bipartition.h and src/translatetable.h
            In src/configoptions.h, add 
             #include <climits>
             after the other include statement at the top of the file.
             If you still have errors, you might also need to add
              #include <cstdio>
              at the top of bipartition.h.
              ===How do I fix compiler "error: cannot call constructor ErrorException::ErrorException directly" (usually GCC 4.2 or later)?===
              This comes up in newer versions of gcc.
              In the src/utility.h file, change the two lines that start with 
               throw ErrorException::ErrorException( ...
               to
                throw ErrorException( ...
                and compile again.