GARLI Configuration Settings

From MolEvol
Revision as of 12:08, 21 July 2015 by Zwickl (talk | contribs) (Descriptions of GARLI configuration settings)


Descriptions of GARLI configuration settings

The format for these configuration settings descriptions is generally: entryname (possible values, default value in bold) – description

General settings

datafname (file containing sequence dataset)

datafname = (filename) – Name of the file containing the aligned sequence data. Formats accepted are PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted datasets is done using the Nexus Class Library. This accommodates things such as interleaved alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an example of exset usage). Use of NEXUS files is recommended, and is required for partitioned models.

constraintfile (file containing constraint definition)

constraintfile = (filename, none) – Name of the file containing any topology constraint specifications, or “none” if there are no constraints. The easiest way to explain the format of the constraint file is by example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 and 5. You may specify either positive constraints (inferred tree MUST contain constrained group) or negative constraints (also called converse constraints, inferred tree CANNOT contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of the constraint specification, for positive and negative constraints, respectively.

  • For a positive constraint on a grouping of taxa 1, 3 and 5:
  • For a negative constraint on a grouping of taxa 1, 3 and 5:
  • Note that there are many other equivalent parenthetical representations of these constraints.
  • Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed.
  • Multiple constrained groupings may be specified in a single string:
  or in two separate strings on successive lines: 
  • Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers.
  • Positive and negative constraints cannot be mixed.
  • GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this:
     or equivalently like this: 
      With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. 
  • The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used:
       :Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.

streefname (source of starting tree and/or model)

       streefname = (random, stepwise, <filename>) – Specifies where the starting tree topology and/or 
       model parameters will come from.  The tree topology may be a completely random topology 
       (constraints will be enforced), a tree provided by the user in a file, or a tree generated by the 
       program using a fast ML stepwise-addition algorithm (see attachmentspertaxon below). 
       Starting or fixed model parameter values may also be provided in the specified file, with or 
       without a tree topology.   Some notes on starting trees/models: 
  • Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.
  • Starting tree formats:
    • Plain newick tree string (with taxon numbers or names, with or without branch lengths)
    • NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.
  • If multiple trees appear in the specified file and multiple search replicates are specified (see searchreps setting), then the first tree is used in the first replicate, the second in the second replicate, etc.
  • Providing model parameter values: see this page Specifying model parameter values
  • See also the FAQ items on model parameters here.

attachmentspertaxon (control creation of stepwise addition starting tree)

       attachmentspertaxon = (1 to infinity, 50) – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree.  Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree.  For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon.  The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added.  A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated).  A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets).  Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. 

ofprefix (output filename prefix)

       ofprefix = (text) – Prefix of all output filenames, such as log, treelog, etc.  Change this for each run that you do or the program will overwrite previous results. 

randseed (random number seed)

       randseed = (-1 or positive integers, -1) – The random number seed used by the random number generator.  Specify “–1” to have a seed chosen for you.  Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. 

availablemememory (control maximum program memory usage)

       availablemememory – Typically this is the amount of available physical memory on the system, in megabytes.  This lets GARLI determine how much system memory it may be able to use to store computations for reuse.  The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less.  If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better.  When a run is started, GARLI will output the availablememory value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”).  More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory.  In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary.  Avoid “very low” whenever possible.  You can find the value is approximately optimal for your dataset by setting the randseed to some 
       positive value (so that the searches are identical) and doing runs with various availablememory values.  Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. 

logevery (frequency to log best score to file)

       logevery = (1 to infinity, 10) – The frequency at which the best score is written to the log file. 

saveevery (frequency to save best tree to file or write checkpoints)

       saveevery = (1 to infinity, 100) – If writecheckpoints or outputcurrentbesttopology are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. 

refinestart (whether to optimize a bit before starting a search)

       refinestart = (0 or 1, 1) – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters.  This is always recommended. 

outputcurrentbesttopology (continuously write the best tree to file during a run)

       outputcurrentbesttopology = (0 or 1, 0) – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every saveevery generations.  In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a 
       run is going.

outputeachbettertopology (write each improved topology to file)

       outputeachbettertopology (0 or 1, 0) – If true, each new topology encountered with a better score than the previous best is written to file.  In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets.  Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way.  This option is not available while bootstrapping. 

enforcetermconditions (use automatic termination)

       enforcetermconditions = (0 or 1, 1) – Specifies whether the automatic termination conditions will be used.  The conditions specified by both of the following two parameters must be met.  See the following two parameters for their definitions.  If this is false, the run will continue until it reaches the time ('stoptime) or generation (stopgen) limit.  It is highly recommended that this option be used! 

genthreshfortopoterm (number of generations without topology improvement required for termination)

       genthreshfortopoterm = (1 to infinity, 20,000) – This specifies the first part of the termination condition.  When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. 

scorethreshforterm (max score improvement over recent generations required for termination)

       scorethreshforterm = (0 to infinity, 0.05) – The second part of the termination condition.  When the total improvement in score over the last intervallength x intervalstostore generations (default is 500 generations, see below) is less than this value, this condition is met.  This does not usually need to be changed. 

significanttopochange (required score improvement for topology to be considered better)

       significanttopochange = (0 to infinity, 0.01) – The lnL increase required for a new topology to be 
       considered significant as far as the termination condition is concerned. It probably doesn’t 
       need to be played with, but you might try increasing it slightly if your runs reach a stable 
       score and then take a very long time to terminate due to very minor changes in topology. 

outputphyliptree (write trees to file in Phylip as well as Nexus format)

       outputphyliptree = (0 or 1, 0) – Whether a phylip formatted tree files will be output in addition to 
       the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best 
       tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree 
       for each bootstrap replicate (<ofprefix.boot.phy>. 

outputmostlyuselessfiles (output uninteresting files)

       outputmostlyuselessfiles = (0 or 1, 0) – Whether to output three files of little general interest: the 
       “fate”, “problog” and “swaplog” files.  The fate file shows the parentage, mutation types and 
       scores of every individual in the population during the entire search.  The problog shows how 
       the proportions of the different mutation types changed over the course of the run.  The 
       swaplog shows the number of unique swaps and the number of total swaps on the current 
       best tree over the course of the run. 

writecheckpoints (write checkpoint files during run)

       writecheckpoints (0 or 1, 0) – Whether to write three files to disk containing all information 
       about the current state of the population every saveevery generations, with each successive 
       checkpoint overwriting the previous one.  These files can be used to restart a run at the last 
       written checkpoint by setting the restart configuration entry.

restart (restart run from checkpoint)

       restart = (0 or 1, 0) – Whether to restart at a previously saved checkpoint.  To use this option the writecheckpoints option must have been used during a previous run.  The program will look for checkpoint files that are named based on the ofprefix of the previous run.  If you intend to restart a run, NOTHING should be changed in the config file except setting restart to 1. 
       A run that is restarted from checkpoint will give exactly the same results it would have if the run had gone to completion. 

outgroup (orient inferred trees consistently)

       outgroup = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file.  Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees.  If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored.  If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented.  Ranges can be indicated with a hyphen.  e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: 
        outgroup = 1-3 5

searchreps (number of independent search replicates)

        searchreps = (1 to infinity, 2) – The number of independent search replicates to perform during a program execution.  You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching.  Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate.  That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.

bootstrapreps (number of bootstrap replicates)

        bootstrapreps (0 to infinity, 0) - The number of bootstrap reps to perform.  If this is greater than 
        0, normal searching will not be performed.  The resulting bootstrap trees (one per rep) will be 
        output to a file named <ofprefix>.boot.tre.  To obtain the bootstrap proportions they will then 
        need to be read into PAUP* or a similar program to obtain a majority rule consensus.  Note 
        that it is probably safe to reduce the strictness of the termination conditions during 
        bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the 
        bootstrapping process with negligible effects on the results.
        Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself.  It simply infers the trees and are the input to that consensus.  The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: Detailed Example: A bootstrap analysis.

resampleproportion (relative size of re-sampled data matrix)

        resampleproportion (0.1 to 10, 1.0 ) – When bootstrapreps > 0, this setting allows for    
        bootstrap-like resampling, but with the psuedoreplicate datasets having the number of 
        alignment columns different from the real data.  Setting values < 1.0 is somewhat similar to jackknifing, but not identical. 

inferinternalstateprobs (infer ancestral states)

        inferinternalstateprobs = (0 or 1, 0) – Specify 1 to have GARLI infer the marginal posterior 
        probability of each character at each internal node.  This is done at the very end of the run, 
        just before termination.  The results are output to a file named <ofprefix>.internalstates.log.

outputsitelikelihoods (write a file with the log-likelihood of each site)

        outputsitelikelihoods = (0 or 1, 0) - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate.  Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option.  For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix.  Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 ....  Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers.  
        Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them.  It isn't clear what the effects of this will be on the various tests.  
        Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the  optimizeinputonly setting.

optimizeinputonly (do not search, only optimize model and branch lengths on user trees)

        (new in version 2.0)
        optimizeinputonly = (0 or 1, 0) - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the  streefname setting.  All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores.  A file with the site-likelihoods of each tree will also be output.  See the  outputsitelikelihoods setting for details.

collapsebranches (collapse zero length branches before writing final trees to file)

        collapsebranches = (0 or 1, 1) - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8).  In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.
        I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy.  Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree.  Zero-length branches would add to the distances (~error) although they really should not.

Model specification settings

        With version 1.0 and later there are now many more options dealing with model specification because of 
        the inclusion of amino acid and codon-based models.  The description of the settings will be 
        broken up by data type.  Note that in terms of the model settings in GARLI, “empirical” means 
        to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix 
        them at user specified values.  See the streefname setting for details on how to provide 
        parameter values to be fixed during inference.
        PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself.  Be sure that you are familiar with the rest of this section, then see  Using partitioned models.

datatype (sequence type and inference model)

        datatype = (nucleotide, aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is 
        to be used during tree inference.  Nucleotide and amino acid data are self explanatory.
        The codon-aminoacid datatype means that the data will be 
        supplied as a nucleotide alignment, but will be internally translated and analyzed using an 
        amino acid model.  The codon and codon-aminoacid datatypes require nucleotide sequence 
        that is aligned in the correct reading frame. In other words, all gaps in the alignment should 
        be a multiple of 3 in length, and the alignment should start at the first position of a codon.  If 
        the alignment has extra columns at the start, middle or end, they should be removed or 
        excluded with a Nexus exset (see  this FAQ item for an example of exset usage).  The correct 
         geneticcode must also be set.
        (New in Version 2.0)
        The various "standard" datatypes are new in GARLI 2.0.  These represent morphology-like discrete characters, with any number of states.  These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: Mkv morphology model.

Settings for datatype = nucleotide

ratematrix (relative rate parameters assumed by substitution model)

        ratematrix = (1rate, 2rate, 6rate, fixed, custom string) – The number of relative substitution rate 
        parameters (note that the number of free parameters is this value minus one).  Equivalent to 
        the “nst” setting in PAUP* and MrBayes.  1rate assumes that substitutions between all pairs 
        of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and 
        transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR).  These rates are 
        estimated unless the fixed option is chosen.  Since version 0.96, parameters for any 
        submodel of the GTR model may be estimated.  The format for specifying this is very 
        similar to that used in the “rclass’ setting of PAUP*.  Within parentheses, six letters are 
        specified, with spaces between them.  The six letters represent the rates of substitution 
        between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. 
        Letters within the parentheses that are the same mean that a single parameter is shared by 
        multiple nucleotide pairs.  For example, 
         ratematrix = (a b a a b a) 
         would specify the HKY 2-rate model (equivalent to ratematrix = 2rate).  This entry, 
          ratematrix = (a b c c b a) 
          would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T 
          substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by 
          A-T and C-G substitutions.

statefrequencies (equilibrium base frequencies assumed by substitution model)

          statefrequencies = (equal, empirical, estimate, fixed) – Specifies how the equilibrium state 
          frequencies (A, C, G and T) are treated.  The empirical setting fixes the frequencies at their 
          observed proportions, and the other options should be self-explanatory. 

For datatype = nucleotide or aminoacid

invariantsites (treatment of proportion of invariable sites parameter)

          invariantsites = (none, estimate, fixed) – Specifies whether a parameter representing the 
          proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be 
          included.  This is typically referred to as “invariant sites”, but would better be termed 
          “invariable sites”. 

ratehetmodel (type of rate heterogeneity to assume for variable sites)

          ratehetmodel = (none, gamma, gammafixed) – The model of rate heterogeneity assumed. 
          “gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” 
          estimates it. 

numratecats (number of overall substitution rate categories)

          numratecats = (1 to 20, 4) – The number of categories of variable rates (not including the 
          invariant site class if it is being used).  Must be set to 1 if ratehetmodel is set to none.  Note 
          that runtimes and memory usage scale linearly with this setting. 

For datatype = aminoacid or codon-aminoacid

          Amino acid analyses are typically done using fixed rate matrices that have been estimated on 
          large datasets and published.  Typically the only model parameters that are estimated during tree 
          inference relate to the rate heterogeneity distribution.  Each of the named matrices also has 
          corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those 
          frequencies or with the amino acid frequencies observed in your dataset.  This second option is 
          often denoted as “+F” in a model description, although in terms of the GARLI configuration 
          settings this is referred to as “empirical” frequencies.  In GARLI the Dayhoff model would be 
          specified by setting both the ratematrix and statefrequencies options to “dayhoff”.  The 
          Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and 
          statefrequencies to “empirical”.
          The following named amino acid models are implemented: 
ratematrix/statefrequencies setting reference
dayhoff Dayhoff, Schwartz and Orcutt. 1978.
jones Jones, Taylor and Thornton (JTT), 1992.
WAG Whelan and Goldman, 2001.
mtREV Adachi and Hasegawa, 1996.
mtmam Yang, Nielsen and Hasegawa, 1998.
          Note that most programs allow either the use of a named rate matrix and its corresponding state 
          frequencies, or a named rate matrix and empirical frequencies.  GARLI technically allows the 
          mixing of different named matrices and equilibrium frequencies (for example, wag matrix with 
          jones equilibrium frequencies), but this is not recommended. 

ratematrix (amino acid substitution rates)

          ratematrix = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to 
          use.  You should use the matrix that gives the best likelihood, and could use a program like 
          PROTTEST (very much like MODELTEST, but for amino acid models) to determine which 
          fits best for your data.  Poisson assumes a single rate of substitution between all amino acid 
          pairs, and is a very poor model.

statefrequencies (equilibrium base frequencies assumed by substitution model)

          statefrequencies = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – 
          Specifies how the equilibrium state frequencies of the 20 amino acids are treated.  The 
          “empirical” option fixes the frequencies at their observed proportions (when describing a 
          model this is often termed “+F”).  

For datatype = codon

          The codon models are built with three components: (1) parameters describing the process of 
          individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters 
          describing the relative rate of nonsynonymous to synonymous substitutions.  The nucleotide 
          substitution parameters within the codon models are exactly the same as those possible with 
          standard nucleotide models in GARLI, and are specified with the ratematrix configuration 
          entry.  Thus, they can be of the 2rate variety (inferring different rates for transitions and 
          transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide 
          pairs, GTR-like) or any other sub-model of GTR.  The options for codon frequencies are 
          specified with the statefrequencies configuration entry.  The options are to use equal 
          frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in 
          GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's 
          terminology).  These last two options calculate the codon frequencies as the product of the 
          frequencies of the three nucleotides that make up each codon.  In the “F1x4” case the nucleotide 
          frequencies are those observed in the dataset across all codon positions, while the “F3x4” option 
          uses the nucleotide frequencies observed in the data at each codon position separately.  The final 
          component of the codon models is the nonsynonymous to synonymous relative rate parameters 
          (aka dN/dS or omega parameters).  The default is to infer a single dN/dS value.  Alternatively, a 
          model can be specified that infers a given number of dN/dS categories, with the dN/dS values 
          and proportions falling in each category estimated (ratehetmodel = nonsynonymous).  This is 
          the “discrete” or “M3” model in PAML's terminology. 

ratematrix (relative nucleotide rate parameters assumed by codon model)

          ratematrix = (1rate, 2rate, 6rate, fixed, custom string) – This determines the relative rates of 
          nucleotide substitution assumed by the codon model.  The options are exactly the same as 
          those allowed under a normal nucleotide model.  A codon model with ratematrix = 2rate 
          specifies the standard Goldman and Yang (1994) model, with different substitution rates for 
          transitions and transversions. 

statefrequencies (equilibrium codon frequencies)

          statefrequencies' = (equal, empirical, f1x4, f3x4) - The options are to use equal codon frequencies 
          (not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), 
          or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's 
          terminology).  These last two options calculate the codon frequencies as the product of the 
          frequencies of the three nucleotides that make up each codon.  In the “F1x4” case the 
          nucleotide frequencies are those observed in the dataset across all codon positions, while the 
          “F3x4” option uses the nucleotide frequencies observed in the data at each codon position 

ratehetmodel (variation in dN/dS across sites

          ratehetmodel = (none, nonsynonymous) – For codon models, the default is to infer a single dN/dS 
          parameter.  Alternatively, a model can be specified that infers a given number of dN/dS 
          categories, with the dN/dS values and proportions falling in each category estimated 
          (ratehetmodel = nonsynonymous).  This is the “discrete” or “M3” model of Yang et al. 

numratecats (number of discrete dN/dS categories)

          numratecats = (1 to 20, 1) – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter 


          invariantsites = (none) - NOTE: Due to an error on my part, the invariantsites entry must appear with codon models for them to run (despite the fact that it doesn't apply).  i.e., be sure that this appears:
           invariantsites = none.

For datatype = codon or codon-aminoacid

geneticcode (code to use in codon translation)

           geneticcode = (standard, vertmito, invertmito) – The genetic code to be used in translating codons 
           into amino acids.

Population settings

nindivs (number of individuals in population)

           nindivs = (2 to 100, 4)- The number of individuals in the population.  This may be increased, but 
           doing so is generally not beneficial.  Note that typical genetic algorithms tend to have much, 
           much larger population sizes than GARLI's defaults. 

holdover (unmutated copies of best individual)

           holdover = (1 to nindivs-1, 1)- The number of times the best individual is copied to the next 
           generation with no chance of mutation.  It is best not to mess with this. 

selectionintensity (strength of selection)

           selectionintensity = (0.01 to 5.0, 0.5)- Controls the strength of selection, with larger numbers 
           denoting stronger selection.  The relative probability of reproduction of two individuals 
           depends on the difference in their log likelihoods (ΔlnL) and is formulated very similarly to 
           the procedure of calculating Akaike weights.  The relative probability of reproduction of the 
           less fit individual is equal to:
           e (-selectionIntensity × Δ lnL)
           In general, this setting does not seem to have much of an effect on the progress of a run.  In 
           theory higher values should cause scores to increase more quickly, but make the search more 
           likely to be entrapped in a local optimum.  Low values will increase runtimes, but may be 
           more likely to reach the true optimum. The following table gives the relative probabilities of 
           reproduction for different values of the selection intensity when the difference in log 
           likelihood is 1.0
selectionintensity value Ratio of probabilities of reproduction
0.05 0.95:1.0
0.1 0.90:1.0
0.25 0.78:1.0
0.5 0.61:1.0
0.75 0.47:1.0
1.0 0.37:1.0
2.0 0.14:1.0

holdoverpenalty (fitness handicap for best individual)

           holdoverpenalty = (0 to 100, 0) – This can be used to bias the probability of reproduction of the 
           best individual downward.  Because the best individual is automatically copied into the next 
           generation, it has a bit of an unfair advantage and can cause all population variation to be lost 
           due to genetic drift, especially with small populations sizes.  The value specified here is 
           subtracted from the best individual’s lnL score before calculating the probabilities of
           reproduction.  It seems plausible that this might help maintain variation, but I have not seen it 
           cause a measurable effect.

stopgen (maximum number of generations to run)

           stopgen – The maximum number of generations to run.  Note that this supersedes the automated 
           stopping criterion (see enforcetermconditions  above), and should therefore be set to a very 
           large value if automatic termination is desired. 

stoptime (maximum time to run)

           stoptime – The maximum number of seconds for the run to continue. Note that this supersedes 
           the automated stopping criterion (see enforcetermconditions  above), and should therefore 
           be set to a very large value if automatic termination is desired.

Branch-length optimization settings

           After a topological rearrangement, branch lengths in the vicinity of the rearrangement are 
           optimized by the Newton-Raphson method.  Optimization passes are performed on a particular 
           branch until the expected improvement in likelihood for the next pass is less than a threshold 
           value, termed the optimization precision.  Note that this name is somewhat misleading, as the 
           precision of the optimization algorithm is inversely related to this value (i.e., smaller values of 
           the optimization precision lead to more precise optimization).  If the improvement in likelihood 
           due to optimization for a particular branch is greater than the optimization precision, 
           optimization is also attempted on adjacent branches, spreading out across the tree.  When no new 
           topology with a better likelihood score is discovered for a while, the value is automatically 
           reduced.  The value can have a large effect on speed, with smaller values significantly slowing 
           down the algorithm.  The value of the optimization precision and how it changes over the course 
           of a run are determined by the following three parameters. 


           startoptprec (0.005 to 5.0, 0.5) – The beginning optimization precision. 


           minoptprec (0.001 to startoptprec, 0.01) – The minimum allowed value of the optimization precision.


           numberofprecreductions (0 to 100, 10) – Specify the number of steps that it will take for the 
           optimization precision to decrease (linearly) from startoptprec to minoptprec. 


           treerejectionthreshold (0 to 500, 50) – This setting controls which trees have more extensive 
           branch-length optimization applied to them.  All trees created by a branch swap receive 
           optimization on a few branches that directly took part in the rearrangement.  If the difference 
           in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. 
           Reducing this value can significantly reduce runtimes, often with little or no effect on results. 
           However, it is possible that a better tree could be missed if this is set too low.  In cases in 
           which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this 
           lower (~20) is probably safe.

Settings controlling the proportions of the mutation types

           Each mutation type is assigned a prior weight. These values determine the expected 
           proportions of the various mutation types that are performed.  The primary mutation categories 
           are topology (t), model (m) and branch length (b).  Each are assigned a prior weight ( Pi ) in the 
           config file.  Each time that a new best likelihood score is attained, the amount of the increase in 
           score is credited to the mutation type responsible, with the sum of the increases ( Si ) maintained 
           over the last intervallength x intervalstostore generations.  The number of times that each 
           mutation is performed ( Ni ) is also tallied.  The total weight of a mutation type is Wi = Pi + ( Si / Ni ). 
           The proportion of mutations of type i out of all mutations is then 
           Pr(i) = Wi / 
           (Wt + Wm + Wb) 
           The proportion of each mutation is thus related to its prior weight and the average increase in 
           score that it has caused over recent generations.  The prior weights can be used to control the 
           expected (and starting) proportions of the mutation types, as well as how sensitive the 
           proportions are to the course of events in a run.  It is generally a good idea to make the topology 
           prior much larger than the others so that when no mutations are improving the score many 
           topology mutations are still attempted.  If you set outputmostlyuselessfiles to 1, you can look at the “problog” file to determine what the 
           proportions of the mutations actually were over the course of a run. 

topoweight (weight on topology mutations)

           topoweight (0 to infinity, 1.0) The prior weight assigned to the class of topology mutations 
           (NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run.  This used to be a way to have the program estimate only model parameters and branch-lengths, but the  optimizeinputonly setting is now a better way to go.

modweight (weight on model parameter mutations)

           modweight (0 to infinity, 0.05) The prior weight assigned to the class of model mutations.  Note 
           that setting this at 0.0 fixes the model during the run. 

brlenweight (weight on branch-length parameter mutations)

           brlenweight (0 to infinity, 0.2) The prior weight assigned to branch-length mutations.
           The same procedure used above to determine the proportion of Topology:Model:Branch-Length 
           mutations is also used to determine the relative proportions of the three types of topological 
           mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the 
           proportion of mutations applied to each of the model parameters is not user controlled. 

randnniweight (weight on NNI topology changes)

           randnniweight (0 to infinity, 0.1) - The prior weight assigned to NNI mutations. 

randsprweight (weight on SPR topology changes)

           randsprweight (0 to infinity, 0.3) - The prior weight assigned to random SPR mutations.  For 
           very large datasets it is often best to set this to 0.0, as random SPR mutations essentially 
           never result in score increases. 

limsprweight (weight on localized SPR topology changes)

           limsprweight (0 to infinity, 0.6) - The prior weight assigned to SPR mutations with the 
           reconnection branch limited to being a maximum of limsprrange branches away from where 
           the branch was detached. 


           intervallength (10 to 1000, 100) – The number of generations in each interval during which the 
           number and benefit of each mutation type are stored. 


           intervalstostore = (1 to 10, 5) – The number of intervals to be stored.  Thus, records of 
           mutations are kept for the last (intervallength x intervalstostore) generations.  Every 
           intervallength generations the probabilities of the mutation types are updated by the scheme 
           described above.

Settings controlling mutation details

limsprrange (max range for localized SPR topology changes)

           limsprrange (0 to infinity, 6) – The maximum number of branches away from its original 
           location that a branch may be reattached during a limited SPR move.  Setting this too high (> 
           10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.

meanbrlenmuts (mean # of branch lengths to change per mutation)

           meanbrlenmuts (1 to # taxa, 5) - The mean of the binomial distribution from which the number 
           of branch lengths mutated is drawn during a branch length mutation. 

gammashapebrlen (magnitude of branch-length mutations)

           gammashapebrlen (50 to 2000, 1000) - The shape parameter of the gamma distribution (with a 
           mean of 1.0) from which the branch-length multipliers are drawn for branch-length 
           mutations.  Larger numbers cause smaller changes in branch lengths.  (Note that this has 
           nothing to do with gamma rate heterogeneity.) 

gammashapemodel (magnitude of model parameter mutations)

           gammashapemodel (50 to 2000, 1000) - The shape parameter of the gamma distribution (with a 
           mean of 1.0) from which the model mutation multipliers are drawn for model parameters 
           mutations. Larger numbers cause smaller changes in model parameters. (Note that this has 
           nothing to do with gamma rate heterogeneity.) 

uniqueswapbias (relative weight assigned to already attempted branch swaps)

           uniqueswapbias (0.01 to 1.0, 0.1) –  With version 0.95 and later, GARLI keeps track of which branch 
           swaps it has attempted on the current best tree.  Because swaps are applied randomly, it is 
           possible that some swaps are tried twice before others are tried at all.  This option allows the 
           program to bias the swaps applied toward those that have not yet been attempted.  Each swap 
           is assigned a relative weight depending on the number of times that it has been attempted on 
           the current best tree.  This weight is equal to (uniqueswapbias) raised to the (# times swap 
           attempted) power.  In other words, a value of 0.5 means that swaps that have already been 
           tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ 
           as likely, etc.  A value of 1.0 means no biasing.  If this value is not equal to 1.0 and the 
           outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output.  This file 
           shows the total number rearrangements tried and the number of unique ones over the course 
           of a run.  Note that this bias is only applied to NNI and limSPR rearrangements.  Use of this 
           option may allow the use of somewhat larger values of limsprrange. 

distanceswapbias (relative weight assigned to branch swaps based on locality)

           distanceswapbias (0.1 to 10, 1.0) – This option is similar to uniqueswapbias, except that it 
           biases toward certain swaps based on the topological distance between the initial and 
           rearranged trees.  The distance is measured as in the limsprrange, and is half the the 
           Robinson-Foulds distance between the trees.  As with uniqueswapbias, distanceswapbias 
           assigns a relative weight to each potential swap.  In this case the weight is 
           (distanceswapbias) raised to the (reconnection distance - 1) power.  Thus, given a value of 
           0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 
           is 0.25, etc.  Note that values less than 1.0 bias toward more localized swaps, while values 
           greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to 
           limSPR rearrangements.  Be careful in setting this, as extreme values can have a very large