Difference between revisions of "GARLI Configuration Settings"
(→datatype (sequence type and inference model))
|Line 180:||Line 180:|
(New in Version 2.0)
(New in Version 2.0)
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[morphology model]]'''.
===Settings for datatype = nucleotide===
===Settings for datatype = nucleotide===
Latest revision as of 12:32, 21 July 2015
- 1 Descriptions of GARLI configuration settings
- 2 General settings
- 2.1 datafname (file containing sequence dataset)
- 2.2 constraintfile (file containing constraint definition)
- 2.3 streefname (source of starting tree and/or model)
- 2.4 attachmentspertaxon (control creation of stepwise addition starting tree)
- 2.5 ofprefix (output filename prefix)
- 2.6 randseed (random number seed)
- 2.7 availablemememory (control maximum program memory usage)
- 2.8 logevery (frequency to log best score to file)
- 2.9 saveevery (frequency to save best tree to file or write checkpoints)
- 2.10 refinestart (whether to optimize a bit before starting a search)
- 2.11 outputcurrentbesttopology (continuously write the best tree to file during a run)
- 2.12 outputeachbettertopology (write each improved topology to file)
- 2.13 enforcetermconditions (use automatic termination)
- 2.14 genthreshfortopoterm (number of generations without topology improvement required for termination)
- 2.15 scorethreshforterm (max score improvement over recent generations required for termination)
- 2.16 significanttopochange (required score improvement for topology to be considered better)
- 2.17 outputphyliptree (write trees to file in Phylip as well as Nexus format)
- 2.18 outputmostlyuselessfiles (output uninteresting files)
- 2.19 writecheckpoints (write checkpoint files during run)
- 2.20 restart (restart run from checkpoint)
- 2.21 outgroup (orient inferred trees consistently)
- 2.22 searchreps (number of independent search replicates)
- 2.23 bootstrapreps (number of bootstrap replicates)
- 2.24 resampleproportion (relative size of re-sampled data matrix)
- 2.25 inferinternalstateprobs (infer ancestral states)
- 2.26 outputsitelikelihoods (write a file with the log-likelihood of each site)
- 2.27 optimizeinputonly (do not search, only optimize model and branch lengths on user trees)
- 2.28 collapsebranches (collapse zero length branches before writing final trees to file)
- 3 Model specification settings
- 3.1 datatype (sequence type and inference model)
- 3.2 Settings for datatype = nucleotide
- 3.3 For datatype = nucleotide or aminoacid
- 3.4 For datatype = aminoacid or codon-aminoacid
- 3.5 For datatype = codon
- 3.6 For datatype = codon or codon-aminoacid
- 4 Population settings
- 5 Branch-length optimization settings
- 6 Settings controlling the proportions of the mutation types
- 6.1 topoweight (weight on topology mutations)
- 6.2 modweight (weight on model parameter mutations)
- 6.3 brlenweight (weight on branch-length parameter mutations)
- 6.4 randnniweight (weight on NNI topology changes)
- 6.5 randsprweight (weight on SPR topology changes)
- 6.6 limsprweight (weight on localized SPR topology changes)
- 6.7 intervallength
- 6.8 intervalstostore
- 7 Settings controlling mutation details
- 7.1 limsprrange (max range for localized SPR topology changes)
- 7.2 meanbrlenmuts (mean # of branch lengths to change per mutation)
- 7.3 gammashapebrlen (magnitude of branch-length mutations)
- 7.4 gammashapemodel (magnitude of model parameter mutations)
- 7.5 uniqueswapbias (relative weight assigned to already attempted branch swaps)
- 7.6 distanceswapbias (relative weight assigned to branch swaps based on locality)
Descriptions of GARLI configuration settings
The format for these configuration settings descriptions is generally: entryname (possible values, default value in bold) – description
datafname (file containing sequence dataset)
datafname = (filename) – Name of the file containing the aligned sequence data. Formats accepted are PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted datasets is done using the Nexus Class Library. This accommodates things such as interleaved alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an example of exset usage). Use of NEXUS files is recommended, and is required for partitioned models.
constraintfile (file containing constraint definition)
constraintfile = (filename, none) – Name of the file containing any topology constraint specifications, or “none” if there are no constraints. The easiest way to explain the format of the constraint file is by example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 and 5. You may specify either positive constraints (inferred tree MUST contain constrained group) or negative constraints (also called converse constraints, inferred tree CANNOT contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of the constraint specification, for positive and negative constraints, respectively.
- For a positive constraint on a grouping of taxa 1, 3 and 5:
- For a negative constraint on a grouping of taxa 1, 3 and 5:
- Note that there are many other equivalent parenthetical representations of these constraints.
- Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed.
- Multiple constrained groupings may be specified in a single string:
+((1,3,5),2,4,(6,7),8); or in two separate strings on successive lines:
- Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers.
- Positive and negative constraints cannot be mixed.
- GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this:
+*.*.*… or equivalently like this:
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file.
- The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used:
- Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.
streefname (source of starting tree and/or model)
streefname = (random, stepwise, <filename>) – Specifies where the starting tree topology and/or model parameters will come from. The tree topology may be a completely random topology (constraints will be enforced), a tree provided by the user in a file, or a tree generated by the program using a fast ML stepwise-addition algorithm (see attachmentspertaxon below). Starting or fixed model parameter values may also be provided in the specified file, with or without a tree topology. Some notes on starting trees/models:
- Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.
- Starting tree formats:
- Plain newick tree string (with taxon numbers or names, with or without branch lengths)
- NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.
- If multiple trees appear in the specified file and multiple search replicates are specified (see searchreps setting), then the first tree is used in the first replicate, the second in the second replicate, etc.
- Providing model parameter values: see this page Specifying model parameter values
- See also the FAQ items on model parameters here.
attachmentspertaxon (control creation of stepwise addition starting tree)
attachmentspertaxon = (1 to infinity, 50) – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another.
ofprefix (output filename prefix)
ofprefix = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results.
randseed (random number seed)
randseed = (-1 or positive integers, -1) – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical.
availablemememory (control maximum program memory usage)
availablemememory – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the availablememory value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the randseed to some positive value (so that the searches are identical) and doing runs with various availablememory values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much.
logevery (frequency to log best score to file)
logevery = (1 to infinity, 10) – The frequency at which the best score is written to the log file.
saveevery (frequency to save best tree to file or write checkpoints)
saveevery = (1 to infinity, 100) – If writecheckpoints or outputcurrentbesttopology are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file.
refinestart (whether to optimize a bit before starting a search)
refinestart = (0 or 1, 1) – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended.
outputcurrentbesttopology (continuously write the best tree to file during a run)
outputcurrentbesttopology = (0 or 1, 0) – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every saveevery generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a run is going.
outputeachbettertopology (write each improved topology to file)
outputeachbettertopology (0 or 1, 0) – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping.
enforcetermconditions (use automatic termination)
enforcetermconditions = (0 or 1, 1) – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('stoptime) or generation (stopgen) limit. It is highly recommended that this option be used!
genthreshfortopoterm (number of generations without topology improvement required for termination)
genthreshfortopoterm = (1 to infinity, 20,000) – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes.
scorethreshforterm (max score improvement over recent generations required for termination)
scorethreshforterm = (0 to infinity, 0.05) – The second part of the termination condition. When the total improvement in score over the last intervallength x intervalstostore generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed.
significanttopochange (required score improvement for topology to be considered better)
significanttopochange = (0 to infinity, 0.01) – The lnL increase required for a new topology to be considered significant as far as the termination condition is concerned. It probably doesn’t need to be played with, but you might try increasing it slightly if your runs reach a stable score and then take a very long time to terminate due to very minor changes in topology.
outputphyliptree (write trees to file in Phylip as well as Nexus format)
outputphyliptree = (0 or 1, 0) – Whether a phylip formatted tree files will be output in addition to the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree for each bootstrap replicate (<ofprefix.boot.phy>.
outputmostlyuselessfiles (output uninteresting files)
outputmostlyuselessfiles = (0 or 1, 0) – Whether to output three files of little general interest: the “fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and scores of every individual in the population during the entire search. The problog shows how the proportions of the different mutation types changed over the course of the run. The swaplog shows the number of unique swaps and the number of total swaps on the current best tree over the course of the run.
writecheckpoints (write checkpoint files during run)
writecheckpoints (0 or 1, 0) – Whether to write three files to disk containing all information about the current state of the population every saveevery generations, with each successive checkpoint overwriting the previous one. These files can be used to restart a run at the last written checkpoint by setting the restart configuration entry.
restart (restart run from checkpoint)
restart = (0 or 1, 0) – Whether to restart at a previously saved checkpoint. To use this option the writecheckpoints option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting restart to 1. A run that is restarted from checkpoint will give exactly the same results it would have if the run had gone to completion.
outgroup (orient inferred trees consistently)
outgroup = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: outgroup = 1-3 5
searchreps (number of independent search replicates)
searchreps = (1 to infinity, 2) – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.
bootstrapreps (number of bootstrap replicates)
bootstrapreps (0 to infinity, 0) - The number of bootstrap reps to perform. If this is greater than 0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note that it is probably safe to reduce the strictness of the termination conditions during bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the bootstrapping process with negligible effects on the results.
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: Detailed Example: A bootstrap analysis.
resampleproportion (relative size of re-sampled data matrix)
resampleproportion (0.1 to 10, 1.0 ) – When bootstrapreps > 0, this setting allows for bootstrap-like resampling, but with the psuedoreplicate datasets having the number of alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical.
inferinternalstateprobs (infer ancestral states)
inferinternalstateprobs = (0 or 1, 0) – Specify 1 to have GARLI infer the marginal posterior probability of each character at each internal node. This is done at the very end of the run, just before termination. The results are output to a file named <ofprefix>.internalstates.log.
outputsitelikelihoods (write a file with the log-likelihood of each site)
outputsitelikelihoods = (0 or 1, 0) - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers.
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests.
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the optimizeinputonly setting.
optimizeinputonly (do not search, only optimize model and branch lengths on user trees)
(new in version 2.0)
optimizeinputonly = (0 or 1, 0) - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the streefname setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the outputsitelikelihoods setting for details.
collapsebranches (collapse zero length branches before writing final trees to file)
collapsebranches = (0 or 1, 1) - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.
Model specification settings
With version 1.0 and later there are now many more options dealing with model specification because of the inclusion of amino acid and codon-based models. The description of the settings will be broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix them at user specified values. See the streefname setting for details on how to provide parameter values to be fixed during inference.
PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see Using partitioned models.
datatype (sequence type and inference model)
datatype = (nucleotide, aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is to be used during tree inference. Nucleotide and amino acid data are self explanatory.
The codon-aminoacid datatype means that the data will be supplied as a nucleotide alignment, but will be internally translated and analyzed using an amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence that is aligned in the correct reading frame. In other words, all gaps in the alignment should be a multiple of 3 in length, and the alignment should start at the first position of a codon. If the alignment has extra columns at the start, middle or end, they should be removed or excluded with a Nexus exset (see this FAQ item for an example of exset usage). The correct geneticcode must also be set.
(New in Version 2.0)
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: Garli_Mkv morphology model.
Settings for datatype = nucleotide
ratematrix (relative rate parameters assumed by substitution model)
ratematrix = (1rate, 2rate, 6rate, fixed, custom string) – The number of relative substitution rate parameters (note that the number of free parameters is this value minus one). Equivalent to the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are estimated unless the fixed option is chosen. Since version 0.96, parameters for any submodel of the GTR model may be estimated. The format for specifying this is very similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are specified, with spaces between them. The six letters represent the rates of substitution between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. Letters within the parentheses that are the same mean that a single parameter is shared by multiple nucleotide pairs. For example,
ratematrix = (a b a a b a)
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry,
ratematrix = (a b c c b a)
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by A-T and C-G substitutions.
statefrequencies (equilibrium base frequencies assumed by substitution model)
statefrequencies = (equal, empirical, estimate, fixed) – Specifies how the equilibrium state frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their observed proportions, and the other options should be self-explanatory.
For datatype = nucleotide or aminoacid
invariantsites (treatment of proportion of invariable sites parameter)
invariantsites = (none, estimate, fixed) – Specifies whether a parameter representing the proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be included. This is typically referred to as “invariant sites”, but would better be termed “invariable sites”.
ratehetmodel (type of rate heterogeneity to assume for variable sites)
ratehetmodel = (none, gamma, gammafixed) – The model of rate heterogeneity assumed. “gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” estimates it.
numratecats (number of overall substitution rate categories)
numratecats = (1 to 20, 4) – The number of categories of variable rates (not including the invariant site class if it is being used). Must be set to 1 if ratehetmodel is set to none. Note that runtimes and memory usage scale linearly with this setting.
For datatype = aminoacid or codon-aminoacid
Amino acid analyses are typically done using fixed rate matrices that have been estimated on large datasets and published. Typically the only model parameters that are estimated during tree inference relate to the rate heterogeneity distribution. Each of the named matrices also has corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those frequencies or with the amino acid frequencies observed in your dataset. This second option is often denoted as “+F” in a model description, although in terms of the GARLI configuration settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and statefrequencies to “empirical”.
The following named amino acid models are implemented:
|dayhoff||Dayhoff, Schwartz and Orcutt. 1978.|
|jones||Jones, Taylor and Thornton (JTT), 1992.|
|WAG||Whelan and Goldman, 2001.|
|mtREV||Adachi and Hasegawa, 1996.|
|mtmam||Yang, Nielsen and Hasegawa, 1998.|
Note that most programs allow either the use of a named rate matrix and its corresponding state frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the mixing of different named matrices and equilibrium frequencies (for example, wag matrix with jones equilibrium frequencies), but this is not recommended.
ratematrix (amino acid substitution rates)
ratematrix = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to use. You should use the matrix that gives the best likelihood, and could use a program like PROTTEST (very much like MODELTEST, but for amino acid models) to determine which fits best for your data. Poisson assumes a single rate of substitution between all amino acid pairs, and is a very poor model.
statefrequencies (equilibrium base frequencies assumed by substitution model)
statefrequencies = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The “empirical” option fixes the frequencies at their observed proportions (when describing a model this is often termed “+F”).
For datatype = codon
The codon models are built with three components: (1) parameters describing the process of individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide substitution parameters within the codon models are exactly the same as those possible with standard nucleotide models in GARLI, and are specified with the ratematrix configuration entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are specified with the statefrequencies configuration entry. The options are to use equal frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's terminology). These last two options calculate the codon frequencies as the product of the frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide frequencies are those observed in the dataset across all codon positions, while the “F3x4” option uses the nucleotide frequencies observed in the data at each codon position separately. The final component of the codon models is the nonsynonymous to synonymous relative rate parameters (aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a model can be specified that infers a given number of dN/dS categories, with the dN/dS values and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is the “discrete” or “M3” model in PAML's terminology.
ratematrix (relative nucleotide rate parameters assumed by codon model)
ratematrix = (1rate, 2rate, 6rate, fixed, custom string) – This determines the relative rates of nucleotide substitution assumed by the codon model. The options are exactly the same as those allowed under a normal nucleotide model. A codon model with ratematrix = 2rate specifies the standard Goldman and Yang (1994) model, with different substitution rates for transitions and transversions.
statefrequencies (equilibrium codon frequencies)
'statefrequencies = (equal, empirical, f1x4, f3x4) - The options are to use equal codon frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's terminology). These last two options calculate the codon frequencies as the product of the frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide frequencies are those observed in the dataset across all codon positions, while the “F3x4” option uses the nucleotide frequencies observed in the data at each codon position separately.
ratehetmodel (variation in dN/dS across sites
ratehetmodel = (none, nonsynonymous) – For codon models, the default is to infer a single dN/dS parameter. Alternatively, a model can be specified that infers a given number of dN/dS categories, with the dN/dS values and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. (2000).
numratecats (number of discrete dN/dS categories)
numratecats = (1 to 20, 1) – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter categories.
invariantsites = (none) - NOTE: Due to an error on my part, the invariantsites entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears: invariantsites = none.
For datatype = codon or codon-aminoacid
geneticcode (code to use in codon translation)
geneticcode = (standard, vertmito, invertmito) – The genetic code to be used in translating codons into amino acids.
nindivs (number of individuals in population)
nindivs = (2 to 100, 4)- The number of individuals in the population. This may be increased, but doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, much larger population sizes than GARLI's defaults.
holdover (unmutated copies of best individual)
holdover = (1 to nindivs-1, 1)- The number of times the best individual is copied to the next generation with no chance of mutation. It is best not to mess with this.
selectionintensity (strength of selection)
selectionintensity = (0.01 to 5.0, 0.5)- Controls the strength of selection, with larger numbers denoting stronger selection. The relative probability of reproduction of two individuals depends on the difference in their log likelihoods (ΔlnL) and is formulated very similarly to the procedure of calculating Akaike weights. The relative probability of reproduction of the less fit individual is equal to:
e (-selectionIntensity × Δ lnL)
In general, this setting does not seem to have much of an effect on the progress of a run. In theory higher values should cause scores to increase more quickly, but make the search more likely to be entrapped in a local optimum. Low values will increase runtimes, but may be more likely to reach the true optimum. The following table gives the relative probabilities of reproduction for different values of the selection intensity when the difference in log likelihood is 1.0
|selectionintensity value||Ratio of probabilities of reproduction|
holdoverpenalty (fitness handicap for best individual)
holdoverpenalty = (0 to 100, 0) – This can be used to bias the probability of reproduction of the best individual downward. Because the best individual is automatically copied into the next generation, it has a bit of an unfair advantage and can cause all population variation to be lost due to genetic drift, especially with small populations sizes. The value specified here is subtracted from the best individual’s lnL score before calculating the probabilities of reproduction. It seems plausible that this might help maintain variation, but I have not seen it cause a measurable effect.
stopgen (maximum number of generations to run)
stopgen – The maximum number of generations to run. Note that this supersedes the automated stopping criterion (see enforcetermconditions above), and should therefore be set to a very large value if automatic termination is desired.
stoptime (maximum time to run)
stoptime – The maximum number of seconds for the run to continue. Note that this supersedes the automated stopping criterion (see enforcetermconditions above), and should therefore be set to a very large value if automatic termination is desired.
Branch-length optimization settings
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are optimized by the Newton-Raphson method. Optimization passes are performed on a particular branch until the expected improvement in likelihood for the next pass is less than a threshold value, termed the optimization precision. Note that this name is somewhat misleading, as the precision of the optimization algorithm is inversely related to this value (i.e., smaller values of the optimization precision lead to more precise optimization). If the improvement in likelihood due to optimization for a particular branch is greater than the optimization precision, optimization is also attempted on adjacent branches, spreading out across the tree. When no new topology with a better likelihood score is discovered for a while, the value is automatically reduced. The value can have a large effect on speed, with smaller values significantly slowing down the algorithm. The value of the optimization precision and how it changes over the course of a run are determined by the following three parameters.
startoptprec (0.005 to 5.0, 0.5) – The beginning optimization precision.
minoptprec (0.001 to startoptprec, 0.01) – The minimum allowed value of the optimization precision.
numberofprecreductions (0 to 100, 10) – Specify the number of steps that it will take for the optimization precision to decrease (linearly) from startoptprec to minoptprec.
treerejectionthreshold (0 to 500, 50) – This setting controls which trees have more extensive branch-length optimization applied to them. All trees created by a branch swap receive optimization on a few branches that directly took part in the rearrangement. If the difference in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. Reducing this value can significantly reduce runtimes, often with little or no effect on results. However, it is possible that a better tree could be missed if this is set too low. In cases in which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this lower (~20) is probably safe.
Settings controlling the proportions of the mutation types
Each mutation type is assigned a prior weight. These values determine the expected proportions of the various mutation types that are performed. The primary mutation categories are topology (t), model (m) and branch length (b). Each are assigned a prior weight ( Pi ) in the config file. Each time that a new best likelihood score is attained, the amount of the increase in score is credited to the mutation type responsible, with the sum of the increases ( Si ) maintained over the last intervallength x intervalstostore generations. The number of times that each mutation is performed ( Ni ) is also tallied. The total weight of a mutation type is Wi = Pi + ( Si / Ni ). The proportion of mutations of type i out of all mutations is then
Pr(i) = Wi / (Wt + Wm + Wb)
The proportion of each mutation is thus related to its prior weight and the average increase in score that it has caused over recent generations. The prior weights can be used to control the expected (and starting) proportions of the mutation types, as well as how sensitive the proportions are to the course of events in a run. It is generally a good idea to make the topology prior much larger than the others so that when no mutations are improving the score many topology mutations are still attempted. If you set outputmostlyuselessfiles to 1, you can look at the “problog” file to determine what the proportions of the mutations actually were over the course of a run.
topoweight (weight on topology mutations)
topoweight (0 to infinity, 1.0) The prior weight assigned to the class of topology mutations (NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the optimizeinputonly setting is now a better way to go.
modweight (weight on model parameter mutations)
modweight (0 to infinity, 0.05) The prior weight assigned to the class of model mutations. Note that setting this at 0.0 fixes the model during the run.
brlenweight (weight on branch-length parameter mutations)
brlenweight (0 to infinity, 0.2) The prior weight assigned to branch-length mutations.
The same procedure used above to determine the proportion of Topology:Model:Branch-Length mutations is also used to determine the relative proportions of the three types of topological mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the proportion of mutations applied to each of the model parameters is not user controlled.
randnniweight (weight on NNI topology changes)
randnniweight (0 to infinity, 0.1) - The prior weight assigned to NNI mutations.
randsprweight (weight on SPR topology changes)
randsprweight (0 to infinity, 0.3) - The prior weight assigned to random SPR mutations. For very large datasets it is often best to set this to 0.0, as random SPR mutations essentially never result in score increases.
limsprweight (weight on localized SPR topology changes)
limsprweight (0 to infinity, 0.6) - The prior weight assigned to SPR mutations with the reconnection branch limited to being a maximum of limsprrange branches away from where the branch was detached.
intervallength (10 to 1000, 100) – The number of generations in each interval during which the number and benefit of each mutation type are stored.
intervalstostore = (1 to 10, 5) – The number of intervals to be stored. Thus, records of mutations are kept for the last (intervallength x intervalstostore) generations. Every intervallength generations the probabilities of the mutation types are updated by the scheme described above.
Settings controlling mutation details
limsprrange (max range for localized SPR topology changes)
limsprrange (0 to infinity, 6) – The maximum number of branches away from its original location that a branch may be reattached during a limited SPR move. Setting this too high (> 10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.
meanbrlenmuts (mean # of branch lengths to change per mutation)
meanbrlenmuts (1 to # taxa, 5) - The mean of the binomial distribution from which the number of branch lengths mutated is drawn during a branch length mutation.
gammashapebrlen (magnitude of branch-length mutations)
gammashapebrlen (50 to 2000, 1000) - The shape parameter of the gamma distribution (with a mean of 1.0) from which the branch-length multipliers are drawn for branch-length mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has nothing to do with gamma rate heterogeneity.)
gammashapemodel (magnitude of model parameter mutations)
gammashapemodel (50 to 2000, 1000) - The shape parameter of the gamma distribution (with a mean of 1.0) from which the model mutation multipliers are drawn for model parameters mutations. Larger numbers cause smaller changes in model parameters. (Note that this has nothing to do with gamma rate heterogeneity.)
uniqueswapbias (relative weight assigned to already attempted branch swaps)
uniqueswapbias (0.01 to 1.0, 0.1) – With version 0.95 and later, GARLI keeps track of which branch swaps it has attempted on the current best tree. Because swaps are applied randomly, it is possible that some swaps are tried twice before others are tried at all. This option allows the program to bias the swaps applied toward those that have not yet been attempted. Each swap is assigned a relative weight depending on the number of times that it has been attempted on the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap attempted) power. In other words, a value of 0.5 means that swaps that have already been tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file shows the total number rearrangements tried and the number of unique ones over the course of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this option may allow the use of somewhat larger values of limsprrange.
distanceswapbias (relative weight assigned to branch swaps based on locality)
distanceswapbias (0.1 to 10, 1.0) – This option is similar to uniqueswapbias, except that it biases toward certain swaps based on the topological distance between the initial and rearranged trees. The distance is measured as in the limsprrange, and is half the the Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias assigns a relative weight to each potential swap. In this case the weight is (distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of 0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to limSPR rearrangements. Be careful in setting this, as extreme values can have a very large effect.