https://molevol.mbl.edu/api.php?action=feedcontributions&user=Zwickl&feedformat=atomMolEvol - User contributions [en]2019-09-21T21:43:48ZUser contributionsMediaWiki 1.31.1https://molevol.mbl.edu/index.php?title=Garli_wiki&diff=4397Garli wiki2015-07-21T17:10:34Z<p>Zwickl: </p>
<hr />
<div><br />
useful GARLI documentation in wiki format is available here:<br />
<br />
<br />
<br />
<br />
Welcome to the GARLI support wiki!<br />
What is GARLI?<br />
<br />
GARLI is a program that performs phylogenetic inference using the maximum-likelihood criterion. Several sequence types are supported, including nucleotide, amino acid and codon. Version 2.0 adds support for partitioned models and morphology-like datatypes. It is usable on all operating systems, and is written and maintained by Derrick Zwickl (zwickl{at}mail.arizona{dot}edu or garli.support{at}gmail{dot}com).<br />
<br />
Obtaining GARLI<br />
Current Version 2.01<br />
<br />
You can download GARLI for Windows or Mac OS X (or get the source code) from Google Code: http://garli.googlecode.com<br />
<br />
<br />
Documentation and support for GARLI<br />
<br />
There are a number of options:<br />
<br />
This [[Garli_wiki| wiki]] is the primary documentation for the software. It includes<br />
<br />
[[Garli_FAQ | FAQ]]<br />
<br />
[[GARLI_configuration_settings | Configuration settings]]<br />
<br />
[[Garli_using_partitioned_models | Using partitioned models]]<br />
<br />
And lots of other useful information.<br />
<br />
<br />
Questions may be posted to a the GARLI user forum through the [http://groups.google.com/group/garli_users garli_users] google group.<br />
You may also email questions and problems to garli{dot}support{at}gmail{dot}com <br />
<br />
Known Issues<br />
<br />
There are a few known issues. See them here : Known issues with 2.0<br />
Citing GARLI<br />
<br />
If you use GARLI please cite<br />
<br />
Zwickl, D. J., 2006. Genetic algorithm approaches for the phylogenetic analysis of large <br />
biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation,<br />
The University of Texas at Austin.<br />
<br />
and/or the download website:<br />
http://garli.googlecode.com</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_wiki&diff=4396Garli wiki2015-07-21T17:03:47Z<p>Zwickl: </p>
<hr />
<div><br />
useful GARLI documentation in wiki format is available here:<br />
<br />
[[Garli_FAQ | FAQ]]<br />
<br />
[[GARLI_configuration_settings | Configuration settings]]<br />
<br />
[[Garli_using_partitioned_models | Using partitioned models]]<br />
<br />
<br />
<br />
<br />
Welcome to the GARLI support wiki!<br />
What is GARLI?<br />
<br />
GARLI is a program that performs phylogenetic inference using the maximum-likelihood criterion. Several sequence types are supported, including nucleotide, amino acid and codon. Version 2.0 adds support for partitioned models and morphology-like datatypes. It is usable on all operating systems, and is written and maintained by Derrick Zwickl (zwickl{at}mail.arizona{dot}edu or garli.support{at}gmail{dot}com).<br />
<br />
Obtaining GARLI<br />
Current Version 2.01<br />
<br />
You can download GARLI for Windows or Mac OS X (or get the source code) from Google Code: http://garli.googlecode.com<br />
<br />
<br />
Documentation and support for GARLI<br />
<br />
There are a number of options:<br />
<br />
This [[Garli_wiki| wiki]] contains an online [[Garli_Manual]] which contains a full description of the GARLI Configuration Settings, an FAQ and a Brief_tutorial. Lots of other information appears on the wiki as well.<br />
<br />
Questions may be posted to a the GARLI user forum through the [http://groups.google.com/group/garli_users garli_users] google group.<br />
You may also email questions and problems to garli{dot}support{at}gmail{dot}com <br />
<br />
<br />
Known Issues<br />
<br />
There are a few known issues. See them here : Known issues with 2.0<br />
Citing GARLI<br />
<br />
If you use GARLI please cite<br />
<br />
Zwickl, D. J., 2006. Genetic algorithm approaches for the phylogenetic analysis of large <br />
biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation,<br />
The University of Texas at Austin.<br />
<br />
and/or the download website:<br />
http://garli.googlecode.com</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_wiki&diff=4395Garli wiki2015-07-21T17:03:08Z<p>Zwickl: </p>
<hr />
<div><br />
useful GARLI documentation in wiki format is available here:<br />
<br />
[[Garli_FAQ | FAQ]]<br />
<br />
[[GARLI_configuration_settings | Configuration settings]]<br />
<br />
[[Garli_using_partitioned_models | Using partitioned models]]<br />
<br />
<br />
<br />
<br />
Welcome to the GARLI support wiki!<br />
What is GARLI?<br />
<br />
GARLI is a program that performs phylogenetic inference using the maximum-likelihood criterion. Several sequence types are supported, including nucleotide, amino acid and codon. Version 2.0 adds support for partitioned models and morphology-like datatypes. It is usable on all operating systems, and is written and maintained by Derrick Zwickl (zwickl{at}mail.arizona{dot}edu or garli.support{at}gmail{dot}com).<br />
<br />
Obtaining GARLI<br />
Current Version 2.01<br />
<br />
You can download GARLI for Windows or Mac OS X (or get the source code) from Google Code: http://garli.googlecode.com<br />
<br />
<br />
Documentation and support for GARLI<br />
<br />
There are a number of options:<br />
<br />
This [[Garli_wiki| wiki]] contains an online [[Garli_Manual] which contains a full description of the GARLI Configuration Settings, an FAQ and a Brief_tutorial. Lots of other information appears on the wiki as well.<br />
<br />
Questions may be posted to a the GARLI user forum through the [http://groups.google.com/group/garli_users garli_users] google group.<br />
You may also email questions and problems to garli{dot}support{at}gmail{dot}com <br />
<br />
<br />
Known Issues<br />
<br />
There are a few known issues. See them here : Known issues with 2.0<br />
Citing GARLI<br />
<br />
If you use GARLI please cite<br />
<br />
Zwickl, D. J., 2006. Genetic algorithm approaches for the phylogenetic analysis of large <br />
biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation,<br />
The University of Texas at Austin.<br />
<br />
and/or the download website:<br />
http://garli.googlecode.com</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Software&diff=4394Software2015-07-21T16:51:53Z<p>Zwickl: /* Phylogenetic tree building and analysis */</p>
<hr />
<div>==Alignment==<br />
*[http://faculty.biomath.ucla.edu/msuchard/bali-phy/index.php Bali-Phy]<br />
*[ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/ BMGE]<br />
*[http://fsa.sourceforge.net/ FSA]<br />
*[http://molevol.cmima.csic.es/castresana/Gblocks.html Gblocks]<br />
*[http://guidance.tau.ac.il/ Guidance]<br />
*[http://www.jalview.org/ Jalview]<br />
*[http://mafft.cbrc.jp/alignment/software/ MAFFT]<br />
*[http://www.megasoftware.net/ MEGA]<br />
*[http://www.drive5.com/muscle/ MUSCLE]<br />
*[http://opal.cs.arizona.edu/ Opal]<br />
*[http://www.ebi.ac.uk/goldman-srv/prank/ PRANK]<br />
*[http://biowiki.org/bin/view/Main/ProtPal ProtPal]<br />
*[http://pbil.univ-lyon1.fr/software/seaview.html SeaView]<br />
*[http://www.tcoffee.org/ T-Coffee]<br />
<br />
<br />
==Phylogenetic tree building and analysis==<br />
* [http://beast.bio.ed.ac.uk/Main_Page BEAST] - software package includes BEAST, BEAUti, LogCombiner, TreeAnnotator<br />
* [http://www.stat.osu.edu/~dkp/BEST/introduction/ BEST]<br />
* [http://pythonhosted.org/DendroPy/ DendroPy]<br />
* [http://phylo.bio.ku.edu/content/tracy-heath-dppdiv DPPDiv]<br />
*[http://www.microbesonline.org/fasttree/ FastTree]<br />
* [http://tree.bio.ed.ac.uk/software/figtree/ FigTree]<br />
* [http://garli.googlecode.com/ Garli]<br />
**[[Garli_wiki | Documentation wiki ]]<br />
**[http://groups.google.com/group/garli_users/ Google users group]<br />
**[http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2014.html Workshop tutorial] '''(NOTE: The tutorial bundle linked on this page contains everything you need - you don't need to download the program separately!)'''<br />
* MP-EST<br />
** [http://code.google.com/p/mp-est/ Executables, code]<br />
** [http://bioinformatics.publichealth.uga.edu/SpeciesTreeAnalysis/index.php Species tree web server]<br />
* [http://www.mrbayes.net MrBayes]<br />
* [http://statgen.ncsu.edu/thorne/multidivtime.html multidivtime]<br />
* [http://abacus.gene.ucl.ac.uk/software/paml.html PAML]<br />
* [http://people.sc.fsu.edu/~dswofford/paup_test PAUP*]<br />
* [http://code.google.com/p/phybase/ PHYBASE]<br />
* [http://pbil.univ-lyon1.fr/software/phyldog/ PHYLDOG]<br />
* [http://www.phylobayes.org PhyloBayes]<br />
*[http://sco.h-its.org/exelixis/software.html RAxML (and many other programs)]<br />
* [http://sourceforge.net/projects/revbayes/ RevBayes]<br />
*[http://kiwi.cs.dal.ca/Software/RSPR rSPR]<br />
*[http://kiwi.cs.dal.ca/Software/SPRSupertrees SPRSupertrees]<br />
* [http://www.stat.osu.edu/~lkubatko/software/STEM/ STEM]<br />
* [http://tree.bio.ed.ac.uk/software/tracer/ Tracer]<br />
*[http://evolution.gs.washington.edu/phylip/software.html Comprehensive list of Phylogeny programs]<br />
<br />
==Other==<br />
* [http://people.sc.fsu.edu/~pbeerli/bugs_in_a_box.tar.gz Bugs in a Box]: A Macintosh program and its (python) source code to show the coalescence process (but still does not draw a tree).<br />
*[[Media:MCMCEG.zip | MCMC example software]] from [[John_Huelsenbeck | John Huelsenbeck]]<br />
<br />
==Pipelines==<br />
*[https://bitbucket.org/caseywdunn/agalma Agalma]<br />
<br />
==Population analysis==<br />
* LAMARC: (If you want to know more about Lamarc talk to [[Peter Beerli]])<br />
**[http://evolution.genetics.washington.edu/lamarc/index.html Lamarc] main website: Dowload and manual<br />
**[[Lamarc tutorial]]<br />
* MIGRATE: Demonstration and Tutorial on August 4 ([[Peter Beerli]])<br />
** [http://popgen.sc.fsu.edu Migrate main website]: Download, Manual, Blog/Tutorials, Information on speed, citation of MIGRATE in the literature.<br />
** [[Migrate tutorial]]: Tutorial for the course 2014 (an older version can be found here [http://popgen.sc.fsu.edu/Migrate/Tutorials/Entries/2010/7/12_Day_of_longboarding.html tutorial] on the [http://popgen.sc.fsu.edu/Migrate/Tutorials/Tutorials.html Migrate tutorial website]) <br />
** [http://groups.google.com/group/migrate-support?lnk=iggc Migrate support google Group]<br />
<br />
==Similarity searching==<br />
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi Blast]<br />
* [http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Fasta]<br />
*[http://www.drive5.com/usearch/ USEARCH]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Software&diff=4393Software2015-07-21T16:51:28Z<p>Zwickl: /* Phylogenetic tree building and analysis */</p>
<hr />
<div>==Alignment==<br />
*[http://faculty.biomath.ucla.edu/msuchard/bali-phy/index.php Bali-Phy]<br />
*[ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/ BMGE]<br />
*[http://fsa.sourceforge.net/ FSA]<br />
*[http://molevol.cmima.csic.es/castresana/Gblocks.html Gblocks]<br />
*[http://guidance.tau.ac.il/ Guidance]<br />
*[http://www.jalview.org/ Jalview]<br />
*[http://mafft.cbrc.jp/alignment/software/ MAFFT]<br />
*[http://www.megasoftware.net/ MEGA]<br />
*[http://www.drive5.com/muscle/ MUSCLE]<br />
*[http://opal.cs.arizona.edu/ Opal]<br />
*[http://www.ebi.ac.uk/goldman-srv/prank/ PRANK]<br />
*[http://biowiki.org/bin/view/Main/ProtPal ProtPal]<br />
*[http://pbil.univ-lyon1.fr/software/seaview.html SeaView]<br />
*[http://www.tcoffee.org/ T-Coffee]<br />
<br />
<br />
==Phylogenetic tree building and analysis==<br />
* [http://beast.bio.ed.ac.uk/Main_Page BEAST] - software package includes BEAST, BEAUti, LogCombiner, TreeAnnotator<br />
* [http://www.stat.osu.edu/~dkp/BEST/introduction/ BEST]<br />
* [http://pythonhosted.org/DendroPy/ DendroPy]<br />
* [http://phylo.bio.ku.edu/content/tracy-heath-dppdiv DPPDiv]<br />
*[http://www.microbesonline.org/fasttree/ FastTree]<br />
* [http://tree.bio.ed.ac.uk/software/figtree/ FigTree]<br />
* [http://garli.googlecode.com/ Garli]<br />
**[Garli_wiki | Documentation wiki ]<br />
**[http://groups.google.com/group/garli_users/ Google users group]<br />
**[http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2014.html Workshop tutorial] '''(NOTE: The tutorial bundle linked on this page contains everything you need - you don't need to download the program separately!)'''<br />
* MP-EST<br />
** [http://code.google.com/p/mp-est/ Executables, code]<br />
** [http://bioinformatics.publichealth.uga.edu/SpeciesTreeAnalysis/index.php Species tree web server]<br />
* [http://www.mrbayes.net MrBayes]<br />
* [http://statgen.ncsu.edu/thorne/multidivtime.html multidivtime]<br />
* [http://abacus.gene.ucl.ac.uk/software/paml.html PAML]<br />
* [http://people.sc.fsu.edu/~dswofford/paup_test PAUP*]<br />
* [http://code.google.com/p/phybase/ PHYBASE]<br />
* [http://pbil.univ-lyon1.fr/software/phyldog/ PHYLDOG]<br />
* [http://www.phylobayes.org PhyloBayes]<br />
*[http://sco.h-its.org/exelixis/software.html RAxML (and many other programs)]<br />
* [http://sourceforge.net/projects/revbayes/ RevBayes]<br />
*[http://kiwi.cs.dal.ca/Software/RSPR rSPR]<br />
*[http://kiwi.cs.dal.ca/Software/SPRSupertrees SPRSupertrees]<br />
* [http://www.stat.osu.edu/~lkubatko/software/STEM/ STEM]<br />
* [http://tree.bio.ed.ac.uk/software/tracer/ Tracer]<br />
*[http://evolution.gs.washington.edu/phylip/software.html Comprehensive list of Phylogeny programs]<br />
<br />
==Other==<br />
* [http://people.sc.fsu.edu/~pbeerli/bugs_in_a_box.tar.gz Bugs in a Box]: A Macintosh program and its (python) source code to show the coalescence process (but still does not draw a tree).<br />
*[[Media:MCMCEG.zip | MCMC example software]] from [[John_Huelsenbeck | John Huelsenbeck]]<br />
<br />
==Pipelines==<br />
*[https://bitbucket.org/caseywdunn/agalma Agalma]<br />
<br />
==Population analysis==<br />
* LAMARC: (If you want to know more about Lamarc talk to [[Peter Beerli]])<br />
**[http://evolution.genetics.washington.edu/lamarc/index.html Lamarc] main website: Dowload and manual<br />
**[[Lamarc tutorial]]<br />
* MIGRATE: Demonstration and Tutorial on August 4 ([[Peter Beerli]])<br />
** [http://popgen.sc.fsu.edu Migrate main website]: Download, Manual, Blog/Tutorials, Information on speed, citation of MIGRATE in the literature.<br />
** [[Migrate tutorial]]: Tutorial for the course 2014 (an older version can be found here [http://popgen.sc.fsu.edu/Migrate/Tutorials/Entries/2010/7/12_Day_of_longboarding.html tutorial] on the [http://popgen.sc.fsu.edu/Migrate/Tutorials/Tutorials.html Migrate tutorial website]) <br />
** [http://groups.google.com/group/migrate-support?lnk=iggc Migrate support google Group]<br />
<br />
==Similarity searching==<br />
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi Blast]<br />
* [http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Fasta]<br />
*[http://www.drive5.com/usearch/ USEARCH]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=4392Derrick Zwickl2015-07-21T16:50:02Z<p>Zwickl: /* Other GARLI information */</p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
Derrick won't be able to attend the course for health reasons. Mark Holder will take over his presentation and lab<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/woodshole/Zwickl-WoodsHole2015-lecture.pdf Zwickl-WoodsHole2015-lecture.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2015.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[[garli_wiki | Garli documentation wiki]]<br />
*[http://garli.googlecode.com/ Program download] (migrating to [https://github.com/zwickl/garli GitHub])<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Absent due to health.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=4391Derrick Zwickl2015-07-21T16:49:18Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
Derrick won't be able to attend the course for health reasons. Mark Holder will take over his presentation and lab<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/woodshole/Zwickl-WoodsHole2015-lecture.pdf Zwickl-WoodsHole2015-lecture.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2015.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[[garli_wiki]]<br />
*[http://garli.googlecode.com/ Program download] (migrating to [https://github.com/zwickl/garli GitHub])<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Absent due to health.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=4390Derrick Zwickl2015-07-21T16:48:44Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
Derrick won't be able to attend the course for health reasons. Mark Holder will take over his presentation and lab<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/woodshole/Zwickl-WoodsHole2015-lecture.pdf Zwickl-WoodsHole2015-lecture.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2015.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[garli_wiki] (currently served via a web archiving service, soon to be on GitHub)<br />
*[http://garli.googlecode.com/ Program download] (migrating to [https://github.com/zwickl/garli GitHub])<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Absent due to health.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_wiki&diff=4389Garli wiki2015-07-21T16:45:45Z<p>Zwickl: </p>
<hr />
<div><br />
useful GARLI documentation in wiki format is available here:<br />
<br />
[[Garli_FAQ | FAQ]]<br />
<br />
[[GARLI_configuration_settings | Configuration settings]]<br />
<br />
[[Garli_using_partitioned_models | Using partitioned models]]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_using_partitioned_models&diff=4388Garli using partitioned models2015-07-21T16:41:50Z<p>Zwickl: </p>
<hr />
<div>==Configuring a partitioned analysis==<br />
Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to divide your data and what models to apply. Deciding how to choose the models is also a complex issue.<br />
<br />
'''NOTE:''' If you are the kind of person who would rather just try your hand at running partitioned models without reading all of this first, you'll find example runs and template configuration files in the example/partition/ directory of any GARLI distribution. I'd still suggest reading this at some point to be sure that you understand your options for configuration.<br />
<br />
===Dividing up the data===<br />
Note that I use the technically correct (but often misused) definition of a partition. A partition is a scheme for dividing something up. It does the dividing, like a partition in a room. The individual chunks of data that are created by the partition are referred to as subsets.<br />
<br />
This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix. The setup of the actual models will come later. For a dataset with 2178 characters in a single data or characters block, it would look like this:<br />
<pre><br />
begin sets;<br />
charset ND2 = 1-726;<br />
charset rbcl = 727-1452;<br />
charset 16S = 1453-2178;<br />
charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S; <br />
<br />
[you could also put characters exclusions here by removing the []'s from the line below]<br />
[note that the excluded sites should still appear in the charpartition, however]<br />
[exset * myexclusions = 600-800, 850, 900-100;]<br />
end;<br />
</pre><br />
<br />
The above block would divide up the sites into three sets of 726 characters each. <br />
<br />
To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this<br />
<pre><br />
charpartition bySites = chunk1:ND2 rbcl, chunk2:16S; <br />
</pre><br />
Note the space rather than comma between ND2 and rbcl.<br />
<br />
<br />
The names are unimportant here. The general format is:<br />
<br />
charset <charset name> = <list or range of sites>;<br />
charset <charset name> = <list or range of sites>;<br />
charpartition <charpartition name> = (cont.)<br />
<partition subset 1 name>:<sites or charset making up 1st subset>, (cont.)<br />
<partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;<br />
<br />
To easily specify charsets that divide characters up by codon position, do this:<br />
charset 1stpos = 1-2178\3;<br />
charset 2ndpos = 2-2178\3;<br />
charset 3rdpos = 3-2178\3;<br />
<br />
Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model. If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.<br />
<br />
Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).<br />
<br />
===Specifying the models===<br />
DO SOME SORT OF MODEL TESTING! The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized. Use ModelTest or some other means for finding the best model for each data subset. Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.<br />
<br />
As usual for GARLI, the models are specified in the configuration file. If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual '''[[GARLI_Configuration_Settings#Model_specification_settings|here]]''', and FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.<br />
<br />
There are two new configuration entries that relate to partitioned models:<br />
linkmodels = 0 or 1<br />
subsetspecificrates = 0 or 1<br />
<br />
'''linkmodels''' means to use a single set of model parameters for all subsets.<br />
<br />
'''subsetspecificrates''' means to infer overall rate multipliers for each data subset. This is equivalent to *prset ratepr=variable* in MrBayes<br />
<br />
So, there are various combinations here:<br />
{| border="1"<br />
|-<br />
!'''linkmodels'''!!'''subsetspecificrates'''!!'''meaning '''<br />
|-<br />
|align="center" | 0 || align="center" | 0 || different models, branch lengths equal<br />
|-<br />
|align="center" | 0 || align="center" | 1 || different models, different subset rates <br />
|-<br />
|align="center" | 1 || align="center" | 0 || single model, one set of branch lengths (equivalent to non-partitioned analysis)<br />
|-<br />
|align="center" | 1 || align="center" | 1 || single model, different subset rates (like site-specific rates model in PAUP*)<br />
|}<br />
<br />
The normal model configuration entries are the following, with the defaults in *bold*:<br />
<br />
datatype = '''nucleotide''', aminoacid, codon-aminoacid or codon<br />
ratematrix = '''6rate''', 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f )<br />
statefrequencies = '''estimate''', equal, empirical, (+others for aminoacids or codons. See manual)<br />
ratehetmodel = '''gamma''', none<br />
numratecats = '''4''', 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma) <br />
invariantsites = '''estimate''', none<br />
<br />
If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G. In that case there is nothing else to be done. You can start your run.<br />
<br />
If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in []'s. The headings need to be [model1], [model2], etc., and are assigned to the subsets in order. The number of configuration sets must match the number of data subsets.<br />
<br />
For example, the following would assign the GTR+G and HKY models to the first and second data subsets.<br />
<pre><br />
[model1]<br />
datatype = nucleotide<br />
ratematrix = 6rate<br />
statefrequencies = estimate<br />
ratehetmodel = gamma<br />
numratecats = 4<br />
invariantsites = none<br />
<br />
[model2]<br />
datatype = nucleotide<br />
ratematrix = 2rate<br />
statefrequencies = estimate<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
</pre><br />
<br />
These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.<br />
<br />
If you're not sure how to specify a given model, see the FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.<br />
<br />
NOTE THAT THE BRACKETED LINES BEFORE EACH MODEL DESCRIPTION ARE '''NOT''' COMMENTS, AND MUST APPEAR AS ABOVE WITH CONSECUTIVE MODEL NUMBERING STARTING AT ONE!<br />
<br />
===Choosing partitioning schemes===<br />
====How many partition subsets can I use?====<br />
There is no hard limit that is enforced on the number of subsets that you '''''can''''' use. I've done more than 60 myself. See the below for considerations about how many you '''''should''''' use.<br />
<br />
====What do I need to consider when choosing models and a partitioning scheme?====<br />
That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics). I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize. Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal. This makes the parameter likelihood surfaces very flat and difficult to optimize. This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.<br />
<br />
'''NOTE''' Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context! This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface. This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain. <br />
<br />
In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters. If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values. Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable. However, informative prior estimates are not used that often in standard phylogenetic analyses.<br />
<br />
====Ok, then how should I partition and choose models in practice?====<br />
'''NOTE:''' The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.<br />
<br />
Consider the following procedure that I've used, from a real dataset of 2 genes. It is fairly complicated, but I think as statistically rigorous as can be expected. We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets. We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated. But what model should be chosen for the subsets, and how should the subsets be constructed? <br />
<br />
1. First, construct all reasonable sets of sites for model testing. This amounts to:<br />
:* Each codon position of each gene individually (6)<br />
:* The full concatenated alignment (1)<br />
:* The full sequence of each gene (2)<br />
:* The first and second positions combined for each gene (2)<br />
:* The first, second and third positions pooled across the genes (3)<br />
Note that this list doesn't need to be exhaustive. I've omitted various combinations that I find unlikely ''a priori'', for example a combination of first and third positions.<br />
<br />
<br />
2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above. These will be the models applied to each of the subsets as we move to partitioned models. The following were the results in my case. If you aren't very familiar with the typical models, the information here: ('''[[GARLI_Configuration_Settings#Model_specification_settings|Model configuration]]''') and some of the items here: ('''[[FAQ#Model_choices| FAQ:Model choices]]''') may be helpful.<br />
<br />
<br />
{|border="1"<br />
|-bgcolor=grey<br />
! width="120" | Alignment<br />
! width="120" | Full<br />
! width="120" | 1st Pos<br />
! width="120" | 2nd Pos<br />
! width="120" | 3rd Pos<br />
! width="120" | 1st Pos + 2nd Pos<br />
|-<br />
| align="center" | Gene 1 || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG || align="center" | TVMIG<br />
|-<br />
| align="center" | Gene 2 || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG<br />
|-<br />
| align="center" | Concat || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG<br />
|}<br />
<br />
<br />
3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step. When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3). For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered. The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.<br />
<br />
Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme. Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done. See [[FAQ#How_do_I_fully_constrain_.28i.e..2C_fix.29_the_tree_topology.3F | here]] for information on fixing the tree topology.<br />
<br />
<br />
4. Now, note the number of model parameters that were estimated in each partitioning scheme. For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively. (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30. Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as <br />
AIC = 2 x (# parameters - lnL).<br />
<br />
5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best. See the below table for my results. In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene. The concatenated alignment result was less obvious ''a priori'', with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG. <br />
The AIC scores are summarized in the following table:<br />
<br />
{|border="1"<br />
|-bgcolor=grey<br />
! width="100" | Alignment<br />
! width="100" | Partition<br />
! width="100" | lnL<br />
! width="100" |Parameters<br />
! width="100" |AIC<br />
|-<br />
|align="center"| Gene 1 || align="center" | (1)(2)(3) || align="center" | -22698.82 || align="center" | 30 || align="center" | 45457.64<br />
|-<br />
|align="center"| || align="center" | (12)(3) || align="center" | -22817.84 || align="center" | 19 || align="center" | 45673.68<br />
|-<br />
|align="center"| || align="center" |(123) || align="center" | -23524.92 || align="center" | 10 || align="center" | 47069.84<br />
|-<br />
|align="center"| || align="center" | || align="center" | || align="center" | ||<br />
|-<br />
|align="center"| Gene 2 || align="center" | (1)(2)(3) || align="center" | -24537.93 || align="center" | 31 || align="center" | 49137.85<br />
|-<br />
|align="center"| || align="center" | (12)(3) || align="center" | -24639.11 || align="center" | 19 || align="center" | 49316.22<br />
|-<br />
|align="center"| || align="center" | (123) || align="center" | -25353.78 || align="center" | 10 || align="center" | 50727.55<br />
|-<br />
|align="center"| || align="center" | || align="center" | || align="center" | ||<br />
|-<br />
|align="center"|Concat || align="center" | (11)(2)(2)(33) || align="center" | -47384.54 || align="center" | 42 || align="center" | 94853.08<br />
|-<br />
|align="center"| || align="center" | (1)(2)(1)(2)(33) || align="center" | -47375.30 || align="center" | 53 || align="center" | 94856.60<br />
|-<br />
|align="center"| || align="center" | (1)(2)(3)(1)(2)(3) || align="center" | -47373.34 || align="center" | 62 || align="center" | 94870.69<br />
|-<br />
|align="center"| || align="center" | (11)(22)(33) || align="center" | -47408.21 || align="center" | 31 || align="center" | 94878.43<br />
|-<br />
|align="center"| || align="center" | (1212)(33) || align="center" | -47614.47 || align="center" | 19 || align="center" | 95266.94<br />
|-<br />
|align="center"| || align="center" | 123123 || align="center" | -49038.37 || align="center" | 10 || align="center" | 98096.75<br />
|-<br />
|align="center"| || align="center" | (123)(123) || align="center" | -49031.42 || align="center" | 21 || align="center" | 98104.83<br />
|}<br />
<br />
====Further thoughts====<br />
The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned ''a priori''. Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has '''''20 fewer parameters''''' than the one that most people would have chosen ''a priori'', (1)(2)(3)(1)(2)(3).<br />
Some caveats:<br />
*This methodology and example only includes two genes! Many datasets nowadays have many more. Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes. However, you can still work to find good partitioning schemes ''within'' each gene, as was done for the single genes in the example above. Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each. I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset. You might also do something like the full procedure above on sets of genes that you know '''a priori'' are similar, for example mitochondrial genes, rRNA genes, etc.<br />
<br />
*This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI). Surely more parameters could be eliminated by linking. For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters. If so, three parameters could be eliminated for each subset, which is a significant number.<br />
<br />
*Some may argue that this is a lot of work for unclear benefit. This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models. Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.<br />
<br />
==Check the output==<br />
Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly. The output is currently very verbose. <br />
<br />
First the details of the data partitioning appear. Check that the total number of characters per subset looks correct. <br />
<pre><br />
GARLI partition subset 1<br />
CHARACTERS block #1 ("Untitled DATA Block 1")<br />
CHARPARTITION subset #1 ("1stpos")<br />
Data read as Nucleotide data, modeled as Nucleotide data<br />
<br />
Summary of data or data subset:<br />
11 sequences.<br />
441 constant characters.<br />
189 parsimony-informative characters.<br />
96 autapomorphic characters.<br />
726 total characters.<br />
238 unique patterns in compressed data matrix.<br />
<br />
GARLI partition subset 2<br />
CHARACTERS block #1 ("Untitled DATA Block 1")<br />
CHARPARTITION subset #2 ("2ndpos")<br />
Data read as Nucleotide data, modeled as Nucleotide data<br />
<br />
Summary of data or data subset:<br />
11 sequences.<br />
528 constant characters.<br />
103 parsimony-informative characters.<br />
95 autapomorphic characters.<br />
726 total characters.<br />
158 unique patterns in compressed data matrix.<br />
<br />
GARLI partition subset 3<br />
CHARACTERS block #1 ("Untitled DATA Block 1")<br />
CHARPARTITION subset #3 ("3rdpos")<br />
Data read as Nucleotide data, modeled as Nucleotide data<br />
<br />
Summary of data or data subset:<br />
11 sequences.<br />
103 constant characters.<br />
539 parsimony-informative characters.<br />
84 autapomorphic characters.<br />
726 total characters.<br />
549 unique patterns in compressed data matrix.<br />
</pre><br />
Then a description of the models and model parameters assigned to each subset appears. Parameters are at their initial values. This indicates three models with GTR+G for each:<br />
<pre><br />
MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)<br />
Model 1<br />
Number of states = 4 (nucleotide data)<br />
Nucleotide Relative Rate Matrix: 6 rates <br />
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000<br />
Equilibrium State Frequencies: estimated<br />
(ACGT) 0.3157 0.1746 0.3004 0.2093 <br />
Rate Heterogeneity Model:<br />
4 discrete gamma distributed rate categories, alpha param estimated<br />
0.5000<br />
Substitution rate categories under this model:<br />
rate proportion<br />
0.0334 0.2500<br />
0.2519 0.2500<br />
0.8203 0.2500<br />
2.8944 0.2500<br />
<br />
Model 2<br />
Number of states = 4 (nucleotide data)<br />
Nucleotide Relative Rate Matrix: 6 rates <br />
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000<br />
Equilibrium State Frequencies: estimated<br />
(ACGT) 0.2703 0.1566 0.1628 0.4103 <br />
Rate Heterogeneity Model:<br />
4 discrete gamma distributed rate categories, alpha param estimated<br />
0.5000<br />
Substitution rate categories under this model:<br />
rate proportion<br />
0.0334 0.2500<br />
0.2519 0.2500<br />
0.8203 0.2500<br />
2.8944 0.2500<br />
<br />
Model 3<br />
Number of states = 4 (nucleotide data)<br />
Nucleotide Relative Rate Matrix: 6 rates <br />
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000<br />
Equilibrium State Frequencies: estimated<br />
(ACGT) 0.1460 0.3609 0.2915 0.2015 <br />
Rate Heterogeneity Model:<br />
4 discrete gamma distributed rate categories, alpha param estimated<br />
0.5000<br />
Substitution rate categories under this model:<br />
rate proportion<br />
0.0334 0.2500<br />
0.2519 0.2500<br />
0.8203 0.2500<br />
2.8944 0.2500<br />
<br />
Subset rate multipliers:<br />
1.00 1.00 1.00<br />
</pre><br />
<br />
If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed. All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.<br />
<br />
<pre><br />
Completed 5 replicate runs (of 5).<br />
Results:<br />
Replicate 1 : -13317.4777 (best)<br />
Replicate 2 : -13317.4813 (same topology as 1)<br />
Replicate 3 : -13317.4839 (same topology as 1)<br />
Replicate 4 : -13317.4863 (same topology as 1)<br />
Replicate 5 : -13317.4781 (same topology as 1)<br />
<br />
Parameter estimates:<br />
<br />
Partition subset 1:<br />
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha <br />
rep 1: 1.97 2.58 1.42 1.41 3.72 1.00 0.310 0.177 0.297 0.216 0.411 <br />
rep 2: 1.96 2.58 1.41 1.41 3.71 1.00 0.310 0.177 0.296 0.216 0.409 <br />
rep 3: 1.97 2.58 1.42 1.40 3.71 1.00 0.309 0.177 0.298 0.216 0.411 <br />
rep 4: 1.96 2.57 1.42 1.40 3.72 1.00 0.310 0.177 0.297 0.216 0.411 <br />
rep 5: 1.96 2.57 1.41 1.40 3.71 1.00 0.310 0.177 0.297 0.216 0.409 <br />
<br />
Partition subset 2:<br />
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha <br />
rep 1: 4.32 7.05 1.60 7.05 4.37 1.00 0.269 0.164 0.160 0.407 0.361 <br />
rep 2: 4.32 7.03 1.59 7.08 4.37 1.00 0.270 0.163 0.160 0.407 0.361 <br />
rep 3: 4.33 7.08 1.60 7.07 4.37 1.00 0.269 0.164 0.160 0.406 0.361 <br />
rep 4: 4.34 7.09 1.61 7.10 4.40 1.00 0.269 0.164 0.160 0.407 0.361 <br />
rep 5: 4.35 7.08 1.60 7.11 4.39 1.00 0.269 0.163 0.160 0.407 0.360 <br />
<br />
Partition subset 3:<br />
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha <br />
rep 1: 1.06 5.26 3.55 0.45 4.98 1.00 0.154 0.356 0.287 0.203 2.988 <br />
rep 2: 1.06 5.26 3.56 0.45 4.98 1.00 0.154 0.356 0.287 0.203 2.995 <br />
rep 3: 1.06 5.26 3.57 0.46 5.00 1.00 0.154 0.355 0.287 0.204 3.008 <br />
rep 4: 1.07 5.31 3.57 0.46 5.00 1.00 0.154 0.356 0.286 0.204 2.991 <br />
rep 5: 1.06 5.25 3.56 0.45 5.00 1.00 0.154 0.356 0.287 0.203 2.978 <br />
<br />
Subset rate multipliers:<br />
rep 1: 0.538 0.298 2.164 <br />
rep 2: 0.538 0.299 2.164 <br />
rep 3: 0.539 0.300 2.162 <br />
rep 4: 0.539 0.298 2.163 <br />
rep 5: 0.537 0.299 2.164 <br />
<br />
Final result of the best scoring rep (#1) stored in GTRG.byCodonPos.best.tre<br />
Final results of all reps stored in GTRG.byCodonPos.best.all.tre<br />
</pre><br />
<br />
==The sample runs==<br />
In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset. It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.<br />
<br />
The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.<br />
<br />
The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset. In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc). I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.<br />
<br />
Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP. Parameters that are shared have the same #. The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T. For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:<br />
<pre>ratematrix = ( 0 1 0 0 1 0 )</pre><br />
The GTR model looks like this:<br />
<pre>ratematrix = ( 0 1 2 3 4 5 )</pre><br />
Note that these two particular configuations are equivalent to '''ratematrix = 2rate''' and '''ratematrix = 6''' rate, respectively.<br />
<br />
==Site-likelihood output==<br />
If you'd like to get site-wise likelihoods, for example to input into CONSEL, add<br />
outputsitelikelihoods = 1<br />
to the top part of your config file. This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another. This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL. Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not my.run.txt), so you may need to rename the files.<br />
<br />
If you want the sitelikes for a particular tree, you'd need to do this:<br />
*specify the tree(s) in a starting tree file.<br />
*specify your chosen model normally<br />
*add '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]] = 1''' to the [general] section of your configuration file.<br />
*run the program normally (this will take a while because it will need to optimize the model parameters)<br />
<br />
==Running it yourself==<br />
Use the provided configuration files as templates. I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones. Note that there are some important changes from the defaults values that appear in previous versions of the program. These are important to ensure that the more complex partitioned models are properly optimized.<br />
<br />
You should set '''modweight''' entry to at least 0.0005 x (#subsets + 1).<br />
<br />
Other than that and entering your dataset name on the '''datafname''' line, you should be able to run with the default values. If you know your way around the Garli settings, feel free to tinker.<br />
<br />
As always, you can start the program from the command line with either <br />
<pre>./executableName configFileName</pre><br />
if it is in the same directory with your dataset and config file, or just<br />
<pre>executableName configFileName</pre><br />
if it is in your path.<br />
<br />
If your config file is named garli.conf, you don't need to pass it on the command line.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_Mkv_morphology_model&diff=4387Garli Mkv morphology model2015-07-21T16:37:08Z<p>Zwickl: /* GARLI for "standard" data */</p>
<hr />
<div>==GARLI for "standard" data==<br />
GARLI 2.0+ implements the "Mk" and "Mkv" models of Lewis (2001), "A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Data".<br />
<br />
'''Specific information on the use of morphology data appears on this page, but be sure to read the primary documentation for the [[Garli_using_partitioned_models]].'''<br />
<br />
==This implementation==<br />
*Allows use of character data with any number of states, arbitrarily coded as 1, 2, 3 etc. This is termed the "standard" datatype by the Nexus format.<br />
*Allows simultaneous use of this standard data and typical sequence data (dna, protein) in a partitioned model (although there are some limitations).<br />
*Allows use of the "Mk" model, which assumes that the data collected could contain constant characters<br />
*Allows use of the "Mkv" model, which assumes that the data collected contains only variable characters<br />
*Allows versions of the Mk and Mkv models that treat the states as ordered characters<br />
<br />
==Limitations of this version==<br />
*ONLY allows equal rates of substitution between states (rate of change from 1 -> 2 = 2 -> 1)<br />
*ONLY allows equal frequencies of the character states (state 1 = state 2)<br />
*CAN'T create stepwise addition starting trees under Mkv (for technical reasons)<br />
*CAN'T use rate heterogeneity with the Mk/Mkv models.<br />
*Another technical limitation:<br />
**'''IF''' you are mixing the morphology model with sequence data (DNA) <br />
:'''AND''' different characters have different numbers of states (e.g., character 1 has observed states 1, 2, and 3, while character 2 has states 1 and 2) <br />
:'''THEN''' you will not be able to infer separate subset specific rates for the DNA and morphological sets of data unless you also infer different rates for each set of characters with the same number of observed states<br />
<br />
==Application of this version to indel character data==<br />
One potential use of the "standard" data models implemented here is to encode indels (gaps) from your alignment as independent characters (in a separate data matrix) and analyze them simultaneously with your sequence data in a partitioned analysis. Note that the jury remains out on whether this is a good or helpful approach to take, and I don't necessarily endorse it. Certainly the gap and sequence matrices are not independent of one another, and will tend to reinforce each others signals, thus raising support in a way that may or may not be appropriate.<br />
<br />
==Availability==<br />
Version 2.0 allows these models, and is available here: [http://garli.googlecode.com http://garli.googlecode.com]<br />
<br />
==Basic usage==<br />
Very little needs to be done to use the Mk/Mkv models. <br />
===Data===<br />
Have a Nexus datafile with your standard data in a characters or data block.<br />
===Configuration===<br />
The section of the configuration file containing the model settings should look like this<br />
datatype = '''standard''' (or '''standardXXX''', see below)<br />
ratematrix = 1rate<br />
statefrequencies = equal<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
<br />
The datatype is the only thing that can be changed here, and there are a few options:<br />
*standard - States are coded arbitrarily, and the number of observed states is assumed to be the maximum for each site. That is, if a column has a mix of states 1, 3 and 4, it is assumed that only these three states are possible. This is the "Mk" model.<br />
*standardvariable - As standard, except makes corrections for the fact that all constant columns will not be collected, and therefor won't appear in the matrix. This is the "Mkv" model, and should generally be preferred for morphological data over Mk.<br />
*standardordered - As standard, except that the state numbers DO matter, and transitions can only change the state by one number at a time. i.e., to get from state 2 to state 4 requires two changes. State numbers can be missing, so there could be an intermediate state that is unobserved.<br />
*standardvariableordered - A combination of the properties of standardvariable and standardordered.<br />
<br />
Now run the program as usual.<br />
<br />
==Partitioned usage==<br />
To use Mk/Mkv in a partitioned model (with other types of data), the procedure is this:<br />
<br />
===Data===<br />
Get your data ready. In your Nexus datafile, your sequence data and Mk type data will need to appear in separate characters blocks. Multiple characters blocks automatically create a partitioned model in GARLI. In general, the file should be formatted something like this:<br />
<br />
#NEXUS<br />
begin taxa;<br />
<contents of taxa block><br />
end;<br />
begin characters;<br />
<one of your types of data><br />
end;<br />
begin characters;<br />
<more data of the same or a different type><br />
end;<br />
<more characters blocks if necessary><br />
end;<br />
<br />
Note that this means that you will need to use a taxa block in addition to your characters blocks. If you are using multiple types of data you cannot use data blocks. As for how to get your data into this format, two options are to paste multiple characters blocks into a single file (with one taxa block), or to get your data into Mesquite as separate matrices and then save it.<br />
<br />
===Configuration===<br />
At this point the run can be configured as a typical partitioned analysis. Lots more information on that appears here: [[Partition_testing_version]]. <br />
<br />
In short, the multiple models are specified in the below format. Note that the "[model1]" and "[model2]" bits are important, and indicate which characters blocks (or partition subsets) the models are applied to, with the order being the same. Note that the numbering starts at 1, so the first characters block is model1, the second model2, etc. Assuming that the characters blocks in the file appeared with the nucleotide data first: <br />
<br />
<pre><br />
[model1]<br />
datatype = nucleotide<br />
ratematrix = 6rate<br />
statefrequencies = estimate<br />
ratehetmodel = gamma<br />
numratecats = 4<br />
invariantsites = none<br />
<br />
[model2]<br />
datatype = standardvariable<br />
ratematrix = 1rate<br />
statefrequencies = equal<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
</pre><br />
<br />
This would specify the GTR+Gamma model for the DNA data, and Mkv for the standard data.<br />
<br />
Now run the program as usual.<br />
<br />
==Program output==<br />
Output files will be more or less as usual. If you look in the .screen.log file you will notice that the standard (morphology) data will be split into a number of models with each representing a given number of states. i.e., one model for characters showing 2 states, one model for characters with 3 states, etc. This is normal. You will also notice that with the current basic Mk/Mkv implementation there aren't any parameters to be estimated or reported.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_Mkv_morphology_model&diff=4386Garli Mkv morphology model2015-07-21T16:36:12Z<p>Zwickl: /* GARLI for "standard" data */</p>
<hr />
<div>==GARLI for "standard" data==<br />
GARLI 2.0+ implements the "Mk" and "Mkv" models of Lewis (2001), "A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Data".<br />
<br />
'''Specific information on the use of morphology data appears on this page, but be sure to read the primary documentation for the [[Partition_testing_version]].'''<br />
<br />
==This implementation==<br />
*Allows use of character data with any number of states, arbitrarily coded as 1, 2, 3 etc. This is termed the "standard" datatype by the Nexus format.<br />
*Allows simultaneous use of this standard data and typical sequence data (dna, protein) in a partitioned model (although there are some limitations).<br />
*Allows use of the "Mk" model, which assumes that the data collected could contain constant characters<br />
*Allows use of the "Mkv" model, which assumes that the data collected contains only variable characters<br />
*Allows versions of the Mk and Mkv models that treat the states as ordered characters<br />
<br />
==Limitations of this version==<br />
*ONLY allows equal rates of substitution between states (rate of change from 1 -> 2 = 2 -> 1)<br />
*ONLY allows equal frequencies of the character states (state 1 = state 2)<br />
*CAN'T create stepwise addition starting trees under Mkv (for technical reasons)<br />
*CAN'T use rate heterogeneity with the Mk/Mkv models.<br />
*Another technical limitation:<br />
**'''IF''' you are mixing the morphology model with sequence data (DNA) <br />
:'''AND''' different characters have different numbers of states (e.g., character 1 has observed states 1, 2, and 3, while character 2 has states 1 and 2) <br />
:'''THEN''' you will not be able to infer separate subset specific rates for the DNA and morphological sets of data unless you also infer different rates for each set of characters with the same number of observed states<br />
<br />
==Application of this version to indel character data==<br />
One potential use of the "standard" data models implemented here is to encode indels (gaps) from your alignment as independent characters (in a separate data matrix) and analyze them simultaneously with your sequence data in a partitioned analysis. Note that the jury remains out on whether this is a good or helpful approach to take, and I don't necessarily endorse it. Certainly the gap and sequence matrices are not independent of one another, and will tend to reinforce each others signals, thus raising support in a way that may or may not be appropriate.<br />
<br />
==Availability==<br />
Version 2.0 allows these models, and is available here: [http://garli.googlecode.com http://garli.googlecode.com]<br />
<br />
==Basic usage==<br />
Very little needs to be done to use the Mk/Mkv models. <br />
===Data===<br />
Have a Nexus datafile with your standard data in a characters or data block.<br />
===Configuration===<br />
The section of the configuration file containing the model settings should look like this<br />
datatype = '''standard''' (or '''standardXXX''', see below)<br />
ratematrix = 1rate<br />
statefrequencies = equal<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
<br />
The datatype is the only thing that can be changed here, and there are a few options:<br />
*standard - States are coded arbitrarily, and the number of observed states is assumed to be the maximum for each site. That is, if a column has a mix of states 1, 3 and 4, it is assumed that only these three states are possible. This is the "Mk" model.<br />
*standardvariable - As standard, except makes corrections for the fact that all constant columns will not be collected, and therefor won't appear in the matrix. This is the "Mkv" model, and should generally be preferred for morphological data over Mk.<br />
*standardordered - As standard, except that the state numbers DO matter, and transitions can only change the state by one number at a time. i.e., to get from state 2 to state 4 requires two changes. State numbers can be missing, so there could be an intermediate state that is unobserved.<br />
*standardvariableordered - A combination of the properties of standardvariable and standardordered.<br />
<br />
Now run the program as usual.<br />
<br />
==Partitioned usage==<br />
To use Mk/Mkv in a partitioned model (with other types of data), the procedure is this:<br />
<br />
===Data===<br />
Get your data ready. In your Nexus datafile, your sequence data and Mk type data will need to appear in separate characters blocks. Multiple characters blocks automatically create a partitioned model in GARLI. In general, the file should be formatted something like this:<br />
<br />
#NEXUS<br />
begin taxa;<br />
<contents of taxa block><br />
end;<br />
begin characters;<br />
<one of your types of data><br />
end;<br />
begin characters;<br />
<more data of the same or a different type><br />
end;<br />
<more characters blocks if necessary><br />
end;<br />
<br />
Note that this means that you will need to use a taxa block in addition to your characters blocks. If you are using multiple types of data you cannot use data blocks. As for how to get your data into this format, two options are to paste multiple characters blocks into a single file (with one taxa block), or to get your data into Mesquite as separate matrices and then save it.<br />
<br />
===Configuration===<br />
At this point the run can be configured as a typical partitioned analysis. Lots more information on that appears here: [[Partition_testing_version]]. <br />
<br />
In short, the multiple models are specified in the below format. Note that the "[model1]" and "[model2]" bits are important, and indicate which characters blocks (or partition subsets) the models are applied to, with the order being the same. Note that the numbering starts at 1, so the first characters block is model1, the second model2, etc. Assuming that the characters blocks in the file appeared with the nucleotide data first: <br />
<br />
<pre><br />
[model1]<br />
datatype = nucleotide<br />
ratematrix = 6rate<br />
statefrequencies = estimate<br />
ratehetmodel = gamma<br />
numratecats = 4<br />
invariantsites = none<br />
<br />
[model2]<br />
datatype = standardvariable<br />
ratematrix = 1rate<br />
statefrequencies = equal<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
</pre><br />
<br />
This would specify the GTR+Gamma model for the DNA data, and Mkv for the standard data.<br />
<br />
Now run the program as usual.<br />
<br />
==Program output==<br />
Output files will be more or less as usual. If you look in the .screen.log file you will notice that the standard (morphology) data will be split into a number of models with each representing a given number of states. i.e., one model for characters showing 2 states, one model for characters with 3 states, etc. This is normal. You will also notice that with the current basic Mk/Mkv implementation there aren't any parameters to be estimated or reported.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_Mkv_morphology_model&diff=4385Garli Mkv morphology model2015-07-21T16:34:56Z<p>Zwickl: Created page with "==GARLI for "standard" data== A partitioned testing version of GARLI is available that implements the "Mk" and "Mkv" models of Lewis (2001), "A Likelihood Approach to Estimati..."</p>
<hr />
<div>==GARLI for "standard" data==<br />
A partitioned testing version of GARLI is available that implements the "Mk" and "Mkv" models of Lewis (2001), "A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Data".<br />
<br />
'''Specific information on the use of morphology data appears on this page, but be sure to read the primary documentation for the [[Partition_testing_version]].'''<br />
<br />
==This implementation==<br />
*Allows use of character data with any number of states, arbitrarily coded as 1, 2, 3 etc. This is termed the "standard" datatype by the Nexus format.<br />
*Allows simultaneous use of this standard data and typical sequence data (dna, protein) in a partitioned model (although there are some limitations).<br />
*Allows use of the "Mk" model, which assumes that the data collected could contain constant characters<br />
*Allows use of the "Mkv" model, which assumes that the data collected contains only variable characters<br />
*Allows versions of the Mk and Mkv models that treat the states as ordered characters<br />
<br />
==Limitations of this version==<br />
*ONLY allows equal rates of substitution between states (rate of change from 1 -> 2 = 2 -> 1)<br />
*ONLY allows equal frequencies of the character states (state 1 = state 2)<br />
*CAN'T create stepwise addition starting trees under Mkv (for technical reasons)<br />
*CAN'T use rate heterogeneity with the Mk/Mkv models.<br />
*Another technical limitation:<br />
**'''IF''' you are mixing the morphology model with sequence data (DNA) <br />
:'''AND''' different characters have different numbers of states (e.g., character 1 has observed states 1, 2, and 3, while character 2 has states 1 and 2) <br />
:'''THEN''' you will not be able to infer separate subset specific rates for the DNA and morphological sets of data unless you also infer different rates for each set of characters with the same number of observed states<br />
<br />
==Application of this version to indel character data==<br />
One potential use of the "standard" data models implemented here is to encode indels (gaps) from your alignment as independent characters (in a separate data matrix) and analyze them simultaneously with your sequence data in a partitioned analysis. Note that the jury remains out on whether this is a good or helpful approach to take, and I don't necessarily endorse it. Certainly the gap and sequence matrices are not independent of one another, and will tend to reinforce each others signals, thus raising support in a way that may or may not be appropriate.<br />
<br />
==Availability==<br />
Version 2.0 allows these models, and is available here: [http://garli.googlecode.com http://garli.googlecode.com]<br />
<br />
==Basic usage==<br />
Very little needs to be done to use the Mk/Mkv models. <br />
===Data===<br />
Have a Nexus datafile with your standard data in a characters or data block.<br />
===Configuration===<br />
The section of the configuration file containing the model settings should look like this<br />
datatype = '''standard''' (or '''standardXXX''', see below)<br />
ratematrix = 1rate<br />
statefrequencies = equal<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
<br />
The datatype is the only thing that can be changed here, and there are a few options:<br />
*standard - States are coded arbitrarily, and the number of observed states is assumed to be the maximum for each site. That is, if a column has a mix of states 1, 3 and 4, it is assumed that only these three states are possible. This is the "Mk" model.<br />
*standardvariable - As standard, except makes corrections for the fact that all constant columns will not be collected, and therefor won't appear in the matrix. This is the "Mkv" model, and should generally be preferred for morphological data over Mk.<br />
*standardordered - As standard, except that the state numbers DO matter, and transitions can only change the state by one number at a time. i.e., to get from state 2 to state 4 requires two changes. State numbers can be missing, so there could be an intermediate state that is unobserved.<br />
*standardvariableordered - A combination of the properties of standardvariable and standardordered.<br />
<br />
Now run the program as usual.<br />
<br />
==Partitioned usage==<br />
To use Mk/Mkv in a partitioned model (with other types of data), the procedure is this:<br />
<br />
===Data===<br />
Get your data ready. In your Nexus datafile, your sequence data and Mk type data will need to appear in separate characters blocks. Multiple characters blocks automatically create a partitioned model in GARLI. In general, the file should be formatted something like this:<br />
<br />
#NEXUS<br />
begin taxa;<br />
<contents of taxa block><br />
end;<br />
begin characters;<br />
<one of your types of data><br />
end;<br />
begin characters;<br />
<more data of the same or a different type><br />
end;<br />
<more characters blocks if necessary><br />
end;<br />
<br />
Note that this means that you will need to use a taxa block in addition to your characters blocks. If you are using multiple types of data you cannot use data blocks. As for how to get your data into this format, two options are to paste multiple characters blocks into a single file (with one taxa block), or to get your data into Mesquite as separate matrices and then save it.<br />
<br />
===Configuration===<br />
At this point the run can be configured as a typical partitioned analysis. Lots more information on that appears here: [[Partition_testing_version]]. <br />
<br />
In short, the multiple models are specified in the below format. Note that the "[model1]" and "[model2]" bits are important, and indicate which characters blocks (or partition subsets) the models are applied to, with the order being the same. Note that the numbering starts at 1, so the first characters block is model1, the second model2, etc. Assuming that the characters blocks in the file appeared with the nucleotide data first: <br />
<br />
<pre><br />
[model1]<br />
datatype = nucleotide<br />
ratematrix = 6rate<br />
statefrequencies = estimate<br />
ratehetmodel = gamma<br />
numratecats = 4<br />
invariantsites = none<br />
<br />
[model2]<br />
datatype = standardvariable<br />
ratematrix = 1rate<br />
statefrequencies = equal<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
</pre><br />
<br />
This would specify the GTR+Gamma model for the DNA data, and Mkv for the standard data.<br />
<br />
Now run the program as usual.<br />
<br />
==Program output==<br />
Output files will be more or less as usual. If you look in the .screen.log file you will notice that the standard (morphology) data will be split into a number of models with each representing a given number of states. i.e., one model for characters showing 2 states, one model for characters with 3 states, etc. This is normal. You will also notice that with the current basic Mk/Mkv implementation there aren't any parameters to be estimated or reported.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4384GARLI Configuration Settings2015-07-21T16:32:15Z<p>Zwickl: /* datatype (sequence type and inference model) */</p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[Garli_FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[Garli_FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Garli_Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[Garli_FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Garli_Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a)<br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4383GARLI Configuration Settings2015-07-21T16:30:56Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[Garli_FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[Garli_FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Garli_Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[Garli_FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a)<br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4382GARLI Configuration Settings2015-07-21T16:26:35Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a)<br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4381GARLI Configuration Settings2015-07-21T16:24:53Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4380GARLI Configuration Settings2015-07-21T16:18:39Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4379GARLI Configuration Settings2015-07-21T16:08:06Z<p>Zwickl: /* Descriptions of GARLI configuration settings */</p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4378GARLI Configuration Settings2015-07-21T16:01:17Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4377GARLI Configuration Settings2015-07-21T15:51:31Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4375GARLI Configuration Settings2015-07-21T15:43:36Z<p>Zwickl: </p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4374GARLI Configuration Settings2015-07-21T15:42:35Z<p>Zwickl: /* criptions of GARLI configuration settings */</p>
<hr />
<div>==Descriptions of GARLI configuration settings==<br />
<br />
<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4373GARLI Configuration Settings2015-07-21T15:42:10Z<p>Zwickl: </p>
<hr />
<div>==criptions of GARLI configuration settings==<br />
<br />
<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_wiki&diff=4372Garli wiki2015-07-21T15:35:09Z<p>Zwickl: </p>
<hr />
<div><br />
Base of the garli wiki<br />
<br />
[[Garli_FAQ]]<br />
[[GARLI_configuration_settings]]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_using_partitioned_models&diff=4371Garli using partitioned models2015-07-21T15:30:58Z<p>Zwickl: Created page with "==Configuring a partitioned analysis== Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to d..."</p>
<hr />
<div>==Configuring a partitioned analysis==<br />
Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to divide your data and what models to apply. Deciding how to choose the models is also a complex issue.<br />
<br />
'''NOTE:''' If you are the kind of person who would rather just try your hand at running partitioned models without reading all of this first, you'll find example runs and template configuration files in the example/partition/ directory of any GARLI distribution. I'd still suggest reading this at some point to be sure that you understand your options for configuration.<br />
<br />
===Dividing up the data===<br />
Note that I use the technically correct (but often misused) definition of a partition. A partition is a scheme for dividing something up. It does the dividing, like a partition in a room. The individual chunks of data that are created by the partition are referred to as subsets.<br />
<br />
This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix. The setup of the actual models will come later. For a dataset with 2178 characters in a single data or characters block, it would look like this:<br />
<pre><br />
begin sets;<br />
charset ND2 = 1-726;<br />
charset rbcl = 727-1452;<br />
charset 16S = 1453-2178;<br />
charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S; <br />
<br />
[you could also put characters exclusions here by removing the []'s from the line below]<br />
[note that the excluded sites should still appear in the charpartition, however]<br />
[exset * myexclusions = 600-800, 850, 900-100;]<br />
end;<br />
</pre><br />
<br />
The above block would divide up the sites into three sets of 726 characters each. <br />
<br />
To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this<br />
<pre><br />
charpartition bySites = chunk1:ND2 rbcl, chunk2:16S; <br />
</pre><br />
Note the space rather than comma between ND2 and rbcl.<br />
<br />
<br />
The names are unimportant here. The general format is:<br />
<br />
charset <charset name> = <list or range of sites>;<br />
charset <charset name> = <list or range of sites>;<br />
charpartition <charpartition name> = (cont.)<br />
<partition subset 1 name>:<sites or charset making up 1st subset>, (cont.)<br />
<partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;<br />
<br />
To easily specify charsets that divide characters up by codon position, do this:<br />
charset 1stpos = 1-2178\3;<br />
charset 2ndpos = 2-2178\3;<br />
charset 3rdpos = 3-2178\3;<br />
<br />
Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model. If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.<br />
<br />
Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).<br />
<br />
===Specifying the models===<br />
DO SOME SORT OF MODEL TESTING! The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized. Use ModelTest or some other means for finding the best model for each data subset. Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.<br />
<br />
As usual for GARLI, the models are specified in the configuration file. If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual '''[[GARLI_Configuration_Settings#Model_specification_settings|here]]''', and FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.<br />
<br />
There are two new configuration entries that relate to partitioned models:<br />
linkmodels = 0 or 1<br />
subsetspecificrates = 0 or 1<br />
<br />
'''linkmodels''' means to use a single set of model parameters for all subsets.<br />
<br />
'''subsetspecificrates''' means to infer overall rate multipliers for each data subset. This is equivalent to *prset ratepr=variable* in MrBayes<br />
<br />
So, there are various combinations here:<br />
{| border="1"<br />
|-<br />
!'''linkmodels'''!!'''subsetspecificrates'''!!'''meaning '''<br />
|-<br />
|align="center" | 0 || align="center" | 0 || different models, branch lengths equal<br />
|-<br />
|align="center" | 0 || align="center" | 1 || different models, different subset rates <br />
|-<br />
|align="center" | 1 || align="center" | 0 || single model, one set of branch lengths (equivalent to non-partitioned analysis)<br />
|-<br />
|align="center" | 1 || align="center" | 1 || single model, different subset rates (like site-specific rates model in PAUP*)<br />
|}<br />
<br />
The normal model configuration entries are the following, with the defaults in *bold*:<br />
<br />
datatype = '''nucleotide''', aminoacid, codon-aminoacid or codon<br />
ratematrix = '''6rate''', 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f )<br />
statefrequencies = '''estimate''', equal, empirical, (+others for aminoacids or codons. See manual)<br />
ratehetmodel = '''gamma''', none<br />
numratecats = '''4''', 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma) <br />
invariantsites = '''estimate''', none<br />
<br />
If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G. In that case there is nothing else to be done. You can start your run.<br />
<br />
If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in []'s. The headings need to be [model1], [model2], etc., and are assigned to the subsets in order. The number of configuration sets must match the number of data subsets.<br />
<br />
For example, the following would assign the GTR+G and HKY models to the first and second data subsets.<br />
<pre><br />
[model1]<br />
datatype = nucleotide<br />
ratematrix = 6rate<br />
statefrequencies = estimate<br />
ratehetmodel = gamma<br />
numratecats = 4<br />
invariantsites = none<br />
<br />
[model2]<br />
datatype = nucleotide<br />
ratematrix = 2rate<br />
statefrequencies = estimate<br />
ratehetmodel = none<br />
numratecats = 1<br />
invariantsites = none<br />
</pre><br />
<br />
These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.<br />
<br />
If you're not sure how to specify a given model, see the FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.<br />
<br />
NOTE THAT THE BRACKETED LINES BEFORE EACH MODEL DESCRIPTION ARE '''NOT''' COMMENTS, AND MUST APPEAR AS ABOVE WITH CONSECUTIVE MODEL NUMBERING STARTING AT ONE!<br />
<br />
===Choosing partitioning schemes===<br />
====How many partition subsets can I use?====<br />
There is no hard limit that is enforced on the number of subsets that you '''''can''''' use. I've done more than 60 myself. See the below for considerations about how many you '''''should''''' use.<br />
<br />
====What do I need to consider when choosing models and a partitioning scheme?====<br />
That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics). I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize. Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal. This makes the parameter likelihood surfaces very flat and difficult to optimize. This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.<br />
<br />
'''NOTE''' Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context! This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface. This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain. <br />
<br />
In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters. If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values. Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable. However, informative prior estimates are not used that often in standard phylogenetic analyses.<br />
<br />
====Ok, then how should I partition and choose models in practice?====<br />
'''NOTE:''' The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.<br />
<br />
Consider the following procedure that I've used, from a real dataset of 2 genes. It is fairly complicated, but I think as statistically rigorous as can be expected. We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets. We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated. But what model should be chosen for the subsets, and how should the subsets be constructed? <br />
<br />
1. First, construct all reasonable sets of sites for model testing. This amounts to:<br />
:* Each codon position of each gene individually (6)<br />
:* The full concatenated alignment (1)<br />
:* The full sequence of each gene (2)<br />
:* The first and second positions combined for each gene (2)<br />
:* The first, second and third positions pooled across the genes (3)<br />
Note that this list doesn't need to be exhaustive. I've omitted various combinations that I find unlikely ''a priori'', for example a combination of first and third positions.<br />
<br />
<br />
2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above. These will be the models applied to each of the subsets as we move to partitioned models. The following were the results in my case. If you aren't very familiar with the typical models, the information here: ('''[[GARLI_Configuration_Settings#Model_specification_settings|Model configuration]]''') and some of the items here: ('''[[FAQ#Model_choices| FAQ:Model choices]]''') may be helpful.<br />
<br />
<br />
{|border="1"<br />
|-bgcolor=grey<br />
! width="120" | Alignment<br />
! width="120" | Full<br />
! width="120" | 1st Pos<br />
! width="120" | 2nd Pos<br />
! width="120" | 3rd Pos<br />
! width="120" | 1st Pos + 2nd Pos<br />
|-<br />
| align="center" | Gene 1 || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG || align="center" | TVMIG<br />
|-<br />
| align="center" | Gene 2 || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG<br />
|-<br />
| align="center" | Concat || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG<br />
|}<br />
<br />
<br />
3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step. When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3). For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered. The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.<br />
<br />
Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme. Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done. See [[FAQ#How_do_I_fully_constrain_.28i.e..2C_fix.29_the_tree_topology.3F | here]] for information on fixing the tree topology.<br />
<br />
<br />
4. Now, note the number of model parameters that were estimated in each partitioning scheme. For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively. (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30. Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as <br />
AIC = 2 x (# parameters - lnL).<br />
<br />
5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best. See the below table for my results. In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene. The concatenated alignment result was less obvious ''a priori'', with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG. <br />
The AIC scores are summarized in the following table:<br />
<br />
{|border="1"<br />
|-bgcolor=grey<br />
! width="100" | Alignment<br />
! width="100" | Partition<br />
! width="100" | lnL<br />
! width="100" |Parameters<br />
! width="100" |AIC<br />
|-<br />
|align="center"| Gene 1 || align="center" | (1)(2)(3) || align="center" | -22698.82 || align="center" | 30 || align="center" | 45457.64<br />
|-<br />
|align="center"| || align="center" | (12)(3) || align="center" | -22817.84 || align="center" | 19 || align="center" | 45673.68<br />
|-<br />
|align="center"| || align="center" |(123) || align="center" | -23524.92 || align="center" | 10 || align="center" | 47069.84<br />
|-<br />
|align="center"| || align="center" | || align="center" | || align="center" | ||<br />
|-<br />
|align="center"| Gene 2 || align="center" | (1)(2)(3) || align="center" | -24537.93 || align="center" | 31 || align="center" | 49137.85<br />
|-<br />
|align="center"| || align="center" | (12)(3) || align="center" | -24639.11 || align="center" | 19 || align="center" | 49316.22<br />
|-<br />
|align="center"| || align="center" | (123) || align="center" | -25353.78 || align="center" | 10 || align="center" | 50727.55<br />
|-<br />
|align="center"| || align="center" | || align="center" | || align="center" | ||<br />
|-<br />
|align="center"|Concat || align="center" | (11)(2)(2)(33) || align="center" | -47384.54 || align="center" | 42 || align="center" | 94853.08<br />
|-<br />
|align="center"| || align="center" | (1)(2)(1)(2)(33) || align="center" | -47375.30 || align="center" | 53 || align="center" | 94856.60<br />
|-<br />
|align="center"| || align="center" | (1)(2)(3)(1)(2)(3) || align="center" | -47373.34 || align="center" | 62 || align="center" | 94870.69<br />
|-<br />
|align="center"| || align="center" | (11)(22)(33) || align="center" | -47408.21 || align="center" | 31 || align="center" | 94878.43<br />
|-<br />
|align="center"| || align="center" | (1212)(33) || align="center" | -47614.47 || align="center" | 19 || align="center" | 95266.94<br />
|-<br />
|align="center"| || align="center" | 123123 || align="center" | -49038.37 || align="center" | 10 || align="center" | 98096.75<br />
|-<br />
|align="center"| || align="center" | (123)(123) || align="center" | -49031.42 || align="center" | 21 || align="center" | 98104.83<br />
|}<br />
<br />
====Further thoughts====<br />
The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned ''a priori''. Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has '''''20 fewer parameters''''' than the one that most people would have chosen ''a priori'', (1)(2)(3)(1)(2)(3).<br />
Some caveats:<br />
*This methodology and example only includes two genes! Many datasets nowadays have many more. Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes. However, you can still work to find good partitioning schemes ''within'' each gene, as was done for the single genes in the example above. Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each. I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset. You might also do something like the full procedure above on sets of genes that you know '''a priori'' are similar, for example mitochondrial genes, rRNA genes, etc.<br />
<br />
*This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI). Surely more parameters could be eliminated by linking. For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters. If so, three parameters could be eliminated for each subset, which is a significant number.<br />
<br />
*Some may argue that this is a lot of work for unclear benefit. This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models. Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.<br />
<br />
==Check the output==<br />
Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly. The output is currently very verbose. <br />
<br />
First the details of the data partitioning appear. Check that the total number of characters per subset looks correct. <br />
<pre><br />
GARLI partition subset 1<br />
CHARACTERS block #1 ("Untitled DATA Block 1")<br />
CHARPARTITION subset #1 ("1stpos")<br />
Data read as Nucleotide data, modeled as Nucleotide data<br />
<br />
Summary of data or data subset:<br />
11 sequences.<br />
441 constant characters.<br />
189 parsimony-informative characters.<br />
96 autapomorphic characters.<br />
726 total characters.<br />
238 unique patterns in compressed data matrix.<br />
<br />
GARLI partition subset 2<br />
CHARACTERS block #1 ("Untitled DATA Block 1")<br />
CHARPARTITION subset #2 ("2ndpos")<br />
Data read as Nucleotide data, modeled as Nucleotide data<br />
<br />
Summary of data or data subset:<br />
11 sequences.<br />
528 constant characters.<br />
103 parsimony-informative characters.<br />
95 autapomorphic characters.<br />
726 total characters.<br />
158 unique patterns in compressed data matrix.<br />
<br />
GARLI partition subset 3<br />
CHARACTERS block #1 ("Untitled DATA Block 1")<br />
CHARPARTITION subset #3 ("3rdpos")<br />
Data read as Nucleotide data, modeled as Nucleotide data<br />
<br />
Summary of data or data subset:<br />
11 sequences.<br />
103 constant characters.<br />
539 parsimony-informative characters.<br />
84 autapomorphic characters.<br />
726 total characters.<br />
549 unique patterns in compressed data matrix.<br />
</pre><br />
Then a description of the models and model parameters assigned to each subset appears. Parameters are at their initial values. This indicates three models with GTR+G for each:<br />
<pre><br />
MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)<br />
Model 1<br />
Number of states = 4 (nucleotide data)<br />
Nucleotide Relative Rate Matrix: 6 rates <br />
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000<br />
Equilibrium State Frequencies: estimated<br />
(ACGT) 0.3157 0.1746 0.3004 0.2093 <br />
Rate Heterogeneity Model:<br />
4 discrete gamma distributed rate categories, alpha param estimated<br />
0.5000<br />
Substitution rate categories under this model:<br />
rate proportion<br />
0.0334 0.2500<br />
0.2519 0.2500<br />
0.8203 0.2500<br />
2.8944 0.2500<br />
<br />
Model 2<br />
Number of states = 4 (nucleotide data)<br />
Nucleotide Relative Rate Matrix: 6 rates <br />
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000<br />
Equilibrium State Frequencies: estimated<br />
(ACGT) 0.2703 0.1566 0.1628 0.4103 <br />
Rate Heterogeneity Model:<br />
4 discrete gamma distributed rate categories, alpha param estimated<br />
0.5000<br />
Substitution rate categories under this model:<br />
rate proportion<br />
0.0334 0.2500<br />
0.2519 0.2500<br />
0.8203 0.2500<br />
2.8944 0.2500<br />
<br />
Model 3<br />
Number of states = 4 (nucleotide data)<br />
Nucleotide Relative Rate Matrix: 6 rates <br />
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000<br />
Equilibrium State Frequencies: estimated<br />
(ACGT) 0.1460 0.3609 0.2915 0.2015 <br />
Rate Heterogeneity Model:<br />
4 discrete gamma distributed rate categories, alpha param estimated<br />
0.5000<br />
Substitution rate categories under this model:<br />
rate proportion<br />
0.0334 0.2500<br />
0.2519 0.2500<br />
0.8203 0.2500<br />
2.8944 0.2500<br />
<br />
Subset rate multipliers:<br />
1.00 1.00 1.00<br />
</pre><br />
<br />
If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed. All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.<br />
<br />
<pre><br />
Completed 5 replicate runs (of 5).<br />
Results:<br />
Replicate 1 : -13317.4777 (best)<br />
Replicate 2 : -13317.4813 (same topology as 1)<br />
Replicate 3 : -13317.4839 (same topology as 1)<br />
Replicate 4 : -13317.4863 (same topology as 1)<br />
Replicate 5 : -13317.4781 (same topology as 1)<br />
<br />
Parameter estimates:<br />
<br />
Partition subset 1:<br />
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha <br />
rep 1: 1.97 2.58 1.42 1.41 3.72 1.00 0.310 0.177 0.297 0.216 0.411 <br />
rep 2: 1.96 2.58 1.41 1.41 3.71 1.00 0.310 0.177 0.296 0.216 0.409 <br />
rep 3: 1.97 2.58 1.42 1.40 3.71 1.00 0.309 0.177 0.298 0.216 0.411 <br />
rep 4: 1.96 2.57 1.42 1.40 3.72 1.00 0.310 0.177 0.297 0.216 0.411 <br />
rep 5: 1.96 2.57 1.41 1.40 3.71 1.00 0.310 0.177 0.297 0.216 0.409 <br />
<br />
Partition subset 2:<br />
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha <br />
rep 1: 4.32 7.05 1.60 7.05 4.37 1.00 0.269 0.164 0.160 0.407 0.361 <br />
rep 2: 4.32 7.03 1.59 7.08 4.37 1.00 0.270 0.163 0.160 0.407 0.361 <br />
rep 3: 4.33 7.08 1.60 7.07 4.37 1.00 0.269 0.164 0.160 0.406 0.361 <br />
rep 4: 4.34 7.09 1.61 7.10 4.40 1.00 0.269 0.164 0.160 0.407 0.361 <br />
rep 5: 4.35 7.08 1.60 7.11 4.39 1.00 0.269 0.163 0.160 0.407 0.360 <br />
<br />
Partition subset 3:<br />
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha <br />
rep 1: 1.06 5.26 3.55 0.45 4.98 1.00 0.154 0.356 0.287 0.203 2.988 <br />
rep 2: 1.06 5.26 3.56 0.45 4.98 1.00 0.154 0.356 0.287 0.203 2.995 <br />
rep 3: 1.06 5.26 3.57 0.46 5.00 1.00 0.154 0.355 0.287 0.204 3.008 <br />
rep 4: 1.07 5.31 3.57 0.46 5.00 1.00 0.154 0.356 0.286 0.204 2.991 <br />
rep 5: 1.06 5.25 3.56 0.45 5.00 1.00 0.154 0.356 0.287 0.203 2.978 <br />
<br />
Subset rate multipliers:<br />
rep 1: 0.538 0.298 2.164 <br />
rep 2: 0.538 0.299 2.164 <br />
rep 3: 0.539 0.300 2.162 <br />
rep 4: 0.539 0.298 2.163 <br />
rep 5: 0.537 0.299 2.164 <br />
<br />
Final result of the best scoring rep (#1) stored in GTRG.byCodonPos.best.tre<br />
Final results of all reps stored in GTRG.byCodonPos.best.all.tre<br />
</pre><br />
<br />
==The sample runs==<br />
In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset. It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.<br />
<br />
The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.<br />
<br />
The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset. In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc). I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.<br />
<br />
Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP. Parameters that are shared have the same #. The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T. For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:<br />
<pre>ratematrix = ( 0 1 0 0 1 0 )</pre><br />
The GTR model looks like this:<br />
<pre>ratematrix = ( 0 1 2 3 4 5 )</pre><br />
Note that these two particular configuations are equivalent to '''ratematrix = 2rate''' and '''ratematrix = 6''' rate, respectively.<br />
<br />
==Site-likelihood output==<br />
If you'd like to get site-wise likelihoods, for example to input into CONSEL, add<br />
outputsitelikelihoods = 1<br />
to the top part of your config file. This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another. This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL. Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not my.run.txt), so you may need to rename the files.<br />
<br />
If you want the sitelikes for a particular tree, you'd need to do this:<br />
*specify the tree(s) in a starting tree file.<br />
*specify your chosen model normally<br />
*add '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]] = 1''' to the [general] section of your configuration file.<br />
*run the program normally (this will take a while because it will need to optimize the model parameters)<br />
<br />
==Running it yourself==<br />
Use the provided configuration files as templates. I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones. Note that there are some important changes from the defaults values that appear in previous versions of the program. These are important to ensure that the more complex partitioned models are properly optimized.<br />
<br />
You should set '''modweight''' entry to at least 0.0005 x (#subsets + 1).<br />
<br />
Other than that and entering your dataset name on the '''datafname''' line, you should be able to run with the default values. If you know your way around the Garli settings, feel free to tinker.<br />
<br />
As always, you can start the program from the command line with either <br />
<pre>./executableName configFileName</pre><br />
if it is in the same directory with your dataset and config file, or just<br />
<pre>executableName configFileName</pre><br />
if it is in your path.<br />
<br />
If your config file is named garli.conf, you don't need to pass it on the command line.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=GARLI_Configuration_Settings&diff=4370GARLI Configuration Settings2015-07-21T15:23:43Z<p>Zwickl: Created page with "criptions of GARLI configuration settings== The format for these configuration settings descriptions is generally: '''entryname''' (possible values, '''default value in bold'..."</p>
<hr />
<div>criptions of GARLI configuration settings==<br />
The format for these configuration settings descriptions is generally:<br />
'''entryname''' (possible values, '''default value in bold''') – description<br />
<br />
==General settings==<br />
===datafname (file containing sequence dataset)=== <br />
'''datafname''' = (filename) –<br />
Name of the file containing the aligned sequence data. Formats accepted are <br />
PHYLIP, NEXUS and FASTA. Robust reading of Nexus formatted <br />
datasets is done using the Nexus Class Library. This accommodates things such as interleaved <br />
alignments and exclusion sets (exset’s) in NEXUS assumptions blocks (see the FAQ for an <br />
example of [[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | exset usage]]). Use of NEXUS files is recommended, and is required for partitioned models.<br />
<br />
===constraintfile (file containing constraint definition)===<br />
'''constraintfile''' = (filename, '''none''') –<br />
Name of the file containing any topology constraint specifications, or “none” if <br />
there are no constraints. The easiest way to explain the format of the constraint file is by <br />
example. Consider a dataset of 8 taxa, in which your constraint consists of grouping taxa 1, 3 <br />
and 5. You may specify either positive constraints (inferred tree MUST contain constrained <br />
group) or negative constraints (also called converse constraints, inferred tree CANNOT <br />
contain constrained group). These are specified with either a ‘+’ or a ‘-‘ at the beginning of <br />
the constraint specification, for positive and negative constraints, respectively. <br />
*For a positive constraint on a grouping of taxa 1, 3 and 5: <br />
+((1,3,5),2,4,6,7,8); <br />
*For a negative constraint on a grouping of taxa 1, 3 and 5: <br />
-((1,3,5),2,4,6,7,8); <br />
*Note that there are many other equivalent parenthetical representations of these constraints. <br />
*Multiple groups may be positively constrained, but currently only a single negatively constrained group is allowed. <br />
*Multiple constrained groupings may be specified in a single string: <br />
+((1,3,5),2,4,(6,7),8); <br />
or in two separate strings on successive lines: <br />
+((1,3,5),2,4,6,7,8); <br />
+(1,3,5,2,4,(6,7),8); <br />
*Constraint strings may also be specified in terms of taxon names (matching those used in the alignment) instead of numbers. <br />
*Positive and negative constraints cannot be mixed. <br />
*GARLI also accepts another constraint format that may be easier to use in some cases. This involves specifying a single branch to be constrained with a string of ‘*’ (asterisk) and ‘.’ (period) characters, with one character per taxon. Each taxon specified with a ‘*’ falls on one side of the constrained branch, and all those specified with a ‘.’ fall on the other. This should be familiar to anyone who has looked at PAUP* bootstrap output. With this format, a positive constraint on a grouping of taxa 1, 3 and 5 would look like this: <br />
+*.*.*… <br />
or equivalently like this: <br />
+.*.*.*** <br />
With this format each line only designates a single branch, so multiple constrained branches must be specified as multiple lines in the file. <br />
*The program also allows “backbone” constraints, which are simply constraints that do not include all of the taxa. For example if one wanted to constrain (1,3,5) but didn’t care where taxon 6 appeared in the tree, this string could be used: <br />
+((1,3,5),2,4,7,8); <br />
:Nothing special needs to be done to identify this as a backbone constraint, simply leave out some taxa.<br />
<br />
===streefname (source of starting tree and/or model)===<br />
'''streefname''' = (random, <u>'''stepwise'''</u>, <filename>) – Specifies where the starting tree topology and/or <br />
model parameters will come from. The tree topology may be a completely random topology <br />
(constraints will be enforced), a tree provided by the user in a file, <u>or a tree generated by the <br />
program using a fast ML stepwise-addition algorithm (see [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] below)</u>. <br />
Starting or fixed model parameter values may also be provided in the specified file, with or <br />
without a tree topology. Some notes on starting trees/models: <br />
*Specified starting trees may have polytomies, which will be arbitrarily resolved before the run begins.<br />
*Starting tree formats: <br />
**Plain newick tree string (with taxon numbers or names, with or without branch lengths) <br />
**NEXUS trees block. The trees block can appear in the same file as a NEXUS data or characters block that contains the alignment, although the same filename should then be specified on both the datafname and streefname lines.<br />
*If multiple trees appear in the specified file and multiple search replicates are specified (see '''[[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]]''' setting), then the first tree is used in the first replicate, the second in the second replicate, etc. <br />
*Providing model parameter values: see this page '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''<br />
*See also the FAQ items on model parameters '''[[FAQ#Model_parameters | here]]'''.<br />
<br />
===attachmentspertaxon (control creation of stepwise addition starting tree)===<br />
'''attachmentspertaxon''' = (1 to infinity, '''50''') – The number of attachment branches evaluated for each taxon to be added to the tree during the creation of an ML stepwise-addition starting tree. Briefly, stepwise addition is an algorithm used to make a tree, and involves adding taxa in a random order to a growing tree. For each taxon to be added, a number of randomly chosen attachment branches are tried and scored, and then the best scoring one is chosen as the location of that taxon. The attachmentspertaxon setting controls how many attachment points are evaluated for each taxon to be added. A value of one is equivalent to a completely random tree (only one randomly chosen location is evaluated). A value of greater than 2 times the number of taxa in the dataset means that all attachment points will be evaluated for each taxon, and will result in very good starting trees (but may take a while on large datasets). Even fairly small values (< 10) can result in starting trees that are much, much better than random, but still fairly different from one another. <br />
===ofprefix (output filename prefix)===<br />
'''ofprefix''' = (text) – Prefix of all output filenames, such as log, treelog, etc. Change this for each run that you do or the program will overwrite previous results. <br />
===randseed (random number seed)===<br />
'''randseed''' = (-1 or positive integers, '''-1''') – The random number seed used by the random number generator. Specify “–1” to have a seed chosen for you. Specifying the same seed number in multiple runs will give exactly identical results, if all other parameters and settings are also identical. <br />
===availablemememory (control maximum program memory usage)===<br />
'''availablemememory''' – Typically this is the amount of available physical memory on the system, in megabytes. This lets GARLI determine how much system memory it may be able to use to store computations for reuse. The program will be conservative and use at most about 90% of this value, although for typical datasets it will use much, much less. If other programs must be open or used when GARLI is running, you may need to reduce this value. This setting can also have a significant effect on performance (speed), but more is not always better. When a run is started, GARLI will output the '''availablememory''' value necessary for that dataset to achieve each of the “memory levels” (from best to worst, termed “great”, “good”, “low”, and “very low”). More memory is generally better because more calculations are stored and can be reused, but when the amount of memory needed for “great” becomes more than 512 megabytes or so, performance can be slowed because the operating system has difficulty managing that much memory. In general, chose an amount of memory that allows “great” when this is less than 512 MB, and if it is greater reduce the amount of memory into “good” or “low” as necessary. Avoid “very low” whenever possible. You can find the value is approximately optimal for your dataset by setting the '''randseed''' to some <br />
positive value (so that the searches are identical) and doing runs with various '''availablememory''' values. Typically it will only make at most a 30% difference, so it isn't worth worrying about too much. <br />
===logevery (frequency to log best score to file)===<br />
'''logevery''' = (1 to infinity, '''10''') – The frequency at which the best score is written to the log file. <br />
===saveevery (frequency to save best tree to file or write checkpoints)===<br />
'''saveevery''' = (1 to infinity, '''100''') – If '''writecheckpoints''' or '''outputcurrentbesttopology''' are specified, this is the frequency (in generations) at which checkpoints or the current best tree are written to file. <br />
===refinestart (whether to optimize a bit before starting a search)===<br />
'''refinestart''' = (0 or 1, '''1''') – Specifies whether some initial rough optimization is performed on the starting branch lengths and rate heterogeneity parameters. This is always recommended. <br />
===outputcurrentbesttopology (continuously write the best tree to file during a run)===<br />
'''outputcurrentbesttopology''' = (0 or 1, '''0''') – If true, the current best tree of the current search replicate is written to <ofprefix>.best.current.tre every '''saveevery''' generations. In versions before 0.96 the current best topology was always written to file, but that is no longer the case. Seeing the current best tree has no real use apart from satisfying your curiosity about how a <br />
run is going.<br />
===outputeachbettertopology (write each improved topology to file)===<br />
'''outputeachbettertopology''' (0 or 1, '''0''') – If true, each new topology encountered with a better score than the previous best is written to file. In some cases this can result in really big files (hundreds of MB) though, especially for random starting topologies on large datasets. Note that this file is interesting to get an idea of how the topology changed as the searches progressed, but the collection of trees should NOT be interpreted in any meaningful way. This option is not available while bootstrapping. <br />
===enforcetermconditions (use automatic termination)===<br />
'''enforcetermconditions''' = (0 or 1, '''1''') – Specifies whether the automatic termination conditions will be used. The conditions specified by both of the following two parameters must be met. See the following two parameters for their definitions. If this is false, the run will continue until it reaches the time ('''stoptime''') or generation ('''stopgen'') limit. It is highly recommended that this option be used! <br />
===genthreshfortopoterm (number of generations without topology improvement required for termination) ===<br />
'''genthreshfortopoterm''' = (1 to infinity, '''20,000''') – This specifies the first part of the termination condition. When no new significantly better scoring topology (see significanttopochange below) has been encountered in greater thanthis number of generations, this condition is met. Increasing this parameter may improve the lnL scores obtained (especially on large datasets), but will also increase runtimes. <br />
===scorethreshforterm (max score improvement over recent generations required for termination)===<br />
'''scorethreshforterm''' = (0 to infinity, '''0.05''') – The second part of the termination condition. When the total improvement in score over the last '''intervallength''' x '''intervalstostore''' generations (default is 500 generations, see below) is less than this value, this condition is met. This does not usually need to be changed. <br />
===significanttopochange (required score improvement for topology to be considered better)===<br />
'''significanttopochange''' = (0 to infinity, '''0.01''') – The lnL increase required for a new topology to be <br />
considered significant as far as the termination condition is concerned. It probably doesn’t <br />
need to be played with, but you might try increasing it slightly if your runs reach a stable <br />
score and then take a very long time to terminate due to very minor changes in topology. <br />
===outputphyliptree (write trees to file in Phylip as well as Nexus format)===<br />
'''outputphyliptree''' = (0 or 1, '''0''') – Whether a phylip formatted tree files will be output in addition to <br />
the default nexus files for the best tree across all replicates (<ofprefix>.best.phy), the best <br />
tree for each replicate (<ofprefix>.best.all.phy) or in the case of bootstrapping, the best tree <br />
for each bootstrap replicate (<ofprefix.boot.phy>. <br />
===outputmostlyuselessfiles (output uninteresting files)===<br />
'''outputmostlyuselessfiles''' = (0 or 1, '''0''') – Whether to output three files of little general interest: the <br />
“fate”, “problog” and “swaplog” files. The fate file shows the parentage, mutation types and <br />
scores of every individual in the population during the entire search. The problog shows how <br />
the proportions of the different mutation types changed over the course of the run. The <br />
swaplog shows the number of unique swaps and the number of total swaps on the current <br />
best tree over the course of the run. <br />
===writecheckpoints (write checkpoint files during run)===<br />
'''writecheckpoints''' (0 or 1, '''0''') – Whether to write three files to disk containing all information <br />
about the current state of the population every saveevery generations, with each successive <br />
checkpoint overwriting the previous one. These files can be used to restart a run at the last <br />
written checkpoint by setting the '''restart''' configuration entry.<br />
===restart (restart run from checkpoint)===<br />
'''restart''' = (0 or 1, '''0''') – Whether to restart at a previously saved checkpoint. To use this option the '''writecheckpoints''' option must have been used during a previous run. The program will look for checkpoint files that are named based on the ofprefix of the previous run. If you intend to restart a run, NOTHING should be changed in the config file except setting '''restart''' to 1. <br />
A run that is restarted from checkpoint will give ''exactly'' the same results it would have if the run had gone to completion. <br />
===outgroup (orient inferred trees consistently)===<br />
'''outgroup''' = (ougroup taxa numbers, separated by spaces) – This option allow for orienting the tree topologies in a consistent way when they are written to file. Note that this has NO effect whatsoever on the actual inference and the specified outgroup is NOT constrained to be present in the inferred trees. If multiple outgroup taxa are specified and they do not form a monophyletic group in the inferred tree, this setting will be ignored. If you specify a single outgroup taxon it will always be present, and the tree will always be consistently oriented. Ranges can be indicated with a hyphen. e.g., to specify an outgroup consisting of taxa 1, 2, 3 and 5 the format is this: <br />
outgroup = 1-3 5<br />
<br />
===searchreps (number of independent search replicates)===<br />
'''searchreps''' = (1 to infinity, '''2''') – The number of independent search replicates to perform during a program execution. You should always either do multiple search replicates or multiple program executions with any dataset to get a feel for whether you are getting consistent results, which suggests that the program is doing a decent job of searching. Note that if this is > 1 and you are performing a bootstrap analysis, this is the number of search replicates to be done per bootstrap replicate. That can increase the chance of finding the best tree per bootstrap replicate, but will also increase bootstrap runtimes enormously.<br />
===bootstrapreps (number of bootstrap replicates)===<br />
bootstrapreps (0 to infinity, '''0''') - The number of bootstrap reps to perform. If this is greater than <br />
0, normal searching will not be performed. The resulting bootstrap trees (one per rep) will be <br />
output to a file named <ofprefix>.boot.tre. To obtain the bootstrap proportions they will then <br />
need to be read into PAUP* or a similar program to obtain a majority rule consensus. Note <br />
that it is probably safe to reduce the strictness of the termination conditions during <br />
bootstrapping (perhaps halve genthreshfortopoterm), which will greatly speed up the <br />
bootstrapping process with negligible effects on the results.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===resampleproportion (relative size of re-sampled data matrix)===<br />
resampleproportion (0.1 to 10, '''1.0''' ) – When bootstrapreps > 0, this setting allows for <br />
bootstrap-like resampling, but with the psuedoreplicate datasets having the number of <br />
alignment columns different from the real data. Setting values < 1.0 is somewhat similar to jackknifing, but not identical. <br />
===inferinternalstateprobs (infer ancestral states)===<br />
inferinternalstateprobs = (0 or 1, '''0''') – Specify 1 to have GARLI infer the marginal posterior <br />
probability of each character at each internal node. This is done at the very end of the run, <br />
just before termination. The results are output to a file named <ofprefix>.internalstates.log.<br />
<br />
===outputsitelikelihoods (write a file with the log-likelihood of each site)===<br />
<br />
'''outputsitelikelihoods''' = (0 or 1, '''0''') - Causes a file named <ofprefix>.sitelikes.log to be created containing the sitewise log-likelihood values for each individual site for the final tree of each search replicate. Note that the format of the file is exactly the same as that output by PAUP, so it can be directly used in CONSEL using the PAUP file-format option. For partitioned models the sites are listed in sets by partition subset, but the site numbers will match those in the original datamatrix. Thus, when using a partitioned model that divides site by codon position, the ordering would be 1, 4, 7, ... 2, 5, 8, ... 3, 6, 9 .... Note that this makes no difference to CONSEL, which just assumes the the sitelikelihoods are ordered the same for each tree and ignores the site numbers. <br />
<br />
Also note that using the site likelihoods from a partitioned model is a violation of CONSEL's assumptions, since it will not know about the partitions and its internal bootstrap resampling will not obey them. It isn't clear what the effects of this will be on the various tests. <br />
<br />
Finally, if you want to calculate the site likelihoods for a set of trees that you provide, use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting.<br />
<br />
===optimizeinputonly (do not search, only optimize model and branch lengths on user trees)===<br />
(new in version 2.0)<br />
<br />
'''optimizeinputonly''' = (0 or 1, '''0''') - Requires a NEXUS formatted input tree file (only NEXUS format will work!), specified through the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29 | streefname]]''' setting. All trees in that file will have model parameters and branch lengths optimized, giving their maximum likelihood scores. A file with the site-likelihoods of each tree will also be output. See the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]''' setting for details.<br />
<br />
===collapsebranches (collapse zero length branches before writing final trees to file)===<br />
<br />
'''collapsebranches''' = (0 or 1, '''1''') - Before version 1.0, all trees that are returned were fully resolved. This is true even if the maximum-likelihood estimate of some internal branch lengths are effectively zero (or GARLI's minimum, which is 1e-8). In such cases, collapsing the branch into a polytomy would be a better representation. Note that GARLI will never return a tree with an actual branch length of zero, but rather with its minimum value of 1.0e-8. The drawback of always returning fully resolved trees is that what is effectively a polytomy can be resolved in three ways, and different independent searches may randomly return one of those resolutions. Thus, if you compare the trees by topology only, they will look different. If you pay attention to the branch lengths and likelihood scores of the trees it will be apparent that they are effectively the same.<br />
<br />
I think that collapsing of branches is particularly important when bootstrapping, since no support should be given to a branch that doesn't really exist, i.e., that is a random resolution of a polytomy. Collapsing is also good when calculating tree to tree distances such as the symmetric tree distance, for example when calculating phylogenetic error to a known target tree. Zero-length branches would add to the distances (~error) although they really should not.<br />
<br />
==Model specification settings==<br />
<br />
With version 1.0 and later there are now many more options dealing with model specification because of <br />
the inclusion of amino acid and codon-based models. The description of the settings will be <br />
broken up by data type. Note that in terms of the model settings in GARLI, “empirical” means <br />
to fix parameter values at those observed in the dataset being analyzed, and “fixed” means to fix <br />
them at user specified values. See the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' setting for details on how to provide <br />
parameter values to be fixed during inference.<br />
<br />
'''PARTITIONED MODELS: To use partitioned models you'll still need to use the same basic model settings detailed below for each individual partition subset, and then some other details to set up the partition model itself. Be sure that you are familiar with the rest of this section, then see [[Using_partitioned_models | Using partitioned models]].'''<br />
<br />
===datatype (sequence type and inference model)=== <br />
'''datatype''' = ('''nucleotide''', aminoacid, codon-aminoacid, codon, standard, standardordered, standardvariable, standardvariableordered) – The type of data and model that is <br />
to be used during tree inference. Nucleotide and amino acid data are self explanatory.<br />
<br />
The codon-aminoacid datatype means that the data will be <br />
supplied as a nucleotide alignment, but will be internally translated and analyzed using an <br />
amino acid model. The codon and codon-aminoacid datatypes require nucleotide sequence <br />
that is aligned in the correct reading frame. In other words, all gaps in the alignment should <br />
be a multiple of 3 in length, and the alignment should start at the first position of a codon. If <br />
the alignment has extra columns at the start, middle or end, they should be removed or <br />
excluded with a Nexus exset (see '''[[FAQ#Can_I_specify_alignment_columns_of_my_data_matrix_to_be_excluded.3F | this FAQ item]]''' for an example of exset usage). The correct <br />
'''[[GARLI_Configuration_Settings#geneticcode_.28code_to_use_in_codon_translation.29 | geneticcode]]''' must also be set.<br />
<br />
(New in Version 2.0)<br />
<br />
The various "standard" datatypes are new in GARLI 2.0. These represent morphology-like discrete characters, with any number of states. These are also termed the "Mk" and "Mkv" models by Lewis (2001). See this page for more details on using standard data: '''[[Mkv morphology model]]'''.<br />
<br />
===Settings for datatype = nucleotide=== <br />
====ratematrix (relative rate parameters assumed by substitution model)====<br />
'''ratematrix''' = (1rate, 2rate, '''6rate''', fixed, custom string) – The number of relative substitution rate <br />
parameters (note that the number of free parameters is this value minus one). Equivalent to <br />
the “nst” setting in PAUP* and MrBayes. 1rate assumes that substitutions between all pairs <br />
of nucleotides occur at the same rate (JC model), 2rate allows different rates for transitions and <br />
transversions (K2P or HKY models), and 6rate allows a different rate between each nucleotide pair (GTR). These rates are <br />
estimated unless the fixed option is chosen. Since version 0.96, parameters for any <br />
submodel of the GTR model may be estimated. The format for specifying this is very <br />
similar to that used in the “rclass’ setting of PAUP*. Within parentheses, six letters are <br />
specified, with spaces between them. The six letters represent the rates of substitution <br />
between the six pairs of nucleotides, with the order being A-C, A-G, A-T, C-G, C-T and G-T. <br />
Letters within the parentheses that are the same mean that a single parameter is shared by <br />
multiple nucleotide pairs. For example, <br />
ratematrix = (a b a a b a) <br />
would specify the HKY 2-rate model (equivalent to ratematrix = 2rate). This entry, <br />
ratematrix = (a b c c b a) <br />
would specify 3 estimated rates of subtitution, with one rate shared by A-C and G-T <br />
substitutions, another rate shared by A-G and C-T substitutions, and the final rate shared by <br />
A-T and C-G substitutions.<br />
<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, '''estimate''', fixed) – Specifies how the equilibrium state <br />
frequencies (A, C, G and T) are treated. The empirical setting fixes the frequencies at their <br />
observed proportions, and the other options should be self-explanatory. <br />
===For datatype = nucleotide or aminoacid===<br />
====invariantsites (treatment of proportion of invariable sites parameter)====<br />
'''invariantsites''' = (none, '''estimate''', fixed) – Specifies whether a parameter representing the <br />
proportion of sites that are unable to change (i.e. have a substitution rate of zero) will be <br />
included. This is typically referred to as “invariant sites”, but would better be termed <br />
“invariable sites”. <br />
====ratehetmodel (type of rate heterogeneity to assume for variable sites)====<br />
'''ratehetmodel''' = (none, '''gamma''', gammafixed) – The model of rate heterogeneity assumed. <br />
“gammafixed” requires that the alpha shape parameter is provided, and a setting of “gamma” <br />
estimates it. <br />
====numratecats (number of overall substitution rate categories)==== <br />
'''numratecats''' = (1 to 20, '''4''') – The number of categories of variable rates (not including the <br />
invariant site class if it is being used). Must be set to 1 if '''ratehetmodel''' is set to none. Note <br />
that runtimes and memory usage scale linearly with this setting. <br />
===For datatype = aminoacid or codon-aminoacid===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on <br />
large datasets and published. Typically the only model parameters that are estimated during tree <br />
inference relate to the rate heterogeneity distribution. Each of the named matrices also has <br />
corresponding fixed amino acid frequencies, and a given rate matrix can either be used with those <br />
frequencies or with the amino acid frequencies observed in your dataset. This second option is <br />
often denoted as “+F” in a model description, although in terms of the GARLI configuration <br />
settings this is referred to as “empirical” frequencies. In GARLI the Dayhoff model would be <br />
specified by setting both the ratematrix and statefrequencies options to “dayhoff”. The <br />
Dayhoff+F model would be specified by setting the ratematrix to “dayhoff”, and <br />
statefrequencies to “empirical”.<br />
<br />
The following named amino acid models are implemented: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Note that most programs allow either the use of a named rate matrix and its corresponding state <br />
frequencies, or a named rate matrix and empirical frequencies. GARLI technically allows the <br />
mixing of different named matrices and equilibrium frequencies (for example, wag matrix with <br />
jones equilibrium frequencies), but this is not recommended. <br />
====ratematrix (amino acid substitution rates)====<br />
'''ratematrix''' = (poisson, jones, dayhoff, wag, mtmam, mtrev) – The fixed amino acid rate matrix to <br />
use. You should use the matrix that gives the best likelihood, and could use a program like <br />
PROTTEST (very much like MODELTEST, but for amino acid models) to determine which <br />
fits best for your data. Poisson assumes a single rate of substitution between all amino acid <br />
pairs, and is a very poor model.<br />
====statefrequencies (equilibrium base frequencies assumed by substitution model)====<br />
'''statefrequencies''' = (equal, empirical, estimate, fixed, jones, dayhoff, wag, mtmam, mtrev) – <br />
Specifies how the equilibrium state frequencies of the 20 amino acids are treated. The <br />
“empirical” option fixes the frequencies at their observed proportions (when describing a <br />
model this is often termed “+F”). <br />
===For datatype = codon===<br />
The codon models are built with three components: (1) parameters describing the process of <br />
individual nucleotide substitutions, (2) equilibrium codon frequencies, and (3) parameters <br />
describing the relative rate of nonsynonymous to synonymous substitutions. The nucleotide <br />
substitution parameters within the codon models are exactly the same as those possible with <br />
standard nucleotide models in GARLI, and are specified with the ratematrix configuration <br />
entry. Thus, they can be of the 2rate variety (inferring different rates for transitions and <br />
transversions, K2P or HKY-like), the 6rate variety (inferring different rates for all nucleotide <br />
pairs, GTR-like) or any other sub-model of GTR. The options for codon frequencies are <br />
specified with the statefrequencies configuration entry. The options are to use equal <br />
frequencies (not a good option), the frequencies observed in your dataset (termed “empirical” in <br />
GARLI), or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the nucleotide <br />
frequencies are those observed in the dataset across all codon positions, while the “F3x4” option <br />
uses the nucleotide frequencies observed in the data at each codon position separately. The final <br />
component of the codon models is the nonsynonymous to synonymous relative rate parameters <br />
(aka dN/dS or omega parameters). The default is to infer a single dN/dS value. Alternatively, a <br />
model can be specified that infers a given number of dN/dS categories, with the dN/dS values <br />
and proportions falling in each category estimated (ratehetmodel = nonsynonymous). This is <br />
the “discrete” or “M3” model in PAML's terminology. <br />
====ratematrix (relative ''nucleotide'' rate parameters assumed by codon model)====<br />
'''ratematrix''' = (1rate, '''2rate''', 6rate, fixed, custom string) – This determines the relative rates of <br />
nucleotide substitution assumed by the codon model. The options are exactly the same as <br />
those allowed under a normal nucleotide model. A codon model with '''ratematrix''' = 2rate <br />
specifies the standard Goldman and Yang (1994) model, with different substitution rates for <br />
transitions and transversions. <br />
====statefrequencies (equilibrium codon frequencies)====<br />
'''statefrequencies''' = (equal, empirical, f1x4, '''f3x4'') - The options are to use equal codon frequencies <br />
(not a good option), the frequencies observed in your dataset (termed “empirical” in GARLI), <br />
or the codon frequencies implied by the “F1x4” or “F3x4” methods (using PAML's <br />
terminology). These last two options calculate the codon frequencies as the product of the <br />
frequencies of the three nucleotides that make up each codon. In the “F1x4” case the <br />
nucleotide frequencies are those observed in the dataset across all codon positions, while the <br />
“F3x4” option uses the nucleotide frequencies observed in the data at each codon position <br />
separately. <br />
====ratehetmodel (variation in dN/dS across sites==== <br />
'''ratehetmodel''' = ('''none''', nonsynonymous) – For codon models, the default is to infer a single dN/dS <br />
parameter. Alternatively, a model can be specified that infers a given number of dN/dS <br />
categories, with the dN/dS values and proportions falling in each category estimated <br />
('''ratehetmodel''' = nonsynonymous). This is the “discrete” or “M3” model of Yang et al. <br />
(2000). <br />
====numratecats (number of discrete dN/dS categories)====<br />
'''numratecats''' = (1 to 20, '''1''') – When ratehetmodel = nonsynonymous, this is the number of dN/dS parameter <br />
categories.<br />
====invariantsites====<br />
'''invariantsites''' = (none) - '''NOTE:''' Due to an error on my part, the '''invariantsites''' entry must appear with codon models for them to run (despite the fact that it doesn't apply). i.e., be sure that this appears:<br />
invariantsites = none.<br />
<br />
===For datatype = codon or codon-aminoacid=== <br />
====geneticcode (code to use in codon translation)====<br />
'''geneticcode''' = ('''standard''', vertmito, invertmito) – The genetic code to be used in translating codons <br />
into amino acids.<br />
<br />
==Population settings==<br />
===nindivs (number of individuals in population)===<br />
'''nindivs''' = (2 to 100, '''4''')- The number of individuals in the population. This may be increased, but <br />
doing so is generally not beneficial. Note that typical genetic algorithms tend to have much, <br />
much larger population sizes than GARLI's defaults. <br />
===holdover (unmutated copies of best individual)===<br />
'''holdover''' = (1 to nindivs-1, '''1''')- The number of times the best individual is copied to the next <br />
generation with no chance of mutation. It is best not to mess with this. <br />
===selectionintensity (strength of selection)===<br />
'''selectionintensity''' = (0.01 to 5.0, '''0.5''')- Controls the strength of selection, with larger numbers <br />
denoting stronger selection. The relative probability of reproduction of two individuals <br />
depends on the difference in their log likelihoods (&Delta;lnL) and is formulated very similarly to <br />
the procedure of calculating Akaike weights. The relative probability of reproduction of the <br />
less fit individual is equal to:<br />
<br />
<big><big><br />
''e''<sup> (-selectionIntensity &times; &Delta; lnL)</sup><br />
</big></big><br />
<br />
In general, this setting does not seem to have much of an effect on the progress of a run. In <br />
theory higher values should cause scores to increase more quickly, but make the search more <br />
likely to be entrapped in a local optimum. Low values will increase runtimes, but may be <br />
more likely to reach the true optimum. The following table gives the relative probabilities of <br />
reproduction for different values of the selection intensity when the difference in log <br />
likelihood is 1.0<br />
{| border="1"<br />
|-<br />
!'''selectionintensity''' value !! Ratio of probabilities of reproduction <br />
|-<br />
|align="center" |0.05 || align="center" |0.95:1.0 <br />
|-<br />
|align="center" |0.1 || align="center" |0.90:1.0 <br />
|-<br />
|align="center" |0.25 || align="center" |0.78:1.0 <br />
|-<br />
|align="center" |0.5 || align="center" |0.61:1.0 <br />
|-<br />
|align="center" |0.75 || align="center" |0.47:1.0 <br />
|-<br />
|align="center" |1.0 || align="center" |0.37:1.0 <br />
|-<br />
|align="center" |2.0 || align="center" |0.14:1.0<br />
|}<br />
===holdoverpenalty (fitness handicap for best individual)===<br />
'''holdoverpenalty''' = (0 to 100, '''0''') – This can be used to bias the probability of reproduction of the <br />
best individual downward. Because the best individual is automatically copied into the next <br />
generation, it has a bit of an unfair advantage and can cause all population variation to be lost <br />
due to genetic drift, especially with small populations sizes. The value specified here is <br />
subtracted from the best individual’s lnL score before calculating the probabilities of<br />
reproduction. It seems plausible that this might help maintain variation, but I have not seen it <br />
cause a measurable effect.<br />
<br />
===stopgen (maximum number of generations to run)===<br />
'''stopgen''' – The maximum number of generations to run. Note that this supersedes the automated <br />
stopping criterion (see enforcetermconditions above), and should therefore be set to a very <br />
large value if automatic termination is desired. <br />
===stoptime (maximum time to run)===<br />
'''stoptime''' – The maximum number of seconds for the run to continue. Note that this supersedes <br />
the automated stopping criterion (see enforcetermconditions above), and should therefore <br />
be set to a very large value if automatic termination is desired.<br />
<br />
==Branch-length optimization settings== <br />
After a topological rearrangement, branch lengths in the vicinity of the rearrangement are <br />
optimized by the Newton-Raphson method. Optimization passes are performed on a particular <br />
branch until the expected improvement in likelihood for the next pass is less than a threshold <br />
value, termed the optimization precision. Note that this name is somewhat misleading, as the <br />
precision of the optimization algorithm is inversely related to this value (i.e., smaller values of <br />
the optimization precision lead to more precise optimization). If the improvement in likelihood <br />
due to optimization for a particular branch is greater than the optimization precision, <br />
optimization is also attempted on adjacent branches, spreading out across the tree. When no new <br />
topology with a better likelihood score is discovered for a while, the value is automatically <br />
reduced. The value can have a large effect on speed, with smaller values significantly slowing <br />
down the algorithm. The value of the optimization precision and how it changes over the course <br />
of a run are determined by the following three parameters. <br />
===startoptprec===<br />
startoptprec (0.005 to 5.0, '''0.5''') – The beginning optimization precision. <br />
===minoptprec===<br />
minoptprec (0.001 to startoptprec, '''0.01''') – The minimum allowed value of the optimization precision.<br />
<br />
===numberofprecreductions===<br />
numberofprecreductions (0 to 100, '''10''') – Specify the number of steps that it will take for the <br />
optimization precision to decrease (linearly) from startoptprec to minoptprec. <br />
===treerejectionthreshold===<br />
treerejectionthreshold (0 to 500, '''50''') – This setting controls which trees have more extensive <br />
branch-length optimization applied to them. All trees created by a branch swap receive <br />
optimization on a few branches that directly took part in the rearrangement. If the difference <br />
in score between the partially optimized tree and the best known tree is greater than treerejectionthreshold, no further optimization is applied to the branches of that tree. <br />
Reducing this value can significantly reduce runtimes, often with little or no effect on results. <br />
However, it is possible that a better tree could be missed if this is set too low. In cases in <br />
which obtaining the very best tree per search is not critical (e.g., bootstrapping), setting this <br />
lower (~20) is probably safe.<br />
<br />
==Settings controlling the proportions of the mutation types== <br />
Each mutation type is assigned a prior ''weight''. These values determine the expected <br />
proportions of the various mutation types that are performed. The primary mutation categories <br />
are ''topology'' (t), ''model'' (m) and ''branch length'' (b). Each are assigned a prior weight ( P<sub>i</sub> ) in the <br />
config file. Each time that a new best likelihood score is attained, the amount of the increase in <br />
score is credited to the mutation type responsible, with the sum of the increases ( S<sub>i</sub> ) maintained <br />
over the last intervallength x intervalstostore generations. The number of times that each <br />
mutation is performed ( N<sub>i</sub> ) is also tallied. The total weight of a mutation type is W<sub>i</sub> = P<sub>i</sub> + ( S<sub>i</sub> / N<sub>i</sub> ). <br />
The proportion of mutations of type i out of all mutations is then <br />
<br />
<big><big><br />
Pr(i) = W<sub>i</sub> / <br />
(W<sub>t</sub> + W<sub>m</sub> + W<sub>b</sub>) <br />
</big></big><br />
<br />
The proportion of each mutation is thus related to its prior weight and the average increase in <br />
score that it has caused over recent generations. The prior weights can be used to control the <br />
expected (and starting) proportions of the mutation types, as well as how sensitive the <br />
proportions are to the course of events in a run. It is generally a good idea to make the topology <br />
prior much larger than the others so that when no mutations are improving the score many <br />
topology mutations are still attempted. If you set '''outputmostlyuselessfiles''' to 1, you can look at the “problog” file to determine what the <br />
proportions of the mutations actually were over the course of a run. <br />
===topoweight (weight on topology mutations)===<br />
topoweight (0 to infinity, '''1.0''') The prior weight assigned to the class of topology mutations <br />
(NNI, SPR and limSPR). Note that setting this to 0.0 turns off topology mutations, meaning that the tree topology is fixed for the run. This used to be a way to have the program estimate only model parameters and branch-lengths, but the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]''' setting is now a better way to go.<br />
<br />
===modweight (weight on model parameter mutations)===<br />
modweight (0 to infinity, '''0.05''') The prior weight assigned to the class of model mutations. Note <br />
that setting this at 0.0 fixes the model during the run. <br />
===brlenweight (weight on branch-length parameter mutations)===<br />
brlenweight (0 to infinity, '''0.2''') The prior weight assigned to branch-length mutations.<br />
<br />
The same procedure used above to determine the proportion of Topology:Model:Branch-Length <br />
mutations is also used to determine the relative proportions of the three types of topological <br />
mutations (NNI:SPR:limSPR), controlled by the following three weights. Note that the <br />
proportion of mutations applied to each of the model parameters is not user controlled. <br />
===randnniweight (weight on NNI topology changes)===<br />
randnniweight (0 to infinity, '''0.1''') - The prior weight assigned to NNI mutations. <br />
===randsprweight (weight on SPR topology changes)===<br />
randsprweight (0 to infinity, '''0.3''') - The prior weight assigned to random SPR mutations. For <br />
very large datasets it is often best to set this to 0.0, as random SPR mutations essentially <br />
never result in score increases. <br />
===limsprweight (weight on localized SPR topology changes)===<br />
limsprweight (0 to infinity, '''0.6''') - The prior weight assigned to SPR mutations with the <br />
reconnection branch limited to being a maximum of limsprrange branches away from where <br />
the branch was detached. <br />
===intervallength===<br />
intervallength (10 to 1000, '''100''') – The number of generations in each interval during which the <br />
number and benefit of each mutation type are stored. <br />
===intervalstostore===<br />
intervalstostore = (1 to 10, '''5''') – The number of intervals to be stored. Thus, records of <br />
mutations are kept for the last (intervallength x intervalstostore) generations. Every <br />
intervallength generations the probabilities of the mutation types are updated by the scheme <br />
described above.<br />
<br />
==Settings controlling mutation details==<br />
===limsprrange (max range for localized SPR topology changes)===<br />
limsprrange (0 to infinity, '''6''') – The maximum number of branches away from its original <br />
location that a branch may be reattached during a limited SPR move. Setting this too high (> <br />
10) can seriously degrade performance, but if you do so in conjunction with a large increase in genthreshfortopoterm you might end up with better trees.<br />
===meanbrlenmuts (mean # of branch lengths to change per mutation)===<br />
meanbrlenmuts (1 to # taxa, '''5''') - The mean of the binomial distribution from which the number <br />
of branch lengths mutated is drawn during a branch length mutation. <br />
===gammashapebrlen (magnitude of branch-length mutations)===<br />
gammashapebrlen (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the branch-length multipliers are drawn for branch-length <br />
mutations. Larger numbers cause smaller changes in branch lengths. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===gammashapemodel (magnitude of model parameter mutations)===<br />
gammashapemodel (50 to 2000, '''1000''') - The shape parameter of the gamma distribution (with a <br />
mean of 1.0) from which the model mutation multipliers are drawn for model parameters <br />
mutations. Larger numbers cause smaller changes in model parameters. (Note that this has <br />
nothing to do with gamma rate heterogeneity.) <br />
===uniqueswapbias (relative weight assigned to already attempted branch swaps)===<br />
uniqueswapbias (0.01 to 1.0, '''0.1''') – With version 0.95 and later, GARLI keeps track of which branch <br />
swaps it has attempted on the current best tree. Because swaps are applied randomly, it is <br />
possible that some swaps are tried twice before others are tried at all. This option allows the <br />
program to bias the swaps applied toward those that have not yet been attempted. Each swap <br />
is assigned a relative weight depending on the number of times that it has been attempted on <br />
the current best tree. This weight is equal to (uniqueswapbias) raised to the (# times swap <br />
attempted) power. In other words, a value of 0.5 means that swaps that have already been <br />
tried once will be half as likely as those not yet attempted, swaps attempted twice will be ¼ <br />
as likely, etc. A value of 1.0 means no biasing. If this value is not equal to 1.0 and the <br />
outputmostlyuseless files option is on, a file called <ofprefix>.swap.log is output. This file <br />
shows the total number rearrangements tried and the number of unique ones over the course <br />
of a run. Note that this bias is only applied to NNI and limSPR rearrangements. Use of this <br />
option may allow the use of somewhat larger values of limsprrange. <br />
===distanceswapbias (relative weight assigned to branch swaps based on locality)===<br />
distanceswapbias (0.1 to 10, '''1.0''') – This option is similar to uniqueswapbias, except that it <br />
biases toward certain swaps based on the topological distance between the initial and <br />
rearranged trees. The distance is measured as in the limsprrange, and is half the the <br />
Robinson-Foulds distance between the trees. As with uniqueswapbias, distanceswapbias <br />
assigns a relative weight to each potential swap. In this case the weight is <br />
(distanceswapbias) raised to the (reconnection distance - 1) power. Thus, given a value of <br />
0.5, the weight of an NNI is 1.0, the weight of an SPR with distance 2 is 0.5, with distance 3 <br />
is 0.25, etc. Note that values less than 1.0 bias toward more localized swaps, while values <br />
greater than 1.0 bias toward more extreme swaps. Also note that this bias is only applied to <br />
limSPR rearrangements. Be careful in setting this, as extreme values can have a very large <br />
effect.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_wiki&diff=4369Garli wiki2015-07-21T15:20:54Z<p>Zwickl: Created page with " Base of the garli wiki Garli_FAQ"</p>
<hr />
<div><br />
Base of the garli wiki<br />
<br />
[[Garli_FAQ]]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_Advanced_topics&diff=4368Garli Advanced topics2015-07-21T15:19:47Z<p>Zwickl: Created page with "led Example: Searching for the ML tree (i.e., "normal" searches)= ==Doing the searches (and knowing when to stop)== The usual goal of ML phylogenetic searches is to find the ..."</p>
<hr />
<div>led Example: Searching for the ML tree (i.e., "normal" searches)=<br />
<br />
==Doing the searches (and knowing when to stop)==<br />
The usual goal of ML phylogenetic searches is to find the tree topology with the best likelihood. But, the only way to have any confidence that you've found it is by doing multiple search replicates and finding the same best tree multiple times (or at least multiple trees with very similar scores). If you want to know which is the best scoring tree across multiple runs, the easiest thing to do is to look at the log-likelihood scores that appear at the end of each of the .screen.log file and compare them across searches. <br />
<br />
Running the program multiple times will often be necessary (and running multiple search replicates is '''ALWAYS necessary''', either within a single program execution or across multiple). Running the program multiple times simultaneously can be an efficient way of using a computer with multiple cores or processors (see discussion [[FAQ#Should_I_use_a_multi-threaded_.28openMP.29_version_of_GARLI_if_I.E2.80.99m_using_a_computer_with_multiple_processors.2Fcores.3F | here]]). The following describes the entire procedure of doing a set of searches and examining the results, and it assumes that the program was run multiple times. If the program was only run once with a number of search replicates, then much of the following is simplified.<br />
<br />
Let's follow a series of example analyses:<br />
<br />
'''Dataset''' : 150 taxon, 3188 column rRNA alignment<br />
<br />
I started with a two-replicate search using the default program settings. Once it completed (this took about 70 minutes total on my laptop), the information that I want to look at is at the very end of the .screen.log file:<br />
<br />
Completed 2 replicate runs (of 2).<br />
Results:<br />
Replicate 1 : -69675.8748<br />
Replicate 2 : -69649.0250 (best)<br />
Final result of the best scoring rep (#2) stored in arb150.default1.best.tre<br />
Final results of all reps stored in arb150.default1.best.all.tre<br />
<br />
The rather different scores and the fact that the output doesn't say "same topology" indicate that a different final tree was found by each of the two replicates. This is not particularly surprising for a dataset of this size, and doesn't necessarily indicate that anything is wrong with the program or its settings. This most likely means that the two replicates ended up at different tree "islands" or local optima, and that once it has reached either of them it cannot leave by simple branch swapping. This is a fact of life with any heuristic search algorithm, and can just as easily happen with other optimality criteria such as parsimony.<br />
<br />
Even if the scores of both of the search replicates were the same, we would really need to do more searches to be confident that we've found the best trees. Clearly in this case we need to do more searches since the results are not the same, so I ran the program again with another two search replicates. The end of the .screen.log for run #2:<br />
<br />
Completed 2 replicate runs (of 2).<br />
Results:<br />
Replicate 1 : -69649.0259<br />
Replicate 2 : -69649.0245 (best) (same topology as 1)<br />
Final result of the best scoring rep (#2) stored in arb150.default2.best.tre<br />
Final results of all reps stored in arb150.default2.best.all.tre<br />
<br />
This notes that the same final topology was found in both searches. The log-likelihood scores are very similar, which should always be the case when the same tree is found in multiple replicates. <br />
<br />
A few things to note: <br />
*'''identical''' trees will always give '''very similar''' scores (actually, exactly identical scores when branch lengths and parameters are fully optimized)<br />
*but, very similar scores do not necessarily indicate that two trees are identical. <br />
<br />
In this case the second replicate shouldn't be considered better in any important sense, despite the very slightly better score. The trees are the same, which is all that matters. Also note that the scores of both replicates in the second run are very similar to that of the second replicate of the first run. This doesn't necessarily mean that these three searches gave the same tree, but it is fairly probable. We'll see how to verify whether they really are the same tree later. We've gotten the same score in 3 out of 4 searches, which does give us some confidence that that is the score of the best tree or trees.<br />
<br />
Still, we'll do another two replicates. Run #3 output:<br />
<br />
Completed 2 replicate runs (of 2).<br />
Results:<br />
Replicate 1 : -69652.3843<br />
Replicate 2 : -69649.0261 (best)<br />
Final result of the best scoring rep (#2) stored in arb150.default3.best.tre<br />
Final results of all reps stored in arb150.default3.best.all.tre<br />
<br />
Again two different scores, but one of them is very similar to our previous best scores, so that is good. Another two searches:<br />
<br />
Completed 2 replicate runs (of 2).<br />
Results:<br />
Replicate 1 : -69652.3825<br />
Replicate 2 : -69649.0275 (best)<br />
Final result of the best scoring rep (#2) stored in arb150.default4.best.tre<br />
Final results of all reps stored in arb150.default4.best.all.tre<br />
<br />
These look very similar to what we got in run #3. We got one more count of the best score that we know of.<br />
<br />
A summary of the results of our eight searches across four program executions:<br />
*5 searches gave a score of about '''-69649.03'''. This is the best score that we know of, and based on the output of run #2 we know that at least two of the five are identical in topology.<br />
*2 searches gave a score of about '''-69652.38'''. This is probably a local optimum that searches can be trapped in.<br />
*1 search gave a score of '''-69675.8748'''. Another local optimum.<br />
<br />
I would feel confident that we've done a good job of searching for the best tree for this dataset at this point. This is entirely subjective. Returning the same best score four or five times is good, regardless of how many replicates it takes. If computational time is not a limiting resource, then more search replicates will never hurt. See below for a discussion of what to do with problematic datasets.<br />
<br />
==Tougher cases==<br />
The above example dataset is fairly large, and we saw evidence of local optima in treespace. However, in a fairly small number of searches we were able to find trees with very similar scores several times (and as we will see below, they also have identical topologies). However, some datasets may not perform so well. Repeatability of results across search replicates is the primary indicator of whether a heuristic method is doing well, and determines how many search replicates is "enough"<br />
<br />
A few rules of thumb and things to keep in mind:<br />
*Some (usually very large) datasets are very poorly behaved, and eight search replicates might result in eight different likelihood scores. In this case there are a few options, listed from best to worst. Which you choose partly depends on how much time and/or computational resources you have.<br />
**Keep doing search replicates until the best score that you've seen across replicates is found at least twice. Ideally the topologies should be identical as well.<br />
**Keep doing search replicates until you haven't found a new better scoring tree in some number of replicates. For example, if you've done 20 replicates and the best score you've seen is X, run another 5 replicates. If you don't find a score better than X, stop. If you do, update X to the new best value and run another 5 replicates. Etc.<br />
*Search repeatability is correlated with the number of sequences in a dataset, but there are many examples in which datasets of 60 sequences turn out to be much harder than others four or five times their size. The number of searches that was adequate on one dataset may be insufficient for another. Decisions should only be made on the basis of actual search results. <br />
*Features of datasets that seem to worsen search repeatability are:<br />
**The presence of many very similar (or identical) sequences<br />
**Low phylogenetic signal / low sequence divergence<br />
:There isn't much that can be done about these, although it is best to condense identical sequences to a single exemplar.<br />
*In some cases the results of different search replicates will be essentially identical in score (+/- 0.05 lnL) but different in topology. This is usually due to branches of zero length (GARLI's minimum is actually 1.0e-8), which itself is often due to very similar or identical sequences. The trees will look identical by eye because the zero length branches are indistinguishable from a polytomy. For the purposes of knowing when you've done enough searches, these differences across replicates can be ignored.<br />
<br />
==Examining/collecting results==<br />
Following the series of eight search replicates across four executions of the program discussed above, we now have results strewn across a number of files. We would like to collect them for further investigation. Depending on what you want to do, there are many ways to go about this. Here I discuss several things that one might want to do with the results and the ways that I would go about them. Note that much of this would be simpler had we run the program once and specified eight search replicates, but we didn't really know in advance how many replicates we would need. <br />
<br />
Note that you don't HAVE to do any of the following. You can just figure out which is the best scoring replicate and across runs/searches and if you have gotten it multiple times you can take it as your Maximum Likelihood tree.<br />
<br />
First we probably want to compare the trees returned by each of the separate runs and get them into a single tree file. The final trees were stored in two files per run:<br />
*The best tree across all replicates of a run is saved to the .best.tre file (i.e., only one tree appears regardless of the number of search replicates)<br />
*The best tree found by every individual search replicate is saved to the .best.all.tre file<br />
<br />
===Post-processing with PAUP*===<br />
I usually use PAUP* to collect the trees and compare them. Note that I use the command line version and don't have a graphical version to even try this on, so you'll have to figure out how these commands translate to the menus if you use a graphical version. Even if you are using a graphical version, you can always type these commands into it. I would do the following, all in PAUP*<br />
<br />
====Loading the trees====<br />
*Start PAUP and execute your NEXUS datafile<br />
*Execute one of the .best.all.tre files:<br />
execute arb150.default1.best.all.tre<br />
:Note that there is a PAUP block in the .best.tre files that does a number of handy things when it is executed, including setting and fixing GARLI's parameter estimates and telling PAUP to store and fix the branch lengths estimated by GARLI. This also loads the trees that are contained in that file. Note that the parameter values that are loaded are only from one of the GARLI search replicates, but the values should be very similar for all of them.<br />
*Now use the gettrees command to load trees from each of the other .best.all.tre files. i.e.:<br />
gettrees file= arb150.default2.best.all.tre mode=7;<br />
:Note that you need to specify the "mode=7" to tell PAUP to keep the trees that are already in memory when loading more. If you mess up, you can use the "clear" command in PAUP to remove all trees from memory and start again.<br />
<br />
====Comparing the trees====<br />
*(OPTIONAL) Type "lscore". This tells PAUP to score the trees in memory, and it has already been told above to use GARLI's estimates of all of the parameters. PAUP will very quickly display log-likelihood scores that very nearly match those output by GARLI for each of the search replicates.<br />
*(OPTIONAL) Type :<br />
lscore /userbr=no<br />
:This tells PAUP to score the trees again, this time estimating branch lengths itself. The scores you see will usually be slightly better than those output by GARLI, but typically only by 0.01 lnL or less.<br />
*Now to compare the trees. I would use the Symmetric Difference (Robinson-Foulds) tree distance metric, which you get in PAUP with<br />
treedist<br />
:This displays the pairwise distances between trees. I won't go into exactly what this distance metric means, but a distance of zero indicates that trees are identical and larger values mean that the trees are less similar. The maximum distance for a dataset of N sequences is 2 * (N - 3). For our above example the important part of the output is this:<br />
<br />
Symmetric-difference distances between trees<br />
1 2 3 4 5 6 7 8<br />
1 -<br />
2 46 -<br />
3 46 0 -<br />
4 46 0 0 -<br />
5 24 28 28 28 -<br />
6 46 0 0 0 28 -<br />
7 24 28 28 28 2 28 -<br />
8 46 0 0 0 28 0 28 -<br />
<br />
:After starting at that for a bit, we can see that trees 2, 3, 4, 6 and 8 are identical. Those correspond to exactly the same five replicates that gave the best score of about -69649.0261, so that is good news.<br />
*If we just wanted to compare the best tree from each run, instead of from each replicate, we could go through the above procedure except with the .best.tre files instead of the .best.all.tre files. In that case the tree distance output would be:<br />
1 2 3 4<br />
1 -<br />
2 0 -<br />
3 0 0 -<br />
4 0 0 0 -<br />
:indicating that the best of the two replicates of each of the four runs gave exactly the same tree.<br />
*If the program was only run once, then the output will already indicate which search replicates resulted in identical trees. However, it does not give a quantitative of how similar the trees are.<br />
<br />
====Saving the trees====<br />
If you've done the above to load the trees into PAUP, now we might want to save the best tree(s) per run, per replicate or across all runs/replicates to a single file. Note that if you only did one run with multiple search replicates, you don't need to do this since the best overall tree is already in the .best.tre file, and the best trees from each replicate are in the .best.all.tre file.<br />
<br />
*First we might want to specify an outgroup so that PAUP orients the trees appropriately when saving them (even if they are oriented correctly in the file that PAUP loads them from, they will be oriented with the default outgroup of taxon 1 when saved). Type something like:<br />
outgroup 1-3 6 <br />
:The outgroup could also be specified in a PAUP block in the datafile.<br />
*Now we save the trees. <br />
savetree file= arb.alltrees.tre brlens=user<br />
:CAREFUL HERE! By default PAUP will save the tree with parsimony branch lengths instead of maintaining GARLI's estimates. Setting brlens=user tells it to maintain GARLI's lengths when saving. Alternatively if you want PAUP to estimate the branch lenghts when saving the trees you'll need to set the criterion to likelihood first (set crit=like) and ensure that the model parameters are set appropriately.<br />
*If you want to only save one of the trees (i.e., the very best tree), you can do that too. If tree 3 was the one you wanted:<br />
savetree file= arb.veryBest.tre blens=user from=3 to=3<br />
<br />
===Other techniques for collecting results===<br />
There are many other ways that you might want to explore or collect your results.<br />
====Manually combining treefiles====<br />
Instead of using PAUP to load the results of multiple runs and to then save them to a single file, you can simply manually combine them. If you open one of the .best.tre files in a text editor, you will see something like the following:<br />
<br />
#nexus<br />
begin trees;<br />
translate<br />
1 EscCo143,<br />
2 EscCo145,<br />
3 SleTyp15,<br />
4 SlePara4,<br />
.<br />
.<br />
.<br />
150 Pd2Occul;<br />
tree bestREP2 = [&U][!GarliScore -69649.024503][!GarliModel r 0.953187 2.396777 1.229445 <br />
0.997765 3.566751 e 0.236189 0.233233 0.279949 0.250629 a 0.718326 p 0.098152 ](1:0.00000001,<br />
(((48:0.06527213,49:0.03476198):0.44628288, ... <etc>;<br />
end;<br />
begin paup;<br />
clear;<br />
gett file=arb150.default.best.tre storebr;<br />
lset userbr nst=6 rmat=(0.953187 2.396777 1.229445 0.997765 3.566751) base=( 0.236189 0.233233<br />
0.279949) rates=gamma shape=0.718326 ncat=4 pinv=0.098152;<br />
end;<br />
This looks ugly, but it is not that complicated. The '''translate''' command shows the correspondence between taxon names and the numbers in the tree descriptions. The section that starts with '''begin paup;''' sets the parameter values in PAUP if this file is executed. The line that begins with '''tree''' and ends with a ; (semicolon) is the actual tree description itself. If you open each of the tree files that you want to combine, you can simply copy and paste the lines that begin with '''tree ...''' from each file one after another into a trees block in a single file. That file with whatever set of trees you chose can then be read by programs that read Nexus tree files, such as FigTree, Mesquite, PAUP, etc. The fact that multiple trees may have the same name (the '''bestREP2''' in the example above) will not cause problems in any software that I am aware of.<br />
<br />
====Command-line tricks====<br />
If you have done many runs or ran the MPI version of the program you may have trees scattered across many files. Manually copying and pasting the trees into one file may not be a very inviting option. <br />
<br />
If you have access to a Unix style command line (including OS X), you can very easily grab all of the trees from multiple files and copy them into a single file with one command:<br />
grep -h ^tree *.best.tre > destFilename.tre<br />
(or replace .best.tre with .best.all.tre if you want the results of every search replicate). Now each of the tree descriptions appear one after another in the file destFilename.tre. Just add the first part of any of the other tree files (through the end of the translate block) to the start of that file and an "end;" at the end of the file and you'll have a fully functional tree file.<br />
<br />
Alternatively you can easily make a PAUP block that loads all of the trees. At the command line, type<br />
for i in *.best.tre <return><br />
do <return><br />
echo "gett file = $i mode=7;" <return><br />
done > myPaupBlock.nex <return><br />
<br />
This will make a file named myPaupBlock with commands that load all of the tree files in the current directory with a name that ends with .best.tre into PAUP. Simply execute your dataset in PAUP and then execute this file. Although it isn't strictly necessary, to be technically correct you should edit this file and add<br />
#nexus<br />
begin paup;<br />
at the start of the file, and<br />
end;<br />
at the end.<br />
<br />
=Detailed Example: A bootstrap analysis=<br />
==Doing the searches==<br />
Setting up a bootstrap analysis is not very different from doing a normal GARLI analysis. Simply set '''bootstrapreps''' to whatever number of replicates you want. Bootstrap replicates are slightly faster than normal searches, but doing 100 bootstrap replicates will still take almost 100 times longer than a single search. Therefore, it makes sense to "optimize" your search settings somewhat so that each search is intensive enough to find a good tree but takes as little time as possible.<br />
<br />
(I'm working on a discusion of search tuning currently)<br />
<br />
==Making the bootstrap consensus==<br />
Once a number of bootstrap replicates have been completed, you'll want to calculate the bootstrap consensus tree. Unfortunately, GARLI does not do this itself. It must be done by another program such as SumTrees, PAUP*, CONSENSE of the PHYLIP package or Phyutility. Other options exist that I don't have direct experience with.<br />
===Using SumTrees===<br />
SumTrees is a nice command-line program by Jeet Sukumaran that is extremely useful for making bootstrap consensus trees from GARLI output. It requires an installation of Python 2.4 or newer (the newest Python version is 2.6, and OS X 10.5 Leopard comes with Python 2.5). Note that SumTrees replaces Jeet's previous program BootScore, which included more or less the same features with regards to bootstrap consensuses. Instructions for obtaining and installing it are available here: http://www.jeetworks.org/programs/sumtrees<br />
<br />
The best feature of SumTrees is that it allows visualization of bootstrap values ''on a given tree''. In other words, it allows you to place the values on a fully resolved tree obtained elsewhere (best ML tree, parsimony tree, Bayesian consensus tree, etc.), rather than visualizing them on a consensus tree itself. This is excellent, since placing the bootstrap values on a tree that you already have is often exactly what you want to do! The other nice thing that SumTrees allows is calculating a majority rule consensus that also includes branch length information.<br />
*For a normal majority rule bootstrap consensus with SumTrees you simply provide the name of the GARLI bootstrap tree file and specify an output file:<br />
sumtrees.py mybootrun.boot.tre --output=myconsensus.tre<br />
<br />
*To place the bootstrap values on a given tree, for example the best tree found by GARLI during non-bootstrap searches:<br />
sumtrees.py mybootrun.boot.tre --target=mysearch.best.tre --output=supportOnBest.tre<br />
<br />
If you have bootstrap trees in multiple files (because you did multiple runs or used the MPI version), SumTrees can also easily handle this. Simply list the multiple files like this:<br />
sumtrees.py mybootrun1.boot.tre mybootrun2.boot.tre <etc> --output=myconsensus.tre<br />
<br />
Once your support values have been calculated and the output tree file has been written, you can open it in a program such as FigTree or TreeView to see the values. SumTrees has many other useful options and ways of formatting the output (support as percentages vs proportions, number of decimal places in support values, etc). See the sumtrees website or enter<br />
sumtrees.py --help<br />
for more information.<br />
<br />
===Using PAUP*===<br />
If you are familiar with PAUP* doing the consensus is easy. Simply execute your dataset, and then load the trees with the gettrees command:<br />
gettrees file= myfilename.boot.tre;<br />
gettrees file= myfilename2.boot.tre mode=7;<br />
etc.<br />
If you've only done a single bootstrap run, then only the first line is necessary. The "mode=7" tells PAUP to keep the trees that are already in memory when loading more from file. Otherwise it replaces them with the ones from file. Then, to do a majority rule consensus:<br />
contree /strict=no majrule=yes treefile=mybootcon.tre<br />
This will show the tree on the screen with all bootstrap values over 50%, but without branch length information. It will also save the consensus tree to the file "mybootcon.tre". Other things that you can add to the end of the contree command are "LE50", which tells it to include as many groupings as possible in the consensus, including those with support < 50%. Going in the opposite direction, you can increase the support cutoff for groups to appear by adding "percent=80", which would only show groups with > 80% bootstrap support.<br />
<br />
IMPORTANT NOTE: PAUP* ''cannot'' save the tree with the support values included (unless it did the whole bootstrap itself). It also cannot save or display the consensus with branch length information included. These are the downsides of using PAUP* to calculate the consensus. If you want to do either of these things you'll need to use Sumtrees or another program.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Garli_FAQ&diff=4367Garli FAQ2015-07-21T15:17:16Z<p>Zwickl: Created page with "==Search options== ===How many generations/seconds should I run for?=== This is dataset specific, and there is no way to tell in advance. It is recommended to set the maximu..."</p>
<hr />
<div>==Search options==<br />
===How many generations/seconds should I run for?===<br />
This is dataset specific, and there is no <br />
way to tell in advance. It is recommended to set the maximum generations and seconds to <br />
very large values (>1x10<sup>6</sup>) and use the automated stopping criterion (see <br />
[[GARLI_Configuration_Settings#enforcetermconditions_.28use_automatic_termination.29|enforcetermconditions]] in the settings list). Note that the program can be stopped gracefully at any point by pressing Ctrl-C, although the results may not be fully optimal at that point.<br />
<br />
===How many runs/search replicates should I do?===<br />
That somewhat depends on how much time/computational resources you have. You should ALWAYS do multiple searches, either by using the [[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]] setting or by simply running the program multiple times (but do so in different directories or change the [[GARLI_Configuration_Settings#ofprefix_.28output_filename_prefix.29|ofprefix]] to avoid overwriting previous results). If you perform a few runs or replicates and get very similar trees/lnL scores (ideally within about one lnL of each other), that should give you some confidence that the program is doing a good job searching and is find the best or nearly best topology, and it suggests that you don’t need to do many more searches. If there is a lot of variation between runs, try using different starting tree options (see further FAQ entries) and choose the best scoring result that you obtain. Note that the program is stochastic, and runs performed with exactly the same starting conditions and settings (but different random number seeds) may give different results. You may also try changing some of the search parameters to make each search replicate more intensive (see further FAQ entries). The discussion on the '''[[Advanced_topics]]''' page may be helpful in determining how many replicates you should perform, and in coalating the results from multiple searches.<br />
<br />
===Should I use random starting topologies, stepwise-addition starting topologies or provide starting topologies myself?===<br />
*Unless the number of sequences in your dataset numbers in the hundreds, it is recommended to perform multiple searches with both random ([[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]] = random) and stepwise-addition ([[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]] = stepwise) starting trees. <br />
*For datasets consisting of up to several hundred sequences, searches using a random starting tree often perform well (although they have slightly longer runtimes). Because the search starts from very different parts of the search space, getting consistent results from random starting trees provides good evidence that the search is doing a good job and really is finding the best trees. For datasets of more than a few hundred sequences, random starting trees sometimes perform quite poorly.<br />
*The quality of stepwise-addition starting trees can be controlled with the [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] setting. This allows the creation of starting trees that fall somewhere between completely random and very optimal. See the description of the [[GARLI_Configuration_Settings#attachmentspertaxon_.28control_creation_of_stepwise_addition_starting_tree.29|attachmentspertaxon]] setting. <br />
*Providing your own starting trees can sometimes be helpful, especially on datasets consisting of hundreds of sequences, where the creation of the stepwise-addition tree itself may take quite a long time. <br />
*User-specified starting trees to contain polytomies (not be fully bifurcating), so for example a parsimony strict consensus tree could be used to get the search in the right ballpark to start with without biasing it very much. Before searching, a polytomous tree will be arbitrarily resolved, with the resolution being different for each search replicate or program execution.<br />
<br />
===What is the proper format for specifying a starting topology?===<br />
The tree should be contained in a separate file (with that filename specified on the '''[[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]]''' line of the configuration file) either in a Nexus trees block, or in standard Newick format (parenthetical notation). Note that the tree description may contain either the taxon numbers (corresponding to the order of the taxa in the dataset), or the taxon names. The tree can optionally contain branch lengths, as well as have polytomies. If multiple trees are contained in a single starting tree file, they will be used in order to start each successive search replicate (if [[GARLI_Configuration_Settings#searchreps_.28number_of_independent_search_replicates.29|searchreps]] > 1). See the [[GARLI_Configuration_Settings#streefname_.28source_of_starting_tree_and.2For_model.29|streefname]] configuration entry for more details.<br />
<br />
===Should I specify a starting topology with branch lengths?===<br />
It doesn’t appear to make much of a difference, so I would suggest not doing so. Note that it is probably NOT a good idea to provide starting branch lengths estimated under a different likelihood model or by Neighbor Joining. When in doubt, leave out branch lengths.<br />
<br />
==Model parameters==<br />
===How do I specify starting/fixed model parameter values?=== <br />
Model parameter values are specified using a fairly cryptic scheme of specifying a single letter representing a particular parameter(s), followed by the value(s) of that parameter(s). See this page for details: '''[[Specifying_model_parameter_values | Specifying model parameter values]]'''.<br />
<br />
===Should I specify starting model parameters?===<br />
If you do not intend to fix the model parameters, specifying a starting model is generally of little help. One case in which you might want to specify starting parameter values would be when doing many search replicates or bootstrap replicates, in which case getting the starting values in the right ballpark can reduce total runtimes by an appreciable amount. If you do intend to fix the parameters at values obtained elsewhere or in a previous GARLI run, then you obviously must include the starting parameter values. See the streefname configuration entry for details on how to specify model parameter values. <br />
===Should I fix the model parameters?===<br />
The main reason one would fix parameters is to increase the speed of the search. Fixing model parameters results in a huge speed increase in some inference programs (such as PAUP*), but less in GARLI (generally approx. 10-50% with an unpartitioned model, although it can be much more with a partitioned model or if there are many parameters). Unless you have good model estimates (under exactly the same model), do not fix them. One situation in which you might want to fix parameter values would be in the case bootstrapping. You might want to estimate parameter values on the real data, and then fix those parameter values for the searches on each of the pseudo-replicate datasets. See '''[[Specifying_model_parameter_values#Getting_the_parameter_values | Getting the parameter values]]''' for an easy way to do this.<br />
<br />
==Model choices==<br />
===What DNA/RNA substitution models can I use?=== <br />
All possible submodels of the GTR (General Time Reversible) model, with or without gamma distributed rate heterogeneity and a proportion of invariable sites. This is same set of models allowed by PAUP* and represents the full set of models considered by the model selection program MODELTEST (http://darwin.uvigo.es/software/modeltest.html). See the “[[GARLI_Configuration_Settings#Model_specification_settings | Model specification settings]]” section on the GARLI configuration page.<br />
<br />
===Do I need to perform statistical model selection when using GARLI?===<br />
Yes! Just as when doing an ML search in PAUP* or a Bayesian analysis in MrBayes, you should pick a model that is statistically justified given your data. You may use a program like MODELTEST (http://darwin.uvigo.es/software/modeltest.html) to do the testing. However, most good sized datasets (which is mainly what GARLI is designed to analyze) do support the use of the most complex time-reversible model, GTR with a class of invariable sites and gamma distributed rate heterogeneity (“GTR+I+G”). As of GARLI version 0.96, all of the models examined by MODELTEST can now be estimated. See the “[[GARLI_Configuration_Settings#Model_specification_settings | Model specification settings]]” section on the GARLI configuration page and the FAQ item '''"MODELTEST told me to use model X. How do I set that up in GARLI?"''' below.<br />
<br />
Note that there is NOT really any reason to use the model parameter values provided by MODELTEST, only the model TYPE as indicated by the MODELTEST results (i.e., JC, HKY, GTR, etc.). GARLI will do a better job of estimating the parameters because it will do so on the ML tree, plus fixing or providing the parameter values to GARLI will not help that much to reduce runtimes. See this FAQ section above: [[FAQ#Model_parameters | Model parameters]].<br />
<br />
===What amino acid models can I use?===<br />
Amino acid analyses are typically done using fixed rate matrices that have been estimated on large datasets and published. Typically the only model parameters that are estimated during tree inference relate to the rate heterogeneity distribution. Each of the named matrices also has corresponding fixed amino acid frequencies, and a given matrix can either be used with those frequencies or with the amino acid frequencies observed in your dataset. Amino acid models may be used with the same forms of rate heterogeneity available for nucleotide models (gamma-distributed rate heterogeneity and a proportion of invariable sites). These are the implemented amino acid rate matrices: <br />
{| border="1"<br />
|-<br />
!'''ratematrix'''/'''statefrequencies''' setting !! reference <br />
|-<br />
|align="center" | dayhoff || Dayhoff, Schwartz and Orcutt. 1978.<br />
|-<br />
|align="center" | jones || Jones, Taylor and Thornton (JTT), 1992. <br />
|-<br />
|align="center" | WAG || Whelan and Goldman, 2001.<br />
|-<br />
|align="center" | mtREV || Adachi and Hasegawa, 1996.<br />
|-<br />
|align="center" | mtmam || Yang, Nielsen and Hasegawa, 1998.<br />
|}<br />
Versions 1.0 and later also allow estimation of the full amino acid rate matrix (189 rate parameters). Do not do this unless you have lots of data, as well as a good amount of time. Newer versions also allow input of your own amino acid rate matrix, allowing you to use any model if you have the rates. A file specifying the LG model (Le and Gasquel, 2008) is included in the example directory with the program to demonstrate this.<br />
<br />
See the amino acid section of the '''[[GARLI_Configuration_Settings#Model_specification_settings|model specification settings]]''' section for more details on amino acid models.<br />
<br />
===How do I choose which amino acid model to use?===<br />
As with choosing a nucleotide model, your choice of an amino acid model should be based on some measure of how well the available models fit your data. The program PROTTEST (http://darwin.uvigo.es/software/prottest.html) does for amino acid models what MODELTEST does for nucleotide models, testing a number of amino acid models and helping you choose one. Note that although GARLI can internally translate aligned nucleotide sequences into amino acids and analyze them at that level, to use PROTTEST you will need to convert your alignment into a Phylip formatted amino acid alignment first. <br />
===What codon models can I use?===<br />
The codon models that can be used are related to the Goldman and Yang (1994) model. See the codon section of the '''[[GARLI_Configuration_Settings#Model_specification_settings|model specification settings]]''' for a discussion of the various options. <br />
===How do I choose which codon model to use?===<br />
I don't currently have a good answer for this. The codon models should probably be considered experimental at the moment. Experiments to investigate the use of codon models for tree inference on large datasets are underway, and I should eventually have some general guidelines on how best to apply them. Feel free to give them a try with your data.<br />
===MODELTEST told me to use model X. How do I set that up in GARLI?===<br />
The candidate models that MODELTEST chooses from have the following format:<br />
<Model Name><optionally, +G><optionally, +I><br />
for example, GTR+I+G, SYM+I or HKY.<br />
The model names are definitely cryptic if you aren't familiar with the evolutionary models used in phylogenetic analyses. Luckily, there is a direct correspondence between all of MODELTESTS models and particular GARLI settings. Note that GARLI allows the use of every model that MODELTEST might tell you to use.<br />
First, rate heterogeneity:<br />
For any model with '''"+G"''' in it:<br />
<pre><br />
ratehetmodel = gamma<br />
numratecats = 4 (or some other number. 4 is the default in GARLI, PAUP* and MrBayes)<br />
</pre><br />
For any without '''"+G"''' in it:<br />
<pre><br />
ratehetmodel = none<br />
numratecats = 1<br />
</pre><br />
For any model with '''"+I"''' in it:<br />
<pre><br />
invariantsites = estimate<br />
</pre><br />
For any without '''"+I"''' in it:<br />
<pre><br />
invariantsites = none<br />
</pre><br />
The model names each correspond to a particular combination of the '''[[GARLI_Configuration_Settings#statefrequencies_.28equilibrium_base_frequencies_assumed_by_substitution_model.29|statefrequencies]]''' and '''[[GARLI_Configuration_Settings#ratematrix_.28relative_nucleotide_rate_parameters_assumed_by_codon_model.29|ratematrix]]''' configuration entries. Note that for the rate matrix settings that appear in parentheses like this: (0 1 2 3 4 5), the parentheses do need to appear in the config file. Here are all of the named models:<br />
{| class="wikitable" border="1" style="text-align:center"<br />
|-<br />
!'''model name''' !! ratematrix =''' !! '''statefrequencies ='' <br />
|-<br />
|align="center" | JC || 1rate || equal<br />
|-<br />
|align="center" | F81 || 1rate || estimate<br />
|-<br />
|align="center" | K80 || 2rate || equal<br />
|-<br />
|align="center" | HKY || 2rate || estimate<br />
|-<br />
|align="center" | TrNef || (0 1 0 0 2 0) || equal<br />
|-<br />
|align="center" | TrN || (0 1 0 0 2 0) || estimate<br />
|-<br />
|align="center" | K3P (= K81) || (0 1 2 2 1 0) || equal<br />
|-<br />
|align="center" | K3Puf (= K81uf) || (0 1 2 2 1 0) || estimate<br />
|-<br />
|align="center" | TIMef || (0 1 2 2 3 0) || equal<br />
|-<br />
|align="center" | TIM || (0 1 2 2 3 0) || estimate<br />
|-<br />
|align="center" | TVMef || (0 1 2 3 1 4) || equal<br />
|-<br />
|align="center" | TVM || (0 1 2 3 1 4) || estimate<br />
|-<br />
|align="center" | SYM || 6rate || equal<br />
|-<br />
|align="center" | GTR || 6rate || estimate<br />
<br />
|}<br />
<br />
'''NOTE''': MODELTEST also returns parameter values with the chosen model type. You may fix those values in GARLI, but unlike in PAUP* there is little speed benefit to doing so. As long as you set the correct type of model with the above instructions GARLI will infer a better model estimate than the values given by MODELTEST, since those are estimated on a tree that is poorer than the Maximum Likelihood tree. See the "[[FAQ#Model_parameters | '''Model parameters''']]" section of the FAQ for more information on providing/fixing model parameters.<br />
<br />
===Can GARLI do analyses assuming a relaxed or strict molecular clock?===<br />
Sorry, no.<br />
<br />
===Can I infer rooted trees in GARLI?===<br />
No. All models used are time reversible, and the position of the root neither affects nor is inferred by the analyses. Rooting the inferred tree by using an outgroup or other method is up to you. Note that you can specify an outgroup for GARLI to use, but this only affects how the trees are oriented when written to file, and has no effect on the analysis itself.<br />
<br />
===Can GARLI perform partitioned analyses, e.g. allow different models for different genes?===<br />
Yes. An official version that can do this and better documentation of it are forthcoming. In the meantime an earlier functional version is available and documented [[Partition_testing_version | here]].<br />
<br />
==Constraints==<br />
===How do I specify a topological constraint?=== <br />
In short, this requires deciding which branches (or bipartitions) you would like to constrain, specifying those branches in a file and telling GARLI where to find that file. See the [[GARLI_Configuration_Settings#constraintfile_.28file_containing_constraint_definition.29|constraintfile]] option for details on constraint formats.<br />
<br />
===Why might I want to specify a topological constraint?===<br />
There are two main reasons: to reduce the topology search space or to perform hypothesis testing (such as parametric bootstrapping). For large datasets in which you are certain of some groupings, it may help the search to constrain a few major groups. Note that if constraints are specified without a starting tree, GARLI will create a random or stepwise-addition tree that is compatible with those constraints. This may be an easy way of improving searching without the potential bias of using a given starting tree. A discussion of parametric bootstrapping (sometimes called the SOWH test) is out of the scope of this manual. It is a method of testing topological null hypotheses with a given dataset through simulation. See: Huelsenbeck et. al, (1996). Other statistical tests of tree topologies (attempting to answer the question “Is topology A significantly better than topology B”) are nicely reviewed in Goldman et al. (2000).<br />
===How do I fully constrain (i.e., fix) the tree topology?===<br />
This is not done with a constraint! Use the '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]] option.<br />
<br />
==Program versions==<br />
===What are the differences between the Graphical (GUI) OS X version and other versions?===<br />
(NOTE that the GUI Version of GARLI is VERY out of date, version 0.951 vs 2.0. If you can use a newer non-GUI version, do so.) The main differences are in how the user interacts with the program. The <br />
GUI version changes the cryptic option names detailed below into normal English. If you hold your mouse pointer over an option in the GUI it will give you a description of what that option does (generally taken directly from this manual). There may be some options that are not available in the GUI. Searching and optimization may also not be great in this version, since there have been many improvements made to the core since its release.<br />
<br />
===Should I use a multi-threaded (openMP) version of GARLI if I’m using a computer with multiple processors/cores?===<br />
The multi-threaded versions will often increase the speed of runs by approximately 1.2 to 1.8 times, but will otherwise give results identical to those obtained with the normal version (i.e., the search algorithm is exactly the same). It will perform the best when there are many columns in the alignment, or when using amino acid or codon models. It also seems to be very hardware specific, so with dna models on some machines it may not help at all. Test it yourself on your machine before assuming that it will help.<br />
<br />
Note that even if it is faster, this doesn't mean that running this version is the best use of computing resources. In particular, if you intend to do multiple search replicates or bootstrap replicates, simply running two independent executions of the program will give a speedup of nearly 2 times, and will therefore get a given number of searches done more quickly than a multithreaded version. One case in which the multithreaded version may be of particular use is when analyzing extremely large datasets for which the amount of memory that would be required for two simultaneous executions of the program is near or greater than the amount of memory installed in the system. Note that the multi-threaded versions by default will use all of the cores/processors that are available on the system. To change this, you can set the OMP_NUM_THREADS environment variable (you can find information on how to do that online). Note that the performance of the multithreaded version when it is only using one processor or core is actually worse than the normal version, so when in doubt use the normal version.<br />
<br />
===Should I use a 64-bit Version?===<br />
A 64-bit (sometimes called x64) version of the program will probably not help you unless you need to use large amounts of memory. In general, a 64-bit version will not be faster. If you need to use about 4 or more GB of memory, then you MUST use a 64-bit version. I do not currently have a 64-bit OS X distribution, but could make one if there is interest. Compiling your own might be a better option.<br />
<br />
===What is the parallel MPI version of GARLI? Should I use it?===<br />
This is a fairly complex question and answer. The short of it is that if you are running on a large computer cluster it ''may'' be worthwhile to use the parallel version. There is nothing wrong with using the serial version on a cluster if the cluster allows it, and there may not be much benefit to using the MPI version in this case. The MPI version can also be run on a standalone machine with multiple processors that has MPI installed, such as Linux or Mac OS X Leopard (10.5).<br />
See a detailed discussion of the MPI version [[MPI_version | here]].<br />
<br />
==Miscellaneous==<br />
===Can I use GARLI to do batches of runs, one after another?===<br />
Yes, any of the non-GUI versions can do this. First create a different config file for each run you need to do, and name them something like run1.conf, run2.conf, etc. Assuming that the GARLI executable is named Garli-1.0 and is in the current directory, you may then make a shell script that runs each config file through the program like this: <br />
./Garli-1.0 –b run1.conf <br />
./Garli-1.0 –b run2.conf <br />
etc. <br />
The “–b” tells the program to use batch mode and to not expect user input before terminating. The details of making a shell script are beyond the scope of this manual, but you can find help online or ask your nearest Unix guru.<br />
<br />
===For nucleotide models: Is the score that GARLI reports at the end of a run equivalent to what PAUP* would calculate after fully optimizing model parameters and branch lengths on the final topology?===<br />
The model implementations in GARLI are intentionally identical to those in PAUP, so in general the scores should be very close. In some very rare conditions the score given by <br />
GARLI is better than that given by PAUP* after optimization, which appears to be due to <br />
PAUP* getting trapped in local branch-length optima. This should not be cause for <br />
concern. If you want to be absolutely sure of the lnL score of a tree inferred by GARLI, <br />
optimize it in PAUP*. Note that comparability of scores should NOT generally be assumed between other programs such as RAxML or PHYML.<br />
<br />
===For nucleotide models: Is the lnL score that GARLI reports at the end of a run comparable to the lnL scores reported by other ML search programs?=== <br />
In general, you <br />
should not assume that lnL scores output by other ML search programs (such as PHYML <br />
and RAxML) are directly comparable to those output by GARLI, even if they apparently <br />
use the same model. To truly know which program has found a better tree you will need to<br />
score and optimize the resulting trees using a single program, under the same model. <br />
Also see the previous question.<br />
<br />
===Which GARLI settings should I play around with?===<br />
Besides specifying your own dataset, most settings don’t need to be tinkered with, although you are free to do so if you<br />
understand what they do. Settings that SHOULD be set by the user are '''[[GARLI_Configuration_Settings#ofprefix_.28output_filename_prefix.29|ofprefix]]''', <br />
'''[[GARLI_Configuration_Settings#availablemememory_.28control_maximum_program_memory_usage.29|availablememory]]''' and '''[[GARLI_Configuration_Settings#genthreshfortopoterm_.28number_of_generations_without_topology_improvement_required_for_termination.29|genthreshfortopoterm]]'''. If you want to tinker <br />
further, you might try changing '''[[GARLI_Configuration_Settings#uniqueswapbias_.28relative_weight_assigned_to_already_attempted_branch_swaps.29|uniqueswapbias]]''', '''[[GARLI_Configuration_Settings#nindivs_.28number_of_individuals_in_population.29|nindiv]]''', '''[[GARLI_Configuration_Settings#selectionintensity_.28strength_of_selection.29|selectionintensity]]''', '''[[GARLI_Configuration_Settings#limsprrange_.28max_range_for_localized_SPR_topology_changes.29|limsprrange]]''', '''[[GARLI_Configuration_Settings#startoptprec|startoptprec]]''', '''[[GARLI_Configuration_Settings#minoptprec|minoptprec]]''' and '''[[GARLI_Configuration_Settings#numberofprecreductions|numberofprecisionreductions]]'''. In general, using a different starting topology tends to have more of an effect on the results than any of these settings do. It is recommended that you do NOT change '''[[GARLI_Configuration_Settings#stopgen_.28maximum_number_of_generations_to_run.29|stopgen]]''', '''[[GARLI_Configuration_Settings#stoptime_.28maximum_time_to_run.29|stoptime]]''',<br />
'''[[GARLI_Configuration_Settings#refinestart_.28whether_to_optimize_a_bit_before_starting_a_search.29|refinestart]]''', '''[[GARLI_Configuration_Settings#enforcetermconditions_.28use_automatic_termination.29|enforcetermconditions]]''' and the mutation weight settings unless you have a specific reason to do so.<br />
<br />
===Can I specify alignment columns of my data matrix to be excluded?=== <br />
Yes, if your datafile is Nexus. This is done through an “exset” command in a Nexus assumptions block, included in the same file as a Nexus data matrix. For example, to exclude characters 1-10 inclusive and character 20, the block would look like this: <br />
Begin assumptions; <br />
exset * myExsetName = 1-10 20; <br />
end; <br />
The * means to automatically apply the exset (otherwise the command simply defines the <br />
exset), and the exset name doesn’t matter. Note that this assumes that the file has only one characters or data block, and that the characters block is not named. If you use Mesquite to edit your data or visualize your alignment, any characters that you exclude there will automatically be written to an assumptions block in the file and will be read by GARLI. <br />
(Another option for removing alignment columns is to use PAUP*. Simply execute your dataset in PAUP*, exclude the characters that you don’t want, and then export the file to a new name. The new file will include only the columns you want.)<br />
<br />
===How do I perform a non-parametric bootstrap in GARLI?===<br />
Set up the config file as normal, and set the bootstrapreps setting to the number of replicates you want. The program will perform searches on that number of bootstrap reweighted datasets, and store the best tree found for each replicate dataset in a single file called <ofprefix>.boot.tre. You can also specify searchreps > 1 while bootstrapping to perform multiple searches on each bootstrap resampled dataset. The best tree across across all searches replicates for each bootstrapped dataset will be written to the bootstrap file. See the '''[[GARLI_Configuration_Settings#bootstrapreps_.28number_of_bootstrap_replicates.29|bootstrapreps]]''' configuration entry for more info.<br />
<br />
Note that GARLI does NOT do the bootstrap consensus or calculate the bootstrap proportions itself. It simply infers the trees and are the input to that consensus. The trees in the .boot.tre file will need to be read into a program such as SumTrees (or PAUP* or CONSENSE) to do a majority-rule consensus and obtain your bootstrap support values. See the following part of the "Advanced topics" page for suggestions on making the consensus: '''[[Advanced_topics#Detailed_Example:_A_bootstrap_analysis|Detailed Example: A bootstrap analysis]]'''.<br />
<br />
===How do I use checkpointing (i.e., stop and later restart a run)?===<br />
Set the writecheckpoints option to “1” before doing a run. If the run is stopped for some reason (intentionally or not), it can be restarted by changing the restart option to “1” in the config file and executing the program. DO NOT make any other changes to the config file before attempting a restart, or bad things may happen. See the <br />
'''[[GARLI_Configuration_Settings#writecheckpoints_.28write_checkpoint_files_during_run.29|writecheckpoints]]''' and <br />
'''[[GARLI_Configuration_Settings#restart_.28restart_run_from_checkpoint.29|restart]]''' settings.<br />
<br />
===I ran GARLI multiple times, and now I have results spread across multiple files. How do I deal with this and compare/summarize the results?===<br />
<br />
See the '''[[Advanced_topics]]''' page (in particular the "Examining/collecting results" section) for a discussion of this.<br />
<br />
===Can GARLI output site-likelihoods for use in a program like CONSEL?===<br />
Yes. You can easily do this for the best tree found at the end of a search by using the '''[[GARLI_Configuration_Settings#outputsitelikelihoods_.28write_a_file_with_the_log-likelihood_of_each_site.29 | outputsitelikelihoods]]'''<br />
<br />
To optimize branch-length and model parameters and then output the sitelikelihoods using user specified trees, use '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]]'''.<br />
<br />
===Does GARLI return trees that are not fully resolved (with polytomies)? Does it collapse branches of length zero?===<br />
Yes. The minimum length branch length that GARLI allows is actually 10^8, which is effectively zero. Setting the '''[[GARLI_Configuration_Settings#collapsebranches_.28collapse_zero_length_branches_before_writing_final_trees_to_file.29 | collapsebranches]]''' option will cause such branches to be collapsed before writing the final trees to file. <br />
<br />
Some tree inference software does NOT do this, which can be very important when analyzing datasets with low variability, i.e., when there is really no evidence for some branches. Zero-length branches (which should really be polytomies) will be randomly resolved in one of three ways when branches are not collapsed. Depending on how the trees are being used, this can introduce problematic extra unsupported branches.<br />
<br />
==Source code compilation==<br />
===Why would I want to compile GARLI myself?===<br />
The only reasons that would require you to do this would be if you are trying to use it on a operating system other than OS X or Windows (i.e., Linux) or if you want access to the very latest fixes and updates. It should be easy to build on any Linux or OS X machine.<br />
<br />
===How do I compile GARLI myself from a source distribution?===<br />
This would mean that you are starting with a file called something like garli-1.0.tar.gz that you downloaded from the [[http://garli.googlecode.com googlecode page]]. Your system will need the proper tools installed to do this compilation (many Linux distributions should have this by default, for OS X you'll need to install the Developer Tools). To compile:<br />
<br />
1. Decompress the source distribution. From the command line, <br />
tar xzvf garli-1.0.tar.gz<br />
<br />
2. Change into the garli-1.0 directory that has been created<br />
cd garli-1.0<br />
<br />
3. If you want to do the most basic build, type the following and wait for a few minutes<br />
sh build_garli.sh<br />
<br />
4. If everything worked, you should now have a bin directory within the source distribution and an executable file within called garli-1.0. The file can by copied anywhere on the system and will work properly.<br />
<br />
*(Optional) Alternatively, if you wanted to download and use the latest version of the Nexus Class Library (NCL) that GARLI uses to parse input files, you could type this in step 3<br />
sh build_garli.sh --ncl-svn<br />
*(Optional) If you want to pass other flags to the GARLI configure script, you can provide them as other arguments to build_garli.sh. e.g.<br />
sh build_garli.sh --open-mp<br />
*If you want to build the very latest source in the svn repository (possibly enhanced, possibly broken) see the instructions here: [http://groups.google.com/group/garli_users/web/building-from-source building from source].<br />
<br />
===How do I fix compiler errors like "‘strlen’ was not declared in this scope" (usually GCC 4.2 or later)?===<br />
These errors usually appear when trying to compile with the new gcc 4.3, which made some changes regarding file inclusion. Try changing <br />
#include <string><br />
to <br />
#include <cstring><br />
at the top of the files '''src/bipartition.h''' and '''src/translatetable.h'''<br />
<br />
In '''src/configoptions.h''', add <br />
#include <climits><br />
after the other include statement at the top of the file.<br />
<br />
If you still have errors, you might also need to add<br />
#include <cstdio><br />
at the top of '''bipartition.h'''.<br />
<br />
===How do I fix compiler "error: cannot call constructor ErrorException::ErrorException directly" (usually GCC 4.2 or later)?===<br />
<br />
This comes up in newer versions of gcc.<br />
<br />
In the src/utility.h file, change the two lines that start with <br />
<br />
throw ErrorException::ErrorException( ...<br />
<br />
to<br />
<br />
throw ErrorException( ...<br />
<br />
and compile again.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=3659Derrick Zwickl2014-07-30T18:09:28Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/slides/GarliDemo/Zwickl.WoodsHole2014.final.pdf Zwickl.WoodsHole2013.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2013.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Arriving Monday July 28, departing Friday Aug. 1.</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=3636Derrick Zwickl2014-07-29T22:57:17Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
2014: Arriving Monday July 28, departing Friday Aug. 1.<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/slides/GarliDemo/Zwickl.WoodsHole2014.final.pdf Zwickl.WoodsHole2013.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2013.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
TBA</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=3583Derrick Zwickl2014-07-28T22:52:55Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
2014: Arriving Monday July 28, departing Friday Aug. 1.<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/slides/GarliDemo/Zwickl.WoodsHole2013.final.pdf Zwickl.WoodsHole2013.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2013.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
TBA</div>Zwicklhttps://molevol.mbl.edu/index.php?title=File:Zwicklkids.jpg&diff=2850File:Zwicklkids.jpg2013-07-23T22:36:29Z<p>Zwickl: </p>
<hr />
<div></div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2849Derrick Zwickl2013-07-23T22:32:29Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://phylo.bio.ku.edu/slides/GarliDemo/Zwickl.WoodsHole2013.final.pdf Zwickl.WoodsHole2013.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2013.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 21 - Monday July 28</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2822Derrick Zwickl2013-07-23T15:47:31Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [[Media:Zwickl.WoodsHole2012Final.pdf|Zwickl.WoodsHole2012.final.pdf]]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''Everything you need for the GARLI computer exercises is here [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.WH2013.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 21 - Monday July 28</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2815Derrick Zwickl2013-07-22T23:43:34Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [[Media:Zwickl.WoodsHole2012Final.pdf|Zwickl.WoodsHole2012.final.pdf]]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial (same as 2011 version): [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
NOTE: For interested people, there is a GARLI version implementing the DIMM and Mkv models to allow the use of gaps for tree inference. See the '''"GARLI-2.0 Demo"''' section on this page to download the executable and for a brief tutorial on using this version: '''[http://phylo.bio.ku.edu/software/sate/tutorial.html GARLI gap model tutorial]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 21 - Monday July 28</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2208Derrick Zwickl2012-07-26T21:12:10Z<p>Zwickl: /* GARLI at Woods Hole */</p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [[Media:Zwickl.WoodsHole2012Final.pdf|Zwickl.WoodsHole2012.final.pdf]]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial (same as 2011 version): [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
NOTE: For interested people, there is a GARLI version implementing the DIMM and Mkv models to allow the use of gaps for tree inference. See the '''"GARLI-2.0 Demo"''' section on this page to download the executable and for a brief tutorial on using this version: '''[http://phylo.bio.ku.edu/software/sate/tutorial.html GARLI gap model tutorial]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2143Derrick Zwickl2012-07-26T17:39:57Z<p>Zwickl: /* GARLI at Woods Hole */</p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [[Media:Zwickl.WoodsHole2012Final.pdf|Zwickl.WoodsHole2012.final.pdf]]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial (same as 2011 version): [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2142Derrick Zwickl2012-07-26T17:35:29Z<p>Zwickl: /* GARLI at Woods Hole */</p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2012.final.pdf Zwickl.WoodsHole2012.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial (same as 2011 version): [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=File:Zwickl.WoodsHole2012Final.pdf&diff=2141File:Zwickl.WoodsHole2012Final.pdf2012-07-26T17:34:05Z<p>Zwickl: </p>
<hr />
<div></div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2135Derrick Zwickl2012-07-26T16:02:55Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial (same as 2011 version): [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2134Derrick Zwickl2012-07-26T16:02:16Z<p>Zwickl: /* GARLI at Woods Hole */</p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|400px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial (same as 2011 version): [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2133Derrick Zwickl2012-07-26T16:01:28Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|400px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
==GARLI at Woods Hole==<br />
Presentation and exercises:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial: [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2128Derrick Zwickl2012-07-26T15:58:58Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|400px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
Presentation and exercises from 2011:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial: [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2127Derrick Zwickl2012-07-26T15:58:43Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
Presentation and exercises from 2011:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial: [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
*[http://groups.google.com/group/garli_users/ Google users group]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Software&diff=2126Software2012-07-26T15:57:44Z<p>Zwickl: </p>
<hr />
<div>* [http://beast.bio.ed.ac.uk/Main_Page BEAST] - software package includes BEAST, BEAUti, LogCombiner, TreeAnnotator<br />
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi Blast]<br />
* [http://people.sc.fsu.edu/~pbeerli/bugs_in_a_box.tar.gz Bugs in a Box]: A Macintosh program and its (python) source code to show the coalescence process (but still does not draw a tree).<br />
* [http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Fasta]<br />
* [http://tree.bio.ed.ac.uk/software/figtree/ FigTree]<br />
* GARLI<br />
**[http://www.nescent.org/wg_garli/ Support page]<br />
**[http://garli.googlecode.com/ Program download]<br />
**[http://groups.google.com/group/garli_users/ Google users group]<br />
**[http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Workshop tutorial] '''(NOTE: The tutorial bundle linked on this page contains everything you need - you don't need to download the program separately!)'''<br />
* LAMARC: Demonstration and Tutorial on Augst 2nd ([[Peter Beerli]])<br />
**[http://evolution.genetics.washington.edu/lamarc/index.html Lamarc] main website: Dowload and manual<br />
**[[Lamarc tutorial]]<br />
* [http://mafft.cbrc.jp/alignment/software/ MAFFT]<br />
* MIGRATE<br />
** [http://popgen.sc.fsu.edu Migrate main website]: Download, Manual, Blog/Tutorials, Information on speed, citation of MIGRATE in the literature.<br />
** [[Migrate tutorial]]: Tutorial for the course 2012 (same [http://popgen.sc.fsu.edu/Migrate/Tutorials/Entries/2010/7/12_Day_of_longboarding.html tutorial] on the [http://popgen.sc.fsu.edu/Migrate/Tutorials/Tutorials.html Migrate tutorial website]) <br />
** [http://groups.google.com/group/migrate-support?lnk=iggc Migrate support google Group]<br />
* [http://www.mrbayes.net MrBayes]<br />
* [http://abacus.gene.ucl.ac.uk/software/paml.html PAML]<br />
* [http://people.sc.fsu.edu/~dswofford/paup_test PAUP*]<br />
* [http://www.stat.osu.edu/~lkubatko/software/STEM/ STEM]<br />
* [http://tree.bio.ed.ac.uk/software/tracer/ Tracer]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Software&diff=2120Software2012-07-26T15:51:37Z<p>Zwickl: </p>
<hr />
<div>* [http://beast.bio.ed.ac.uk/Main_Page BEAST] - software package includes BEAST, BEAUti, LogCombiner, TreeAnnotator<br />
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi Blast]<br />
* [http://people.sc.fsu.edu/~pbeerli/bugs_in_a_box.tar.gz Bugs in a Box]: A Macintosh program and its (python) source code to show the coalescence process (but still does not draw a tree).<br />
* [http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Fasta]<br />
* [http://tree.bio.ed.ac.uk/software/figtree/ FigTree]<br />
* GARLI<br />
**[http://www.nescent.org/wg_garli/ Support page]<br />
**[http://garli.googlecode.com/ Program download]<br />
**[http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Workshop tutorial] '''(NOTE: The tutorial bundle linked on this page contains everything you need - you don't need to download the program separately!)'''<br />
* LAMARC: Demonstration and Tutorial on Augst 2nd ([[Peter Beerli]])<br />
**[http://evolution.genetics.washington.edu/lamarc/index.html Lamarc] main website: Dowload and manual<br />
**[[Lamarc tutorial]]<br />
* [http://mafft.cbrc.jp/alignment/software/ MAFFT]<br />
* MIGRATE<br />
** [http://popgen.sc.fsu.edu Migrate main website]: Download, Manual, Blog/Tutorials, Information on speed, citation of MIGRATE in the literature.<br />
** [[Migrate tutorial]]: Tutorial for the course 2012 (same [http://popgen.sc.fsu.edu/Migrate/Tutorials/Entries/2010/7/12_Day_of_longboarding.html tutorial] on the [http://popgen.sc.fsu.edu/Migrate/Tutorials/Tutorials.html Migrate tutorial website]) <br />
** [http://groups.google.com/group/migrate-support?lnk=iggc Migrate support google Group]<br />
* [http://www.mrbayes.net MrBayes]<br />
* [http://abacus.gene.ucl.ac.uk/software/paml.html PAML]<br />
* [http://people.sc.fsu.edu/~dswofford/paup_test PAUP*]<br />
* [http://www.stat.osu.edu/~lkubatko/software/STEM/ STEM]<br />
* [http://tree.bio.ed.ac.uk/software/tracer/ Tracer]</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2119Derrick Zwickl2012-07-26T15:49:21Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
Presentation and exercises from 2011:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''NOTE: The tutorial bundle linked on the page below contains everything you need - you don't need to download the program separately!'''<br />
<br />
'''GARLI tutorial: [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
<br />
==Other GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwicklhttps://molevol.mbl.edu/index.php?title=Derrick_Zwickl&diff=2098Derrick Zwickl2012-07-25T21:00:29Z<p>Zwickl: </p>
<hr />
<div>University of Arizona<br />
<br />
zwickl@email.arizona.edu<br />
<br />
[[File:Zwickl&son.JPG|300px|thumb|right|Derrick and offspring]]<br />
<br />
<br />
Presentation and exercises from 2011:<br />
<br />
'''Lecture: [http://people.ku.edu/~zwickl/Zwickl.WoodsHole2011.final.pdf Zwickl.WoodsHole2011.final.pdf]'''<br />
<br />
'''GARLI tutorial: [http://phylo.bio.ku.edu/slides/GarliDemo/garliExercise.html Computer exercise]'''<br />
<br />
==GARLI information==<br />
*[http://www.nescent.org/wg_garli/ Extensive documentation wiki]<br />
*[http://garli.googlecode.com/ Program download]<br />
<br />
==Time at MBL==<br />
Tuesday July 24 - Monday July 30</div>Zwickl