Difference between revisions of "Garli using partitioned models"

From MolEvol
(Created page with "==Configuring a partitioned analysis== Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to d...")
 
 
Line 9: Line 9:
 
This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix.  The setup of the actual models will come later.  For a dataset with 2178 characters in a single data or characters block, it would look like this:
 
This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix.  The setup of the actual models will come later.  For a dataset with 2178 characters in a single data or characters block, it would look like this:
 
<pre>
 
<pre>
begin sets;
+
begin sets;
  charset ND2 = 1-726;
+
charset ND2 = 1-726;
  charset rbcl = 727-1452;
+
charset rbcl = 727-1452;
    charset 16S = 1453-2178;
+
charset 16S = 1453-2178;
    charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S;  
+
charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S;  
  
      [you could also put characters exclusions here by removing the []'s from the line below]
+
[you could also put characters exclusions here by removing the []'s from the line below]
      [note that the excluded sites should still appear in the charpartition, however]
+
[note that the excluded sites should still appear in the charpartition, however]
        [exset * myexclusions = 600-800, 850, 900-100;]
+
[exset * myexclusions = 600-800, 850, 900-100;]
        end;
+
end;
        </pre>
+
</pre>
  
        The above block would divide up the sites into three sets of 726 characters each.   
+
The above block would divide up the sites into three sets of 726 characters each.   
  
        To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this
+
To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this
        <pre>
+
<pre>
          charpartition bySites = chunk1:ND2 rbcl, chunk2:16S;  
+
charpartition bySites = chunk1:ND2 rbcl, chunk2:16S;  
          </pre>
+
</pre>
          Note the space rather than comma between ND2 and rbcl.
+
Note the space rather than comma between ND2 and rbcl.
  
  
          The names are unimportant here.  The general format is:
+
The names are unimportant here.  The general format is:
  
          charset <charset name> = <list or range of sites>;
+
charset <charset name> = <list or range of sites>;
            charset <charset name> = <list or range of sites>;
+
charset <charset name> = <list or range of sites>;
            charpartition <charpartition name> = (cont.)
+
charpartition <charpartition name> = (cont.)
                <partition subset 1 name>:<sites or charset making up 1st subset>, (cont.)
+
<partition subset 1 name>:<sites or charset making up 1st subset>, (cont.)
                    <partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;
+
<partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;
  
                    To easily specify charsets that divide characters up by codon position, do this:
+
To easily specify charsets that divide characters up by codon position, do this:
                      charset 1stpos = 1-2178\3;
+
charset 1stpos = 1-2178\3;
                      charset 2ndpos = 2-2178\3;
+
charset 2ndpos = 2-2178\3;
                        charset 3rdpos = 3-2178\3;
+
charset 3rdpos = 3-2178\3;
  
                        Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model.  If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.
+
Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model.  If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.
  
                        Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).
+
Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).
  
                        ===Specifying the models===
+
===Specifying the models===
                        DO SOME SORT OF MODEL TESTING!  The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized.  Use ModelTest or some other means for finding the best model for each data subset.  Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.
+
DO SOME SORT OF MODEL TESTING!  The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized.  Use ModelTest or some other means for finding the best model for each data subset.  Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.
  
                        As usual for GARLI, the models are specified in the configuration file.  If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual '''[[GARLI_Configuration_Settings#Model_specification_settings|here]]''', and FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.
+
As usual for GARLI, the models are specified in the configuration file.  If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual '''[[GARLI_Configuration_Settings#Model_specification_settings|here]]''', and FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.
  
                        There are two new configuration entries that relate to partitioned models:
+
There are two new configuration entries that relate to partitioned models:
                        linkmodels = 0 or 1
+
linkmodels = 0 or 1
                          subsetspecificrates = 0 or 1
+
subsetspecificrates = 0 or 1
  
                          '''linkmodels''' means to use a single set of model parameters for all subsets.
+
'''linkmodels''' means to use a single set of model parameters for all subsets.
  
                          '''subsetspecificrates''' means to infer overall rate multipliers for each data subset.  This is equivalent to *prset ratepr=variable* in MrBayes
+
'''subsetspecificrates''' means to infer overall rate multipliers for each data subset.  This is equivalent to *prset ratepr=variable* in MrBayes
  
                          So, there are various combinations here:
+
So, there are various combinations here:
                          {| border="1"
+
{| border="1"
                          |-
+
|-
                          !'''linkmodels'''!!'''subsetspecificrates'''!!'''meaning '''
+
!'''linkmodels'''!!'''subsetspecificrates'''!!'''meaning '''
                          |-
+
|-
                          |align="center" | 0 || align="center" | 0 || different models, branch lengths equal
+
|align="center" | 0 || align="center" | 0 || different models, branch lengths equal
                          |-
+
|-
                          |align="center" | 0 || align="center" | 1 || different models, different subset rates  
+
|align="center" | 0 || align="center" | 1 || different models, different subset rates  
                          |-
+
|-
                          |align="center" | 1 || align="center" | 0 || single model, one set of branch lengths (equivalent to non-partitioned analysis)
+
|align="center" | 1 || align="center" | 0 || single model, one set of branch lengths (equivalent to non-partitioned analysis)
                          |-
+
|-
                          |align="center" | 1 || align="center" | 1 || single model, different subset rates (like site-specific rates model in PAUP*)
+
|align="center" | 1 || align="center" | 1 || single model, different subset rates (like site-specific rates model in PAUP*)
                          |}
+
|}
  
                          The normal model configuration entries are the following, with the defaults in *bold*:
+
The normal model configuration entries are the following, with the defaults in *bold*:
  
                          datatype = '''nucleotide''', aminoacid, codon-aminoacid or codon
+
datatype = '''nucleotide''', aminoacid, codon-aminoacid or codon
                            ratematrix = '''6rate''', 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f )
+
ratematrix = '''6rate''', 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f )
                            statefrequencies = '''estimate''', equal, empirical, (+others for aminoacids or codons.  See manual)
+
statefrequencies = '''estimate''', equal, empirical, (+others for aminoacids or codons.  See manual)
                              ratehetmodel = '''gamma''', none
+
ratehetmodel = '''gamma''', none
                              numratecats = '''4''', 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma)  
+
numratecats = '''4''', 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma)  
                                invariantsites = '''estimate''', none
+
invariantsites = '''estimate''', none
  
                                If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G.  In that case there is nothing else to be done.  You can start your run.
+
If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G.  In that case there is nothing else to be done.  You can start your run.
  
                                If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in []'s.  The headings need to be [model1], [model2], etc., and are assigned to the subsets in order.  The number of configuration sets must match the number of data subsets.
+
If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in []'s.  The headings need to be [model1], [model2], etc., and are assigned to the subsets in order.  The number of configuration sets must match the number of data subsets.
  
                                For example, the following would assign the GTR+G and HKY models to the first and second data subsets.
+
For example, the following would assign the GTR+G and HKY models to the first and second data subsets.
                                <pre>
+
<pre>
                                [model1]
+
[model1]
                                  datatype = nucleotide
+
datatype = nucleotide
                                  ratematrix = 6rate
+
ratematrix = 6rate
                                    statefrequencies = estimate
+
statefrequencies = estimate
                                    ratehetmodel = gamma
+
ratehetmodel = gamma
                                      numratecats = 4
+
numratecats = 4
                                      invariantsites = none
+
invariantsites = none
  
                                        [model2]
+
[model2]
                                        datatype = nucleotide
+
datatype = nucleotide
                                          ratematrix = 2rate
+
ratematrix = 2rate
                                          statefrequencies = estimate
+
statefrequencies = estimate
                                            ratehetmodel = none
+
ratehetmodel = none
                                            numratecats = 1
+
numratecats = 1
                                              invariantsites = none
+
invariantsites = none
                                              </pre>
+
</pre>
  
                                              These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.
+
These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.
  
                                              If you're not sure how to specify a given model, see the FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.
+
If you're not sure how to specify a given model, see the FAQ entry '''[[FAQ#MODELTEST_told_me_to_use_model_X._How_do_I_set_that_up_in_GARLI.3F|here]]'''.
  
                                              NOTE THAT THE BRACKETED LINES BEFORE EACH MODEL DESCRIPTION ARE '''NOT''' COMMENTS, AND MUST APPEAR AS ABOVE WITH CONSECUTIVE MODEL NUMBERING STARTING AT ONE!
+
NOTE THAT THE BRACKETED LINES BEFORE EACH MODEL DESCRIPTION ARE '''NOT''' COMMENTS, AND MUST APPEAR AS ABOVE WITH CONSECUTIVE MODEL NUMBERING STARTING AT ONE!
  
                                              ===Choosing partitioning schemes===
+
===Choosing partitioning schemes===
                                              ====How many partition subsets can I use?====
+
====How many partition subsets can I use?====
                                              There is no hard limit that is enforced on the number of subsets that you '''''can''''' use.  I've done more than 60 myself.  See the below for considerations about how many you '''''should''''' use.
+
There is no hard limit that is enforced on the number of subsets that you '''''can''''' use.  I've done more than 60 myself.  See the below for considerations about how many you '''''should''''' use.
  
                                              ====What do I need to consider when choosing models and a partitioning scheme?====
+
====What do I need to consider when choosing models and a partitioning scheme?====
                                              That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics).  I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize.  Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal.  This makes the parameter likelihood surfaces very flat and difficult to optimize.  This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.
+
That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics).  I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize.  Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal.  This makes the parameter likelihood surfaces very flat and difficult to optimize.  This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.
  
                                              '''NOTE''' Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context!  This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface.  This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain.   
+
'''NOTE''' Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context!  This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface.  This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain.   
  
                                              In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters.  If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values.  Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable.  However, informative prior estimates are not used that often in standard phylogenetic analyses.
+
In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters.  If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values.  Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable.  However, informative prior estimates are not used that often in standard phylogenetic analyses.
  
                                              ====Ok, then how should I partition and choose models in practice?====
+
====Ok, then how should I partition and choose models in practice?====
                                              '''NOTE:''' The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.
+
'''NOTE:''' The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.
  
                                              Consider the following procedure that I've used, from a real dataset of 2 genes.  It is fairly complicated, but I think as statistically rigorous as can be expected.  We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets.  We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated.  But what model should be chosen for the subsets, and how should the subsets be constructed?   
+
Consider the following procedure that I've used, from a real dataset of 2 genes.  It is fairly complicated, but I think as statistically rigorous as can be expected.  We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets.  We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated.  But what model should be chosen for the subsets, and how should the subsets be constructed?   
  
                                              1. First, construct all reasonable sets of sites for model testing.  This amounts to:
+
1. First, construct all reasonable sets of sites for model testing.  This amounts to:
                                              :* Each codon position of each gene individually (6)
+
:* Each codon position of each gene individually (6)
                                              :* The full concatenated alignment (1)
+
:* The full concatenated alignment (1)
                                              :* The full sequence of each gene (2)
+
:* The full sequence of each gene (2)
                                              :* The first and second positions combined for each gene (2)
+
:* The first and second positions combined for each gene (2)
                                              :* The first, second and third positions pooled across the genes (3)
+
:* The first, second and third positions pooled across the genes (3)
                                              Note that this list doesn't need to be exhaustive.  I've omitted various combinations that I find unlikely ''a priori'', for example a combination of first and third positions.
+
Note that this list doesn't need to be exhaustive.  I've omitted various combinations that I find unlikely ''a priori'', for example a combination of first and third positions.
  
  
                                              2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above.  These will be the models applied to each of the subsets as we move to partitioned models.  The following were the results in my case.  If you aren't very familiar with the typical models, the information here: ('''[[GARLI_Configuration_Settings#Model_specification_settings|Model configuration]]''') and some of the items here: ('''[[FAQ#Model_choices| FAQ:Model choices]]''') may be helpful.
+
2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above.  These will be the models applied to each of the subsets as we move to partitioned models.  The following were the results in my case.  If you aren't very familiar with the typical models, the information here: ('''[[GARLI_Configuration_Settings#Model_specification_settings|Model configuration]]''') and some of the items here: ('''[[FAQ#Model_choices| FAQ:Model choices]]''') may be helpful.
  
  
                                              {|border="1"
+
{|border="1"
                                              |-bgcolor=grey
+
|-bgcolor=grey
                                              !  width="120" | Alignment
+
!  width="120" | Alignment
                                              !  width="120" | Full
+
!  width="120" | Full
                                              !  width="120" | 1st Pos
+
!  width="120" | 1st Pos
                                              !  width="120" | 2nd Pos
+
!  width="120" | 2nd Pos
                                              !  width="120" | 3rd Pos
+
!  width="120" | 3rd Pos
                                              !  width="120" | 1st Pos + 2nd Pos
+
!  width="120" | 1st Pos + 2nd Pos
                                              |-
+
|-
                                              | align="center" | Gene 1 || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG || align="center" | TVMIG
+
| align="center" | Gene 1 || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG || align="center" | TVMIG
                                              |-
+
|-
                                              | align="center" | Gene 2 || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG
+
| align="center" | Gene 2 || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG
                                              |-
+
|-
                                              | align="center" | Concat || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG
+
| align="center" | Concat || align="center" | GTRIG || align="center" | GTRIG || align="center" | GTRIG || align="center" | TVMIG || align="center" | TVMIG
                                              |}
+
|}
  
  
                                              3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step.  When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3).  For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered.  The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.
+
3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step.  When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3).  For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered.  The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.
  
                                              Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme.  Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done.  See [[FAQ#How_do_I_fully_constrain_.28i.e..2C_fix.29_the_tree_topology.3F | here]] for information on fixing the tree topology.
+
Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme.  Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done.  See [[FAQ#How_do_I_fully_constrain_.28i.e..2C_fix.29_the_tree_topology.3F | here]] for information on fixing the tree topology.
  
  
                                              4. Now, note the number of model parameters that were estimated in each partitioning scheme.  For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively.  (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30.  Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as  
+
4. Now, note the number of model parameters that were estimated in each partitioning scheme.  For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively.  (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30.  Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as  
                                              AIC = 2 x (# parameters - lnL).
+
AIC = 2 x (# parameters - lnL).
  
                                              5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best.  See the below table for my results.  In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene.  The concatenated alignment result was less obvious ''a priori'', with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG.  
+
5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best.  See the below table for my results.  In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene.  The concatenated alignment result was less obvious ''a priori'', with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG.  
                                              The AIC scores are summarized in the following table:
+
The AIC scores are summarized in the following table:
  
                                              {|border="1"
+
{|border="1"
                                              |-bgcolor=grey
+
|-bgcolor=grey
                                              !  width="100" | Alignment
+
!  width="100" | Alignment
                                              !  width="100" | Partition
+
!  width="100" | Partition
                                              !  width="100" | lnL
+
!  width="100" | lnL
                                              !  width="100" |Parameters
+
!  width="100" |Parameters
                                              !  width="100" |AIC
+
!  width="100" |AIC
                                              |-
+
|-
                                              |align="center"| Gene 1 || align="center" | (1)(2)(3) || align="center" | -22698.82 || align="center" | 30 || align="center" | 45457.64
+
|align="center"| Gene 1 || align="center" | (1)(2)(3) || align="center" | -22698.82 || align="center" | 30 || align="center" | 45457.64
                                              |-
+
|-
                                              |align="center"| || align="center" | (12)(3) || align="center" | -22817.84 || align="center" | 19 || align="center" | 45673.68
+
|align="center"| || align="center" | (12)(3) || align="center" | -22817.84 || align="center" | 19 || align="center" | 45673.68
                                              |-
+
|-
                                              |align="center"| || align="center" |(123) || align="center" | -23524.92 || align="center" | 10 || align="center" | 47069.84
+
|align="center"| || align="center" |(123) || align="center" | -23524.92 || align="center" | 10 || align="center" | 47069.84
                                              |-
+
|-
                                              |align="center"| || align="center" |  || align="center" |  || align="center" | ||
+
|align="center"| || align="center" |  || align="center" |  || align="center" | ||
                                              |-
+
|-
                                              |align="center"| Gene 2 || align="center" | (1)(2)(3) || align="center" | -24537.93 || align="center" | 31 || align="center" | 49137.85
+
|align="center"| Gene 2 || align="center" | (1)(2)(3) || align="center" | -24537.93 || align="center" | 31 || align="center" | 49137.85
                                              |-
+
|-
                                              |align="center"| || align="center" | (12)(3) || align="center" | -24639.11 || align="center" | 19 || align="center" | 49316.22
+
|align="center"| || align="center" | (12)(3) || align="center" | -24639.11 || align="center" | 19 || align="center" | 49316.22
                                              |-
+
|-
                                              |align="center"| || align="center" | (123) || align="center" | -25353.78 || align="center" | 10 || align="center" | 50727.55
+
|align="center"| || align="center" | (123) || align="center" | -25353.78 || align="center" | 10 || align="center" | 50727.55
                                              |-
+
|-
                                              |align="center"| || align="center" |  || align="center" |  || align="center" | ||
+
|align="center"| || align="center" |  || align="center" |  || align="center" | ||
                                              |-
+
|-
                                              |align="center"|Concat || align="center" | (11)(2)(2)(33) || align="center" | -47384.54 || align="center" | 42 || align="center" | 94853.08
+
|align="center"|Concat || align="center" | (11)(2)(2)(33) || align="center" | -47384.54 || align="center" | 42 || align="center" | 94853.08
                                              |-
+
|-
                                              |align="center"| || align="center" | (1)(2)(1)(2)(33) || align="center" | -47375.30 || align="center" | 53 || align="center" | 94856.60
+
|align="center"| || align="center" | (1)(2)(1)(2)(33) || align="center" | -47375.30 || align="center" | 53 || align="center" | 94856.60
                                              |-
+
|-
                                              |align="center"| || align="center" | (1)(2)(3)(1)(2)(3) || align="center" | -47373.34 || align="center" | 62 || align="center" | 94870.69
+
|align="center"| || align="center" | (1)(2)(3)(1)(2)(3) || align="center" | -47373.34 || align="center" | 62 || align="center" | 94870.69
                                              |-
+
|-
                                              |align="center"| || align="center" | (11)(22)(33) || align="center" | -47408.21 || align="center" | 31 || align="center" | 94878.43
+
|align="center"| || align="center" | (11)(22)(33) || align="center" | -47408.21 || align="center" | 31 || align="center" | 94878.43
                                              |-
+
|-
                                              |align="center"| || align="center" | (1212)(33) || align="center" | -47614.47 || align="center" | 19 || align="center" | 95266.94
+
|align="center"| || align="center" | (1212)(33) || align="center" | -47614.47 || align="center" | 19 || align="center" | 95266.94
                                              |-
+
|-
                                              |align="center"| || align="center" | 123123 || align="center" | -49038.37 || align="center" | 10 || align="center" | 98096.75
+
|align="center"| || align="center" | 123123 || align="center" | -49038.37 || align="center" | 10 || align="center" | 98096.75
                                              |-
+
|-
                                              |align="center"| || align="center" | (123)(123) || align="center" | -49031.42 || align="center" | 21 || align="center" | 98104.83
+
|align="center"| || align="center" | (123)(123) || align="center" | -49031.42 || align="center" | 21 || align="center" | 98104.83
                                              |}
+
|}
  
                                              ====Further thoughts====
+
====Further thoughts====
                                              The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned ''a priori''.  Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has '''''20 fewer parameters''''' than the one that most people would have chosen ''a priori'', (1)(2)(3)(1)(2)(3).
+
The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned ''a priori''.  Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has '''''20 fewer parameters''''' than the one that most people would have chosen ''a priori'', (1)(2)(3)(1)(2)(3).
                                              Some caveats:
+
Some caveats:
                                              *This methodology and example only includes two genes!  Many datasets nowadays have many more.  Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes.  However, you can still work to find good partitioning schemes ''within'' each gene, as was done for the single genes in the example above.  Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each.  I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset.  You might also do something like the full procedure above on sets of genes that you know '''a priori'' are similar, for example mitochondrial genes, rRNA genes, etc.
+
*This methodology and example only includes two genes!  Many datasets nowadays have many more.  Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes.  However, you can still work to find good partitioning schemes ''within'' each gene, as was done for the single genes in the example above.  Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each.  I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset.  You might also do something like the full procedure above on sets of genes that you know '''a priori'' are similar, for example mitochondrial genes, rRNA genes, etc.
  
                                              *This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI).  Surely more parameters could be eliminated by linking.  For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters.  If so, three parameters could be eliminated for each subset, which is a significant number.
+
*This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI).  Surely more parameters could be eliminated by linking.  For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters.  If so, three parameters could be eliminated for each subset, which is a significant number.
  
                                              *Some may argue that this is a lot of work for unclear benefit.  This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models.  Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.
+
*Some may argue that this is a lot of work for unclear benefit.  This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models.  Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.
  
                                              ==Check the output==
+
==Check the output==
                                              Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly.  The output is currently very verbose.   
+
Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly.  The output is currently very verbose.   
  
                                              First the details of the data partitioning appear.  Check that the total number of characters per subset looks correct.  
+
First the details of the data partitioning appear.  Check that the total number of characters per subset looks correct.  
                                              <pre>
+
<pre>
                                                GARLI partition subset 1
+
GARLI partition subset 1
                                                    CHARACTERS block #1 ("Untitled DATA Block 1")
+
CHARACTERS block #1 ("Untitled DATA Block 1")
                                                          CHARPARTITION subset #1 ("1stpos")
+
CHARPARTITION subset #1 ("1stpos")
                                                              Data read as Nucleotide data, modeled as Nucleotide data
+
Data read as Nucleotide data, modeled as Nucleotide data
  
                                                                    Summary of data or data subset:
+
Summary of data or data subset:
                                                                            11 sequences.
+
11 sequences.
                                                                                  441 constant characters.
+
441 constant characters.
                                                                                          189 parsimony-informative characters.
+
189 parsimony-informative characters.
                                                                                                  96 autapomorphic characters.
+
96 autapomorphic characters.
                                                                                                        726 total characters.
+
726 total characters.
                                                                                                                238 unique patterns in compressed data matrix.
+
238 unique patterns in compressed data matrix.
  
                                                                                                                GARLI partition subset 2
+
GARLI partition subset 2
                                                                                                                      CHARACTERS block #1 ("Untitled DATA Block 1")
+
CHARACTERS block #1 ("Untitled DATA Block 1")
                                                                                                                          CHARPARTITION subset #2 ("2ndpos")
+
CHARPARTITION subset #2 ("2ndpos")
                                                                                                                                Data read as Nucleotide data, modeled as Nucleotide data
+
Data read as Nucleotide data, modeled as Nucleotide data
  
                                                                                                                                    Summary of data or data subset:
+
Summary of data or data subset:
                                                                                                                                            11 sequences.
+
11 sequences.
                                                                                                                                                    528 constant characters.
+
528 constant characters.
                                                                                                                                                          103 parsimony-informative characters.
+
103 parsimony-informative characters.
                                                                                                                                                                  95 autapomorphic characters.
+
95 autapomorphic characters.
                                                                                                                                                                          726 total characters.
+
726 total characters.
                                                                                                                                                                                158 unique patterns in compressed data matrix.
+
158 unique patterns in compressed data matrix.
  
                                                                                                                                                                                  GARLI partition subset 3
+
GARLI partition subset 3
                                                                                                                                                                                      CHARACTERS block #1 ("Untitled DATA Block 1")
+
CHARACTERS block #1 ("Untitled DATA Block 1")
                                                                                                                                                                                            CHARPARTITION subset #3 ("3rdpos")
+
CHARPARTITION subset #3 ("3rdpos")
                                                                                                                                                                                                Data read as Nucleotide data, modeled as Nucleotide data
+
Data read as Nucleotide data, modeled as Nucleotide data
  
                                                                                                                                                                                                      Summary of data or data subset:
+
Summary of data or data subset:
                                                                                                                                                                                                              11 sequences.
+
11 sequences.
                                                                                                                                                                                                                    103 constant characters.
+
103 constant characters.
                                                                                                                                                                                                                            539 parsimony-informative characters.
+
539 parsimony-informative characters.
                                                                                                                                                                                                                                    84 autapomorphic characters.
+
84 autapomorphic characters.
                                                                                                                                                                                                                                          726 total characters.
+
726 total characters.
                                                                                                                                                                                                                                                  549 unique patterns in compressed data matrix.
+
549 unique patterns in compressed data matrix.
                                                                                                                                                                                                                                                  </pre>
+
</pre>
                                                                                                                                                                                                                                                  Then a description of the models and model parameters assigned to each subset appears.  Parameters are at their initial values.  This indicates three models with GTR+G for each:
+
Then a description of the models and model parameters assigned to each subset appears.  Parameters are at their initial values.  This indicates three models with GTR+G for each:
                                                                                                                                                                                                                                                  <pre>
+
<pre>
                                                                                                                                                                                                                                                  MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)
+
MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)
                                                                                                                                                                                                                                                    Model 1
+
Model 1
                                                                                                                                                                                                                                                      Number of states = 4 (nucleotide data)
+
Number of states = 4 (nucleotide data)
                                                                                                                                                                                                                                                        Nucleotide Relative Rate Matrix:    6 rates  
+
Nucleotide Relative Rate Matrix:    6 rates  
                                                                                                                                                                                                                                                            AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
+
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
                                                                                                                                                                                                                                                              Equilibrium State Frequencies: estimated
+
Equilibrium State Frequencies: estimated
                                                                                                                                                                                                                                                                  (ACGT) 0.3157 0.1746 0.3004 0.2093  
+
(ACGT) 0.3157 0.1746 0.3004 0.2093  
                                                                                                                                                                                                                                                                    Rate Heterogeneity Model:
+
Rate Heterogeneity Model:
                                                                                                                                                                                                                                                                        4 discrete gamma distributed rate categories, alpha param estimated
+
4 discrete gamma distributed rate categories, alpha param estimated
                                                                                                                                                                                                                                                                              0.5000
+
0.5000
                                                                                                                                                                                                                                                                                  Substitution rate categories under this model:
+
Substitution rate categories under this model:
                                                                                                                                                                                                                                                                                        rate    proportion
+
rate    proportion
                                                                                                                                                                                                                                                                                              0.0334    0.2500
+
0.0334    0.2500
                                                                                                                                                                                                                                                                                                    0.2519    0.2500
+
0.2519    0.2500
                                                                                                                                                                                                                                                                                                          0.8203    0.2500
+
0.8203    0.2500
                                                                                                                                                                                                                                                                                                                2.8944    0.2500
+
2.8944    0.2500
  
                                                                                                                                                                                                                                                                                                                Model 2
+
Model 2
                                                                                                                                                                                                                                                                                                                  Number of states = 4 (nucleotide data)
+
Number of states = 4 (nucleotide data)
                                                                                                                                                                                                                                                                                                                    Nucleotide Relative Rate Matrix:    6 rates  
+
Nucleotide Relative Rate Matrix:    6 rates  
                                                                                                                                                                                                                                                                                                                        AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
+
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
                                                                                                                                                                                                                                                                                                                          Equilibrium State Frequencies: estimated
+
Equilibrium State Frequencies: estimated
                                                                                                                                                                                                                                                                                                                              (ACGT) 0.2703 0.1566 0.1628 0.4103  
+
(ACGT) 0.2703 0.1566 0.1628 0.4103  
                                                                                                                                                                                                                                                                                                                                Rate Heterogeneity Model:
+
Rate Heterogeneity Model:
                                                                                                                                                                                                                                                                                                                                    4 discrete gamma distributed rate categories, alpha param estimated
+
4 discrete gamma distributed rate categories, alpha param estimated
                                                                                                                                                                                                                                                                                                                                          0.5000
+
0.5000
                                                                                                                                                                                                                                                                                                                                              Substitution rate categories under this model:
+
Substitution rate categories under this model:
                                                                                                                                                                                                                                                                                                                                                    rate    proportion
+
rate    proportion
                                                                                                                                                                                                                                                                                                                                                          0.0334    0.2500
+
0.0334    0.2500
                                                                                                                                                                                                                                                                                                                                                                0.2519    0.2500
+
0.2519    0.2500
                                                                                                                                                                                                                                                                                                                                                                      0.8203    0.2500
+
0.8203    0.2500
                                                                                                                                                                                                                                                                                                                                                                            2.8944    0.2500
+
2.8944    0.2500
  
                                                                                                                                                                                                                                                                                                                                                                              Model 3
+
Model 3
                                                                                                                                                                                                                                                                                                                                                                                Number of states = 4 (nucleotide data)
+
Number of states = 4 (nucleotide data)
                                                                                                                                                                                                                                                                                                                                                                                  Nucleotide Relative Rate Matrix:    6 rates  
+
Nucleotide Relative Rate Matrix:    6 rates  
                                                                                                                                                                                                                                                                                                                                                                                      AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
+
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
                                                                                                                                                                                                                                                                                                                                                                                        Equilibrium State Frequencies: estimated
+
Equilibrium State Frequencies: estimated
                                                                                                                                                                                                                                                                                                                                                                                            (ACGT) 0.1460 0.3609 0.2915 0.2015  
+
(ACGT) 0.1460 0.3609 0.2915 0.2015  
                                                                                                                                                                                                                                                                                                                                                                                              Rate Heterogeneity Model:
+
Rate Heterogeneity Model:
                                                                                                                                                                                                                                                                                                                                                                                                  4 discrete gamma distributed rate categories, alpha param estimated
+
4 discrete gamma distributed rate categories, alpha param estimated
                                                                                                                                                                                                                                                                                                                                                                                                        0.5000
+
0.5000
                                                                                                                                                                                                                                                                                                                                                                                                            Substitution rate categories under this model:
+
Substitution rate categories under this model:
                                                                                                                                                                                                                                                                                                                                                                                                                  rate    proportion
+
rate    proportion
                                                                                                                                                                                                                                                                                                                                                                                                                        0.0334    0.2500
+
0.0334    0.2500
                                                                                                                                                                                                                                                                                                                                                                                                                              0.2519    0.2500
+
0.2519    0.2500
                                                                                                                                                                                                                                                                                                                                                                                                                                    0.8203    0.2500
+
0.8203    0.2500
                                                                                                                                                                                                                                                                                                                                                                                                                                          2.8944    0.2500
+
2.8944    0.2500
  
                                                                                                                                                                                                                                                                                                                                                                                                                                          Subset rate multipliers:
+
Subset rate multipliers:
                                                                                                                                                                                                                                                                                                                                                                                                                                              1.00  1.00  1.00
+
1.00  1.00  1.00
                                                                                                                                                                                                                                                                                                                                                                                                                                              </pre>
+
</pre>
  
                                                                                                                                                                                                                                                                                                                                                                                                                                              If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed.  All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.
+
If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed.  All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                              <pre>
+
<pre>
                                                                                                                                                                                                                                                                                                                                                                                                                                              Completed 5 replicate runs (of 5).
+
Completed 5 replicate runs (of 5).
                                                                                                                                                                                                                                                                                                                                                                                                                                              Results:
+
Results:
                                                                                                                                                                                                                                                                                                                                                                                                                                              Replicate 1 : -13317.4777 (best)
+
Replicate 1 : -13317.4777 (best)
                                                                                                                                                                                                                                                                                                                                                                                                                                              Replicate 2 : -13317.4813        (same topology as 1)
+
Replicate 2 : -13317.4813        (same topology as 1)
                                                                                                                                                                                                                                                                                                                                                                                                                                              Replicate 3 : -13317.4839        (same topology as 1)
+
Replicate 3 : -13317.4839        (same topology as 1)
                                                                                                                                                                                                                                                                                                                                                                                                                                              Replicate 4 : -13317.4863        (same topology as 1)
+
Replicate 4 : -13317.4863        (same topology as 1)
                                                                                                                                                                                                                                                                                                                                                                                                                                              Replicate 5 : -13317.4781        (same topology as 1)
+
Replicate 5 : -13317.4781        (same topology as 1)
  
                                                                                                                                                                                                                                                                                                                                                                                                                                              Parameter estimates:
+
Parameter estimates:
  
                                                                                                                                                                                                                                                                                                                                                                                                                                              Partition subset 1:
+
Partition subset 1:
                                                                                                                                                                                                                                                                                                                                                                                                                                                      r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha  
+
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha  
                                                                                                                                                                                                                                                                                                                                                                                                                                                      rep 1:  1.97  2.58  1.42  1.41  3.72  1.00 0.310 0.177 0.297 0.216 0.411  
+
rep 1:  1.97  2.58  1.42  1.41  3.72  1.00 0.310 0.177 0.297 0.216 0.411  
                                                                                                                                                                                                                                                                                                                                                                                                                                                      rep 2:  1.96  2.58  1.41  1.41  3.71  1.00 0.310 0.177 0.296 0.216 0.409  
+
rep 2:  1.96  2.58  1.41  1.41  3.71  1.00 0.310 0.177 0.296 0.216 0.409  
                                                                                                                                                                                                                                                                                                                                                                                                                                                      rep 3:  1.97  2.58  1.42  1.40  3.71  1.00 0.309 0.177 0.298 0.216 0.411  
+
rep 3:  1.97  2.58  1.42  1.40  3.71  1.00 0.309 0.177 0.298 0.216 0.411  
                                                                                                                                                                                                                                                                                                                                                                                                                                                      rep 4:  1.96  2.57  1.42  1.40  3.72  1.00 0.310 0.177 0.297 0.216 0.411  
+
rep 4:  1.96  2.57  1.42  1.40  3.72  1.00 0.310 0.177 0.297 0.216 0.411  
                                                                                                                                                                                                                                                                                                                                                                                                                                                      rep 5:  1.96  2.57  1.41  1.40  3.71  1.00 0.310 0.177 0.297 0.216 0.409  
+
rep 5:  1.96  2.57  1.41  1.40  3.71  1.00 0.310 0.177 0.297 0.216 0.409  
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                      Partition subset 2:
+
Partition subset 2:
                                                                                                                                                                                                                                                                                                                                                                                                                                                            r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha  
+
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha  
                                                                                                                                                                                                                                                                                                                                                                                                                                                            rep 1:  4.32  7.05  1.60  7.05  4.37  1.00 0.269 0.164 0.160 0.407 0.361  
+
rep 1:  4.32  7.05  1.60  7.05  4.37  1.00 0.269 0.164 0.160 0.407 0.361  
                                                                                                                                                                                                                                                                                                                                                                                                                                                            rep 2:  4.32  7.03  1.59  7.08  4.37  1.00 0.270 0.163 0.160 0.407 0.361  
+
rep 2:  4.32  7.03  1.59  7.08  4.37  1.00 0.270 0.163 0.160 0.407 0.361  
                                                                                                                                                                                                                                                                                                                                                                                                                                                            rep 3:  4.33  7.08  1.60  7.07  4.37  1.00 0.269 0.164 0.160 0.406 0.361  
+
rep 3:  4.33  7.08  1.60  7.07  4.37  1.00 0.269 0.164 0.160 0.406 0.361  
                                                                                                                                                                                                                                                                                                                                                                                                                                                            rep 4:  4.34  7.09  1.61  7.10  4.40  1.00 0.269 0.164 0.160 0.407 0.361  
+
rep 4:  4.34  7.09  1.61  7.10  4.40  1.00 0.269 0.164 0.160 0.407 0.361  
                                                                                                                                                                                                                                                                                                                                                                                                                                                            rep 5:  4.35  7.08  1.60  7.11  4.39  1.00 0.269 0.163 0.160 0.407 0.360  
+
rep 5:  4.35  7.08  1.60  7.11  4.39  1.00 0.269 0.163 0.160 0.407 0.360  
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partition subset 3:
+
Partition subset 3:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha  
+
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 1:  1.06  5.26  3.55  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.988  
+
rep 1:  1.06  5.26  3.55  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.988  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 2:  1.06  5.26  3.56  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.995  
+
rep 2:  1.06  5.26  3.56  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.995  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 3:  1.06  5.26  3.57  0.46  5.00  1.00 0.154 0.355 0.287 0.204 3.008  
+
rep 3:  1.06  5.26  3.57  0.46  5.00  1.00 0.154 0.355 0.287 0.204 3.008  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 4:  1.07  5.31  3.57  0.46  5.00  1.00 0.154 0.356 0.286 0.204 2.991  
+
rep 4:  1.07  5.31  3.57  0.46  5.00  1.00 0.154 0.356 0.286 0.204 2.991  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 5:  1.06  5.25  3.56  0.45  5.00  1.00 0.154 0.356 0.287 0.203 2.978  
+
rep 5:  1.06  5.25  3.56  0.45  5.00  1.00 0.154 0.356 0.287 0.203 2.978  
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Subset rate multipliers:
+
Subset rate multipliers:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 1: 0.538  0.298  2.164  
+
rep 1: 0.538  0.298  2.164  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 2: 0.538  0.299  2.164  
+
rep 2: 0.538  0.299  2.164  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 3: 0.539  0.300  2.162  
+
rep 3: 0.539  0.300  2.162  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 4: 0.539  0.298  2.163  
+
rep 4: 0.539  0.298  2.163  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rep 5: 0.537  0.299  2.164  
+
rep 5: 0.537  0.299  2.164  
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Final result of the best scoring rep (#1) stored in GTRG.byCodonPos.best.tre
+
Final result of the best scoring rep (#1) stored in GTRG.byCodonPos.best.tre
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Final results of all reps stored in GTRG.byCodonPos.best.all.tre
+
Final results of all reps stored in GTRG.byCodonPos.best.all.tre
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    </pre>
+
</pre>
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ==The sample runs==
+
==The sample runs==
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset.  It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.
+
In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset.  It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.
+
The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset.  In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc).  I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.
+
The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset.  In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc).  I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP.  Parameters that are shared have the same #.  The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T.  For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:
+
Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP.  Parameters that are shared have the same #.  The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T.  For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <pre>ratematrix = ( 0 1 0 0 1 0 )</pre>
+
<pre>ratematrix = ( 0 1 0 0 1 0 )</pre>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The GTR model looks like this:
+
The GTR model looks like this:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <pre>ratematrix = ( 0 1 2 3 4 5 )</pre>
+
<pre>ratematrix = ( 0 1 2 3 4 5 )</pre>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note that these two particular configuations are equivalent to '''ratematrix = 2rate''' and '''ratematrix = 6''' rate, respectively.
+
Note that these two particular configuations are equivalent to '''ratematrix = 2rate''' and '''ratematrix = 6''' rate, respectively.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ==Site-likelihood output==
+
==Site-likelihood output==
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If you'd like to get site-wise likelihoods, for example to input into CONSEL, add
+
If you'd like to get site-wise likelihoods, for example to input into CONSEL, add
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    outputsitelikelihoods = 1
+
outputsitelikelihoods = 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    to the top part of your config file.  This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another.  This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL.  Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not my.run.txt), so you may need to rename the files.
+
to the top part of your config file.  This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another.  This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL.  Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not my.run.txt), so you may need to rename the files.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If you want the sitelikes for a particular tree, you'd need to do this:
+
If you want the sitelikes for a particular tree, you'd need to do this:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    *specify the tree(s) in a starting tree file.
+
*specify the tree(s) in a starting tree file.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    *specify your chosen model normally
+
*specify your chosen model normally
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    *add '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]] = 1''' to the [general] section of your configuration file.
+
*add '''[[GARLI_Configuration_Settings#optimizeinputonly_.28do_not_search.2C_only_optimize_model_and_branch_lengths_on_user_trees.29 | optimizeinputonly]] = 1''' to the [general] section of your configuration file.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    *run the program normally (this will take a while because it will need to optimize the model parameters)
+
*run the program normally (this will take a while because it will need to optimize the model parameters)
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ==Running it yourself==
+
==Running it yourself==
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Use the provided configuration files as templates.  I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones.  Note that there are some important changes from the defaults values that appear in previous versions of the program.  These are important to ensure that the more complex partitioned models are properly optimized.
+
Use the provided configuration files as templates.  I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones.  Note that there are some important changes from the defaults values that appear in previous versions of the program.  These are important to ensure that the more complex partitioned models are properly optimized.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    You should set '''modweight''' entry to at least 0.0005 x (#subsets + 1).
+
You should set '''modweight''' entry to at least 0.0005 x (#subsets + 1).
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Other than that and entering your dataset name on the '''datafname''' line, you should be able to run with the default values.  If you know your way around the Garli settings, feel free to tinker.
+
Other than that and entering your dataset name on the '''datafname''' line, you should be able to run with the default values.  If you know your way around the Garli settings, feel free to tinker.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    As always, you can start the program from the command line with either  
+
As always, you can start the program from the command line with either  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <pre>./executableName configFileName</pre>
+
<pre>./executableName configFileName</pre>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    if it is in the same directory with your dataset and config file, or just
+
if it is in the same directory with your dataset and config file, or just
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <pre>executableName configFileName</pre>
+
<pre>executableName configFileName</pre>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    if it is in your path.
+
if it is in your path.
  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If your config file is named garli.conf, you don't need to pass it on the command line.
+
If your config file is named garli.conf, you don't need to pass it on the command line.

Latest revision as of 12:41, 21 July 2015

Configuring a partitioned analysis

Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to divide your data and what models to apply. Deciding how to choose the models is also a complex issue.

NOTE: If you are the kind of person who would rather just try your hand at running partitioned models without reading all of this first, you'll find example runs and template configuration files in the example/partition/ directory of any GARLI distribution. I'd still suggest reading this at some point to be sure that you understand your options for configuration.

Dividing up the data

Note that I use the technically correct (but often misused) definition of a partition. A partition is a scheme for dividing something up. It does the dividing, like a partition in a room. The individual chunks of data that are created by the partition are referred to as subsets.

This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix. The setup of the actual models will come later. For a dataset with 2178 characters in a single data or characters block, it would look like this:

begin sets;
charset ND2 = 1-726;
charset rbcl = 727-1452;
charset 16S = 1453-2178;
charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S; 

[you could also put characters exclusions here by removing the []'s from the line below]
[note that the excluded sites should still appear in the charpartition, however]
[exset * myexclusions = 600-800, 850, 900-100;]
end;

The above block would divide up the sites into three sets of 726 characters each.

To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this

charpartition bySites = chunk1:ND2 rbcl, chunk2:16S; 

Note the space rather than comma between ND2 and rbcl.


The names are unimportant here. The general format is:

charset <charset name> = <list or range of sites>; charset <charset name> = <list or range of sites>; charpartition <charpartition name> = (cont.) <partition subset 1 name>:<sites or charset making up 1st subset>, (cont.) <partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;

To easily specify charsets that divide characters up by codon position, do this: charset 1stpos = 1-2178\3; charset 2ndpos = 2-2178\3; charset 3rdpos = 3-2178\3;

Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model. If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.

Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).

Specifying the models

DO SOME SORT OF MODEL TESTING! The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized. Use ModelTest or some other means for finding the best model for each data subset. Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.

As usual for GARLI, the models are specified in the configuration file. If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual here, and FAQ entry here.

There are two new configuration entries that relate to partitioned models: linkmodels = 0 or 1 subsetspecificrates = 0 or 1

linkmodels means to use a single set of model parameters for all subsets.

subsetspecificrates means to infer overall rate multipliers for each data subset. This is equivalent to *prset ratepr=variable* in MrBayes

So, there are various combinations here:

linkmodels subsetspecificrates meaning
0 0 different models, branch lengths equal
0 1 different models, different subset rates
1 0 single model, one set of branch lengths (equivalent to non-partitioned analysis)
1 1 single model, different subset rates (like site-specific rates model in PAUP*)

The normal model configuration entries are the following, with the defaults in *bold*:

datatype = nucleotide, aminoacid, codon-aminoacid or codon ratematrix = 6rate, 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f ) statefrequencies = estimate, equal, empirical, (+others for aminoacids or codons. See manual) ratehetmodel = gamma, none numratecats = 4, 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma) invariantsites = estimate, none

If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G. In that case there is nothing else to be done. You can start your run.

If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in []'s. The headings need to be [model1], [model2], etc., and are assigned to the subsets in order. The number of configuration sets must match the number of data subsets.

For example, the following would assign the GTR+G and HKY models to the first and second data subsets.

[model1]
datatype = nucleotide
ratematrix = 6rate
statefrequencies = estimate
ratehetmodel = gamma
numratecats = 4
invariantsites = none

[model2]
datatype = nucleotide
ratematrix = 2rate
statefrequencies = estimate
ratehetmodel = none
numratecats = 1
invariantsites = none

These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.

If you're not sure how to specify a given model, see the FAQ entry here.

NOTE THAT THE BRACKETED LINES BEFORE EACH MODEL DESCRIPTION ARE NOT COMMENTS, AND MUST APPEAR AS ABOVE WITH CONSECUTIVE MODEL NUMBERING STARTING AT ONE!

Choosing partitioning schemes

How many partition subsets can I use?

There is no hard limit that is enforced on the number of subsets that you can use. I've done more than 60 myself. See the below for considerations about how many you should use.

What do I need to consider when choosing models and a partitioning scheme?

That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics). I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize. Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal. This makes the parameter likelihood surfaces very flat and difficult to optimize. This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.

NOTE Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context! This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface. This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain.

In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters. If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values. Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable. However, informative prior estimates are not used that often in standard phylogenetic analyses.

Ok, then how should I partition and choose models in practice?

NOTE: The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.

Consider the following procedure that I've used, from a real dataset of 2 genes. It is fairly complicated, but I think as statistically rigorous as can be expected. We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets. We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated. But what model should be chosen for the subsets, and how should the subsets be constructed?

1. First, construct all reasonable sets of sites for model testing. This amounts to:

  • Each codon position of each gene individually (6)
  • The full concatenated alignment (1)
  • The full sequence of each gene (2)
  • The first and second positions combined for each gene (2)
  • The first, second and third positions pooled across the genes (3)

Note that this list doesn't need to be exhaustive. I've omitted various combinations that I find unlikely a priori, for example a combination of first and third positions.


2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above. These will be the models applied to each of the subsets as we move to partitioned models. The following were the results in my case. If you aren't very familiar with the typical models, the information here: (Model configuration) and some of the items here: ( FAQ:Model choices) may be helpful.


Alignment Full 1st Pos 2nd Pos 3rd Pos 1st Pos + 2nd Pos
Gene 1 GTRIG GTRIG TVMIG TVMIG TVMIG
Gene 2 GTRIG GTRIG GTRIG TVMIG TVMIG
Concat GTRIG GTRIG GTRIG TVMIG TVMIG


3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step. When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3). For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered. The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.

Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme. Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done. See here for information on fixing the tree topology.


4. Now, note the number of model parameters that were estimated in each partitioning scheme. For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively. (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30. Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as AIC = 2 x (# parameters - lnL).

5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best. See the below table for my results. In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene. The concatenated alignment result was less obvious a priori, with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG. The AIC scores are summarized in the following table:

Alignment Partition lnL Parameters AIC
Gene 1 (1)(2)(3) -22698.82 30 45457.64
(12)(3) -22817.84 19 45673.68
(123) -23524.92 10 47069.84
Gene 2 (1)(2)(3) -24537.93 31 49137.85
(12)(3) -24639.11 19 49316.22
(123) -25353.78 10 50727.55
Concat (11)(2)(2)(33) -47384.54 42 94853.08
(1)(2)(1)(2)(33) -47375.30 53 94856.60
(1)(2)(3)(1)(2)(3) -47373.34 62 94870.69
(11)(22)(33) -47408.21 31 94878.43
(1212)(33) -47614.47 19 95266.94
123123 -49038.37 10 98096.75
(123)(123) -49031.42 21 98104.83

Further thoughts

The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned a priori. Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has 20 fewer parameters than the one that most people would have chosen a priori, (1)(2)(3)(1)(2)(3). Some caveats:

  • This methodology and example only includes two genes! Many datasets nowadays have many more. Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes. However, you can still work to find good partitioning schemes within each gene, as was done for the single genes in the example above. Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each. I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset. You might also do something like the full procedure above on sets of genes that you know 'a priori are similar, for example mitochondrial genes, rRNA genes, etc.
  • This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI). Surely more parameters could be eliminated by linking. For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters. If so, three parameters could be eliminated for each subset, which is a significant number.
  • Some may argue that this is a lot of work for unclear benefit. This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models. Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.

Check the output

Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly. The output is currently very verbose.

First the details of the data partitioning appear. Check that the total number of characters per subset looks correct.

GARLI partition subset 1
CHARACTERS block #1 ("Untitled DATA Block 1")
CHARPARTITION subset #1 ("1stpos")
Data read as Nucleotide data, modeled as Nucleotide data

Summary of data or data subset:
11 sequences.
441 constant characters.
189 parsimony-informative characters.
96 autapomorphic characters.
726 total characters.
238 unique patterns in compressed data matrix.

GARLI partition subset 2
CHARACTERS block #1 ("Untitled DATA Block 1")
CHARPARTITION subset #2 ("2ndpos")
Data read as Nucleotide data, modeled as Nucleotide data

Summary of data or data subset:
11 sequences.
528 constant characters.
103 parsimony-informative characters.
95 autapomorphic characters.
726 total characters.
158 unique patterns in compressed data matrix.

GARLI partition subset 3
CHARACTERS block #1 ("Untitled DATA Block 1")
CHARPARTITION subset #3 ("3rdpos")
Data read as Nucleotide data, modeled as Nucleotide data

Summary of data or data subset:
11 sequences.
103 constant characters.
539 parsimony-informative characters.
84 autapomorphic characters.
726 total characters.
549 unique patterns in compressed data matrix.

Then a description of the models and model parameters assigned to each subset appears. Parameters are at their initial values. This indicates three models with GTR+G for each:

MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)
Model 1
Number of states = 4 (nucleotide data)
Nucleotide Relative Rate Matrix:     6 rates 
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
Equilibrium State Frequencies: estimated
(ACGT) 0.3157 0.1746 0.3004 0.2093 
Rate Heterogeneity Model:
4 discrete gamma distributed rate categories, alpha param estimated
0.5000
Substitution rate categories under this model:
rate    proportion
0.0334    0.2500
0.2519    0.2500
0.8203    0.2500
2.8944    0.2500

Model 2
Number of states = 4 (nucleotide data)
Nucleotide Relative Rate Matrix:     6 rates 
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
Equilibrium State Frequencies: estimated
(ACGT) 0.2703 0.1566 0.1628 0.4103 
Rate Heterogeneity Model:
4 discrete gamma distributed rate categories, alpha param estimated
0.5000
Substitution rate categories under this model:
rate    proportion
0.0334    0.2500
0.2519    0.2500
0.8203    0.2500
2.8944    0.2500

Model 3
Number of states = 4 (nucleotide data)
Nucleotide Relative Rate Matrix:     6 rates 
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
Equilibrium State Frequencies: estimated
(ACGT) 0.1460 0.3609 0.2915 0.2015 
Rate Heterogeneity Model:
4 discrete gamma distributed rate categories, alpha param estimated
0.5000
Substitution rate categories under this model:
rate    proportion
0.0334    0.2500
0.2519    0.2500
0.8203    0.2500
2.8944    0.2500

Subset rate multipliers:
1.00  1.00  1.00

If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed. All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.

Completed 5 replicate runs (of 5).
Results:
Replicate 1 : -13317.4777 (best)
Replicate 2 : -13317.4813        (same topology as 1)
Replicate 3 : -13317.4839        (same topology as 1)
Replicate 4 : -13317.4863        (same topology as 1)
Replicate 5 : -13317.4781        (same topology as 1)

Parameter estimates:

Partition subset 1:
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha 
rep 1:  1.97  2.58  1.42  1.41  3.72  1.00 0.310 0.177 0.297 0.216 0.411 
rep 2:  1.96  2.58  1.41  1.41  3.71  1.00 0.310 0.177 0.296 0.216 0.409 
rep 3:  1.97  2.58  1.42  1.40  3.71  1.00 0.309 0.177 0.298 0.216 0.411 
rep 4:  1.96  2.57  1.42  1.40  3.72  1.00 0.310 0.177 0.297 0.216 0.411 
rep 5:  1.96  2.57  1.41  1.40  3.71  1.00 0.310 0.177 0.297 0.216 0.409 

Partition subset 2:
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha 
rep 1:  4.32  7.05  1.60  7.05  4.37  1.00 0.269 0.164 0.160 0.407 0.361 
rep 2:  4.32  7.03  1.59  7.08  4.37  1.00 0.270 0.163 0.160 0.407 0.361 
rep 3:  4.33  7.08  1.60  7.07  4.37  1.00 0.269 0.164 0.160 0.406 0.361 
rep 4:  4.34  7.09  1.61  7.10  4.40  1.00 0.269 0.164 0.160 0.407 0.361 
rep 5:  4.35  7.08  1.60  7.11  4.39  1.00 0.269 0.163 0.160 0.407 0.360 

Partition subset 3:
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha 
rep 1:  1.06  5.26  3.55  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.988 
rep 2:  1.06  5.26  3.56  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.995 
rep 3:  1.06  5.26  3.57  0.46  5.00  1.00 0.154 0.355 0.287 0.204 3.008 
rep 4:  1.07  5.31  3.57  0.46  5.00  1.00 0.154 0.356 0.286 0.204 2.991 
rep 5:  1.06  5.25  3.56  0.45  5.00  1.00 0.154 0.356 0.287 0.203 2.978 

Subset rate multipliers:
rep 1: 0.538  0.298  2.164 
rep 2: 0.538  0.299  2.164 
rep 3: 0.539  0.300  2.162 
rep 4: 0.539  0.298  2.163 
rep 5: 0.537  0.299  2.164 

Final result of the best scoring rep (#1) stored in GTRG.byCodonPos.best.tre
Final results of all reps stored in GTRG.byCodonPos.best.all.tre

The sample runs

In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset. It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.

The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.

The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset. In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc). I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.

Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP. Parameters that are shared have the same #. The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T. For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:

ratematrix = ( 0 1 0 0 1 0 )

The GTR model looks like this:

ratematrix = ( 0 1 2 3 4 5 )

Note that these two particular configuations are equivalent to ratematrix = 2rate and ratematrix = 6 rate, respectively.

Site-likelihood output

If you'd like to get site-wise likelihoods, for example to input into CONSEL, add outputsitelikelihoods = 1 to the top part of your config file. This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another. This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL. Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not my.run.txt), so you may need to rename the files.

If you want the sitelikes for a particular tree, you'd need to do this:

  • specify the tree(s) in a starting tree file.
  • specify your chosen model normally
  • add optimizeinputonly = 1 to the [general] section of your configuration file.
  • run the program normally (this will take a while because it will need to optimize the model parameters)

Running it yourself

Use the provided configuration files as templates. I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones. Note that there are some important changes from the defaults values that appear in previous versions of the program. These are important to ensure that the more complex partitioned models are properly optimized.

You should set modweight entry to at least 0.0005 x (#subsets + 1).

Other than that and entering your dataset name on the datafname line, you should be able to run with the default values. If you know your way around the Garli settings, feel free to tinker.

As always, you can start the program from the command line with either

./executableName configFileName

if it is in the same directory with your dataset and config file, or just

executableName configFileName

if it is in your path.

If your config file is named garli.conf, you don't need to pass it on the command line.