Garli using partitioned models

From MolEvol
Revision as of 12:41, 21 July 2015 by Zwickl (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Configuring a partitioned analysis

Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to divide your data and what models to apply. Deciding how to choose the models is also a complex issue.

NOTE: If you are the kind of person who would rather just try your hand at running partitioned models without reading all of this first, you'll find example runs and template configuration files in the example/partition/ directory of any GARLI distribution. I'd still suggest reading this at some point to be sure that you understand your options for configuration.

Dividing up the data

Note that I use the technically correct (but often misused) definition of a partition. A partition is a scheme for dividing something up. It does the dividing, like a partition in a room. The individual chunks of data that are created by the partition are referred to as subsets.

This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix. The setup of the actual models will come later. For a dataset with 2178 characters in a single data or characters block, it would look like this:

begin sets;
charset ND2 = 1-726;
charset rbcl = 727-1452;
charset 16S = 1453-2178;
charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S; 

[you could also put characters exclusions here by removing the []'s from the line below]
[note that the excluded sites should still appear in the charpartition, however]
[exset * myexclusions = 600-800, 850, 900-100;]

The above block would divide up the sites into three sets of 726 characters each.

To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this

charpartition bySites = chunk1:ND2 rbcl, chunk2:16S; 

Note the space rather than comma between ND2 and rbcl.

The names are unimportant here. The general format is:

charset <charset name> = <list or range of sites>; charset <charset name> = <list or range of sites>; charpartition <charpartition name> = (cont.) <partition subset 1 name>:<sites or charset making up 1st subset>, (cont.) <partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;

To easily specify charsets that divide characters up by codon position, do this: charset 1stpos = 1-2178\3; charset 2ndpos = 2-2178\3; charset 3rdpos = 3-2178\3;

Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model. If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.

Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).

Specifying the models

DO SOME SORT OF MODEL TESTING! The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized. Use ModelTest or some other means for finding the best model for each data subset. Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.

As usual for GARLI, the models are specified in the configuration file. If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual here, and FAQ entry here.

There are two new configuration entries that relate to partitioned models: linkmodels = 0 or 1 subsetspecificrates = 0 or 1

linkmodels means to use a single set of model parameters for all subsets.

subsetspecificrates means to infer overall rate multipliers for each data subset. This is equivalent to *prset ratepr=variable* in MrBayes

So, there are various combinations here:

linkmodels subsetspecificrates meaning
0 0 different models, branch lengths equal
0 1 different models, different subset rates
1 0 single model, one set of branch lengths (equivalent to non-partitioned analysis)
1 1 single model, different subset rates (like site-specific rates model in PAUP*)

The normal model configuration entries are the following, with the defaults in *bold*:

datatype = nucleotide, aminoacid, codon-aminoacid or codon ratematrix = 6rate, 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f ) statefrequencies = estimate, equal, empirical, (+others for aminoacids or codons. See manual) ratehetmodel = gamma, none numratecats = 4, 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma) invariantsites = estimate, none

If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G. In that case there is nothing else to be done. You can start your run.

If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in []'s. The headings need to be [model1], [model2], etc., and are assigned to the subsets in order. The number of configuration sets must match the number of data subsets.

For example, the following would assign the GTR+G and HKY models to the first and second data subsets.

datatype = nucleotide
ratematrix = 6rate
statefrequencies = estimate
ratehetmodel = gamma
numratecats = 4
invariantsites = none

datatype = nucleotide
ratematrix = 2rate
statefrequencies = estimate
ratehetmodel = none
numratecats = 1
invariantsites = none

These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.

If you're not sure how to specify a given model, see the FAQ entry here.


Choosing partitioning schemes

How many partition subsets can I use?

There is no hard limit that is enforced on the number of subsets that you can use. I've done more than 60 myself. See the below for considerations about how many you should use.

What do I need to consider when choosing models and a partitioning scheme?

That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics). I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize. Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal. This makes the parameter likelihood surfaces very flat and difficult to optimize. This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.

NOTE Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context! This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface. This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain.

In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters. If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values. Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable. However, informative prior estimates are not used that often in standard phylogenetic analyses.

Ok, then how should I partition and choose models in practice?

NOTE: The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.

Consider the following procedure that I've used, from a real dataset of 2 genes. It is fairly complicated, but I think as statistically rigorous as can be expected. We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets. We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated. But what model should be chosen for the subsets, and how should the subsets be constructed?

1. First, construct all reasonable sets of sites for model testing. This amounts to:

  • Each codon position of each gene individually (6)
  • The full concatenated alignment (1)
  • The full sequence of each gene (2)
  • The first and second positions combined for each gene (2)
  • The first, second and third positions pooled across the genes (3)

Note that this list doesn't need to be exhaustive. I've omitted various combinations that I find unlikely a priori, for example a combination of first and third positions.

2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above. These will be the models applied to each of the subsets as we move to partitioned models. The following were the results in my case. If you aren't very familiar with the typical models, the information here: (Model configuration) and some of the items here: ( FAQ:Model choices) may be helpful.

Alignment Full 1st Pos 2nd Pos 3rd Pos 1st Pos + 2nd Pos

3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step. When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3). For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered. The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.

Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme. Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done. See here for information on fixing the tree topology.

4. Now, note the number of model parameters that were estimated in each partitioning scheme. For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively. (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30. Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as AIC = 2 x (# parameters - lnL).

5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best. See the below table for my results. In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene. The concatenated alignment result was less obvious a priori, with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG. The AIC scores are summarized in the following table:

Alignment Partition lnL Parameters AIC
Gene 1 (1)(2)(3) -22698.82 30 45457.64
(12)(3) -22817.84 19 45673.68
(123) -23524.92 10 47069.84
Gene 2 (1)(2)(3) -24537.93 31 49137.85
(12)(3) -24639.11 19 49316.22
(123) -25353.78 10 50727.55
Concat (11)(2)(2)(33) -47384.54 42 94853.08
(1)(2)(1)(2)(33) -47375.30 53 94856.60
(1)(2)(3)(1)(2)(3) -47373.34 62 94870.69
(11)(22)(33) -47408.21 31 94878.43
(1212)(33) -47614.47 19 95266.94
123123 -49038.37 10 98096.75
(123)(123) -49031.42 21 98104.83

Further thoughts

The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned a priori. Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has 20 fewer parameters than the one that most people would have chosen a priori, (1)(2)(3)(1)(2)(3). Some caveats:

  • This methodology and example only includes two genes! Many datasets nowadays have many more. Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes. However, you can still work to find good partitioning schemes within each gene, as was done for the single genes in the example above. Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each. I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset. You might also do something like the full procedure above on sets of genes that you know 'a priori are similar, for example mitochondrial genes, rRNA genes, etc.
  • This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI). Surely more parameters could be eliminated by linking. For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters. If so, three parameters could be eliminated for each subset, which is a significant number.
  • Some may argue that this is a lot of work for unclear benefit. This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models. Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.

Check the output

Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly. The output is currently very verbose.

First the details of the data partitioning appear. Check that the total number of characters per subset looks correct.

GARLI partition subset 1
CHARACTERS block #1 ("Untitled DATA Block 1")
CHARPARTITION subset #1 ("1stpos")
Data read as Nucleotide data, modeled as Nucleotide data

Summary of data or data subset:
11 sequences.
441 constant characters.
189 parsimony-informative characters.
96 autapomorphic characters.
726 total characters.
238 unique patterns in compressed data matrix.

GARLI partition subset 2
CHARACTERS block #1 ("Untitled DATA Block 1")
CHARPARTITION subset #2 ("2ndpos")
Data read as Nucleotide data, modeled as Nucleotide data

Summary of data or data subset:
11 sequences.
528 constant characters.
103 parsimony-informative characters.
95 autapomorphic characters.
726 total characters.
158 unique patterns in compressed data matrix.

GARLI partition subset 3
CHARACTERS block #1 ("Untitled DATA Block 1")
CHARPARTITION subset #3 ("3rdpos")
Data read as Nucleotide data, modeled as Nucleotide data

Summary of data or data subset:
11 sequences.
103 constant characters.
539 parsimony-informative characters.
84 autapomorphic characters.
726 total characters.
549 unique patterns in compressed data matrix.

Then a description of the models and model parameters assigned to each subset appears. Parameters are at their initial values. This indicates three models with GTR+G for each:

MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)
Model 1
Number of states = 4 (nucleotide data)
Nucleotide Relative Rate Matrix:     6 rates 
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
Equilibrium State Frequencies: estimated
(ACGT) 0.3157 0.1746 0.3004 0.2093 
Rate Heterogeneity Model:
4 discrete gamma distributed rate categories, alpha param estimated
Substitution rate categories under this model:
rate    proportion
0.0334    0.2500
0.2519    0.2500
0.8203    0.2500
2.8944    0.2500

Model 2
Number of states = 4 (nucleotide data)
Nucleotide Relative Rate Matrix:     6 rates 
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
Equilibrium State Frequencies: estimated
(ACGT) 0.2703 0.1566 0.1628 0.4103 
Rate Heterogeneity Model:
4 discrete gamma distributed rate categories, alpha param estimated
Substitution rate categories under this model:
rate    proportion
0.0334    0.2500
0.2519    0.2500
0.8203    0.2500
2.8944    0.2500

Model 3
Number of states = 4 (nucleotide data)
Nucleotide Relative Rate Matrix:     6 rates 
AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000
Equilibrium State Frequencies: estimated
(ACGT) 0.1460 0.3609 0.2915 0.2015 
Rate Heterogeneity Model:
4 discrete gamma distributed rate categories, alpha param estimated
Substitution rate categories under this model:
rate    proportion
0.0334    0.2500
0.2519    0.2500
0.8203    0.2500
2.8944    0.2500

Subset rate multipliers:
1.00  1.00  1.00

If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed. All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.

Completed 5 replicate runs (of 5).
Replicate 1 : -13317.4777 (best)
Replicate 2 : -13317.4813        (same topology as 1)
Replicate 3 : -13317.4839        (same topology as 1)
Replicate 4 : -13317.4863        (same topology as 1)
Replicate 5 : -13317.4781        (same topology as 1)

Parameter estimates:

Partition subset 1:
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha 
rep 1:  1.97  2.58  1.42  1.41  3.72  1.00 0.310 0.177 0.297 0.216 0.411 
rep 2:  1.96  2.58  1.41  1.41  3.71  1.00 0.310 0.177 0.296 0.216 0.409 
rep 3:  1.97  2.58  1.42  1.40  3.71  1.00 0.309 0.177 0.298 0.216 0.411 
rep 4:  1.96  2.57  1.42  1.40  3.72  1.00 0.310 0.177 0.297 0.216 0.411 
rep 5:  1.96  2.57  1.41  1.40  3.71  1.00 0.310 0.177 0.297 0.216 0.409 

Partition subset 2:
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha 
rep 1:  4.32  7.05  1.60  7.05  4.37  1.00 0.269 0.164 0.160 0.407 0.361 
rep 2:  4.32  7.03  1.59  7.08  4.37  1.00 0.270 0.163 0.160 0.407 0.361 
rep 3:  4.33  7.08  1.60  7.07  4.37  1.00 0.269 0.164 0.160 0.406 0.361 
rep 4:  4.34  7.09  1.61  7.10  4.40  1.00 0.269 0.164 0.160 0.407 0.361 
rep 5:  4.35  7.08  1.60  7.11  4.39  1.00 0.269 0.163 0.160 0.407 0.360 

Partition subset 3:
r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha 
rep 1:  1.06  5.26  3.55  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.988 
rep 2:  1.06  5.26  3.56  0.45  4.98  1.00 0.154 0.356 0.287 0.203 2.995 
rep 3:  1.06  5.26  3.57  0.46  5.00  1.00 0.154 0.355 0.287 0.204 3.008 
rep 4:  1.07  5.31  3.57  0.46  5.00  1.00 0.154 0.356 0.286 0.204 2.991 
rep 5:  1.06  5.25  3.56  0.45  5.00  1.00 0.154 0.356 0.287 0.203 2.978 

Subset rate multipliers:
rep 1: 0.538  0.298  2.164 
rep 2: 0.538  0.299  2.164 
rep 3: 0.539  0.300  2.162 
rep 4: 0.539  0.298  2.163 
rep 5: 0.537  0.299  2.164 

Final result of the best scoring rep (#1) stored in
Final results of all reps stored in

The sample runs

In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset. It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.

The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.

The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset. In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc). I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.

Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP. Parameters that are shared have the same #. The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T. For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:

ratematrix = ( 0 1 0 0 1 0 )

The GTR model looks like this:

ratematrix = ( 0 1 2 3 4 5 )

Note that these two particular configuations are equivalent to ratematrix = 2rate and ratematrix = 6 rate, respectively.

Site-likelihood output

If you'd like to get site-wise likelihoods, for example to input into CONSEL, add outputsitelikelihoods = 1 to the top part of your config file. This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another. This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL. Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not, so you may need to rename the files.

If you want the sitelikes for a particular tree, you'd need to do this:

  • specify the tree(s) in a starting tree file.
  • specify your chosen model normally
  • add optimizeinputonly = 1 to the [general] section of your configuration file.
  • run the program normally (this will take a while because it will need to optimize the model parameters)

Running it yourself

Use the provided configuration files as templates. I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones. Note that there are some important changes from the defaults values that appear in previous versions of the program. These are important to ensure that the more complex partitioned models are properly optimized.

You should set modweight entry to at least 0.0005 x (#subsets + 1).

Other than that and entering your dataset name on the datafname line, you should be able to run with the default values. If you know your way around the Garli settings, feel free to tinker.

As always, you can start the program from the command line with either

./executableName configFileName

if it is in the same directory with your dataset and config file, or just

executableName configFileName

if it is in your path.

If your config file is named garli.conf, you don't need to pass it on the command line.