Garli using partitioned models
- 1 Configuring a partitioned analysis
- 1.1 Dividing up the data
- 1.2 Specifying the models
- 1.3 Choosing partitioning schemes
- 2 Check the output
- 3 The sample runs
- 4 Site-likelihood output
- 5 Running it yourself
Configuring a partitioned analysis
Not surprisingly, setting up partitioned models is more complicated than normal GARLI usage, since you need to tell the program how to divide your data and what models to apply. Deciding how to choose the models is also a complex issue.
NOTE: If you are the kind of person who would rather just try your hand at running partitioned models without reading all of this first, you'll find example runs and template configuration files in the example/partition/ directory of any GARLI distribution. I'd still suggest reading this at some point to be sure that you understand your options for configuration.
Dividing up the data
Note that I use the technically correct (but often misused) definition of a partition. A partition is a scheme for dividing something up. It does the dividing, like a partition in a room. The individual chunks of data that are created by the partition are referred to as subsets.
This version requires NEXUS formatted datafiles, and the partitioning is specified via standard NEXUS commands appearing in a sets or assumptions block in the same file as the data matrix. The setup of the actual models will come later. For a dataset with 2178 characters in a single data or characters block, it would look like this:
begin sets; charset ND2 = 1-726; charset rbcl = 727-1452; charset 16S = 1453-2178; charpartition byGene = chunk1:ND2, chunk2:rbcl, chunk3:16S; [you could also put characters exclusions here by removing the 's from the line below] [note that the excluded sites should still appear in the charpartition, however] [exset * myexclusions = 600-800, 850, 900-100;] end;
The above block would divide up the sites into three sets of 726 characters each.
To put charsets ND2 and rbcl in a single partition subset, the charpartition command would look like this
charpartition bySites = chunk1:ND2 rbcl, chunk2:16S;
Note the space rather than comma between ND2 and rbcl.
The names are unimportant here. The general format is:
charset <charset name> = <list or range of sites>; charset <charset name> = <list or range of sites>; charpartition <charpartition name> = (cont.) <partition subset 1 name>:<sites or charset making up 1st subset>, (cont.) <partition subset 2 name>:<sites or charset making up 1st subset>, <etc>;
To easily specify charsets that divide characters up by codon position, do this: charset 1stpos = 1-2178\3; charset 2ndpos = 2-2178\3; charset 3rdpos = 3-2178\3;
Note that if a charpartition appears, GARLI will AUTOMATICALLY apply a partitioned model. If you don't want that for some runs, remove or comment out (surround it with [ ]) the charpartition command.
Also note that GARLI will also automatically partition if it sees multiple characters blocks, so that is an alternate way to do this (instead of the charpartition).
Specifying the models
DO SOME SORT OF MODEL TESTING! The parameter estimates under partitioned models are currently somewhat erratic if the models are over-parameterized. Use ModelTest or some other means for finding the best model for each data subset. Note that the best model for each subset separately is not necessarily the best when they are combined in a partitioned model, but they will give a useful measure of which parameters are justified in each subset.
As usual for GARLI, the models are specified in the configuration file. If you aren't familiar with the normal way that models are configured in GARLI, see the general info in the manual here, and FAQ entry here.
There are two new configuration entries that relate to partitioned models: linkmodels = 0 or 1 subsetspecificrates = 0 or 1
linkmodels means to use a single set of model parameters for all subsets.
subsetspecificrates means to infer overall rate multipliers for each data subset. This is equivalent to *prset ratepr=variable* in MrBayes
So, there are various combinations here:
|0||0||different models, branch lengths equal|
|0||1||different models, different subset rates|
|1||0||single model, one set of branch lengths (equivalent to non-partitioned analysis)|
|1||1||single model, different subset rates (like site-specific rates model in PAUP*)|
The normal model configuration entries are the following, with the defaults in *bold*:
datatype = nucleotide, aminoacid, codon-aminoacid or codon ratematrix = 6rate, 2rate, 1rate, or other matrix spec. like this :( a, b, c, d, e, f ) statefrequencies = estimate, equal, empirical, (+others for aminoacids or codons. See manual) ratehetmodel = gamma, none numratecats = 4, 1-20 (must be 1 if ratehetmodel = none, must be > 1 if ratehetmodel = gamma) invariantsites = estimate, none
If you leave these as is, set linkmodels = 0 and have a charpartition defined in the datafile, each subset will automatically be assigned a separate unlinked version of GTR+I+G. In that case there is nothing else to be done. You can start your run.
If you want different models for each subset you need to add a set of model settings for each, with a specific heading name in 's. The headings need to be [model1], [model2], etc., and are assigned to the subsets in order. The number of configuration sets must match the number of data subsets.
For example, the following would assign the GTR+G and HKY models to the first and second data subsets.
[model1] datatype = nucleotide ratematrix = 6rate statefrequencies = estimate ratehetmodel = gamma numratecats = 4 invariantsites = none [model2] datatype = nucleotide ratematrix = 2rate statefrequencies = estimate ratehetmodel = none numratecats = 1 invariantsites = none
These should appear in place of the normal set of model configuration settings, and should be placed just before the [master] heading of the configuration file.
If you're not sure how to specify a given model, see the FAQ entry here.
NOTE THAT THE BRACKETED LINES BEFORE EACH MODEL DESCRIPTION ARE NOT COMMENTS, AND MUST APPEAR AS ABOVE WITH CONSECUTIVE MODEL NUMBERING STARTING AT ONE!
Choosing partitioning schemes
How many partition subsets can I use?
There is no hard limit that is enforced on the number of subsets that you can use. I've done more than 60 myself. See the below for considerations about how many you should use.
What do I need to consider when choosing models and a partitioning scheme?
That is a difficult question, and one that is not well investigated in ML phylogenetics (as opposed to Bayesian phylogenetics). I definitely suggest that you do not partition overly finely, as the likelihood surface becomes difficult to optimize. Keep in mind that partitioning finely may create subsets with very few changes, and therefore little signal. This makes the parameter likelihood surfaces very flat and difficult to optimize. This is particularly true when bootstrapping, which can further reduce signal by creating some resampled datasets with even fewer variable sites.
NOTE Do not assume that the partitioning scheme and model choice method that you use in MrBayes or another Bayesian method are appropriate in an maximum likelihood context! This is primarily because the Bayesian way to deal with nuisance parameters is to marginalize over them, meaning to account for the shape of the entire likelihood surface. This effectively integrates out uncertainty in parameter estimates, even if the likelihood surface is very flat and those estimates are very uncertain.
In contrast, ML analyses seek to maximize the likelihood with respect to nuisance parameters. If the likelihood surface is very flat and suggests that a parameter value could nearly equally lie between a value of 1.0 and 3.0, an ML method will still return the value at the very peak of the distribution (lets say 2.329), even if it is only a fraction more likely than surrounding values. Another reason that Bayesian methods can be more appropriate for analyzing data with little signal is that prior distributions can be used to provide some outside information and keep parameter estimates reasonable. However, informative prior estimates are not used that often in standard phylogenetic analyses.
Ok, then how should I partition and choose models in practice?
NOTE: The following assumes that subset specific rates ARE being estimated e.g., subsetspecificrates = 1.
Consider the following procedure that I've used, from a real dataset of 2 genes. It is fairly complicated, but I think as statistically rigorous as can be expected. We'll assume that the smallest subsets will be by codon position, thus there are 6 potential subsets. We'd like to find the most appropriate scheme to use for analyzing each gene separately, as well as both concatenated. But what model should be chosen for the subsets, and how should the subsets be constructed?
1. First, construct all reasonable sets of sites for model testing. This amounts to:
- Each codon position of each gene individually (6)
- The full concatenated alignment (1)
- The full sequence of each gene (2)
- The first and second positions combined for each gene (2)
- The first, second and third positions pooled across the genes (3)
Note that this list doesn't need to be exhaustive. I've omitted various combinations that I find unlikely a priori, for example a combination of first and third positions.
2. Now, use unpartitioned models and the AIC criterion with ModelTest or a similar procedure to chose the "best" model for each subset above. These will be the models applied to each of the subsets as we move to partitioned models. The following were the results in my case. If you aren't very familiar with the typical models, the information here: (Model configuration) and some of the items here: ( FAQ:Model choices) may be helpful.
|Alignment||Full||1st Pos||2nd Pos||3rd Pos||1st Pos + 2nd Pos|
3. Now, create partitioning schemes and configure GARLI partitioned analyses with each subset using the model chosen for it in the previous step. When choosing the scheme for Gene 1 analyses, the combinations would be each position separately, which I'll denote (1)(2)(3), the three positions together (i.e., unpartitioned or (123) ), and the first and second positions grouped, (12)(3). For the concatenated alignment there are many more combinations, not all of which need to necessarily be considered. The models assigned within subsets are from the above table, for example, when partitioning Gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG for the subsets of first, second and third positions respectively.
Now run full GARLI analyses with four search reps (or at least two) for each of these partitioning schemes, and note the best log-likelihood across the search replicates for each partitioning scheme. Note that you could also do analyses giving GARLI a fixed tree to optimize on, which is actually the more common way that model choice is done. See here for information on fixing the tree topology.
4. Now, note the number of model parameters that were estimated in each partitioning scheme. For example, when partitioning gene 1 by codon position, the chosen models were GTRIG, TVMIG and TVMIG, which have 10, 9 and 9 free parameters respectively. (GTRIG has 10 parameters because 5 free relative rates + 3 free base frequencies + 1 invariable sites parameter + 1 alpha shape parameter of the gamma distribution = 10). In addition, subset specific rate multipliers were estimated, which adds (#subsets - 1) additional parameters, bringing the total to 30. Use the log-likelihood score and parameter count to calculate the AIC score for each partitioning scheme as AIC = 2 x (# parameters - lnL).
5. Finally, for each alignment compare the AIC scores of each partitioning scheme and choose the lowest value as the best. See the below table for my results. In this example (GTRIG)(TVMIG)(TVMIG) was chosen for first gene and (GTRIG)(GTRIG)(TVMIG) was chosen for the second gene. The concatenated alignment result was less obvious a priori, with 4 subsets. First positions shared a subset with GTRIG, second positions each had their own model with TVMIG for gene 1 and GTRIG for gene 2, and the third positions shared a subset with TVMIG. The AIC scores are summarized in the following table:
The above procedure is somewhat complicated (I may write some scripts to somewhat automate it at some point). However, you can now be confident that you've evaluated and and possibly chosen a better configuration than you might have if you had partitioned a priori. Note that in the above example the best model for the concatenated alignment, (11)(2)(2)(33), has 20 fewer parameters than the one that most people would have chosen a priori, (1)(2)(3)(1)(2)(3). Some caveats:
- This methodology and example only includes two genes! Many datasets nowadays have many more. Exhaustively going through the subsets and partitions as above will quickly get out of hand for five or ten genes. However, you can still work to find good partitioning schemes within each gene, as was done for the single genes in the example above. Specifically, after finding the best models for each possible subset of each gene via ModelTest or the like, comparison of partitioning schemes (123), (12)(3) and (123) can be done for each. I've found that the (12)(3) scheme can work well in cases in which there are only a few second position changes, making parameter optimization difficult when it is analyzed as an independent subset. You might also do something like the full procedure above on sets of genes that you know 'a priori are similar, for example mitochondrial genes, rRNA genes, etc.
- This discussion ignores the ability to "link" parameters across subsets (which I don't yet have implemented in GALRI). Surely more parameters could be eliminated by linking. For example, all of the subsets may have similar base frequencies, and could share a single set of frequency parameters. If so, three parameters could be eliminated for each subset, which is a significant number.
- Some may argue that this is a lot of work for unclear benefit. This might be true, although reducing the number of free parameters is thought to be statistically beneficial in a general sense, if not specifically for partitioned phylogenetic models. Doing full searches to find the best model and then using it for more searches may seem wasteful, but remember that the Bayesian analog to this AIC procedure is the comparison of Bayes factors, which also generally require full analyses using each of the competing models.
Check the output
Once you've done a run, check the output in the .screen.log file to see that your data were divided and models assigned correctly. The output is currently very verbose.
First the details of the data partitioning appear. Check that the total number of characters per subset looks correct.
GARLI partition subset 1 CHARACTERS block #1 ("Untitled DATA Block 1") CHARPARTITION subset #1 ("1stpos") Data read as Nucleotide data, modeled as Nucleotide data Summary of data or data subset: 11 sequences. 441 constant characters. 189 parsimony-informative characters. 96 autapomorphic characters. 726 total characters. 238 unique patterns in compressed data matrix. GARLI partition subset 2 CHARACTERS block #1 ("Untitled DATA Block 1") CHARPARTITION subset #2 ("2ndpos") Data read as Nucleotide data, modeled as Nucleotide data Summary of data or data subset: 11 sequences. 528 constant characters. 103 parsimony-informative characters. 95 autapomorphic characters. 726 total characters. 158 unique patterns in compressed data matrix. GARLI partition subset 3 CHARACTERS block #1 ("Untitled DATA Block 1") CHARPARTITION subset #3 ("3rdpos") Data read as Nucleotide data, modeled as Nucleotide data Summary of data or data subset: 11 sequences. 103 constant characters. 539 parsimony-informative characters. 84 autapomorphic characters. 726 total characters. 549 unique patterns in compressed data matrix.
Then a description of the models and model parameters assigned to each subset appears. Parameters are at their initial values. This indicates three models with GTR+G for each:
MODEL REPORT - Parameters are at their INITIAL values (not yet optimized) Model 1 Number of states = 4 (nucleotide data) Nucleotide Relative Rate Matrix: 6 rates AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000 Equilibrium State Frequencies: estimated (ACGT) 0.3157 0.1746 0.3004 0.2093 Rate Heterogeneity Model: 4 discrete gamma distributed rate categories, alpha param estimated 0.5000 Substitution rate categories under this model: rate proportion 0.0334 0.2500 0.2519 0.2500 0.8203 0.2500 2.8944 0.2500 Model 2 Number of states = 4 (nucleotide data) Nucleotide Relative Rate Matrix: 6 rates AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000 Equilibrium State Frequencies: estimated (ACGT) 0.2703 0.1566 0.1628 0.4103 Rate Heterogeneity Model: 4 discrete gamma distributed rate categories, alpha param estimated 0.5000 Substitution rate categories under this model: rate proportion 0.0334 0.2500 0.2519 0.2500 0.8203 0.2500 2.8944 0.2500 Model 3 Number of states = 4 (nucleotide data) Nucleotide Relative Rate Matrix: 6 rates AC = 1.000, AG = 4.000, AT = 1.000, CG = 1.000, CT = 4.000, GT = 1.000 Equilibrium State Frequencies: estimated (ACGT) 0.1460 0.3609 0.2915 0.2015 Rate Heterogeneity Model: 4 discrete gamma distributed rate categories, alpha param estimated 0.5000 Substitution rate categories under this model: rate proportion 0.0334 0.2500 0.2519 0.2500 0.8203 0.2500 2.8944 0.2500 Subset rate multipliers: 1.00 1.00 1.00
If you setup multiple search replicates in the configuration file (which you should!), a summary of the model parameters estimated during each replicate is displayed. All of the parameter values and (hopeful) the likelihoods should be fairly close for the different replicates.
Completed 5 replicate runs (of 5). Results: Replicate 1 : -13317.4777 (best) Replicate 2 : -13317.4813 (same topology as 1) Replicate 3 : -13317.4839 (same topology as 1) Replicate 4 : -13317.4863 (same topology as 1) Replicate 5 : -13317.4781 (same topology as 1) Parameter estimates: Partition subset 1: r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha rep 1: 1.97 2.58 1.42 1.41 3.72 1.00 0.310 0.177 0.297 0.216 0.411 rep 2: 1.96 2.58 1.41 1.41 3.71 1.00 0.310 0.177 0.296 0.216 0.409 rep 3: 1.97 2.58 1.42 1.40 3.71 1.00 0.309 0.177 0.298 0.216 0.411 rep 4: 1.96 2.57 1.42 1.40 3.72 1.00 0.310 0.177 0.297 0.216 0.411 rep 5: 1.96 2.57 1.41 1.40 3.71 1.00 0.310 0.177 0.297 0.216 0.409 Partition subset 2: r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha rep 1: 4.32 7.05 1.60 7.05 4.37 1.00 0.269 0.164 0.160 0.407 0.361 rep 2: 4.32 7.03 1.59 7.08 4.37 1.00 0.270 0.163 0.160 0.407 0.361 rep 3: 4.33 7.08 1.60 7.07 4.37 1.00 0.269 0.164 0.160 0.406 0.361 rep 4: 4.34 7.09 1.61 7.10 4.40 1.00 0.269 0.164 0.160 0.407 0.361 rep 5: 4.35 7.08 1.60 7.11 4.39 1.00 0.269 0.163 0.160 0.407 0.360 Partition subset 3: r(AC) r(AG) r(AT) r(CG) r(CT) r(GT) pi(A) pi(C) pi(G) pi(T) alpha rep 1: 1.06 5.26 3.55 0.45 4.98 1.00 0.154 0.356 0.287 0.203 2.988 rep 2: 1.06 5.26 3.56 0.45 4.98 1.00 0.154 0.356 0.287 0.203 2.995 rep 3: 1.06 5.26 3.57 0.46 5.00 1.00 0.154 0.355 0.287 0.204 3.008 rep 4: 1.07 5.31 3.57 0.46 5.00 1.00 0.154 0.356 0.286 0.204 2.991 rep 5: 1.06 5.25 3.56 0.45 5.00 1.00 0.154 0.356 0.287 0.203 2.978 Subset rate multipliers: rep 1: 0.538 0.298 2.164 rep 2: 0.538 0.299 2.164 rep 3: 0.539 0.300 2.162 rep 4: 0.539 0.298 2.163 rep 5: 0.537 0.299 2.164 Final result of the best scoring rep (#1) stored in GTRG.byCodonPos.best.tre Final results of all reps stored in GTRG.byCodonPos.best.all.tre
The sample runs
In the example/partition/exampleRuns directory included with GARLI distributions, I've included the configuration and output files from two runs with a very small 11 taxon x 2178 character dataset. It is partitioned by codon position. Not surprisingly, the tree is found almost immediately, and the rest of the time is spent optimizing the models and branch lengths.
The first run (in 3parts.sameModelType) is an example of using a single model type for all subsets, in this case GTR+G.
The second run (in 3parts.diffModelTypes) is an example of using a different model for each subset. In this case the 3 models aren't even any of the normal named ones (JC, K2P, HKY, GTR, etc). I combined some of the rate matrix parameters that looked very similar in the first run, and added a proportion of invariant sites to the third subset.
Although this doesn't have anything to do with partitioning, I'll mention that the way of specifying any arbitrary restriction of the GTR model in GARLI is similar to that used in PAUP. Parameters that are shared have the same #. The parameters are in the order of A-C, A-G, A-T, C-G, C-T and G-T. For example, the HKY or K2P models give one rate to transitions (A-G and C-T) and another to transversions:
ratematrix = ( 0 1 0 0 1 0 )
The GTR model looks like this:
ratematrix = ( 0 1 2 3 4 5 )
Note that these two particular configuations are equivalent to ratematrix = 2rate and ratematrix = 6 rate, respectively.
If you'd like to get site-wise likelihoods, for example to input into CONSEL, add outputsitelikelihoods = 1 to the top part of your config file. This will create a <ofprefix>.sitelikes.log file that has the site-likelihoods for each replicate concatenated one after another. This file is in exactly the same format as PAUP site-likelihood output, so can go directly into CONSEL. Note that CONSEL only allows a single period (".") in the site-likelihood file name (i.e., myrun.txt, not my.run.txt), so you may need to rename the files.
If you want the sitelikes for a particular tree, you'd need to do this:
- specify the tree(s) in a starting tree file.
- specify your chosen model normally
- add optimizeinputonly = 1 to the [general] section of your configuration file.
- run the program normally (this will take a while because it will need to optimize the model parameters)
Running it yourself
Use the provided configuration files as templates. I've provided some appropriate config files for smaller datasets ( < about 50 taxa) and for larger ones. Note that there are some important changes from the defaults values that appear in previous versions of the program. These are important to ensure that the more complex partitioned models are properly optimized.
You should set modweight entry to at least 0.0005 x (#subsets + 1).
Other than that and entering your dataset name on the datafname line, you should be able to run with the default values. If you know your way around the Garli settings, feel free to tinker.
As always, you can start the program from the command line with either
if it is in the same directory with your dataset and config file, or just
if it is in your path.
If your config file is named garli.conf, you don't need to pass it on the command line.