Garli Mkv morphology model

From MolEvol

GARLI for "standard" data

GARLI 2.0+ implements the "Mk" and "Mkv" models of Lewis (2001), "A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Data".

Specific information on the use of morphology data appears on this page, but be sure to read the primary documentation for the Garli_using_partitioned_models.

This implementation

  • Allows use of character data with any number of states, arbitrarily coded as 1, 2, 3 etc. This is termed the "standard" datatype by the Nexus format.
  • Allows simultaneous use of this standard data and typical sequence data (dna, protein) in a partitioned model (although there are some limitations).
  • Allows use of the "Mk" model, which assumes that the data collected could contain constant characters
  • Allows use of the "Mkv" model, which assumes that the data collected contains only variable characters
  • Allows versions of the Mk and Mkv models that treat the states as ordered characters

Limitations of this version

  • ONLY allows equal rates of substitution between states (rate of change from 1 -> 2 = 2 -> 1)
  • ONLY allows equal frequencies of the character states (state 1 = state 2)
  • CAN'T create stepwise addition starting trees under Mkv (for technical reasons)
  • CAN'T use rate heterogeneity with the Mk/Mkv models.
  • Another technical limitation:
    • IF you are mixing the morphology model with sequence data (DNA)
AND different characters have different numbers of states (e.g., character 1 has observed states 1, 2, and 3, while character 2 has states 1 and 2)
THEN you will not be able to infer separate subset specific rates for the DNA and morphological sets of data unless you also infer different rates for each set of characters with the same number of observed states

Application of this version to indel character data

One potential use of the "standard" data models implemented here is to encode indels (gaps) from your alignment as independent characters (in a separate data matrix) and analyze them simultaneously with your sequence data in a partitioned analysis. Note that the jury remains out on whether this is a good or helpful approach to take, and I don't necessarily endorse it. Certainly the gap and sequence matrices are not independent of one another, and will tend to reinforce each others signals, thus raising support in a way that may or may not be appropriate.

Availability

Version 2.0 allows these models, and is available here: http://garli.googlecode.com

Basic usage

Very little needs to be done to use the Mk/Mkv models.

Data

Have a Nexus datafile with your standard data in a characters or data block.

Configuration

The section of the configuration file containing the model settings should look like this

datatype = standard (or standardXXX, see below)
ratematrix = 1rate
statefrequencies = equal
ratehetmodel = none
numratecats = 1
invariantsites = none

The datatype is the only thing that can be changed here, and there are a few options:

  • standard - States are coded arbitrarily, and the number of observed states is assumed to be the maximum for each site. That is, if a column has a mix of states 1, 3 and 4, it is assumed that only these three states are possible. This is the "Mk" model.
  • standardvariable - As standard, except makes corrections for the fact that all constant columns will not be collected, and therefor won't appear in the matrix. This is the "Mkv" model, and should generally be preferred for morphological data over Mk.
  • standardordered - As standard, except that the state numbers DO matter, and transitions can only change the state by one number at a time. i.e., to get from state 2 to state 4 requires two changes. State numbers can be missing, so there could be an intermediate state that is unobserved.
  • standardvariableordered - A combination of the properties of standardvariable and standardordered.

Now run the program as usual.

Partitioned usage

To use Mk/Mkv in a partitioned model (with other types of data), the procedure is this:

Data

Get your data ready. In your Nexus datafile, your sequence data and Mk type data will need to appear in separate characters blocks. Multiple characters blocks automatically create a partitioned model in GARLI. In general, the file should be formatted something like this:

#NEXUS
begin taxa;
<contents of taxa block>
end;
begin characters;
<one of your types of data>
end;
begin characters;
<more data of the same or a different type>
end;
<more characters blocks if necessary>
end;

Note that this means that you will need to use a taxa block in addition to your characters blocks. If you are using multiple types of data you cannot use data blocks. As for how to get your data into this format, two options are to paste multiple characters blocks into a single file (with one taxa block), or to get your data into Mesquite as separate matrices and then save it.

Configuration

At this point the run can be configured as a typical partitioned analysis. Lots more information on that appears here: Partition_testing_version.

In short, the multiple models are specified in the below format. Note that the "[model1]" and "[model2]" bits are important, and indicate which characters blocks (or partition subsets) the models are applied to, with the order being the same. Note that the numbering starts at 1, so the first characters block is model1, the second model2, etc. Assuming that the characters blocks in the file appeared with the nucleotide data first:

 [model1]
 datatype = nucleotide
 ratematrix = 6rate
 statefrequencies = estimate
 ratehetmodel = gamma
 numratecats = 4
 invariantsites = none

 [model2]
 datatype = standardvariable
 ratematrix = 1rate
 statefrequencies = equal
 ratehetmodel = none
 numratecats = 1
 invariantsites = none

This would specify the GTR+Gamma model for the DNA data, and Mkv for the standard data.

Now run the program as usual.

Program output

Output files will be more or less as usual. If you look in the .screen.log file you will notice that the standard (morphology) data will be split into a number of models with each representing a given number of states. i.e., one model for characters showing 2 states, one model for characters with 3 states, etc. This is normal. You will also notice that with the current basic Mk/Mkv implementation there aren't any parameters to be estimated or reported.