Lamarc tutorial

From MolEvol
Jump to: navigation, search

IF YOU READ THIS BEFORE THE LAB, GO TO THE "GETTING STARTED" SECTION AND DOWNLOAD THE SOFTWARE FROM the server

A reminder: if you want to keep results (they are written by default into a file called outfile and outfile.pdf), save them using cp outfile yourfavoredfilename or change the name of the output file in the menu.

Lamarc.jpg

Getting started

The lamarc converter has a graphical user interface, and you will need to run the converter on your own computer. I have prepared a local copy of the programs, that hopefully reduces the download bottleneck. The conversion is relative simple, and I actually suggest that you use the server to run lamarc and skip the conversion exercise if the time is short.

Work on your own computer

If you want to work on you computer have a look into the directory /class/shared/lamarc_demo/distribution. Here you can find the distribution for macintosh, windows, and linux computers. Use scp to get the distribution file, for example for macs it is

scp userid@class-0x.mbl.edu:/class/shared/lamarc_demo/distribution/lamarc-2.1.8-osx.dmg .
    1. unpack the file
    2. this results in a directory called lamarc-2.1.8 or similar
    3. cd into that folder
  1. Download lamarcFiles.zip or copy the folder /class/shared/lamarc_demo/lamarcFiles.zip into your lamarc folder. Unzip the file; I also suggest to move the files that are in this directory into your lamarc folder.Your directory should now contain the programs and the data files for the tutorial: frogs.phy, frogs.xml, lpl.xml, lpl.mig, lpl.conv. Ideally, your directory either on your computer will look like this:
[your own machine]$ ls
COPYING		doc		lam_conv.app	lamarc_demo	lpl.xml
HISTORY				frog.phy	lamarc.app	lpl.conv
README						frog.xml	lamarcFiles.zip	lpl.mig

Work on the server

  1. Download lamarcFiles.zip or copy the folder /class/shared/lamarc_demo/lamarcFiles.zip into home directory. Unzip the file. Your directory should now contain the programs and the data files for the tutorial: frogs.phy, frogs.xml, lpl.xml, lpl.mig, lpl.conv. Ideally, your directory will look like this:
[server]$ ls
frog.phy  frog.xml  lpl.conv  lpl.mig  lpl.xml


Now you are ready to start the tutorial.

Tutorial

Exercise 1: file conversion [skip this if time is short (this works only on your own computer!)]

Use the file conversion utility, lam_conv, to convert the sample file frog.phy to a LAMARC infile. After reading in the data file, you will want to check (and probably change) the converter’s guess for the data set’s data type. The easiest way to do this is by double-clicking in the "contiguous segment" area of the window, then checking the appropriate box under Data Type (in this case DNA, not SNP). Be sure to click the Apply button to actually save any changes made.

To save your converted data file, choose Write Lamarc File under the File drop-down menu. This will by default write a file named infile.xml.

If you become frustrated with the file converter, a pre-converted input file is available as frog.xml.

Exercise 2: estimating Θ alone

Run LAMARC and give it the full name of your LAMARC input file. Be careful to include extensions at the end of a filename (like .xml or .txt). (Note that LAMARC may be unhappy if you try to be fancy and use Mac's drag-over abilities, or include a ~ in your path, etc. Just make sure LAMARC and your datafile are in the same directory, and enter the datafile name.)

We will do a likelihood-based run, in which the overall strategy is to generate many genealogies based on a single set of driving values. Each genealogy proposed is one "step". Some proportion of the steps are sampled (recorded) for use in parameter estimation. Periodically the program will estimate the parameters and start a new search with the improved parameters; each iteration of this is a "chain." The typical likelihood run will do several "initial chains" to improve the driving values, and one or two much longer "final chains" to produce the final estimate.

The default run length is generally too short for publication, but too long for a demonstration. Under the Search Strategy submenu (s), Sampling Strategy submenu (s), cut the Number of samples to discard down to 100 for both initial (4) and final (8) chains, and the final-chain Number of recorded genealogies (6) down to 1000.

Since we know this is mtDNA with a high frequency of transitions over transversions, go back to the main menu (return twice) and under submenu Data options (d), submenu Edit data model(s) for Region 1 (1) edit the data model. The TT ratio (transition/transversion ratio) is set by default to 2, which is unreasonable for mtDNA. Set it to the more reasonable value of 20. (PAUP* with Modeltest is an ideal tool for finding an optimal data model for your data.)

Run the program (. return). Progress reports should appear, including a guess as to how long the analysis will take. At the end of this section are some tips for deciding whether your run is satisfactory. You may want to consider these as you watch the progress reports; you will certainly want to consult them when the run has finished.

When the program finishes, your output will be in outfile unless you told the program to rename this file.

What is the point estimate of Θ? Write the Θ into the spreadsheet. What does this number mean? Show Answer

What are the 95% confidence interval boundaries for this estimate? For subsequent runs, you may want to use the menusettings_infile.xml file produced by this run as your starting point, as it will already have the desired TT ratio and search strategy.

Validation

Tracer[1] is mainly useful for Bayesian runs; for this likelihood run LAMARC's own diagnostics will be more helpful. Examine the output file from your run, looking for any of the following symptoms. You won't find them all; this is a general checklist useful for any likelihood-based run.

Do the parameter estimates increase (or decrease) throughout the run, rather than settling around a value? Show diagnosis


Does the "data lnL" in the run reports, which shows the score of the best genealogy found during that chain, increase throughout the run, rather than settling around a value? Show diagnosis

Is the "posterior lnL" in the run reports, which shows improvement in driving value from one chain to the next, greater than 2x the number of parameters being estimated in the final chain? Show diagnosis

If you run the program twice with the same data and settings, but different random number seeds, do the results disagree? (Compare the between-run difference with the confidence intervals.) Show diagnosis

Do the estimates of the parameters vary wildly from one chain to the next? Show diagnosis

Is the acceptance rate above 50% or below 5%? Show diagnosis

Exercise 3: estimating Θ and growth rate

Repeat the analysis, but turn on estimation of growth from the Analysis menu (remember to cut down the burnin, and samples taken, and change the TT ratio, as above--LAMARC will not remember your earlier changes). Make sure to save your previous results by either copying/moving/renaming the previous outfile, or by telling LAMARC to save its output to a file with a different name (change this in the Input and Output menu).

What is the point estimate of Θ? What is the interpretation of this number? Show answer

What are the 95% confidence interval boundaries for this estimate? What is the point estimate of the growth rate g? Write the Θ and g into the spreadsheet. How do you interpret your estimate? Show answer

What are the 95% confidence interval boundaries for this estimate? Is there a lot of correlation between Θ and g? Why? Show answer

How much does this estimate of Θ differ from one made without taking growth into account? Why might we expect it to differ? Show answer

Validation

Check the same points as for exercise 2. Additionally, for a run with growth the following problem is relevant:

Are the estimates of Theta and g enormous, with upper confidence limits that are nearly infinite? Show diagnosis

Exercise 4: estimating Θ and recombination rate

For estimation of recombination we provide a file lpl.xml with SNP data from three human populations: European Finns from North Karelia; assorted Europeans from Rochester, Minnesota; and African-Americans from Jackson, Mississippi (Nickerson et al. 1998). The file lpl.xml is a subset of the original LPL dataset.

These are pre-generated LAMARC infiles which will perform a Bayesian analysis with one final chain sampling 7000 parameter values split randomly among all of the available parameters. If you would like to experiment with some of the more complex features of the data converter, we provide the raw migrate format file containing the full dataset in lpl.mig, plus an XML command file for use with the converter in lpl.conv.

You will want to examine the priors (Search Strategy menu); they probably start off too wide for human data. A more reasonable choice might be an upper bound of 1.0 and a lower bound close to zero, like 0.00001. You can experiment with a linear prior here if you like; then the lower bound can actually be zero.

What is the point estimate of Θ? What are the 95% support interval boundaries for this estimate? What is the point estimate of the recombination rate r? How is this number interpreted? Show answer

What are the 95% support interval boundaries for this estimate? (If you have time) How does changing the prior for r change these results?

Validation

All of the questions for Exercise 1 are relevant here. Since this was a Bayesian run, we can additionally use the Tracer tool. Files with trace file in the name are ready to be read by Tracer; in this case, read tracefile_LPL_1.txt.

View the trace and histogram for each parameter. The trace shows how the value of that parameter changed over the course of the run; the histogram shows the final distribution of values. You can also look at the trace of P(D|G), called the "Data ln(Likelihood)". Note that less negative numbers are better scores (a difference from PAUP*). Check the ESS (Effective Sample Size) for each parameter. Look for the following symptoms. Bear in mind that a bad Tracer result is a near-guarantee of a failed run, but a good Tracer result is not proof of success. If the search has failed to find an island of good trees, no diagnostic can detect this.

ESS (Effective Sample Size) for any parameter less than 200. Show diagnosis

Directional trends in the trace of a parameter. Show diagnosis

Sudden leaps in the trace of a parameter, usually the recombination rate. Show diagnosis

Multiple peaks in the histogram of a parameter, usually a migration rate. Show diagnosis

A "knee" on the lower side of the histogram for a parameter. Show diagnosis

Exercise 5: improving a poorly performing run

These optional exercises are time-consuming; they are good ways to explore the program further during your free time.

On the LPL data, the given run conditions lead to very poor performance. Here are some strategies that may improve these runs. Try a few and see what happens.

Increase the number of steps in the chain (Search Strategy menu). Add heating, with one heated chain (Search Strategy menu). A temperature of 1.2 for the hot chain is recommended. Aim for a temperature at which swaps between hot and cold are moderate, between 10% and 50%. Refine the mutation model (Analysis menu) so that migration rates are symmetrical or some migration pathways are not allowed. For example, it is probably reasonable not to estimate gene flow from Jackson (African-American) into Karelia (Finnish). Set some Theta values to pre-determined figures so that they will not have to be estimated (Analysis menu). A previous study suggested Theta(Jackson)=0,0072, Theta(Karelia)=0.0027, Theta(Rochester)=0.0031. Turn off estimation of recombination (Analysis menu). Narrow the Bayesian priors (Search Strategy menu) so that a smaller range of values will be proposed. This may improve acceptance rates. Change to a likelihood run. (This is mainly helpful if data are very sparse or parameter values very close to their lower limits.) If you do, you will have two more options to consider: Run more chains (Search Strategy menu). Adjust your starting values (Analysis menu). Cut down this three-population case to a two-population case. (This can be done by cutting down the MIGRATE-N input file and then feeding it into the file converter lam_conv.) Randomly remove some sequences from the LAMARC input file, being careful to keep the rest of the XML intact. Be sure to save the file as flat text, not .doc or other rich formats. final thoughts

LAMARC is a powerful tool but must be used thoughtfully. It will often perform poorly with "out of the box" settings. Plan to do some trial runs to explore the behavior of the program with your data set. If you encounter any difficulties, please email the LAMARC development team. It is helpful if you include the data file that is causing the problem, especially for crashes and other bugs.

References