- 1 Types of Software Needed for Phylogenetic Analyses:
- 2 General knowledge
- 3 Practical Applications (Flowcharts for using programs):
- 3.1 Making a sequence alignment
- 3.2 Calculating sequence statistics of a sequence alignment (κ, average base frequencies, etc.)
- 3.3 Wrestling with data formats
- 3.4 Selecting Models
- 3.5 Making a phylogenetic tree (Parsimony)
- 3.6 Making a phylogenetic tree (Distance)
- 3.7 Making a phylogenetic tree (Maximum Likelihood)
- 3.8 Making a phylogenetic tree (Bayesian)
- 3.9 Calculating a molecular clock
- 3.10 Calculating dN/dS along phylogenies
- 3.11 Calculating an Ancestral sequence
- 3.12 Calculating variance of dN/dS in different sites.
Types of Software Needed for Phylogenetic Analyses:
NOTE: I don't think we should add the name in each contribution. The history section provides this kind of info.
Models of Evolution.
These are described brilliantly in Chapter 11 of Molecular Systematics, 2nd ed. edited by Hillis, Moritz and Mable, 1996.
The chapter is written by Dave Swofford, David Hillis and others. Read it a couple of times at least and you should start to see the relationships among the models and their parameters. Should drive home the fact that all models we use are special case of the GTR model (JC, K80, F84, HKY, and their rate heterogeneity variants). Very nice discussion of long branch attraction (Felsenstein zone) and of across-site rate heterogeneity.
Much of the book is quaintly out of date but this chapter and Chapter 12 (by David Hillis) are worth the price of the book or a trip to the library. Chapter 12 has an excellent (scary) discussion of confidence levels on molecular clocks.
Practical Applications (Flowcharts for using programs):
Making a sequence alignment
A nice GUI (free) place to edit sequences, align them and more is SeaView, which embeds Phyml for ML analysis and also distance and parsimony analysis. The included alignment software is Muscle, which I really like and Clustal.
Another terrific alignment program is T-Coffee (and R-Coffee for rRNAs).
Calculating sequence statistics of a sequence alignment (κ, average base frequencies, etc.)
Wrestling with data formats
Throughout our research, we often need to deal with different data formats (and variants of data formats). To convert this files, I suggest ALTER. It´s very useful and easy to use. Just upload your data file and select output format and software, and the magic begins...
I propose jMODELTEST (amino acid data) and ProtTest3 (DNA data) to select the substitution model that best fit your data. Both programs are quite easy to use, cross-platform (JAVA) and have nice visual interfaces. Indeed, ProtTest3 has been parallelized, so you can use it in parallel in your multicore computer or in a HPC cluster (MPI, PTHREADS and a hybrid method). Both have very useful manuals, but basically we must select among all factible (in the next steps of our pipeline) substitution models, according to one or several criteria (maybe AIC, or AIC and BIC). To do that, we can use a BIONJ tree (fast approach) or a ML tree (slow and accurate approach).
For a different approach, I strongly recommend a paper by Jack Sullivan and Paul Joyce called "Model Selection in Phylogenetics". Annu. Rev. Ecol. Evol. Syst. 2005. 36:445–66. This discusses AIC, LRTs and BICs and proposes an interactive method much like that of Dave Swofford's in his lecture on models.
Making a phylogenetic tree (Parsimony)
This can be done in SeaView and phylip. But PAUP is the best. Also fool around with MacClade (now free, but not available for Lion) or Mesquite to really internalize what parsimony is -both are written by the Maddison brothers.
TNT is the fastest software for approximating parsimony trees on very large datasets.
Making a phylogenetic tree (Distance)
Making a phylogenetic tree (Maximum Likelihood)
Methods using Few Taxa e.g., <15, PAUP*, Phylip Method using Many Taxa e.g., hundreds, Garli Method using Many Taxa e.g., thousands, Garli or RAxML Methods for whole genomes MCl -> RaxML -> customized trimming approach.
Great recommendations, although I have used PAUP and Phyllip for 94 and 63 taxa for 12 complete mtDNA protein-coding genes quite successfully. Also Phyml is really very good and fast. Don't be afraid of long runs.
Another great genetic algorithm program is MetaPiga by Michel C. Milinkovitch and Raphaël Helaers. Great visuals. Still slightly buggy but always being revised.
And Daniel Huson and David Bryant have a wonderful easy program SplitsTree for diagnosing trees (spectral analysis) and for making various networks including Split Decomposition and Neighbor Net. Also includes parsimony and distance methods.
Making a phylogenetic tree (Bayesian)
Calculating a molecular clock
BEAST practical writeup. Things to think about when generating a .xml file
Another program to try is HYPHY (also free) which does Global and Local (Yang and Yoder) molecular clocks and much, much more. Just Google HYPHY and phylogenetics to find the download site.
Calculating dN/dS along phylogenies
- Will need to run PAML (set up CODEML).
- Use this codeml.ctl template.
- use model = 0 for average across all branches.
- use model = 1 for branch independent. (Not recommended! Too many parameters!)
- use model = 2 for user defined branches. If there are distinct groups that you expect different omega values for, a sample treefile can be seen here. The "#1" and "#2" indicate user defined groups.
- for site independent tests, the ctl file is set up is different.
- Generate a sequence alignment with no stop codons
- Convert this sequence alignment to PHYLIP format (example of format here).
- Generate a tree of your phylogeny.
- Generate codeml.ctl file as shown in example 3. Use sequence file as “seqfile”, and tree file (example here) as “treefile”
- Run codeml in the directory with your codeml.ctl file (with the tree and seq files properly referenced).
- Compare Different models (H0, H1, and H2) and do LRT to determine if which model is best (model with lowest -lnL is best (based upon number of parameters?)) and using χ2 test.
Calculating an Ancestral sequence
Calculating variance of dN/dS in different sites.
The objective of this exercise is to use a series of LRTs to test for sites evolving under positive selection in the nef gene. If you find significant evidence for positive selection, then identify the involved sites by using empirical Bayes methods.
- Use this codeml.ctl file, and try different tree models
- If you plan to run two or more models at the same time, then create a separate directory for each run and place a sequence file, control file and tree file in each one.
- As in all the previous exercises, you will need to change the control file and re-run CODEML several times. In this case you will be fitting six different codon models (M0, M1a, M2a, M3, M7 & M8) to the example dataset.
- If you are running your analyses sequentially in the same directory, then you should change the name of the main result file (via outfile= in the control file) or you will overwrite your previous results.
- Set the tree file with treefile=. I have supplied tree files pre-loaded with the ML branch lengths for each model (hence you need to set a different tree for each model). This will greatly speed up your analyses, giving you more “beer time”. See the example control file for more details about treefile names.
- Set the codon model with NSsites=.
- Fix the value of kappa at the ML estimate with kappa=. Again, this will help speed up the analysis. See the control file for the value of kappa for each model.
- For some models you will also need to set the number of categories (ncatG) in the omega distribution:
- For M3 set ncatG=3
- For M7 set ncatG=10
- For M8 set ncatG=10
- Once the analysis is complete, rename the rst file because subsequent runs will overwrite it!
- Repeat steps a. through f. for each of the six codon models listed above.
- Keep track of your results (ex4_HelpFile.pdf) by using a table like “Table E4” shown in the slides (TableE4.pdf).
- In addition, carry out the following likelihood ratio tests:
- M0 vs. M3 (4 degrees of freedom)
- M1a vs. M2a (2 degrees of freedom)
- M7 vs. M8 (2 degrees of freedom)
- Lastly, open the rst file generated when you ran model M3 (ex4_rst_HelpFile.pdf). Locate the columns of posterior probabilities for each site under the three site-categories of this model. Use these data to reproduce the plot shown in the slides.