Species tree lab

From MolEvol

30 July 2013 Woods Hole MBL Workshop in Molecular Evolution, Scott Edwards

Methods for estimating species trees Files can be found on wiki, Edwards page: Outline: 1) BEST – Bayesian estimation of species trees (http://www.stat.osu.edu/~dkp/BEST/)

  • i) Best format
  • ii) Priors
  • iii) Execution

2) MP-EST– maximum (pseudo)likelihood estimation of species trees (http://code.google.com/p/mp-est/)

3)Depending on time...

  • a) Using the Phybase R package ( http://code.google.com/p/phybase/) to make species trees, simulate gene trees, and conduct a multilocus bootstrap
  • i) R environment of Phybase
  • ii) Defining variables (sequences, trees, OTU names, species names)
  • iii)Executing multilocus bootstrap
  • iv) Making a STAR tree
  • v)Representing a species tree as a matrix
  • vi) Simulating gene trees


To get the data files to run these analyses:

 wget https://molevol.mbl.edu/wiki/images/c/c4/Edwards_lab_files.zip
 unzip Edwards_lab_files.zip 
 cd Edwards_lab_files



1. BEST

a) Input file format– modified mrBayes file

BEST block:

partition Genes = 30:

locus097,locus098,locus118,locus119,locus120,locus122,locus130,locus104,locus129,locus143,locus146,

locus111,locus135,locus148,locus182,locus200,locus209,locusB098,locus184,locus185,locus186,locus187,

locus188,locus192,locus193,locus195,locus198,locus199,locusB200,locus103;

set partition=Genes;

taxset P_acuticauda=P_acuticauda;

taxset P_hecki=P_hecki;

taxset P_cincta=P_cincta;

taxset T_guttata=T_guttata;

prset thetapr=invgamma(3,0.003) GeneMuPr=uniform(0.5,1.5) best=1;

unlink topology=(all) brlens=(all) statefreq=(all) genemu=(all);

mcmc ngen=5000000 samplefreq=100 nrun=2 nchain=2;

quit;

end;


 best
 Best> execute Finch_BEST.nex

While this analysis is running, open up another ssh window to the cluster and start on the next section.

When analysis is done you can type:

 Best> execute Finch_BEST.nex.sumt

You can use tracer (on your machine) to examine output files (similar to mrBayes): Finch_BEST.nex.run1.p, Finch_BEST.nex.run2.p, Finch_BEST.nex.sptree.con, and *.t, *.parts, *.tprobs files


2. MP-EST-maximum (pseudo)likelihood estimation of species trees

You need:

1. rooted gene trees (some missing taxa ok; right now, just one sequence per species; all gene trees must have outgroup; branch lengths not necessary)

2. control file: (“Maluridae_control_file.txt”) contains information on where the gene trees are, how the gene tree OTUs map onto species, etc.


Malurusphy.trees

0

3

18 26

Kalkadoon_Grasswren 1 Kalkadoon_Grasswren

Grey_Grasswren 1 Grey_Grasswren

Carpentarian_Grasswren 1 Carpentarian_Grasswren

Eyrean_Grasswren 1 Eyrean_Grasswren

Black_Grasswren 1 Black_Grasswren

Short_tailed_Grasswren 1 Short_tailed_Grasswren

Dusky_Grasswren 1 Dusky_Grasswren

Thick_billed_Grasswren 1 Thick_billed_Grasswren

Lovely_Fairy_wren 1 Lovely_Fairy_wren

Superb_Fairy_wren 1 Superb_Fairy_wren

Red_winged_Fairy_wren 1 Red_winged_Fairy_wren

Blue_breasted_Fairy_wren 1 Blue_breasted_Fairy_wren

Southern_Emu_wren 1 Southern_Emu_wren

Mallee_emu_wren 1 Mallee_emu_wren

Broad_billed_Fairy_Wren 1 Broad_billed_Fairy_Wren

Emperor_Fairy_Wren 1 Emperor_Fairy_Wren

Orange_crowned_Fairy_wren 1 Orange_crowned_Fairy_wren

Purple_crowned_Fairy_wren 1 Purple_crowned_Fairy_wren

Red_backed_Fairy_wren 1 Red_backed_Fairy_wren

Red_crowned_Emu_Wren 1 Red_crowned_Emu_Wren

Splendid_Fairy_wren 1 Splendid_Fairy_wren

Striated_Grasswren 1 Striated_Grasswren

Variegated_Fairy_Wren 1 Variegated_Fairy_Wren

White_shouldered_fairy_wren 1 White_shouldered_fairy_wren

White_winged_Fairy_Wren 1 White_winged_Fairy_Wren

White_throated_Gerygone 1 White_throated_Gerygone

0

(((((((Kalkadoon_Grasswren,Dusky_Grasswren),Black_Grasswren),Eyrean_Grasswren),Thick_billed_Grasswren),(Grey_Grasswren,

(Carpentarian_Grasswren,Striated_Grasswren)),Short_tailed_Grasswren),((((Lovely_Fairy_wren,Red_winged_Fairy_wren,

Blue_breasted_Fairy_wren,Variegated_Fairy_Wren),((((Superb_Fairy_wren,Splendid_Fairy_wren),

(Red_backed_Fairy_wren,White_shouldered_fairy_wren)),Purple_crowned_Fairy_wren,White_winged_Fairy_Wren),Emperor_Fairy_Wren)),

(Southern_Emu_wren,(Mallee_emu_wren,Red_crowned_Emu_Wren)))(Broad_billed_Fairy_Wren,Orange_crowned_Fairy_wren))),White_throated_Gerygone);


Make sure you are in the folder "Edwards_lab_files"

mpest Maluridae_control_file.txt


3) Phybase, an R module for estimating, analyzing and simulating species trees

a. The multilocus bootstrap i. call up R (type ‘R’) ii. type:

 R> library(phybase)

Input file for DNA sequence data : same as for BEST (Nexus/mrbayes file with BEST block)
1. read in a sequence file

 R> finch.data<-read.dna.seq(file= "Finch_BEST.nex")

2. assign DNA sequences in that file to a variable "finch_sequences"

finch_sequences<-finch.data$seq

3. assign gene partitions to variable "finch_genes"

 R> finch_genes<-finch.data$gene

4. get taxa names – these are the OTUs in the gene trees

 R> finch_otus<-finch.data$name

5. bootstrap the data set

 R> bootstrap.mulgene(sequence=finch_sequences,gene=finch_genes,name=finch_otus,boot=100, outfile="finchboot.txt")

6. exit R

 R> quit()

Let’s now look at the file “finchboot.txt” by typing:

 nano finchboot.txt

. Multilocus bootstrap replicates can be used for many species tree methods, such as STAR, MDC, MP-EST,STEM, and many other species tree methods.

b) Making a STAR tree: Input file: rooted or unrooted gene trees in phylip or nexus format
1. Open R and read in the trees file

 R
 R> library(phybase)
 R> genetrees<-read.tree.string(file="Malurusphy.trees",format="phylip")

2. variable genetrees has 3 values: vector of trees; species names; and TRUE or FALSE for rooted or not.

 R> genetreevector<-genetrees$tree


3. extracts trees from the file and assigns them to variable “genetreevector”

 R> wren_taxa_names<-species.name(genetreevector[1])

4. gets gene tree names from the first gene tree; make sure this gene tree has all taxa in it.

 R> wren_species_names<-wren_taxa_names

5. assigns same names to species tree as in first gene tree
Now, link names in gene tree with names in species tree via a matrix called “species.structure”

 R> species.structure<-matrix(0,26,26)


6 a matrix for 26 species, filled with 0s

 R> diag(species.structure)<-1

1s on the diagonal indicate a 1-to-1 correspondence of gene and species names
7 now, make a start tree:

 R> star.sptree(genetreevector, speciesname=wren_species_names, taxaname=wren_taxa_names,species.structure=species.structure, outgroup="White_throated_Gerygone", method="nj")


Representing species trees as matrices and simulating gene trees will wait for another time. The Phybase manual has useful instructions for these two topics.


References:

Castillo-Ramírez, S., L. Liu, D. Pearl, and Edwardsm S. V. 2010. Bayesian estimation of species trees: a practical guide to optimal sampling and analysis, Pages 15-33 in L. L. Knowles, and L. S. Kubatko, eds. Estimating Species Trees: Practical and Theoretical Aspects. New Jersey, Wiley-Blackwell.

Edwards, S. V., L. Liu, and D. K. Pearl. 2007. High -resolution species trees without concatenation. Proceedings of the National Academy of Sciences (USA) 104:5936-5941.

Edwards, S. V. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63:1-19.

Knowles, L. L., and L. S. Kubatko. 2010. Estimating species trees: An introduction to concepts and models, Pages 1-14 in L. L. Knowles, and L. S. Kubatko, eds. Estimating Species Trees: Practical and Theoretical Aspects. New York, Wiley-Blackwell.

Kubatko, L. S., B. C. Carstens, and L. L. Knowles. 2009. STEM: species tree estimation using maximum likeli hoodfor gene trees under coalescence. Bioinformatics 25:971-973.

Kubatko, LS. 2009. Identifying Hybridization Events in the Presence of Coalescence via Model Selection, Systematic Biology 58(5): 478-488

Liu, L., L. Yu, and S. Edwards. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology 10:302.

Liu, L., L. Yu, D. K. Pearl, and S. V. Edwards. 2009. Estimating species phylogenies using coalescence times among sequences. Syst Biol 58:468-477.

Liu, L., Yu L., & Pearl D. K. 2009. Maximum tree: a consistent estimator of the species tree. Journal of Mathematical Biology 60(1): 95-106

Liu, L., L. Yu, L. Kubatko, D. K. Pearl, and S. V. Edwards. 2009. Coalescent methods for estimating phylogenetictrees. Mol Phylogenet Evol 53:320-328.

Liu, L., D. K. Pearl, R. T. Brumfield, and S. V. Edwards. 2008. Estimating species trees using multiple -allele DNAsequence data. Evolution 62:2080-2091.

Liu, L., and D. K. Pearl. 2007. Species trees f rom gene trees: reconstructing Bayesian posterior distributions of aspecies phylogeny using estimated gene tree distributions. Syst Biol 56:504-514.

Snir, S and S. Rao. 2010. Quartets MaxCut: A Divide and Conquer Quartets Algorithm. IEEE/ACM Trans. Comput. Biology Bio inform. 7(4): 704-718