Difference between revisions of "Alignment tutorial"

From MolEvol
(Exercise 5. Alignment extension)
(Exercise 4: exploring the difference in gap penalties using MUSCLE)
 
(59 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
==Expected learning outcomes==
 
==Expected learning outcomes==
The objective of this activity is to become familiar with the features of several multiple alignment and alignment visualization programs, including data input and output, basic visualization and editing functions, alignment options, and differences between nucleotide and amino acid alignments.
+
The objective of this activity is to become familiar with the features of multiple alignment and alignment visualization programs. This includes data input and output, basic visualization and editing functions, alignment options, and differences between nucleotide and amino acid alignments.
Additionally, you should be able to run analyses on the cluster and move files to and from the cluster and your local machine.
+
Most importantly, you should be able to run analyses on the cluster and move files to and from the cluster and your local machine. Transferring data between your computer and a remote machine is necessary in all following labs.
 
 
  
 
== About this lab ==  
 
== About this lab ==  
Most of the software used in the workshop requires that the input data is already aligned. So, if you want to use your own data during the workshop, you will need to go through the process of multiple sequence alignment (MSA).
+
Software used in this workshop assumes that input data is aligned. If you want to use your own sequencing data during the workshop, you will need to go through the process of multiple sequence alignment (MSA).
We focus here on gene sequences, i.e. from Sanger data or assembled NGS data.
+
We focus here on gene sequences, which can be from targeted Sanger data or assembled genomic data. Many contemporary studies use reference-based short read alignment, but at least some of the underlying theory is the same. We do not cover short read alignment, but reference-based alignment resources are provided at the end of the tutorial.
Much of the theory underlying short read assembly and alignment is the same, but analysing these large data sets is too time consuming for this tutorial.
 
  
 
== General Background ==
 
== General Background ==
Multiple sequence alignment is a crucial first step for most methods of genealogical inference.
+
Multiple sequence alignment (MSA) is a crucial first step for most methods of phylogenetic estimation or model-based inference of evolutionary processes. The goal of MSA is to introduce gaps into sequences so that columns of an aligned matrix contain character states that are homologous.
The goal of MSA (in the context of evolutionary inference) is to introduce gaps into sequences so that the columns of the aligned matrix contain character states that are homologous.
 
This statement of homology means that if two sequences both contain a residue at a position in the alignment, then their common ancestor had a corresponding residue.
 
This site was inherited by the descendant lineages (possibly with substitutions occurring) down to the present, observed sequences.
 
Note that if two sequences both have gap at a position this does '''not''' necessarily mean that their ancestor had no residue at that site - multiple deletion events are possible.
 
  
Homology is a historical property that can be inferred, but not directly observed.   
+
This statement of homology means that if two sequences both have a residue at a position in the alignment, then a residue existed in their most recent common ancestor. This site was inherited by the descendant lineages (possibly with substitutions occurring) down to the present, observed sequences. Note that if two sequences both have gap at a position, this does <b>not</b> mean that their ancestor had no residue at that site - multiple deletion events are possible.
So MSA software attempts to produce homologous alignments by scoring alternative alignments and choosing the alignment with the best score.
+
 
 +
Homology is a historical property that cannot be directly observed.  However, homology can be inferred by MSA methods. Generally, a MSA algorithm attempts to produce homologous alignments by scoring many plausible alignments and choosing one with the best score.
 
Aligning two positions that display the same nucleotide (or amino acid) improves the score.
 
Aligning two positions that display the same nucleotide (or amino acid) improves the score.
Aligning two positions that differ incurs a cost that is related to what specific type of substitution is implied.
+
Aligning two positions that are not the same decrease the score. How much of a decrease (the cost) depends on what specific substitution is implied.
For example, Leucine to Isoleucine substitutions are common and would not be penalized much.
+
For example, Leucine to Isoleucine substitutions are common and would not cost much.
 
But Cysteine to Glycine changes are rare.
 
But Cysteine to Glycine changes are rare.
So aligning a Cysteine to a Glycine results in a higher penalty to the score of the alignment.
+
Therefore, aligning a Cysteine to a Glycine results in a higher penalty to the score of the alignment.
Placing a gap in one of the sequences is penalized.
+
Placing gaps in a sequence is penalized too.
Usually introducing a new gap is penalized more strongly than extending an existing gap (the gap-extension penalty is usually less than the gap-opening penalty).
+
Introducing a new gap usually has a higher cost than extending an existing gap. For example, a gap open cost will be -5 while an extension cost -2, to reflect a scientist’s general belief that one longer indel event is more expected than two shorter indel events.
 +
 
 +
Most default scoring options for alignment programs are designed to balance expectations of getting similar residues in the same column and allowing for gaps. Try scoring the following alignments yourself, do you get the same answers?
 +
::[[File:AlignmentCost.4.png|450px|none|left|Simple Alignment Scoring Example]]
 +
 
  
 
As mentioned in lecture, pairwise alignment is analytically tractable (though slow for very long sequences).
 
As mentioned in lecture, pairwise alignment is analytically tractable (though slow for very long sequences).
Multiple sequence alignment presents new difficulties: we do not know the evolutionary tree relating the sequences.
+
Multiple sequence alignment presents new challenges:  
Even if we did know the tree, we do not get to see the sequences at both ends of each branch in the tree because we do not have data from the ancestral sequences.
+
1) we do not know the evolutionary tree relating the sequences
As a result, it is not clear how the penalty terms should be applied when we align sets of more than two species.
+
2) we do not know the states at internal nodes of the tree in the absence of data from the ancestral sequences.
 +
Challenge 1 is frequently addressed by the use of a guide tree, but challenge 2 is for most impossible to overcome and it is not clear how the penalty terms should be applied when we align sets of more than two species. Many MSA approaches focus on solving challenge 2 in a computationally efficient way.  
  
One common solution to this problem (and the solution used by the software we'll talk about in this lab) is progressive alignment.  
+
One common solution (and the solution used by the software we'll talk about in this lab) is progressive alignment.  
The general strategy is typically:
+
The general strategy is:
 
# estimate pairwise distances using pairwise alignment,
 
# estimate pairwise distances using pairwise alignment,
 
# estimate a guide tree from these distances,
 
# estimate a guide tree from these distances,
# use the guide tree to specify the order in which multiple sequence alignments are constructed. The algorithm proceeds by traversing the tree from leaves to the root. The two descendants of each internal node in the tree are aligned to each other. Thus, the guide tree serves as means of ordering a set of pairwise alignment operations.  
+
# use the guide tree to specify the order in which multiple sequence alignments are constructed. The algorithm proceeds by traversing the tree from leaves to the root. The two descendants of each internal node in the tree are aligned to each other. Thus, the guide tree serves as means of ordering a set of pairwise alignment operations.
 +
# iterate until the alignment cannot be improved based on some criterion
  
 
The initial step in the progressive stage will be a pairwise sequence alignment to produce a group of two aligned sequences. But, as we move deeper in the tree, the approach will require aligning other single sequences to the previously aligned groups or aligning two distinct groups to each other (group-to-group alignment).
 
The initial step in the progressive stage will be a pairwise sequence alignment to produce a group of two aligned sequences. But, as we move deeper in the tree, the approach will require aligning other single sequences to the previously aligned groups or aligning two distinct groups to each other (group-to-group alignment).
Line 41: Line 42:
 
The score of each alignment decision during progressive alignment is usually computed using a weighted sum-of-pairs score which produces an average cost for the position in the alignment.  
 
The score of each alignment decision during progressive alignment is usually computed using a weighted sum-of-pairs score which produces an average cost for the position in the alignment.  
 
This score reflects the costs associated with aligning each member of one descendant group to each member of the other descendant group.
 
This score reflects the costs associated with aligning each member of one descendant group to each member of the other descendant group.
Another way to score an alignment of two groups is to use a consistency-based objective function that reflects how well the positions in the resulting MSA reflect the homology statements inferred from pairwise alignments (See [http://bioinformatics.oxfordjournals.org/content/14/5/407 Notredame et al. 1998]).
+
Another way to score an alignment of two groups is to use a consistency-based objective function that reflects how well the positions in the resulting MSA reflect the homology statements inferred from pairwise alignments (Notredame et al 1998). One of the programs we will use today, MAFFT, uses progressive alignment, but then refines the MSA with consistency.
  
 
== MAFFT Background ==
 
== MAFFT Background ==
 
MAFFT gets its name from its use of Fast Fourier Transform to quickly identify some of the more obvious regions of homology.   
 
MAFFT gets its name from its use of Fast Fourier Transform to quickly identify some of the more obvious regions of homology.   
After identifying these regions slower dynamic programming approaches are used to join these segments into a full alignment.
+
After identifying these regions, slower dynamic programming approaches are used to join these segments into a full alignment.
Thus, the main advantage of the initial version of MAFFT was speed.
+
Thus, the main advantage of the initial version of MAFFT was speed. The speed of MAFFT has made it a favorite method for many phylogenomic studies and it has some alignment algorithms that are very useful for some common data types (e.g. UCEs, hybrid enrichment/capture-seq, RADseq).
 
MAFFT also uses a transformation of scoring matrices (fully described in the "Scoring system" section on pages 3061 and 3062 of [http://nar.oxfordjournals.org/content/30/14/3059.abstract Katoh et al 2002]) so that all of the pairwise scores are positive and costs of multi-position gaps can be computed quickly.
 
MAFFT also uses a transformation of scoring matrices (fully described in the "Scoring system" section on pages 3061 and 3062 of [http://nar.oxfordjournals.org/content/30/14/3059.abstract Katoh et al 2002]) so that all of the pairwise scores are positive and costs of multi-position gaps can be computed quickly.
  
Line 67: Line 68:
 
::[[File:KatohEtAl2008Fig1Acropped.png|1000px|none|left|Fig 1a Katoh and Toh 2008]]
 
::[[File:KatohEtAl2008Fig1Acropped.png|1000px|none|left|Fig 1a Katoh and Toh 2008]]
  
The iterative improvement steps mentioned in step 7 (which are used if you use the FFT-NS-2-i version of the algorithm)are depicted as:
+
The iterative improvement steps mentioned in step 7 (which are used if you use the FFT-NS-2-i version of the algorithm) are depicted as:
 
::[[File:KatohEtAl2008Fig1Bcropped.png|500px|none|left|Fig 1a Katoh and Toh 2008]]
 
::[[File:KatohEtAl2008Fig1Bcropped.png|500px|none|left|Fig 1a Katoh and Toh 2008]]
  
[http://nar.oxfordjournals.org/content/33/2/511.abstract Katoh et al. 2005] added consistency-based scoring in the iterative refinement stage.  '''G-INS-i''' is uses global alignments to construct the library of pairwise alignments used to assess consistency. '''L-INS-i''' uses local pairwise alignments with affine gaps to form the library, and '''E-INS-i''' uses local alignments with a generalized affine gap cost (which allows for an unalignable cost in addition to gap opening and gap extension costs).
+
[http://nar.oxfordjournals.org/content/33/2/511.abstract Katoh et al. 2005] added consistency-based scoring in the iterative refinement stage.  '''G-INS-i''' uses global alignments to construct the library of pairwise alignments used to assess consistency. '''L-INS-i''' uses local pairwise alignments with affine gaps to form the library, and '''E-INS-i''' uses local alignments with a generalized affine gap cost (which allows for an unalignable cost in addition to gap opening and gap extension costs).
 
The contexts in which you would want to use each of these algorithms is shown below (from Figure 3 of Katoh and Toh (2008)).  X's in the figure represent aligned residues and o's represent unalignable residues:
 
The contexts in which you would want to use each of these algorithms is shown below (from Figure 3 of Katoh and Toh (2008)).  X's in the figure represent aligned residues and o's represent unalignable residues:
 
::[[File:KatohEtAl2008Fig3.png|700px|none|left|Fig 1a Katoh and Toh 2008]]
 
::[[File:KatohEtAl2008Fig3.png|700px|none|left|Fig 1a Katoh and Toh 2008]]
 +
 +
We do encounter such cases in real biological datasets. For example L-INS-I is useful for hybrid enrichment (Breinholt et al. 2018) or UCES (Faircloth 2016).
  
 
==Muscle background==
 
==Muscle background==
  
Muscle is described in [http://nar.oxfordjournals.org/content/32/5/1792.long Edgar et al. 2004].
+
Muscle is described in [http://nar.oxfordjournals.org/content/32/5/1792.long Edgar et al. 2004].  
  
Muscle is a modified progressive alignment algorithm which has comparable accuracy to MAFFT, but faster performance.
+
Muscle is a modified progressive alignment algorithm which has comparable accuracy to MAFFT, but faster performance. Muscle has remained popular for phylogenomic studies, and there has been some evidence that Muscle is more accurate than MAFFT on amino acid alignments (Pais et al. 2014).
  
 
Broadly it consists of three stages:
 
Broadly it consists of three stages:
Line 90: Line 93:
 
[[File:F2.medium.gif|700px|none|left| EdgarF2]]
 
[[File:F2.medium.gif|700px|none|left| EdgarF2]]
  
==Alignment Extension==
+
==Getting started==
In some cases, you already have an alignment and an estimate of phylogeny for some set of sequences, but you would like to add more.
+
While a large number of alignment programs have been developed, we are going to focus on MAFFT and ‎MUSCLE. Alternatives may be more accurate on small data sets, but these programs perform well even on fairly large data sets and are thus part of many phylogenomic pipelines (e.g. Yang and Smith 2014). The MAFFT algorithms and options can be viewed [http://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html here]. The MUSCLE user guide is found at [http://www.drive5.com/muscle/muscle_userguide3.8.html here]. <br />
In this case, you can use your tree estimate to guide your alignment extension instead of the iterative distance methods described above.
+
MAFFT and MUSCLE are available on the cluster. We will transfer sequences to the cluster using scp or cyberduck, run the programs as needed remotely, and then transfer back in the same way for visualization. Clusters are great for analyses, but if this is your first time using one, you will not be able to used programs that have a graphical user interface (GUI) on them.
 +
 
 +
For visualization we will use the program SeaView. Seaview is an alignment viewer, but it also allows you to estimate simple distance-based trees and invoke alignment programs. You can access help files at any time within the program by clicking on ‘Help’ in the top right corner.<br />
 +
A good alternative to SeaView is [http://www.ormbunkar.se/aliview/ AliView]. AliView is great for looking at properties of very large alignments. Tree building options in alignment viewers are not publication quality, but they can be useful if checking for contamination or homology errors.
 +
 
 +
If you have not downloaded SeaView, do so now using the following links: [https://web.archive.org/web/20141011000249/http://doua.prabi.fr/software/seaview_data/seaview4.zip Mac OS X] [https://web.archive.org/web/20141011000249/http://doua.prabi.fr/software/seaview_data/seaview4.exe Windows] [https://web.archive.org/web/20141011000249/http://doua.prabi.fr/software/seaview_data/seaview4-64.tgz Linux] .
 +
 
 +
This activity is structured to be done either by yourself or with a partner. Working with a partner is a great idea! <br />
  
We will use  two programs, [http://sco.h-its.org/exelixis/web/software/papara/index.html PaPaRa], and [https://code.google.com/p/pagan-msa/wiki/PAGAN Pagan]
+
A link to the data set for this activity can be found in [[Media:MSAlab.zip|MSAlab.zip]].  
They are both phylogeny-aware alignment procedures to add a sequence to a very large alignment, guided by a maximum likelihood tree estimate.
 
  
==Getting started==
+
We will run the alignment software on the class cluster.  
While a large number of alignment programs have been developed, we are going to focus on two of them: MAFFT and ‎MUSCLE. Although alternatives may be more acccurate on small data stes, these programs perform well even on fairly large data sets. The MAFFT algorithms and options can be viewed [http://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html here]. The MUSCLE user guide is found at [http://www.drive5.com/muscle/muscle_userguide3.8.html here]. <br />
+
Login in to the cluster on the terminal.
MAFFT and MUSCLE are available on the cluster. We will transfer sequences to the cluster using scp or cyberduck, run the programs as needed remotely, and then transfer back in the same way for visualisation.
+
Get the datafiles using
  
For visualisation we will use the program SeaView, from which you can view alignments, make rudimentary trees and invoke alignment programs. You can access help files at any time within the program by clicking on ‘Help’ in the top right corner.<br />
+
    wget https://molevol.mbl.edu/images/3/3c/MSAlab.zip
A good alternative to SeaView is JalView. This program allows for observation of alignment qualities but is more restricted in tree building and certain editing functions.
 
  
You should download SeaView from [http://pbil.univ-lyon1.fr/software/seaview.html].
+
<b>wget will work on the cluster.</b> You can also use wget to download files directly to your local windows machine with gitbash. If you want to download directly to your MAC, you will need curl.
  
This activity is structured to be done either by yourself or with a partner. But working with a partner is a great idea! <br />
+
    curl -O https://molevol.mbl.edu/images/3/3c/MSAlab.zip
  
A link to the data set for this activity can be found in [[Media:MSAlab.zip|MSAlab.zip]].
+
Unzip the datafile using
Download it and make a note of where you save it on your computer.
 
Unzip it by double clicking on it.
 
  
You will also want to move a copy to your folder on the cluster using the 'scp' command or cyberduck:
+
    unzip MSAlab.zip
<br />1ped.fasta: nucleotide sequences of alcohol dehydrogenase from a variety of organisms; modified from BAliBASE. <br />
 
  
 
==Exercise 1: basic functions in SeaView==
 
==Exercise 1: basic functions in SeaView==
 +
Copy the 1ped.fasta file from the cluster to your laptop.<br />
 +
If you cannot recall the scp command, take a look at the computer lab introduction from earlier.
 +
The 1ped fasta file contains alcohol dehydrogenase nucleotide sequences from a variety of organisms; modified from BAliBASE.
 
Start SeaView.<br />
 
Start SeaView.<br />
 
Load the data set 1ped.fasta by going to ‘File’ > ‘Open’. Have a look at the data. Is it aligned?<br />
 
Load the data set 1ped.fasta by going to ‘File’ > ‘Open’. Have a look at the data. Is it aligned?<br />
 
Try some of the basic commands:<br />
 
Try some of the basic commands:<br />
To select a taxon, click on any taxon name on the left side. You can copy the sequence using the 'copy selected sequences' command in the edit menu. These can then be pasted into a text editor if needed.<br />
+
To select a taxon, click on any taxon name on the left side. You can copy the sequence using the 'copy selected sequences' command in the edit menu. These can then be pasted into a text editor (e.g. Notepad++, BBEdit) if needed.<br />
 
To select all sequences at once, on Macs you can type Command-A. On Windows or Linux, you can use Control-A.<br />
 
To select all sequences at once, on Macs you can type Command-A. On Windows or Linux, you can use Control-A.<br />
 
Explore the edit menu and observe how sequences can be reversed, complemented etc.<br />
 
Explore the edit menu and observe how sequences can be reversed, complemented etc.<br />
Close the file without saving.
+
Do not close the window, but move it aside for now.
 
  
 
==Exercise 2: comparison of two different alignment programs (MAFFT and MUSCLE) using nucleotide sequences==
 
==Exercise 2: comparison of two different alignment programs (MAFFT and MUSCLE) using nucleotide sequences==
  
Load the software we need for this exercise
+
'''Change directories into the MSAlab folder.'''
    module load bioware
+
Refer to the lab intro page if you have forgotten how to do this.  
 
We will run the alignment software on the class cluster.
 
Transfer the datafiles MSA.zip file to the cluster using scp or Cyberduck,
 
 
 
Login in to the cluster on the terminal. Unzip the datafile using
 
 
 
  unzip MSAlab.zip
 
 
 
Change directories into the MSAlab folder.
 
Refer to the lab intro page if you have forgotten how to do this. Don't forget to load the modules with 'module load bioware' before you run MAFFT/MUSCLE.
 
  
Programs such as MAFFT and MUSCLE and many others use flags to designate input options. These are usually a dash before the command, or in the case of MAFFT a double dash. Examples are given throughout this lab.
+
Programs such as MAFFT and MUSCLE and many others use <i> flags </i> to designate input options. These are usually a dash (-) before the command, or in the case of MAFFT a double dash (--). Sometimes programs will use a single dash with and abbreviation or a single letter to invoke and option as well as a set of double dashes for more verbose forms of those options. Examples are given throughout this lab and will become intuitive throughout the workshop.
  
  
Line 149: Line 146:
 
The breakdown of this command is:<br />
 
The breakdown of this command is:<br />
 
mafft calls the program MAFFT <br />
 
mafft calls the program MAFFT <br />
--retree 2 tells MAFFT to run a progressive alignment. MAFFT uses double dashes (--) for its options (called flags). You can see why it is --retree 2 on the MAFFT webpage linked above. <br />
+
--retree 2 tells MAFFT to run a progressive alignment. MAFFT uses double dashes (--) for its options. You can see why it is --retree 2 on the MAFFT webpage linked above. <br />
1ped.fasta Before the > and after all flags have been given we place the input file name.<br />
+
1ped.fasta is the input file name, place it before the > and after all flags.<br />
> mafft_dna.fasta tells MAFFT to place the output alignment in the file mafft_dna.fasta. For many programs a > designates where to place the output.<br />
+
> mafft_dna.fasta tells MAFFT to place the output alignment in the file mafft_dna.fasta. For many programs a > designates where to place the output.<br /> If the > symbol is unfamiliar to you, take a look back at the [[UNIX tutorial advanced|advanced UNIX tutorial]].<br />
 +
 
 
Once the alignment process is completed, transfer the file to your own computer (through scp or Cyberduck) and open it in SeaView as in exercise 1. A new window with the aligned data will appear.
 
Once the alignment process is completed, transfer the file to your own computer (through scp or Cyberduck) and open it in SeaView as in exercise 1. A new window with the aligned data will appear.
  
Line 164: Line 162:
 
-in 1ped.fasta instructs MUSCLE where the input file is.<br />
 
-in 1ped.fasta instructs MUSCLE where the input file is.<br />
 
-out muscle_dna.fasta instructs MUSCLE to place the alignment in the file muscle_dna.fasta.
 
-out muscle_dna.fasta instructs MUSCLE to place the alignment in the file muscle_dna.fasta.
Note that -verbose and -log are not always needed but it allows you to see what are the default options in MUSCLE.
+
Note that -verbose and -log are not always needed but it allows you to see the default options in MUSCLE.
  
Once the alignment process is completed, transfer the file to your own computer (through scp or Cyberduck) and open it in SeaView as in exercise 1. A new window with the aligned data will appear.
+
Once the MUSCLE alignment is done, transfer the aligned fasta file to your own computer (through scp or Cyberduck) and open it in SeaView. A new window with the aligned data will appear.
 
   
 
   
  
Compare the alignments resulting from A steps and B steps. Are they different?  
+
Compare the alignments resulting from MAFFT and MUSCLE. Are they different?  
 
How many columns are in each the MAFFT or the MUSCLE alignment?  
 
How many columns are in each the MAFFT or the MUSCLE alignment?  
 
What may be wrong with both? (Hint: these are protein coding genes).<br />
 
What may be wrong with both? (Hint: these are protein coding genes).<br />
 
Build 2 trees, one from each of your nucleotide alignments: <br />
 
Build 2 trees, one from each of your nucleotide alignments: <br />
Go to your aligned nucleotide sequences window (for both MAFFT and MUSCLE alignments) and click on ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a J-C distance metric. De-click the 'ignore all gap sites' and click to calculate 100 bootstraps.<br />
+
Go to your aligned nucleotide sequences window (for both MAFFT and MUSCLE alignments) and click on ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a J-C distance metric. De-select the 'ignore all gap sites', then click to calculate 100 bootstraps.<br />
 
(Note: these trees are easy for helping to evaluate your alignments, but this program should never be your tree building method).<br />
 
(Note: these trees are easy for helping to evaluate your alignments, but this program should never be your tree building method).<br />
 
Compare the trees from both the MAFFT and MUSCLE alignments. Do the topologies and/or branch lengths differ? (Hint: look up some species names to get an idea of the expected topology!)
 
Compare the trees from both the MAFFT and MUSCLE alignments. Do the topologies and/or branch lengths differ? (Hint: look up some species names to get an idea of the expected topology!)
Line 180: Line 178:
 
In this exercise we will convert the nucleotide sequences to their equivalent protein sequences and align these instead. Note that because we are running the alignment programs externally to SeaView you cannot convert back to the original DNA sequences post alignment. If run through SeaView itself this can be done.  
 
In this exercise we will convert the nucleotide sequences to their equivalent protein sequences and align these instead. Note that because we are running the alignment programs externally to SeaView you cannot convert back to the original DNA sequences post alignment. If run through SeaView itself this can be done.  
  
Find the original 1ped.fasta window.<br />
+
Return to the SeaView window with the unaligned 1ped.fasta sequences.<br />
 
Click ‘Props’ > ‘View as proteins’.<br />
 
Click ‘Props’ > ‘View as proteins’.<br />
 
Click ‘File’>’Save prot alignment” and save the file as ’1ped_aa.fasta’ with Fasta as the file format.<br />
 
Click ‘File’>’Save prot alignment” and save the file as ’1ped_aa.fasta’ with Fasta as the file format.<br />
Line 194: Line 192:
 
Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-click ignore gaps and do a bootstrap test as above.
 
Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-click ignore gaps and do a bootstrap test as above.
  
 
+
Using the same 1ped_aa.fasta file, employ the MAFFT automatic selection of alignment strategy by using the command:
Find the original 1ped.fasta window.<br />
 
Click ‘Props’ > ‘View as proteins’.<br />
 
Click ‘File’>’Save prot alignment” and save the file as ’1ped_aa.fasta’ with Fasta as the file format.<br />
 
Transfer this file to the class cluster into the MSAlab folder.<br />
 
Employ the MAFFT automatic selection of alignment strategy by using the command:
 
 
<pre>
 
<pre>
 
mafft --auto 1ped_aa.fasta > mafft_aa_auto.fasta
 
mafft --auto 1ped_aa.fasta > mafft_aa_auto.fasta
Line 207: Line 200:
 
Load this file into SeaView.
 
Load this file into SeaView.
  
Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-click ignore gaps and do a bootstrap test as above.
+
Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-select ignore gaps and use bootstrapping.
  
  
Line 228: Line 221:
 
muscle -verbose -log muscle_gap-1.log -in 1ped_aa.fasta -out muscle_aa_gap-1.fasta -gapopen -1
 
muscle -verbose -log muscle_gap-1.log -in 1ped_aa.fasta -out muscle_aa_gap-1.fasta -gapopen -1
 
</pre>
 
</pre>
A new flag has been added here (-gapopen -1) to instruct MUSCLE to set the gap opening penatly to -1.
+
The flag (-gapopen -1) instructs MUSCLE to set the gap opening penatly to -1.
 
Load the alignment into SeaView and build a tree as in exercise 3.
 
Load the alignment into SeaView and build a tree as in exercise 3.
  
Line 237: Line 230:
 
Load the alignment into SeaView and build a tree as in exercise 3.
 
Load the alignment into SeaView and build a tree as in exercise 3.
  
Compare the modified gap penalty alignments to the default one. How do they differ? Which has the most gaps?  
+
How do the modified gap penalty alignments compare to the default one? Which alignment has the most gaps? The log files from MUSCLE will tell you the gap penalty used for each alignment?
Can you guess the order of magnitude of the default gap penalty?  Can you find in the log files of MUSCLE the gap penalty used?
 
  
== Exercise 5. Alignment extension ==
+
<b>Protip</b> Try moving all three MUSCLE amino acid alignments to your laptop with scp!
 +
<pre>
 +
scp username@classServername.jbpc-np.mbl.edu:/class/username/MSAlab/muscle_aa*.fasta .
 +
</pre>
  
Schirrmeister (2011) published a phylogeny of 1000 cyanobacteria, based on a 16s RNA sequence. http://www.biomedcentral.com/1471-2148/11/45
+
== Codon Alignments ==
  
You can see the whole tree on the Open Tree of Life database: https://tree.opentreeoflife.org/curator/study/view/pg_2739
+
This is not part of the exercises, it's just for your future information.
  
We have made a tree using a subset of these
+
As you now know, it is not appropriate to align nucleotides of protein coding regions. In the exercises above, you translated the nucleotides to amino acids which you could use to infer trees. But sometimes you want to analyze nucleotides that have been aligned by codon. (Joe [[Bielawski]] will talk about some analyses that necessitate this kind of alignment.) So how do you go from an amino acid alignment back to codon-aligned nucleotides? You can use the [http://www.bork.embl.de/pal2nal/ Pal2Nal server]. You will upload your protein alignment and nucleotide sequences, and it will spit out the codon alignment. Please be aware that your nucleotides must be multiples of three (i.e. a full open reading frame).
The tree and alignment are in the MSA folder you downloaded. (example.tre and aln.phy)
 
  
You can open the tree in figtree, but you will see that it is quite large, and hard to read all the labels.
+
Another great option is to use PRANK (Löytynoja and Goldman 2005), which can use codon-aware data structures for alignment. Many people now write their own scripts for back-translation of aligned amino acids - here is a very ugly [https://github.com/gtiley/Alignment_Tools/tree/master/Codon_Alignment Perl script] I wrote when starting grad school.  
  
Lets add a novel sequence from an uncultured bacterium to this tree.
+
== Filtering ==
  
Download the sequence from NCBI in fasta format, to your laptop, then transfer it to your MSAlab folder on the cluster.
+
There are various approaches to filter poorly aligned sites out of your alignment (Gblocks, BMGE), but in some cases filtering does not improve and can even worsen
http://www.ncbi.nlm.nih.gov/nucleotide/262528049
+
tree estimation (Tan et al. 2015). Another strategy is to integrate over alignment uncertainty in a Bayesian framework (BAli-Phy; Redelings and Suchard 2005). All strategies can be justified on some basis but will come with limitations, and it is ultimately up to you to decide the most appropriate course of action for your data.
(Click on the FASTA tab and download)
 
  
In order to place this sequence into the existing phylogeny, you need to align it to the reference alignment.
 
  
Also transfer the tree file example.tre and the alignment aln.phy (this is an alignment in Phylip format).
+
== Reference-Based Alignment ==
We will use PaPaRa, which can align either short reads or full length sequences using an estimated phylogeny for the existing alignment.
 
  
Align the sequence you downloaded to the reference alignment and tree using
+
In some cases it is now easier to generate low-coverage genomic data, RADseq, or other subsamples of the genome and align to some well-assembled reference. Especially for population genetics, but also used in species-level phylogenetic studies too (e.g. Grewe et al. 2017; Figueiró et al. 2017; Rochette et al. 2019). This process is a little different than the progressive MSA discussed here, but a common short read aligner is bwa (Li and Durbin 2009). This type of reference-based alignment can then be followed by genotyping with GATK (McKenna et al. 2010), which uses the quality information of Illumina sequencers to call variants and heterozygotes. Another genotype caller with a philosophically different approach to variant calling is ANGSD (Korneliussen et al. 2014), which has many population genetic applications but not many phylogenetic methods can leverage genotype likelihoods. These topics fall outside of the scope of our workshop, but the development teams for these software have fantastic tutorials that you should be able to follow with skills developed here.
  
    papara -t example.tre -s aln.phy -q sequence.fasta -n run1
+
== References ==
  
(-t specifies is your input tree, -s is the sequence alignment, -q is your new query sequence, and -n names your output files).
+
Breinholt JW, Earl C, Lemmon AR, Lemmon EM, Xiao L, Kawahara AY. 2017. Resolving relationships among the megadiverse butterflies and moths with a novel pipeline for anchored phylogenomics. Syst Biol. 67:78-93.
Try typing only 'papara' to see the full explanation of parameters
 
  
NOTE: if this run fails and you try to run it again, you will need to delete the output files from the failed attempt.
+
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43:491-498.
e.g. rm papara_log.run1
 
  
Look at the sequence alignment, papara_alignment.run1, using seaview.
+
Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797.
How does the alignment look? How is the new sequence different?
 
  
Now try downloading another bacterial sequence:
+
Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics. 32:786-788.
http://www.ncbi.nlm.nih.gov/nuccore/GQ340924.1
 
  
save it as sequence2.fasta
+
Figueiró HV, et al. 2017. Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Sci Adv. 7:1700299.
  
Align it to your reference using:
+
Grewe F, Huang J-P, Leavitt SP, Lumbsch HT. 2017. Reference-based RADseq resolves robust relationships among closely related species of lichen-forming fungi using metagenomic DNA. Sci Rep. 7:9884
  
    papara -t example.tre -s aln.phy -q sequence2.fasta -n run2
+
Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059-3066.
  
Look at the sequence alignment, papara_alignment.run2, using seaview.
+
Katoh K, Kuma K, Toh H, Miyata T. 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33:511-518.
  
How does the alignment look? Any concerns?
+
Katoh K, Toh H. 2008. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 9:286-298.
  
Another option for alignment extension is [https://code.google.com/p/pagan-msa/wiki/PAGAN pagan]
+
Korneliussen TS, Albrechtsen A, Nielsen R. 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics. 15:356.
  
Pagan requires input alignments in fasta rather than phylip format, but we have provided that in aln.fasta
+
Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. 25:1754-1760.
Try typing only 'pagan' to see the full explanation of parameters
 
Compare the pagan alignments for the two sequences you downloaded
 
  
    pagan --ref-seqfile aln.fasta -t example.tre -q sequence.fasta -o pagan1
+
Löytynoja A, Goldman N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 102:10557-10562.
  
    pagan --ref-seqfile aln.fasta -t example.tre -q sequence1.fasta -o pagan2
+
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
  
The alignment output files should be be named pagan1.fas and pagan2.fas.
+
Pais FS-M, de Cassia Ruy P, Oliveira G, Coimbra RS. 2014. Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol. 9:4.
  
Both papara and pagan can align short reads as well as full length sequences.
+
Redelings BD, Suchard MA. 2005. Joint Bayesian estimation of alignment and phylogeny. Syst Biol. 54:401-418.
  
== Codon Alignments ==
+
Rochette NC, Rivera-Colón AG, Catchen JM. 2019. Stacks 2: Analytical methods for paired-end sequencing improve RADseq-based population genomics. bioRxiv. doi: 10.1101/615385.
  
This is not part of the exercises, it's just for your future information.
+
Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C. 2015. Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol. 64:778-791.
 
 
As you now know, it is not appropriate to align nucleotides of protein coding regions. In the exercises above, you translated the nucleotides to amino acids which you could use to infer trees. But sometimes you want to analyze nucleotides that have been aligned by codon. (Joe [[Bielawski]] will talk about some analyses that necessitate this kind of alignment.) So how do you go from an amino acid alignment back to codon-aligned nucleotides? We suggest the [http://www.bork.embl.de/pal2nal/ Pal2Nal server]. You will upload your protein alignment and nucleotide sequences, and it will spit out the codon alignment. Please be aware that your nucleotides must be multiples of three (i.e. a full open reading frame).
 
 
 
== Filtering ==
 
  
There are various approaches to filter poorly aligned sites out of your alignment (Gblocks, BMGE), but recent work suggests that they do not improve and can even worsen
+
Yang Y, Smith SA. 2014. Orthology inference in nonmodel organisms using transcripts and low-coverage genomes: improving accuracy and matrix occupancy for phylogenetics. Mol Biol Evol. 31:3081-3092.
tree estimation.
 
Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C: Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol 2015:syv033.
 
http://sysbio.oxfordjournals.org/content/early/2015/05/31/sysbio.syv033.abstract
 

Latest revision as of 15:13, 2 August 2019

Expected learning outcomes

The objective of this activity is to become familiar with the features of multiple alignment and alignment visualization programs. This includes data input and output, basic visualization and editing functions, alignment options, and differences between nucleotide and amino acid alignments. Most importantly, you should be able to run analyses on the cluster and move files to and from the cluster and your local machine. Transferring data between your computer and a remote machine is necessary in all following labs.

About this lab

Software used in this workshop assumes that input data is aligned. If you want to use your own sequencing data during the workshop, you will need to go through the process of multiple sequence alignment (MSA). We focus here on gene sequences, which can be from targeted Sanger data or assembled genomic data. Many contemporary studies use reference-based short read alignment, but at least some of the underlying theory is the same. We do not cover short read alignment, but reference-based alignment resources are provided at the end of the tutorial.

General Background

Multiple sequence alignment (MSA) is a crucial first step for most methods of phylogenetic estimation or model-based inference of evolutionary processes. The goal of MSA is to introduce gaps into sequences so that columns of an aligned matrix contain character states that are homologous.

This statement of homology means that if two sequences both have a residue at a position in the alignment, then a residue existed in their most recent common ancestor. This site was inherited by the descendant lineages (possibly with substitutions occurring) down to the present, observed sequences. Note that if two sequences both have gap at a position, this does not mean that their ancestor had no residue at that site - multiple deletion events are possible.

Homology is a historical property that cannot be directly observed. However, homology can be inferred by MSA methods. Generally, a MSA algorithm attempts to produce homologous alignments by scoring many plausible alignments and choosing one with the best score. Aligning two positions that display the same nucleotide (or amino acid) improves the score. Aligning two positions that are not the same decrease the score. How much of a decrease (the cost) depends on what specific substitution is implied. For example, Leucine to Isoleucine substitutions are common and would not cost much. But Cysteine to Glycine changes are rare. Therefore, aligning a Cysteine to a Glycine results in a higher penalty to the score of the alignment. Placing gaps in a sequence is penalized too. Introducing a new gap usually has a higher cost than extending an existing gap. For example, a gap open cost will be -5 while an extension cost -2, to reflect a scientist’s general belief that one longer indel event is more expected than two shorter indel events.

Most default scoring options for alignment programs are designed to balance expectations of getting similar residues in the same column and allowing for gaps. Try scoring the following alignments yourself, do you get the same answers?

Simple Alignment Scoring Example


As mentioned in lecture, pairwise alignment is analytically tractable (though slow for very long sequences). Multiple sequence alignment presents new challenges: 1) we do not know the evolutionary tree relating the sequences 2) we do not know the states at internal nodes of the tree in the absence of data from the ancestral sequences. Challenge 1 is frequently addressed by the use of a guide tree, but challenge 2 is for most impossible to overcome and it is not clear how the penalty terms should be applied when we align sets of more than two species. Many MSA approaches focus on solving challenge 2 in a computationally efficient way.

One common solution (and the solution used by the software we'll talk about in this lab) is progressive alignment. The general strategy is:

  1. estimate pairwise distances using pairwise alignment,
  2. estimate a guide tree from these distances,
  3. use the guide tree to specify the order in which multiple sequence alignments are constructed. The algorithm proceeds by traversing the tree from leaves to the root. The two descendants of each internal node in the tree are aligned to each other. Thus, the guide tree serves as means of ordering a set of pairwise alignment operations.
  4. iterate until the alignment cannot be improved based on some criterion

The initial step in the progressive stage will be a pairwise sequence alignment to produce a group of two aligned sequences. But, as we move deeper in the tree, the approach will require aligning other single sequences to the previously aligned groups or aligning two distinct groups to each other (group-to-group alignment).

The score of each alignment decision during progressive alignment is usually computed using a weighted sum-of-pairs score which produces an average cost for the position in the alignment. This score reflects the costs associated with aligning each member of one descendant group to each member of the other descendant group. Another way to score an alignment of two groups is to use a consistency-based objective function that reflects how well the positions in the resulting MSA reflect the homology statements inferred from pairwise alignments (Notredame et al 1998). One of the programs we will use today, MAFFT, uses progressive alignment, but then refines the MSA with consistency.

MAFFT Background

MAFFT gets its name from its use of Fast Fourier Transform to quickly identify some of the more obvious regions of homology. After identifying these regions, slower dynamic programming approaches are used to join these segments into a full alignment. Thus, the main advantage of the initial version of MAFFT was speed. The speed of MAFFT has made it a favorite method for many phylogenomic studies and it has some alignment algorithms that are very useful for some common data types (e.g. UCEs, hybrid enrichment/capture-seq, RADseq). MAFFT also uses a transformation of scoring matrices (fully described in the "Scoring system" section on pages 3061 and 3062 of Katoh et al 2002) so that all of the pairwise scores are positive and costs of multi-position gaps can be computed quickly.

The steps in a MAFFT run are:

  1. calculation of a crude pairwise distance matrix based shared 6-tuples
  2. construction of a UPGMA (Unweighted Pair Group Method with Arithmetic Mean) guide tree
  3. FFT and dynamic programming used in progressive alignment with this initial guide tree
  4. improved pairwise distance matrix inferred from the alignment of step 3
  5. improved guide tree constructed from the new distances that were computed in step 4
  6. repeat of the progressive alignment algorithm (like step 3, but with the new guide tree).
  7. Then MAFFT repeats the following:
    1. break the alignment in 2 groups based on the tree
    2. realign these groups
    3. accept this alignment if it improves the score.

The first 3 steps are called FFT-NS-1. Continuing through step 6 is called FFT-NS-2. And including the iterative improvements is referred to as FFT-NS-2-i.

The following figure (from the nice review by Katoh and Toh (2008)) diagrams FFT-NS-2:

Fig 1a Katoh and Toh 2008

The iterative improvement steps mentioned in step 7 (which are used if you use the FFT-NS-2-i version of the algorithm) are depicted as:

Fig 1a Katoh and Toh 2008

Katoh et al. 2005 added consistency-based scoring in the iterative refinement stage. G-INS-i uses global alignments to construct the library of pairwise alignments used to assess consistency. L-INS-i uses local pairwise alignments with affine gaps to form the library, and E-INS-i uses local alignments with a generalized affine gap cost (which allows for an unalignable cost in addition to gap opening and gap extension costs). The contexts in which you would want to use each of these algorithms is shown below (from Figure 3 of Katoh and Toh (2008)). X's in the figure represent aligned residues and o's represent unalignable residues:

Fig 1a Katoh and Toh 2008

We do encounter such cases in real biological datasets. For example L-INS-I is useful for hybrid enrichment (Breinholt et al. 2018) or UCES (Faircloth 2016).

Muscle background

Muscle is described in Edgar et al. 2004.

Muscle is a modified progressive alignment algorithm which has comparable accuracy to MAFFT, but faster performance. Muscle has remained popular for phylogenomic studies, and there has been some evidence that Muscle is more accurate than MAFFT on amino acid alignments (Pais et al. 2014).

Broadly it consists of three stages:

Stage 1. Draft progressive alignment. The goal of the first stage is to produce a multiple alignment, emphasizing speed over accuracy, using an approximate kmer distance measure. Kmer distance between two sequences is defined by first collecting the set of k-mers (subsequences of length k) occuring in the two sequences, then measuring how different the two sets are.

Stage 2. Improved progressive. The main source of error in the draft progressive stage is the approximate kmer distance measure, which results in a suboptimal tree. MUSCLE therefore re‐estimates the tree using the Kimura distance, which is more accurate but requires an alignment.

Stage 3. Refinement. The tree is broken into subtrees, and the sub-alignments refined.

EdgarF2

Getting started

While a large number of alignment programs have been developed, we are going to focus on MAFFT and ‎MUSCLE. Alternatives may be more accurate on small data sets, but these programs perform well even on fairly large data sets and are thus part of many phylogenomic pipelines (e.g. Yang and Smith 2014). The MAFFT algorithms and options can be viewed here. The MUSCLE user guide is found at here.
MAFFT and MUSCLE are available on the cluster. We will transfer sequences to the cluster using scp or cyberduck, run the programs as needed remotely, and then transfer back in the same way for visualization. Clusters are great for analyses, but if this is your first time using one, you will not be able to used programs that have a graphical user interface (GUI) on them.

For visualization we will use the program SeaView. Seaview is an alignment viewer, but it also allows you to estimate simple distance-based trees and invoke alignment programs. You can access help files at any time within the program by clicking on ‘Help’ in the top right corner.
A good alternative to SeaView is AliView. AliView is great for looking at properties of very large alignments. Tree building options in alignment viewers are not publication quality, but they can be useful if checking for contamination or homology errors.

If you have not downloaded SeaView, do so now using the following links: Mac OS X Windows Linux .

This activity is structured to be done either by yourself or with a partner. Working with a partner is a great idea!

A link to the data set for this activity can be found in MSAlab.zip.

We will run the alignment software on the class cluster. Login in to the cluster on the terminal. Get the datafiles using

   wget https://molevol.mbl.edu/images/3/3c/MSAlab.zip

wget will work on the cluster. You can also use wget to download files directly to your local windows machine with gitbash. If you want to download directly to your MAC, you will need curl.

   curl -O https://molevol.mbl.edu/images/3/3c/MSAlab.zip

Unzip the datafile using

   unzip MSAlab.zip

Exercise 1: basic functions in SeaView

Copy the 1ped.fasta file from the cluster to your laptop.
If you cannot recall the scp command, take a look at the computer lab introduction from earlier. The 1ped fasta file contains alcohol dehydrogenase nucleotide sequences from a variety of organisms; modified from BAliBASE. Start SeaView.
Load the data set 1ped.fasta by going to ‘File’ > ‘Open’. Have a look at the data. Is it aligned?
Try some of the basic commands:
To select a taxon, click on any taxon name on the left side. You can copy the sequence using the 'copy selected sequences' command in the edit menu. These can then be pasted into a text editor (e.g. Notepad++, BBEdit) if needed.
To select all sequences at once, on Macs you can type Command-A. On Windows or Linux, you can use Control-A.
Explore the edit menu and observe how sequences can be reversed, complemented etc.
Do not close the window, but move it aside for now.

Exercise 2: comparison of two different alignment programs (MAFFT and MUSCLE) using nucleotide sequences

Change directories into the MSAlab folder. Refer to the lab intro page if you have forgotten how to do this.

Programs such as MAFFT and MUSCLE and many others use flags to designate input options. These are usually a dash (-) before the command, or in the case of MAFFT a double dash (--). Sometimes programs will use a single dash with and abbreviation or a single letter to invoke and option as well as a set of double dashes for more verbose forms of those options. Examples are given throughout this lab and will become intuitive throughout the workshop.


Run a progressive alignment in MAFFT on the cluster by using the command:

mafft --retree 2 1ped.fasta > mafft_dna.fasta

The breakdown of this command is:
mafft calls the program MAFFT
--retree 2 tells MAFFT to run a progressive alignment. MAFFT uses double dashes (--) for its options. You can see why it is --retree 2 on the MAFFT webpage linked above.
1ped.fasta is the input file name, place it before the > and after all flags.
> mafft_dna.fasta tells MAFFT to place the output alignment in the file mafft_dna.fasta. For many programs a > designates where to place the output.
If the > symbol is unfamiliar to you, take a look back at the advanced UNIX tutorial.

Once the alignment process is completed, transfer the file to your own computer (through scp or Cyberduck) and open it in SeaView as in exercise 1. A new window with the aligned data will appear.

Run a standard alignment in MUSCLE on the cluster by using the command:

muscle -verbose -log muscle_dna.log -in 1ped.fasta -out muscle_dna.fasta

The breakdown of this command is:
muscle calls the program MUSCLE
-verbose is a flag that instructs MUSCLE to output a very thorough log. This allows to see exactly what MUSCLE is doing. Note that the flags here are single dashes, not double dashes as with mafft.
-log muscle_dna.log instructs MUSCLE to place all the output except the alignment itself to the log file called muscle_dna.log. This file will then include things like the gap penalty used etc.
-in 1ped.fasta instructs MUSCLE where the input file is.
-out muscle_dna.fasta instructs MUSCLE to place the alignment in the file muscle_dna.fasta. Note that -verbose and -log are not always needed but it allows you to see the default options in MUSCLE.

Once the MUSCLE alignment is done, transfer the aligned fasta file to your own computer (through scp or Cyberduck) and open it in SeaView. A new window with the aligned data will appear.


Compare the alignments resulting from MAFFT and MUSCLE. Are they different? How many columns are in each the MAFFT or the MUSCLE alignment? What may be wrong with both? (Hint: these are protein coding genes).
Build 2 trees, one from each of your nucleotide alignments:
Go to your aligned nucleotide sequences window (for both MAFFT and MUSCLE alignments) and click on ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a J-C distance metric. De-select the 'ignore all gap sites', then click to calculate 100 bootstraps.
(Note: these trees are easy for helping to evaluate your alignments, but this program should never be your tree building method).
Compare the trees from both the MAFFT and MUSCLE alignments. Do the topologies and/or branch lengths differ? (Hint: look up some species names to get an idea of the expected topology!)

Exercise 3: comparison of two different alignment approaches in MAFFT using protein sequences

In this exercise we will convert the nucleotide sequences to their equivalent protein sequences and align these instead. Note that because we are running the alignment programs externally to SeaView you cannot convert back to the original DNA sequences post alignment. If run through SeaView itself this can be done.

Return to the SeaView window with the unaligned 1ped.fasta sequences.
Click ‘Props’ > ‘View as proteins’.
Click ‘File’>’Save prot alignment” and save the file as ’1ped_aa.fasta’ with Fasta as the file format.
Transfer this file to the class cluster into the MSAlab folder.
Run an iterative alignment in MAFFT by using the command:

mafft --maxiterate 1000 1ped_aa.fasta > mafft_aa_iter.fasta

Comparing the command to the MAFFT command in exercise 2, you notice a new option to tell MAFFT to do an iterative alignment.
--maxiterate 1000 This instructs MAFFT to run an iterative alignment with maximum 1000 cycles. Load this file into SeaView.

Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-click ignore gaps and do a bootstrap test as above.

Using the same 1ped_aa.fasta file, employ the MAFFT automatic selection of alignment strategy by using the command:

mafft --auto 1ped_aa.fasta > mafft_aa_auto.fasta

You can see that now we use the flag --auto to allow MAFFT to choose the best alignment strategy.
Can you determine what strategy was employed? (hint MAFFT outputs data to the screen and the strategy should be listed near the end). Load this file into SeaView.

Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-select ignore gaps and use bootstrapping.


Compare amino acid alignments and trees. Which one do you prefer? Does it make sense to align protein-coding sequences using the protein translation, or should you instead build alignments from nucleotide sequences?

Exercise 4: exploring the difference in gap penalties using MUSCLE

We will now run MUSCLE with different gap penalties to observe how this changes the alignment.

Run MUSCLE with a gap penalty of -20 using the command:

muscle -verbose -log muscle_gap-20.log -in 1ped_aa.fasta -out muscle_aa_gap-20.fasta -gapopen -20

A new flag has been added here (-gapopen -20) to instruct MUSCLE to set the gap opening penatly to -20. Load the alignment into SeaView and build a tree as in exercise 3.

Run MUSCLE with a gap penalty of -1 using the command:

muscle -verbose -log muscle_gap-1.log -in 1ped_aa.fasta -out muscle_aa_gap-1.fasta -gapopen -1

The flag (-gapopen -1) instructs MUSCLE to set the gap opening penatly to -1. Load the alignment into SeaView and build a tree as in exercise 3.

Run MUSCLE with the default gap penalty using the command:

muscle -verbose -log muscle_defGap.log -in 1ped_aa.fasta -out muscle_aa.fasta

Load the alignment into SeaView and build a tree as in exercise 3.

How do the modified gap penalty alignments compare to the default one? Which alignment has the most gaps? The log files from MUSCLE will tell you the gap penalty used for each alignment?

Protip Try moving all three MUSCLE amino acid alignments to your laptop with scp!

scp username@classServername.jbpc-np.mbl.edu:/class/username/MSAlab/muscle_aa*.fasta .

Codon Alignments

This is not part of the exercises, it's just for your future information.

As you now know, it is not appropriate to align nucleotides of protein coding regions. In the exercises above, you translated the nucleotides to amino acids which you could use to infer trees. But sometimes you want to analyze nucleotides that have been aligned by codon. (Joe Bielawski will talk about some analyses that necessitate this kind of alignment.) So how do you go from an amino acid alignment back to codon-aligned nucleotides? You can use the Pal2Nal server. You will upload your protein alignment and nucleotide sequences, and it will spit out the codon alignment. Please be aware that your nucleotides must be multiples of three (i.e. a full open reading frame).

Another great option is to use PRANK (Löytynoja and Goldman 2005), which can use codon-aware data structures for alignment. Many people now write their own scripts for back-translation of aligned amino acids - here is a very ugly Perl script I wrote when starting grad school.

Filtering

There are various approaches to filter poorly aligned sites out of your alignment (Gblocks, BMGE), but in some cases filtering does not improve and can even worsen tree estimation (Tan et al. 2015). Another strategy is to integrate over alignment uncertainty in a Bayesian framework (BAli-Phy; Redelings and Suchard 2005). All strategies can be justified on some basis but will come with limitations, and it is ultimately up to you to decide the most appropriate course of action for your data.


Reference-Based Alignment

In some cases it is now easier to generate low-coverage genomic data, RADseq, or other subsamples of the genome and align to some well-assembled reference. Especially for population genetics, but also used in species-level phylogenetic studies too (e.g. Grewe et al. 2017; Figueiró et al. 2017; Rochette et al. 2019). This process is a little different than the progressive MSA discussed here, but a common short read aligner is bwa (Li and Durbin 2009). This type of reference-based alignment can then be followed by genotyping with GATK (McKenna et al. 2010), which uses the quality information of Illumina sequencers to call variants and heterozygotes. Another genotype caller with a philosophically different approach to variant calling is ANGSD (Korneliussen et al. 2014), which has many population genetic applications but not many phylogenetic methods can leverage genotype likelihoods. These topics fall outside of the scope of our workshop, but the development teams for these software have fantastic tutorials that you should be able to follow with skills developed here.

References

Breinholt JW, Earl C, Lemmon AR, Lemmon EM, Xiao L, Kawahara AY. 2017. Resolving relationships among the megadiverse butterflies and moths with a novel pipeline for anchored phylogenomics. Syst Biol. 67:78-93.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43:491-498.

Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797.

Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics. 32:786-788.

Figueiró HV, et al. 2017. Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Sci Adv. 7:1700299.

Grewe F, Huang J-P, Leavitt SP, Lumbsch HT. 2017. Reference-based RADseq resolves robust relationships among closely related species of lichen-forming fungi using metagenomic DNA. Sci Rep. 7:9884

Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059-3066.

Katoh K, Kuma K, Toh H, Miyata T. 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33:511-518.

Katoh K, Toh H. 2008. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 9:286-298.

Korneliussen TS, Albrechtsen A, Nielsen R. 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics. 15:356.

Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. 25:1754-1760.

Löytynoja A, Goldman N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 102:10557-10562.

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.

Pais FS-M, de Cassia Ruy P, Oliveira G, Coimbra RS. 2014. Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol. 9:4.

Redelings BD, Suchard MA. 2005. Joint Bayesian estimation of alignment and phylogeny. Syst Biol. 54:401-418.

Rochette NC, Rivera-Colón AG, Catchen JM. 2019. Stacks 2: Analytical methods for paired-end sequencing improve RADseq-based population genomics. bioRxiv. doi: 10.1101/615385.

Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C. 2015. Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol. 64:778-791.

Yang Y, Smith SA. 2014. Orthology inference in nonmodel organisms using transcripts and low-coverage genomes: improving accuracy and matrix occupancy for phylogenetics. Mol Biol Evol. 31:3081-3092.