MSA lab

From MolEvol
Revision as of 09:55, 23 July 2012 by Ejmctavish (talk | contribs)

About this (optional) lab

Most of the software used in the workshop requires that the input data is already aligned. So, if you want to use your own data during the workshop, you will need to go through the process of multiple sequence alignment (MSA).

Because of time constraints, the Workshop on Molecular Evolution does not have a computer lab dedicated to multiple sequence alignment this year. This lab is optional and intended for students to work through on their own time. You may want to work through this lab if you have not used MAFFT or alignment viewers before. Despite lack of a scheduled MSA lab, Mark Holder and the TAs would be happy to answer questions about this lab or discuss issues related to alignment.

General Background

Multiple sequence alignment is a crucial first step for most methods of genealogical inference. The goal of MSA (in the context of evolutionary inference) is to introduce gaps into sequences so that the columns of the aligned matrix contain character states that are homologous. This statement of homology means that if two sequences both contain a residue at a position in the alignment, then their common ancestor had a corresponding residue. This site was inherited by the descendant lineages (possibly with substitutions occurring) down to the present, observed sequences. Note that if two sequences both have gap at a position this does not necessarily mean that their ancestor had no residue at that site - multiple deletion events are possible.

Homology is a historical property that can be inferred, but not directly observed. So MSA software attempts to produce homologous alignments by scoring alternative alignments and choosing the alignment with the best score. Aligning two positions that display the same nucleotide (or amino acid) improves the score. Aligning two positions that differ incurs a cost that is related to what specific type of substitution is implied. For example, Leucine to Isoleucine substitutions are common and would not be penalized much. But Cysteine to Glycine changes are rare. So aligning a Cysteine to a Glycine results in a higher penalty to the score of the alignment. Placing a gap in one of the sequences is penalized. Usually introducing a new gap is penalize more strongly than extending an existing gap (the gap-extension penalty is usually less than the gap-opening penalty).

As mentioned in lecture, pairwise alignment is analytically tractable (though slow for very long sequences). Multiple sequence alignment presents new difficulties: we do not know the evolutionary tree relating the sequences. Even if we did know the tree, we do not get to see the sequences at both ends of each branch in the tree because we do not have data from the ancestral sequences. As a result, it is not clear how the penalty terms should be applied when we align sets of more than two species.

One common solution to this problem (and the solution used by the software we'll talk about in this lab) is progressive alignment. The general strategy is typically:

  1. estimate pairwise distances using pairwise alignment,
  2. estimate a guide tree from these distances,
  3. use the guide tree to specify the order in which multiple sequence alignments are constructed. The algorithm proceeds by traversing the tree from leaves to the root. The two descendants of each internal node in the tree are aligned to each other. Thus, the guide tree serves as means of ordering a set of pairwise alignment operations.

The initial step in the progressive stage will be a pairwise sequence alignment to produce a group of two aligned sequences. But, as we move deeper in the tree, the approach will require aligning other single sequences to the previously aligned groups or aligning two distinct groups to each other (group-to-group alignment).

The score of each alignment decision during progressive alignment is usually computed using a weighted sum-of-pairs score which produces an average cost for the position in the alignment. This score reflects the costs associated with aligning each member of one descendant group to each member of the other descendant group. Another way to score an alignment of two groups is to use a consistency-based objective function that reflects how well the positions in the resulting MSA reflect the homology statements inferred from pairwise alignments (See Notredame et al. 1998).

MAFFT Background

MAFFT gets its name from its use of Fast Fourier Transform to quickly identify some of the more obvious regions of homology. After identifying these region slower dynamic programming approaches are used to join these segments into a full alignment. Thus, the main advantage of the initial version of MAFFT was speed. MAFFT also uses a transformation of scoring matrices (fully described in the "Scoring system" section on pages 3061 and 3062 of Katoh et al 2002) so that all of the pairwise scores are positive and costs of multi-position gaps can be computed quickly.

The steps in a MAFFT run are:

  1. calculation of a crude pairwise distance matrix based shared 6-tuples
  2. construction of a UPGMA guide tree
  3. FFT and dynamic programming used in progressive alignment with this initial guide tree
  4. improved pairwise distance matrix inferred from the alignment of step 3
  5. improved guide tree constructed from the new distances that were computed in step 4
  6. repeat of the progressive alignment algorithm (like step 3, but with the new guide tree).
  7. Then MAFFT repeats the following:
    1. break the alignment in 2 groups based on the tree
    2. realign these groups
    3. accept this alignment if it improves the score.

The first 3 steps are called FFT-NS-1. Continuing through step 6 is called FFT-NS-2. And including the iterative improvements is referred to as FFT-NS-2-i.

The following figure (from the nice review by Katoh and Toh (2008)) diagrams FFT-NS-2:

Fig 1a Katoh and Toh 2008

The iterative improvement steps mentioned in step 7 (which are used if you use the FFT-NS-2-i version of the algorithm)are depicted as:

Fig 1a Katoh and Toh 2008

Katoh et al. 2005 added consistency-based scoring in the iterative refinement stage. G-INS-i is uses global alignments to construct the library of pairwise alignments used to assess consistency. L-INS-i uses local pairwise alignments with affine gaps to form the library, and E-INS-i uses local alignments with a generalized affine gap cost (which allows for an unalignable cost in addition to gap opening and gap extension costs). The contexts in which you would want to use each of these algorithms is shown below (from Figure 3 of Katoh and Toh (2008)). X's in the figure represent aligned residues and o's represent unalignable residues:

Fig 1a Katoh and Toh 2008

Lab Activity

Getting unaligned sequences from NCBI

  1. Go to the NCBI website and select Nucleotide from the Search drop-down menu. This will let us conduct a search against the NCBI's nucleotide sequence database.
  2. Type Athroismeae [organism] AND internal transcribed spacer in the search box. This narrows the search to the plant tribe Athroismeae (Asteraceae), and those taxa for which there are ITS sequences in the database.
  3. Once the summaries of the sequences appear on the screen, use the Send To link in the upper right hand part of the page to save the sequences to a file. When you click on the File you should see a Format drop-down menu appear. Choose FASTA and the Create File
  4. A file called sequences.fasta should be downloaded by your browser. Move the file to the directory in which you are going to conduct alignments.
  5. Open the terminal window and use the cd command to change your working directory to be the directory containing sequences.fasta (Consult Computer_lab_introduction if you have questions about working from the terminal).
  6. If you are interested in the running times for different MAFFT variants, then you will probably want a larger data file to use. You can download a dataset of mitochondrial 12S rRNA sequences for 80 frogs from

MAFFT basics

  1. If you do not have a recent version of MAFFT, download it from here (the most recent version is 6.857 as of July, 2011).
  2. Make sure that MAFFT is on your PATH by running
     mafft -h
    MAFFT should complain about not being able to open a file called -h but then it should show you its version and some basic usage info. Consult Computer_lab_introduction or ask an instructor if this command is failing for you.
  3. Not all of the Athroismeae sequences are full length, so a local alignment may give the most appropriate results. You can find a nice description of the arguments that you have to give MAFFT in order to perform each algorithmic variant here. If we want to run the slowest (and usually most accurate) settings for local alignment then we would use
    mafft --maxiterate 1000 --localpair sequence.fasta > AthroLINSi.fasta
    Note that sequence.fasta is the input file. This is an argument to MAFFT. The --localpair option is telling MAFFT to construct a library of local pairwise alignments for consistency scoring during the iterative refinement stage. The --maxiterate 1000 is a pair of arguments that are read together by MAFFT; they are telling MAFFT to perform 1000 iteration of improvement on the alignment. Using the > character is referred to as output redirection. This is actually not an argument to MAFFT. It is syntax that your shell understands to mean "take all of the output from the program that I am running and put it into a file" the word after the > character specifies the filename. Be very careful with output redirection. If you specify that the output should be directed to a file that already exists, then it will overwrite the contents of that file without any warning.
  4. Try a faster run of MAFFT by not using any iterative improvement:
mafft --retree 1 --maxiterate 0 limnonectes.fasta > AthroFFT-NS-1.fasta

This is the FFT-NS-1 algorithm in which the guide tree is not rebuilt (--retree 2 would have told it to build the guide tree twice, so you would specify those arguments if you want to run FFT-NS-2) . You may not notice a difference in speed on the small Athroismeae dataset, but the difference is quite noticeable for the frog dataset.

  1. To see the difference in speed cause by using FFT to find homologous segments, you can add the --nofft argument to the previous invocation (make sure to change the name of the file where you are redirecting the output!). Using the full dynamic programming approach is noticeably slower on the limnonectes.fasta dataset.

Looking at the alignments

There are many tools available for navigating and editing alignments (see #Alignment Editors section below). SeaView is a nice, cross-platform option.

  1. Download the appropriate version of SeaView for your platform
  2. Launch it and use the menus to select File>>Open and then click on the output from one of your MAFFT runs.
  3. Try out some basic alignment editing to get a feel for how seaview works.

Selection of sequences

  • To select a taxon, click on that taxon name in the left panel.
  • To select multiple taxa, drag through a range of them.
  • To de-select all sequences at once, shift-click on the left panel.
  • Use the < and > keys to move your view frame 50 characters left or right.
  • To change the order of the sequences:
    • Select a group of sequences (by clicking and/or dragging in the left column)
    • Hold down the control key and click on the left column in the spot where you would like these sequences to appear.

Editing of alignments

  • Click on the right panel in a spot that you would like to edit. Press the spacebar to insert a gap. If you clicked an row corresponding to a selected sequence, then a gap will be inserted in all of the currently selected sequences. If you clicked in an unselected sequence, then the gaps will only be introduced in the sequence with the cursor.
  • The backspace/delete key will remove gaps to the left of the cursor.
  • To insert a residue, you will need to turn on the option Props > allow seq. editing. After doing this you can add sequence to a single row.
  • The + key will add gaps to every sequence other than the selected sequences.
  • The _ (underscore) key will shift all of the non selected sequences to the left. This deletes sequence in the previous column (be careful if you have "allow seq. editing" turned on then you can delete residues as well as gaps!).
  • To edit a range of columns, type a numerical value. You should see the number appear in the lower left after "mult=" (e.g., "mult=100" if you type 100). Clicking the spacebar will then cause SeaView to respond as if you hit space that number of times (e.g. insert 100 gaps in the case of 100space).

Adding another alignment tool to SeaView

If you look in the Align>>Alignment options you will see that SeaView is packaged with Clustal and Muscle as alignment tools. But you can also add other aligners. In this exercise, we'll add MAFFT with the --auto option to SeaView.

  1. Choose Align>>Alignment options>>Add external method. This will bring up a dialog box titled "alignment method creation."
  2. Click on the "Select external program" button and then use the file browser to select your executable version of MAFFT. On Mac it is possible that MAFFT will be installed in a system path (such as /usr/local/bin/mafft) and this path will not be accessible from SeaView's file browser. If this happens you may need to make a symbolic link from MAFFT to a location that is in under your home directory or copy the full path to mafft into the name edit-box in SeaView. You can find the full path to mafft with the following command in a UNIX terminal:
    which mafft
    If MAFFT is at /usr/local/bin/mafft then you can execute the following command from a Terminal to create such a link called mafft in your shell's current working directory:
    ln -s /usr/local/bin/mafft ./mafft
  3. After selecting the MAFFT executable, you need to tell SeaView how to invoke MAFFT by filling in the Arguments edit box with the arguments that MAFFT needs. The catch is that SeaView will run MAFFT in a hidden directory under a temporary name. So we need to fill in placeholders in the arguments. Specifally we will use %f to stand for "whatever filename prefix SeaView chooses for the unaligned sequences". So you will need to specify:
    --auto %f.pir > %f.out 
    to invoke MAFFT with the --auto argument (which leaves the running parameters up to MAFFT's best judgement).
  4. Click the OK button to finish registering MAFFT as an aligner in SeaView.
  5. Now you should be able to change the default aligner to MAFFT using the menu item: Align>>Alignment options>>mafft
  6. So that you can verify that SeaView updates the alignment when told to do so, add several extraneous gaps to one of the sequences.
  7. To realign the whole set of sequences use: Align>>Align all This should cause MAFFT to be run in a new menu. When MAFFT is finished, the alignments displayed in SeaView's main window should be updated (the extra gaps that you added should now be gone.

SeaView also allows you to select specific regions (using the Sites>> Create Sets) and then realign only portions of the alignment.

Exporting files from SeaView

If you select File>>Save As... menu item, then you will get a dialog box that will let you save the sequences to your filesystem. You can save as NEXUS, thus SeaView can be used as a file conversion tool.

Running prankster

PRANK Löytynoja and Goldman 2005 uses hidden Markov models to perform progressive alignment and corrects the tendency of many progressive alignment tools to "overalign" (or "compress") alignments by putting independent insertion events in the same column. PRANK can now be run through a graphical user interface.

  1. Download PRANKster from the PRANK website
  2. Try running the toy.fasta dataset discussed in lecture. The menu items you'll need to invoke are:
    1. File>>Open file
    2. Alignment>>Make guide tree
    3. Alignment>>Make alignment
  3. Now try modifying the model used by PRANK to disallow multiple insertions in the same column: Choose Settings>>Model and the click the +F (force) option to "keep gaps open"
    1. Realign the sequences under this new model.

Do you see the difference between the alignments? Ask an instructor if you don't understand why the weighted sum-of-pairs scoring function (used by most progressive alignment software) would prefer the first alignment, or why PRANK prefers the second alignment.

Using Guidance to find regions of alignment uncertainty

The GUIDANCE website will produce alignments using Prank or MAFFT. The GUIDANCE algorithm (Penn et al 2010) will also produce many alignments using bootstrapped guide trees. Then it will compare the alternative alignments and report column scores (the average of the sum-of-pairs score of all of the perturbed alignments). GUIDANCE will show you the alignment with the scores, and allow you to download versions of the alignment with low-scoring columns removed.

The Heads-or-Tails (HoT) assessment of alignment uncertainty (Landon and Graur, 2007) compares the alignment of the sequences to the alignment that would be obtained if the sequences were reversed. GUIDANCE web-server will also perform the HoT algorithm (see the Select Algorithm drop-down menu).

Try running the Athroismeae data through both algorithms, and see if the same sites are identified as being questionably aligned.

Excluding sites with considerable alignment uncertainty before genealogical inference will clearly throw out some information. However in many cases performing a robust analysis with fewer sources is preferable to maximizing power by including even questionable data. Note that excluding sites that are difficult to align will almost certainly bias some evolutionary inferences - for example the rate of sequence evolution from filtered data will be lower because regions that evolve quickly are more likely to be removed.

Inferring Large Trees from Unaligned Sequences

SATé uses a tree-based decomposition of sequences into smaller groups. Then it produces alignments for the sequences within these groups (using MAFFT by default). The alignments for the clusters of sequences are then merged (using Opal by default) until an alignment for the entire dataset is produced. RAxML is used to infer a new tree for the sequences, and the process is repeated.

You can find information on how to run SATé and a simple Graphical User Interface implementation of SATé here (Mark Holder is a co-author of that implementation, so if you have any questions or feedack please let him know).

Links to Multiple Sequence Alignment software

Inference of Trees + Alignments

These tools are typically much slower than tools that simply return an alignment.

  • BAli-Phy
  • AliFritz
  • SATé Toggles between alignment using divide-and-conquer, and tree searching using RAxML.

Web Services

  • EBI's MSA tools
  • GUIDANCE - help identify alignment uncertainty arising from uncertainty in the guide tree and near-optimal alignments on the same tree.

Alignment Editors

  • Mesquite Free. Cross-platform. Visualization and analysis tool for comparative biology (in particular phylogenetics)
  • Mega Free. Cross-platform (via emulation on Mac and Linux). " integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses."
  • SeaView Free. Cross-platform. "Graphical user interface for multiple sequence alignment and molecular phylogeny"
  • JalView Free. Cross-platform
  • SuiteMSA Free. Cross-platform
  • Se-Al Free. Mac only. Not under active development.
  • BioEdit Free. Windows only. Not under active development.
  • geneious Commercial. Cross-platform.

Other tools related to alignment issues

  • ERATE by Rivas and Eddy uses indel information during tree inference
  • GARLI version 2 also support some simple indel models (ask Derrick Zwickl for a link to this version).

Here is a link back to the course Schedule page.