Alignment tutorial

From MolEvol
Revision as of 15:17, 28 July 2014 by Ejmctavish (talk | contribs) (Exercise 5: trimming an alignment using BMGE)

Expected learning outcomes

The objective of this activity is to become familiar with the features of several multiple alignment and alignment visualization programs, including data input and output, basic visualization and editing functions, alignment options, and differences between nucleotide and amino acid alignments. Additionally, you should be able to run analyses on the cluster and move files to and from the cluster and your local machine.

About this lab

Most of the software used in the workshop requires that the input data is already aligned. So, if you want to use your own data during the workshop, you will need to go through the process of multiple sequence alignment (MSA).

General Background

Multiple sequence alignment is a crucial first step for most methods of genealogical inference. The goal of MSA (in the context of evolutionary inference) is to introduce gaps into sequences so that the columns of the aligned matrix contain character states that are homologous. This statement of homology means that if two sequences both contain a residue at a position in the alignment, then their common ancestor had a corresponding residue. This site was inherited by the descendant lineages (possibly with substitutions occurring) down to the present, observed sequences. Note that if two sequences both have gap at a position this does not necessarily mean that their ancestor had no residue at that site - multiple deletion events are possible.

Homology is a historical property that can be inferred, but not directly observed. So MSA software attempts to produce homologous alignments by scoring alternative alignments and choosing the alignment with the best score. Aligning two positions that display the same nucleotide (or amino acid) improves the score. Aligning two positions that differ incurs a cost that is related to what specific type of substitution is implied. For example, Leucine to Isoleucine substitutions are common and would not be penalized much. But Cysteine to Glycine changes are rare. So aligning a Cysteine to a Glycine results in a higher penalty to the score of the alignment. Placing a gap in one of the sequences is penalized. Usually introducing a new gap is penalize more strongly than extending an existing gap (the gap-extension penalty is usually less than the gap-opening penalty).

As mentioned in lecture, pairwise alignment is analytically tractable (though slow for very long sequences). Multiple sequence alignment presents new difficulties: we do not know the evolutionary tree relating the sequences. Even if we did know the tree, we do not get to see the sequences at both ends of each branch in the tree because we do not have data from the ancestral sequences. As a result, it is not clear how the penalty terms should be applied when we align sets of more than two species.

One common solution to this problem (and the solution used by the software we'll talk about in this lab) is progressive alignment. The general strategy is typically:

  1. estimate pairwise distances using pairwise alignment,
  2. estimate a guide tree from these distances,
  3. use the guide tree to specify the order in which multiple sequence alignments are constructed. The algorithm proceeds by traversing the tree from leaves to the root. The two descendants of each internal node in the tree are aligned to each other. Thus, the guide tree serves as means of ordering a set of pairwise alignment operations.

The initial step in the progressive stage will be a pairwise sequence alignment to produce a group of two aligned sequences. But, as we move deeper in the tree, the approach will require aligning other single sequences to the previously aligned groups or aligning two distinct groups to each other (group-to-group alignment).

The score of each alignment decision during progressive alignment is usually computed using a weighted sum-of-pairs score which produces an average cost for the position in the alignment. This score reflects the costs associated with aligning each member of one descendant group to each member of the other descendant group. Another way to score an alignment of two groups is to use a consistency-based objective function that reflects how well the positions in the resulting MSA reflect the homology statements inferred from pairwise alignments (See Notredame et al. 1998).

MAFFT Background

MAFFT gets its name from its use of Fast Fourier Transform to quickly identify some of the more obvious regions of homology. After identifying these region slower dynamic programming approaches are used to join these segments into a full alignment. Thus, the main advantage of the initial version of MAFFT was speed. MAFFT also uses a transformation of scoring matrices (fully described in the "Scoring system" section on pages 3061 and 3062 of Katoh et al 2002) so that all of the pairwise scores are positive and costs of multi-position gaps can be computed quickly.

The steps in a MAFFT run are:

  1. calculation of a crude pairwise distance matrix based shared 6-tuples
  2. construction of a UPGMA guide tree
  3. FFT and dynamic programming used in progressive alignment with this initial guide tree
  4. improved pairwise distance matrix inferred from the alignment of step 3
  5. improved guide tree constructed from the new distances that were computed in step 4
  6. repeat of the progressive alignment algorithm (like step 3, but with the new guide tree).
  7. Then MAFFT repeats the following:
    1. break the alignment in 2 groups based on the tree
    2. realign these groups
    3. accept this alignment if it improves the score.

The first 3 steps are called FFT-NS-1. Continuing through step 6 is called FFT-NS-2. And including the iterative improvements is referred to as FFT-NS-2-i.

The following figure (from the nice review by Katoh and Toh (2008)) diagrams FFT-NS-2:

Fig 1a Katoh and Toh 2008

The iterative improvement steps mentioned in step 7 (which are used if you use the FFT-NS-2-i version of the algorithm)are depicted as:

Fig 1a Katoh and Toh 2008

Katoh et al. 2005 added consistency-based scoring in the iterative refinement stage. G-INS-i is uses global alignments to construct the library of pairwise alignments used to assess consistency. L-INS-i uses local pairwise alignments with affine gaps to form the library, and E-INS-i uses local alignments with a generalized affine gap cost (which allows for an unalignable cost in addition to gap opening and gap extension costs). The contexts in which you would want to use each of these algorithms is shown below (from Figure 3 of Katoh and Toh (2008)). X's in the figure represent aligned residues and o's represent unalignable residues:

Fig 1a Katoh and Toh 2008

Muscle background

Muscle is described Edgar et al. 2004.

Muscle is a modified progressive alignemnt algorithm which has comparable accuracy to MAFFT, but faster performance.

Broadly it consists of three stages:

Stage 1. Draft progressive alignment. The goal of the first stage is to produce a multiple alignment, emphasizing speed over accuracy, using an approximate kmer distance measure.

Stage 2. Improved progressive. The main source of error in the draft progressive stage is the approximate kmer distance measure, which results in a suboptimal tree. MUSCLE therefore re‐estimates the tree using the Kimura distance, which is more accurate but requires an alignment.

Stage 3. Refinement. The tree is broken into subtrees, and the sub-alignments refined.


Getting started

While a large number of alignment programs have been developed, we are going to focus on two of them: MAFFT and ‎MUSCLE. Although alternatives may be more acccurate on small data stes, these programs perform well even on fairly large data sets. The MAFFT algorithms and options can be viewed here. The MUSCLE user guide is found at here.
MAFFT and MUSCLE are available on the cluster. We will transfer sequences to the cluster using scp or cyberduck, run the programs as needed remotely, and then transfer back in the same way for visualisation.

For visualisation we will use the program SeaView, from which you can view alignments, make rudimentary trees and invoke alignment programs. You can access help files at any time within the program by clicking on ‘Help’ in the top right corner.
A good alternative to SeaView is JalView. This program allows for observation of alignment qualities but is more restricted in tree building and certain editing functions.

You should download SeaView from [1].

This activity is structured to be done either by yourself or with a partner. But working with a partner is a great idea!

A link to the data set for this activity can be found in Download it and make a note of where you save it on your computer. Unzip it by double clicking on it.

You will also want to move a copy to your folder on the cluster using the 'scp' command or cyberduck:
1ped.fasta: nucleotide sequences of alcohol dehydrogenase from a variety of organisms; modified from BAliBASE.
Included in this zip file is JalView in case you wish to play around with it in your own time. This is loaded with the Java applet program on your computer.

Exercise 1: basic functions in SeaView

Start SeaView.
Load the data set 1ped.fasta by going to ‘File’ > ‘Open’. Have a look at the data. Is it aligned?
Try some of the basic commands:
To select a taxon, click on any taxon name on the left side. You can copy the sequence using the 'copy selected sequences' command in the edit menu. These can then be pasted into a text editor if needed.
To select all sequences at once, on Macs you can type Command-A. On Windows or Linux, you can use Control-A.
Explore the edit menu and observe how sequences can be reversed, complemented etc.
Close the file without saving.

Exercise 2: comparison of two different alignment programs (MAFFT and MUSCLE) using nucleotide sequences

Load the software we need for this exercise

   module load bioware

We will run the alignment software on the class cluster. Transfer the 1ped.fasta file to the cluster using scp or Cyberduck. Login in to the cluster on the terminal. Create a folder called 'alignmentLab' and move 1ped.fasta to this folder. Refer to the lab intro page if you have forgotten how to do this. Don't forget to load the modules with 'module load bioware' before you run MAFFT/MUSCLE.

Programs such as MAFFT and MUSCLE and many others use flags to designate input options. These are usually a dash before the command, or in the case of MAFFT a double dash. Examples are given throughout this lab.

If you would like to work with a partner, designate one partner A, and the other partner B.

A only:
Run a progressive alignment in MAFFT on the cluster by using the command:

mafft --retree 2 1ped.fasta > mafft_dna.fasta

The breakdown of this command is:
mafft calls the program MAFFT
--retree 2 tells MAFFT to run a progressive alignment. MAFFT uses double dashes (--) for its options (called flags). You can see why it is --retree 2 on the MAFFT webpage linked above.
1ped.fasta Before the > and after all flags have been given we place the input file name.
> mafft_dna.fasta tells MAFFT to place the output alignment in the file mafft_dna.fasta. For many programs a > designates where to place the output.
Once the alignment process is completed, transfer the file to your own computer (through scp or Cyberduck) and open it in SeaView as in exercise 1. A new window with the aligned data will appear.

B only:
Run a standard alignment in MUSCLE on the cluster by using the command:

muscle -verbose -log muscle_dna.log -in 1ped.fasta -out muscle_dna.fasta

The breakdown of this command is:
muscle calls the program MUSCLE
-verbose is a flag that instructs MUSCLE to output a very thorough log. This allows to see exactly what MUSCLE is doing. Note that the flags here are single dashes, not double dashes as with mafft.
-log muscle_dna.log instructs MUSCLE to place all the output except the alignment itself to the log file called muscle_dna.log. This file will then include things like the gap penalty used etc.
-in 1ped.fasta instructs MUSCLE where the input file is.
-out muscle_dna.fasta instructs MUSCLE to place the alignment in the file muscle_dna.fasta. Note that -verbose and -log are not always needed but it allows you to see what are the default options in MUSCLE.

Once the alignment process is completed, transfer the file to your own computer (through scp or Cyberduck) and open it in SeaView as in exercise 1. A new window with the aligned data will appear.

Both A and B:
Compare the alignments resulting from A steps and B steps. Are they different? ow many columns are in each the MAFFT or the MUSCLE alignment? What may be wrong with both? (Hint: these are protein coding genes).
Build 2 trees, one from each of your nucleotide alignments:
Go to your aligned nucleotide sequences window (for both MAFFT and MUSCLE alignments) and click on ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a J-C distance metric. De-click the 'ignore all gap sites' and click to calculate 100 bootstraps.
(Note: these trees are great for helping to evaluate your alignments, but this program should never be your sole tree building method).
Compare the trees from both the MAFFT and MUSCLE alignments. Do the topologies and/or branch lengths differ? (Hint: look up some species names to get an idea of the expected topology!)

Exercise 3: comparison of two different alignment approaches in MAFFT using protein sequences

In this exercise we will convert the nucleotide sequences to their equivalent protein sequences and align these instead. Note that because we are running the alignment programs externally to SeaView you cannot convert back to the original DNA sequences post alignment. If run through SeaView itself this can be done.

A only:
Find the original 1ped.fasta window.
Click ‘Props’ > ‘View as proteins’.
Click ‘File’>’Save prot alignment” and save the file as ’1ped_aa.fasta’ with Fasta as the file format.
Transfer this file to the class cluster into the alignmentLab folder.
Run an iterative alignment in MAFFT by using the command:

mafft --maxiterate 1000 1ped_aa.fasta > mafft_aa_iter.fasta

Comparing the command to the MAFFT command in exercise 2, you notice a new option to tell MAFFT to do an iterative alignment.
--maxiterate 1000 This instructs MAFFT to run an iterative alignment with maximum 1000 cycles. on the command line. Load this file into SeaView.

Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-click ignore gaps and do a bootstrap test as above.

B only:

Find the original 1ped.fasta window.
Click ‘Props’ > ‘View as proteins’.
Click ‘File’>’Save prot alignment” and save the file as ’1ped_aa.fasta’ with Fasta as the file format.
Transfer this file to the class cluster into the alignmentLab folder.
Employ the MAFFT automatic selection of alignment strategy by using the command:

mafft --auto 1ped_aa.fasta > mafft_aa_auto.fasta

You can see that now we use the flag --auto to allow MAFFT to choose the best alignment strategy.
Can you determine what strategy was employed? (hint MAFFT outputs data to the screen and the strategy should be listed near the end). Load this file into SeaView.

Build a tree using your protein alignment by selecting ‘Trees’ > ‘Distance Methods’ > ‘NJ' and use a Poisson distance metric. De-click ignore gaps and do a bootstrap test as above.

Both A and B:

Compare amino acid alignments and trees. Which one do you prefer? Does it make sense to align protein-coding sequences using the protein translation, or should you instead build alignments from nucleotide sequences?

Exercise 4: exploring the difference in gap penalties using MUSCLE

We will now run MUSCLE with different gap penalties to observe how this changes the alignment.

A only:
Run MUSCLE with a gap penalty of -20 using the command:

muscle -verbose -log muscle_gap-20.log -in 1ped_aa.fasta -out muscle_aa_gap-20.fasta -gapopen -20

A new flag has been added here (-gapopen -20) to instruct MUSCLE to set the gap opening penatly to -20. Load the alignment into SeaView and build a tree as in exercise 3.

B only:
Run MUSCLE with a gap penalty of -1 using the command:

muscle -verbose -log muscle_gap-1.log -in 1ped_aa.fasta -out muscle_aa_gap-1.fasta -gapopen -1

A new flag has been added here (-gapopen -1) to instruct MUSCLE to set the gap opening penatly to -1. Load the alignment into SeaView and build a tree as in exercise 3.

A or B:
Run MUSCLE with the default gap penalty using the command:

muscle -verbose -log muscle_defGap.log -in 1ped_aa.fasta -out muscle_aa.fasta

Load the alignment into SeaView and build a tree as in exercise 3.

Both A and B:
Compare the modified gap penalty alignments to the default one. Which one of the three alignments do you prefer, and why? Which has the most gaps? Can you guess the default gap penalty? Can you find in the log files of MUSCLE the gap penalty used?

Exercise 5: trimming an alignment using BMGE

Trimming of an alignment removes the ambiguous columns in order to improve later analysis. We shall use BMGE for this task. This program calculates column entropy to determine columns that are within biologically expected variation. Any columns above a given entropy threshold are removed. Full options can be seen in the BMGE user guide.

We shall trim the MUSCLE amino acid alignment (muscle_aa.fasta) using BMGE with a conserved entropy score of 0.7 and use the BLOSUM62 matrix to determine substitution weights.
BMGE is supplied as a jar file which is run using the programming language java. This is done using the command

java -jar /bioware/BMGE-1.1/BMGE.jar -i muscle_aa.fasta -ofaa muscle_trimmed.fasta -h .7 -m BLOSUM62 -t AA

There are many options in BMGE which can be viewed on their page [2].
A breakdown of the command is:
java calls the programming language interpreter for java.
-jar /bioware/BMGE-1.1/BMGE.jar tells java where the jar file containing the program BMGE is located.
-i muscle_aa.fasta instructs BMGE what the input file is.
-ofaa muscle_trimmed.fasta instructs BMGE to output a file in fasta format to muscle_trimmed.fasta and it is an amino acid alignment.
-h .7 This is where you place the entropy score you wish to cut-off at. The higher the score, the more stringent the cut-off (thus removing more variable columns. .7 appears to be a reliable cut-off but feel free to play around with this.
-m BLOSUM62 instructs BMGE on what scoring matrix to use for deciding similarity between residues. The scoring matrix will affect how stringent the cut-off is due to the how much weight is placed upon substitutions. This is explained in the BMGE user manual or ask your TA.
-t AA instructs BMGE that it is an amino acid alignment that is to be trimmed.

How many columns were removed by BMGE?

Load the resulting alignment into SeaView and build a tree as in exercise 3. Compare to the untrimmed MUSCLE alignment. Has the tree changed? Have the branch lengths or bootstrap support changed?