BLAST UNIX Tutorial

From MolEvol
Revision as of 15:21, 30 July 2012 by Cmeehan (talk | contribs) (Extracting sequences of hits from the database)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This is a quick outline of how to make a local database and run BLAST on sequence files against these databases or online ones like nr.

BLAST+

The command line (UNIX or Windows) version of BLAST is named BLAST+. You can download it from [1]. The user manual for BLAST+ is [2] and contains instructions on installation of the command line tools. This covers all of the functionality of blast+. Here, only a sunset of features and basic usage will be covered.

Basic use

Blast+ will do all the regular searches: blastn, blastp, blastx, tblastx, tblastn, and less used but still relevant searches psiblast, rpsblast, and rpstblastn.
Each tool is run by typing the search name into your command line followed by certain options (called flags).
For example if I wanted to run blastx on a file called seqs.fasta, output the result to a file output.blast.txt and keep all the default settings I would navigate to the folder where seqs.fasta is stored and type

blastx -query seqs.fasta -out output.blast.txt

This is considering that blast+ was placed into my path (which should have occurred during installation. If not you would type the full path of the location of the blast bin folder before blastx).
The order of these flags (such as -query and -out) do not matter.

  • NOTE: the input file can be any number of sequences. A BLAST search will be performed on each sequence individually with all their outputs concatenated one after the other in the output file. This allows you to run all the sequences of interest in one command as opposed to separating them out.

Changing parameters

Many sections of the blast search can be modified such as the e-value cut-off, output format, number of hits to retain and much more. A full list of these options is covered in the manual and an overview can be gotten by typing

blastx -help

replacing blastx with whichever tool you wish to use.

E-value

The e-value threshold will change the number of hits returned by limiting them only to those lower than a given value. The default for this is 10 which will basically give you all hits. To change this value you use the -evalue flag. For example to change the value to 1e-3 (thus only returning hits that have a value lower than this) I would use the command

blastx -query seqs.fasta -out output.blast.txt -evalue 1e-3

Output formats

The default output from the BLAST website is in html format (the coding language that makes most webpages). However this is difficult to parse and you may want other things like just the sequences returned or a tabular format of the e-values and identity scores that can be copied into your favourite spreadsheet program.
The flag for this is -outfmt followed by a number which denotes the format requested. The numbers correspond like so:

  • 0 = pairwise,
  • 1 = query-anchored showing identities,
  • 2 = query-anchored no identities,
  • 3 = flat query-anchored, show identities,
  • 4 = flat query-anchored, no identities,
  • 5 = XML Blast output,
  • 6 = tabular,
  • 7 = tabular with comment lines,
  • 8 = Text ASN.1,
  • 9 = Binary ASN.1,
  • 10 = Comma-separated values,
  • 11 = BLAST archive format (ASN.1)

The most common output formats are 0 and 6. 0 is the default and will output will give a table of each query-hit pair with a bit score and e-value followed by the pairwise alignment of each pair. A portion of example output is:

                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

ref|NP_068858.1|  3-hydroxyacyl-CoA dehydrogenase [Archaeoglobus ...   817    0.0  
ref|YP_003435472.1|  3-hydroxyacyl-CoA dehydrogenase NAD-binding ...   646    0.0  


>ref|NP_068858.1| 3-hydroxyacyl-CoA dehydrogenase [Archaeoglobus fulgidus DSM 4304]
 gb|AAB91209.1| 3-hydroxyacyl-CoA dehydrogenase (hbd-1) [Archaeoglobus fulgidus 
DSM 4304]
Length=661

 Score =  817 bits (2111),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 402/402 (100%), Positives = 402/402 (100%), Gaps = 0/402 (0%)

Query  1    MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERG  60
            MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERG
Sbjct  1    MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERG  60

Query  61   YISEDPEKVLKRIEATADLIEVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATN  120
            YISEDPEKVLKRIEATADLIEVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATN
Sbjct  61   YISEDPEKVLKRIEATADLIEVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATN  120

Query  121  TSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGEKTSEETIRIVEDFARKIDRI  180
            TSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGEKTSEETIRIVEDFARKIDRI
Sbjct  121  TSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGEKTSEETIRIVEDFARKIDRI  180

Query  181  IIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG  240
            IIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG
Sbjct  181  IIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG  240

Query  241  GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRA  300
            GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRA
Sbjct  241  GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRA  300

Query  301  GANFDLLRLVAPAVNEAAWLIEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKK  360
            GANFDLLRLVAPAVNEAAWLIEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKK
Sbjct  301  GANFDLLRLVAPAVNEAAWLIEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKK  360

Query  361  LNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKYGD  402
            LNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKYGD
Sbjct  361  LNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKYGD  402

Other information such as lamda and k of the run along with the input options used are given

The second option commonly used is format 6 which is the tabular format. This will give you a detailed table of the hits and the different scores associated with each query-hit pair. The output of this format by default gives you <blah> number of columns. The contents of each are:

Query id
Subject id
% identity
alignment length
mismatches
gap
openings
q. start
q. end
s. start
s. end
e-value
bit score

For example the same output as above with -outfmt 0 would be with -outfmt 6:

AF0017_1        gi|11497638|ref|NP_068858.1|    100.00  402     0       0       1       402     1       402     0.0      817
AF0017_1        gi|288931412|ref|YP_003435472.1|        75.13   398     98      1       4       401     1       397     0.0      646

These can be modified with other outputs as outlined in the blastx -help output or whichever program you are using.

Changing the database to be searched

For blast+ runs you can use local databases or remote (on ncbi website) databases.
For this you use the -db command. If you have created a database locally (covered below) and it is in the current working directory you would use the command

blastx -query seqs.fasta -out output.blast.txt -db database

where database is the name given to the database.
If this database is stored elsewhere you would give the absolute or relative path to the database. For example if I have stored it in my home directory I would use

blastx -query seqs.fasta -out output.blast.txt -db ~/database

or

blastx -query seqs.fasta -out output.blast.txt -db /Users/username/database

To use remote databases (such as nr) you add the -remote command along with the -db command. For example:

blastx -query seqs.fasta -out output.blast.txt -db nr -remote

Making a local BLAST database

Often you will not want to search all of nr or remote databases due to time issues or to restrict your results to a certain set of organisms. This can be done by creating a local database to be searched. To do this you need a fasta formatted sequence file which has the general format such as:

>header
sequence

E.g.:

>sequence1
MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERGYISEDPEKVLKRIEATADLI
EVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATNTSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGE
KTSEETIRIVEDFARKIDRIIIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG
GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRAGANFDLLRLVAPAVNEAAWL
IEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKKLNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKY
GDGNYEFVKVEKEGKVGVLKLNRPRRANALNPTFLKEVEDALDLLERDEEVRAIVIAGEGKNFCAGADIAMFASGRPEMV
TEFSQLGHKVFRKIEMLSKPVIAAIHGAAVGGGFELAMACDLRVMSERAFLGLPELNLGIIPGWGGTQRLAYYVGVSKLK
EVIMLKRNIKPEEAKNLGLVAEVFPQERFWDEVMKLAREVAELPPLAVKYLKKVIALGTMPALETGNLAESEAGAVIALT
DDVAEGIQAFNYRRKPNFRGR
>sequence2
MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERGYISEDPEKVLKRIEATADLI
EVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATNTSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGE
KTSEETIRIVEDFARKIDRIIIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG
GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRAGANFDLLRLVAPAVNEAAWL
IEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKKLNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKY
GDGNYEFVKVEKEGKVGVLKLNRPRRANALNPTFLKEVEDALDLLERDEEVRAIVIAGEGKNFCAGADIAMFASGRPEMV
TEFSQLGHKVFRKIEMLSKPVIAAIHGAAVGGGFELAMACDLRVMSERAFLGLPELNLGIIPGWGGTQRLAYYVGVSKLK
EVIMLKRNIKPEEAKNLGLVAEVFPQERFWDEVMKLAREVAELPPLAVKYLKKVIALGTMPALETGNLAESEAGAVIALT
DDVAEGIQAFNYRRKPNFRGR

The sequence name can be whatever you want.

The basic way to make a local BLAST database is using the makeblastdb command

makeblastdb -in database.fasta -out databaseBLAST -dbtype prot -parse_seqids

where database.fasta is the file containing all the sequences to be made into a database, the sequence type is protein and the output name for the database is databaseBLAST. The -parse_seqids is needed for referencing back to this database at a later stage (such as extracting sequences of hits).This will create multiple files, all of which are required for the BLAST database to function.
A suggestion is to run this on a fasta file, place all the resulting files into a new folder and then point to the database filename within that folder in the -db portion of a BLAST run.
Further masking options (which remove regions of low complexity etc) can be applied and are covered in the BLAST+ manual linked at the top.

Extracting sequences of hits from the database

Once you have your hits you may want to extract all the sequences corresponding to those hits to a file for further analysis. This is done using the blastdbcmd command.
For this to work a few things are needed. The -parse_seqids flag must have been set during the creation of the database searched. The names of each hit is also needed. This can be extracted from any output format from a BLAST run (e.g. is in the 2nd column of a run using default outputs of -outfmt 6 option). A file of just these names can be created by using a -outfmt "6 sseqid" command in the BLAST run. NOTE This will give you every sequence name of hits but no other information. It also will not separate out the hits from different query sequences in the input so all hits for all sequences will be output with no separation.
To get the sequences you can run a command such as:

blastdbcmd -db databaseBLAST -dbtype prot -entry_batch hits.txt -outfmt %f -out hits.fasta

where databaseBLAST is the database to extract sequences from, prot is the type of data in the database, hits.txt contains the names of the sequences to extract, %f states we want fasta format output and hits.fasta is the output file that will contain all the sequences requested.