BLAST UNIX Tutorial
This is a quick outline of how to make a local database and run BLAST on sequence files against these databases or online ones like nr.
The command line (UNIX or Windows) version of BLAST is named BLAST+. You can download it from . The user manual for BLAST+ is  and contains instructions on installation of the command line tools. This covers all of the functionality of blast+. Here, only a sunset of features and basic usage will be covered.
Blast+ will do all the regular searches: blastn, blastp, blastx, tblastx, tblastn, and less used but still relevant searches psiblast, rpsblast, and rpstblastn.
Each tool is run by typing the search name into your command line followed by certain options (called flags).
For example if I wanted to run blastx on a file called seqs.fasta, output the result to a file output.blast.txt and keep all the default settings I would navigate to the folder where seqs.fasta is stored and type
blastx -query seqs.fasta -out output.blast.txt
This is considering that blast+ was placed into my path (which should have occurred during installation. If not you would type the full path of the location of the blast bin folder before blastx).
The order of these flags (such as -query and -out) do not matter.
- NOTE: the input file can be any number of sequences. A BLAST search will be performed on each sequence individually with all their outputs concatenated one after the other in the output file. This allows you to run all the sequences of interest in one command as opposed to separating them out.
Many sections of the blast search can be modified such as the e-value cut-off, output format, number of hits to retain and much more. A full list of these options is covered in the manual and an overview can be gotten by typing
replacing blastx with whichever tool you wish to use.
The e-value threshold will change the number of hits returned by limiting them only to those lower than a given value. The default for this is 10 which will basically give you all hits. To change this value you use the -evalue flag. For example to change the value to 1e-3 (thus only returning hits that have a value lower than this) I would use the command
blastx -query seqs.fasta -out output.blast.txt -evalue 1e-3
The default output from the BLAST website is in html format (the coding language that makes most webpages). However this is difficult to parse and you may want other things like just the sequences returned or a tabular format of the e-values and identity scores that can be copied into your favourite spreadsheet program.
The flag for this is -outfmt followed by a number which denotes the format requested. The numbers correspond like so:
- 0 = pairwise,
- 1 = query-anchored showing identities,
- 2 = query-anchored no identities,
- 3 = flat query-anchored, show identities,
- 4 = flat query-anchored, no identities,
- 5 = XML Blast output,
- 6 = tabular,
- 7 = tabular with comment lines,
- 8 = Text ASN.1,
- 9 = Binary ASN.1,
- 10 = Comma-separated values,
- 11 = BLAST archive format (ASN.1)
The most common output formats are 0 and 6. 0 is the default and will output will give a table of each query-hit pair with a bit score and e-value followed by the pairwise alignment of each pair. A portion of example output is:
Score E Sequences producing significant alignments: (Bits) Value ref|NP_068858.1| 3-hydroxyacyl-CoA dehydrogenase [Archaeoglobus ... 817 0.0 ref|YP_003435472.1| 3-hydroxyacyl-CoA dehydrogenase NAD-binding ... 646 0.0 >ref|NP_068858.1| 3-hydroxyacyl-CoA dehydrogenase [Archaeoglobus fulgidus DSM 4304] gb|AAB91209.1| 3-hydroxyacyl-CoA dehydrogenase (hbd-1) [Archaeoglobus fulgidus DSM 4304] Length=661 Score = 817 bits (2111), Expect = 0.0, Method: Compositional matrix adjust. Identities = 402/402 (100%), Positives = 402/402 (100%), Gaps = 0/402 (0%) Query 1 MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERG 60 MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERG Sbjct 1 MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERG 60 Query 61 YISEDPEKVLKRIEATADLIEVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATN 120 YISEDPEKVLKRIEATADLIEVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATN Sbjct 61 YISEDPEKVLKRIEATADLIEVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATN 120 Query 121 TSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGEKTSEETIRIVEDFARKIDRI 180 TSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGEKTSEETIRIVEDFARKIDRI Sbjct 121 TSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGEKTSEETIRIVEDFARKIDRI 180 Query 181 IIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG 240 IIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG Sbjct 181 IIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG 240 Query 241 GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRA 300 GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRA Sbjct 241 GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRA 300 Query 301 GANFDLLRLVAPAVNEAAWLIEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKK 360 GANFDLLRLVAPAVNEAAWLIEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKK Sbjct 301 GANFDLLRLVAPAVNEAAWLIEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKK 360 Query 361 LNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKYGD 402 LNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKYGD Sbjct 361 LNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKYGD 402
Other information such as lamda and k of the run along with the input options used are given
The second option commonly used is format 6 which is the tabular format. This will give you a detailed table of the hits and the different scores associated with each query-hit pair. The output of this format by default gives you <blah> number of columns. The contents of each are:
Query id Subject id % identity alignment length mismatches gap openings q. start q. end s. start s. end e-value bit score
For example the same output as above with -outfmt 0 would be with -outfmt 6:
AF0017_1 gi|11497638|ref|NP_068858.1| 100.00 402 0 0 1 402 1 402 0.0 817 AF0017_1 gi|288931412|ref|YP_003435472.1| 75.13 398 98 1 4 401 1 397 0.0 646
These can be modified with other outputs as outlined in the blastx -help output or whichever program you are using.
Changing the database to be searched
For blast+ runs you can use local databases or remote (on ncbi website) databases.
For this you use the -db command. If you have created a database locally (covered below) and it is in the current working directory you would use the command
blastx -query seqs.fasta -out output.blast.txt -db database
where database is the name given to the database.
If this database is stored elsewhere you would give the absolute or relative path to the database. For example if I have stored it in my home directory I would use
blastx -query seqs.fasta -out output.blast.txt -db ~/database
blastx -query seqs.fasta -out output.blast.txt -db /Users/username/database
To use remote databases (such as nr) you add the -remote command along with the -db command. For example:
blastx -query seqs.fasta -out output.blast.txt -db nr -remote
Making a local BLAST database
Often you will not want to search all of nr or remote databases due to time issues or to restrict your results to a certain set of organisms. This can be done by creating a local database to be searched. To do this you need a fasta formatted sequence file which has the general format such as:
>sequence1 MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERGYISEDPEKVLKRIEATADLI EVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATNTSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGE KTSEETIRIVEDFARKIDRIIIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRAGANFDLLRLVAPAVNEAAWL IEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKKLNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKY GDGNYEFVKVEKEGKVGVLKLNRPRRANALNPTFLKEVEDALDLLERDEEVRAIVIAGEGKNFCAGADIAMFASGRPEMV TEFSQLGHKVFRKIEMLSKPVIAAIHGAAVGGGFELAMACDLRVMSERAFLGLPELNLGIIPGWGGTQRLAYYVGVSKLK EVIMLKRNIKPEEAKNLGLVAEVFPQERFWDEVMKLAREVAELPPLAVKYLKKVIALGTMPALETGNLAESEAGAVIALT DDVAEGIQAFNYRRKPNFRGR >sequence2 MMVLEIRNVAVIGAGSMGHAIAEVVAIHGFNVKLMDVSEDQLKRAMEKIEEGLRKSYERGYISEDPEKVLKRIEATADLI EVAKDADLVIEAIPEIFDLKKKVFSEIEQYCPDHTIFATNTSSLSITKLAEATKRPEKFIGMHFFNPPKILKLLEIVWGE KTSEETIRIVEDFARKIDRIIIHVRKDVPGFIVNRIFVTMSNEASWAVEMGEGTIEEIDSAVKYRLGLPMGLFELHDVLG GGSVDVSYHVLEYYRQTLGESYRPSPLFERLFKAGHYGKKTGKGFYDWSEGKTNEVPLRAGANFDLLRLVAPAVNEAAWL IEKGVASAEEIDLAVLHGLNYPRGLLRMADDFGIDSIVKKLNELYEKYNGEERYKVNPVLQKMVEEGKLGRTTGEGFYKY GDGNYEFVKVEKEGKVGVLKLNRPRRANALNPTFLKEVEDALDLLERDEEVRAIVIAGEGKNFCAGADIAMFASGRPEMV TEFSQLGHKVFRKIEMLSKPVIAAIHGAAVGGGFELAMACDLRVMSERAFLGLPELNLGIIPGWGGTQRLAYYVGVSKLK EVIMLKRNIKPEEAKNLGLVAEVFPQERFWDEVMKLAREVAELPPLAVKYLKKVIALGTMPALETGNLAESEAGAVIALT DDVAEGIQAFNYRRKPNFRGR
The sequence name can be whatever you want.
The basic way to make a local BLAST database is using the makeblastdb command
makeblastdb -in database.fasta -out databaseBLAST -dbtype prot -parse_seqids
where database.fasta is the file containing all the sequences to be made into a database, the sequence type is protein and the output name for the database is databaseBLAST. The -parse_seqids is needed for referencing back to this database at a later stage (such as extracting sequences of hits).This will create multiple files, all of which are required for the BLAST database to function.
A suggestion is to run this on a fasta file, place all the resulting files into a new folder and then point to the database filename within that folder in the -db portion of a BLAST run.
Further masking options (which remove regions of low complexity etc) can be applied and are covered in the BLAST+ manual linked at the top.
Extracting sequences of hits from the database
Once you have your hits you may want to extract all the sequences corresponding to those hits to a file for further analysis. This is done using the blastdbcmd command.
For this to work a few things are needed. The -parse_seqids flag must have been set during the creation of the database searched. The names of each hit is also needed. This can be extracted from any output format from a BLAST run (e.g. is in the 2nd column of a run using default outputs of -outfmt 6 option). A file of just these names can be created by using a -outfmt "6 sseqid" command in the BLAST run. NOTE This will give you every sequence name of hits but no other information. It also will not separate out the hits from different query sequences in the input so all hits for all sequences will be output with no separation.
To get the sequences you can run a command such as:
blastdbcmd -db databaseBLAST -dbtype prot -entry_batch hits.txt -outfmt %f -out hits.fasta
where databaseBLAST is the database to extract sequences from, prot is the type of data in the database, hits.txt contains the names of the sequences to extract, %f states we want fasta format output and hits.fasta is the output file that will contain all the sequences requested.