Biowulf at the NIH
High-throughput Blast on Biowulf

BLAST on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the Blast databases. If you have just a few query sequences, you should use Blast on the NCBI website or on Helix. All NCBI Blast programs - blastn, blastp, blastx, tblastn, tblastx, blastpgp, megablast,rpsblast -- are available on Biowulf. Please contact the Helix Systems staff staff@helix.nih.gov, or 4-6248) if you have questions about your Blast jobs.

The 'easyblast' program on Biowulf simplifies submission of large Blast jobs. You need to put all your query sequences into a directory, and then type 'easyblast' at the Biowulf prompt. You will be prompted for all required parameters. The script will then decide what kind of nodes you need (based on the database you are BLASTing against), and submit your job to as many nodes as are available (max 32). Sample session (user input is in bold):
biowulf% easyblast

EasyBlast: Blast for large numbers of sequences
Enter the directory which contains your input sequences: /data/username/blast/seqs

Enter the directory where you want your Blast output to go: /data/username/blast/out
** WARNING: There are already files in /data/username/blast/out which will be 
overwritten by this job.
** Continue? (y/n) :y


BLAST programs:
    blastn - nucleotide query sequence against nucleotide database
    blastp - protein query sequence against protein database
    blastx - nucleotide query translated in all 6 reading frames
                      against a protein database
    tblastn - protein query sequence against a nucleotide database
                      translated in all 6 reading frames
    tblastx - 6-frame translations of a nucleotide query sequence
                      against the 6-frame translations of a nucleotide database
    blastpgp - PSI-BLAST protein query against protein database
    megablast - EST-type query sequences against nucleotide database
Which program do you want to run: blastn


The following nucleotide databases are available:(
(or enter your own database with full pathname)
    nt - NCBI nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
    est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
    est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
    est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
    pdbnt - from the 3-dimensional structures 
    htgs - high throughput genome sequences
    ecoli.nt - ecoli genomic sequences
    mito.nt - mitochondrial sequences
    yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
    drosoph.nt - drosophila sequences
    hs_genome - human genome assembly (Build 36, Apr 2006)
    hs_genome.rna - human genome RNA (Build 36, Apr 2006)
    mouse_genome - mouse genome assembly (Build 36, Mar 2006)
    mouse_genome.rna - mouse genome RNA (Build 36, Mar 2006)
    mouse_masked - mouse genome, masked (Build 36, Mar 2006)
    other_genomic - non-human genomic sequences
    human.rna - RefSeq human RNA
    mouse.rna - RefSeq mouse RNA

Database to run against: yeast

Want a summary file in the output directory? (y/n, default y) :  

http://biowulf.nih.gov/apps/blast/#blast_params has a full list of 
available parameters. Any additional Blast parameters (e.g. -v 10): Submitting to 8 nodes with :m2048 memory. Job number is 85633.biobos Monitor your job at http://biowulf.nih.gov/cgi-bin/usermonS?username

easyblast figures out the node memory required, sets up all temporary files and directories, and submits the job for you.

If a summary has been requested, a file called 'summary' will appear in your output directory along with the actual Blast outputs. For your convenience, this will contain the hits from each Blast result so you can scroll through it easily. Sample summary file.

To run against your own database, enter the db name with full path at the Database: prompt. For example:

       Database to run against: /data/username/blast_db/my_db
You can put multiple sequences into each of your input sequence files. However, there needs to be at least as many query sequence files as nodes! Very occasionally Blast may barf on a particular sequence, in which case it will not continue on to other sequences in that file. If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available: As with all jobs on Biowulf, you must monitor your Blast jobs by one or more of the following methods: Some recent runs on our system to give you an idea of what sort of performance to expect.
Query Database Blast program
v 2.2.13
Nodes Time
1000 nucleotide
EST sequences
nt
updated 4/Dec/2008
7,808,957 sequences
6.2 GB
blastn 16 nodes
2.6 GHz dual-core Opterons
8GB RAM
Gb ethernet
21 mins
nr
updated 4/Dec/2008
7,463,447 sequences
2.4 GB
blastx 16 nodes
2.6 GHz dual-core Opterons
8GB RAM
Gb ethernet
21 mins
est_human
updated 4/Dec/2008
8,163,883 sequences
1.1 GB
blastn 16 nodes
2.6 GHz dual-core Opterons
8 GB RAM
Gb ethernet
5.5 mins
human genome
updated Apr 2006
25 sequences
770 Mb
blastn 16 nodes
2.6 GHz dual-core Opterons
8 GB RAM
Gb ethernet
40 mins

1000 protein sequences nr
updated 15/Dec/2008
7,463,447 sequences
2.4 GB
blastp 16 nodes
2.6 GHz dual-core Opterons
8 GB RAM
Gb ethernet
28 mins
nt
updated 4/Dec/2008
7,808,957 sequences
6.2 GB
tblastn 16 nodes
2.6 GHz dual-core Opterons
8 GB RAM
Gb ethernet
3:47 hrs

Note: EasyBlast will typically allot 32 nodes (i.e. 64 processors) for a job. Large databases (e.g. nt nucleotide) may be allotted fewer nodes.

BLAST Databases
Local copies of the sequence databases used by blast can be found in the directory /fdb/blastdb. These data are a (weekly-updated) mirror of the ftp://ncbi.nlm.nih.gov/blast/db/ directory maintained by NCBI. Some dbs were built from Fasta-format files, with the command:

formatdb -o T

Blast Database Update Status -- status of all Blast databases installed on the system.

This section is for users who want more control over their jobs, and want to use the multiblast scripts directly without using EasyBlast. Easyblast will automatically select the type of node and number of nodes for you.
  1. Set up a directory structure for input/output and tmp files.
    For management purposes it is best (but not necessary) to keep all the input, output, and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Blast sequences, tmp files, and output. The tmp and output directories will be created if they do not already exist.
  2. Edit/create a file which tells Blast where the input/output files are, and what Blast parameters to use.
    This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env

    Sample environment file:

    -----------------------------------------
    setenv DB /fdb/blastdb/nr
    setenv PROG blastn
    setenv INDIR  /data/username/sample1/input
    setenv OUTDIR /data/username/sample1/output
    setenv TMPDIR  /data/username/sample1/tmp
    setenv PARAMS "-a 2 -b 10 -v 10"
    -----------------------------------------
    
    where
    DB -- Database to act on (e.g. /fdb/blastdb/nr). See Blast db status for a list of available databases. You can also use your own formatdb-formatted Blast database here, with full pathname (e.g. /data/user/myseqs/snail)
    $PROG -- Blast program to run (blastn, blastp, blastx, tblastn, tblastx, blastpgp, megablast ) (see NCBI's Blast tutorial for details.)
    $INDIR -- directory containing all the input sequence files. Each sequence file may contain single or multiple sequences. Don't put ALL your sequences into a single file; this would force all the Blast jobs to be run sequentially on a single node and you would not get any advantages from using the cluster.
    $OUTDIR -- directory which will receive all the individual output files
    $TMPDIR -- temporary directory used by the program.
    $PARAMS-- Parameters for the Blast program, in quotes (e.g. "-a 4 -b 10 -v 10" )
    The Biowulf nodes have 2 or 4 processors each. If you are submitting to the dual-core (4p) nodes, use '-a 4', otherwise use '-a 2'. (List of all available Blast parameters)

    Note: Existing files in $TMPDIR and $OUTDIR may be overwritten by this job. Use a different TMPDIR and OUTDIR if you want to preserve them.

  3. Submit the job to the batch system
    Submit the job to the batch system using the 'qsub' command. Example:
    qsub -v np=8,read=/DIRNAME/env2 -l nodes=8:m4096 /usr/local/blast/bin/runblast
    
    This job is using the environment file /DIRNAME/env2, will use 2 processors per node (because of the -a 2 flag given as a Blast parameter above) and has asked for 'm4096' nodes (see the section on node memory). The 'shnodes' command will list all the Biowulf nodes and their properties.

    Example 2:

    qsub -v np=8,read=/DIRNAME/env1 -l nodes=8 /usr/local/blast/bin/runblast
    
    This job is using the blast environment file /DIRNAME/env1, and has asked for 8 nodes. No specific nodes have been asked for (i.e. no 'm4096' or 'm2048' specification on the nodes), so the batch system will allocate the first nodes that are available. All Biowulf nodes have at least 1 GB memory, so the target database should be no larger than 1GB for this run to be efficient. (see the section on node memory). Example 3: DNA sequences against the human genomic database
    biowulf$ mkdir /data1/username/run1             make main directory for this run
    biowulf$ cd /data1/username/run1                go this directory
    biowulf$ mkdir input output tmp                 make subdirectories for this run
    biowulf$ cd input                               go to the 'input' subdirectory
    biowulf$ cp /home/username/*.seq .              copy sequence files into this subdir
    
    Create a Blast environment file in /data1/username/run1/env. The file contains
    setenv DB /fdb/blastdb/human_genomic
    setenv PROG blastn
    setenv INDIR  /data/username/run1/seqs
    setenv OUTDIR /data/username/run1/out
    setenv TMPDIR  /data/username/run1/tmp
    setenv PARAMS "-a 2 -b 10 -v 10"
    
    Check the size of the database:
    biowulf % ls -l /fdb/blastdb/human_genomic*.nsq
    -rw-rw-r--    1 helixapp staff    990423059 Apr 24 18:29 human_genomic.00.nsq
    -rw-rw-r--    1 helixapp staff    529527444 Apr 24 18:30 human_genomic.01.nsq
    
    The database is ~1.5 GB. Submit the job with the command:
    qsub -v np=8,read=/data/username/run1/env -l nodes=8:m2048 /usr/local/blast/bin/runblast
    

    Example 4: Same data against the Drosophila protein database

    Create a new environment file /data/username/run1/env2 which contains:
    setenv DB /fdb/blastdb/drosoph.aa
    setenv PROG blastx
    setenv PARAMS "-a 2 -b 10 -v 10"
    setenv INDIR  /data/username/run1/seqs
    setenv OUTDIR /data/username/run1/out2
    setenv TMPDIR  /data/username/run1/tmp2
    
    Submit the job using:
    qsub -v np=16,read=/data/username/run1/env2 -l nodes=16 /usr/local/blast/bin/runblast
    
    The Drosophila protein database is not very large and will easily fit into the default 1GB node memory. Thus you don't have to ask for any special node properties.

Programs/Scripts/Files involved:

These scripts can be copied from /usr/local/blast/bin and modified if desired, although this should not be necessary.

When analyzing a large number of sequences with blast it is imperative that the blast database fit entirely within the memory of a given node... this makes a vast difference in the performance of blast. The Biowulf cluster consists of nodes with several different memory configurations. Please see the hardware section of the User Guide for the current configuration of the cluster.

You need to submit your job to nodes that have enough memory for your particular job. If you are running against your own database, check the size of the .nsq or .psq file and submit to the smallest possible node memory. For example,

biowulf% ls -l my_db.nsq
-rw-rw-r--   1 username  username     1521238769 Aug 31  2001 my_db.nsq
The database file is 1521238769 bytes, i.e. 1.5 GB. The nodes should have at least 2GB of memory, so you can submit to the m2048 nodes. If you are running against one of the Biowulf-installed databases in /fdb/blastdb, check the database size with the 'ls -l' command as above, and submit to the appropriate nodes. For multi-part databases such as 'human_genomic', you need to add the size of al the sections. e.g.
[user@biowulf ~]# ls -l /fdb/blastdb/human_genomic*.nsq
-rw-rw-r-- 1 helixapp staff 990423059 Oct  1 22:11 /fdb/blastdb/human_genomic.00.nsq
-rw-rw-r-- 1 helixapp staff 957571439 Oct  1 22:12 /fdb/blastdb/human_genomic.01.nsq
-rw-rw-r-- 1 helixapp staff 274887767 Oct  1 22:12 /fdb/blastdb/human_genomic.02.nsq
The total database size for the human_genomic database is thus ~1GB + ~1GB + .27 GB = 2.3 GB. Blast searches against this database will require nodes with 4 GB memory.

If you are using Easyblast, the Easyblast script will determine the database size and submit the job to the appropriate nodes.

Note: Use the "shnodes" command to see a list of nodes with their properties and status (free, job-exclusive, offline, down)

Scanning through a large set of Blast results can be time-consuming. The blast_summary script may help. Go to your blast output directory and type:
/usr/local/blast/bin/blast_summary

and it will create a file in that directory called 'summary' which contains just the Blast hits for each query sequence. Easyblast does this automatically for you.

These are the parameters available for the blastall program. See also (the Blast documentation at the NCBI website)
blastall 2.2.17 arguments:


  -p  Program Name [String]
  -d  Database [String]
    default = nr
  -i  Query File [File In]
    default = stdin
  -e  Expectation value (E) [Real]
    default = 10.0
  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular, 
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
    default = 0
    range from 0 to 11
  -o  BLAST report Output File [File Out]  Optional
      default = stdout
  -F  Filter query sequence (DUST with blastn, SEG with others) [String]
      default = T
  -G  Cost to open a gap (-1 invokes default behavior) [Integer]
      default = -1
  -E  Cost to extend a gap (-1 invokes default behavior) [Integer]
      default = -1
  -X  X dropoff value for gapped alignment (in bits) (zero invokes default
      behavior) blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
      default = 0
  -I  Show GI's in deflines [T/F]
      default = F
  -q  Penalty for a nucleotide mismatch (blastn only) [Integer]
      default = -3
  -r  Reward for a nucleotide match (blastn only) [Integer]
      default = 1
  -v  Number of database sequences to show one-line descriptions for (V)
      [Integer]
      default = 500
  -b  Number of database sequence to show alignments for (B) [Integer]
      default = 250
  -f  Threshold for extending hits, default if zero
      blastp 11, blastn 0, blastx 12, tblastn 13
      tblastx 13, megablast 0 [Real]
      default = 0
  -g  Perform gapped alignment (not available with tblastx) [T/F]
      default = T
  -Q  Query Genetic code to use [Integer]
      default = 1
  -D  DB Genetic code (for tblast[nx] only) [Integer]
      default = 1
  -a  Number of processors to use [Integer]
      default = 1
  -O  SeqAlign file [File Out]  Optional
  -J  Believe the query defline [T/F]
      default = F
  -M  Matrix [String]
      default = BLOSUM62
  -W  Word size, default if zero (blastn 11, megablast 28, all others 3)
      [Integer]
      default = 0
  -z  Effective length of the database (use zero for the real size) [Real]
      default = 0
  -K  Number of best hits from a region to keep (off by default, if used a value
      of 100 is recommended) [Integer]
      default = 0
  -P  0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
      default = 0
  -Y  Effective length of the search space (use zero for the real size) [Real]
      default = 0
  -S  Query strands to search against database (for blast[nx], and tblastx)
      3 is both, 1 is top, 2 is bottom [Integer]
      default = 3
  -T  Produce HTML output [T/F]
      default = F
  -l  Restrict search of database to list of GI's [String]  Optional
  -U  Use lower case filtering of FASTA sequence [T/F]  Optional
  -y  X dropoff value for ungapped extensions in bits (0.0 invokes default
      behavior) blastn 20, megablast 10, all others 7 [Real]
      default = 0.0
  -Z  X dropoff value for final gapped alignment in bits (0.0 invokes default
      behavior) blastn/megablast 50, tblastx 0, all others 25 [Integer]
      default = 0
  -R  PSI-TBLASTN checkpoint file [File In]  Optional
  -n  MegaBlast search [T/F]
      default = F
  -L  Location on query sequence [String]  Optional
  -A  Multiple Hits window size, default if zero (blastn/megablast 0, all others
      40 [Integer]
      default = 0
  -w  Frame shift penalty (OOF algorithm for blastx) [Integer]
      default = 0
  -t  Length of the largest intron allowed in a translated nucleotide sequence
      when linking multiple distinct alignments. (0 invokes default behavior; a
      negative value disables linking.) [Integer]
      default = 0
  -B  Number of concatenated queries, for blastn and tblastn [Integer]  Optional
      default = 0
  -V  Force use of the legacy BLAST engine [T/F]  Optional
      default = F
  -C  Use composition-based statistics for blastp or tblastn:
      As first character:
      D or d: default (equivalent to T)
      0 or F or f: no composition-based statistics
      1 or T or t: Composition-based statistics as in NAR 29:2994-3005, 2001
      2: Composition-based score adjustment as in Bioinformatics 21:902-911,
         2005, conditioned on sequence properties
      3: Composition-based score adjustment as in Bioinformatics 21:902-911,
         2005, unconditionally
      For programs other than tblastn, must either be absent or be D, F or 0.
      As second character, if first character is equivalent to 1, 2, or 3:
      U or u: unified p-value combining alignment p-value and compositional
      p-value in round 1 only [String]
      default = D
  -s  Compute locally optimal Smith-Waterman alignments (This option is only
      available for gapped tblastn.) [T/F]
      default = F