Biowulf at the NIH
High-throughput WU-Blast on Biowulf

we-blast logo

WU-Blast on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the WU-Blast databases. If you have just a few query sequences, you should use WU-Blast on a public server such as the EBI server or on Helix. Please contact the Helix Systems staff (staff@helix.nih.gov, or 4-6248) if you have questions about your WU-Blast jobs.

WU-BLAST was developed by Warren Gish at Washington University in St. Louis. (WU-Blast website) Specific customization via wrapper scripts for the NIH Biowulf cluster by Susan Chacko (Helix Staff, CIT) and Peter FitzGerald (Genome Analysis Unit, NCI).

The 'easywublast' program on Biowulf simplifies submission of large WU-Blast jobs. Your query sequences can be in a single large file, or as separate sequence files in a directory. You then type 'easywublast' at the Biowulf prompt. You will be prompted for all required parameters. The script will do some basic sanity checking, set up your run and submit it to the batch queue. Sample session (user input is in bold):
[user@biowulf ~]$ easywublast

EasyWUBlast: WuBlast for large numbers of sequences
Enter the file or directory which contains your input sequences: /data/user/blast/1000_est
Enter the directory where you want your Blast output to go: /data/user/blast/out
** WARNING: There are already files in /data/user/blast/out which may be overwritten by this job.
** Continue? (y/n) :y

WU-BLAST programs:
    blastn - nucleotide query sequence against nucleotide database
    blastp - protein query sequence against protein database
    blastx - nucleotide query translated in all 6 reading frames 
          against a protein database
    tblastn - protein query sequence against a nucleotide database
          translated in all 6 reading frames
    tblastx - 6-frame translations of a nucleotide query sequence 
          against the 6-frame translations of a nucleotide database
Which program do you want to run: blastn

The following nucleotide databases are available:
(or enter your own database with full pathname)
    nt - all nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
    est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
    est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
    est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
    pdb.nt - from the 3-dimensional structures 
    ecoli.nt - ecoli genomic sequences
    mito.nt - mitochondrial sequences
    yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
    drosoph.nt - drosophila sequences
    hs_genome - human genome assembly (Build 36, Apr 2006)
    hs_genome.rna - human genome RNA (Build 36, Apr 2006)
    mouse_genome - mouse genome assembly (Build 36, Mar 2006)
    mouse_genome.rna - mouse genome RNA (Build 36, Mar 2006)
    human.rna - Refseq Human RNA
    mouse.rna - RefSeq Mouse RNA
Database to run against: drosoph.nt

Use NCBI-Blast parameters? [n] : 

Any additional WUBlast parameters (e.g. -hspmax 500 ): -E=1.0 -V=10 -B=10

Save intermediate output files? (for debugging only) [n]: 
Creating parameter file /data/susanc/wublast_par.24368

Submitting to 16 nodes. Job number is 789290.biobos

Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?789290.biobos

[user@biowulf ~]

easywublast sets up the temporary files and directories that are required, and submits the job for you. Most required parameters are self-evident. The "NCBI-Blast parameters" will use parameter sets that approximate the NCBI Gapped Blast 2.0 (More info). If you choose to 'save intermediate files', the unmerged outputs, the parameter file, and temporary files will all be saved.

In the example above, the query sequences are in individual files in one directory. They can also be set up as multiple sequences per file (e.g. 100 sequences per file, 50 files in the directory), or in one large file. If they are all in a single file, enter the name of that file for the query sequences. e.g.

Enter the file or directory which contains your input sequences: /data/user/my_drosoph.seq

To run against your own database, enter the db name with full path at the Database: prompt. For example:

Database to run against: /data/username/blast_db/my_db
This database should have been built with the WU-Blast xdformat program. It is available in /usr/local/wublast and described further in the Wu-Blast documentation.
Analyzing the output
Your output from all the query sequences will appear in one large file, for convenience. If you need to pull out individual outputs from that file, use the wublast_extract program. e.g.
biowulf% wublast_extract -f output.wublast -q AAU46754 > AAU46754.out
WU-Blast Databases
Local copies of the sequence databases used by WU-Blast can be found in the directory /fdb/wublastdb. These databases are updated weekly. They are built from Fasta-format files downloaded from ftp://ncbi.nlm.nih.gov/blast/db/FASTA directory maintained by NCBI, with the command

xdformat -I "description" -p[-n] file.fasta

WU-Blast Database Update Status

As with all jobs on Biowulf, you must monitor your WU-Blast jobs by one or more of the following methods: Some recent runs on our system to give you an idea of what sort of timescales to expect.
Query Database Blast program Time
(mins)
Time with
NCBI Blast
parameters

(mins)
Nodes
1000 nucleotide
EST sequence
nt nucleotide
updated 26 May 2007
5,294,948 sequences
5.4 Gb
blastn 35:20 21:25 16 nodes
2.6 GHz Opterons
8 GB RAM
Gb ethernet
nr protein
updated 7 Jun 2007
4,988,250 sequences
1.7 GB
blastx 53:02 10:21
est_human
updated 3 June 2007
8,119,086 sequences
1.1 GB
blastn 10:38 5:29
human genome
May 2006 build
25 sequences
767 Mb
blastn 9:26 7:46

1000 protein sequences nr protein
updated 7 Jun 2007
4,988,250 sequences
1.7 GB
blastp 66:02 15:31
nt nucleotide
updated 26 May 2007
5,294,948 sequences
5.4 Gb
tblastn 10:37:18 3:40:11

NCBI Blast parameters: WU-Blast is designed to be sensitive, so that with default parameters it will typically take longer than a NCBI Blast run with default parameters. The wu-blastall command converts NCBI blast parameters into their roughly equivalent WU-Blast parameters. Note that sensitivity is inversely related to speed. More about WU-Blast and NCBI Blast parameters

WU-Blast on Biowulf works by dividing the database among the nodes, and running WU-Blast with all the query sequences against each piece of the database. At the end of the run, a merge program puts the pieces together. (In contrast, NCBI Blast on Biowulf is parallelized by job, where individual query sequences are sent to different nodes. This means that the entire Blast database has to be read by each node, which can cause slowdowns for large databases and/or large numbers of nodes.)

EasyWUBlast will set up the temporary directories and files, and allocate nodes for your job. If you want more control, you can bypass EasyWUBlast and use the underlying scripts directly, via the instructions below.

  1. Query sequences: If your query sequences are in individual files, put them all into one directory or concatenate them into a single file. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your WU-Blast sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs. The tmp and output directories will be created if they do not already exist. Note that any previous files in the output directory will be overwritten.
  2. Edit/create a file which tells Blast where the input/output files are, and what WU-Blast parameters to use.
    This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env

    Sample environment file named /data/username/blast/env

    setenv DB /fdb/wublastdb/nt
    setenv PROG blastn
    setenv INDIR  /data/user/sample/query.fasta
    setenv OUTDIR /data/user/sample/out/
    setenv TMPDIR  /data/user/sample/tmp
    setenv PARAMS "-B 10 -V 10 -hspmax 500"
    
    where
    $DB -- Database to act on (e.g. /fdb/wublastdb/nt). See WU-Blast db status for a list of available databases.
    $PROG -- WU-Blast program to run (blastn, blastp, blastx, tblastn, tblastx)
    $INDIR -- directory or file containing all the query sequences. Each sequence file may contain single or multiple sequences.
    $OUTDIR -- directory which will receive all the output files
    $TMPDIR -- temporary directory used by the program.
    $PARAMS-- Parameters for the WU-Blast program, in quotes (e.g. "-B 10 -V 10 -hspmax 500" )
    Other optional environment variables can also be added in this file, as described in the next section.
    (List of all available Blast parameters)

    Note: Existing files in $TMPDIR and $OUTDIR may be overwritten by this job. Use a different TMPDIR and OUTDIR if you want to preserve them.

  3. Submit the job to the batch system
    Submit the job to the batch system using the 'qsub' command. Example:
    qsub -v np=8,read=/DIRNAME/env -l nodes=8 /usr/local/wublast/nih/runwublast
    
    This job is using the environment file /DIRNAME/env2, will use 2 processors per node (WU-Blast default). WU-Blast will divide the database into 8 pieces, so the node memory requirement will typically be irrelevant. For example, the nt database is 3.04 Gb, so the program will require 3.0/8 = 38 Mb on each node. Biowulf nodes have a minimum of 1 Gb memory.

    Another example:

    qsub -v np=10,read=/DIRNAME/env1 -l nodes=10 /usr/local/wublast/nih/runwublast
    
    This job is using the blast environment file /DIRNAME/env1, and has asked for 10 nodes. The batch system will allocate the fastest nodes that are available.

    You can monitor any jobs you submit; see Monitoring your jobs in the Biowulf User Guide.

  4. Dos and Don'ts:
    • Unlike with NCBI Blast, memory probably won't be a limiting factor. So don't specify a minimum node memory when submitting jobs.
    • Don't submit a WU-Blast job to a large number of nodes unless you are sure you know what you are doing. The last step in the procedure merges the output from all the nodes, and takes up more cpu and memory as the number of nodes increases. Thus, if you submit to more nodes, your job may end up taking longer. In our experience, about 10 nodes is ideal.

WU-Blast environment variables

The WU-Blast environment variables are set as follows:
WUBLASTMAT - /usr/local/wublast/matrix
WUBLASTFILTER - /usr/local/wublast/filter
WUBLASTDB - /fdb/wublastdb
If you would like to change these values, add them to your parameter file, e.g.
setenv DB /fdb/wublastdb/nt
setenv PROG blastn
setenv INDIR  /data/user/sample/query.fasta
setenv OUTDIR /data/user/sample/out/
setenv TMPDIR  /data/user/sample/tmp
setenv PARAMS "-B 10 -V 10 -hspmax 500"
setenv WUBLASTDB /data/user/wublast/mydb
setenv WUBLASTFILTER /data/user/wublast/filter

Debugging

By default, the temporary files (/data/user/wublast_par, /data/user/wublast_tmp/*, and the outputs from each node) will be deleted after the run. The only remaining file will be 'output.wublast' which contains the results from all nodes. Your query sequences, of course, will also remain.

If you wish to retain all the temporary files, add the environment variable SAVE_ALL to your parameter file. e.g.

setenv DB /fdb/wublastdb/nt
setenv PROG blastn
setenv INDIR  /data/user/sample/query.fasta
setenv OUTDIR /data/user/sample/out/
setenv TMPDIR  /data/user/sample/tmp
setenv PARAMS "-B 10 -V 10 -hspmax 500"
setenv SAVE_ALL 1

Programs/Scripts/Files involved:

These scripts can be copied from /usr/local/wublast/nih and modified if desired, although this should not be necessary.

More Examples

  1. DNA sequences against the nucleotide non-redundant database (nt)
    Note that 'username' in the examples below should be replaced by your own username!!
    biowulf$ mkdir /data1/username/run1             make main directory for this run
    biowulf$ cd /data1/username/run1                go this directory
    biowulf$ mkdir input output tmp                 make subdirectories for this run
    biowulf$ cd input                               go to the 'input' subdirectory
    biowulf$ cp /home/username/*.seq .              copy sequence files into this subdir
    
    Create a WU-Blast environment file in /data/username/run1/env. The file contains
    setenv DB /fdb/wublastdb/nt
    setenv PROG blastn
    setenv INDIR  /data/username/run1/seqs
    setenv OUTDIR /data/username/run1/out
    setenv TMPDIR  /data/username/run1/tmp
    setenv PARAMS "-hspmax 500 -warnings"
    
    Submit the job with the command:
    qsub -v np=8,read=/data/username/run1/env -l nodes=8 \
    /usr/local/wublast/nih/runwublast
    
  2. A file of 1000 query sequences against the protein non-redundant database (nr)
    biowulf% cat /data/username/*.seq > /data/username/wb_run5/myseqs
    
    Create a new environment file /data/username/run1/env2 which contains:
    setenv DB /fdb/wublastdb/nr
    setenv PROG blastx
    setenv INDIR  /data/username/wb_run5/myseqs
    setenv OUTDIR /data/username/run1/out2
    setenv TMPDIR  /data/username/run1/tmp2
    setenv PARAMS "-hspmax 500 -B 10 -V 10"
    
    Submit the job using:
    qsub -v np=10,read=/data/username/run1/env2 -l nodes=10 \
    /usr/local/wublast/nih/runwublast
    

Wash U. WU-Blast documentation

Typing the program name will give you a summary of most available parameters and will report the version of WU-Blast being used. See also the Command-line options discussed at the WU-Blast site.

biowulf% /usr/local/wublast/i686/blastn
BLASTN 2.0MP-WashU [26-Oct-2004] [linux24-i686-ILP32F64 2004-10-26T20:25:25]

Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.

Reference:  Gish, W. (1996-2004) http://blast.wustl.edu

Notice:  this program and its default parameter settings are optimized to find
nearly identical sequences rapidly.  To identify weak protein similarities
encoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Usage:

    BLASTN database queryfile [options]

Valid BLASTN options:  E, S, E2, S2, W, T, X, M, N, Y, Z, L, K, H, V, B
(described at Wu-Blast command-line parameters)


  -matrix <matrix-name>   use the specified scoring matrix (default matrix is
                    computed from M=+5 N=-4); be sure to consider changing the
                    default gap penalties when using a non-default scoring
                    system
  -Q <s>    penalty score for a gap of length 1
  -R <s>    penalty score for extending a gap by each letter after the first
  -kap        use Karlin-Altschul statistics on individual alignment scores
  -sump       use Karlin-Altschul "Sum" statistics*
  -poissonp   use Poisson statistics to evaluate multiple HSPs
  -top     search only the top strand of the query
  -bottom  search only the bottom strand of the query
  -filter <method>  hard mask the query using the specified method (e.g.,
                    "seg", "xnu", "ccp", "dust" or "none")
  -wordmask <method>   soft mask the query using the specified method (see
                    -filter)
  -maskextra <n>   extend soft masking additional distance <n> into flanking
                    regions
  -lcfilter    hard mask lower case letters in the query sequence
  -lcmask      soft mask lower case letters in the query sequence
  -echofilter  display the query, after any/all masks have been applied
  -hitdist <n> max. distance between word hits for 2-hit BLAST (default 0)
  -wink <n>    generate neighborhood words every <n>-th position (default 1)
  -stats       collect word-hit statistics (consumes marginally more cpu time)
  -ctxfactor <f>  base statistics on this number of independent contexts or
                    reading frames
  -nogap       turn off gapped alignment method, reporting only ungapped HSPs
  -wstrict     impose strict requirement for word hits in ungapped alignments
  -gapall      perform gapped alignment procedure on all ungapped HSPs*
  -gapE <e>    expectation threshold of sets of ungapped HSPs for subsequent
                    use in seeding gapped alignments (default gapall)
  -gapE2 <e>   expectation threshold for saving individual gapped alignments
  -gapW <n>    full band width for gapped alignment procedure
  -gapX <s>    drop-off score for gapped alignment procedure
  -pingpong    perform extra processing to help ensure a locally optimal
                    alignment (rarely useful)
  -nosegs      do not segment the query sequence on hyphen (-) characters
  -olf <f>     max. fractional length of overlap for HSP consistency
  -golf <f>    max. fractional length overlap for GSP consistency
  -olmax <n>   max. absolute length of overlap for HSP consistency (default
                    unlimited)
  -golmax <n>  max. absolute length of overlap for GSP consistency (default
                    unlimited)
  -gapdecayrate <f>  characteristic parameter of geometric weights (default
                    0.5)
  -span2    discard HSPs spanned on both query and subject by a better HSP*
  -span1    discard HSPs spanned on query, subject or both by a better HSP
  -span     do not discard HSPs spanned by other, better HSPs
  -prune    do not prune insignificant HSPs from the output lists
  -consistency  turn off HSP consistency rules for statistics
  -links        display consistent link information for each alignment
  -topcomboN <n>  report this number of consistent (colinear) groups of HSPs
  -topcomboE <e>  only show HSP combos within this factor of the best combo
  -sumstatsmethod <n>    specify an alternate use of Sum statistics
  -hspsepqmax <n>  max. separation allowed between HSPs along query
  -hspsepsmax <n>  max. separation allowed between HSPs along subject
  -altscore "qc,sc,score"    qc and sc may be letters or "all"; score may be
                    numeric, "min", "max", or "na" (not allowed)
  -altscore "none"    clears any previous altscore specifications
  -hspmax <n>    max. number of ungapped HSPs saved per subject sequence
                    (default 1000; 0 => unlimited)
  -gspmax <n>    max. number of gapped HSPs (GSPs) saved per subject sequence
                    (default 0; 0 => unlimited)
  -spoutmax <n>  max. number of segment pairs reported in the output per
                    subject sequence (default 0; 0 => unlimited)
  -qoffset <i>   adjust query sequence coordinate numbers by this amount
  -soffset <i>   adjust subject sequence coordinate numbers by this amount
  -nwstart <n>   start generating neighborhood words here in query (default 1)
  -nwlen <n>     generate neighborhood words over this distance from nwstart
                    in query
  -qrecmin <n>   starting multi-query file record number to search
  -qrecmax <n>   ending multi-query file record number to search
  -dbrecmin <n>  starting database record number to search
  -dbrecmax <n>  ending database record number to search
  -ucdb          search nucleotide sequence database in uncompressed form
  -vdbdescmax <n>  limit depth of recursion to <n> in describing virtual
                    databases (default 1)
  -dbchunks <n>    no. of logical chunks of the database to assign to threads
  -dbslice <m>/<n>  search slice <m> out of a database sliced <n> ways
  -dbslice <a>-<b>/<n>  search slices <a> through <b> (inclusive) out of a
                    database sliced <n> ways
  -gi          display gi identifiers, when available
  -noseqs      do not display sequence alignments -- abbreviated output
  -qtype       exit non-zero if query seems to be of wrong type
  -qres        exit non-zero if query contains an invalid residue code
  
  Multiple sort options can be specified and are applied in the user-specified
                    order.
  -sort_by_pvalue           list subjects in decreasing P-value order*
  -sort_by_count            list subjects by the number of HSPs
  -sort_by_highscore        list subjects by highest HSP score
  -sort_by_totalscore       list subjects by the sum total of HSP scores
  -sort_by_subjectlength    list subjects with longer sequences first
  
  -cpus <n>     no. of processors to utilize on multi-processor systems
  -mmio         do not use memory-mapped I/O (usually slower)
  -nonnegok     make all non-negative expected scores a non-FATAL error
  -novalidctxok make no valid contexts a non-FATAL error
  -shortqueryok make queries shorter than the word length a non-FATAL error
  -notes       suppress informatory messages
  -warnings    suppress warning messages
  -errors      suppress non-fatal error messages (strongly discouraged)
  -putenv "NAME=VALUE"   set environment variable NAME to the specified VALUE
  -endputenv    ignore any subsequent putenv options on the command line
  -getenv NAME display the value of the environment variable NAME
  -endgetenv    ignore any subsequent getenv options on the command line
  -compat1.4  revert to BLAST version 1.4 behavior (with bug fixes)
  -compat1.3  revert to BLAST version 1.3 behavior (with bug fixes)
  -haltonfatal    halt multi-query execution on occurrence of first FATAL
  -globalexit     append EXIT CODE 12 to output if any multi-query was fatal
  -abortonerror   abort (and possibly dump core) on a non-fatal error
  -abortonfatal   abort (and possibly dump core) on a fatal error
  -progress <n>   report progress of search at least this often (in seconds)
  -o fname     write output to file named "fname", instead of stdout

    *Default program behavior