WU-Blast on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the WU-Blast databases. If you have just a few query sequences, you should use WU-Blast on a public server such as the EBI server or on Helix. Please contact the Helix Systems staff (staff@helix.nih.gov, or 4-6248) if you have questions about your WU-Blast jobs. WU-BLAST was developed by Warren Gish at Washington University in St. Louis. (WU-Blast website) Specific customization via wrapper scripts for the NIH Biowulf cluster by Susan Chacko (Helix Staff, CIT) and Peter FitzGerald (Genome Analysis Unit, NCI). |
[user@biowulf ~]$ easywublast EasyWUBlast: WuBlast for large numbers of sequences Enter the file or directory which contains your input sequences: /data/user/blast/1000_est Enter the directory where you want your Blast output to go: /data/user/blast/out ** WARNING: There are already files in /data/user/blast/out which may be overwritten by this job. ** Continue? (y/n) :y WU-BLAST programs: blastn - nucleotide query sequence against nucleotide database blastp - protein query sequence against protein database blastx - nucleotide query translated in all 6 reading frames against a protein database tblastn - protein query sequence against a nucleotide database translated in all 6 reading frames tblastx - 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide database Which program do you want to run: blastn The following nucleotide databases are available: (or enter your own database with full pathname) nt - all nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG) est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms pdb.nt - from the 3-dimensional structures ecoli.nt - ecoli genomic sequences mito.nt - mitochondrial sequences yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences drosoph.nt - drosophila sequences hs_genome - human genome assembly (Build 36, Apr 2006) hs_genome.rna - human genome RNA (Build 36, Apr 2006) mouse_genome - mouse genome assembly (Build 36, Mar 2006) mouse_genome.rna - mouse genome RNA (Build 36, Mar 2006) human.rna - Refseq Human RNA mouse.rna - RefSeq Mouse RNA Database to run against: drosoph.nt Use NCBI-Blast parameters? [n] : Any additional WUBlast parameters (e.g. -hspmax 500 ): -E=1.0 -V=10 -B=10 Save intermediate output files? (for debugging only) [n]: Creating parameter file /data/susanc/wublast_par.24368 Submitting to 16 nodes. Job number is 789290.biobos Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?789290.biobos [user@biowulf ~]
easywublast sets up the temporary files and directories that are required, and submits the job for you. Most required parameters are self-evident. The "NCBI-Blast parameters" will use parameter sets that approximate the NCBI Gapped Blast 2.0 (More info). If you choose to 'save intermediate files', the unmerged outputs, the parameter file, and temporary files will all be saved.
In the example above, the query sequences are in individual files in one directory. They can also be set up as multiple sequences per file (e.g. 100 sequences per file, 50 files in the directory), or in one large file. If they are all in a single file, enter the name of that file for the query sequences. e.g.
Enter the file or directory which contains your input sequences: /data/user/my_drosoph.seq
To run against your own database, enter the db name with full path at the Database: prompt. For example:
Database to run against: /data/username/blast_db/my_dbThis database should have been built with the WU-Blast xdformat program. It is available in /usr/local/wublast and described further in the Wu-Blast documentation.
biowulf% wublast_extract -f output.wublast -q AAU46754 > AAU46754.out
xdformat -I "description" -p[-n] file.fasta
WU-Blast Database Update Status
As with all jobs on Biowulf, you must monitor your WU-Blast jobs by one or more of the following methods:- command-line tools (qstat, jobload). Described in detail in the Monitoring section of the Biowulf user guide
- The graphic user monitor. Described in detail in the Monitoring section of the Biowulf user guide
- Look at the standard output file and standard error file. For Easywublast, these will be called '/home/user/Easywublast.o######' and '/home/user/Easywublast,e######'.
- You will get email when the EasyWU-Blast job starts and ends. Check that the 'Exit Status' is 0 (zero).
Query | Database | Blast program | Time (mins) |
Time with NCBI Blast parameters (mins) |
Nodes |
1000 nucleotide EST sequence |
nt nucleotide updated 26 May 2007 5,294,948 sequences 5.4 Gb |
blastn | 35:20 | 21:25 | 16 nodes 2.6 GHz Opterons 8 GB RAM Gb ethernet |
nr protein updated 7 Jun 2007 4,988,250 sequences 1.7 GB |
blastx | 53:02 | 10:21 | ||
est_human updated 3 June 2007 8,119,086 sequences 1.1 GB |
blastn | 10:38 | 5:29 | ||
human genome May 2006 build 25 sequences 767 Mb |
blastn | 9:26 | 7:46 | ||
1000 protein sequences | nr protein updated 7 Jun 2007 4,988,250 sequences 1.7 GB |
blastp | 66:02 | 15:31 | |
nt nucleotide updated 26 May 2007 5,294,948 sequences 5.4 Gb |
tblastn | 10:37:18 | 3:40:11 |
NCBI Blast parameters: WU-Blast is designed to be sensitive, so that with default parameters it will typically take longer than a NCBI Blast run with default parameters. The wu-blastall command converts NCBI blast parameters into their roughly equivalent WU-Blast parameters. Note that sensitivity is inversely related to speed. More about WU-Blast and NCBI Blast parameters
WU-Blast on Biowulf works by dividing the database among the nodes, and running WU-Blast with all the query sequences against each piece of the database. At the end of the run, a merge program puts the pieces together. (In contrast, NCBI Blast on Biowulf is parallelized by job, where individual query sequences are sent to different nodes. This means that the entire Blast database has to be read by each node, which can cause slowdowns for large databases and/or large numbers of nodes.)EasyWUBlast will set up the temporary directories and files, and allocate nodes for your job. If you want more control, you can bypass EasyWUBlast and use the underlying scripts directly, via the instructions below.
- Query sequences: If your query sequences are in individual files, put them all into one directory or concatenate them into a single file. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your WU-Blast sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs. The tmp and output directories will be created if they do not already exist. Note that any previous files in the output directory will be overwritten.
- Edit/create a file which tells Blast where the input/output files are,
and what WU-Blast parameters to use.
This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/envSample environment file named /data/username/blast/env
setenv DB /fdb/wublastdb/nt setenv PROG blastn setenv INDIR /data/user/sample/query.fasta setenv OUTDIR /data/user/sample/out/ setenv TMPDIR /data/user/sample/tmp setenv PARAMS "-B 10 -V 10 -hspmax 500"
where
$DB -- Database to act on (e.g. /fdb/wublastdb/nt). See WU-Blast db status for a list of available databases.
$PROG -- WU-Blast program to run (blastn, blastp, blastx, tblastn, tblastx)
$INDIR -- directory or file containing all the query sequences. Each sequence file may contain single or multiple sequences.
$OUTDIR -- directory which will receive all the output files
$TMPDIR -- temporary directory used by the program.
$PARAMS-- Parameters for the WU-Blast program, in quotes (e.g. "-B 10 -V 10 -hspmax 500" )
Other optional environment variables can also be added in this file, as described in the next section.
(List of all available Blast parameters)Note: Existing files in $TMPDIR and $OUTDIR may be overwritten by this job. Use a different TMPDIR and OUTDIR if you want to preserve them.
- Submit the job to the batch system
Submit the job to the batch system using the 'qsub' command. Example:qsub -v np=8,read=/DIRNAME/env -l nodes=8 /usr/local/wublast/nih/runwublast
This job is using the environment file /DIRNAME/env2, will use 2 processors per node (WU-Blast default). WU-Blast will divide the database into 8 pieces, so the node memory requirement will typically be irrelevant. For example, the nt database is 3.04 Gb, so the program will require 3.0/8 = 38 Mb on each node. Biowulf nodes have a minimum of 1 Gb memory.Another example:
qsub -v np=10,read=/DIRNAME/env1 -l nodes=10 /usr/local/wublast/nih/runwublast
This job is using the blast environment file /DIRNAME/env1, and has asked for 10 nodes. The batch system will allocate the fastest nodes that are available.You can monitor any jobs you submit; see Monitoring your jobs in the Biowulf User Guide.
- Dos and Don'ts:
- Unlike with NCBI Blast, memory probably won't be a limiting factor. So don't specify a minimum node memory when submitting jobs.
- Don't submit a WU-Blast job to a large number of nodes unless you are sure you know what you are doing. The last step in the procedure merges the output from all the nodes, and takes up more cpu and memory as the number of nodes increases. Thus, if you submit to more nodes, your job may end up taking longer. In our experience, about 10 nodes is ideal.
WU-Blast environment variables
The WU-Blast environment variables are set as follows:WUBLASTMAT - /usr/local/wublast/matrix WUBLASTFILTER - /usr/local/wublast/filter WUBLASTDB - /fdb/wublastdbIf you would like to change these values, add them to your parameter file, e.g.
setenv DB /fdb/wublastdb/nt setenv PROG blastn setenv INDIR /data/user/sample/query.fasta setenv OUTDIR /data/user/sample/out/ setenv TMPDIR /data/user/sample/tmp setenv PARAMS "-B 10 -V 10 -hspmax 500" setenv WUBLASTDB /data/user/wublast/mydb setenv WUBLASTFILTER /data/user/wublast/filter
Debugging
By default, the temporary files (/data/user/wublast_par, /data/user/wublast_tmp/*, and the outputs from each node) will be deleted after the run. The only remaining file will be 'output.wublast' which contains the results from all nodes. Your query sequences, of course, will also remain.If you wish to retain all the temporary files, add the environment variable SAVE_ALL to your parameter file. e.g.
setenv DB /fdb/wublastdb/nt setenv PROG blastn setenv INDIR /data/user/sample/query.fasta setenv OUTDIR /data/user/sample/out/ setenv TMPDIR /data/user/sample/tmp setenv PARAMS "-B 10 -V 10 -hspmax 500" setenv SAVE_ALL 1
Programs/Scripts/Files involved:
These scripts can be copied from /usr/local/wublast/nih and modified if desired, although this should not be necessary.
- easywublast -- the easy interface described first in this document
- runwublast -- Batch(PBS) submission file which sets up the mpi wrapper (multirun) for 2 perl scripts
- blastdist -- sets up tmp files for query sequences
- mpiwublast -- main execution of wublast...processes all query sequences assigned to a given node
- env -- file which contains appropriate variables and parameters for blast execution
- multirun -- MPI wrapper program which allows for different behavior on different nodes
- cleanup - does the merging
More Examples
- DNA sequences against the nucleotide non-redundant database
(nt)
Note that 'username' in the examples below should be replaced by your own username!!biowulf$ mkdir /data1/username/run1 make main directory for this run biowulf$ cd /data1/username/run1 go this directory biowulf$ mkdir input output tmp make subdirectories for this run biowulf$ cd input go to the 'input' subdirectory biowulf$ cp /home/username/*.seq . copy sequence files into this subdir
Create a WU-Blast environment file in /data/username/run1/env. The file containssetenv DB /fdb/wublastdb/nt setenv PROG blastn setenv INDIR /data/username/run1/seqs setenv OUTDIR /data/username/run1/out setenv TMPDIR /data/username/run1/tmp setenv PARAMS "-hspmax 500 -warnings"
Submit the job with the command:qsub -v np=8,read=/data/username/run1/env -l nodes=8 \ /usr/local/wublast/nih/runwublast
- A file of 1000 query sequences against the protein non-redundant
database (nr)
biowulf% cat /data/username/*.seq > /data/username/wb_run5/myseqs
Create a new environment file /data/username/run1/env2 which contains:setenv DB /fdb/wublastdb/nr setenv PROG blastx setenv INDIR /data/username/wb_run5/myseqs setenv OUTDIR /data/username/run1/out2 setenv TMPDIR /data/username/run1/tmp2 setenv PARAMS "-hspmax 500 -B 10 -V 10"
Submit the job using:qsub -v np=10,read=/data/username/run1/env2 -l nodes=10 \ /usr/local/wublast/nih/runwublast
Wash U. WU-Blast documentation
Typing the program name will give you a summary of most available parameters and will report the version of WU-Blast being used. See also the Command-line options discussed at the WU-Blast site.
biowulf% /usr/local/wublast/i686/blastn BLASTN 2.0MP-WashU [26-Oct-2004] [linux24-i686-ILP32F64 2004-10-26T20:25:25] Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA. All Rights Reserved. Reference: Gish, W. (1996-2004) http://blast.wustl.edu Notice: this program and its default parameter settings are optimized to find nearly identical sequences rapidly. To identify weak protein similarities encoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX. Usage: BLASTN database queryfile [options] Valid BLASTN options: E, S, E2, S2, W, T, X, M, N, Y, Z, L, K, H, V, B (described at Wu-Blast command-line parameters) -matrix <matrix-name> use the specified scoring matrix (default matrix is computed from M=+5 N=-4); be sure to consider changing the default gap penalties when using a non-default scoring system -Q <s> penalty score for a gap of length 1 -R <s> penalty score for extending a gap by each letter after the first -kap use Karlin-Altschul statistics on individual alignment scores -sump use Karlin-Altschul "Sum" statistics* -poissonp use Poisson statistics to evaluate multiple HSPs -top search only the top strand of the query -bottom search only the bottom strand of the query -filter <method> hard mask the query using the specified method (e.g., "seg", "xnu", "ccp", "dust" or "none") -wordmask <method> soft mask the query using the specified method (see -filter) -maskextra <n> extend soft masking additional distance <n> into flanking regions -lcfilter hard mask lower case letters in the query sequence -lcmask soft mask lower case letters in the query sequence -echofilter display the query, after any/all masks have been applied -hitdist <n> max. distance between word hits for 2-hit BLAST (default 0) -wink <n> generate neighborhood words every <n>-th position (default 1) -stats collect word-hit statistics (consumes marginally more cpu time) -ctxfactor <f> base statistics on this number of independent contexts or reading frames -nogap turn off gapped alignment method, reporting only ungapped HSPs -wstrict impose strict requirement for word hits in ungapped alignments -gapall perform gapped alignment procedure on all ungapped HSPs* -gapE <e> expectation threshold of sets of ungapped HSPs for subsequent use in seeding gapped alignments (default gapall) -gapE2 <e> expectation threshold for saving individual gapped alignments -gapW <n> full band width for gapped alignment procedure -gapX <s> drop-off score for gapped alignment procedure -pingpong perform extra processing to help ensure a locally optimal alignment (rarely useful) -nosegs do not segment the query sequence on hyphen (-) characters -olf <f> max. fractional length of overlap for HSP consistency -golf <f> max. fractional length overlap for GSP consistency -olmax <n> max. absolute length of overlap for HSP consistency (default unlimited) -golmax <n> max. absolute length of overlap for GSP consistency (default unlimited) -gapdecayrate <f> characteristic parameter of geometric weights (default 0.5) -span2 discard HSPs spanned on both query and subject by a better HSP* -span1 discard HSPs spanned on query, subject or both by a better HSP -span do not discard HSPs spanned by other, better HSPs -prune do not prune insignificant HSPs from the output lists -consistency turn off HSP consistency rules for statistics -links display consistent link information for each alignment -topcomboN <n> report this number of consistent (colinear) groups of HSPs -topcomboE <e> only show HSP combos within this factor of the best combo -sumstatsmethod <n> specify an alternate use of Sum statistics -hspsepqmax <n> max. separation allowed between HSPs along query -hspsepsmax <n> max. separation allowed between HSPs along subject -altscore "qc,sc,score" qc and sc may be letters or "all"; score may be numeric, "min", "max", or "na" (not allowed) -altscore "none" clears any previous altscore specifications -hspmax <n> max. number of ungapped HSPs saved per subject sequence (default 1000; 0 => unlimited) -gspmax <n> max. number of gapped HSPs (GSPs) saved per subject sequence (default 0; 0 => unlimited) -spoutmax <n> max. number of segment pairs reported in the output per subject sequence (default 0; 0 => unlimited) -qoffset <i> adjust query sequence coordinate numbers by this amount -soffset <i> adjust subject sequence coordinate numbers by this amount -nwstart <n> start generating neighborhood words here in query (default 1) -nwlen <n> generate neighborhood words over this distance from nwstart in query -qrecmin <n> starting multi-query file record number to search -qrecmax <n> ending multi-query file record number to search -dbrecmin <n> starting database record number to search -dbrecmax <n> ending database record number to search -ucdb search nucleotide sequence database in uncompressed form -vdbdescmax <n> limit depth of recursion to <n> in describing virtual databases (default 1) -dbchunks <n> no. of logical chunks of the database to assign to threads -dbslice <m>/<n> search slice <m> out of a database sliced <n> ways -dbslice <a>-<b>/<n> search slices <a> through <b> (inclusive) out of a database sliced <n> ways -gi display gi identifiers, when available -noseqs do not display sequence alignments -- abbreviated output -qtype exit non-zero if query seems to be of wrong type -qres exit non-zero if query contains an invalid residue code Multiple sort options can be specified and are applied in the user-specified order. -sort_by_pvalue list subjects in decreasing P-value order* -sort_by_count list subjects by the number of HSPs -sort_by_highscore list subjects by highest HSP score -sort_by_totalscore list subjects by the sum total of HSP scores -sort_by_subjectlength list subjects with longer sequences first -cpus <n> no. of processors to utilize on multi-processor systems -mmio do not use memory-mapped I/O (usually slower) -nonnegok make all non-negative expected scores a non-FATAL error -novalidctxok make no valid contexts a non-FATAL error -shortqueryok make queries shorter than the word length a non-FATAL error -notes suppress informatory messages -warnings suppress warning messages -errors suppress non-fatal error messages (strongly discouraged) -putenv "NAME=VALUE" set environment variable NAME to the specified VALUE -endputenv ignore any subsequent putenv options on the command line -getenv NAME display the value of the environment variable NAME -endgetenv ignore any subsequent getenv options on the command line -compat1.4 revert to BLAST version 1.4 behavior (with bug fixes) -compat1.3 revert to BLAST version 1.3 behavior (with bug fixes) -haltonfatal halt multi-query execution on occurrence of first FATAL -globalexit append EXIT CODE 12 to output if any multi-query was fatal -abortonerror abort (and possibly dump core) on a non-fatal error -abortonfatal abort (and possibly dump core) on a fatal error -progress <n> report progress of search at least this often (in seconds) -o fname write output to file named "fname", instead of stdout *Default program behavior