Blat on Biowulf

Blat (not Blast!) on Biowulf

Quick Links

BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates. For more information see the BLAT web page or Jim Kent's web page.

EasyBlat

The 'easyblat' script simplifies running large BLAT jobs. You need to put all your query sequences into a directory, and then type 'easyblat' at the Biowulf prompt. You will be prompted for all required parameters. The script will then decide what kind of node you need (based on the database you choose) and submit your job to as many nodes as are available (max 24).

Sample session: (user input is in bold):

biowulf% easyblat

EasyBLAT: BLAT (not Blast!) for large numbers of sequences

Enter the directory which contains your input sequences: /data/username/blast/ssqs
** ERROR: Input sequence directory /data/username/blast/ssqs does not exist
Enter the directory which contains your input sequences: /data/username/blat/seqs

Enter the directory where you want your BLAT output to go: /data/username/blat/out
** WARNING: There are already files in /data/username/blat/out which will be overwritten 
   by this job.
** Continue? (y/n) : y

The following databases are available:
  H - Human Genome (Apr 2006) assembly
  M - Mouse Genome (Jul 2007) assembly
  O - Other databases
  Enter H, M or O for a detailed list: h
      
Human Genome (Build 35, May 2004) assembly:
chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11
chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20,
chr21, chr22, chrX, chrY,
chr1-9, chr10-Y
Enter human section to run against: chr10-Y

http://biowulf.nih.gov/blat.html has a full list of available parameters.
Any additional BLAT parameters (e.g. -maxGap=3):

Submitting to 24 nodes with m1028 memory. Job number is 87930.biobos

Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?87930.biobos

As you see above, easyblat does some simple error checking, such as checking whether your query sequences exist. It will figure out the node memory required, set up all temporary files and directories, and submit the job for you.

You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:

The following databases are available:
  H - Human Genome (Apr 2006) assembly 
  M - Mouse Genome (Jul 2007) assembly 
  O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
    pdb - from the PDB 3-dimensional structures
    drosoph - Drosophila sequences
    ecoli - E. Coli sequences
    mito - mitochondrial sequences
    yeast - Yeast sequences

If using your own database, enter the full pathname.
Enter db to run against: /data/user/my_db.fas

Detailed procedure

If you want more control over the number of nodes, you can bypass EasyBlat and use the underlying scripts directly, following the instructions below:

Set up a directory structure for input/output and tmp files.
For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Blast sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs. The tmp and output directories will be created if they do not already exist.
Edit/create a file which tells Blat where the input/output files are, and what Blat parameters to use.
This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env
Sample environment file:
```
setenv DB /fdb/blastdb/genome/mouse-feb2003/chr11-X.fa
setenv INDIR /data/user/blast/seqs
setenv OUTDIR /data/user/blat_out/
setenv TMPDIR /data/user/blat_tmp/
setenv PARAMS ""
```
where:
$DB -- target database. You can enter your own fasta-format target file here.
$INDIR -- directory containing all the input sequence files (may contain single or multiple sequences)
$OUTDIR -- directory which will receive all the individual output files. Note that existing files in this directory may be overwritten by the new run.
$TMPDIR -- temporary directory used by the program
$PARAMS -- Optional parameters, in quotes, to pass to BLAT (e.g. "-maxGap=3" ) List of all options
Submit the job to the batch system.
Submit the job to the batch system using the 'qsub' command. Example:
```
 
qsub -v np=16,read=/data/username/sample1/env -l nodes=8:m1024 \
/usr/local/blat/nih/runblat
```
This job has asked for 16 processors (np=16), is using the environment file /data/username/sample1/env, is using 2 processors per node (nodes=8, processors=16, has asked for m1024 nodes (1024 Mb memory)).
Another example:
```
qsub -v np=8,read=/data/username/sample2/env -l nodes=8 /usr/local/blat/nih/runblat
```
This job has asked for 8 processors (np=8), is using the blast environment file /data/username/sample2/env, and has asked for 8 nodes. No specific nodes have been asked for (i.e. no 'm2048' or 'm1024' specification on the nodes), so the batch system will allocate the first nodes that are available. (The command 'shnodes' will list all the nodes on the system and their various properties).
It is an excellent idea to monitor any jobs you submit, with the Biowulf job or user monitor as shown on the right. More information about the monitors is in the User Guide

More Examples:

DNA sequences against chr 8 from the Aug 2003 human genome build. Chromosome 8 is 149 Mb. (You can see the size of any chromosome file by typing 'ls -l' in the directory'. Therefore, two copies of this file will easily fit in node memory on any node in the system, so no memory requirements need be specified.

Command :

qsub -v np=16,read=/home/user/env -l nodes=8 /usr/local/blat/nih/runblat

File "env" contains -----------------------
setenv DB /fdb/blastdb/genome/ucsc-aug01/chr8.fa
setenv INDIR  /data1/user/blat/query100
setenv OUTDIR /data1/user/blat/out100
setenv TMPDIR  /data1/user/blat/tmp
setenv PARAMS " "
--------------------------------------------

Same input data against entire human genome

ls -l /fdb/blastdb/genome/ucsc-apr01/chr1.fa

-rw-rw-r--    1 susanc   Seqdb    3131547699 Oct 10  2003 chr_all.fa

The database is 3.1 Gb. There are no nodes in the system big enough to hold the entire database in memory. A run against this database will be I/O bound and very inefficient. Therefore, it is better to take two separate runs against chr1-9, and chr10-Y

-rw-r--r--    1 susanc   Seqdb    1707475762 Mar 26 08:44 chr1-9.fa
-rw-r--r--    1 susanc   Seqdb    1424055028 Mar 26 08:46 chr10-Y.fa

These files are 1.7 Gb and 1.4 Gb respectively. They will fit in the memory of a 2Gb node (m2048). Command :

qsub -v np=10,read=/home/user/env1 -l nodes=10:m2048 /usr/local/blat/nih/runblat

where file "env1" contains

setenv DB /fdb/blastdb/genome/chr1.fa
setenv PARAMS "-minIdentity=95 -nohead"
setenv INDIR  /data1/susanc/blat/query100
setenv OUTDIR /data1/susanc/blat/out2
setenv TMPDIR  /data1/susanc/blat/tmp2

Important Notes

The success of this method depends on the entire database fitting into the node memory. If you are running 2 processes per node, 2 copies of the database should fit into the node memory, with an additional 50Mb or so required for overhead. Watch the job the first time it runs (using the Memory Monitor or logging on to the node and running 'top') to make sure you don't have memory issues. Each Biowulf node has at least 1 GB of memory.
The individual chromosome files are on the order of 70 - 300 Mb. It is most efficient to use 2 processors/node. If Biowulf is heavily loaded, your job may be waiting in the queue until enough of the appropriate nodes become available. In that case, you may be better off running with 1 processor/node on a smaller-memory node.
```
qsub -v np=16,read=/home/user/env -l nodes=16 /usr/local/blat/nih/runblat
```
(No memory was specified in this command because all nodes have at least 1 GB of memory). You can check the size of each chromosome by doing an 'ls -l' on the directory.
Some query sequences may take significantly longer than others. Thus it is not unusual to see one process continue to run while the other nodes have finished their jobs.
Note: Use the "shnodes" command to see a list of nodes with their properties and status (free, job-exclusive, offline, down)

BLAT Databases

Any fasta-format file can serve as a BLAT database. The full list of updated Fasta-format databases on the Helix Systems is available on the Helix Database page.

BLAT documentation

BLAT - The Blast-Like Alignment Tool. W. James Kent, Genome Research 12(4): 656-664, April 2002 BLAT Suite Program Specifications and User Guide. at the UCSC Genome website. All BLAT options are listed on this page.