BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates. For more information see the BLAT web page or Jim Kent's web page. |
Sample session: (user input is in bold):
biowulf% easyblat EasyBLAT: BLAT (not Blast!) for large numbers of sequences Enter the directory which contains your input sequences: /data/username/blast/ssqs ** ERROR: Input sequence directory /data/username/blast/ssqs does not exist Enter the directory which contains your input sequences: /data/username/blat/seqs Enter the directory where you want your BLAT output to go: /data/username/blat/out ** WARNING: There are already files in /data/username/blat/out which will be overwritten by this job. ** Continue? (y/n) : y The following databases are available: H - Human Genome (Apr 2006) assembly M - Mouse Genome (Jul 2007) assembly O - Other databases Enter H, M or O for a detailed list: h Human Genome (Build 35, May 2004) assembly: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11 chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1-9, chr10-Y Enter human section to run against: chr10-Y http://biowulf.nih.gov/blat.html has a full list of available parameters. Any additional BLAT parameters (e.g. -maxGap=3): Submitting to 24 nodes with m1028 memory. Job number is 87930.biobos Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?87930.biobos
As you see above, easyblat does some simple error checking, such as checking whether your query sequences exist. It will figure out the node memory required, set up all temporary files and directories, and submit the job for you.
You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:
The following databases are available: H - Human Genome (Apr 2006) assembly M - Mouse Genome (Jul 2007) assembly O - Other databases Enter H, M or O for a detailed list: O Other databases, updated weekly: pdb - from the PDB 3-dimensional structures drosoph - Drosophila sequences ecoli - E. Coli sequences mito - mitochondrial sequences yeast - Yeast sequences If using your own database, enter the full pathname. Enter db to run against: /data/user/my_db.fas
If you want more control over the number of nodes, you can bypass EasyBlat and use the underlying scripts directly, following the instructions below:
- Set up a directory structure for input/output and tmp files.
For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Blast sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs. The tmp and output directories will be created if they do not already exist. - Edit/create a file which tells Blat where the input/output files are,
and what Blat parameters to use.
This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/envSample environment file:
setenv DB /fdb/blastdb/genome/mouse-feb2003/chr11-X.fa setenv INDIR /data/user/blast/seqs setenv OUTDIR /data/user/blat_out/ setenv TMPDIR /data/user/blat_tmp/ setenv PARAMS ""
where:
$DB -- target database. You can enter your own fasta-format target file here.
$INDIR -- directory containing all the input sequence files (may contain single or multiple sequences)
$OUTDIR -- directory which will receive all the individual output files. Note that existing files in this directory may be overwritten by the new run.
$TMPDIR -- temporary directory used by the program
$PARAMS -- Optional parameters, in quotes, to pass to BLAT (e.g. "-maxGap=3" ) List of all options - Submit the job to the batch system.
Submit the job to the batch system using the 'qsub' command. Example:qsub -v np=16,read=/data/username/sample1/env -l nodes=8:m1024 \ /usr/local/blat/nih/runblat
This job has asked for 16 processors (np=16), is using the environment file /data/username/sample1/env, is using 2 processors per node (nodes=8, processors=16, has asked for m1024 nodes (1024 Mb memory)).Another example:
qsub -v np=8,read=/data/username/sample2/env -l nodes=8 /usr/local/blat/nih/runblat
This job has asked for 8 processors (np=8), is using the blast environment file /data/username/sample2/env, and has asked for 8 nodes. No specific nodes have been asked for (i.e. no 'm2048' or 'm1024' specification on the nodes), so the batch system will allocate the first nodes that are available. (The command 'shnodes' will list all the nodes on the system and their various properties).It is an excellent idea to monitor any jobs you submit, with the Biowulf job or user monitor as shown on the right. More information about the monitors is in the User Guide
More Examples:
- DNA sequences against chr 8 from the Aug 2003 human genome
build. Chromosome 8 is 149 Mb. (You can see the size of any chromosome
file by typing 'ls -l' in the directory'. Therefore, two copies of this file
will easily fit in node memory on any node in the system, so no memory
requirements need be specified.
Command : qsub -v np=16,read=/home/user/env -l nodes=8 /usr/local/blat/nih/runblat File "env" contains ----------------------- setenv DB /fdb/blastdb/genome/ucsc-aug01/chr8.fa setenv INDIR /data1/user/blat/query100 setenv OUTDIR /data1/user/blat/out100 setenv TMPDIR /data1/user/blat/tmp setenv PARAMS " " --------------------------------------------
- Same input data against entire human genome
ls -l /fdb/blastdb/genome/ucsc-apr01/chr1.fa -rw-rw-r-- 1 susanc Seqdb 3131547699 Oct 10 2003 chr_all.fa
The database is 3.1 Gb. There are no nodes in the system big enough to hold the entire database in memory. A run against this database will be I/O bound and very inefficient. Therefore, it is better to take two separate runs against chr1-9, and chr10-Y-rw-r--r-- 1 susanc Seqdb 1707475762 Mar 26 08:44 chr1-9.fa -rw-r--r-- 1 susanc Seqdb 1424055028 Mar 26 08:46 chr10-Y.fa
These files are 1.7 Gb and 1.4 Gb respectively. They will fit in the memory of a 2Gb node (m2048). Command :qsub -v np=10,read=/home/user/env1 -l nodes=10:m2048 /usr/local/blat/nih/runblat
where file "env1" containssetenv DB /fdb/blastdb/genome/chr1.fa setenv PARAMS "-minIdentity=95 -nohead" setenv INDIR /data1/susanc/blat/query100 setenv OUTDIR /data1/susanc/blat/out2 setenv TMPDIR /data1/susanc/blat/tmp2
- The success of this method depends on the entire database fitting into the node memory. If you are running 2 processes per node, 2 copies of the database should fit into the node memory, with an additional 50Mb or so required for overhead. Watch the job the first time it runs (using the Memory Monitor or logging on to the node and running 'top') to make sure you don't have memory issues. Each Biowulf node has at least 1 GB of memory.
- The individual chromosome files are on the order of 70 - 300 Mb. It is most
efficient to use 2 processors/node. If Biowulf is heavily loaded, your job may
be waiting in the queue until enough of the appropriate nodes become available.
In that case, you may be better off running with 1 processor/node on a
smaller-memory node.
qsub -v np=16,read=/home/user/env -l nodes=16 /usr/local/blat/nih/runblat
(No memory was specified in this command because all nodes have at least 1 GB of memory). You can check the size of each chromosome by doing an 'ls -l' on the directory. - Some query sequences may take significantly longer than others. Thus it is
not unusual to see one process continue to run while the other nodes have
finished their jobs.
Note: Use the "shnodes" command to see a list of nodes with their properties and status (free, job-exclusive, offline, down)
Any fasta-format file can serve as a BLAT database. The full list of updated Fasta-format databases on the Helix Systems is available on the Helix Database page.
BLAT - The Blast-Like Alignment Tool. W. James Kent, Genome Research 12(4): 656-664, April 2002 BLAT Suite Program Specifications and User Guide. at the UCSC Genome website. All BLAT options are listed on this page.