Biowulf at the NIH
High-throughput Fasta on Biowulf

FASTA on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the databases. If you have just a few query sequences, you should use FASTA on Helix. Please contact the Helix Systems staff staff@helix.nih.gov, or 4-6248 if you have questions about your FASTA jobs.

The 'easyfasta' program on Biowulf simplifies submission of large Fasta jobs. You need to put all your query sequences into a directory, and then type 'easyfasta' at the Biowulf prompt. You will be prompted for all required parameters. The script will then decide what kind of nodes you need (based on the database you are searching against), and submit your job to as many nodes as are available (max 32).Sample Session (user input is in bold):
[user@biowulf user]$ easyfasta

---------------------------------------------------
EasyFasta: Fasta for large numbers of sequences
---------------------------------------------------

Enter the directory which contains your input sequences: /data/user/myseqs/1000_est

Enter the directory where you want your Fasta output to go: /data/user/myseqs/out
** WARNING: There are already files in /data/user/myseqs/out which will be deleted by this job.
** Continue? (y/n) :y

Are your query sequences nucleotides (y/n)? y

Fasta programs:
  fasta - nt seq against nt database or aa seq against aa database
  fastx - Translated DNA (with frameshifts, e.g. ESTs) vs Proteins
  fasty - similar to fastx but slower, more sensitive
  fasts - Un-ordered Nucleotides vs Nucleotide
  fastm - Ordered Nucleotides vs Nucleotide

Which program do you want to run: fastx

The following protein databases are available:
(or enter your own database with full pathname)

   nr                   Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
   swissprot            SwissProt sequences
   pdb.aa               Sequences derived from Protein Data Bank
   yeast.aa             Saccharomyces cerevisiae protein sequences
   drosoph.aa           Genbank drosophila sequences
   ecoli.aa             Ecoli genomic CDS translations
   mito.aa              Mitochondrial protein sequences
   ref.human.protein    Refseq human protein sequences
   ref.mouse.protein    Refseq mouse protein sequences
   hs_genome.protein    Build 36, hg18 (April 2006) from the International Human Genome Consortium
   mouse_genome.protein Build 34, mm6, May 2005 from the Mouse Genome Consortium


Database to run against: pdb.aa
What is the ktup value (default 2)? 

http://helix.nih.gov/Applications/fasta3x.txt has a full list of available options.
Any additional Fasta options (e.g. -v 10): -v 5

Creating parameter file /data/user/fasta_tmp.27016/fasta_par.27016
Checking node situation....

272 are available for 1000 sequences 

qsub -v np=32,read=/data/user/fasta_tmp.27016/fasta_par.27016 -l nodes=32:m2048:x86-64 
   /usr/local/fasta/bin/easyrunfasta

Submitting to 32 nodes. Job number is 1606770.biobos

Monitor your job at http://biowulf.nih.gov/cgi-bin/usermonS?user

              
Easyfasta figures out the node memory required, sets up all temporary files and directories, and submits the job for you. To run against your own database, enter the db name with full path at the Database: prompt. If your database is not in fasta format, please see Database section below. For example:
Database to run against: /data/username/fasta_db/my_db
Splitting your query sequences

If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:

Some recent runs on our system to give you an idea of what sort of performance to expect.

Query Database Fasta Program Nodes Time
1000 nucleotide
EST sequence
nt nucleotide
updated 13 Jul 2004
2,283,112 sequences
2.8 Gb
fasta 10
4 Gb memory
18 hrs 51 min
month nucleotide
updated 17 Jul 2004
52,474 sequences
90 Mb
fasta

10
p2800
1 Gb memory

40 min
est
updated 14 Jul 2004
21,925,146 sequences
3.9 Gb
fasta 10
4 Gb memory
50 hrs 47 min
human genome
updated 12 Dec 2003
25 sequences
707 Mb
fasta

10
P2800
1 Gb memory

4 hrs 44 min
1000 protein sequences nt nucleotide
updated 13 Jul 2004
2,283,112 sequences
2.8 Gb
tfastx 10
4 Gb memory
30 hrs 53 min
nr protein
updated 3 Feb 2004
1,934,002sequences
700 Mb
fasta

10
p2800
1 Gb memory

1 hrs 59 min

All runs were 2 processors/node.

Fasta Databases

Fasta programs can accept databases in many different formats.

User can use their own database or ready-to-use databases on Helix. These databases can be found in the directory /fdb/. User use one of the two methods below to define the $DB variable in the environmental file:

1. To use a FASTA-FORMATTED database in ONE single file, the environmental file should look like this:

setenv DB /fdb/fastadb/pdb.nt.fas
setenv KTUP 6
setenv PROG fasta_t
setenv INDIR /data/username/sample1/indir
setenv OUTDIR /data/username/sample1/outdir
setenv TMPDIR /data/username/sample1/temp1
setenv PARAMS "-b 10 -d 10"
                

2. To use a database composed of several files, or databases, or database in NON-FASTA format:

/fdb/blastdb/nt.00 12 
/fdb/blastdb/nt.01 12 
/fdb/blastdb/nt.02 12
/data/username/mydb.fas 0
/fdb/embossdb/estnew 3
setenv DB @/data/user/dbfile
setenv KTUP 6
setenv PROG fasta_t
setenv INDIR /data/username/sample1/indir
setenv OUTDIR /data/username/sample1/outdir
setenv TMPDIR /data/username/sample1/temp1
setenv PARAMS "-b 10 -d 10"

The Helix databases are updated frequently. Easyfasta is designed to use the Blast-format databases on the system. See Blast Database Update Status for the latest status and updates to these databases.

Easyfasta will select the number and type of nodes for you. If you want more control, you can bypass Easyfasta and use the underlying scripts directly, via the instructions below.

1. Set up a directory structure for input/output and tmp files.

For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Fasta sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs machines. The tmp and output directories will be created if they do not already exist.

2. Edit/create an Environment file which tells Fasta where the input/output files are, and what Fasta parameters to use.

This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env

Sample environment file:

setenv DB @/fdb/fastadb/fastanam/nt.nam 
setenv KTUP 6 setenv PROG fasta_t setenv INDIR /data/username/sample1/indir 
setenv OUTDIR /data/username/sample1/outdir 
setenv TMPDIR /data/username/sample1/temp1 
setenv PARAMS "-b 10 -d 10"

where

$DB -- Database to act on (e.g. /fdb/fastadb/nr or your own database). More info.
$PROG -- Fasta program to run (fasta_t, fastx_t, fastf_t, fasts_t, fasty_t, prss_t, ssearch_t, tfasta_t, tfastx_t, tfastf_t, tfasts_t, tfasty_t)
$INDIR -- directory containing all the input sequence files.
$OUTDIR -- directory which will receive all the individual output files
$TMPDIR -- temporary directory used by the program.
$PARAMS-- Parameters for the Fasta program, in quotes (e.g. "-b 10 -v 10" )
(List of all available Fasta parameters)

3. Submit the job to the batch system

Submit the job to the batch system using the 'qsub' command.

Example 1:

qsub -v np=4,read=/data/username/easyfasta/env1 
-l nodes=4:p2800:m2048 /usr/local/fasta/bin/easyrunfasta

This job is using the environment file /data/username/easyfasta/env1, will use 2 processors per node (because of the -T 2 flag given as a default parameter in Fasta programs) and has asked for 'm2048' nodes (memory = 2048 Mb). The 'shnodes' command will list all the Biowulf nodes and their properties. The value of 'np' should always be equal to the number of nodes requested.

Example 2:

qsub -v np=4,read=/data/username/easyfasta/env1 
-l nodes=4 /usr/local/fasta/bin/easyrunfasta

This job is using the fasta environment file /data/username/easyfasta/env1, and has asked for 4 nodes. No specific nodes have been asked for (i.e. no 'm####' specification on the nodes), so the batch system will allocate the first nodes that are available. (The command 'shnodes' will list all the nodes on the system and their various properties). The value of 'np' should always be equal to the number of nodes requested.

Node memory:

When analyzing a large number of sequences with fasta, it is imperative that the database fit entirely within the memory of a given node. This makes a vast difference in the performance of fasta.

For example, in the environmental file above, the database line reads:

setenv DB @/fdb/fastadb/fastanam/nt.nam

In this nt.nam file, there are full path of 3 database files::

biowulf 75% more /fdb/fastadb/fastanam/nt.nam
/fdb/blastdb/nt.00 12
/fdb/blastdb/nt.01 12
/fdb/blastdb/nt.02 12

The size of the files IN the file /fdb/fastadb/fastanam/nt.nam needs to be sumed up in order to decide the memory of the nodes required. NOT the file /fdb/fastadb/fastanam/nt.nam itself. Do this:

Helix 100% ls -l /fdb/blastdb/nt.*

Add up the size of the files whose name is nt.**.nsq

If the result is 2,786,861,701 which is > 2 GB, then 4 Gb nodes are required (qsub -l nodes=16:m4096 ). If the result is less than 900 Mb, the the default 1GB nodes are fine, so no node memory needs to be specified. If the database size is greater than 4 GB, then the dual core nodes which has 8 GB of memory should be requested ( -l nodes=16:dc ). Please note, if dual core nodes are requested, please add -T 4 to the PARAMS variable in the environmental file to make use of all 4 processors in a dual core node.

At the present time the Biowulf cluster consists of nodes with different memory configurations. Please see the hardware section of the User Guide for the current configuration of the cluster.

More Examples:

1. DNA sequences against the nucleotide non-redundant database (nt)

biowulf$ mkdir /data/username/run1             make main directory for this run
biowulf$ cd /data/username/run1                go this directory
biowulf$ mkdir input output tmp                make subdirectories for this run
biowulf$ cd input                              go to the 'input' subdirectory
biowulf$ cp /home/username/*.seq .             copy all the sequence files into this subdir
setenv DB @/fdb/fastadb/fastanam/nt.nam
setenv PROG fasta_t
setenv INDIR  /data/username/run1/seqs
setenv OUTDIR /data/username/run1/out
setenv TMPDIR  /data/username/run1/tmp
setenv PARAMS "-H -b 10 -v 10"
qsub -v np=8,read=/data/username/run1/env 
-l nodes=8:m1024 /usr/local/fasta/bin/easyrunfasta

2. Same data against the protein non-redundant database (nr)

setenv DB @/fdb/fastadb/fastanam/nr.nam
setenv PROG fasta_t
setenv PARAMS "-H -b 10 -v 10"
setenv INDIR  /data/username/run1/seqs
setenv OUTDIR /data/username/run1/out2
setenv TMPDIR  /data/username/run1/tmp2
qsub -v np=16,read=/data/username/run1/env2 
-l nodes=16 /usr/local/fasta/bin/easyrunfasta

Programs/Scripts/Files involved

Fasta programs:

Program
Description
FASTA Compares a protein sequence to another protein sequence or to a protein database, or a DNA sequence to another DNA sequence or a DNA library.
SSEARCH Performs a rigorous Smith-Waterman alignment between a protein sequence and another protein sequence or a protein database, or with DNA sequence to another DNA sequence or a DNA library (very slow).
FASTX/FASTY Compares a DNA sequence to a protein sequence database, translating the DNA sequence in three forward (or reverse) frames and allowing frameshifts.
TFASTX/TFASTY Compares a protein sequence to a DNA sequence or DNA sequence library. The DNA sequence is translated in three forward and three reverse frames, and the protein query sequence is compared to each of the six derived protein sequences. The DNA sequence is translated from one end to the other; no attempt is made to edit out intervening sequences. Termination codons are translated into unknown ('X') amino acids.
FASTF/TFASTF Compares an ordered peptide mixture, as would be obtained by Edman degredation of a CNBr cleavage of a protein, against a protein (fastf) or DNA (tfastf) database. A different format is required to specify the ordered peptide mixture:
>mgstm1
MGCEN,
MIDYP,
MLLAY,
MLLGY
indicates m in the first position of all three peptides (as from CNBr), g, i, l (twice) in the second position (first cycle), c,d,l (twice) in the third position, etc. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue.
FASTS/TFASTS Compares set of short peptide fragments, as would be obtained from mass-spec. analysis of a protein, against a protein (fasts) or DNA (tfasts) database. A different format is required to specify the ordered peptide mixture:
>mgstm1
MILG,
MLLEYTD,
MGDAP
indicates three peptide fragments were found: MILG, MLLEYTD, and MGDAP. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue.

 

Fasta scripts:

These scripts can be copied from /usr/local/fasta/bin and modified if desired, although this should not be necessary.

Local documentation for the Fasta program

Fasta documentation at the CSC, Finland -- a bit easier to read because of the formatting.

Fasta vs Blast - a discussion of the pros and cons of each algorithm, at MSKCC.