FASTA on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the databases. If you have just a few query sequences, you should use FASTA on Helix. Please contact the Helix Systems staff staff@helix.nih.gov, or 4-6248 if you have questions about your FASTA jobs. |
[user@biowulf user]$ easyfasta --------------------------------------------------- EasyFasta: Fasta for large numbers of sequences --------------------------------------------------- Enter the directory which contains your input sequences: /data/user/myseqs/1000_est Enter the directory where you want your Fasta output to go: /data/user/myseqs/out ** WARNING: There are already files in /data/user/myseqs/out which will be deleted by this job. ** Continue? (y/n) :y Are your query sequences nucleotides (y/n)? y Fasta programs: fasta - nt seq against nt database or aa seq against aa database fastx - Translated DNA (with frameshifts, e.g. ESTs) vs Proteins fasty - similar to fastx but slower, more sensitive fasts - Un-ordered Nucleotides vs Nucleotide fastm - Ordered Nucleotides vs Nucleotide Which program do you want to run: fastx The following protein databases are available: (or enter your own database with full pathname) nr Non-redundant GenBank CDS translations+PDB+SwissProt+PIR swissprot SwissProt sequences pdb.aa Sequences derived from Protein Data Bank yeast.aa Saccharomyces cerevisiae protein sequences drosoph.aa Genbank drosophila sequences ecoli.aa Ecoli genomic CDS translations mito.aa Mitochondrial protein sequences ref.human.protein Refseq human protein sequences ref.mouse.protein Refseq mouse protein sequences hs_genome.protein Build 36, hg18 (April 2006) from the International Human Genome Consortium mouse_genome.protein Build 34, mm6, May 2005 from the Mouse Genome Consortium Database to run against: pdb.aa What is the ktup value (default 2)? http://helix.nih.gov/Applications/fasta3x.txt has a full list of available options. Any additional Fasta options (e.g. -v 10): -v 5 Creating parameter file /data/user/fasta_tmp.27016/fasta_par.27016 Checking node situation.... 272 are available for 1000 sequences qsub -v np=32,read=/data/user/fasta_tmp.27016/fasta_par.27016 -l nodes=32:m2048:x86-64 /usr/local/fasta/bin/easyrunfasta Submitting to 32 nodes. Job number is 1606770.biobos Monitor your job at http://biowulf.nih.gov/cgi-bin/usermonS?user
Database to run against: /data/username/fasta_db/my_db
If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:
- seqsplit: will split a multisequence fasta-format file into
individual sequences. Usage:
seqsplit -f sequence_file
If the file sequence_file contains 2000 sequences, you will get 2000 individual files. Each file will be named according to the sequence name in the fasta entry.
- split_fasta: will split a multisequence fasta-format file into a
desired number of files. Usage:
split_fasta [optional parameters] [dir]file.fas
-n # number of split files (default=2)
-o file root name of output file (default split#)
-c # chunks to write out (default 100 entries)
-d outdir output directory (default = input directory)
-z if input file is .Z or .gz compressed.Thus, if a file has 100 sequences, and you want to split it into 5 multisequence files, use split_fasta -n 5 sequence_file will produce 5 files, each containing 100/5=20 sequences. The files will be called split0, split1,...split4. split_fasta -n 5 -o oligo sequence_file will produce 5 files, each containing 20 sequences. The files will be called oligo0, oligo1 ..oligo5.
Some recent runs on our system to give you an idea of what sort of performance to expect.
Query | Database | Fasta Program | Nodes | Time |
1000 nucleotide EST sequence |
nt nucleotide updated 13 Jul 2004 2,283,112 sequences 2.8 Gb |
fasta | 10 4 Gb memory |
18 hrs 51 min |
month nucleotide updated 17 Jul 2004 52,474 sequences 90 Mb |
fasta |
10 |
40 min | |
est updated 14 Jul 2004 21,925,146 sequences 3.9 Gb |
fasta | 10 4 Gb memory |
50 hrs 47 min | |
human genome updated 12 Dec 2003 25 sequences 707 Mb |
fasta |
10 |
4 hrs 44 min | |
1000 protein sequences | nt nucleotide updated 13 Jul 2004 2,283,112 sequences 2.8 Gb |
tfastx | 10 4 Gb memory |
30 hrs 53 min |
nr protein updated 3 Feb 2004 1,934,002sequences 700 Mb |
fasta |
10 |
1 hrs 59 min |
All runs were 2 processors/node.
- EasyFasta will typically allot 32 nodes (i.e. 64 processors) for a job. Large databases (e.g. nt nucleotide) may be allotted fewer nodes, since there are fewer nodes with > 2 Gb memory and hence they are less likely to be available.
Fasta programs can accept databases in many different formats.
User can use their own database or ready-to-use databases on Helix. These databases can be found in the directory /fdb/. User use one of the two methods below to define the $DB variable in the environmental file:
1. To use a FASTA-FORMATTED database in ONE single file, the environmental file should look like this:
setenv DB /fdb/fastadb/pdb.nt.fas setenv KTUP 6 setenv PROG fasta_t setenv INDIR /data/username/sample1/indir setenv OUTDIR /data/username/sample1/outdir setenv TMPDIR /data/username/sample1/temp1 setenv PARAMS "-b 10 -d 10"
2. To use a database composed of several files, or databases, or database in NON-FASTA format:
- create a file (/data/user/dbfile) which look like this with the full path of each database and the format number (see below):
/fdb/blastdb/nt.00 12 /fdb/blastdb/nt.01 12 /fdb/blastdb/nt.02 12 /data/username/mydb.fas 0 /fdb/embossdb/estnew 3
- The environmental file should look like this:
setenv DB @/data/user/dbfile setenv KTUP 6 setenv PROG fasta_t setenv INDIR /data/username/sample1/indir setenv OUTDIR /data/username/sample1/outdir setenv TMPDIR /data/username/sample1/temp1 setenv PARAMS "-b 10 -d 10"
- Make sure the '@' sign is in front of the file which contains the full path
of the databases and their format number.
The format number can be determined using the following list:0 Pearson/FASTA (>SEQID - comment/sequence)
1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
2 NBRF CODATA (ENTRY/SEQUENCE)
3 EMBL/SWISS-PROT (ID/DE/SQ)
4 Intelligenetics (;comment/SEQID/sequence)
5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
6 GCG (version 8.0) Unix Protein and DNA (compressed)
11 NCBI Blast1.3.2 format (unix only)
12 NCBI Blast2.0 format (unix only, fasta32t08 or later)
The Helix databases are updated frequently. Easyfasta is designed to use the Blast-format databases on the system. See Blast Database Update Status for the latest status and updates to these databases.
Easyfasta will select the number and type of nodes for you. If you want more control, you can bypass Easyfasta and use the underlying scripts directly, via the instructions below.
1. Set up a directory structure for input/output and tmp files.
For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Fasta sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs machines. The tmp and output directories will be created if they do not already exist.
2. Edit/create an Environment file which tells Fasta where the input/output files are, and what Fasta parameters to use.
This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env
Sample environment file:
setenv DB @/fdb/fastadb/fastanam/nt.nam setenv KTUP 6 setenv PROG fasta_t setenv INDIR /data/username/sample1/indir setenv OUTDIR /data/username/sample1/outdir setenv TMPDIR /data/username/sample1/temp1 setenv PARAMS "-b 10 -d 10"where
$DB -- Database to act on (e.g. /fdb/fastadb/nr or your own database). More info.
$PROG -- Fasta program to run (fasta_t, fastx_t, fastf_t, fasts_t, fasty_t, prss_t, ssearch_t, tfasta_t, tfastx_t, tfastf_t, tfasts_t, tfasty_t)
$INDIR -- directory containing all the input sequence files.
$OUTDIR -- directory which will receive all the individual output files
$TMPDIR -- temporary directory used by the program.
$PARAMS-- Parameters for the Fasta program, in quotes (e.g. "-b 10 -v 10" )
(List of all available Fasta parameters)
- Each sequence file may contain single or multiple sequences. Each file will go to a single node, so don't put ALL your sequences into a single file.
- Existing files in $TMPDIR and $OUTDIR may be overwritten by this job. Use a different TMPDIR and OUTDIR if you want to preserve them.
- The programs ended with _t are multithreaded version of their corresponding programs. They are defaulted to 2 threads. Since each node in Biowulf has 2 processors, _t versions should be used.
3. Submit the job to the batch system
Submit the job to the batch system using the 'qsub' command.
Example 1:
qsub -v np=4,read=/data/username/easyfasta/env1 -l nodes=4:p2800:m2048 /usr/local/fasta/bin/easyrunfastaThis job is using the environment file /data/username/easyfasta/env1, will use 2 processors per node (because of the -T 2 flag given as a default parameter in Fasta programs) and has asked for 'm2048' nodes (memory = 2048 Mb). The 'shnodes' command will list all the Biowulf nodes and their properties. The value of 'np' should always be equal to the number of nodes requested.
Example 2:
qsub -v np=4,read=/data/username/easyfasta/env1 -l nodes=4 /usr/local/fasta/bin/easyrunfastaThis job is using the fasta environment file /data/username/easyfasta/env1, and has asked for 4 nodes. No specific nodes have been asked for (i.e. no 'm####' specification on the nodes), so the batch system will allocate the first nodes that are available. (The command 'shnodes' will list all the nodes on the system and their various properties). The value of 'np' should always be equal to the number of nodes requested.
- You must monitor any jobs you submit; see Monitoring your jobs in the Biowulf User Guide.
- You need to submit your job to nodes that have enough memory for your
particular job. See the section 'Fasta and Biowulf Memory Size' further down
this page. If you are running against your own database, check the size of the
database file(s) and request node memory as appropriate.
For example, a Fasta run against a Blast-formatted database
-
biowulf% ls -l my_db.nsq -rw-rw-r-- 1 username username 321238769 Aug 31 2001 my_db.nsq
The database file is 321238769 bytes, i.e. 321 Mb. The nodes should have at least 400 Mb. All nodes on the Biowulf cluster have at least 1 GB of memory, so you need not specify any memory to the qsub command.
- It is possible for each query sequence file to contain multiple sequences. However, there needs to be at least as many query sequence files as nodes! Occasionally Fasta may barf on a particular sequence, in which case it will not continue on to other sequences in that file. See below for splitting multi-sequence files.
When analyzing a large number of sequences with fasta, it is imperative that the database fit entirely within the memory of a given node. This makes a vast difference in the performance of fasta.
For example, in the environmental file above, the database line reads:
setenv DB @/fdb/fastadb/fastanam/nt.nam
In this nt.nam file, there are full path of 3 database files::
biowulf 75% more /fdb/fastadb/fastanam/nt.nam /fdb/blastdb/nt.00 12 /fdb/blastdb/nt.01 12 /fdb/blastdb/nt.02 12
The size of the files IN the file /fdb/fastadb/fastanam/nt.nam needs to be sumed up in order to decide the memory of the nodes required. NOT the file /fdb/fastadb/fastanam/nt.nam itself. Do this:
Helix 100% ls -l /fdb/blastdb/nt.*
Add up the size of the files whose name is nt.**.nsq
If the result is 2,786,861,701 which is > 2 GB, then 4 Gb nodes are required (qsub -l nodes=16:m4096 ). If the result is less than 900 Mb, the the default 1GB nodes are fine, so no node memory needs to be specified. If the database size is greater than 4 GB, then the dual core nodes which has 8 GB of memory should be requested ( -l nodes=16:dc ). Please note, if dual core nodes are requested, please add -T 4 to the PARAMS variable in the environmental file to make use of all 4 processors in a dual core node.
At the present time the Biowulf cluster consists of nodes with different memory configurations. Please see the hardware section of the User Guide for the current configuration of the cluster.
- Use the "shnodes" command to see a list of nodes with their properties and status (free, job-exclusive, offline, down)
- Fasta jobs should be submitted with the appropriate node designation, depending on the size of the target database. If no size is specified, the batch system will allocate nodes based on availability and system-specified order. Your fasta jobs will run very slowly if the node memory is not sufficient.
1. DNA sequences against the nucleotide non-redundant database (nt)
- Note that 'username' in the examples below should be replaced by your own username!!
biowulf$ mkdir /data/username/run1 make main directory for this run biowulf$ cd /data/username/run1 go this directory biowulf$ mkdir input output tmp make subdirectories for this run biowulf$ cd input go to the 'input' subdirectory biowulf$ cp /home/username/*.seq . copy all the sequence files into this subdir
- Create a Fasta environment file in /data/username/run1/env. The file contains
setenv DB @/fdb/fastadb/fastanam/nt.nam setenv PROG fasta_t setenv INDIR /data/username/run1/seqs setenv OUTDIR /data/username/run1/out setenv TMPDIR /data/username/run1/tmp setenv PARAMS "-H -b 10 -v 10"
- Submit the job with the command:
qsub -v np=8,read=/data/username/run1/env -l nodes=8:m1024 /usr/local/fasta/bin/easyrunfasta
2. Same data against the protein non-redundant database (nr)
- Create a new environment file /data/username/run1/env2 which contains:
setenv DB @/fdb/fastadb/fastanam/nr.nam setenv PROG fasta_t setenv PARAMS "-H -b 10 -v 10" setenv INDIR /data/username/run1/seqs setenv OUTDIR /data/username/run1/out2 setenv TMPDIR /data/username/run1/tmp2
- Submit the job using:
qsub -v np=16,read=/data/username/run1/env2 -l nodes=16 /usr/local/fasta/bin/easyrunfasta
- The protein nr database is not very large and so requires only 512M of memory. Thus you don't have to ask for any special node properties.
Programs/Scripts/Files involved
Fasta programs:
Program
|
Description
|
---|---|
FASTA | Compares a protein sequence to another protein sequence or to a protein database, or a DNA sequence to another DNA sequence or a DNA library. |
SSEARCH | Performs a rigorous Smith-Waterman alignment between a protein sequence and another protein sequence or a protein database, or with DNA sequence to another DNA sequence or a DNA library (very slow). |
FASTX/FASTY | Compares a DNA sequence to a protein sequence database, translating the DNA sequence in three forward (or reverse) frames and allowing frameshifts. |
TFASTX/TFASTY | Compares a protein sequence to a DNA sequence or DNA sequence library. The DNA sequence is translated in three forward and three reverse frames, and the protein query sequence is compared to each of the six derived protein sequences. The DNA sequence is translated from one end to the other; no attempt is made to edit out intervening sequences. Termination codons are translated into unknown ('X') amino acids. |
FASTF/TFASTF | Compares an ordered peptide mixture, as would be obtained by Edman
degredation of a CNBr cleavage of a protein, against a protein (fastf)
or DNA (tfastf) database. A different format is required to specify
the ordered peptide mixture:
>mgstm1 MGCEN, MIDYP, MLLAY, MLLGYindicates m in the first position of all three peptides (as from CNBr), g, i, l (twice) in the second position (first cycle), c,d,l (twice) in the third position, etc. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue. |
FASTS/TFASTS | Compares set of short peptide fragments, as would be obtained from
mass-spec. analysis of a protein, against a protein (fasts) or DNA
(tfasts) database. A different format is required to specify the
ordered peptide mixture:
>mgstm1 MILG, MLLEYTD, MGDAPindicates three peptide fragments were found: MILG, MLLEYTD, and MGDAP. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue. |
Fasta scripts:
These scripts can be copied from /usr/local/fasta/bin and modified if desired, although this should not be necessary.- easyrunfasta-- (aka runfasta) Batch(PBS) submission file which sets up the mpi wrapper (multirun) for 2 perl scripts
- fastadist -- sorts input query sequences by size to balance the load on the nodes. Makes lists of assigned sequences for each node in $TMPDIR
- mpifasta -- main execution of fasta...processes all query sequences assigned to a given node
- env -- file which contains appropriate variables and parameters for fasta execution
- fasta -- fasta and all its associated executables and parameters are located in /usr/local/fasta.
- multirun -- MPI wrapper program which allows for different behavior on different nodes
- cleanup - cleans up tmp files for easyfasta.
Local documentation for the Fasta program
Fasta documentation at the CSC, Finland -- a bit easier to read because of the formatting.
Fasta vs Blast - a discussion of the pros and cons of each algorithm, at MSKCC.