Biowulf at the NIH
EMBOSS on Biowulf

EMBOSS stands for "The European Molecular Biology Open Software Suite". Within EMBOSS you will find around hundreds of programs (applications) covering areas such as:

EMBOSS on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, using the programs in EMBOSS. If you have just a few query sequences, you should use the EMBOSS web interface or command line on Helix. Please contact the Helix Systems staff staff@helix.nih.gov, or 301-594-6248) if you have questions about your EMBOSS jobs.

Submitting EMBOSS jobs on Biowulf

The EMBOSS programs are typically used to perform one or more tasks on a large number of sequences. The swarm program on Biowulf is ideally suited for large numbers of independent simultaneous jobs like this.

1. Set up the EMBOSS environment. csh or tcsh users should add the following lines to the end of their /home/username/.cshrc file.

setenv PLPLOT_LIB /usr/local/emboss/plplot/lib
set path=( /usr/local/emboss/bin ${path} )
setenv emboss_acdroot /usr/local/emboss/share/EMBOSS/acd
Or for bash/ksh/sh users, insert the following at the end of your .bashrc file:
PLPLOT_LIB=/usr/local/emboss/lib
PATH=/usr/local/emboss/bin:$PATH
emboss_acdroot=/usr/local/emboss/share/EMBOSS/acd
export PLPLOT_LIB PATH emboss_acdroot

2. Set up the swarm command file. with one line for each command that you wish to run. For example, to pull 2500 sequences out of the database, you would run the EMBOSS 'seqret' command 2500 times. Create a file called 'cmd.file' which contains 2500 lines, one for each command. e.g.:

seqret -sequence 'genbank:ab1681*' -outseq 'outseq1'
seqret -sequence 'swissprot:P16310' -outseq 'outseq2'
seqret -sequence 'genpept:M31661' -outseq 'outseq3'

...............
.............
...............
seqret -sequence 'refseqnt:nc_011*' -outseq 'outseq4'

Each command line in the cmd.file should appear just as they would be entered on a command line.

3. Submit this swarm job to the cluster, with the command

swarm -f cmd.file
If you have over 1000 commands, especially if each one runs for a short time, you should 'bundle' your jobs with the -b flag.  This will greatly increase the speed of your jobs and prevent overwork of cluster. You should bundle your jobs such that there are no more than 100 jobs. The most appropriate 'bundle number' will be the total number of commands in your cmd.file divided by 2*100. e.g if you have 5000 sequences to process, create the swarm command file as above, calculate the best 'bundle number' (5000/200 = 25), and then submit with
        swarm -f cmd.file -b 25

Useful tip. It is obviously time-consuming and error-prone to create a large swarm command file by hand. You will probably want to write a simple csh or perl script to build this swarm command file. If you are unfamiliar with csh, Basic scripting with csh maybe useful. The following is an exmaple using csh to build a command file:

        helix% cd my_sequence_directory
        helix% touch cmdfile
        helix% foreach file (*)
        foreach> echo "patmatmotifs $file $file.out >> cmdfile end
        helix%
Documentation

EMBOSS Documentation

swarm documentation

EMBOSS database status is displayed on the front page of the EMBOSS web interface.