Biowulf at the NIH
Swarm on Biowulf

Swarm is a program designed to simplify submitting a group of commands to the cluster. Some programs do not scale well and thus are not suited to true parallelizing. Other programs may be such that many independent jobs need to be run. Such programs are well suited to running 'swarms of single-threaded jobs'. The Swarm program simplifies this process.

Swarm reads a list of commands from a swarm command file, then automatically submits those commands to the PBS batch system to execute. Swarm runs one command for each processor on a node, making optimum use of a node (thus a node with 2 processors will execute two commands simultaneously). When there are hundreds or thousands of commmands, use the -b option to bundle groups of commands to be run sequentially per processor.

Commands in the command file should appear just as they would be entered on a command line. STDOUT (or STDERR) output that isn't explicitly directed elsewhere will be sent to a file named swarm#nPID.o (or .e) in your current working directory. A line where the first non-whitespace character is "#" is considered a comment and is ignored.

Swarm creates a .swarm directory in your current working directory, and creates an executable script for every 2 commands in your command file. These scripts are automatically deleted as the final step when they are executed. (If the -d (debug) option is specified, the scripts will not be deleted at the end of the job..)

Usage

swarm -f cmdfile [ -n # ] [-b #] [ -d ] [ -h ] [ qsub-options ]

The -f cmdfile option is mandatory, all others are optional.

-f cmdfile specify the file containing a list of commands, one command per line. You may use ";" to separate several commands on a line, and these will be executed sequentially.
-d debug mode. The command file is read, command scripts are generated and saved in the .swarm directory, and debugging information is printed. The scripts are submitted to the batch system, but the command scripts are not deleted at the end of the job.
-b # bundle mode. swarm runs one command per processor by default. Use the bundle option to run "#" commands per processor, one after the other. The advantages of bundling include fewer swarm jobs and output/error files, lower overhead due to scheduling and job startup, and disk file cache benefits under certain circumstances.
-n # number of processes to run per node; swarm sets this number to 2 by default (the NIH Biowulf comprises 2-processor nodes).
-h prints help message
Output

STDOUT and STDERR output from processes executed under swarm will be directed to a file named swarm#nPID.o (or .e), for instance swarm2587n1.o (or swarm2587n1.e). Since this can be confusing (with multiple processes writing to the same file) it is a good idea to explicitly redirect output on the command line using ">".

Be aware of programs that write directly to a file using a fixed filename. If you run multiple instances of such programs then for each instance you will need to a) change the name of the file or b) alter the path to the file. See the EXAMPLES section for some ideas.

Examples

To see how swarm works, first create a file containing a few simple commands, then use swarm to submit them to the batch queue:

$ cat > cmdfile
date
hostname
ls -l
^D

$ swarm -f cmdfile

Use qstat -u your-user-id to monitor the status of your request; an "R" in the "S"tatus column indicates your job is running (see qstat(1) for more details). This particular example will probably run to completion before you can give the qstat command. To see the output from the commands, see the files named "swarm#nPID.o".


Example 1: A program that reads to STDIN and writes to STDOUT

For each invocation of the program the names for the input and output files vary:

$ cat > runbix
./bix < testin1 > testout1
./bix < testin2 > testout2
./bix < testin3 > testout3
./bix < testin4 > testout4
^D

Example 2: A program that writes to a fixed filename

If a program writes to a fixed filename, then you may need to run the program in different directories. First create the necessary directories (for instance run1, run2), and in the swarm command file cd to the unique output directory before running the program: (cd using either an absolute path beginning with "/" or a relative path from your home directory). Lines with leading "#" are considered comments and ignored.

$ cat > batchcmds
# Run ped program using different directory
# for each run
cd pedsystem/run1; ../ped
cd pedsystem/run2; ../ped
cd pedsystem/run3; ../ped
cd pedsystem/run4; ../ped
...

$ swarm -f batchcmds

Example 3: Bundling large numbers of commands

If you have over 1000 commands, especially if each one runs for a short time, you should 'bundle' your jobs with the -b flag. If the command file contains 2500 commands, the following swarm command will group them into bundles of 40 commands each, producing 64 bundles. Swarm will then submit two bundles as a single swarm job, so there will be 32 (2500/64) swarm jobs.

swarm -f cmdfile -b 40

Note that commands in a bundle will run sequentially on the assigned node. Ideally, the bundling number should be chosen so that there are as many jobs as the system will allow for a single user. For example, if the current jobs/user limit is 32, design your bundle size so that you get at least 32 swarm jobs.


Example 4: Using qsub flags

Swarm submits clusters of processes using PBS (Portable Batch System) via the qsub command; any valid qsub command-line option is also valid for swarm. In this example the "-l" option is given to specify the type of node.

swarm -f testfile -l nodes=1:p866:m1000

Note that swarm is designed to run 2 processes on a single node. If you use '-l nodes=4' in the command above, swarm will override this and set '-l nodes=1' for each swarm job. You cannot submit multi-node jobs via swarm.

Deleting Swarm job sets
swarmdel Job_ID|Job_Name [ -niqf ] [ -t # ] [ -EHQRSTW ]

The utility script swarmdel will delete a set of swarm jobs automatically:

swarmdel 123456.biobos
swarmdel swarmb23n12345

By default, swarm jobs are given a name that matches the pattern swarm[b](submission number)n(PID number).

The submission number corresponds to the order in which the job was submitted to the queue, and the PID number corresponds to the process id that originated the submission. Bundled swarm jobs are given an additional 'b'. Thus, only the submission number is variable in the swarm job name set.

swarmdel will delete only those jobs that match the default swarm name pattern and which are owned by the user giving the swarmdel command.

To find all jobs that exactly match a Job_Name, use qselect -N [Job_Name].

swarmdel has the following options:

-n
test run: don't actually delete anything
-i
interactive mode: the user is prompted to allow the deletion
-q
quiet mode: don't give any output
-f
forceful: keep deleting until every job is gone from the queue
-t #
number of seconds to wait for additional jobs to appear (forceful mode only)
-EHQRSTW
job state selection. Only delete jobs in the selected state: -E (ending), -H (held), -Q (queued), -R (running), -S (suspended), -T (being moved), -W (waiting). State options can be combined (e.g., -W -H deletes only jobs in either W or H state).