NERSCPowering Scientific Discovery Since 1974

Submitting Batch Jobs

Overview

A batch job is the most common way users run production applications on NERSC machines.  Carver's batch system is based on the PBS model, implemented with the Moab scheduler and Torque resource manager.

Typically, the user submits a batch script to the batch system. This script specifies, at the very least, how many nodes and cores the job will use, how long the job will run, and the name of the application to run. The job will advance in the queue until it has reached the top.   At this point, Torque will allocate the requested number of nodes to the batch job.  The batch script itself will execute on the "head node" (sometimes known as the "MOM" node).  See Queues and Policies for details of batch queues, limits, and policies.

Debug Jobs

Short jobs requesting less than 30 minutes and requiring 32 nodes (256 cores) or fewer can run in the debug queue.  From 5am-6pm Pacific Time, 8 nodes are reserved for debugging and interactive use. 

Sample Batch Scripts

Althought there are default values for all batch parameters, it is a good idea always to specify the name of the queue, the number of nodes, and the walltime for all batch jobs.  To minimize the time spent waiting in the queue, specify the smallest walltime that will safely allow the job to complete.

A common convention is to append the suffix ".pbs" to batch scripts.

Basic Batch Script

This example requests 16 nodes, and 8 tasks per node, for 10 minutes, in the debug queue.

#PBS -q debug
#PBS -l nodes=16:ppn=8
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable

Submit Example

Submit your batch script with the qsub command:

carver% qsub my_job.pbs
123456.cvrsvc09-ib

The qsub command displays the job_id (12346.cvrsvc09-ib in the above example).  It is important to keep track of your job_id, to help with job tracking and problem resolution.

Torque Keywords

The following keywords may be specified as qsub command line options, or as directives (preceded by #PBS) embedded in a batch script.

Required Torque Options/Directives
OptionDefaultDescription
-l nodes=num_nodes 1 Number of nodes assigned to job
-l walltime=HH:MM:SS 00:05:00 Maximum wallclock time for job
-q queue_name debug Name of submit queue
-N job_name Name of job script Name of job; up to 15 printable, non-whitespace characters
Useful Torque Options/Directives
OptionDefaultDescription
-A repo Default repo Charge job to repo
-e filename <job_name>.e<job_id> Write stderr to filename
-o filename &lt;sjob_name>.o&lt;job_id> Write stdout to filename
-j [oe | eo] Do not merge Merge (join) stdout and stderr.  If oe, merge as output file; ie eo, merge as error file
-m [m | b | e | n] a Email notification:
a=send mail if job aborted by system
b=send mail when job begins
e=send mail when job ends
n=never send email
Options a, b, e may be combined
-S shell Login shell Specify shell to interpret batch script
-r [y|n] n, which means Rerunable=False System default is not to rerun the job after system-wide outage.  Users can overwrite the behavior with "#PBS -r y".
-l gres=project:1 Uses generic resource Specify if a batch job uses /project. When set, a job will not start during scheduled /project file system maintenance.
-V Do not import Export current environment variables into the batch job environment

Torque Environment Variables

The batch system defines many environment variables, which are available for use in batch scripts.  The following tables list some of the more useful variables.  Users must not redefine the value of any of these variables!

Variable Name Meaning
PBS_O_LOGNAME Login name of user who executed qsub.
PBS_O_HOME Home directory of submitting user.
PBS_O_WORKDIR Directory in which qsub command was executed.  Note that batch jobs begin execution in $PBS_O_HOME; many batch scripts execute "cd $PBS_O_WORKDIR" as first executable statement.
PBS_O_HOST Hostname of system on which qsub was executed.  This is typically a Carver login node.
PBS_JOBID Unique identifier for this job; important for tracking job status.
PBS_ENVIRONMENT Set to "PBS_BATCH" for scripts submitted as batch jobs; "PBS_INTERACTIVE" for interactive batch jobs ("qsub -I ...").
PBS_O_QUEUE Name of submit queue.
PBS_QUEUE Name of execution queue.
PBS_O_JOBNAME Name of this job.
PBS_NODEFILE Name of file containing list of nodes assigned to this job.
PBS_NUM_NODES Number of nodes assigned to this job.
PBS_NUM_PPN Value of "ppn" (processes per node) for this job.

Standard Output and Error

While your job is running, standard output (stdout) and standard error (stderr) are written to temporary "spool" files (for example: 123456-cvrsvc09-ib.OU and 123456-cvrsvc09-ib.ER) in the submit directory.  If you merge  stderr/stdout via the "#PBS -j eo" or "#PBS -j oe" option, then only one such spool file will appear. 

These files will be updated in real-time while the job is running, allowing you to use them for job monitoring.  It is important that you do not modify, remove or rename these spool files while the job is still running!

After your batch job completes, the above files will be renamed to the corresponding stdout/stderr files (for example: my_job.o123456 and my_job .e123456). 

Running Serial Jobs

A serial job is one that only requires a single computational core.  The serial queue is specifically configured to run serial jobs.  It runs on 80 12-core Westmere nodes, each having 48GB of memory. 

Serial jobs share nodes, rather than having exclusive access. Multiple jobs will be scheduled on an available node until either all cores are in use, or until there is not enough memory available for additional processes on that node.  This last characteristic makes it important to request sufficient memory for your serial job.   See Carver Memory Considerations for more information.

Serial jobs are not charged for an entire node; they are only charged for a single core.  See Usage Charging for more information.

The following script requests a single Westmere core and 10GB of memory for 12 hours:

#PBS -q serial
#PBS -l walltime=12:00:00
#PBS -l pvmem=10GB

./a.out

Note that it is not necessary to specify "-l nodes=1:ppn=1" for serial jobs.  Also note that if values other that 1 are specified for these attributes, the serial job will never start.

Running Multiple Parallel Jobs Sequentially

Make sure that the total number of nodes requested is sufficient for the largest (node-count) application, and that the requested walltime is sufficient for all applications combined.  Note that the repository charge for the batch job is based on the total number of nodes requested, regardless of whether any particular application uses all those nodes.

#PBS -q regular
#PBS -l nodes=16:ppn=8
#PBS -l walltime=4:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V

cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable1
mpirun -np 32 ./my_executable2
mpirun -np 64 ./my_executable3

Running Multiple Parallel Jobs Simultaneously

If you need to run many jobs which require approximately the same run time, you can boundle them up and run in one job script. Eg., to run 8 jobs simultaneously with 4 cores per job (you need 4 nodes in total), here is a sample job script:

#!/bin/bash -l
#PBS -N test_multi
#PBS -q debug
#PBS -l nodes=4:ppn=8,walltime=30:00
#PBS -j oe
#PBS -V

cd $PBS_O_WORKDIR

#Run 8 jobs simultaneously, each with 4 cores
let num_jobs=8
let tasks_per_job=4

#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
    cd job$i

    #write hostfile for i-th job to use
    let lstart=($i-1)*${tasks_per_job}+1
    let lend=${lstart}+${tasks_per_job}-1
    sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile

    mpirun -np $tasks_per_job -hostfile nodefile ./a.out  >& job$i.out &

    cd ..
done

wait

Please note, it is important to pass the hostfile to mpirun for each job, otherwise all jobs will be running on the same first 4 cores, which will overload the node and will likely kill the node. Also the last "wait" command is necessary, otherwise your job will exit right away afer the script gets executed, and all your jobs sent to the background will be termintated.

If you want to run multiple hybird MPI+OpenMP jobs at the same time, here is a sample job script. This job script runs 8 hybrid jobs simultaneously with 2 MPI tasks 4 threads per task for each job (so each job runs on 1 node, and you need 8 nodes in total).

#!/bin/bash -l
#PBS -N test_hybrid
#PBS -q debug
#PBS -l nodes=8:ppn=2,walltime=30:00
#PBS -l pvmem=10GB
#PBS -j oe
#PBS -V

cd $PBS_O_WORKDIR

#Run 8 jobs simultaneously, each running 2 mpi tasks 4 threads per task
let num_jobs=8
let tasks_per_job=2
let threads_per_task=4

export OMP_NUM_THREADS=$threads_per_task

#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
    cd job$i

    #write hostfile for i-th job to use
    let lstart=($i-1)*${tasks_per_job}+1
    let lend=${lstart}+${tasks_per_job}-1
    sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile
 
   mpirun -np $tasks_per_job -bysocket -bind-to-socket -hostfile nodefile ./a.out >& job$i.out &                  

    cd ..
done

wait

Again, it is important to pass the hostfile to the mpirun command line to spread out your tasks over the nodes allocated to your job. Please also note, the torqueue directive #PBS -l pvmem=10GB is necessary to allow 4 times of the default percore memory (2.5GB) for each MPI task, otherwise the 4 threads in each MPI task will be sharing 2.5GB of the default memory instead of the availabe 4 times of that mount. For more information abut uisng memory properly, please refer to our Memory Considerations page. This job script is good for running 4 threads per task. To run hybrid jobs with other task to thread ratios, please refer to our MPI/OpenMP page for general instructions.

As a last note, you should be aware that the contens of the batch nodefile, $PBS_NODEFILE, depends on the ppn value of the torque directive #PBS -l nodes. You are recommended to work out a more robust and convenient way to generate the nodefile for each of your multiple jobs.

Job Steps and Dependencies

There is a batch option depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated either with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.

For example, to run batch job2 only after batch job1 succeeds,

carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub -W depend=afterok:123456.cvrsvc09-ib job2.pbs
123457.cvrsvc09-ib

As with most batch options, the dependency can be included in a batch script rather than on the command line:

carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub job2.pbs
123457.cvrsvc09-ib

where the batch script job2.pbs contains the following line:

#PBS -W depend=afterok:123456.cvrsvc09-ib

The second job will be in batch "Held" status until job1 has run successfully. Note that job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.

It is also possible for the first job to submit the second (dependent) job by employing the Torque environment variable "PBS_JOBID":

#PBS -q regular
#PBS -l nodes=4:ppn=8
#PBS -l walltime=0:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2.pbs
mpirun -np 32 ./a.out

Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.

Sample Scripts for Submitting Chained Dependency Jobs

Below is a simple batch script, 'runit.pbs', for submitting three chained jobs in total (job_number_max=3). It sets the job sequence number (job_number) to 1 if this variable is undefined (that is, in the first job). When the value is less than job_number_max, the current job submits the next job. The value of job_number is incremented by 1, and the new value is provided to the subsequent job.

#!/bin/bash
#PBS -q regular
#PBS -l nodes=1
#PBS -l walltime=0:05:00
#PBS -j oe

 : ${job_number:="1"} # set job_number to 1 if it is undefined
 job_number_max=3
JOBID="${PBS_JOBID}"

 cd $PBS_O_WORKDIR

 echo "hi from ${PBS_JOBID}"

 if [[ ${job_number} -lt ${job_number_max} ]]
 then
   (( job_number++ ))
   next_jobid=$(qsub -v job_number=${job_number} -W depend=afterok:${JOBID} runit.pbs)
   echo "submitted ${next_jobid}"
 fi

 sleep 15
 echo "${PBS_JOBID} done"

Using the above script, three batch jobs are submitted:

carver% qsub runit.pbs
123456.cvrsvc09-ib
carver% ls runit.pbs.o*
-rw------- 1 xxxxx xxxxx 899 2011-03-09 09:27 runit.pbs.o123456
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123457
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123458
carver% cat runit.pbs.o123456
...
hi from 123456.cvrsvc09-ib
submitted 123457.cvrsvc09-ib
123456.cvrsvc09-ib done
...
carver% cat runit.pbs.o123457
...
hi from 123457.cvrsvc09-ib
submitted 123458.cvrsvc09-ib
123457.cvrsvc09-ib done
...
carver% cat runit.pbs.o123458
...
hi from 123458.cvrsvc09-ib
123458.cvrsvc09-ib done
...