Submitting Batch Jobs
Overview
A batch job is the most common way users run production applications on NERSC machines. Carver's batch system is based on the PBS model, implemented with the Moab scheduler and Torque resource manager.
Typically, the user submits a batch script to the batch system. This script specifies, at the very least, how many nodes and cores the job will use, how long the job will run, and the name of the application to run. The job will advance in the queue until it has reached the top. At this point, Torque will allocate the requested number of nodes to the batch job. The batch script itself will execute on the "head node" (sometimes known as the "MOM" node). See Queues and Policies for details of batch queues, limits, and policies.
Debug Jobs
Short jobs requesting less than 30 minutes and requiring 32 nodes (256 cores) or fewer can run in the debug queue. From 5am-6pm Pacific Time, 8 nodes are reserved for debugging and interactive use.
Sample Batch Scripts
Althought there are default values for all batch parameters, it is a good idea always to specify the name of the queue, the number of nodes, and the walltime for all batch jobs. To minimize the time spent waiting in the queue, specify the smallest walltime that will safely allow the job to complete.
A common convention is to append the suffix ".pbs" to batch scripts.
Basic Batch Script
This example requests 16 nodes, and 8 tasks per node, for 10 minutes, in the debug queue.
#PBS -q debug
#PBS -l nodes=16:ppn=8
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable
Submit Example
Submit your batch script with the qsub command:
carver% qsub my_job.pbs
123456.cvrsvc09-ib
The qsub command displays the job_id (12346.cvrsvc09-ib in the above example). It is important to keep track of your job_id, to help with job tracking and problem resolution.
Torque Keywords
The following keywords may be specified as qsub command line options, or as directives (preceded by #PBS) embedded in a batch script.
Required Torque Options/Directives | ||
---|---|---|
Option | Default | Description |
-l nodes=num_nodes | 1 | Number of nodes assigned to job |
-l walltime=HH:MM:SS | 00:05:00 | Maximum wallclock time for job |
-q queue_name | debug | Name of submit queue |
-N job_name | Name of job script | Name of job; up to 15 printable, non-whitespace characters |
Useful Torque Options/Directives | ||
Option | Default | Description |
-A repo | Default repo | Charge job to repo |
-e filename | <job_name>.e<job_id> | Write stderr to filename |
-o filename | <sjob_name>.o<job_id> | Write stdout to filename |
-j [oe | eo] | Do not merge | Merge (join) stdout and stderr. If oe, merge as output file; ie eo, merge as error file |
-m [m | b | e | n] | a | Email notification: a=send mail if job aborted by system b=send mail when job begins e=send mail when job ends n=never send email Options a, b, e may be combined |
-S shell | Login shell | Specify shell to interpret batch script |
-r [y|n] | n, which means Rerunable=False | System default is not to rerun the job after system-wide outage. Users can overwrite the behavior with "#PBS -r y". |
-l gres=project:1 | Uses generic resource | Specify if a batch job uses /project. When set, a job will not start during scheduled /project file system maintenance. |
-V | Do not import | Export current environment variables into the batch job environment |
Torque Environment Variables
The batch system defines many environment variables, which are available for use in batch scripts. The following tables list some of the more useful variables. Users must not redefine the value of any of these variables!
Variable Name | Meaning |
PBS_O_LOGNAME | Login name of user who executed qsub. |
PBS_O_HOME | Home directory of submitting user. |
PBS_O_WORKDIR | Directory in which qsub command was executed. Note that batch jobs begin execution in $PBS_O_HOME; many batch scripts execute "cd $PBS_O_WORKDIR" as first executable statement. |
PBS_O_HOST | Hostname of system on which qsub was executed. This is typically a Carver login node. |
PBS_JOBID | Unique identifier for this job; important for tracking job status. |
PBS_ENVIRONMENT | Set to "PBS_BATCH" for scripts submitted as batch jobs; "PBS_INTERACTIVE" for interactive batch jobs ("qsub -I ..."). |
PBS_O_QUEUE | Name of submit queue. |
PBS_QUEUE | Name of execution queue. |
PBS_O_JOBNAME | Name of this job. |
PBS_NODEFILE | Name of file containing list of nodes assigned to this job. |
PBS_NUM_NODES | Number of nodes assigned to this job. |
PBS_NUM_PPN | Value of "ppn" (processes per node) for this job. |
Standard Output and Error
While your job is running, standard output (stdout) and standard error (stderr) are written to temporary "spool" files (for example: 123456-cvrsvc09-ib.OU and 123456-cvrsvc09-ib.ER) in the submit directory. If you merge stderr/stdout via the "#PBS -j eo" or "#PBS -j oe" option, then only one such spool file will appear.
These files will be updated in real-time while the job is running, allowing you to use them for job monitoring. It is important that you do not modify, remove or rename these spool files while the job is still running!
After your batch job completes, the above files will be renamed to the corresponding stdout/stderr files (for example: my_job.o123456 and my_job .e123456).
Running Serial Jobs
A serial job is one that only requires a single computational core. The serial queue is specifically configured to run serial jobs. It runs on 80 12-core Westmere nodes, each having 48GB of memory.
Serial jobs share nodes, rather than having exclusive access. Multiple jobs will be scheduled on an available node until either all cores are in use, or until there is not enough memory available for additional processes on that node. This last characteristic makes it important to request sufficient memory for your serial job. See Carver Memory Considerations for more information.
Serial jobs are not charged for an entire node; they are only charged for a single core. See Usage Charging for more information.
The following script requests a single Westmere core and 10GB of memory for 12 hours:
#PBS -q serial
#PBS -l walltime=12:00:00
#PBS -l pvmem=10GB
./a.out
Note that it is not necessary to specify "-l nodes=1:ppn=1" for serial jobs. Also note that if values other that 1 are specified for these attributes, the serial job will never start.
Running Multiple Parallel Jobs Sequentially
Make sure that the total number of nodes requested is sufficient for the largest (node-count) application, and that the requested walltime is sufficient for all applications combined. Note that the repository charge for the batch job is based on the total number of nodes requested, regardless of whether any particular application uses all those nodes.
#PBS -q regular
#PBS -l nodes=16:ppn=8
#PBS -l walltime=4:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable1
mpirun -np 32 ./my_executable2
mpirun -np 64 ./my_executable3
Running Multiple Parallel Jobs Simultaneously
If you need to run many jobs which require approximately the same run time, you can boundle them up and run in one job script. Eg., to run 8 jobs simultaneously with 4 cores per job (you need 4 nodes in total), here is a sample job script:
#!/bin/bash -l
#PBS -N test_multi
#PBS -q debug
#PBS -l nodes=4:ppn=8,walltime=30:00
#PBS -j oe
#PBS -V
cd $PBS_O_WORKDIR
#Run 8 jobs simultaneously, each with 4 cores
let num_jobs=8
let tasks_per_job=4
#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
cd job$i
#write hostfile for i-th job to use
let lstart=($i-1)*${tasks_per_job}+1
let lend=${lstart}+${tasks_per_job}-1
sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile
mpirun -np $tasks_per_job -hostfile nodefile ./a.out >& job$i.out &
cd ..
done
wait
Please note, it is important to pass the hostfile to mpirun for each job, otherwise all jobs will be running on the same first 4 cores, which will overload the node and will likely kill the node. Also the last "wait" command is necessary, otherwise your job will exit right away afer the script gets executed, and all your jobs sent to the background will be termintated.
If you want to run multiple hybird MPI+OpenMP jobs at the same time, here is a sample job script. This job script runs 8 hybrid jobs simultaneously with 2 MPI tasks 4 threads per task for each job (so each job runs on 1 node, and you need 8 nodes in total).
#!/bin/bash -l
#PBS -N test_hybrid
#PBS -q debug
#PBS -l nodes=8:ppn=2,walltime=30:00
#PBS -l pvmem=10GB
#PBS -j oe
#PBS -V
cd $PBS_O_WORKDIR
#Run 8 jobs simultaneously, each running 2 mpi tasks 4 threads per task
let num_jobs=8
let tasks_per_job=2
let threads_per_task=4
export OMP_NUM_THREADS=$threads_per_task
#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
cd job$i
#write hostfile for i-th job to use
let lstart=($i-1)*${tasks_per_job}+1
let lend=${lstart}+${tasks_per_job}-1
sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile
mpirun -np $tasks_per_job -bysocket -bind-to-socket -hostfile nodefile ./a.out >& job$i.out &
cd ..
done
wait
Again, it is important to pass the hostfile to the mpirun command line to spread out your tasks over the nodes allocated to your job. Please also note, the torqueue directive #PBS -l pvmem=10GB is necessary to allow 4 times of the default percore memory (2.5GB) for each MPI task, otherwise the 4 threads in each MPI task will be sharing 2.5GB of the default memory instead of the availabe 4 times of that mount. For more information abut uisng memory properly, please refer to our Memory Considerations page. This job script is good for running 4 threads per task. To run hybrid jobs with other task to thread ratios, please refer to our MPI/OpenMP page for general instructions.
As a last note, you should be aware that the contens of the batch nodefile, $PBS_NODEFILE, depends on the ppn value of the torque directive #PBS -l nodes. You are recommended to work out a more robust and convenient way to generate the nodefile for each of your multiple jobs.
Job Steps and Dependencies
There is a batch option depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated either with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.
For example, to run batch job2 only after batch job1 succeeds,
carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub -W depend=afterok:123456.cvrsvc09-ib job2.pbs
123457.cvrsvc09-ib
As with most batch options, the dependency can be included in a batch script rather than on the command line:
carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub job2.pbs
123457.cvrsvc09-ib
where the batch script job2.pbs contains the following line:
#PBS -W depend=afterok:123456.cvrsvc09-ib
The second job will be in batch "Held" status until job1 has run successfully. Note that job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.
It is also possible for the first job to submit the second (dependent) job by employing the Torque environment variable "PBS_JOBID":
#PBS -q regular
#PBS -l nodes=4:ppn=8
#PBS -l walltime=0:30:00
#PBS -j oe
cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2.pbs
mpirun -np 32 ./a.out
Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.
Sample Scripts for Submitting Chained Dependency Jobs
Below is a simple batch script, 'runit.pbs', for submitting three chained jobs in total (job_number_max=3). It sets the job sequence number (job_number) to 1 if this variable is undefined (that is, in the first job). When the value is less than job_number_max, the current job submits the next job. The value of job_number is incremented by 1, and the new value is provided to the subsequent job.
#!/bin/bash
#PBS -q regular
#PBS -l nodes=1
#PBS -l walltime=0:05:00
#PBS -j oe
: ${job_number:="1"} # set job_number to 1 if it is undefined
job_number_max=3
JOBID="${PBS_JOBID}"
cd $PBS_O_WORKDIR
echo "hi from ${PBS_JOBID}"
if [[ ${job_number} -lt ${job_number_max} ]]
then
(( job_number++ ))
next_jobid=$(qsub -v job_number=${job_number} -W depend=afterok:${JOBID} runit.pbs)
echo "submitted ${next_jobid}"
fi
sleep 15
echo "${PBS_JOBID} done"
Using the above script, three batch jobs are submitted:
carver% qsub runit.pbs
123456.cvrsvc09-ib
carver% ls runit.pbs.o*
-rw------- 1 xxxxx xxxxx 899 2011-03-09 09:27 runit.pbs.o123456
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123457
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123458
carver% cat runit.pbs.o123456
...
hi from 123456.cvrsvc09-ib
submitted 123457.cvrsvc09-ib
123456.cvrsvc09-ib done
...
carver% cat runit.pbs.o123457
...
hi from 123457.cvrsvc09-ib
submitted 123458.cvrsvc09-ib
123457.cvrsvc09-ib done
...
carver% cat runit.pbs.o123458
...
hi from 123458.cvrsvc09-ib
123458.cvrsvc09-ib done
...