IBM Watson Machine Learning CE -> Open CE¶
Getting Started¶
IBM Watson Machine Learning Community Edition is provided on Summit
through the module ibm-wml-ce
, and after version 1.7.0
, the
module has been renamed to open-ce
, which is built based on the
Open Cognitive Environment.
To access the latest analytics packages use the module load
command:
module load open-ce
Loading a specific version of the module is recommended to future-proof scripts against software updates. The following commands can be used to find and load specific module versions:
[user@login2.summit ~]$ module avail ibm-wml-ce
------------------------- /sw/summit/modulefiles/core --------------------------
ibm-wml-ce/1.6.1-1 ibm-wml-ce/1.6.2-1 ibm-wml-ce/1.6.2-5 ibm-wml-ce/1.7.1.a0-0
ibm-wml-ce/1.6.1-2 ibm-wml-ce/1.6.2-2 ibm-wml-ce/1.7.0-1
ibm-wml-ce/1.6.1-3 ibm-wml-ce/1.6.2-3 ibm-wml-ce/1.7.0-2
ibm-wml-ce/1.6.2-0 ibm-wml-ce/1.6.2-4 ibm-wml-ce/1.7.0-3 (D)
...
[user@login2.summit ~]$ module load ibm-wml-ce/1.7.0-3
[user@login2.summit ~]$ module avail open-ce
------------------------- /sw/summit/modulefiles/core --------------------------
open-ce/0.1-0
[user@login2.summit ~]$ module load open-ce/0.1-0
For more information on loading modules, including loading specific verions, see: Environment Management with Lmod
This will activate a conda environment which is pre-loaded with the following packages, and their dependencies:
Environment | ibm-wml-ce/1.6.1 | ibm-wml-ce/1.7.0 | open-ce/0.1 |
Package | IBM DDL 1.5.0 | IBM DDL 1.5.1 | |
Tensorflow 1.15 | Tensorflow 2.1 | Tensorflow 2.3 | |
Pytorch 1.2.0 | Pytorch 1.3.1 | Pytorch 1.6.0 | |
Caffe(IBM-enhanced) 1.0.0 | Caffe (IBM-enhanced) 1.0.0 | ||
Horovod v0.18.2 (IBM-DDL Backend) | Horovod v0.19 (NCCL Backend) | Horovod v0.19.5 (NCCL Backend) | |
Complete List | 1.6.2 Software Packages | 1.7.0 Software Packages | Open-CE Software Packages |
Comparing to IBM WML CE, Open-CE no longer has IBM DDL
, Caffe(IBM-enhanced)
, IBM SnapML
, Nvidia Rapids, Apex
packages, and TensorFlow and PyTorch are not compiled with IBM Large Model Support (LMS). For standalone Keras users, please pip install keras
after module load open-ce
.
Running Distributed Deep Learning Jobs¶
The IBM ddlrun
tool has been deprecated. The recommended tool for
launching distributed deep learning jobs on Summit is jsrun
. When
launching distributed deep learning jobs the primary concern for most
distribution methods is that each process needs to have access to
all GPUs on the node it’s running on. The following command should
correctly launch most DDL scripts:
jsrun -r1 -g6 -a6 -c42 -bpacked:7 <SCRIPT>
Flags | Description |
---|---|
-r1 |
1 resource set per host |
-g6 |
6 GPUs per resource set |
-a6 |
6 MPI tasks per resource set |
-c42 |
42 CPU cores per resource set |
-bpacked:7 |
Binds each task to 7 contiguous CPU cores |
Basic Distributed Deep Learning BSUB Script¶
The following bsub script will run a distributed Tensorflow resnet50 training job across 2 nodes.
#BSUB -P <PROJECT>
#BSUB -W 0:10
#BSUB -nnodes 2
#BSUB -q batch
#BSUB -J mldl_test_job
#BSUB -o /ccs/home/<user>/job%J.out
#BSUB -e /ccs/home/<user>/job%J.out
module load open-ce
jsrun -bpacked:7 -g6 -a6 -c42 -r1 python $CONDA_PREFIX/horovod/examples/tensorflow2_synthetic_benchmark.py
bsub
is used to launch the script as follows:
bsub script.bash
For more information on bsub
and job submission
please see: Running Jobs.
For more information on jsrun
please see:
Job Launcher (jsrun).
Setting up Custom Environments¶
The IBM-WML-CE and Open-CE conda environments are read-only. Therefore, users cannot install any additional packages that may be needed. If users need any additional conda or pip packages, they can clone the IBM-WML-CE or Open-CE conda environment into their home directory and then add any packages they need.
Note
The conda environment includes a module revision number, the ‘X’ in
ibm-wml-ce-1.7.0-X
. The name of the active environment can be found in
the prompt string, or conda env list
can be used to see what conda
environments are available.
$ module load ibm-wml-ce
(ibm-wml-ce-1.7.0-X) $ conda create --name cloned_env --clone ibm-wml-ce-1.7.0-X
(ibm-wml-ce-1.7.0-X) $ conda activate cloned_env
(cloned_env) $
By default this should create the cloned environment in
/ccs/home/${USER}/.conda/envs/cloned_env
.
To activate the new environment you should still load the module first. This will ensure that all of the conda settings remain the same.
$ module load ibm-wml-ce
(ibm-wml-ce-1.7.0-X) $ conda activate cloned_env
(cloned_env) $
Best Distributed Deep Learning Performance¶
Performance Profiling¶
There are several tools that can be used to profile the performance of a deep learning job. Below are links to several tools that are available as part of the ibm-wml-ce and open-ce modules.
NVIDIA Profiling Tools¶
The ibm-wml-ce and open-ce modules contain the nvprof profiling tool. It can be used to profile work that is running on GPUs. It will give information about when different CUDA kernels are being launched and how long they take to complete. For more information on using the NVIDA profiling tools on Summit, please see these slides.
Horovod Timeline¶
Horovod comes with a tool called Timeline which can help analyze the performance of Horovod. This is particularly useful when trying to scale a deep learning job to many nodes. The Timeline tool can help pick various options that can improve the performance of distributed deep learning jobs that are using Horovod. For more information, please see Horovod’s documentation.
PyTorch’s Autograd Profiler¶
PyTorch provides a builtin profiler that can be used to find bottlenecks within a training job. It is most useful for profiling the performance of a job running on a single GPU. For more information on using PyTorch’s profiler, see PyTorch’s documentation.
Reserving Whole Racks¶
Most users will get good performance using LSF basic job submission, and
specifying the node count with -nnodes N
. However, users trying
to squeeze out the final few percent of performance can use the following
technique.
When making node reservations for DDL jobs, it can sometimes improve performance to reserve nodes in a rack-contiguous manner.
In order to instruct BSUB to reserve nodes in the same rack, expert mode must
be used (-csm y
), and the user needs to explicitly specify the reservation
string. For more information on Expert mode see: Easy Mode vs. Expert Mode
The following BSUB arguments and reservation string instruct bsub
to
reserve 2 compute nodes within the same rack:
#BSUB -csm y
#BSUB -n 85
#BSUB -R 1*{select[((LN)&&(type==any))]order[r15s:pg]span[hosts=1]cu[type=rack:pref=config]}+84*{select[((CN)&&(type==any))]order[r15s:pg]span[ptile=42]cu[type=rack:maxcus=1]}
-csm y
enables ‘expert mode’.
-n 85
the total number of slots must be requested, as -nnodes
is not
compatible with expert mode.
We can break the reservation string down to understand each piece.
The first term is needed to include a launch node in the reservation.
1*{select[((LN)&&(type==any))]order[r15s:pg]span[hosts=1]cu[type=rack:pref=config]}
The second term specifies how many compute slots and how many racks.
+84*{select[((CN)&&(type==any))]order[r15s:pg]span[ptile=42]cu[type=rack:maxcus=1]}
- Here the
84
slots represents 2 compute nodes. Each compute node has 42 compute slots. - The
maxcus=1
specifies that the nodes can come from at most 1 rack.
- Here the
Example¶
The following graph shows the scaling performance of the
tf_cnn_benchmarks
implementation of the Resnet50 model
running on Summit during initial benchmark testing.
The following LSF script can be used to reproduce the results for 144 nodes:
#BSUB -P <PROJECT>
#BSUB -W 1:00
#BSUB -csm y
#BSUB -n 6049
#BSUB -R "1*{select[((LN) && (type == any))] order[r15s:pg] span[hosts=1] cu[type=rack:pref=config]}+6048*{select[((CN) && (type == any))] order[r15s:pg] span[ptile=42] cu[type=rack:maxcus=8]}"
#BSUB -q batch
#BSUB -J <PROJECT>
#BSUB -o /ccs/home/user/job%J.out
#BSUB -e /ccs/home/user/job%J.out
module load ibm-wml-ce/1.6.2-2
ddlrun --nodes 18 --racks 4 --aisles 2 python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--variable_update=horovod\
--model=resnet50 \
--num_gpus=1 \
--batch_size=256 \
--num_batches=100 \
--num_warmup_batches=10 \
--data_name=imagenet \
--allow_growth=True \
--use_fp16
Troubleshooting Tips¶
Full command¶
The output from ddlrun
includes the exact command used to launch the
distributed job. This is useful if a user wants to see exactly what ddlrun
is doing. The following is the first line of the output from the above script:
$ module load ibm-wml-ce
(ibm-wml-ce-1.6.1-1) $ ddlrun python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=ddl --model=resnet50
+ /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-3/spectrum-mpi-10.3.0.1-20190611-aqjt3jo53mogrrhcrd2iufr435azcaha/bin/mpirun \
-x LSB_JOBID -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH -x LSB_MCPU_HOSTS -x NCCL_LL_THRESHOLD=0 -x NCCL_TREE_THRESHOLD=0 \
-disable_gdr -gpu --rankfile /tmp/DDLRUN/DDLRUN.xoObgjtixZfp/RANKFILE -x "DDL_OPTIONS=-mode p:6x2x1x1 " -n 12 \
-mca plm_rsh_num_concurrent 12 -x DDL_HOST_PORT=2200 -x "DDL_HOST_LIST=g28n14:0,2,4,6,8,10;g28n15:1,3,5,7,9,11" bash \
-c 'source /sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh && conda activate /sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.1-1 \
> /dev/null 2>&1 && python /sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.1-1/ddl-tensorflow/examples/mnist/mnist-env.py'
...
Problems Distributing Pytorch with Multiple Data Loader Workers¶
Problem¶
It is common to encounter segmenation faults or deadlocks when running distributed PyTorch scripts that make use of a DataLoader with multiple workers. A typical segfault may look something like the following:
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 150462) is killed by signal: Segmentation fault.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pytorch_imagenet_resnet50.py", line 277, in <module>
train(epoch)
File "pytorch_imagenet_resnet50.py", line 169, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
idx, data = self._get_data()
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _get_data
success, data = self._try_get_data()
File "/gpfs/anaconda3/envs/powerai/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 737, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 150462) exited unexpectedly
Solution¶
The solution is to change the multiprocessing start method to forkserver
(Python 3 only) or
spawn
. The forkserver
method tends to give better performance. This Horovod PR
has examples of changing scripts to use the forkserver
method.
See the PyTorch documentation for more information.