NERSCPowering Scientific Discovery Since 1974

OMB MPI Tests

Description

This is a collection of independent MPI message passing performance microbenchmarks developed and written at The Ohio State University.  It includes traditional benchmarks and performance measures such as latency, bandwidth and host overhead and can be used for both traditional and GPU-enhanced nodes.  For the purposes of the Trinity / NERSC-8 acquisition this includes only the following tests: (name of OSU test: performance characteristic)

  • osu_latency:                                              Ping-pong Latency
  • osu_multi_lat:                                             Multi-pair Ping-pong Latency      
  • osu_bw:                                                     Ping-pong Bandwidth                                                  
  • osu_bibw:                                                   Bi-directional BW                                                        
  • osu_mbw_mr:                                             Throughput or Multiple Bandwidth and Message Rate 
  • osu_barrier, osu_allreduce, osu_allgather:     Collectives                                                                 

More details are in the README.OSU file.

Download

osu-micro-benchmarks-3.7.tar

How to Build

If mpicc is in your PATH then you can just type 

./configure
make
make install 

If the compiler is not in your path or you would like to use another particular version you can tell configure by setting CC:

./configure CC=/path/to/mpicc
make
make install 

The build process will create several executables not used in these tests, such as UPC and openshem, and it does not matter if the builds for these executables fail because we are interested only in the MPI tests.  

After 'make install' all executables will be in several directories under libexec.  The use of the --prefix option to configure is recommended.

CUDA Extensions to OMB

CUDA Extensions to OMB can be enabled by configuring the benchmark suite with --enable-cuda option as shown below. The MPI library used must be able to support MPI_Send / MPI_Recv from buffers in the GPU Device memory.

./configure CC=/path/to/mpicc --enable-cuda --with-cuda-include=/path/to/cuda/include --with-cuda-lib=/path/to/cuda/lib
make

The following benchmarks have been extended to evaluate performance of MPI_Send/MPI_Recv from and to buffers on NVIDIA GPU devices.

osu_bibw - Bidirectional Bandwidth Test
osu_bw - Bandwidth Test
osu_latency - Latency Test

These extensions are enabled when the benchmark suite is configured with --enable-cuda option, as shown above. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time.   

How to Run

Some instructions for the tests are in the "Required Runs" section, below.  When running on GPU nodes each of these benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.

Examples:

- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_latency D D

In this run, the latency test allocates buffers at both rank 0 and rank 1 on the GPU devices.  

- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_bw D H

In this run, the bandwidth test allocates buffers at rank 0 on the GPU device and buffers at rank 1 on the host.

Setting GPU affinity

GPU affinity for processes should be set before MPI_Init is called. The process rank on a node is normally used to do this and different MPI launchers expose this information through different environment variables. The benchmarks use an environment variable called LOCAL_RANK to get this information.

A script like below can be used to export this environment variable when using mpirun_rsh. This can be adapted to work with other MPI launchers and libraries.

#!/bin/bash

export LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK
exec $*

A copy of this script is installed as get_local_rank alongside the benchmarks. It can be used as follows:

mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank ./osu_latency D D

Required Runs

  1. Interconnect Delivered Latency: two nodes, one core per node
    Using the osu_latency code provide the complete communications latency output for message sizes from 0 B to 128 B with one MPI process per node for the following configurations:
    - Worst-case node pairing over two nodes.
    - Best-case node pairing over two nodes.
    Make sure to provide a description justifying the choice of the specific node pairs for the best and worst-case results.
  2. Interconnect Delivered Latency: two nodes, multiple cores per node
    Using the osu_multi_lat code provide the complete communications latency output for message sizes 0 B to 128 B for 1, 2, 4, 8, and NCORE MPI processes per node for the following configurations:
    - Worst-case node pairing over two nodes
    - Best-case node pairing over two nodes
    where NCORE equals the total number of assignable cores on the node. 
  3. Interconnect Delivered Unidirectional Bandwidth
    Using the osu_bw code provide the complete communications bandwidth output for message sizes from 0 B to 4 MB for one MPI process per node for the following configurations:
    - Worst-case node pairing over two nodes.
    - Best-case node pairing over two nodes.
  4. Interconnect Delivered Bi-idirectional Bandwidth
    Using the osu_bibw code provide the complete communications bandwidth output for message sizes from 0 B to 4 MB for one MPI process per node for the following configurations:
    - Worst-case node pairing over two nodes.
    - Best-case node pairing over two nodes.
  5. All Gather Delivered Latency
    Using the osu_allgather code provide complete output over all nodes in the system with one MPI task per core using all cores.  Include the code's "-f" option, e.g., mpirun -n 48 osu_allgather -f
  6. All Reduce Delivered Latency
    Using the osu_allreduce code provide complete output over all nodes in the system with one MPI task per core.  The code will do this for the SUM operation.  Include the code's "-f" option, e.g., mpirun -n 48 osu_allreduce -f.
  7. Global Barriers Network Delivered Latency
    Using the osu_barrier code provide complete output over all nodes in the system with one MPI task per core.
  8. Interconnect Messaging Rate
    Make sure you view the README.OSU file for important instructions regarding running this test, and in particular, the need to use block layout of MPI ranks.  Use the osu_mbw_mr code and run this test on two nodes.  We are requesting that this test be run using a variable number of tasks per node that must include 1 and NCORE and should also include sufficient additional configurations to fully capture the configuration at which maximum message rate for the 1-byte message size is obtained as well as one or two configurations before that point and one or two after.   

 Authorship

 The OMB (OSU Micro Benchmarks) software package is developed by the team members of The Ohio State University's Network-Based Computing Laboratory (NBCL), headed by Professor Dhabaleswar K. (DK) Panda.