NERSCPowering Scientific Discovery Since 1974

Reordering MPI Ranks

Introduction

When a parallel program runs, MPI tasks are assigned to compute cores.  Since compute nodes (which each contain 24 cores) are located across different positions on the 3D torus network, communication time between tasks will vary depending not only on node placement, but also the placement of each task within the allocated nodes.  This study explores the performance of applications when the placement of MPI tasks is changed across nodes allocated to an application.

 

Methodology

One way to change MPI task placement on cores is to change the rank ordering, the order in which MPI tasks (or ranks) are assigned to cores.  When a parallel program is run on Hopper using the aprun command, the enviroment variable MPICH_RANK_REORDER_METHOD determines the order in which tasks are assigned to cores.

MPICH_RANK_REORDER_METHOD which can be set to an integer from 0 to 3:

Rank Reorder Methods

Rank reorder method 1 (SMP-Style) is the default, i.e., all programs run with SMP-style rank ordering if MPICH_RANK_REORDER_METHOD is not set.

Setting MPICH_RANK_REORDER_METHOD=3 tells aprun to read a custom rank order from the file named MPICH_RANK_ORDER in the current directory. CrayPAT's pat_report tool can generate recommended rank order files by specifying the -Ompi_sm_rank_order flag. It generates two files, MPICH_RANK_ORDER.d and MPICH_RANK_ORDER.u.

More information can be found on the mpi man page. (Search for MPICH_RANK_REORDER_METHOD.)

Experiment

A series of benchmark programs were run with the different rank orders using the following procedure:

  • Build with perftools module loaded
  • Use pat_build to make an CrayPAT-instrumented version.
  • Run instrumented version using default rank order method.
  • Use pat_report was used to generate CrayPAT's two recommended rank orders, d and u.
  • Run noninstrumented version with each of the five (three predefined and two generated) rank orders.
  • Applications have their own methods of recording run times and these run times were collected and analyzed.

 

Results and Analysis

Run time* (in seconds)

* minimum of two independent runs

 

0

1

2

3d

3u

CAM

349

361

354

351

352

GTC

1,333

1,336

1,334

1,336

1,332

IMPACT-T

637

596

643

630

666

MAESTRO

1,933

1,981

1,939

N/A

N/A

MILC

1,809

996

1,583

1,293

1,315

PARATEC

460

408

442

498

485

 

 

The data show that not all programs are the most efficient with the default rank ordering.

Conclusions

The default rank order method is generally the best.
Hopper users may be able to increase the efficiency of their programs by trying different rank order methods.
There is generally no need to use CrayPAT’s custom rank order.
Users wishing to experiment with different rank orders on their own programs may follow 
  • The default rank order method is generally the best.
  • Hopper users may be able to increase the efficiency of their programs by trying different rank order methods.
  • Experiments with 6 different applications resulting in no benefit using Cray's CrayPAT’s custom rank order.
  • Users wishing to experiment with different rank orders on their own programs may follow the procedure used in this study.