NERSCPowering Scientific Discovery Since 1974



It was discovered that MPI collective functions (e.g., MPI_Allreduce, etc.), when using MPI_IN_PLACE, produce incorrect numerical results in a Fortran code, but only when used with IPM.  Please use care when using IPM and check if your results are correct. Developers are currently working on a fix for the problem.

Description and Overview

IPM is a portable profiling infrastructure which provide a high level report on the execution of a parallel job. IPM reports hardware counters data, MPI function timings, and memory usage. It provides a low overhead means to generate scaling studies or performance data for ERCAP submissions. When you run a job using the IPM module you will get a performance summary (see below) to stdout as well as a web accessible summary of all your IPM jobs. The two main objectives of IPM are ease-of-use and scalability in performance analysis.


% module load ipm

On HPC architectures that support shared libraries that's all you need to do. Once the module is loaded you can run as you normally and get a performance profile once the job has successfully completed. You do not need to relink your code. For static executables and architectures which do not support shared libraries a relink is required. You simply load the ipm module, add $(IPM) to your link line, and run as you normally would.

Using IPM on Hopper

You must link your code against IPM on Hopper. Here is a simple example compile and link for the default PGI programming environment.

% module load ipm
% ftn -c mycode.f90
% ftn -o mycode.x mycode.o $IPM

The $IPM reference needs to be the last argument on the link line. For use with Pathscale and GCC, use $IPM_GNU instead of $IPM.

Using IPM on Carver

On Carver, just load the ipm module and then run as normal.  For example:

% qsub -I -lnodes=1
% ftn -o flip flip.f90
% module load ipm
% mpitun -n 4 ./flip
Flipping coin 1000000 times on each CPU.
Processor 2 got 499945 heads.
Processor 3 got 499945 heads.
Processor 1 got 499945 heads.
Processor 0 got 499945 heads.
Heads came up 49.99450 percent of the time.
# command : ./flip
# start : Thu Mar 17 13:28:48 2011 host : c0347
# stop : Thu Mar 17 13:28:50 2011 wallclock : 1.04
# mpi_tasks : 4 on 1 nodes %comm : 0.02
# mem [GB] : 0.50 gflop/sec : 0.01
# : [total] <avg> min max
# wallclock : 4.15 1.04 1.04 1.04
# MPI : 0.00 0.00 0.00 0.00
# %wall :
# MPI : 0.02 0.00 0.03
# #calls :
# MPI : 24 6 6 6
# mem [GB] : 0.50 0.12 0.12 0.12
# ###################################################################

Output and Results

Once the module has been loaded each parallel code will, upon completion, print a concise report to standard out. In addition, detailed results are available the day after the job completed from the Completed Jobs page.

More detailed reports are possible, for example, a more detailed report looks like

# code : ./bin/cg.B.32 (completed)
# host : s05601/006035314C00_AIX mpi_tasks : 32 on 2 nodes
# start : 11/30/04/14:35:34 wallclock : 29.975184 sec
# stop : 11/30/04/14:36:00 %comm : 27.72
# gbytes : 6.65863e-01 total gflop/sec : 2.33478e+00 total
# [total] <avg> min max
# wallclock 953.272 29.7897 29.6092 29.9752
# user 837.25 26.1641 25.71 26.92
# system 60.6 1.89375 1.52 2.59
# mpi 264.267 8.25834 7.73025 8.70985
# %comm 27.7234 25.8873 29.3705
# gflop/sec 2.33478 0.0729619 0.072204 0.0745817
# gbytes 0.665863 0.0208082 0.0195503 0.0237541
# PM_FPU0_CMPL 2.28827e+10 7.15084e+08 7.07373e+08 7.30171e+08
# PM_FPU1_CMPL 1.70657e+10 5.33304e+08 5.28487e+08 5.42882e+08
# PM_FPU_FMA 3.00371e+10 9.3866e+08 9.27762e+08 9.62547e+08
# PM_INST_CMPL 2.78819e+11 8.71309e+09 8.20981e+09 9.21761e+09
# PM_LD_CMPL 1.25478e+11 3.92118e+09 3.74541e+09 4.11658e+09
# PM_ST_CMPL 7.45961e+10 2.33113e+09 2.21164e+09 2.46327e+09
# PM_TLB_MISS 2.45894e+08 7.68418e+06 6.98733e+06 2.05724e+07
# PM_CYC 3.0575e+11 9.55467e+09 9.36585e+09 9.62227e+09
# [time] [calls] <%mpi> <%wall>
# MPI_Send 188.386 639616 71.29 19.76
# MPI_Wait 69.5032 639616 26.30 7.29
# MPI_Irecv 6.34936 639616 2.40 0.67
# MPI_Barrier 0.0177442 32 0.01 0.00
# MPI_Reduce 0.00540609 32 0.00 0.00
# MPI_Comm_rank 0.00465156 32 0.00 0.00
# MPI_Comm_size 0.000145341 32 0.00 0.00

  The amount of detail reported information can be obtained using the options described in the next section.


The interface to IPM is through environment variables and MPI_Pcontrol. The environment variable interface is selected at execute/submit time while the later allows for dynamic control of IPM. A description of environment variables supported is given below. A description of the MPI_Pcontrol interface is included in the main IPM documentation.

IPM_REPORT terse (default) Aggregate wallclock time, memory usage and flops are reported along with the percentage of wallclock time spent in MPI calls.
  full Each HPM counter is reported as are all of wallclock, user, system, and MPI time. The contribution of each MPI call to the communication time is given.
  none No report


IPM can also generate a detailed report in the form of an XML file. By default, this file is placed in a system directory. The data is imported to the web and available in your MyNERSC section of the website 24 hours after your job completed. You can override this behavior by modifying the IPM_LOGDIR environment variable (e.g. "export IPM_LOGDIR=." in BASH). Additionally, setting the IPM_LOG environment variable to "full" provides additional information as in the IPM_REPORT variable options above.


PackagePlatformCategoryVersionModuleInstall DateDate Made Default
IPM carver libraries/ performance 0.983 ipm/0.983 2012-02-14
IPM carver libraries/ performance 2.00 ipm-intel/2.00 2012-02-14 2012-02-14
IPM carver libraries/ performance 2.00 ipm/2.00 2012-02-14
IPM hopper libraries/ performance 2.00 ipm/2.00 2011-02-24 2011-04-04
IPM hopper libraries/ performance 2.00 ipmio/2.00 2011-03-28 2011-04-05
IPM hopper libraries/ performance 2.00 ipm-ccm/2.00 2012-02-14 2012-02-14