FAQ

General Frequently Asked Questions

Table of Contents

Compiling/Linking

Running Jobs

Lustre File System

Runtime Messages/Errors

Miscellaneous

Compiling/Linking

Why does my compile fail with “usr/bin/ld: can not find -lsma”?

This error message occurs when using the mpi* compiler wrappers (mpicc, mpif90, etc.). These are intermediate wrappers that should not be called directly by users. Instead, users should compile with either ftn, cc, or CC. The ftn, cc, and CC scripts will do the necessary setup and then automatically call the appropriate intermediate scripts and ultimately the compilers.

Why does my compile fail with the message “relocation truncated to fit: R_X86_64_PC32″?

The default memory model for the PGI compilers is the “small” model. This requires that the object be smaller than 2 GB in size. The PGI compilers support the “medium” memory model, which allows objects to be larger than 2 GB. Unfortunately, for a code to use the medium memory model, all objects and static libraries must be compiled under the medium memory model. Several system libraries are not, so in general, executables on Jaguar must use the small memory model.

The “relocation truncated” error message occurs when an object file or executable is too large for the memory model. To work around this error, you should reduce the static memory usage for your code. Common ways to do this include the following:

  • Remove (either by deleting or via compiler directives) subroutines that are not used on the XT platform.
  • Remove static variables (especially large arrays) that are not used on the XT platform.
  • Use allocatable arrays instead of static arrays. Because the memory model applies to only static size, allocatable arrays can be larger than 2 GB with the small memory model.

This limitation is typically not a problem for programs that will run in dual-core mode because each core has only 2 GB of memory. However, if you plan to run in single-core mode and use the entire
4 GB of available memory, you will need to ensure the static size of your executable is less than 2 GB.

How do I link a C program that calls Fortran routines?

Use the pgf90 compiler to link and provide the -Mnomain option.

What does “multiple definition of main” and/or “undefined reference to MAIN_” mean?

This most likely means you have a C program that calls Fortran, and you are linking with the Portland Group Fortran compiler. The Fortran compiler has its own default “main,” and now there is a second main from the C source. You just need to add the -Mnomain flag during link time to fix this.

What do I do with “configure: error: linking to Fortran libraries from C fails”?

That message sometimes comes as a result of using configure on the XT3 with the FC=ftn and CC=cc compilers. The error usually shows up in the configure log with the following output:

checking how to get verbose linking output from ftn... -v
checking for Fortran libraries of ftn...  -L/opt/acml/2.7/pgi64/lib/cray/cnos64 -llapacktimers -L/opt/xt-mpt/1.3.15/mpich2-64/P2/lib -L/opt/acml/2.7/pgi64/lib -L/opt/xt-libsci/1.3.15/pgi/cnos64/lib -L/opt/xt-mpt/1.3.15/sma/lib -L/opt/xt-tools/papi/3.2.1/lib/cnos64 -lpapi -lperfctr -L/opt/xt-lustre-ss/1.3.15/lib64 -L/opt/xt-catamount/1.3.15/lib/cnos64 -L/opt/xt-pe/1.3.15/lib/cnos64 -L/opt/xt-libc/1.3.15/amd64/lib -L/opt/xt-os/1.3.15/lib/cnos64 -L/opt/xt-service/1.3.15/lib/cnos64 -L/opt/pgi/6.1.1/linux86-64/6.1/lib -L/opt/gcc/3.2.3/lib/gcc-lib/x86_64-suse-linux/3.2.3/ -lacml -lmpichf90 -lsci -lmpich -llustre -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgf90rtl -lpgftnrtl -lpgc -lm -lcatamount -lsysio -lportals -lC -lcrtend' -lcrtend
checking for dummy main to link with Fortran libraries... unknown
configure: error: linking to Fortran libraries from C fails
See 'config.log' for more details.

If you look at the end of the Fortran libraries line, you will see “-lcrtend’ -lcrtend.” There is an extra “‘”. To get around this, usually you specify this long line of Fortran libraries in a environment variable like FLIBS or FCLIBS with the extra “‘” and the extra “-lcrtend” removed.

My code compiles without any trouble, but fails in the link step.

Internally, the compilers use several variables/macros even if they’re not specified on the command line. These include F90FLAGS, FFLAGS, CFLAGS, and others. If your makefile defines these variables with flags not intended for the link step, the link may fail. For example, if they contain the -c flag, which tells the compiler to skip the link step, the link will fail.

Can I use the 1.5 programming environments on the CNL system?

The 1.5 programming environments are available on the CNL system. However, they will build for Catamount and should not be used on the CNL system. The 2. and greater programming environment versions should be used on the CNL system.

How do I link a C++ object with ftn? It worked on the Catamount system without modification.

Under the 1.5 programming environments used under Catamount, ftn linked in libC.a. Under the 2. programming environments used under CNL, ftn does not link in libC.a. Fortran codes that link in libraries that contain C++ objects will need to add -lC to the link line.

libc.a is added to the link under 2. as it was under 1.5. Adding -lc to the link will result in multiple definition warnings.

Why do I see the message: SEEK_SET is #defined but must not be for the C++ binding of MPI?

The following error message:


#error "SEEK_SET is #defined but must not be for the C++ binding of MPI"


Is the result of a name conflict between stdio.h and the MPI C++ binding. Users should place the mpi include before the stdio.h and iostream includes.


Users may also see the following error messages as a result of including stdio or iostream before mpi:


#error "SEEK_CUR is #defined but must not be for the C++ binding of MPI"


#error "SEEK_END is #defined but must not be for the C++ binding of MPI"

Why does my C/C++ compile/link fail with missing pgf77 or pgf90 symbols?

One of the libraries that you are using may have been compiled with Fortran in such a way that Fortran-specific symbols remain in the .a file. If this happens, then a compile/link with cc or CC will report unresolved symbols such as __pgf90_compiled. To include the necessary libraries to resolve the missing symbols, add -pgf90libs to your cc/CC compile/link line. There is a similar option, -pgf77libs, that should be used if the missing symbols are Fortran 77 and not Fortran 90.

Running Jobs

How do I find out what nodes I am using?

There are a couple of easy ways to find out what nodes are assigned to your batch job. The easiest is to issue checkjob <jobid>. Part of the output will return a list of nodes like the following:

Allocated Nodes:      

[84:1][85:1][86:1][87:1][88:1][89:1][90:1][91:1]

Another way to find out what nodes your batch job has is to run the nodeinfo tool that we have installed. This can only be run inside a batch job. Just add the following line to your batch script before the execution step:

/sw/xt/bin/nodeinfo

This tool will return a list of nodes (one per line) as well as statistics about each node, as follows:

PE Node Processor CPU Speed Rev Cores Mem Size Mem Speed Seastar Speed   

  0 84 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5980 MB/s SS1 1109 MB/s   

  1 85 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  2 86 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  3 87 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5983 MB/s SS1 1109 MB/s   

  4 88 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  5 89 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5979 MB/s SS1 1109 MB/s   

  6 90 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  7 91 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s

The above two methods return the same logical numbering of nodes. A physical numbering of the nodes as well as the pid layout can be obtained by setting the PMI_DEBUG variable to 1.

gt; setenv PMI_DEBUG 1
> aprun -n4 ./a.out
Detected aprun CNOS interface
MPI rank order: Using default aprun rank ordering
rank 0 is on nid00015 pid 76; originally was on nid00015 pid 76
rank 1 is on nid00015 pid 77; originally was on nid00015 pid 77
rank 2 is on nid00016 pid 69; originally was on nid00016 pid 69
rank 3 is on nid00016 pid 70; originally was on nid00016 pid 70

From within your code, you can reference PMI_CNOS_Get_nid to get the physical number for each process.

#include <stdio.h>
#include "mpi.h"int main (int argc, char *argv[])
{
  int rank,nproc,nid;
  int i;
  MPI_Status status;
MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
PMI_CNOS_Get_nid(rank, &nid);
printf("  Rank: %10d  NID: %10d  Total: %10d n",rank,nid,nproc);
MPI_Finalize();
return 0;
}

The output with four cores would be as follows:

aprun -n4 ./hello-mpi.x
  Rank:          1  NID:         15  Total:          4
  Rank:          0  NID:         15  Total:          4
  Rank:          2  NID:         16  Total:          4
  Rank:          3  NID:         16  Total:          4
Application 13390 resources: utime 0, stime 0

The aprun -q option can be used to run commands outside of a code as shown below.

> aprun -q -n4 /bin/hostname
nid00015
nid00015
nid00016
nid00016
>

Or

> aprun -q -n4 /bin/cat /proc/cray_xt/nid
15
15
16
16
>

Why do I get the error “qsub: Job exceeds queue resource limits MSG=cannot satisfy server max mem requirement” when submitting a job?

The queuing system on the XT4 does not allow memory requests with the #PBS -lmem= flag. Jobs requesting memory will be rejected with the error message shown above.

Memory on the XT4 is not shared between nodes. When running in virtual node mode (dual-core mode), each task has access to 2 GB of memory. In single node (single-core) mode, each task can access 4 GB of memory. Thus, memory is directly related to the number of processors requested. Because the memory is not shared, it does not make sense to request memory directly via PBS. (It is implicitly requested based on the #PBS -lsize=... request.)

Can I run size=0 jobs?

Yes, size=0 jobs are supported. These jobs are a good way to automate data transfers to HPSS. The hsi command runs on a service node. So, if you use hsi at the conclusion of a production run, all of the compute nodes your job was allocated remain idle. As an alternative, you can submit a production job, and then submit a second ‘data transfer’ job. This second job should be submitted with a dependency on the first job so that it will not start until the first job finishes. Additionally, it should be submitted with a size argument of 0. Since hsi runs on service nodes, it does not require any compute node (thus, size=0).

NOTE: Jobs requesting size=0 should not use the PBS feature option. This creates a dependency condition that the system can’t satisfy (the system can’t allocate 0 compute nodes and compute nodes with a certain feature string). Thus, a job submitted with #PBS -l size=0,feature=800 will remain in a queued state indefinitely.

Lustre File System

How is striping set up in Lustre?

The lfs command can be used to determine the Lustre file system setup. Note that each file and directory can have its own striping pattern. This means that a user can set striping patters for his own files and/or directories. The default stripe width after the July 2006 hardware upgrade is 4.

This command will give you information on the striping information for a directory/file.

lfs find -v <directory/file>

If the command returns has no stripe info, then that means the directory/file is set to not stripe, or in other words the stripe width is 1.

How do I change the striping in Lustre?

A user can change the striping settings for a file or directory in Lustre by using the lfs command. More specifically, one would use lfs setstripe <directory> <options>. Note that if you change the settings for existing files, the file will get the new settings only if it is recreated. If you change the settings for an existing directory, you will need to copy the files elsewhere and then copy them back to inherit the new settings.

We believe that the best setting for a program in which each process writes out its own file(s) is

> lfs setstripe <directory> 0 -1 1

That is, do not use striping. Then we see that

> lfs find -v testdirectory
OBDS:
0: ost1_UUID ACTIVE
1: ost2_UUID ACTIVE
2: ost3_UUID ACTIVE
3: ost4_UUID ACTIVE
4: ost5_UUID ACTIVE
5: ost6_UUID ACTIVE
6: ost7_UUID ACTIVE
7: ost8_UUID ACTIVE
8: ost9_UUID ACTIVE
9: ost10_UUID ACTIVE
10: ost11_UUID ACTIVE
11: ost12_UUID ACTIVE
12: ost13_UUID ACTIVE
13: ost14_UUID ACTIVE
14: ost15_UUID ACTIVE
15: ost16_UUID ACTIVE
testdirectory/
default stripe_count: 1 stripe_size: 0 stripe_offset: -1

This shows we have a stripe count of 1 (no striping), the stripe size is set to 0 (which means use the default), and the stripe offset is set to -1 (which means to round-robin the files across the OSTs). You should always use -1 for stripe_offset.

The stripe count and stripe size are something you can tweak for performance.

Run-Time Messages/Errors

What does “MPIDI_PORTALSU_REQUEST_FDU_OR_AEP: DROPPED EVENT ON UNEXPECTED RECEIVE QUEUE” mean?

By setting

MPICH_PTL_SEND_CREDITS=-1

A flow control mechanism can be enabled. See the mpi_intro man page for details.

For best performance, the number of event queue entries for the MPI unexpected receive queue should be set as high as possible.

MPICH_PTL_UNEX_EVENTS=80000

Note that this fix does not address unexpected message buffer exhaustion. Thus, the user may still need to adjust MPICH_MAX_SHORT_MSG_SIZE or MPICH_UNEX_BUFFER_SIZE if this buffering overflows.

I get the runtime error MPI has run out of PER_PROC Message Packets.

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
*** MPI has run out of PER_PROC message packets.
*** The current allocation levels are:
***     MPI_MSGS_PER_PROC = 16384
_pmii_daemon(SIGCHLD): PE 1 exit signal Aborted
[NID 2987]Apid 566145: initiated application termination

Even though the message refers to MPI_MSGS_PER_PROC, you will need to increase the variable MPICH_MSGS_PER_PROC to a number greater than the number of cores requested by the job. The MPICH_MSGS_PER_PROC “Specifies the maximum number of internal message headers that can be allocated by MPI”. The default value is 16,384.

I get a chdir No such file or directory error.

The NFS-mounted home, project, and software directories are not accessible to the compute nodes.

  • Executables must be executed from within the Lustre work space.
  • Batch jobs can be submitted from the home or work space. If submitted from a user’s home area, the user should cd into the Lustre work space directory prior to running the executable through aprun. An error similar to the following may be returned if this is not done:
            aprun: [NID 94]Exec /tmp/work/userid/a.out failed: chdir /autofs/na1_home/userid
            No such file or directory
  • Input must reside in the Lustre work space.
  • Output must also be sent to the Lustre file system.

Why do I see a no space left on device error?

A no space left on device error will be returned during file I/O if one of the file’s associated OSTs becomes 100% utilized. An OST may become 100% utilized even if there is space available on the filesystem.

You can see a file or directory’s associated OST(s) with “lfs getstripe “. “lfs df” can be used to see the usage on each OST.

Miscellaneous

What “endian”ness is the XT3 and XT4? Is there any way to affect it?

The Cray XT3 and XT4 are little-endian. There is a compiler switch -Mbyteswapio that makes the default Fortran unformatted I/O big-endian (read and write.)

Note that this little-endian-to-big-endian conversion feature is intended for Fortran unformatted I/O operations. It enables the development and processing of files with big-endian data organization. The feature also enables processing of the files developed on processors that generate big-endian data (such as IBM, Cray X1, Sun).

How can I check memory usage for my application on the XT3?

If you don’t use allocatable memory, size executable_name is a reliable way to check the memory usage of your application.

Heap usage can be checked with the UNICOS/lc system call heap_info. An example of usage in C would be as follows:

       #include <stdio.h>      

       #include <catamount/catmalloc.h>      

       void       mem_check ()      

       {      

         size_t fragments;      

         unsigned long total_free, largest_free, total_used;      

         if (heap_info(&fragments, &total_free, &largest_free, &total_used) == 0) {      

printf(      

           “heap_info fragments=%lu total_free=%lu largest_free=%lu total_used =%lun,      

              fragments, total_free, largest_free, total_used);      

         } else {      

           printf(“non zero return code from heap_infon);      

         }      

return;      

       }

An example of usage in Fortran would be as follows:

     program heap      

        integer i      

        integer*4 fragments      

        integer*8 total_free, largest_free, total_used      

        integer heap_info      

        i = heap_info(fragments, total_free, largest_free, total_used)      

        write(0,*) 'heap_info fragments =',fragments,' total_free = ',      

       1total_free,' largest_free = ',largest_free,' total_used = ',      

       2total_used,' i = ',i      

        stop      

     end

(Both these examples can be found on the man page for heap_info).

Interrogating stack usage is a bit more involved.

  #include <qk/types.h>      

  #include <qk/process_pcb_type.h>      

  PROCESS_PCB_TYPE *_my_pcb;      

  inline ADDR_LEN get_stack_pointer() {    ADDR_LEN sp;      

    asm(“mov %%rsp,%0″ : “=m” (sp));      

    return sp;      

  }      

  /* Returns the free space on the stack after allocating n more bytes.      

   * If this overflows, aborts instead of returning. */      

  unsigned check_stack( int n ) {      

#define NN (int)((get_stack_pointer() - _my_pcb->stack_base) - (n+16))      

    if ( NN >= 0 ) return NN;      

    abort();      

  }

What profiling tools are available?

At least three profiling tools are available on Jaguar.

  1. CrayPat is provided by Cray. Follow this link for more information.
  2. fpmpi is an unsupported product that can provide a very concise profile of MPI routines in an application. To use it, simply load the fpmpi (or fpmpi_papi) module and relink. Then rerun your application. There are a few environment variables to control profiling output:
    • MPI_PROFILE_DISABLE : Disables statistic collection until fpmpi_enable is called (#include fpmpi.h).
    • MPI_PROFILE_SUMMARY : Setting disables creation of individual MPI process statistics files. Should set this when running with 1000s of processes.
    • MPI_PROFILE_FILE : Name of process statistic file; default is profile.txt.
    • MPI_HWPC_COUNTERS : List of events or event set number as in libhwpc.
  3. A third tool that is unsupported is TAU. TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. Basic profiling with TAU can be done in the following steps:
    1. In your makefile or configuration script, type TAUROOTDIR=/apps/TAU/prod/jaguar.
    2. Then type include $(TAUROOTDIR)/lib/Makefile.tau-pdt-pgi.
    3. Build your code using the modified makefile or configuration script. Contact help@nccs.gov if there are any error messages at this stage. TAU will revert to normal build process if the automatic instrumentation is unsuccessful (i.e., you won’t get an instrumented file).
    4. You will get a regular executable. Submit your job as usual.
    5. After execution, there should be a profile.xxx text file.

TAU can also do MPI profiling and collect hardware performance counter data.

How do I get performance counter data for my program?

Use the following process:

  1. Use module load craypat.
  2. Compile code.
    1. If Fortran90 with modules, compile with -Mprof=func.
  3. Run pat_build -u -g mpi a.out.
  4. Run a.out+pat as you would a.out, BUT make sure PAT_RT_HWPC is set to 1 in batch script.
    1. If you want just a regular profile, don’t set PAT_RT_HWPC.
  5. Run pat_report <dir>/*.xf, where <dir> is automatically generated by instrumented code.

The resulting output will have performance counter results for the entire run AND for each subroutine.

Where can I find documentation on MPI environment variables?

You can find current information on MPI environment variables from the mpi_intro man page.

What are the differences between MPT 2 and MPT 3?

MPT 3 Features

  • Support for multiple interconnect devices.

    MPI 3.0 can support multiple interconnect devices for a single MPI
    job. This allows each process (rank) of an MPI job to create the most
    optimal messaging path to every other process in the job, based on the
    topology of the given ranks. The two device drivers that are supported
    are the shared memory (SMP) driver and the portals device driver. The
    SMP device driver is based on shared memory, and is used for
    communication between ranks that share a node. The portals device is
    used for communication between ranks that span nodes. The portals device
    was completely rewritten to work with other devices on a single application
    and will allow future optimizations to be more easily added.

  • MPT 3.0 uses a completely new launching sequence via the Process Manager
    Interface (PMI) library. This includes a PMI daemon process on each
    compute node. This daemon process is started at program launch, and
    exits when the program exits. Applications are still launched via aprun
    in the same manner as with previous MPI versions.
  • The MPI 3.0 source is based on MPICH2 1.0.4p. The MPI 2.0 source was based
    on MPICH2 1.0.2. The new version contains numerous fixes in the machine
    independent areas of MPI.
  • MPI 3.0 has some new defaults and new environment variables.

    The new MPI 3.0 environment variables are:

    MPICH_ENV_DISPLAY
    displays env variables and their values
    MPICH_VERSION_DISPLAY
    displays MPICH2 Cray version number and build info
    MPICH_PTL_EAGER_LONG
    formerly known as MPICH_PTLS_EAGER_LONG
    MPICH_MSGS_PER_PROC
    overrides default internal message header maximum
    MPICH_SMPDEV_BUFS_PER_PROC
    overrides default SMP device buffer size
    MPICH_COLL_OPT_OFF
    disables the collective optimizations
    MPICH_ABORT_ON_ERROR
    replaces the old MPICH_DBMASK env variable
    MPICH_RANK_REORDER_DISPLAY
    displays the rank to node mapping. The rank order can be manipulated via the MPICH_RANK_REORDER_METHOD env variable. (replaces the old PMI_DEBUG env variable)
    MPICH_SMP_SINGLE_COPY_SIZE
    specifies the minimum message size to qualify for on-node single-copy transfers
    MPICH_SMP_SINGLE_COPY_OFF
    disables the on-node single copy optimization
    MPICH_RMA_MAX_OUTSTANDING_REQS
    controls the maximum number of outstanding RMA operations on a given window
    MPICH_SMP_OFF
    disables the SMP device used for on-node messages
    MPICH_ALLREDUCE_LARGE_MSG
    adjusts the cutoff for the SMP-aware MPI_Allreduce Algorithm
    PMI_EXIT_QUIET
    specifies to the Process Manager Interface (PMI) to inhibit reporting all exits

    There are also some new defaults for existing MPI environment variables:

    Variable Name New Default Old Default
    MPICH_ALLTOALL_SHORT_MSG 1024 512
    MPICH_ALLTOALLVW_FCSIZE 32 120
    MPICH_ALLTOALLVW_SENDWIN 20 80
    MPICH_ALLTOALLVW_RECVWIN 20 100

    In MPT 3.0, the architecture-specific collective optimizations are enabled
    by default, so the MPI_COLL_OPT_ON variable has been deprecated. It is
    replaced by MPICH_COLL_OPT_OFF to allow the user to disable these
    optimizations if desired.

  • A new environment variable MPICH_MPIIO_HINTS has been added to allow users to
    set MPI-IO hints without code modifications. The supported hints are:

    romio_cb_read, romio_cb_write, cb_buffer_size, cb_nodes, cb_config_list,
    romio_no_indep_rw, ind_rd_buffer_size, ind_wr_buffer_size, romio_ds_read,
    romio_ds_write, direct_io

    The intro_mpi(3) man page contains additional information on how use them as
    well as their default values.

  • For more information on all of the supported MPI and SHMEM environment
    variables, please refer to the intro_mpi and intro_shmem man pages.
  • MPT 3.0.1 supports the Berkeley Lab Checkpoint/Restart (BLCR)
    feature on UNICOS/lc CNL systems. The initial implementation of
    checkpoint/restart (CPR) in UNICOS/lc is in release 2.1. However,
    full support in UNICOS/lc is deferred until a future release.
    CrayPat 4.3.0 also supports CPR. However, to run a CrayPat
    instrumented code with MPT 3.0.1 under a CPR environment, the user
    needs to set an environment variable so that each process writes
    data to its own file. When MPT supports shared parallel I/O
    (multiple processes writing to the same file) under CPR, (planned
    for MPT 3.0.2) this environment variable can be omitted.

Differences and Incompatibilities

  • The cancelling of sends is not supported.
  • Several MPI environment variables names have changed with MPT 3.0.
    They are:
    MPICH_PTL_EAGER_LONG replaces MPICH_PTLS_EAGER_LONG
    MPICH_ABORT_ON_ERROR replaces MPICH_DBMASK
  • MPT 2.0 allowed a number of cnos_ functions (like cnos_barrier or
    cnos_get_rank). These functions are no longer supported in MPT 3.0.
    Applications with MPT 3.0 that use them may still have them satisfied
    from libpct.a, however, this could lead to hangs, etc. In MPT 3.0, MPI
    and SHMEM use the PMI for similar functionality that the cnos_
    functions provided. Users can examine the pmi/include/pmi.h header for
    replacement functions and can use “#include ” in their program.
    The MPT 3.1 release will include a document that will describe how to
    use the PMI functions in pmi.h.
  • With MPT 3.0, the PMI library is used to assist aprun in launching
    SHMEM programs as well as MPI programs. The intro_shmem man page has
    always specified that a call to shmem_finalize is necessary to allow
    proper cleanup. With MPT 3.0, if shmem_finalize was not called, the
    PMI library may conclude that some processes ended prematurely and tell
    aprun to kill those processes and the job will get “killed” messages.
    Users should insert a shmem_finalize call into their program to allow
    proper cleanup and avoid the confusion.
  • A new environment variable called MPICH_CPU_YIELD was added to MPI
    3.0 to allow behavior similar to MPI 2.0 in certain cases. If set, MPI
    calls sched_yield() in the global progress loop. The sched_yield()
    function forces the current process/thread to relinquish the
    processor. In 3.0 the existing sched_yield calls in the SMP progress
    engine were removed as it significantly increases on-node pingpong
    latency. This is most useful when the user over-subscribes the number
    of CPUs, or when more than one process or thread is pinned to the same
    CPU. Note that this scenario can happen when using the Pathscale
    compiler with a hybrid openmp/MPI app if one is not careful.
  • MPI programs that use MPI_ANY_SOURCE with pre-posted receives may
    see performance degradation with respect to MPT 2.0. The MPT 2.0
    performance can be obtained by setting MPICH_SMP_OFF in MPT 3.0 thus
    disabling the SMP device. Note that using MPI_ANY_SOURCE with
    MPI_Probe/MPI_Iprobe followed by a receive (hence non-pre-posted)
    should not see significant changes in performance.
  • MPI 2.0 was based on ANL release 1.0.2. MPI 3.0 is based on ANL
    release 1.0.4 which made significant changes to the C++ bindings
    including the header file. For this reason users of the C++ bindings
    should recompile when switching to MPI 3.0.

Where can I find more information?

If you haven’t already, please check out the other Jaguar resource pages at Jaguar resources on compiling, file systems, batch jobs, open issues, parallel I/O tips, CrayPAT overview, and other reports and presentations related to Jaguar.

Another good resource (without Jaguar-specific information) is the documentation that Cray provides at CrayDocs.