FAQ
General Frequently Asked Questions
Table of Contents
Compiling/Linking
- Why does my compile fail with “usr/bin/ld: can not find -lsma”?
- Why does my compile fail with the message “relocation truncated to fit: R_X86_64_PC32″?
- How do I link a C program that calls Fortran routines?
- What does “multiple definition of main” and/or “undefined reference to MAIN_” mean?
- What do I do with ”configure: error: linking to Fortran libraries from C fails”?’
- My code compiles without any trouble, but fails in the link step.
- Can I use the 1.5 programming environments on the CNL system?
- How do I link a C++ object with ftn? It worked on the Catamount system without modification.
- Why do I see the message: SEEK_SET is #defined but must not be for the C++ binding of MPI?
- Why does my C/C++ compile/link fail with missing pgf77 or pgf90 symbols?
Running Jobs
- How do I find out what nodes I am using?
- Why do I get the error “qsub: Job exceeds queue resource limits MSG=cannot satisfy server max mem requirement” when submitting a job?
- Can I run size=0 jobs?
Lustre File System
- How is striping set up in Lustre?
- How do I change the striping in Lustre?
- Why do I see a no space left on device error?
Runtime Messages/Errors
- What does “MPIDI_PORTALSU_REQUEST_FDU_OR_AEP: DROPPED EVENT ON UNEXPECTED RECEIVE QUEUE” mean?
- I get the runtime error MPI has run out of PER_PROC Message Packets.
- I get a chdir No such file or directory error.
Miscellaneous
- What “endian”ness is the XT3 and XT4? Is there any way to affect it?
- How can I check memory usage for my application on the XT3?
- What profiling tools are available?
- How do I get performance counter data for my program?
- Where can I find documentation on MPI environment variables?
- What are the differences between MPT 2 and MPT 3?
- Where can I find more information?
Compiling/Linking
Why does my compile fail with “usr/bin/ld: can not find -lsma”?
This error message occurs when using the mpi* compiler wrappers (mpicc, mpif90, etc.). These are intermediate wrappers that should not be called directly by users. Instead, users should compile with either ftn, cc, or CC. The ftn, cc, and CC scripts will do the necessary setup and then automatically call the appropriate intermediate scripts and ultimately the compilers.
Why does my compile fail with the message “relocation truncated to fit: R_X86_64_PC32″?
The default memory model for the PGI compilers is the “small” model. This requires that the object be smaller than 2 GB in size. The PGI compilers support the “medium” memory model, which allows objects to be larger than 2 GB. Unfortunately, for a code to use the medium memory model, all objects and static libraries must be compiled under the medium memory model. Several system libraries are not, so in general, executables on Jaguar must use the small memory model.
The “relocation truncated” error message occurs when an object file or executable is too large for the memory model. To work around this error, you should reduce the static memory usage for your code. Common ways to do this include the following:
- Remove (either by deleting or via compiler directives) subroutines that are not used on the XT platform.
- Remove static variables (especially large arrays) that are not used on the XT platform.
- Use allocatable arrays instead of static arrays. Because the memory model applies to only static size, allocatable arrays can be larger than 2 GB with the small memory model.
This limitation is typically not a problem for programs that will run in dual-core mode because each core has only 2 GB of memory. However, if you plan to run in single-core mode and use the entire
4 GB of available memory, you will need to ensure the static size of your executable is less than 2 GB.
How do I link a C program that calls Fortran routines?
Use the pgf90
compiler to link and provide the -Mnomain
option.
What does “multiple definition of main” and/or “undefined reference to MAIN_” mean?
This most likely means you have a C program that calls Fortran, and you are linking with the Portland Group Fortran compiler. The Fortran compiler has its own default “main,” and now there is a second main from the C source. You just need to add the -Mnomain
flag during link time to fix this.
What do I do with “configure: error: linking to Fortran libraries from C fails”?
That message sometimes comes as a result of using configure on the XT3 with the FC=ftn and CC=cc compilers. The error usually shows up in the configure log with the following output:
checking how to get verbose linking output from ftn... -v checking for Fortran libraries of ftn... -L/opt/acml/2.7/pgi64/lib/cray/cnos64 -llapacktimers -L/opt/xt-mpt/1.3.15/mpich2-64/P2/lib -L/opt/acml/2.7/pgi64/lib -L/opt/xt-libsci/1.3.15/pgi/cnos64/lib -L/opt/xt-mpt/1.3.15/sma/lib -L/opt/xt-tools/papi/3.2.1/lib/cnos64 -lpapi -lperfctr -L/opt/xt-lustre-ss/1.3.15/lib64 -L/opt/xt-catamount/1.3.15/lib/cnos64 -L/opt/xt-pe/1.3.15/lib/cnos64 -L/opt/xt-libc/1.3.15/amd64/lib -L/opt/xt-os/1.3.15/lib/cnos64 -L/opt/xt-service/1.3.15/lib/cnos64 -L/opt/pgi/6.1.1/linux86-64/6.1/lib -L/opt/gcc/3.2.3/lib/gcc-lib/x86_64-suse-linux/3.2.3/ -lacml -lmpichf90 -lsci -lmpich -llustre -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgf90rtl -lpgftnrtl -lpgc -lm -lcatamount -lsysio -lportals -lC -lcrtend' -lcrtend checking for dummy main to link with Fortran libraries... unknown configure: error: linking to Fortran libraries from C fails See 'config.log' for more details.
If you look at the end of the Fortran libraries line, you will see “-lcrtend’ -lcrtend
.” There is an extra “‘”. To get around this, usually you specify this long line of Fortran libraries in a environment variable like FLIBS
or FCLIBS
with the extra “‘” and the extra “-lcrtend
” removed.
My code compiles without any trouble, but fails in the link step.
Internally, the compilers use several variables/macros even if they’re not specified on the command line. These include F90FLAGS
, FFLAGS
, CFLAGS
, and others. If your makefile defines these variables with flags not intended for the link step, the link may fail. For example, if they contain the -c
flag, which tells the compiler to skip the link step, the link will fail.
Can I use the 1.5 programming environments on the CNL system?
The 1.5 programming environments are available on the CNL system. However, they will build for Catamount and should not be used on the CNL system. The 2. and greater programming environment versions should be used on the CNL system.
How do I link a C++ object with ftn? It worked on the Catamount system without modification.
Under the 1.5 programming environments used under Catamount, ftn
linked in libC.a
. Under the 2. programming environments used under CNL, ftn
does not link in libC.a
. Fortran codes that link in libraries that contain C++ objects will need to add -lC
to the link line.
libc.a
is added to the link under 2. as it was under 1.5. Adding -lc
to the link will result in multiple definition warnings.
Why do I see the message: SEEK_SET is #defined but must not be for the C++ binding of MPI?
The following error message:
#error "SEEK_SET is #defined but must not be for the C++ binding of MPI"
Is the result of a name conflict between stdio.h and the MPI C++ binding. Users should place the mpi include before the stdio.h and iostream includes.
Users may also see the following error messages as a result of including stdio or iostream before mpi:
#error "SEEK_CUR is #defined but must not be for the C++ binding of MPI"
#error "SEEK_END is #defined but must not be for the C++ binding of MPI"
Why does my C/C++ compile/link fail with missing pgf77 or pgf90 symbols?
One of the libraries that you are using may have been compiled with Fortran in such a way that Fortran-specific symbols remain in the .a file. If this happens, then a compile/link with cc or CC will report unresolved symbols such as __pgf90_compiled
. To include the necessary libraries to resolve the missing symbols, add -pgf90libs
to your cc/CC compile/link line. There is a similar option, -pgf77libs
, that should be used if the missing symbols are Fortran 77 and not Fortran 90.
Running Jobs
How do I find out what nodes I am using?
There are a couple of easy ways to find out what nodes are assigned to your batch job. The easiest is to issue checkjob <jobid>
. Part of the output will return a list of nodes like the following:
Allocated Nodes: [84:1][85:1][86:1][87:1][88:1][89:1][90:1][91:1]
Another way to find out what nodes your batch job has is to run the nodeinfo
tool that we have installed. This can only be run inside a batch job. Just add the following line to your batch script before the execution step:
/sw/xt/bin/nodeinfo
This tool will return a list of nodes (one per line) as well as statistics about each node, as follows:
PE Node Processor CPU Speed Rev Cores Mem Size Mem Speed Seastar Speed 0 84 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5980 MB/s SS1 1109 MB/s 1 85 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s 2 86 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s 3 87 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5983 MB/s SS1 1109 MB/s 4 88 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s 5 89 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5979 MB/s SS1 1109 MB/s 6 90 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s 7 91 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s
The above two methods return the same logical numbering of nodes. A physical numbering of the nodes as well as the pid layout can be obtained by setting the PMI_DEBUG
variable to 1.
gt; setenv PMI_DEBUG 1 > aprun -n4 ./a.out Detected aprun CNOS interface MPI rank order: Using default aprun rank ordering rank 0 is on nid00015 pid 76; originally was on nid00015 pid 76 rank 1 is on nid00015 pid 77; originally was on nid00015 pid 77 rank 2 is on nid00016 pid 69; originally was on nid00016 pid 69 rank 3 is on nid00016 pid 70; originally was on nid00016 pid 70
From within your code, you can reference PMI_CNOS_Get_nid
to get the physical number for each process.
#include <stdio.h> #include "mpi.h"int main (int argc, char *argv[]) { int rank,nproc,nid; int i; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nproc); PMI_CNOS_Get_nid(rank, &nid); printf(" Rank: %10d NID: %10d Total: %10d n",rank,nid,nproc); MPI_Finalize(); return 0; }
The output with four cores would be as follows:
aprun -n4 ./hello-mpi.x Rank: 1 NID: 15 Total: 4 Rank: 0 NID: 15 Total: 4 Rank: 2 NID: 16 Total: 4 Rank: 3 NID: 16 Total: 4 Application 13390 resources: utime 0, stime 0
The aprun -q
option can be used to run commands outside of a code as shown below.
> aprun -q -n4 /bin/hostname nid00015 nid00015 nid00016 nid00016 >
Or
> aprun -q -n4 /bin/cat /proc/cray_xt/nid 15 15 16 16 >
Why do I get the error “qsub: Job exceeds queue resource limits MSG=cannot satisfy server max mem requirement” when submitting a job?
The queuing system on the XT4 does not allow memory requests with the #PBS -lmem= flag. Jobs requesting memory will be rejected with the error message shown above.
Memory on the XT4 is not shared between nodes. When running in virtual node mode (dual-core mode), each task has access to 2 GB of memory. In single node (single-core) mode, each task can access 4 GB of memory. Thus, memory is directly related to the number of processors requested. Because the memory is not shared, it does not make sense to request memory directly via PBS. (It is implicitly requested based on the #PBS -lsize=...
request.)
Can I run size=0 jobs?
Yes, size=0 jobs are supported. These jobs are a good way to automate data transfers to HPSS. The hsi
command runs on a service node. So, if you use hsi
at the conclusion of a production run, all of the compute nodes your job was allocated remain idle. As an alternative, you can submit a production job, and then submit a second ‘data transfer’ job. This second job should be submitted with a dependency on the first job so that it will not start until the first job finishes. Additionally, it should be submitted with a size argument of 0. Since hsi
runs on service nodes, it does not require any compute node (thus, size=0).
NOTE: Jobs requesting size=0 should not use the PBS feature option. This creates a dependency condition that the system can’t satisfy (the system can’t allocate 0 compute nodes and compute nodes with a certain feature string). Thus, a job submitted with #PBS -l size=0,feature=800
will remain in a queued state indefinitely.
Lustre File System
How is striping set up in Lustre?
The lfs
command can be used to determine the Lustre file system setup. Note that each file and directory can have its own striping pattern. This means that a user can set striping patters for his own files and/or directories. The default stripe width after the July 2006 hardware upgrade is 4.
This command will give you information on the striping information for a directory/file.
lfs find -v <directory/file>
If the command returns has no stripe info
, then that means the directory/file is set to not stripe, or in other words the stripe width is 1.
How do I change the striping in Lustre?
A user can change the striping settings for a file or directory in Lustre by using the lfs
command. More specifically, one would use lfs setstripe <directory> <options>
. Note that if you change the settings for existing files, the file will get the new settings only if it is recreated. If you change the settings for an existing directory, you will need to copy the files elsewhere and then copy them back to inherit the new settings.
We believe that the best setting for a program in which each process writes out its own file(s) is
> lfs setstripe <directory> 0 -1 1
That is, do not use striping. Then we see that
> lfs find -v testdirectory OBDS: 0: ost1_UUID ACTIVE 1: ost2_UUID ACTIVE 2: ost3_UUID ACTIVE 3: ost4_UUID ACTIVE 4: ost5_UUID ACTIVE 5: ost6_UUID ACTIVE 6: ost7_UUID ACTIVE 7: ost8_UUID ACTIVE 8: ost9_UUID ACTIVE 9: ost10_UUID ACTIVE 10: ost11_UUID ACTIVE 11: ost12_UUID ACTIVE 12: ost13_UUID ACTIVE 13: ost14_UUID ACTIVE 14: ost15_UUID ACTIVE 15: ost16_UUID ACTIVE testdirectory/ default stripe_count: 1 stripe_size: 0 stripe_offset: -1
This shows we have a stripe count of 1 (no striping), the stripe size is set to 0 (which means use the default), and the stripe offset is set to -1 (which means to round-robin the files across the OSTs). You should always use -1 for stripe_offset
.
The stripe count and stripe size are something you can tweak for performance.
Run-Time Messages/Errors
What does “MPIDI_PORTALSU_REQUEST_FDU_OR_AEP: DROPPED EVENT ON UNEXPECTED RECEIVE QUEUE” mean?
By setting
MPICH_PTL_SEND_CREDITS=-1
A flow control mechanism can be enabled. See the mpi_intro
man page for details.
For best performance, the number of event queue entries for the MPI unexpected receive queue should be set as high as possible.
MPICH_PTL_UNEX_EVENTS=80000
Note that this fix does not address unexpected message buffer exhaustion. Thus, the user may still need to adjust MPICH_MAX_SHORT_MSG_SIZE
or MPICH_UNEX_BUFFER_SIZE
if this buffering overflows.
I get the runtime error MPI has run out of PER_PROC Message Packets.
Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. *** MPI has run out of PER_PROC message packets. *** The current allocation levels are: *** MPI_MSGS_PER_PROC = 16384 _pmii_daemon(SIGCHLD): PE 1 exit signal Aborted [NID 2987]Apid 566145: initiated application termination
Even though the message refers to MPI_MSGS_PER_PROC, you will need to increase the variable MPICH_MSGS_PER_PROC to a number greater than the number of cores requested by the job. The MPICH_MSGS_PER_PROC “Specifies the maximum number of internal message headers that can be allocated by MPI”. The default value is 16,384.
I get a chdir No such file or directory error.
The NFS-mounted home, project, and software directories are not accessible to the compute nodes.
- Executables must be executed from within the Lustre work space.
- Batch jobs can be submitted from the home or work space. If submitted from a user’s home area, the user should cd into the Lustre work space directory prior to running the executable through
aprun
. An error similar to the following may be returned if this is not done:aprun: [NID 94]Exec /tmp/work/userid/a.out failed: chdir /autofs/na1_home/userid No such file or directory
- Input must reside in the Lustre work space.
- Output must also be sent to the Lustre file system.
Why do I see a no space left on device error?
A no space left on device error will be returned during file I/O if one of the file’s associated OSTs becomes 100% utilized. An OST may become 100% utilized even if there is space available on the filesystem.
You can see a file or directory’s associated OST(s) with “lfs getstripe
Miscellaneous
What “endian”ness is the XT3 and XT4? Is there any way to affect it?
The Cray XT3 and XT4 are little-endian. There is a compiler switch -Mbyteswapio
that makes the default Fortran unformatted I/O big-endian (read and write.)
Note that this little-endian-to-big-endian conversion feature is intended for Fortran unformatted I/O operations. It enables the development and processing of files with big-endian data organization. The feature also enables processing of the files developed on processors that generate big-endian data (such as IBM, Cray X1, Sun).
How can I check memory usage for my application on the XT3?
If you don’t use allocatable memory, size executable_name
is a reliable way to check the memory usage of your application.
Heap usage can be checked with the UNICOS/lc system call heap_info
. An example of usage in C would be as follows:
#include <stdio.h> #include <catamount/catmalloc.h> void mem_check () { size_t fragments; unsigned long total_free, largest_free, total_used; if (heap_info(&fragments, &total_free, &largest_free, &total_used) == 0) { printf( “heap_info fragments=%lu total_free=%lu largest_free=%lu total_used =%lun“, fragments, total_free, largest_free, total_used); } else { printf(“non zero return code from heap_infon“); } return; }
An example of usage in Fortran would be as follows:
program heap integer i integer*4 fragments integer*8 total_free, largest_free, total_used integer heap_info i = heap_info(fragments, total_free, largest_free, total_used) write(0,*) 'heap_info fragments =',fragments,' total_free = ', 1total_free,' largest_free = ',largest_free,' total_used = ', 2total_used,' i = ',i stop end
(Both these examples can be found on the man page for heap_info
).
Interrogating stack usage is a bit more involved.
#include <qk/types.h> #include <qk/process_pcb_type.h> PROCESS_PCB_TYPE *_my_pcb; inline ADDR_LEN get_stack_pointer() { ADDR_LEN sp; asm(“mov %%rsp,%0″ : “=m” (sp)); return sp; } /* Returns the free space on the stack after allocating n more bytes. * If this overflows, aborts instead of returning. */ unsigned check_stack( int n ) { #define NN (int)((get_stack_pointer() - _my_pcb->stack_base) - (n+16)) if ( NN >= 0 ) return NN; abort(); }
What profiling tools are available?
At least three profiling tools are available on Jaguar.
- CrayPat is provided by Cray. Follow this link for more information.
fpmpi
is an unsupported product that can provide a very concise profile of MPI routines in an application. To use it, simply load thefpmpi
(orfpmpi_papi
) module and relink. Then rerun your application. There are a few environment variables to control profiling output:MPI_PROFILE_DISABLE
: Disables statistic collection untilfpmpi_enable
is called (#include fpmpi.h
).MPI_PROFILE_SUMMARY
: Setting disables creation of individual MPI process statistics files. Should set this when running with 1000s of processes.MPI_PROFILE_FILE
: Name of process statistic file; default isprofile.txt
.MPI_HWPC_COUNTERS
: List of events or event set number as inlibhwpc
.
- A third tool that is unsupported is TAU. TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. Basic profiling with TAU can be done in the following steps:
- In your makefile or configuration script, type
TAUROOTDIR=/apps/TAU/prod/jaguar.
- Then type
include $(TAUROOTDIR)/lib/Makefile.tau-pdt-pgi.
- Build your code using the modified makefile or configuration script. Contact help@nccs.gov if there are any error messages at this stage. TAU will revert to normal build process if the automatic instrumentation is unsuccessful (i.e., you won’t get an instrumented file).
- You will get a regular executable. Submit your job as usual.
- After execution, there should be a
profile.xxx
text file.
- In your makefile or configuration script, type
TAU can also do MPI profiling and collect hardware performance counter data.
How do I get performance counter data for my program?
Use the following process:
- Use
module load craypat
. - Compile code.
- If Fortran90 with modules, compile with
-Mprof=func
.
- If Fortran90 with modules, compile with
- Run
pat_build -u -g mpi a.out
. - Run
a.out+pat
as you woulda.out
, BUT make surePAT_RT_HWPC
is set to 1 in batch script.- If you want just a regular profile, don’t set
PAT_RT_HWPC
.
- If you want just a regular profile, don’t set
- Run
pat_report <dir>/*.xf
, where<dir>
is automatically generated by instrumented code.
The resulting output will have performance counter results for the entire run AND for each subroutine.
Where can I find documentation on MPI environment variables?
You can find current information on MPI environment variables from the mpi_intro
man page.
What are the differences between MPT 2 and MPT 3?
MPT 3 Features
-
Support for multiple interconnect devices.
MPI 3.0 can support multiple interconnect devices for a single MPI
job. This allows each process (rank) of an MPI job to create the most
optimal messaging path to every other process in the job, based on the
topology of the given ranks. The two device drivers that are supported
are the shared memory (SMP) driver and the portals device driver. The
SMP device driver is based on shared memory, and is used for
communication between ranks that share a node. The portals device is
used for communication between ranks that span nodes. The portals device
was completely rewritten to work with other devices on a single application
and will allow future optimizations to be more easily added. -
MPT 3.0 uses a completely new launching sequence via the Process Manager
Interface (PMI) library. This includes a PMI daemon process on each
compute node. This daemon process is started at program launch, and
exits when the program exits. Applications are still launched via aprun
in the same manner as with previous MPI versions. -
The MPI 3.0 source is based on MPICH2 1.0.4p. The MPI 2.0 source was based
on MPICH2 1.0.2. The new version contains numerous fixes in the machine
independent areas of MPI. -
MPI 3.0 has some new defaults and new environment variables.
The new MPI 3.0 environment variables are:
- MPICH_ENV_DISPLAY
- displays env variables and their values
- MPICH_VERSION_DISPLAY
- displays MPICH2 Cray version number and build info
- MPICH_PTL_EAGER_LONG
- formerly known as MPICH_PTLS_EAGER_LONG
- MPICH_MSGS_PER_PROC
- overrides default internal message header maximum
- MPICH_SMPDEV_BUFS_PER_PROC
- overrides default SMP device buffer size
- MPICH_COLL_OPT_OFF
- disables the collective optimizations
- MPICH_ABORT_ON_ERROR
- replaces the old MPICH_DBMASK env variable
- MPICH_RANK_REORDER_DISPLAY
- displays the rank to node mapping. The rank order can be manipulated via the MPICH_RANK_REORDER_METHOD env variable. (replaces the old PMI_DEBUG env variable)
- MPICH_SMP_SINGLE_COPY_SIZE
- specifies the minimum message size to qualify for on-node single-copy transfers
- MPICH_SMP_SINGLE_COPY_OFF
- disables the on-node single copy optimization
- MPICH_RMA_MAX_OUTSTANDING_REQS
- controls the maximum number of outstanding RMA operations on a given window
- MPICH_SMP_OFF
- disables the SMP device used for on-node messages
- MPICH_ALLREDUCE_LARGE_MSG
- adjusts the cutoff for the SMP-aware MPI_Allreduce Algorithm
- PMI_EXIT_QUIET
- specifies to the Process Manager Interface (PMI) to inhibit reporting all exits
There are also some new defaults for existing MPI environment variables:
Variable Name New Default Old Default MPICH_ALLTOALL_SHORT_MSG 1024 512 MPICH_ALLTOALLVW_FCSIZE 32 120 MPICH_ALLTOALLVW_SENDWIN 20 80 MPICH_ALLTOALLVW_RECVWIN 20 100 In MPT 3.0, the architecture-specific collective optimizations are enabled
by default, so the MPI_COLL_OPT_ON variable has been deprecated. It is
replaced by MPICH_COLL_OPT_OFF to allow the user to disable these
optimizations if desired. -
A new environment variable MPICH_MPIIO_HINTS has been added to allow users to
set MPI-IO hints without code modifications. The supported hints are:romio_cb_read, romio_cb_write, cb_buffer_size, cb_nodes, cb_config_list,
romio_no_indep_rw, ind_rd_buffer_size, ind_wr_buffer_size, romio_ds_read,
romio_ds_write, direct_ioThe intro_mpi(3) man page contains additional information on how use them as
well as their default values. -
For more information on all of the supported MPI and SHMEM environment
variables, please refer to the intro_mpi and intro_shmem man pages. -
MPT 3.0.1 supports the Berkeley Lab Checkpoint/Restart (BLCR)
feature on UNICOS/lc CNL systems. The initial implementation of
checkpoint/restart (CPR) in UNICOS/lc is in release 2.1. However,
full support in UNICOS/lc is deferred until a future release.
CrayPat 4.3.0 also supports CPR. However, to run a CrayPat
instrumented code with MPT 3.0.1 under a CPR environment, the user
needs to set an environment variable so that each process writes
data to its own file. When MPT supports shared parallel I/O
(multiple processes writing to the same file) under CPR, (planned
for MPT 3.0.2) this environment variable can be omitted.
Differences and Incompatibilities
- The cancelling of sends is not supported.
-
Several MPI environment variables names have changed with MPT 3.0.
They are:
MPICH_PTL_EAGER_LONG replaces MPICH_PTLS_EAGER_LONG
MPICH_ABORT_ON_ERROR replaces MPICH_DBMASK -
MPT 2.0 allowed a number of cnos_ functions (like cnos_barrier or
cnos_get_rank). These functions are no longer supported in MPT 3.0.
Applications with MPT 3.0 that use them may still have them satisfied
from libpct.a, however, this could lead to hangs, etc. In MPT 3.0, MPI
and SHMEM use the PMI for similar functionality that the cnos_
functions provided. Users can examine the pmi/include/pmi.h header for
replacement functions and can use “#include” in their program.
The MPT 3.1 release will include a document that will describe how to
use the PMI functions in pmi.h. -
With MPT 3.0, the PMI library is used to assist aprun in launching
SHMEM programs as well as MPI programs. The intro_shmem man page has
always specified that a call to shmem_finalize is necessary to allow
proper cleanup. With MPT 3.0, if shmem_finalize was not called, the
PMI library may conclude that some processes ended prematurely and tell
aprun to kill those processes and the job will get “killed” messages.
Users should insert a shmem_finalize call into their program to allow
proper cleanup and avoid the confusion. -
A new environment variable called MPICH_CPU_YIELD was added to MPI
3.0 to allow behavior similar to MPI 2.0 in certain cases. If set, MPI
calls sched_yield() in the global progress loop. The sched_yield()
function forces the current process/thread to relinquish the
processor. In 3.0 the existing sched_yield calls in the SMP progress
engine were removed as it significantly increases on-node pingpong
latency. This is most useful when the user over-subscribes the number
of CPUs, or when more than one process or thread is pinned to the same
CPU. Note that this scenario can happen when using the Pathscale
compiler with a hybrid openmp/MPI app if one is not careful. -
MPI programs that use MPI_ANY_SOURCE with pre-posted receives may
see performance degradation with respect to MPT 2.0. The MPT 2.0
performance can be obtained by setting MPICH_SMP_OFF in MPT 3.0 thus
disabling the SMP device. Note that using MPI_ANY_SOURCE with
MPI_Probe/MPI_Iprobe followed by a receive (hence non-pre-posted)
should not see significant changes in performance. -
MPI 2.0 was based on ANL release 1.0.2. MPI 3.0 is based on ANL
release 1.0.4 which made significant changes to the C++ bindings
including the header file. For this reason users of the C++ bindings
should recompile when switching to MPI 3.0.
Where can I find more information?
If you haven’t already, please check out the other Jaguar resource pages at Jaguar resources on compiling, file systems, batch jobs, open issues, parallel I/O tips, CrayPAT overview, and other reports and presentations related to Jaguar.
Another good resource (without Jaguar-specific information) is the documentation that Cray provides at CrayDocs.