Open Issues

The ongoing Jaguar software and hardware upgrades result in various issues that will most likely impact all Jaguar users at some point. Issues that hinder productivity on Jaguar are taken very seriously and are therefore urgently worked.

This page contains several issues that are known and being worked. Listed issues will be updated as issues are resolved, and new issues will be listed as they are discovered.

We appreciate your patience while we work through these issues and ask that you report any problems not related to those listed below to the NCCS User Assistance Center.

Large jobs fail even after increasing the value of MPI_MSGS_PER_PROC

The error message specifies the wrong variable. It should be MPICH_MSGS_PER_PROC.

Errors when running codes compiled with Pathscale

Codes compiled with the Pathscale compilers may produce the following error when run:
lib-4965 : WARNING
Unable to find error message (check NLSPATH, file lib.cat)

This was first noticed on a code compiled with array bounds checking enabled

The lib.cat file contains a list of error messages for Pathscale. However, it is not located in a lustre directory, so codes running under CNL cannot access it if they need to. This problem has been reported. A temporary fix is to copy the lib.cat file from its location (/opt/pathscale/lib/3.0/lib.cat) to your work directory, and then setting the NLSPATH variable to the fully-qualified path to that file:
export NLSPATH="/tmp/work/$USER/lib.cat"
or
setenv NLSPATH /tmp/work/$USER/lib.cat

Executables Run via aprun Do Not Produce stderr

Several users have reported that codes run through aprun do not produce stderr. This issue is under investigation.

CrayPAT Overhead in Instrumented Codes is Large

On CNL, CrayPAT (v. 3.2.3) instrumented codes incur a larger overhead for subroutine calls. For example, at least one code ran about three times longer when instrumented. And although CrayPAT tries to compensate for the overhead it incurs, it is likely to report hot spots that aren’t really hot spots.

One possible way to deal with this is to inline the false hot spots before they are identified, but that is impossible to know a priori. Furthermore, inlining with the Portland Group compilers is a two-step process, and the compiler may make use of temporary object files (.o), which are removed after the executable is built, but the pat_build command requires these object files and thus fails. So doing a pat_build of code with inlining turned on does not work in all scenarios; for example, it may fail when the code is comprised of many object files sitting in several directories.

In summary, there is a large overhead with CrayPAT instrumented codes on CNL, and some routines get identified as false hot spots. The user will have to be careful about the reported results.

CNL Compute Nodes Can See Only the Lustre Work Space

The Network File Service (NFS)-mounted home, project, and application directories are not accessible to the CNL compute nodes.

  • Executable must be executed from within the Lustre work space. The executable can exist in the home, project, or application directory as long as it is executed with aprun from within the work space. Then, aprun will copy the executable into the Lustre file system.
  • Batch jobs can be submitted from home or work space. If submitted from a user’s home area, the user should cd into the Lustre work space directory prior to running the executable through aprun. An error similar to the following may be returned if this is not done:
aprun: [NID 94]Exec /tmp/work/userid/a.out failed: chdir /autofs/na1_home/userid
No such file or directory
  • Input must reside in the Lustre work space.
  • Output must also be sent to the Lustre file system.

Unable to Submit Batch Job from Within Batch Job

Batch jobs submitted from within a batch script will be queued but will not run. Users are encouraged not to chain jobs by submitting batch jobs from within batch jobs.

aprun Depth Option Does Not Currently Work

The aprun -d option does not work as stated in the aprun man page. Depth values greater than 1 fail.

> aprun -n 2 -d 2 hello-omp.x
confirmed depth (1) is less than claimed depth (2)
>

OpenMP codes can be run without specifying the depth option.

> setenv OMP_NUM_THREADS 2
> aprun -n 2 -N 1 hello-omp.x
Hello from rank 0 (thread 0) on nid16346 <-- MASTER
Hello from rank 0 (thread 1) on nid16346 <-- slave
Hello from rank 1 (thread 0) on nid16347 <-- MASTER
Hello from rank 1 (thread 1) on nid16347 <-- slave
Application 91314 resources: utime 0, stime 0
>

PGI OpenMP Numa Library

The non-uniform memory access (numa) library is not found in the default PGI OpenMP compile.

> cc omptest.c -mp
/opt/xt-pe/2.0.10/bin/snos64/cc: INFO: linux target is being used
omptest.c:
/usr/bin/ld: cannot find -lnuma
>

One work-around to this problem is to build without numa.

> cc omptest.c -mp=nonuma
/opt/xt-pe/2.0.10/bin/snos64/cc: INFO: linux target is being used
omptest.c:
>

The PathScale compilers do not exhibit this behavior.

Under PGI 7 Compilers, -fast Is the Same as -fastsse

Under PGI 7 compilers, the optimization options -fast and -fastsse are the same. This is not the case under the PGI 6 versions.

Under the PGI 6 versions,

fast sets as follows:

`-O2 -Munroll=c:1 -Mnoframe -Mlre

fastsse sets as follows:

`-fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz

Under the PGI 7 versions,

fast sets as follows:

`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mvect=sse -Mscalarsse
`-Mcache_align -Mflushz

fastsse sets as follows:

`-fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz

-Mvect=nosse can be used in combination with -fast to disable -Mvect=sse.

Resolved Issues

Running Multiple aprun Commands Simultaneously

Batch jobs cannot currently run multiple instances of aprun simultaneously. Thus, the following will not work:

aprun -n 256 ./a.out &
aprun -n 256 ./b.out &
aprun -n 512 ./c.out
wait

This issue was corrected in the recent OS upgrade.

aprun Node Target Option Does Not Currently Work

The aprun -L option does not work as stated in the aprun man page. It appears to be ignored.

> aprun -q -n4 /bin/cat /proc/cray_xt/nid
 16350
 16350
 16351
 16351
> aprun -q -L 16351 -n1 /bin/cat /proc/cray_xt/nid
 16350
>

This issue was corrected in a recent Alps upgrade.
September 17, 2007

> aprun -q -n4 /bin/cat /proc/cray_xt/nid
 1931
 1931
 1932
 1932
> aprun -q -L 1930 -n1 /bin/cat /proc/cray_xt/nid
 user-specified NIDs in command 0 do not match the confirmed NIDs
> aprun -q -L 1931 -n1 /bin/cat /proc/cray_xt/nid
 1931
> aprun -q -L 1932 -n1 /bin/cat /proc/cray_xt/nid
 1932
>