Open Issues
The ongoing Jaguar software and hardware upgrades result in various issues that will most likely impact all Jaguar users at some point. Issues that hinder productivity on Jaguar are taken very seriously and are therefore urgently worked.
This page contains several issues that are known and being worked. Listed issues will be updated as issues are resolved, and new issues will be listed as they are discovered.
We appreciate your patience while we work through these issues and ask that you report any problems not related to those listed below to the NCCS User Assistance Center.
Large jobs fail even after increasing the value of MPI_MSGS_PER_PROC
The error message specifies the wrong variable. It should be MPICH_MSGS_PER_PROC.
Errors when running codes compiled with Pathscale
Codes compiled with the Pathscale compilers may produce the following error when run:
lib-4965 : WARNING
Unable to find error message (check NLSPATH, file lib.cat)
This was first noticed on a code compiled with array bounds checking enabled
The lib.cat file contains a list of error messages for Pathscale. However, it is not located in a lustre directory, so codes running under CNL cannot access it if they need to. This problem has been reported. A temporary fix is to copy the lib.cat file from its location (/opt/pathscale/lib/3.0/lib.cat) to your work directory, and then setting the NLSPATH variable to the fully-qualified path to that file:export NLSPATH="/tmp/work/$USER/lib.cat"
orsetenv NLSPATH /tmp/work/$USER/lib.cat
Executables Run via aprun Do Not Produce stderr
Several users have reported that codes run through aprun
do not produce stderr
. This issue is under investigation.
CrayPAT Overhead in Instrumented Codes is Large
On CNL, CrayPAT (v. 3.2.3) instrumented codes incur a larger overhead for subroutine calls. For example, at least one code ran about three times longer when instrumented. And although CrayPAT tries to compensate for the overhead it incurs, it is likely to report hot spots that aren’t really hot spots.
One possible way to deal with this is to inline the false hot spots before they are identified, but that is impossible to know a priori. Furthermore, inlining with the Portland Group compilers is a two-step process, and the compiler may make use of temporary object files (.o
), which are removed after the executable is built, but the pat_build
command requires these object files and thus fails. So doing a pat_build
of code with inlining turned on does not work in all scenarios; for example, it may fail when the code is comprised of many object files sitting in several directories.
In summary, there is a large overhead with CrayPAT instrumented codes on CNL, and some routines get identified as false hot spots. The user will have to be careful about the reported results.
CNL Compute Nodes Can See Only the Lustre Work Space
The Network File Service (NFS)-mounted home, project, and application directories are not accessible to the CNL compute nodes.
- Executable must be executed from within the Lustre work space. The executable can exist in the home, project, or application directory as long as it is executed with
aprun
from within the work space. Then,aprun
will copy the executable into the Lustre file system.
- Batch jobs can be submitted from home or work space. If submitted from a user’s home area, the user should cd into the Lustre work space directory prior to running the executable through
aprun
. An error similar to the following may be returned if this is not done:
aprun: [NID 94]Exec /tmp/work/userid/a.out failed: chdir /autofs/na1_home/userid No such file or directory
- Input must reside in the Lustre work space.
- Output must also be sent to the Lustre file system.
Unable to Submit Batch Job from Within Batch Job
Batch jobs submitted from within a batch script will be queued but will not run. Users are encouraged not to chain jobs by submitting batch jobs from within batch jobs.
aprun Depth Option Does Not Currently Work
The aprun -d
option does not work as stated in the aprun
man page. Depth values greater than 1 fail.
> aprun -n 2 -d 2 hello-omp.x confirmed depth (1) is less than claimed depth (2) >
OpenMP codes can be run without specifying the depth option.
> setenv OMP_NUM_THREADS 2 > aprun -n 2 -N 1 hello-omp.x Hello from rank 0 (thread 0) on nid16346 <-- MASTER Hello from rank 0 (thread 1) on nid16346 <-- slave Hello from rank 1 (thread 0) on nid16347 <-- MASTER Hello from rank 1 (thread 1) on nid16347 <-- slave Application 91314 resources: utime 0, stime 0 >
PGI OpenMP Numa Library
The non-uniform memory access (numa) library is not found in the default PGI OpenMP compile.
> cc omptest.c -mp /opt/xt-pe/2.0.10/bin/snos64/cc: INFO: linux target is being used omptest.c: /usr/bin/ld: cannot find -lnuma >
One work-around to this problem is to build without numa.
> cc omptest.c -mp=nonuma /opt/xt-pe/2.0.10/bin/snos64/cc: INFO: linux target is being used omptest.c: >
The PathScale compilers do not exhibit this behavior.
Under PGI 7 Compilers, -fast Is the Same as -fastsse
Under PGI 7 compilers, the optimization options -fast
and -fastsse
are the same. This is not the case under the PGI 6 versions.
Under the PGI 6 versions,
fast
sets as follows:
`-O2 -Munroll=c:1 -Mnoframe -Mlre
fastsse
sets as follows:
`-fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
Under the PGI 7 versions,
fast
sets as follows:
`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mvect=sse -Mscalarsse
`-Mcache_align -Mflushz
fastsse
sets as follows:
`-fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-Mvect=nosse
can be used in combination with -fast
to disable -Mvect=sse
.
Resolved Issues
Running Multiple aprun Commands Simultaneously
Batch jobs cannot currently run multiple instances of aprun
simultaneously. Thus, the following will not work:
aprun -n 256 ./a.out & aprun -n 256 ./b.out & aprun -n 512 ./c.out wait
This issue was corrected in the recent OS upgrade.
aprun Node Target Option Does Not Currently Work
The aprun -L
option does not work as stated in the aprun
man page. It appears to be ignored.
> aprun -q -n4 /bin/cat /proc/cray_xt/nid 16350 16350 16351 16351 > aprun -q -L 16351 -n1 /bin/cat /proc/cray_xt/nid 16350 >
This issue was corrected in a recent Alps upgrade.
September 17, 2007
> aprun -q -n4 /bin/cat /proc/cray_xt/nid 1931 1931 1932 1932 > aprun -q -L 1930 -n1 /bin/cat /proc/cray_xt/nid user-specified NIDs in command 0 do not match the confirmed NIDs > aprun -q -L 1931 -n1 /bin/cat /proc/cray_xt/nid 1931 > aprun -q -L 1932 -n1 /bin/cat /proc/cray_xt/nid 1932 >