Home	About	News & Publications	HPC Users	Projects
Systems	Services	Analytics	Status & Statistics	Help

Status & Stats

NERSC MOTD
Announcements
Known Problems
Current Queue Look
Completed Jobs List
Job Stats

Franklin Known Problems

This page will document some known problems on Franklin that have a direct, noticeable effect on NERSC users. Problems that must be resolved by Cray will be reported and receive a Cray "SPR" tracking number.

Open Issues

Linking libsci with PGI C Compiler Requires "-pgf90libs" Option

Date opened: 02-19-08 (SPR 741542)

Description: The problem starts to appear after the OS upgrade on 02/13/08 from OS 2.0.24 to 2.0.39. It may be related to libsci.

Status: Cray has an interanl fix for testing on March 18. Cray has fixed in CNL in xt-asyncpe 1.0e, which will be available on Franklin soon.

CrayPat Reports a Truncated User Time for a Job Running over ~ 2 Hours

Date opened: 02-05-08 (SPR 741409)

Description: Long-running job under xt-craypat/4.1, the user time reported is a truncated value, much smaller than what it should be. As a result, the reported derived hardware performance metrics are incorrect, some shown as larger than the peak value.

This abnormality appears for jobs running over ~2 hours (on a 2.6 GHz machine). Overflow resulted from using integer arithmetic to convert from clock ticks to cycles.

Status: Cray has an interanl fix (changed to use floating point) for testing on March 17. Fixed in Craypat 4.2.0, which will be available on Franklin soon.

Zero-length File Before/During System Wide Outages

Date opened: 12-18-07 (SPR 741373)

Description: A few users have reported the zero-length file problems before/during recent system wide outages caused by lustre service node downs. When a user is editing a file at the time, and exiting the editor possibly caused the file to be truncated and rewritten with the editor memory or temporary file contents. But if that I/O hung or failed, the resulting file will have zero-length.

Status: Cray is working on increasing system stability to avoid such occurrences and users are strongly recommended to frequently back up their important files to HPSS. Cray also opend an SPR 741373 "LUSTRE DOES NOT IMMEDIATELY MAKE FILE SYSTEM READ-ONLY AFTER INTERNAL ERRORS" on 02/01/2008 to improve some of the situations.

Occasionally Jobs No Progressing or Aprun Not Starting So Jobs Exceeding Wall Clock Limits

Date opened: 09-24-07 (SPR 740028)

Description: Occasionally jobs appears to not progressing or aprun not starting so jobs eventually exit with wallclock limit exceeded. There is some suspicion that the above errors might be related to nodes running low on memory due to out-of-memory issues with prior applications, which could prevent subsequent application start-up.

Status: NERSC periodically runs a script manually to detect such "bad" nodes, and mark them "admindown". Cray is working on starting the nodehealth checking script automatically when out-of-memory jobs are exiting, to prevent "bad" nodes to be allocated for future jobs.

Unexpected interaction between bash shell and shell I/O redirection

Date opened: 07-16-07 (SPR 739249)

Description: Bash shell users using "#PBS -S /bin/bash" in batch script would cause IO redirection fail, and job fails with "exit code 127". The workaround is to specify csh/tcsh and corresponding IO redirection syntax in the batch script.

Status: Cray is working on the problem.

Runtime error message: "Apid xxxx killed. Received node failed or halted event for nid xxxx"

Date opened: 06-08-07

Description: Uncorrected memory errors (UME) that cause compute nodes to be unresponsive. If a node allocated to an application dies while the job is running, the application would be aborted with an error message such as "Apid xxxx killed. Received node failed or halted event for nid xxxx".

Status: Cray resolved the high frequency on the UME errors with adjusting the voltage setting on 07/16/07. But we are still getting some of these node failures, which is at normal hardware failure rate for a 9,000+ nodes system.

Resolved Issues

Lustre Incorrectly Thinks User Inode Over Quota

Date opened: 11-19-07 (SPR 740644)

Description: A few users reported this problem. User has only a few hundred files, but Lustre thinks it is over the quota of 25,000 inode limit on /home or 50,000 inode limit on /scratch. And user could not create new files. The inode quota bucket size is set to 1000 on /scratch. This bug could be hit at the multiple of the bucket size boundary.

Date resolved: 02-13-08

Status: NERSC had a workaround for affected users int the early days. NERSC turned off inode quota temporarily for users on 01/04/08. Fixed with OS upgrade to 2.0.39 on 02/13/08.

Running multiple independent parallel jobs simultaneously in a single batch script with aprun does not work

Date opened: 09-05-07 (SPR 739866)

Description: Programming User's Guide has an example of how to run multiple independent parallel jobs simultaneously under CNL in a single batch script, with multiple aprun commands, each with an "&", and the total of CPUs needed for all the jobs in the #PBS -l mppwidth=" field. But it is currently not working.

Date resolved: 02-13-08

Status: Fixed with OS upgrade to 2.0.39 on 02/13/08.

MPI_allreduce Call Lacks Implicit Barrier

Date opened: 12-07-07 (SPR 740827)

Description: Various user codes failed with different error messages from "exit codes: 13" to suggested increasing MPICH_UNEX_BUFFER_SIZE. Traceback information found the implicit barrier of MPI_allreduce function is not working properly. Adding an explicit MPI_barrier call after each MPI_allreduce would get the application work successfully.

Date resolved: 01-22-08 (SPR 740827)

Status: Cray installed a patch on top of the current xt-mpt (massage passing toolkit) module. Successfully tested by affected user applications that no explicit barriers are needed any more.

GAMESS (Shmem version) and NWCHEM applications cause system crash

Date opened: 09-14-07

Description: This problems maybe related to Global Arrays usage. The symptom is that while these two application launches, a service IO node will be down first, then the whole system would crash.

Date resolved: 10-23-07

Status: Cray installed a patch on OS 2.0.14 on 09/23 for trapping the NWCHEM and GAMESS (Shmem) problem that causes system crash. Users could now try to run GAMESS (Shmem) or NWCHEM applications, but programs using Shmem atomic functions (that call portals atomics) will cause jobs exit with an error message similar to: "LIBSMA ERROR: PtlGetAddRegion failed (rc = 27) for PE 33 pid 0x39/16". The list of Shmem functions affected are:

shmem_*swap
shmem_*cswap
shmem_*finc
shmem_*fadd
shmem_set_lock
shmem_test_lock

Cray removed the Shmem atomic trap code, and installed another portals patch on 10/05/07. Now majority of GAMESS (shmem) and NWCHEM codes are expected to run successfully. Waiting for more user exposure to confirm. Cray installed a new shmem library under xt-mpt/2.0.24d module. Set this the default library on 10/23/07. System tested to be more robust for multiple large concurrency NWCHEM jobs. User are asked to recompile NWCHEM, GAMESS (shmem) and Global Array jobs.

aprun "-L node_list" option does not work

Date opened: 08-27-07 (SPR 739665)

Description: aprun with "-L node_list" option does not honor the node list specified.

Date resolved: 09-10-07

Solution: Patched in OS 2.0.14.

aprun MPMD mode does not work in batch mode

Date opened: 08-15-07 (SPR 739592)

Description: aprun with MPMD (Multiple Program Multiple Data) execution mode (aprun -n xx exe1 : -n xx exe2 : -n xx exe3 : ...) fails on CNL under batch mode (PBS).

Date resolved: 09-07-07

Solution: Patched in OS 2.0.14.

Runtime error message: "LIBSMA ERROR: PtlMDBind for GET symheap failed (rc = 26)"

Date opened: 06-14-07 (SPR 738915)

Description: Problem related to running SHMEM applications. First reported with GAMESS shmem version, also with a stand alone Global Arrays (GA) application code.

Date resolved: 08-31-07

Solution: Patched in OS 2.0.14.

Runtime error message: "PtlMDAttach failed : PTL_VAL_FAILED"

Date opened: 08-22-07 (SPR 739661)

Description: Problem related to a difficulty in allocating large contiguous memory in the Portal level.

Date resolved: 08-28-07 (SPR 739661)

Solution: Patched in OS 2.0.14. Permanent fix will be in OS 2.0.20.

Job killed with walltime exceeded limit

Date opened: 07-19-07

Description: Jobs intermittently hitting time limits set 5x larger than usual run times. Jobs run successfully at other times.

Date resolved: 08-02-07

Solution: With OS upgrade on 7/23 (2.0.10 to 2.0.14) with additional software patches. Confirmed with user codes (with high frequency failure rates before) on 08/02.

aprun: Apid xxxxx close of the compute node connection after app startup barrier

Date opened: 07-01-07

Description: Some jobs will fail with the above error message.

Date resolved: 07-23-07

Solution: Fixed with a pre-release aprun binary Cray installed on 7/23.

Runtime error message: "aprun: Application xxxx on node xxxx received aborted signal" with exit codes 13

Date opened: 06-27-07

Description: Some jobs aborted with the above error message and "exit codes: 13".

Date resolved: 07-18-07

Solution: Obtained more information via enabled detailed stderr message: "exhausted unexpected receive queue buffering, increase via env. var. MPICH_UNEX_BUFFER_SIZE". Setting MPICH_UNEX_BUFFER_SIZE to a larger value (such as 120M) instead of the default 60M would allow codes to run successfully.

Runtime error message: "aprun: /proc readdir timeout alarm occurred"

Date opened: 06-29-07

Description: Some jobs will fail with the above error message.

Date resolved: 07-09-07

Solution: Patched in OS 2.0.10. Permanent fix should be available in OS 2.0.11.

Wrong time stamps for output files

Date opened: 06-10-07

Description: Some job output files have the wrong time stamps of Sept 15, 2004 or Sept 16, 2004.

Date resolved: 07-09-07

Solution: Patched in OS 2.0.10.

	National Energy Research Scientific Computing Center
	A DOE Office of Science User Facility at Lawrence Berkeley National Laboratory	Site Map \| Help \| Search Login

HPC Users

Systems

Franklin

Status & Stats

Franklin Known Problems

Open Issues

Linking libsci with PGI C Compiler Requires "-pgf90libs" Option

CrayPat Reports a Truncated User Time for a Job Running over ~ 2 Hours

Zero-length File Before/During System Wide Outages

Occasionally Jobs No Progressing or Aprun Not Starting So Jobs Exceeding Wall Clock Limits

Unexpected interaction between bash shell and shell I/O redirection

Runtime error message: "Apid xxxx killed. Received node failed or halted event for nid xxxx"

Resolved Issues

Lustre Incorrectly Thinks User Inode Over Quota

Running multiple independent parallel jobs simultaneously in a single batch script with aprun does not work

MPI_allreduce Call Lacks Implicit Barrier

GAMESS (Shmem version) and NWCHEM applications cause system crash

aprun "-L node_list" option does not work

aprun MPMD mode does not work in batch mode

Runtime error message: "LIBSMA ERROR: PtlMDBind for GET symheap failed (rc = 26)"

Runtime error message: "PtlMDAttach failed : PTL_VAL_FAILED"

Job killed with walltime exceeded limit

aprun: Apid xxxxx close of the compute node connection after app startup barrier

Runtime error message: "aprun: Application xxxx on node xxxx received aborted signal" with exit codes 13

Runtime error message: "aprun: /proc readdir timeout alarm occurred"

Wrong time stamps for output files