NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Bassi Problem Tracking

The following are known problems being tracked by NERSC on Bassi. Problems that must be resovled by IBM are reported and receive an IBM "PMR" tracking number. The problems listed here are those that have a direct noticable effect on NERSC users; therefore, this is not an exhaustive list of all system PMRs.

Known Unresolved Problems

Resolved Problems


Parallel Job Launch Failures
IBM PMR #: 00925,49R,000
Created
June 1, 2006
Status
Fixed with November 15, 2006 upgrade to AIX 5.3 TL5 SP3.
Description
Intermittant job launch failures are occurring with symptoms similar to:
ERROR: 0031-024 b0206.nersc.gov: no response; rc = -1                  
LoadL_starter: The program, /etc/pmdv4, terminated with a signal 11.
LoadL_starter: 2512-902 Unable to set process limits for user.
LoadL_starter: 2539-752 The getpcred system call failed for user ragerber. errno=2 
[A file or directory in the path name does not exist.]
The Schedd daemon forced this job into reject state due to an internal 
communication error, attempting to communicate with startd on 
b1101.nersc.gov.
Cause
Password, group, and security files are corrupt when they are initially indexed, causing a cascade of job-launch failures. The files are being indexed because of poor application performance under AIX 5.3 when large, unindexed files are in place (IBM PMR 76065,49R,000). The files should not even be necessary under AIX 5.3, but PAM/LDAP is unable to get group information from LDAP (IBM PMR 89551,49R,000), so the files are manually created by a script pulling from LDAP every two hours. Once PMR 89551 is resolved and a fix is in place, it is believed this problem will be fixed as well.
Resolution
The pulls from LDAP were suspended on the compute nodes as a temporary work-around. Fixed with November 15, 2006 upgrade to AIX 5.3 TL5 SP3.

Degraded parallel performance when reading STDIN
IBM PMR #: 86826,49R,000
Created
April 21, 2006
Status
A temporary fix (efix) was applied on August 2, 2006. A permanent fix is expected to be incorporated into Parallel Environment 4.2.2.5.
Description
Parallel programs that read from Standard Input (STDIN) are experiencing very poor performance.
Cause

A change was made in IBM's Parallel Environment (PE) version 4.2. to prevent a situation where a hang might be possible when an application required redirected input from one or more tasks, where the data could not all be read in completely and a checkpoint was requested. Previously, such data was left undelivered on the STDIN pipes, and POE would wait in a read for data which would never get delivered to the task(s), causing a hang condition. The fix resulted in a logic change in POE, where it would check for data on all tasks' STDIN pipes, using a select() system call, and then once data was read, would deliver it to the appropriate task,

As a result, this additional logic to check for STDIN on each task has resulted in a situation where POE would continually check for undelivered input, which caused a series of repeated interrupts caused by the select() system calls that degraded application performance. As a workaround, by setting MP_STDINMODE=n, where "n" is a specific task number that will be reading data via redirected STDIN, POE focused on just that task's STDIN pipe to listen on, greatly reducing the number of select() calls & associated interrupts. However, the use of MP_STDINMODE will be limited to situations where a single task will expect input data from STDIN, and in cases where multiple tasks must read from STDIN it will not be helpful (and can prevent another task from reading data).

Resolution
A temporary fix is in place.

MPI-2 one-sided functions fail in 64-bit (Resolved)
IBM PMR #: 76523,49R,000
Status
This was determined to be a user coding error.
Description
A user reports that that MPI-2 one-side functions give wrong answers in 64-bit compiles. In 32-bit the answers are correct.
Cause
Unknown.
Resolution
An pointer address was declared to be of type INTEGER(KIND=MPI_INTEGER), which fails in 64-bit. The correct specification is INTEGER(KIND=MPI_ADDRESS_KIND).

LAPI RDMA example code fails
IBM PMR #: 76619,49R,000
Created
March 7, 2006
Status
Fixed with system upgrade to PE 3.3.2.4 and LoadLeveler 4.2.2.4 on August 2, 2006.
Description
The LAPI example from /opt/rsct/lapi/samples/xfer/Hw_xfer.c fails with:
                                                                        
(LAPI_Util(handle, (lapi_util_t *) &util_pvo)) returns error: 498       
(LAPI_Util(handle, (lapi_util_t *) &util_pvo)) returns error: 498       
Cause
The RDMA LAPI functionality contained in the sample code is not supported under LoadLeveler 3.3.0.0, which is currently on Bassi.
Resolution
Update to LoadLeveler 3.3.2.4 expected to be performed on August 2, 2006. After installation on test system, we found that this capability must be user-enabled by setting the environment variable MP_RDMA_COUNT to a value greater than zero and using #@ network.LAPI = sn_all,not_shared,us,,,rcxtblocks=2 in LoadLeveler scripts.

LBNL Home
Page last modified: Tue, 12 Dec 2006 00:24:53 GMT
Page URL: http://www.nersc.gov/nusers/systems/bassi/known_problems.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science