NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Bassi Timeline

A brief timeline of significant events on Bassi.

September 9, 2008: GPFS Upgrade
GPFS software was upgraded to version 3.2.1.4.
August 14, 2008: AIX Upgrade
AIX 5.3 was updated to version 5.3 TL8
May 29, 2008: LoadLeveler upgraded
LoadLeveler was upgraded to version 3.4.2.0.
May 5, 2008: XLF updated to version 11.1.0.1
The default version of the IBM XLF Compiler was updated to version 11.1.0.1.
January 23, 2008: Default compilers upgraded
The default compilers were upgraded to XL Fortran 10.1.0.5 and VAC 8.0.0.16 from 9.1.0.8 and 7.0.0.0.
August 22 to November 1, 2007: Degraded performance
An unneeded system daemon, which woke up every 1 minute, caused about 5 percent slower run times on the Bassi SSP suite of benchmark codes. A bug in an AIX script was identified and fixed, which restored performance.
September 12, 2007: Parallel Environment and LoadLeveler updates
The Parallel Environment is updated from 4.3.0.3 to 4.3.1.3 and LoadLeveler is updated from 3.4.0.0 to 3.4.1.2.
August 22, 2007: AIX and GPFS Updates
AIX was updated to version AIX 5.3 TL6 SP3. GPFS was updated to version 3.1.0.13.
July 11, 2007: Large page memory reduced to 20 GB per node
The large page memory pool on Bassi compute nodes was reduced to 20 GB per node, from 24 GB per node. This frees 4 GB per node to be used by the application stack, which must reside in small pages.
April 12, 2007: Checkpoint/restart enabled by default
Checkpoint/restart is enabled by default (see checkpoint and restart considerations).
April 10, 2007: Parallel Environment and LoadLeveler updates
The Parallel Environment is updated from 4.2.2.4 to 4.3.0.3 and LoadLeveler is updated from 3.3.2.4 to 3.4.0.0.
March 5, 2007: Regular wallclock run limits increased
The wallclock limit for the reg_1 class was increased from 24 to 36 hours, and from 12 to 18 hours for the reg_16 and reg_32 classes.
February 28, 2007: Firmware upgrade
A major firmware upgrade of "power code" and "global firmware." Due to the upgrade, the memory available to user code was reduced by 128 MB per node. The value of the ConsumableMemory resource was reduced to 27032 MB (26.40 GB) per node, or 3379 MB (3.30 GB) per task on a node running 8 MPI tasks.
January 18, 2007: Reserved interactive and debug nodes
Four nodes are reserved for debug and interactive jobs from 05:00 to 18:00 Pacific Time beginning Jan. 18, 2007.
January 8, 2007: GPFS 3.1 upgrade
December 7, 2006: GPFS maintenance software installed
November 15, 2006: AIX upgrade
The operating system was upgraded to AIX 5.3 TL5 SP3. This upgrade fixed the degraded performance issue that was observed after the September 11 changes.
September 11, 2006: Job launch failures and paging space kills
Two significant changes were made to address recent problems:
  1. Job launch failures
  2. Jobs exceeding memory paging space

Job launch failures

Because so many jobs were failing at startup we have turned off password file indexing on Bassi. A bug in AIX causes index file corruption, which causes jobs to fail. With indexing turned off we do not expect any more job launch failures, but we have observed relatively large runtime performance variation.

Jobs exceeding memory paging space

We are making two changes to avoid node failures due to memory over subscription. Most MPI jobs will be unaffected, but if you are running threaded applications, or jobs that run with fewer than 8 tasks per node, you may need to modify your batch scripts.
  • Memory resource limits are being enforced in LoadLeveler. Jobs that request more than 27160 MB (26.52 GB) of memory per node will not run. Default values are in place so that job scripts that request 8 tasks per node do not need any modification. Jobs will be killed if they try to access more than 27160 MB of memory per node. Details are available on the Running Jobs pages.
  • The default NERSC shell initialization files (dot-files) have been modified to set an environment variable that starts all parallel jobs in large-page memory. The setting is export LDR_CNTRL=LARGE_PAGE_DATA=Y for sh,ksh,bash and setenv LDR_CNTRL "LARGE_PAGE_DATA=Y" for csh,tcsh
September 7, 2006: LDR_CNTRL environment variable set to "LARGE_PAGE_DATA=Y"
The change was made in the default user dot files for jobs run under LoadLeveler. This has the same effect as enabling a binary with the -blpdata loader flag. This was set to "force" jobs into using large page memory and thus avoid page space kills due to oversubscription of the limited (~3 GB) small-page memory available on the compute nodes.
August 16, 2006: MP_SINGLE_THREAD change in default environment
The environment variable MP_SINGLE_THREAD is no longer defined in the default NERSC environment. Based on our experience and consultation with IBM, it was determined that the performance gain for single-threaded MPI programs is typically small, especially compared to potential problems associated setting MP_SINGLE_THREAD=yes when running a correctly thread safe user code that happens to have two or more threads making MPI calls. As a result of this change, OpenMP jobs no longer need to explicitly unset MP_SINGLE_THREAD. If your job is very sensitive to MPI latency and you know it is single-threaded, setting MP_SINGLE_THREAD=yes will give better performance.
August 2, 2006: PE and LoadLeveler updates
The Parallel Environment is updated from version 4.2.2.2 to 4.2.2.4. LoadLeveler is updated from 3.3.0.5 to 3.3.2.4. With this upgrade, POWER 5 affinity options for batch jobs must be specified using keywords in the LoadLever script. POE environment variables MP_TASK_AFFINITY and MEMORY_AFFINITY are ignored in batch jobs. See Runtime Configuration and Options.
May 24, 2006
Bassi is running AIX 5.3 and performance across the entire system is believed to be comparable to that under AIX 5.3. Following the May 10 AIX 5.3 upgrade some nodes had to be remigrated, others had incorrect large-page memory configurations, GPFS was misconfigured, and a bad node was identified and removed.
May 10, 2006
Bassi's operating system was migrated from AIX 5.2 to AIX 5.3.
April 26, 2006
Dedicated system time was taken to evaluate AIX 5.3 on 12 nodes. During the outage a number of security patches were applied to the production system. Acceptable benchmark performance was attained on the 12 AIX 5.3 nodes, but a problem with indexing authentication database files was found.
March 31, 2006
The wallclock limit for the low class class was increased from 6 to 12 hours.
March 29, 2006
Downtime was taken to load and test AIX 5.3 and Parallel Environment 4.2.2.2 on 10 Bassi nodes. The performance of NERSC benchmarks on the 10 5.3 nodes was unacceptable and the machine was returned to service with all nodes running AIX 5.2 and POE 4.2.0.3.
March 23, 2006
Wallclock limit for reg_1 class (1-15 nodes) was increased from 12 to 24 hours upon NUG recommendation.
March 1, 2006
The entire system is rebooted after network performance degradation is confirmed as a result of a firmware upgrade that was installed on Feb. 10. The reboot restores network performance. In an attempt to boot 12 nodes to AIX 5.3/PE 4.2.2.2 for testing, GPFS becomes unavailable and AIX 5.3 testing is aborted.
February 23, 2006
NERSC again attempts to update AIX and PE. One frame of 12 nodes is migrated to AIX 5.3 and PE 4.2.2.2. Benchmark performance on the migrated nodes is unacceptably poor and the attempt is aborted.
February 17, 2006
NERSC's two-node test/development p575 system, which had already been successfully migrated to AIX 5.3, is upgraded to PE 4.2.2.2. The upgrade is successful and benchmark performance is acceptable.
February 10, 2006
During a site-wide outage, NERSC attempts to migrate the system from AIX 5.2 to AIX 5.3 and to upgrade LoadLeveler and the IBM Parallel Environment. The migration scripts fail and the system remains at AIX 5.2.
January 20, 2006
The wallclock limit for the regular classes was increased from 8 hours to 12 hours per advice from NUG.
January 9, 2006
Bassi goes into full production. The charge factor, relative to a Seaborg "SP Hour," is set at 6, based on the performance of HPC benchmarks and reports for early users.
December 15, 2005
Bassi is accepted by NERSC from IBM. The system passes NERSC's requirements by being available for more than 99 percent of the time with 86 percent utilization during the availabily test period.
December 6, 2005
The NERSC INCITE project, "Direct Numerical Simulation of Turbulent Non-premixed Combustion - Fundamental Insights towards Predictive Modeling," finishes a month of running almost continuously using 512 Bassi processors. The runs resulted in the first three-dimensional Direct Numerical Simulation (DNS) of a turbulent nonpremixed H2/CO/N2-air flame with detailed chemistry.
November 30 to December 7, 2005
System software is updated to LoadLeveler 3.3.1.1 and Parallel Environment (PE) 4.2.2.1 on Nov. 30. Performance for some codes decreases by up to a factor of 4. The software levels are reinstated at 3.3.0.4 and 4.2.0.3 on Dec. 7.
November 9, 2005
Bassi begins system availability period. The system has to meet strict performance and stability requirements under a HPC workload. First users from the NERSC community are given access.
October 14, 2005
Acceptance period begins. The system is tested and tuned and is required to pass strict performance and functionality tests.
October 7, 2005
After NERSC discovers that alternating nodes in a frame have memory bandwidth that differs by about 10%, IBM installs a software efix that boosts performance on the poorly performing nodes.
July 11, 2005
System delivery starts. The system was integrated on-site at NERSC's Oakland Scientific Facility.

LBNL Home
Page last modified: Thu, 11 Sep 2008 19:56:32 GMT
Page URL: http://www.nersc.gov/nusers/systems/bassi/timeline.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science