|
Bassi Timeline
A brief timeline of significant events on Bassi.
- September 9, 2008: GPFS Upgrade
-
GPFS software was upgraded to version 3.2.1.4.
- August 14, 2008: AIX Upgrade
-
AIX 5.3 was updated to version 5.3 TL8
- May 29, 2008: LoadLeveler upgraded
-
LoadLeveler was upgraded to version 3.4.2.0.
- May 5, 2008: XLF updated to version 11.1.0.1
-
The default version of the IBM XLF Compiler was updated to version 11.1.0.1.
- January 23, 2008: Default compilers upgraded
-
The default compilers were upgraded to XL Fortran 10.1.0.5 and VAC 8.0.0.16
from 9.1.0.8 and 7.0.0.0.
- August 22 to November 1, 2007: Degraded performance
-
An unneeded system daemon, which woke up every 1 minute, caused
about 5 percent slower run times on the Bassi SSP suite of
benchmark codes. A bug in an AIX script was identified and fixed,
which restored performance.
- September 12, 2007: Parallel Environment and LoadLeveler updates
-
The Parallel Environment is updated from 4.3.0.3 to 4.3.1.3 and
LoadLeveler is updated from 3.4.0.0 to 3.4.1.2.
- August 22, 2007: AIX and GPFS Updates
- AIX was updated to version AIX 5.3 TL6 SP3. GPFS was updated to version 3.1.0.13.
- July 11, 2007: Large page memory reduced to 20 GB per node
- The large page memory pool on Bassi compute nodes was reduced to
20 GB per node, from 24 GB per node. This frees 4 GB per node to
be used by the application stack, which must reside in small pages.
- April 12, 2007: Checkpoint/restart enabled by default
-
Checkpoint/restart
is enabled by default (see checkpoint and restart considerations).
- April 10, 2007: Parallel Environment and LoadLeveler updates
-
The Parallel Environment is updated from 4.2.2.4 to 4.3.0.3 and
LoadLeveler is updated from 3.3.2.4 to 3.4.0.0.
- March 5, 2007: Regular wallclock run limits increased
- The wallclock limit for the reg_1 class was increased from
24 to 36 hours, and from 12 to 18 hours for the reg_16 and reg_32
classes.
- February 28, 2007: Firmware upgrade
- A major firmware upgrade of "power code" and "global firmware."
Due to the upgrade, the memory available to user code was
reduced by 128 MB per node. The value of the ConsumableMemory
resource was reduced to 27032 MB (26.40 GB) per node, or 3379 MB
(3.30 GB) per task on a node running 8 MPI tasks.
- January 18, 2007: Reserved interactive and debug nodes
- Four nodes are reserved for debug and interactive jobs from
05:00 to 18:00 Pacific Time beginning Jan. 18, 2007.
- January 8, 2007: GPFS 3.1 upgrade
- December 7, 2006: GPFS maintenance software installed
- November 15, 2006: AIX upgrade
- The operating system was upgraded to AIX 5.3 TL5 SP3. This upgrade fixed the degraded performance issue that was observed after the September 11 changes.
- September 11, 2006: Job launch failures and paging space kills
-
Two significant changes were made to address recent problems:
- Job launch failures
- Jobs exceeding memory paging space
Job launch failures
Because so many jobs were failing at startup we have turned off password file indexing on Bassi.
A bug in AIX causes index file corruption, which causes jobs to fail. With indexing turned off
we do not expect any more job launch failures, but we have observed relatively large runtime performance variation.
Jobs exceeding memory paging space
We are making two changes to avoid node failures due to memory over subscription. Most MPI
jobs will be unaffected, but if you are running threaded applications, or jobs that run with fewer than
8 tasks per node, you may need to modify your batch scripts.
-
Memory resource limits are being enforced in LoadLeveler. Jobs that request more than 27160 MB (26.52 GB) of memory
per node will not run. Default values are in place so that job scripts that request 8 tasks per node do not need any
modification. Jobs will be killed if they try to access more than 27160 MB of memory per node.
Details are available on the Running Jobs pages.
-
The default NERSC shell initialization files (dot-files) have been modified to set an
environment variable that starts all parallel jobs in large-page memory. The setting is
export LDR_CNTRL=LARGE_PAGE_DATA=Y for sh,ksh,bash
and
setenv LDR_CNTRL "LARGE_PAGE_DATA=Y" for csh,tcsh
- September 7, 2006: LDR_CNTRL environment variable set to "LARGE_PAGE_DATA=Y"
-
The change was made in the default user dot files for jobs run under LoadLeveler. This has the same
effect as enabling a binary with the -blpdata loader flag. This was set
to "force" jobs into using large page memory and thus avoid page space
kills due to oversubscription of the limited (~3 GB) small-page memory
available on the compute nodes.
- August 16, 2006: MP_SINGLE_THREAD change in default environment
-
The environment variable MP_SINGLE_THREAD is no longer defined in the
default NERSC
environment. Based on our experience and consultation with IBM, it was
determined
that the performance gain for single-threaded MPI programs is typically
small,
especially compared to potential problems associated setting
MP_SINGLE_THREAD=yes when
running a correctly thread safe user code that happens to have two or more
threads
making MPI calls. As a result of this change, OpenMP jobs no longer need
to explicitly unset MP_SINGLE_THREAD.
If your job is very sensitive to MPI latency and you know it is single-threaded,
setting MP_SINGLE_THREAD=yes will give better performance.
- August 2, 2006: PE and LoadLeveler updates
-
The Parallel Environment is updated from version 4.2.2.2 to 4.2.2.4.
LoadLeveler is updated from 3.3.0.5 to 3.3.2.4.
With this upgrade, POWER 5 affinity options for batch jobs
must be specified using keywords in the LoadLever script.
POE environment variables MP_TASK_AFFINITY and MEMORY_AFFINITY
are ignored in batch jobs. See Runtime Configuration and Options.
- May 24, 2006
-
Bassi is running AIX 5.3 and performance across the entire
system is believed to be comparable to that under AIX 5.3.
Following the May 10 AIX 5.3 upgrade some nodes had to
be remigrated, others had incorrect large-page memory configurations,
GPFS was misconfigured, and a bad node was identified and removed.
- May 10, 2006
-
Bassi's operating system was migrated from AIX 5.2 to AIX 5.3.
- April 26, 2006
-
Dedicated system time was taken to evaluate AIX 5.3 on 12
nodes. During the outage a number of security patches were
applied to the production system. Acceptable benchmark
performance was attained on the 12 AIX 5.3 nodes, but
a problem with indexing authentication database files
was found.
- March 31, 2006
-
The wallclock limit for the low class class was increased
from 6 to 12 hours.
- March 29, 2006
-
Downtime was taken to load and test AIX 5.3 and Parallel Environment
4.2.2.2 on 10 Bassi nodes.
The performance of NERSC benchmarks on the 10 5.3 nodes was
unacceptable and the machine was returned to service with all
nodes running AIX 5.2 and POE 4.2.0.3.
- March 23, 2006
-
Wallclock limit for reg_1 class (1-15 nodes) was increased
from 12 to 24 hours upon NUG recommendation.
- March 1, 2006
-
The entire system is rebooted after network performance
degradation is confirmed as a result of a firmware upgrade
that was installed on Feb. 10. The reboot restores
network performance. In an attempt to boot 12 nodes to
AIX 5.3/PE 4.2.2.2 for testing, GPFS becomes unavailable
and AIX 5.3 testing is aborted.
- February 23, 2006
-
NERSC again attempts to update AIX and PE. One frame
of 12 nodes is migrated to AIX 5.3 and PE 4.2.2.2.
Benchmark performance on the migrated nodes is
unacceptably poor and the attempt is aborted.
- February 17, 2006
-
NERSC's two-node test/development p575 system, which had
already been successfully migrated to AIX 5.3,
is upgraded to PE 4.2.2.2. The upgrade is successful and
benchmark performance is acceptable.
- February 10, 2006
-
During a site-wide outage, NERSC attempts to migrate
the system from AIX 5.2 to AIX 5.3 and to upgrade
LoadLeveler and the IBM Parallel Environment. The migration
scripts fail and the system remains at AIX 5.2.
- January 20, 2006
-
The wallclock limit for the regular classes was increased from
8 hours to 12 hours per advice from NUG.
- January 9, 2006
-
Bassi goes into full production. The charge factor,
relative to a Seaborg "SP Hour," is set at 6, based
on the performance of HPC benchmarks and reports for
early users.
- December 15, 2005
-
Bassi is accepted by NERSC from IBM. The system passes
NERSC's requirements by being available for more than
99 percent of the time with 86 percent utilization during
the availabily test period.
- December 6, 2005
-
The NERSC INCITE project,
"Direct Numerical Simulation of Turbulent Non-premixed
Combustion - Fundamental Insights towards Predictive Modeling,"
finishes a month of running almost continuously using
512 Bassi processors. The runs resulted in the first
three-dimensional Direct Numerical Simulation (DNS) of a
turbulent nonpremixed H2/CO/N2-air flame
with detailed chemistry.
- November 30 to December 7, 2005
-
System software is updated to LoadLeveler 3.3.1.1 and
Parallel Environment (PE) 4.2.2.1 on Nov. 30. Performance for
some codes decreases by up to a factor of 4. The
software levels are reinstated at 3.3.0.4 and 4.2.0.3 on
Dec. 7.
- November 9, 2005
-
Bassi begins system availability period. The system has to
meet strict performance and stability requirements under a
HPC workload. First users from the NERSC community are
given access.
- October 14, 2005
-
Acceptance period begins. The system is tested and tuned
and is required to pass strict performance and functionality tests.
- October 7, 2005
-
After NERSC discovers that alternating nodes in a frame
have memory bandwidth that differs by about 10%, IBM
installs a software efix that boosts performance on the
poorly performing nodes.
- July 11, 2005
-
System delivery starts. The system was integrated on-site
at NERSC's Oakland Scientific Facility.
|