+ Visit NASA.gov

Latest Changes on Discover

Important Note on MPI Usage for Expanded Discover Cluster (09/04/2008)

The additional 2048 CPUs recently added to Discover will be available to the general user community no later than Thursday 4 Sep at noon. The new cluster nodes have two quad-core "Harpertown" chips (8 CPUs per node) and 16GB of memory (2 GB per core) in contrast to the original nodes with two dual-core "Dempsey" or "Woodcrest" chips (4 CPUs per node) and 4GB of memory per node (1 GB per core).

The compilation, job submission, and execution environments will generally be the same as they are currently with a few important exceptions:

Scali MPI will be available only on the original non-IBM "Dempsey" and "Woodcrest" nodes. It will NOT be available on the new "Harpertown" nodes.
If your jobs require Scali MPI (as most current jobs do), you MUST add the specification "scali=true" in the PBS select statement to ensure the job is scheduled onto Scali-compatible nodes. For example, -l select=4:ncpus=4:scali=true selects 4 nodes with 4 CPUs each (16 total) and Scali compatibility.
Currently, PBS will schedule jobs onto the original Woodcrest and Dempsey nodes before the newer Harpertown nodes so that codes compiled with Scali MPI will be less likely to be scheduled to Harpertown nodes and fail due to lack of access to Scali MPI on those nodes. This scheduling order will change in the future so that jobs will be more likely to take advantage of the new hardware. This delay will give users time to either add the "scali=true" parameter to their jobs or to convert to another MPI that is available on the entire cluster.
Please begin to migrate your jobs away from Scali MPI in favor of OpenMPI or Intel MPI, which are both available across the entire Discover cluster. Please see https://modelingguru.nasa.gov/clearspace/message/6194#6194 and https://modelingguru.nasa.gov/clearspace/docs/DOC-1571 for details on using Intel MPI. Please be sure to modify your "module load" statements accordingly (in your '.' dot files as well as in your jobs scripts.)
The NCCS currently recommends that users recompile their codes using Intel MPI or OpenMPI and use "select=X:ncpus=Y" to request X number of Y core nodes in a more generic fashion. Users may continue to specify their PBS select statements as they do now, but please remember that if you don't add "scali=true" for any jobs that require Scali MPI, it is possible that PBS will schedule the job onto nodes without Scali MPI and the job will fail.

On Thursday, July 10 2008, the NCCS will be upgrading the operating system on the Discover cluster from SLES-9 to SLES-10.

While the two OS's are binary compatible and recompiling is not absolutely required, it is strongly recommended that users recompile their applications to ensure that they have been built against the latest versions of various system libraries.

Below are a few things that users should be aware of regarding the upgrade to SLES-10

Scali MPI has been upgraded from verion 5.3 to version 5.6 The module for Scali MPI has been renamed from "scali-5.3" to "scali-5". Users will need to change their "module load" commands to load "mpi/scali-5" instead of "mpi/scali-5.3"
"ssh totalview" and "ssh idl" will no longer be supported. Any users that have used this method to establish X-forwarding for PBS jobs should use "xsub" instead. "xsub" accepts all the same arguments as qsub, and establishes the necessary X-forwarding for you.

Module changes (Compilers, Math Kernel Libraries, etc.)

The following modules that are currently available under SLES-9 will not be available under SLES-10. (Please contact user support if you feel you have a continuing need for any of these items)

comp/gcc-3.3.6	comp/intel-8.1.034	comp/intel-8.1.038
comp/intel-9.1.038	comp/intel-9.1.039	comp/intel-9.1.042
comp/intel-9.1.046	comp/intel-9.1.049	comp/intel-10.0.023
comp/intel-10.0.025	comp/intel-10.1.013	comp/intel-10.1.015
comp/nag-5.1	comp/pgi-6.1.6	comp/pgi-6.2.4
lib/mkl-10.0.2.018	lib/mkl-8.1	lib/mkl-9.0.017
lib/mkl-9.1.018	lib/mkl-9.1.021	tool/tview-8.0.0.0
tool/tview-8.1.0.1	mpi/scali-5.3

The following modules will be available under SLES-10

comp/gcc-4.1.2 (natively available without a "module load")	comp/intel-9.1.052
comp/intel-10.1.017	comp/nag-5.1-463
comp/pgi-7.1.6	comp/pgi-7.2.1
lib/mkl-9.1.023	lib/mkl-10.0.3.020
tool/tview-8.2.0.1	mpi/scali-5

Additional modules will be made available for other software and will be listed as "other/". These modules will be made available as these packages are rebuilt.

Process limits
- Default process limits (data size and stack size) will be set to a maximum safe limit based on physical memory on the various nodes. Under most circumstance, users should not need to change these settings. Please contact user support if you have questions or concerns about this.
Some of the additional software that currently resides in /usr/local is now included as part of SLES-10 and will no longer be maintained in /usr/local.

The software packages in /usr/local are being rebuild under SLES-10 and some may not be available initially. Please report any problems you find or anything that appears to be missing.

On Wednesday Feb 20th, the NCCS will be making the following changes to the discover system (downtime notice to follow):

Multiple login nodes will be used for interactive access.
- Users will still connect to and ask for the service the same way they do now. The difference is that they will be placed on the login nodes discover05, 06, 07, or 08 (in a round-robin fashion) instead of discover01. More login nodes may be added to in the future.
- Any user scripts that connect from a discover login node to another remote system may fail if the remote system does not allow all the discover login nodes access. Please have your system administrators contact the NCCS User Services Group for node address information if required.

CRON jobs will be run and managed from a single dedicated cron node so they don't impact interactive processes.
- Once on discover, users may access and manage their cron jobs by connecting to discover-cron . The new login nodes will deny user CRON activity. All existing cron entries will be relocated to discover-cron.

System wide process virtual memory limits will be put in place.
- In order to limit the impact from process that exceed a nodes memory resources, we will be setting virtual memory limits globally on discover.

This means that any single process on discover that reaches 6GB of virtual memory will be terminated. We have found that processes that reach 6GB of virtual memory will continue growing until they exceed the nodes memory resources. This causes the node(s) to hang and the filesystem daemon is frequently killed. Users may see a runtime library error if their processes exceed the 6GB virtual memory limit.

+ Privacy Policy and Important Notices
+ Sciences and Exploration Directorate
+ CISTO

Curator: Mason Chang,
NCCS User Services Group (301-286-9120)
NASA Official: Phil Webster, High-Performance
Computing Lead, GSFC Code 606.2