Skip all navigation and jump to content Jump to site navigation Jump to section navigation.
NASA Logo - Goddard Space Flight Center + Visit NASA.gov
NASA Center for Computational Sciences
NCCS HOME USER SERVICES SYSTEMS DOCUMENTATION NEWS GET MORE HELP

 

User Services
OVERVIEW
ACCOUNT INFO
FAQ
GLOSSARY
TUTORIALS
CLASSES & WORKSHOPS

Latest Changes on Discover


Important Note on MPI Usage for Expanded Discover Cluster (09/04/2008)

The additional 2048 CPUs recently added to Discover will be available to the general user community no later than Thursday 4 Sep at noon. The new cluster nodes have two quad-core "Harpertown" chips (8 CPUs per node) and 16GB of memory (2 GB per core) in contrast to the original nodes with two dual-core "Dempsey" or "Woodcrest" chips (4 CPUs per node) and 4GB of memory per node (1 GB per core).

The compilation, job submission, and execution environments will generally be the same as they are currently with a few important exceptions:

  1. Scali MPI will be available only on the original non-IBM "Dempsey" and "Woodcrest" nodes. It will NOT be available on the new "Harpertown" nodes.

  2. If your jobs require Scali MPI (as most current jobs do), you MUST add the specification "scali=true" in the PBS select statement to ensure the job is scheduled onto Scali-compatible nodes. For example, -l select=4:ncpus=4:scali=true selects 4 nodes with 4 CPUs each (16 total) and Scali compatibility.

  3. Currently, PBS will schedule jobs onto the original Woodcrest and Dempsey nodes before the newer Harpertown nodes so that codes compiled with Scali MPI will be less likely to be scheduled to Harpertown nodes and fail due to lack of access to Scali MPI on those nodes. This scheduling order will change in the future so that jobs will be more likely to take advantage of the new hardware. This delay will give users time to either add the "scali=true" parameter to their jobs or to convert to another MPI that is available on the entire cluster.

  4. Please begin to migrate your jobs away from Scali MPI in favor of OpenMPI or Intel MPI, which are both available across the entire Discover cluster. Please see https://modelingguru.nasa.gov/clearspace/message/6194#6194 and https://modelingguru.nasa.gov/clearspace/docs/DOC-1571 for details on using Intel MPI. Please be sure to modify your "module load" statements accordingly (in your '.' dot files as well as in your jobs scripts.)

  5. The NCCS currently recommends that users recompile their codes using Intel MPI or OpenMPI and use "select=X:ncpus=Y" to request X number of Y core nodes in a more generic fashion. Users may continue to specify their PBS select statements as they do now, but please remember that if you don't add "scali=true" for any jobs that require Scali MPI, it is possible that PBS will schedule the job onto nodes without Scali MPI and the job will fail.


On Thursday, July 10 2008, the NCCS will be upgrading the operating system on the Discover cluster from SLES-9 to SLES-10.

While the two OS's are binary compatible and recompiling is not absolutely required, it is strongly recommended that users recompile their applications to ensure that they have been built against the latest versions of various system libraries.

Below are a few things that users should be aware of regarding the upgrade to SLES-10

  • Scali MPI has been upgraded from verion 5.3 to version 5.6 The module for Scali MPI has been renamed from "scali-5.3" to "scali-5". Users will need to change their "module load" commands to load "mpi/scali-5" instead of "mpi/scali-5.3"

  • "ssh totalview" and "ssh idl" will no longer be supported. Any users that have used this method to establish X-forwarding for PBS jobs should use "xsub" instead. "xsub" accepts all the same arguments as qsub, and establishes the necessary X-forwarding for you.

  • Module changes (Compilers, Math Kernel Libraries, etc.)

  • The following modules that are currently available under SLES-9 will not be available under SLES-10. (Please contact user support if you feel you have a continuing need for any of these items)

    comp/gcc-3.3.6 comp/intel-8.1.034 comp/intel-8.1.038
    comp/intel-9.1.038 comp/intel-9.1.039 comp/intel-9.1.042
    comp/intel-9.1.046 comp/intel-9.1.049 comp/intel-10.0.023
    comp/intel-10.0.025 comp/intel-10.1.013 comp/intel-10.1.015
    comp/nag-5.1 comp/pgi-6.1.6 comp/pgi-6.2.4
    lib/mkl-10.0.2.018 lib/mkl-8.1 lib/mkl-9.0.017
    lib/mkl-9.1.018 lib/mkl-9.1.021 tool/tview-8.0.0.0
    tool/tview-8.1.0.1 mpi/scali-5.3

  • The following modules will be available under SLES-10

    comp/gcc-4.1.2
    (natively available without a "module load")
    comp/intel-9.1.052
    comp/intel-10.1.017 comp/nag-5.1-463
    comp/pgi-7.1.6 comp/pgi-7.2.1
    lib/mkl-9.1.023 lib/mkl-10.0.3.020
    tool/tview-8.2.0.1 mpi/scali-5

  • Additional modules will be made available for other software and will be listed as "other/". These modules will be made available as these packages are rebuilt.

  • Process limits

    • Default process limits (data size and stack size) will be set to a maximum safe limit based on physical memory on the various nodes. Under most circumstance, users should not need to change these settings. Please contact user support if you have questions or concerns about this.

  • Some of the additional software that currently resides in /usr/local is now included as part of SLES-10 and will no longer be maintained in /usr/local.

  • The software packages in /usr/local are being rebuild under SLES-10 and some may not be available initially. Please report any problems you find or anything that appears to be missing.

On Wednesday Feb 20th, the NCCS will be making the following changes to the discover system (downtime notice to follow):

  • Multiple login nodes will be used for interactive access.

    • Users will still connect to and ask for the service the same way they do now. The difference is that they will be placed on the login nodes discover05, 06, 07, or 08 (in a round-robin fashion) instead of discover01. More login nodes may be added to in the future.

    • Any user scripts that connect from a discover login node to another remote system may fail if the remote system does not allow all the discover login nodes access. Please have your system administrators contact the NCCS User Services Group for node address information if required.
  • CRON jobs will be run and managed from a single dedicated cron node so they don't impact interactive processes.

    • Once on discover, users may access and manage their cron jobs by connecting to discover-cron . The new login nodes will deny user CRON activity. All existing cron entries will be relocated to discover-cron.
  • System wide process virtual memory limits will be put in place.

    • In order to limit the impact from process that exceed a nodes memory resources, we will be setting virtual memory limits globally on discover.

This means that any single process on discover that reaches 6GB of virtual memory will be terminated. We have found that processes that reach 6GB of virtual memory will continue growing until they exceed the nodes memory resources. This causes the node(s) to hang and the filesystem daemon is frequently killed. Users may see a runtime library error if their processes exceed the 6GB virtual memory limit.


FirstGov logo + Privacy Policy and Important Notices
+ Sciences and Exploration Directorate
+ CISTO
NASA Curator: Mason Chang,
NCCS User Services Group (301-286-9120)
NASA Official: Phil Webster, High-Performance
Computing Lead, GSFC Code 606.2