Latest Changes on Discover
Important Note on MPI Usage for Expanded Discover Cluster (09/04/2008)
The additional 2048 CPUs recently added to Discover will be available to the general user community no later than Thursday 4 Sep at noon. The new cluster nodes have two quad-core "Harpertown" chips (8 CPUs per node) and 16GB of memory (2 GB per core) in contrast to the original nodes with two dual-core "Dempsey" or "Woodcrest" chips (4 CPUs per node) and 4GB of memory per node (1 GB per core).
The compilation, job submission, and execution environments will generally be the same as they are currently with a few important exceptions:
- Scali MPI will be available only on the original non-IBM "Dempsey" and "Woodcrest" nodes. It will NOT be available on the new "Harpertown" nodes.
- If your jobs require Scali MPI (as most current jobs do), you MUST add the specification "scali=true" in the PBS select statement to ensure the job is scheduled onto Scali-compatible nodes. For example, -l select=4:ncpus=4:scali=true selects 4 nodes with 4 CPUs each (16 total) and Scali compatibility.
- Currently, PBS will schedule jobs onto the original Woodcrest and Dempsey nodes before the newer Harpertown nodes so that codes compiled with Scali MPI will be less likely to be scheduled to Harpertown nodes and fail due to lack of access to Scali MPI on those nodes. This scheduling order will change in the future so that jobs will be more likely to take advantage of the new hardware. This delay will give users time to either add the "scali=true" parameter to their jobs or to convert to another MPI that is available on the entire cluster.
- Please begin to migrate your jobs away from Scali MPI in favor of OpenMPI or Intel MPI, which are both available across the entire Discover cluster. Please see https://modelingguru.nasa.gov/clearspace/message/6194#6194 and https://modelingguru.nasa.gov/clearspace/docs/DOC-1571 for details on using Intel MPI. Please be sure to modify your "module load" statements accordingly (in your '.' dot files as well as in your jobs scripts.)
- The NCCS currently recommends that users recompile their codes using Intel MPI or OpenMPI and use "select=X:ncpus=Y" to request X number of Y core nodes in a more generic fashion. Users may continue to specify their PBS select statements as they do now, but please remember that if you don't add "scali=true" for any jobs that require Scali MPI, it is possible that PBS will schedule the job onto nodes without Scali MPI and the job will fail.
On Thursday, July 10 2008, the NCCS will be upgrading the operating
system on the Discover cluster from SLES-9 to SLES-10.
While the two OS's are binary compatible and recompiling is not
absolutely required, it is strongly recommended that users
recompile their applications to ensure that they have been
built against the latest versions of various system libraries.
Below are a few things that users should be aware of regarding
the upgrade to SLES-10
- Scali MPI has been upgraded from verion 5.3 to version 5.6
The module for Scali MPI has been renamed from "scali-5.3"
to "scali-5". Users will need to change their "module load"
commands to load "mpi/scali-5" instead of "mpi/scali-5.3"
- "ssh totalview" and "ssh idl" will no longer be supported.
Any users that have used this method to establish X-forwarding
for PBS jobs should use "xsub" instead. "xsub" accepts all
the same arguments as qsub, and establishes the necessary
X-forwarding for you.
- Module changes (Compilers, Math Kernel Libraries, etc.)
- The following modules that are currently available under SLES-9
will not be available under SLES-10. (Please contact user support
if you feel you have a continuing need for any of these items)
comp/gcc-3.3.6 |
comp/intel-8.1.034 |
comp/intel-8.1.038 |
comp/intel-9.1.038 |
comp/intel-9.1.039 |
comp/intel-9.1.042 |
comp/intel-9.1.046 |
comp/intel-9.1.049 |
comp/intel-10.0.023 |
comp/intel-10.0.025 |
comp/intel-10.1.013 |
comp/intel-10.1.015 |
comp/nag-5.1 |
comp/pgi-6.1.6 |
comp/pgi-6.2.4 |
lib/mkl-10.0.2.018 |
lib/mkl-8.1 |
lib/mkl-9.0.017 |
lib/mkl-9.1.018 |
lib/mkl-9.1.021 |
tool/tview-8.0.0.0 |
tool/tview-8.1.0.1 |
mpi/scali-5.3 |
|
|
|
- The following modules will be available under SLES-10
comp/gcc-4.1.2 (natively available without a "module load") |
comp/intel-9.1.052 |
comp/intel-10.1.017 |
comp/nag-5.1-463 |
comp/pgi-7.1.6 |
comp/pgi-7.2.1 |
lib/mkl-9.1.023 |
lib/mkl-10.0.3.020 |
tool/tview-8.2.0.1 |
mpi/scali-5 |
- Additional modules will be made available for other software
and will be listed as "other/". These modules
will be made available as these packages are rebuilt.
- Process limits
- Default process limits (data size and stack size) will be set to
a maximum safe limit based on physical memory on the various nodes.
Under most circumstance, users should not need to change these
settings. Please contact user support if you have questions or
concerns about this.
- Some of the additional software that currently resides in
/usr/local is now included as part of SLES-10 and will no
longer be maintained in /usr/local.
- The software packages in /usr/local are being rebuild under
SLES-10 and some may not be available initially. Please report
any problems you find or anything that appears to be missing.
On Wednesday Feb 20th, the NCCS will be making the following changes
to the discover system (downtime notice to follow):
- Multiple login nodes will be used for interactive access.
- Users will still connect to and ask for the service
the same way they do now. The difference is that they will be placed on
the login nodes discover05, 06, 07, or 08 (in a round-robin fashion)
instead of discover01. More login nodes may be added to in the future.
- Any user scripts that connect from a discover login node to another
remote system may fail if the remote system does not allow all the
discover login nodes access. Please have your system administrators
contact the NCCS User Services Group for node address information
if required.
- CRON jobs will be run and managed from a single dedicated cron node so
they don't impact interactive processes.
- Once on discover, users may access and manage their cron jobs by
connecting to discover-cron . The new login
nodes will deny user CRON activity. All existing cron entries will
be relocated to discover-cron.
- System wide process virtual memory limits will be put in place.
- In order to limit the impact from process that exceed a nodes
memory resources, we will be setting virtual memory limits globally
on discover.
This means that any single process on discover that reaches 6GB of
virtual memory will be terminated. We have found that processes
that reach 6GB of virtual memory will continue growing until they
exceed the nodes memory resources. This causes the node(s) to hang and
the filesystem daemon is frequently killed. Users may see a runtime
library error if their processes exceed the 6GB virtual memory limit.
|