Skip all navigation and jump to content Jump to site navigation Jump to section navigation.
NASA Logo - Goddard Space Flight Center + Visit NASA.gov
NASA Center for Computational Sciences
NCCS HOME USER SERVICES SYSTEMS DOCUMENTATION NEWS GET MORE HELP

 

Documentation
OVERVIEW
GENERAL SOFTWARE INFO
HALEM
PALM/EXPLORE
DIRAC/JIMPF
SUP

More halem links:

+ Quick Start Guide

+ Overview of System Resources

+ Programming Environment

+ Batch Queues

+ Software and Tools

halem Overview of System Resources

The NCCS HP/Compaq AlphaServer Sierra Cluster (SC) 45 system is halem.

+ Overview of halem
+ Hardware
+ Network connectivity
+ Operating system
+ Filesystem overview
+ Shells on halem
+ Data storage
+ Password policies

Overview of halem

The HP/Compaq AlphaServer SC45 is a very-large-scale distributed-memory parallel supercomputer engineered for TeraFLOPS performance (shared-memory structure exists only among processors within a node). It uses standard off-the-shelf symmetric multiprocessor (SMP) nodes and connects them through high-speed intelligent interconnects to deliver wire-level performance for the Message Passing Interface (MPI) programming model. The HP/Compaq AlphaServer SC45 uses AlphaServer ES45s as its SMP node.

halem has the following features:

  • The halem system has 348 available nodes (as of 9/25/02)
  • Each ES45 node is built on 21264 system architecture that incorporates four Alpha-EV68 processors.
  • For the domains A to E, each EV-68 processor, capable of 2 GFLOPS, runs at 1.0 GHz and has 64 KB of L1 cache and 8 MB of high-speed (250 MHz) L2 cache.
  • For the domains F to L, the processors run at 1.25 GHz.
  • All nodes have 2 GB of memory through dual 256 bits-wide 125-MHz memory bus that yields 8 GB/sec memory bandwidth (4 GB/sec peak per port).
  • Four banks of 512-MB memory DIMMs are used to achieve 4-way memory interleaving with room for expansion.
  • halem's SMP nodes are connected to a Single rail Quadrics QsNet network switch through an ELAN PCI adaptor (1 process per node) on 64bit/66MHz (528MB/s) PCI bus, yielding peak internode bandwidth of about 280 MB/s.
  • halem also has 5.5 TB of disk drives controlled by Dual HSG80 RAID controllers with 512 MB cache.

| Top of Page |


Hardware

HP AlphaServer SC system hardware is described in detail at HP AlphaServer SC40/SC45 QuickSpecs. As of September 25, 2002, halem is comprised of 348 nodes (1392 processors) that are broken into 12 CFS cluster domains (A through L) with various numbers of nodes (cluster members) for each domain (or host, in network point of view):

Domains A B C D E
Host halemA halemB halemC halemD halemE
Nodes 0-13 14-31 32-63 64-95 96-127
# of Nodes 14 18 32 32 32

Domains F G H I J K L
Host halemF halemG halemH halemI halemJ halemK halemL
Nodes 128-159 160-191 192-223 224-255 256-287 288-319 320-347
# of Nodes 32 32 32 32 32 32 28

RMS partitions are used by LSF to configure hosts and batch jobs. Currently the LSF hosts, which are used for queue designations, are the same as the RMS partitions. For the current LSF host settings, see /usr/share/lsf/conf/hosts. Clusters are grouped into RMS partitions as (type 'rinfo'):

RMS Partition Cluster Domains Hosts LSF Hosts
small A halem[0-13] halema
par_a B, C, D, E halem[14-127] hlm100
par_b F, G, H, I, J, K, L halem[128-347] hlm125

| Top of Page |


Network connectivity

Tru64 UNIX supports Asynchronous Transfer Mode (ATM), Ethernet, Fast Eternet, Gigabit Ethernet, Fiber Distributed Data Interface (FDDI), and others. Connection to the NCCS mass storage system is through Gigabit Ethernet. Use "dirac" as the hostname for the mass storage system to establish a connection through an available network. sftp and scp are allowed between halem and qualified hostnames only through the Tcpwrapper software.

| Top of Page |


Operating system

The NCCS HP AlphaServer SC system operates on Tru64 UNIX version 5.1B. Earlier versions of Tru64 UNIX were known as DIGITAL UNIX, where it existed as DEC OSF/1. Tru64 UNIX is the HP/Compaq implementation of the Open Software Foundation Version 1.0 and Version 1.2 technology and the Motif Version 1.2.5 graphical user interface and programming environment. In addition, Tru64 UNIX supports the full features of the X Window System, Version 11, Release 6.3 (X11R6.3). The Tru64 UNIX operating system is a multiuser/multitasking, 64-bit, advanced kernel architecture based on Carnegie Mellon University's Mach Version 2.5 kernel design, with components from Berkeley Software Distribution (BSD) Versions 4.3 and 4.4, UNIX System Laboratories System V Release 4.0, other software sources, the public domain, and HP and Compaq Computer Corporation.

The operation of the HP AlphaServer SC system itself evolves depending on the many layers of application software resource management tools such as Load Sharing Facility (LSF) by Platform and Resource Management System (RMS) by Quadrics. LSF is a powerful batch queue workload management software and RMS is set of commands for running parallel programs. The current version of the HP AlphaServer SC system software is SC2.6 Uk1 5.1B+pk4 (as of 05/23/2005). Quadrics Supercomputers World Ltd., which produces the Interconnects network switch and RMS for the HP AlphaServer SC, has written a general User Guide for AlphaServer SC system. Platform also has its own LSF documentation. Note, however, that the description on job environment in the above guide doesn't apply to halem since all parallel jobs on halem should go through combined commands of LSF (bsub) and RMS (prun).

| Top of Page |


Filesystem overview

halem has several levels of file systems:

UNIX file system (UFS) and network file system (NFS). All file systems provided by Tru64 UNIX are accessed through a virtual file system (VFS) layer integrated with the virtual memory unified buffer cache (UBC). VFS keeps file system-specific inodes about each file in a mounted file system. System calls (read/write) are directed to the routine appropriate for that file system through these nodes. VFS therefore plays a role of a uniform interface between users/applications and files, regardless of the file system on which the files reside.

Advanced file system (AdvFS). The default file system under Tru64 UNIX is the advanced file system (AdvFS). In AdvFS, the physical storage layer is managed independent of the directory layer. AdvFS allows system administrators to create file systems and back up and restore filesets without taking an AdvFS offline. AdvFS supports multivolume file systems.

Cluster file system (CFS). The cluster file system (CFS) is a virtual file system that sits above physical file systems such as AdvFS, NFS, and UFS, to provide clusterwide access to mounted file systems. Each file system is served by a single-cluster domain, and other domain members (nodes) access that file system as CFS clients. CFS makes all files, including the root(/), /usr, and /var file systems, visible to and accessible by all cluster domain members. CFS is controlled by TruCluster Server. TruCluster Server is a powerful tool for managing UNIX clusters that acts as a single virtual system, even though it is made up of multiple systems.

SC file system (SCFS). Each CFS cluster domain is configured with a cluster file system (CFS), which forms part of a global HP AlphaServer SC file system. The HP AlphaServer SC software V2 release includes the new SC file system (SCFS) capabilities. SCFS improves the ability to place compute nodes in one or more logical clusters of nodes and to isolate I/O nodes in other logical clusters. Logical clusters allow compute activities to run on nodes dedicated to parallel computation while I/O and non-parallel computing (e.g., application development and post-processing activities) run on other nodes. SCFS is a transport mechanism that moves file I/O across the HP AlphaServer SC interconnect among the logical clusters that compose an HP AlphaServer SC system.

More explanations can be found in Tru64 UNIX Version 5.1B and Cluster Technical Overview.

| Top of Page |


Shells on halem

The default shell on halem is the POSIX shell (sh), but you may change your shell to one of the other supported or unsupported shells available.

Alpha Tru64 UNIX supports the POSIX shell (sh), which is the default shell on halem; the Korn shell (ksh); and the C shell (csh). The POSIX and Korn shells are almost identical, with the exception of the pattern matching feature available on the Korn shell.

  • POSIX (sh, default on halem) and Korn shell (ksh)
    • Easy interactive use
    • Compatible with Bourne shell
    • Supports history and aliasing
    • Permits command line editing
    • Supports job suspension
  • C shell (csh)
    • Easy interactive use
    • C-style control structure syntax
    • Supports history and aliasing
    • Supports job suspension

The Tenex C shell (/usr/dlocal/bin/tcsh) and bash shell (/usr/dlocal/bin/bash) are also available but are not supported.

  • Tenex C shell (/usr/dlocal/bin/tcsh)
    • Supports all features of the C shell
    • Allows command line editing
    • Command name completion
    • Not supported by the NCCS
  • BASH (/usr/dlocal/bin/bash), not supported by the NCCS

To temporarily invoke a different shell, issue one the following commands:

sh (for POSIX shell; default prompt is $)
csh (for C shell; default prompt is %)
ksh (for Korn shell; default prompt is $)
tcsh (for Tenex C shell; default prompt is >)
bash (for BASH shell; default prompt is bash.2.05b$)

Issue the exit command to return to your previous shell.

To change your default shell, issue the following command:

chsh userid shell

where userid is your NCCS userid, and shell is one of the following:

/bin/sh (for POSIX shell)
/bin/ksh (for Korn shell)
/bin/csh (for C shell)
/usr/dlocal/bin/tcsh (for Tenex C shell)
/usr/dlocal/bin/bash (for BASH shell)

| Top of Page |


Data storage

Temporary file storage (/scr)

All directories except /uig/unsipp, /usr/local, and /ford1 (except scratch) are considered temporary and are not subject to backups.

The directories that belong to NSIPP on halem are

/atmos
/forecast
/nsippscr
/ocean

After the October 2002 system upgrade, these were changed into SCFS from PFS and no file quotas were set up for these directories. Also, no scrubbing utilities exist for /nsippscr.

Other directories such as /u1, /scr, /unsipp are parts of SCFS. The /unsipp directory is the home directory for NSIPP staff and /u1 for all other users.

Policy for scratch file system (/scr)

The /scr filesystem is provided as a globally visible scratch space for temporary data storage. When a userid is created on halem, a directory is also created on /scr of the form /scr/ with mode 700 and the user's default group (you may choose to open up these permissions to others if needed, but we recommend that permissions be kept as restrictive as possible). You should use this directory to store temporary data on /scr.

The /scr file system is available for temporary data storage only and is not intended for long-term data storage; there are no backups of the data on /scr. It is your responsibility to maintain backup copies of any critical data from /scr on other mass storage systems, such as dirac/jimpf.

When disk space starts to become full on /scr, the following actions will be taken to reclaim disk space:

  • The skulker disk space management utility will be run periodicaly from cron to check the status of the filesystem.
  • If the filesystem is over 75% full, the skulker utility will begin removing files in the following manner in an attempt to reduce the filesystem usage to 70%:
    • Files will be removed proportionately to your total usage. For example, if you are the owner of 20% of the total data that resides on /scr and 50Gb of data needs to be removed, 20% of that 50Gb (or 10Gb) of your data will be removed.
    • Only files larger than 16kb will be removed.
    • The oldest files will be removed first. Skulker orders your files by age and begins by deleting the oldest files and continues until it has recovered as much space as it is trying to recover from you. Files younger than 7 days old will not be deleted.
    • Users will be e-mailed a list of what files were removed

Permanent file storage

See the documentation on batch transfers to and from halem.

| Top of Page |


Password policies

Passwords on NCCS systems are set to expire every 90 days. This means users will be required to change their passwords at least once every 90 days (the period starts each time you change your password). Changing a password again within 21 days is prohibited. To check when you last changed your password on halem, issue the command:

passwd -q

The sample computer output below shows that the password was last changed on 07/20/05:

userid P5 072005 21 90

New passwords on halem must differ from old passwords by at least three characters, be at least 11 characters in length, and must contain at least one number or special character.


FirstGov logo + Privacy Policy and Important Notices
+ Sciences and Exploration Directorate
+ CISTO
NASA Curator: Mason Chang,
NCCS User Services Group (301-286-9120)
NASA Official: Phil Webster, High-Performance
Computing Lead, GSFC Code 606.2