halem Overview
of System Resources
The NCCS HP/Compaq AlphaServer Sierra Cluster
(SC) 45 system is halem.
Overview of halem
The HP/Compaq AlphaServer SC45 is a very-large-scale
distributed-memory parallel supercomputer
engineered for TeraFLOPS performance
(shared-memory structure exists
only among processors within a
node).
It uses standard off-the-shelf
symmetric multiprocessor (SMP)
nodes and connects them through high-speed
intelligent interconnects to deliver
wire-level performance for the
Message Passing Interface (MPI)
programming model. The HP/Compaq
AlphaServer SC45 uses AlphaServer
ES45s as its SMP node.
halem
has the following features:
- The halem system has 348 available nodes
(as of 9/25/02)
- Each
ES45 node is built on 21264
system architecture that incorporates
four Alpha-EV68 processors.
- For
the domains A to E, each
EV-68 processor, capable of 2
GFLOPS, runs at 1.0 GHz and has
64 KB of L1 cache and 8 MB of
high-speed (250 MHz) L2 cache.
- For the domains F to L,
the processors run at 1.25
GHz.
- All nodes have 2 GB of
memory through dual
256 bits-wide 125-MHz memory
bus that yields 8 GB/sec memory
bandwidth (4 GB/sec peak per
port).
- Four banks of 512-MB
memory DIMMs are
used to achieve 4-way memory
interleaving with room for
expansion.
- halem's SMP nodes
are connected to
a Single rail Quadrics QsNet
network switch through
an ELAN PCI adaptor (1
process per node) on 64bit/66MHz
(528MB/s) PCI bus, yielding
peak internode bandwidth of
about 280 MB/s.
- halem also
has 5.5 TB of disk drives
controlled by Dual
HSG80 RAID controllers
with 512 MB cache.
| Top
of Page |
Hardware
HP AlphaServer SC system hardware is described
in detail at HP
AlphaServer SC40/SC45 QuickSpecs. As of September
25, 2002, halem is comprised of
348 nodes (1392 processors) that
are broken into 12 CFS cluster
domains (A through L) with various
numbers of nodes (cluster members)
for each domain (or host, in network
point of view):
Domains |
A |
B |
C |
D |
E |
Host |
halemA |
halemB |
halemC |
halemD |
halemE |
Nodes |
0-13 |
14-31 |
32-63 |
64-95 |
96-127 |
# of Nodes |
14 |
18 |
32 |
32 |
32 |
Domains |
F |
G |
H |
I |
J |
K |
L |
Host |
halemF |
halemG |
halemH |
halemI |
halemJ |
halemK |
halemL |
Nodes |
128-159 |
160-191 |
192-223 |
224-255 |
256-287 |
288-319 |
320-347 |
# of Nodes |
32 |
32 |
32 |
32 |
32 |
32 |
28 |
RMS partitions are used by LSF to configure
hosts and batch jobs. Currently
the LSF hosts, which are used for
queue designations, are the same
as the RMS partitions. For the
current LSF host settings, see /usr/share/lsf/conf/hosts.
Clusters are grouped into RMS partitions as (type
'rinfo'):
RMS Partition |
Cluster Domains |
Hosts |
LSF Hosts |
small |
A |
halem[0-13] |
halema |
par_a |
B, C, D, E |
halem[14-127] |
hlm100 |
par_b |
F, G, H, I, J, K, L |
halem[128-347] |
hlm125 |
| Top
of Page |
Network connectivity
Tru64 UNIX supports Asynchronous Transfer
Mode (ATM), Ethernet, Fast Eternet,
Gigabit Ethernet, Fiber Distributed
Data Interface (FDDI), and others.
Connection to the NCCS mass storage
system is through Gigabit Ethernet.
Use "dirac" as
the hostname for the mass storage
system to establish a connection
through an
available network. sftp and
scp are allowed between halem
and qualified hostnames only through
the Tcpwrapper software.
| Top
of Page |
Operating system
The NCCS HP AlphaServer SC system operates
on Tru64
UNIX version 5.1B. Earlier
versions of Tru64 UNIX were known
as DIGITAL UNIX, where it existed
as DEC OSF/1.
Tru64 UNIX is the HP/Compaq implementation
of the Open Software Foundation
Version 1.0 and Version 1.2 technology
and the Motif Version 1.2.5 graphical
user interface and programming environment.
In addition, Tru64 UNIX supports the full features
of the X Window System, Version 11, Release
6.3 (X11R6.3). The Tru64 UNIX operating system
is a multiuser/multitasking, 64-bit,
advanced kernel architecture based on Carnegie
Mellon University's Mach Version 2.5 kernel
design, with components from Berkeley Software
Distribution (BSD) Versions 4.3 and 4.4, UNIX
System Laboratories System V Release 4.0, other
software sources, the public domain, and HP
and Compaq Computer Corporation.
The operation of the HP AlphaServer SC system
itself evolves depending on the
many layers of application software
resource management tools such
as Load Sharing Facility (LSF)
by Platform and Resource Management
System (RMS) by Quadrics.
LSF is a powerful batch queue workload
management software and RMS is
set of commands for running parallel
programs. The current version of
the HP AlphaServer SC system software
is SC2.6 Uk1 5.1B+pk4 (as of 05/23/2005). Quadrics
Supercomputers World Ltd.,
which produces the Interconnects
network switch and RMS for the
HP AlphaServer SC, has written
a general User
Guide for AlphaServer SC system. Platform also has its
own LSF
documentation. Note, however,
that the
description on job environment
in the above guide doesn't apply
to halem since all parallel jobs
on halem should go through combined
commands of LSF (bsub) and RMS
(prun).
| Top
of Page |
Filesystem overview
halem has several levels of file systems:
UNIX file system (UFS)
and network file system (NFS). All
file systems provided by Tru64
UNIX are accessed through a virtual
file system (VFS) layer integrated
with the virtual memory unified
buffer cache (UBC). VFS keeps
file system-specific inodes about
each file in a mounted file system.
System calls (read/write) are
directed to the routine appropriate
for that file system through
these nodes. VFS therefore
plays a role of a uniform interface
between users/applications and
files, regardless of the file
system on which the files reside.
Advanced
file system (AdvFS). The default
file system under Tru64 UNIX is
the advanced file system (AdvFS).
In AdvFS, the physical storage layer is
managed independent of the directory layer.
AdvFS allows system administrators to create
file systems and back up and restore filesets
without taking an AdvFS offline. AdvFS
supports multivolume file systems.
Cluster
file system (CFS). The cluster
file system (CFS) is a virtual
file system that sits above
physical file systems
such as AdvFS, NFS, and UFS,
to provide clusterwide access
to mounted file systems.
Each file system is served
by a single-cluster domain,
and other domain members
(nodes) access that file
system as CFS clients. CFS
makes all files, including
the root(/), /usr, and /var
file systems, visible to
and accessible by all cluster
domain members. CFS is controlled
by TruCluster
Server. TruCluster
Server is a powerful tool
for managing UNIX clusters
that
acts as a single virtual
system, even though it is
made up of multiple systems.
SC
file system (SCFS). Each
CFS cluster domain is configured
with a cluster file system (CFS),
which forms part of a global HP
AlphaServer SC file system.
The HP AlphaServer SC software
V2 release includes the
new SC file system (SCFS)
capabilities. SCFS improves
the ability to place compute
nodes in one or more logical
clusters of nodes and to
isolate I/O nodes in other
logical clusters. Logical
clusters allow compute
activities to run on nodes
dedicated to parallel computation
while I/O and non-parallel computing
(e.g., application development and
post-processing activities) run on
other nodes. SCFS is a transport mechanism
that moves file I/O across the HP AlphaServer
SC interconnect among the
logical clusters that compose an HP
AlphaServer SC system.
More explanations can be found in Tru64
UNIX Version
5.1B and Cluster
Technical Overview.
| Top
of Page |
Shells on halem
The default shell on halem is the POSIX
shell (sh), but you may
change your shell to one of the
other supported or unsupported
shells available.
Alpha Tru64 UNIX supports the
POSIX shell (sh), which is the
default shell on halem; the Korn
shell (ksh); and the C shell (csh).
The POSIX and Korn shells are almost
identical, with the exception of
the pattern
matching feature available on the
Korn shell.
- POSIX (sh, default on halem) and Korn shell
(ksh)
- Easy interactive use
- Compatible with Bourne
shell
- Supports history and
aliasing
- Permits command line
editing
- Supports job suspension
- C shell (csh)
- Easy interactive use
- C-style control structure
syntax
- Supports history and
aliasing
- Supports job suspension
The Tenex C shell (/usr/dlocal/bin/tcsh)
and bash shell (/usr/dlocal/bin/bash)
are also available but are not
supported.
- Tenex C shell
(/usr/dlocal/bin/tcsh)
- Supports
all features of the
C shell
- Allows command
line editing
- Command
name completion
- Not supported
by the NCCS
- BASH
(/usr/dlocal/bin/bash),
not supported by the NCCS
To temporarily invoke a different shell,
issue one the following commands:
sh (for POSIX shell; default prompt is $)
csh (for C shell; default prompt
is %)
ksh (for Korn shell; default
prompt is $)
tcsh (for Tenex C shell; default
prompt is >)
bash (for BASH shell; default
prompt is bash.2.05b$)
Issue the exit command
to return to your previous shell.
To change your default
shell, issue the following
command:
chsh userid shell
where userid is
your NCCS userid, and shell is one of the following:
/bin/sh (for POSIX shell)
/bin/ksh (for Korn shell)
/bin/csh (for C shell)
/usr/dlocal/bin/tcsh (for Tenex
C shell)
/usr/dlocal/bin/bash (for BASH
shell)
| Top
of Page |
Data storage
Temporary file storage (/scr)
All directories except /uig/unsipp, /usr/local,
and /ford1 (except scratch) are considered
temporary and are not subject to backups.
The directories that belong to NSIPP on halem
are
/atmos
/forecast
/nsippscr
/ocean
After the October 2002 system upgrade, these
were changed into SCFS from PFS and no file quotas
were set up for these directories. Also, no scrubbing
utilities exist for /nsippscr.
Other directories such as /u1, /scr, /unsipp
are parts of SCFS. The /unsipp
directory is the home directory
for NSIPP staff and /u1 for all other users.
Policy
for scratch file system (/scr)
The /scr filesystem is provided as a globally
visible scratch space for temporary data storage.
When a userid is created on halem, a directory
is also created on /scr of the form /scr/ with
mode 700 and the user's default group (you
may choose to open up these permissions to
others if needed, but we recommend that permissions
be kept as restrictive as possible). You should
use this directory to store temporary data
on /scr.
The /scr file system is available for temporary
data storage only and is not intended for long-term
data storage; there are no backups
of the data on /scr. It is your responsibility
to maintain backup copies of any critical data
from /scr on other mass storage systems, such
as dirac/jimpf.
When disk space starts to become full on /scr,
the following actions will be taken
to reclaim disk space:
- The skulker disk space
management utility will be
run periodicaly from cron to
check the status of the filesystem.
- If the
filesystem is over 75% full,
the skulker utility will begin
removing files in the following
manner in an attempt to reduce
the filesystem usage to 70%:
- Files will be removed proportionately
to your total usage.
For example, if you are the
owner of 20% of the total
data that resides on /scr
and 50Gb of data needs to
be removed, 20% of that 50Gb (or 10Gb)
of your data will be removed.
- Only files
larger than 16kb will be
removed.
- The oldest files
will be removed first.
Skulker orders your files
by age and begins by deleting
the oldest files and continues
until it has recovered as
much space as it is trying
to recover from you. Files
younger than 7 days old will
not be deleted.
- Users will
be e-mailed a list of what
files were
removed
Permanent file
storage
See the documentation on batch
transfers to and from halem.
| Top
of Page |
Password policies
Passwords on NCCS systems
are set to expire every 90 days.
This means users will be required
to change their passwords at least
once every 90 days (the period
starts each time you change your
password). Changing a password
again within 21 days is prohibited.
To check when you
last changed your password on halem,
issue the command:
passwd -q
The sample computer output below shows
that the password was last changed
on 07/20/05:
userid P5 072005 21 90
New
passwords on halem must differ
from old passwords by at least three
characters, be at least 11 characters
in length, and must contain at
least one number or special character.
|