If resources are available on Eagle, you can run parallel jobs
interactively using the Parallel
Operating Environment (POE). To run a program in parallel, you
specify the number of processors and/or nodes, the communication
library, and a particular "pool" of nodes. POE then uses LoadLeveler
to acquire a set of nodes in the specified pool. If nodes are not
available, the command fails.
LoadLeveler uses the class
"interactive" for all interactive jobs. Use the following
command for information on this class, including wall-clock run-time limits.
$ llclass -l interactive
If an executable is compiled for parallel
execution, it
will run under POE
as a single program with multiple processes. If the executable is
sequential, POE will start multiple copies of it
across the acquired nodes.
You can specify values for number of processors, communication
library, pool, etc. using environment variables or command-line
arguments to the "poe" command. Command-line arguments
override environment variables. The following table
summarizes the important options.
"poe" option |
Environment variable |
Description |
-procs n |
MP_PROCS=n |
The number ("n") of parallel processes. Use with
either "-tasks_per_node" or "-nodes". |
-nodes n |
MP_NODES=n |
The number ("n") of nodes. Use with either
"-procs" or "-tasks_per_node". |
-tasks_per_node n |
MP_TASKS_PER_NODE=n |
The number ("n") of parallel processes per
node. Use with either "-procs" or
"-nodes". |
-rmpool 1 |
MP_RMPOOL=1 |
The resource-manager pool that LoadLeveler will use to
allocate nodes. The compute nodes of the ORNL SP are in pool "1". |
-euilib xx |
MP_EUILIB=xx |
Communication library. Valid values for "xx" are
"ip" for Internet Protocol and "us" for User
Space. The recommended values is "us", though
"ip" is the default. |
none |
MP_SHARED_MEMORY=yes |
Use shared memory for MPI communication within a node. Requires
compilation with the thread-safe MPI library (i.e. using
"mpxlf_r", "mpcc_r", etc.). |
The following example runs "a.out" on 8 processors across 2
compute
nodes using US over the SP switch.
$ poe a.out -rmpool 1 -procs 8 -nodes 2 -euilib us
For more information on "poe" options, see "man
poe". Online documentation for IBM's Parallel Environment (PE),
including POE, is available at the following URL. As of this writing,
Eagle has verion 3 release 1 of PE.
http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/pe.html
Etnus TotalView is a debugger for sequential, parallel, and threaded
programs, and it has
a powerful graphical interface. On Eagle, it works with MPI, OpenMP,
and hybrid MPI-OpenMP applications. It also has a command-line interface.
Starting TotalView on a parallel job is nontrivial because of
current limitations of "rsh" under DCE on Eagle. To simplify
the procedure, use the "tv" script. It takes the same options
as "poe", but runs "poe" within TotalView. Simply
replace "poe" with "tv" on the command line.
The following example starts TotalView on 8
processors across 2 compute nodes using US over the SP switch.
$ tv a.out -rmpool 1 -procs 8 -nodes 2 -euilib us
(To debug sequential jobs or core files, use TotalView directly
instead of "tv".)
TotalView starts daemons on remote nodes where your parallel job runs,
and those daemons need to have DCE credentials. The "tv"
script forwards your credentials to the remote nodes.
In order to forward your credentials, however,
these credentials must be forward-able. They are not by
default. Therefore, before running "tv", you must issue
"kinit -f" and give your password.
$ kinit -f
Enter Password:
Your credentials will be forwardable for the remainder of the
session.
Since TotalView is an X-Window
application, you must have the "DISPLAY" environment variable
set to point to your local display. You may need to issue the
following command on your machine to
allow the Eagle login node to display there:
local$ xhost +eagle163j1.ccs.ornl.gov
(If the usual login node, Eagle163, is down, you may be on the backup
node, Eagle164. Just replace "163" with "164" in the
above command.)
After you run the "tv" command, two windows should
appear. In the
larger window, you will see the assembly code for "poe". Type
"G" (capital G) in this window to cause all processes to
"Go". TotalView will run for a few
seconds and then ask if you'd like to stop your processes before
entering "MAIN". Answer, "yes", to stop your program at the beginning,
so you can add breakpoints, etc. before running.
For more information on "tv", see "man tv".
For more information on using TotalView, see "man totalview" or
type "?" within a TotalView window. The TotalView
User's Guide is available on Eagle in the
following location.
/usr/local/com/toolworks/totalview/doc/pdf/user_guide.pdf
For information on the TotalView command-line interface, see
the TotalView Command Line Interface Guide, available on Eagle
in the following location.
/usr/local/com/toolworks/totalview/doc/pdf/cli_guide.pdf
Documentation for the TotalView graphical interface and command-line interface
is also available directly from Etnus at the
following URL.
http://www.etnus.com/Support/docs/index.html