NCCS | User Info | search  

Interactive Parallel Jobs on Eagle


Contents


Using the Parallel Operating Environment

If resources are available on Eagle, you can run parallel jobs interactively using the Parallel Operating Environment (POE). To run a program in parallel, you specify the number of processors and/or nodes, the communication library, and a particular "pool" of nodes. POE then uses LoadLeveler to acquire a set of nodes in the specified pool. If nodes are not available, the command fails.

LoadLeveler uses the class "interactive" for all interactive jobs. Use the following command for information on this class, including wall-clock run-time limits.

$ llclass -l interactive

If an executable is compiled for parallel execution, it will run under POE as a single program with multiple processes. If the executable is sequential, POE will start multiple copies of it across the acquired nodes.

You can specify values for number of processors, communication library, pool, etc. using environment variables or command-line arguments to the "poe" command. Command-line arguments override environment variables. The following table summarizes the important options.

"poe" option Environment variable Description
-procs n MP_PROCS=n The number ("n") of parallel processes. Use with either "-tasks_per_node" or "-nodes".
-nodes n MP_NODES=n The number ("n") of nodes. Use with either "-procs" or "-tasks_per_node".
-tasks_per_node n MP_TASKS_PER_NODE=n The number ("n") of parallel processes per node. Use with either "-procs" or "-nodes".
-rmpool 1 MP_RMPOOL=1 The resource-manager pool that LoadLeveler will use to allocate nodes. The compute nodes of the ORNL SP are in pool "1".
-euilib xx MP_EUILIB=xx Communication library. Valid values for "xx" are "ip" for Internet Protocol and "us" for User Space. The recommended values is "us", though "ip" is the default.
none MP_SHARED_MEMORY=yes Use shared memory for MPI communication within a node. Requires compilation with the thread-safe MPI library (i.e. using "mpxlf_r", "mpcc_r", etc.).

The following example runs "a.out" on 8 processors across 2 compute nodes using US over the SP switch.

$ poe a.out -rmpool 1 -procs 8 -nodes 2 -euilib us

For more information on "poe" options, see "man poe". Online documentation for IBM's Parallel Environment (PE), including POE, is available at the following URL. As of this writing, Eagle has verion 3 release 1 of PE.

http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/pe.html


Using the TotalView Parallel Debugger

Etnus TotalView is a debugger for sequential, parallel, and threaded programs, and it has a powerful graphical interface. On Eagle, it works with MPI, OpenMP, and hybrid MPI-OpenMP applications. It also has a command-line interface.

Starting TotalView on a parallel job is nontrivial because of current limitations of "rsh" under DCE on Eagle. To simplify the procedure, use the "tv" script. It takes the same options as "poe", but runs "poe" within TotalView. Simply replace "poe" with "tv" on the command line. The following example starts TotalView on 8 processors across 2 compute nodes using US over the SP switch.

$ tv a.out -rmpool 1 -procs 8 -nodes 2 -euilib us

(To debug sequential jobs or core files, use TotalView directly instead of "tv".)

TotalView starts daemons on remote nodes where your parallel job runs, and those daemons need to have DCE credentials. The "tv" script forwards your credentials to the remote nodes. In order to forward your credentials, however, these credentials must be forward-able. They are not by default. Therefore, before running "tv", you must issue "kinit -f" and give your password.

$ kinit -f
Enter Password: 

Your credentials will be forwardable for the remainder of the session.

Since TotalView is an X-Window application, you must have the "DISPLAY" environment variable set to point to your local display. You may need to issue the following command on your machine to allow the Eagle login node to display there:

local$ xhost +eagle163j1.ccs.ornl.gov

(If the usual login node, Eagle163, is down, you may be on the backup node, Eagle164. Just replace "163" with "164" in the above command.)

After you run the "tv" command, two windows should appear. In the larger window, you will see the assembly code for "poe". Type "G" (capital G) in this window to cause all processes to "Go". TotalView will run for a few seconds and then ask if you'd like to stop your processes before entering "MAIN". Answer, "yes", to stop your program at the beginning, so you can add breakpoints, etc. before running.

For more information on "tv", see "man tv".

For more information on using TotalView, see "man totalview" or type "?" within a TotalView window. The TotalView User's Guide is available on Eagle in the following location.

/usr/local/com/toolworks/totalview/doc/pdf/user_guide.pdf

For information on the TotalView command-line interface, see the TotalView Command Line Interface Guide, available on Eagle in the following location.

/usr/local/com/toolworks/totalview/doc/pdf/cli_guide.pdf
Documentation for the TotalView graphical interface and command-line interface is also available directly from Etnus at the following URL.
http://www.etnus.com/Support/docs/index.html

ornl | nccs | ccs | computers | disclaimer

URL http://www.ccs.ornl.gov/eagle/interactive.html
Updated: Thursday, 21-Mar-2002 16:14:44 EST
consult@ccs.ornl.gov