Oak Ridge Leadership Computing Facility

Titan User Guide

Titan Has Replaced Jaguar

As the final step in the Titan upgrade, Jaguar has been permanently decommissioned, and users have no access to the old system. The new Cray XK7 (Titan) is now available for login. Please see the section below on the Jaguar to Titan Transition for important information about running on Titan.

1. Jaguar to Titan Transition
2. Titan System Overview
3. Requesting Access to OLCF Resources
3.1. Project Allocation Requests
3.2. User Account Requests
4. OLCF Help and Policies
4.1. User Assistance Center
4.2. Communications to Users
4.3. My OLCF Site
4.4. Special Requests and Policy Exemptions
4.5. OLCF Acknowledgement
5. Accessing OLCF Systems
5.1. OLCF System Hostnames
5.2. General-Purpose Systems
5.3. X11 Forwarding
5.4. RSA Key Fingerprints
5.5. Authenticating to OLCF Systems
6. Data Management
6.1. User-Centric Data Storage
6.1.1. User Home Directories (NFS)
6.1.2. User Work Directories (Lustre)
6.1.3. User Archive Directories (HPSS)
6.2. Project-Centric Data Storage
6.2.1. Project Home Directories (NFS)
6.2.2. Project Work Directories (Lustre)
6.2.3. Project Archive Directories (HPSS)
6.3. Transferring Data
6.4. Storage Policy Summary
7. Software and Shell Environments
7.1. Default Shell
7.2. Using Modules
7.3. Installed Software
8. Compiling On Titan
8.1. Cray Compiler Wrappers
8.2. Compiling and Node Types
8.3. Controlling the Programming Environment
8.4. Compiling Threaded Codes
9. Running Jobs on Titan
9.1. Login vs. Service vs. Compute Nodes
9.2. Filesystems Available to Compute Nodes
9.3. Writing Batch Scripts
9.4. Submitting Batch Scripts
9.5. Interactive Batch Jobs
9.6. Common Batch Options to PBS
9.7. Batch Environment Variables
9.8. Modifying Batch Jobs
9.9. Monitoring Batch Jobs
9.10. Titan Batch Queues
9.11. Job Execution on Titan
9.11.1. Using the aprun command
9.11.2. XK7 CPU Description
9.11.3. Controlling MPI Task Layout Within a Physical Node
9.11.4. Controlling MPI Task Layout Across Many Physical Nodes
9.11.5. Controlling Thread Layout Within a Physical Node
9.12. Job Resource Accounting
9.13. Titan Scheduling Policy
10. Development Tools
10.1. GPU Accelerated Libraries
10.2. Accelerator Compiler Directives
10.3. Low-Level GPU Languages
11. Debugging and Optimizing Code on Titan
11.1. GPU Performance Tools

1. Jaguar to Titan Transition

(Back to Top)

Titan Has Replaced Jaguar

As the final step in the Titan upgrade, Jaguar has been permanently decommissioned, and users have no access to the old system. The new Cray XK7 (Titan) is now available for login. Please read on for important information about running on Titan.

Items to Note Before Running on Titan

Titan Availability Schedule

The Titan upgrade is progressing. The following table summarizes general availability of Titan over the next few months as we work to resolve two specific issues that are keeping us from releasing Titan into its final production form.

Start Date	Available Nodes	GPUs Enabled?
Feb 2nd	9,716	No
Mid March	8,972	No
Early April	0	No
May	18,688 (all)	Yes

We thank you for your continued patience.

Connecting to the XK7

Jaguar -> Titan
The Jaguar and JaguarPF system names have been decommissioned. To access the XK7, log into titan.ccs.ornl.gov.

GPU Access

Checkout and testing of the accelerators (GPUs) is continuing, so the GPUs are not currently accessible to users.

Batch Job Submission

size -> nodes

Previously the ‘size’ option was used to request cores. Batch jobs were forced to request cores in multiples of 16 to allocate entire nodes. The ‘size’ option should no longer be used to request resources. The ‘nodes’ option should now be used to request a node and its resources.

Warning: The ‘size’ option should no longer be used to request resources. The ‘nodes’ option should now be used to request a node and its resources on Titan.

For example, to request 4 nodes:

Jaguar (old)	`#PBS -l size=64`
Titan (new)	`#PBS -l nodes=4`

Batch Environment Variables

$PBS_NNODES -> $PBS_NUM_NODES

Because of the size to nodes batch submission change, the PBS_NNODES environment variable is no longer set by the batch system. Instead, the variable PBS_NUM_NODES can be used to determine the number of requested nodes.

Warning: The environment variable PBS_NNODES is no longer available.

Batch Job Charging

core-hours -> Titan-hours

Previously batch jobs were charged upon completion using core-hours:

  allocated cores * (end time – start time)

With Titan, the charging algorithm has changed to include the GPUs. The following is the algorithm used to calculate a job’s usage in Titan-hours:

  (allocated nodes * 30) * (end time – start time)

Note: While the GPUs are disabled on Titan (see above) the charging algorithm will be (allocated nodes * 16) * (end time – start time)

Scratch Purges Resume on January 23, 2013

The scratch filesystem purge, which has been disabled while Titan was not available to users, will resume at 7:30 AM on Wednesday, January 23. At that time, all files in User Work (scratch) areas in the widow0, widow1, widow2 and widow3 filesystems that have not been accessed/updated in the past (14) days will be eligible for deletion.

Warning: The scratch filesystem purge will resume at 7:30 AM on Wednesday, January 23.

For more information, please see the OLCF Storage Policy Summary page.

Recompile Recommended

To include possible library changes since jaguar’s decommission, it is recommended that users recompile or at least re-link prior to running on titan.

Changes to `/tmp/work` and `/tmp/proj`

In order to improve filesystem interaction and load distribution, the widow1 filesystem was split into two filesystems (widow0 and widow1). This change adds the 5 PB disk space previously used for test and development work into the production lustre pool. Splitting the filesystem into two filesystems also allows us to add an additional metadata server, which will help improve interaction with the lustre filesystems. The widow0 and widow1 filesystems are available to all users through /lustre/widow[0-1]/scratch/$USER.

To further take advantage of the widow0 and widow1 disk and additional metadata servers, approximately half of the /tmp/proj areas and /tmp/work links were redistributed among the new widow0 and widow1 filesystems on November 14. You can execute the locatescratch and locateproj utilities from any OLCF system to see if you are impacted by /tmp/work or /tmp/proj changes. If your /tmp/proj area or /tmp/work link changed, you may need to move data or update batch scripts.

Important! If your /tmp/proj area or /tmp/work link changed, you may need to move data or update batch scripts.

Note: For projects that move from widow[2-3], project members will be responsible for moving data from the widow[2-3] filesystems to the widow[0-1] filesystems.

More information on these Lustre filesystem changes can be found on the Spider Changes (2012) page.

2. Titan System Overview

(Back to Top)

With a theoretical peak performance of more than 20 petaflops, Titan, a Cray XK7 supercomputer located at the Oak Ridge Leadership Computing Facility (OLCF), gives computational scientists unprecedented resolution for studying a whole range of natural phenomena, from climate change to energy assurance, to nanotechnology and nuclear energy.

Compute Partition

Titan contains 18,688 physical compute nodes, each with a processor, physical memory, and a connection to the Cray custom high-speed interconnect. Each compute node contains (1) 16-core 2.2GHz AMD Opteron™ 6274 (Interlagos) processor and (32) GB of RAM. Two nodes share (1) Gemini™ high-speed interconnect router. The resulting partition contains 299,008 traditional processor cores, and (598) TB of memory.

Specialized NVIDIA Accelerators

In addition to the Opteron CPU, all of Titan’s 18,688 physical compute nodes contain an NVIDIA Kepler™ accelerator (GPU).

External Login Nodes

Upon login, users are placed onto login nodes by default. Each Titan login node houses an 8-core AMD Opteron™ 6140-series CPU and (256) GB of RAM.

Network Topology

Nodes within the compute partition are connected in a three-dimensional torus. This provides a very scalable network with low latency and high bandwidth.

File Systems

The OLCF’s center-wide Lustre file system, named Spider, is available on Titan for computational work. With over 52,000 clients and (10) PB of disk space, it is the largest-scale Lustre file system in the world. A separate, NFS-based file system provides $HOME storage areas, and an HPSS-based file system provides Titan users with archival spaces.

Operating System

Titan employs the Cray Linux Environment as its OS. This consists of a full-featured version of Linux on the login nodes, and a Compute Node Linux microkernel on compute nodes. The microkernel is designed to minimize partition overhead allowing scalable, low-latency global communications.

3. Requesting Access to OLCF Resources

(Back to Top)

Access to the computational resources of the Oak Ridge Leadership Facility (OLCF) is limited to approved users via project allocations. There are different kinds of projects, and the type of project request will determine the application and review procedure. Approved projects will be granted an allocation of hours for a period of time on one or more systems.

Every user account at the OLCF must be associated with at least one allocation. Once an allocation has been approved and established, users can request to be added to the project allocation so they may run jobs against it.

3.1. Project Allocation Requests

(Back to Top)

The OLCF grants (3) different types of project allocations. The type of allocation you should request depends on a few different factors. The table below outlines the types of project allocations available at the OLCF and the some general policies that apply to each:

	INCITE	Director’s Discretion	ALCC
Allocations	Large	Small	Large
Call for Proposals	Once per year	At any time	Once per year
Closeout Report	Required	Required	Required
Duration	1 year	1 year	1 year
Job Priority	High	Medium	High
Quarterly Reports	Required	Required	Required
	Apply for INCITE	Apply for DD	Apply for ALCC

Project Type Details

INCITE – The Novel Computational Impact on Theory and Experiment (INCITE) program invites proposals for large-scale, computationally intensive research projects to run at the OLCF. The INCITE program awards sizeable allocations (typically, millions of processor-hours per project) on some of the world’s most powerful supercomputers to address grand challenges in science and engineering. There is an annual call for INCITE proposals and awards are made on an annual basis. For more information or to apply for an INCITE project, please visit the DOE INCITE page.

ALCC – The ASCR Leadership Computing Challenge (ALCC) is open to scientists from the research community in national laboratories, academia and industry. The ALCC program allocates computational resources at the OLCF for special situations of interest to the Department with an emphasis on high-risk, high-payoff simulations in areas directly related to the Department’s energy mission in areas such as advancing the clean energy agenda and understanding the Earth’s climate, for national emergencies, or for broadening the community of researchers capable of using leadership computing resources. For more information or to submit a proposal, please visit the DOE ALCC page.

DD – Director’s Discretion (DD) projects are dedicated to leadership computing preparation, INCITE and ALCC scaling, and application performance to maximize scientific application efficiency and productivity on leadership computing platforms. The OLCF Resource Utilization Council, as well as independent referees, review and approve all DD requests. Applications are accepted year round via the OLCF Director’s Discretion Project Application page.

After Project Approval

Once a project is approved, an OLCF Accounts Manager will notify the PI, outlining the steps (listed below) necessary to create the project. If you have any questions, please feel free to contact the OLCF Accounts Team at accounts@ccs.ornl.gov.

Steps for Activating a Project Once the Allocation is Approved

A signed Principal Investigator’s PI Agreement must be submitted with the project application.
Export Control: The project request will be reviewed by ORNL Export Control to determine whether sensitive or proprietary data will be generated or used. The results of this review will be forwarded to the PI. If the project request is deemed sensitive and/or proprietary, the OLCF Security Team will schedule a conference call with the PI to discuss the data protection needs.
ORNL Personnel Access System (PAS): All PI’s are required to be entered into the ORNL PAS system. An OLCF Accounts Manager will send the PI a PAS invitation to submit all the pertinent information. Please note that processing a PAS request may take 15 or more days.
User Agreement/Appendix A or Subcontract: A User Agreement/Appendix A or Subcontract must be executed between UT-Battelle and the PI’s institution. If our records indicate this requirement has not been met, all necessary documents will be provided to the applicant by an OLCF Accounts Manager.

Upon completion of the above steps, the PI will be notified that the project has been created and provided with the Project ID and system allocation. At this time, project participants may apply for an account via the OLCF User Account Application page.

3.2. User Account Requests

(Back to Top)

Users can apply for an account on existing projects. There are several steps in applying for an account; OLCF User Assistance can help you through the process. If you have any questions, please feel free to contact the Accounts Team at accounts@ccs.ornl.gov.

Steps to Obtain a User Account

Apply for an account using the Account Request Form.
The principal investigator (PI) of the project must approve your account and system access. The Accounts Team will contact the PI for this approval.
If you have or will receive a RSA SecurID from our facility, additional paperwork will be sent to you via email to complete for identity proofing.
Foreign national participants will be sent an Oak Ridge National Lab (ORNL) Personnel Access System (PAS) request specific for the facility and cyber-only access. After receiving your response, it takes between (2) to (5) weeks for approval.
Fully-executed User Agreements with each institution having participants are required. If our records indicate your institution needs to sign either a User Agreement and/or Appendix A, the form(s) along with instructions will be sent via email.
If you are processing sensitive or proprietary data, additional paperwork is required and will be sent to you.

Your account will be created and you will be notified via email when all of the steps above are complete. To begin the process, visit the OLCF User Account Application page.

4. OLCF Help and Policies

(Back to Top)

The OLCF provides many tools to assist users, including direct hands-on assistance by trained consultants. Means of assistance at the OLCF include:

The OLCF User Assistance Center (UAC), where consultants answer your questions directly via email or phone.
Various OLCF communications, which provide status updates of relevance to end-users.
The My OLCF site, which provides a mechanism for viewing project allocation reports.
The OLCF Policy Guide, which details accepted use of our computational resources.
Upcoming and historical OLCF Training Events, both in-person and web-based, that cover topics of interest to end-users.

4.1. User Assistance Center

(Back to Top)

The OLCF User Assistance Center (UAC) provides direct support to users of our computational resources.

Hours

The center’s normal support hours are 9am EST to 5pm EST Monday through Friday, exclusive of holidays.

Contact Us

Email	help@olcf.ornl.gov
Phone:	865-241-6536
Fax:	865-241-4011
Address:	1 Bethel Valley Road, Oak Ridge, TN 37831

The OLCF UAC is located at the Oak Ridge National Laboratory (ORNL) in Building 5600, Room C103.

After Hours

Outside of normal business hours, calls are directed to the ORNL Computer Operations staff. If you require immediate assistance, you may contact them at the phone number listed above. If your request is not urgent, you may send an email to help@olcf.ornl.gov, where it will be answered by a OLCF User Assistance member the next business day.

Ticket Submission Webform

In lieu of sending email, you can also use the Ticket Submission Web Form to submit a request directly to OLCF User Assistance.

4.2. Communications to Users

(Back to Top)

The OLCF provides users with several ways of staying informed.

OLCF Announcements Mailing Lists

These mailing lists provides users with email messages of general interest (system upgrades, long-term outages, etc.) Since the mailing frequency is low and the information sent is important to all users, users are automatically subscribed to these lists as applicable when an account is set up.

OLCF “Notice” Mailing Lists

The OLCF also provides opt-in email lists that provide automated notices about the status of OLCF systems as well as other notable system events. Since the mailing frequency of these lists are high, they are offered on an opt-in basis. More information on the lists can be found at the OLCF Notifications List page.

Weekly Update

Each week, typically on Friday afternoon, an email announcing the next week’s scheduled outages is sent to all users. This message also includes meeting announcements and other items of interest to all OLCF users. If you are an OLCF user but are not receiving this weekly message, please contact the OLCF User Assistance Center.

System Status Pages

The OLCF Main Support page shows the current up/down status of selected OLCF systems at the top.

Mobile Apps

The OLCF StatusApp for iOS and the OLCF StatusApp for Android are available in the Apple App Store and Google Play Store, respectively. These Apps are free to download and report real-time system statuses and other general information.

Twitter

The OLCF posts messages of interest on the OLCF Twitter Feed. We also post tweets specific to system outages on the OLCF Status Twitter Feed.

Message of the Day

In addition to other methods of notification, the system “Message of the Day” (MOTD) that is echoed upon login shows recent system outages. Important announcements are also posted to the MOTD. Users are encouraged to take a look at the MOTD upon login to see if there are any important notices.

4.3. My OLCF Site

(Back to Top)

To assist users in managing project allocations, we provide end-users with My OLCF, a web application with valuable information about OLCF projects and allocations on a per user basis. Users must login to the site with their OLCF username and SecurID fob:

https://users.nccs.gov

Detailed metrics for users and projects can be found in each project’s usage section:

YTD usage by system, subproject, and project member
Monthly usage by system, subproject, and project member
YTD usage by job size groupings for each system, subproject, and project member
Weekly usage by job size groupings for each system, and subproject
Batch system priorities by project and subproject
Project members

4.4. Special Requests and Policy Exemptions

(Back to Top)

Users can request policy exemptions by submitting the appropriate web form available on the OLCF Documents and Forms page. Special requests forms allow a user to:

Request Software installations
Request relaxed queue limits for a job
Request a system reservation
Request a disk quota increase
Request a User Work area purge exemption

Special requests are reviewed weekly and approved or denied by management via the OLCF Resource Utilization Council.

4.5. OLCF Acknowledgement

(Back to Top)

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

5. Accessing OLCF Systems

(Back to Top)

This section covers the basic procedures for accessing OLCF computational resources.

To avoid risks associated with using plain-text communication, the only supported remote client on OLCF systems is a secure shell (SSH) client, which encrypts the entire session between OLCF systems and the client system.

Note: To access OLCF systems, your SSH client must support SSH protocol version 2 (this is common) and allow keyboard-interactive authentication.

For UNIX-based SSH clients, the following line should be in either the default ssh_config file or your $HOME/.ssh/config file:

PreferredAuthentications keyboard-interactive,password

The line may also contain other authentication methods, but keyboard-interactive must be included.

SSH clients are also available for Windows-based systems, such as SecureCRT published by Van Dyke Software. For recent SecureCRT versions, the preferred authentications change above can be made through the “connection properties” menu.

5.1. OLCF System Hostnames

(Back to Top)

Each OLCF system has a single, designated hostname for general user-initiated user connections. Sometimes this is a load-balancing mechanism that will send users to other hosts as needed. In any case, the designated OLCF host names for general user connections are as follows:

System Name	Hostname	RSA fingerprint
Titan	`titan.ccs.ornl.gov`	`--`
Lens	`lens.ccs.ornl.gov`	`cc:6e:ef:84:7e:7c:dc:72:71:7b:76:7f:f3:46:57:2b`
Everest	`everest.ccs.ornl.gov`	`cc:6e:ef:84:7e:7c:dc:72:71:7b:76:7f:f3:46:57:2b`
Smoky	`smoky.ccs.ornl.gov`	`e3:88:b9:ba:fe:3a:fd:99:00:24:fc:e6:9d:5c:69:2b`
Sith	`sith.ccs.ornl.gov`	`28:63:5e:41:32:39:c2:ec:9b:63:e0:86:16:2f:e4:bd`
Data Transfer Nodes	`dtn.ccs.ornl.gov`	`50:dc:59:7b:e1:7c:ad:b2:30:55:9c:fa:fb:e8:6e:55`
Home (machine)	`home.ccs.ornl.gov`	`12:9b:10:f7:b9:c7:1b:a2:b0:52:5e:13:e2:b9:b2:8c`

For example, to connect to Titan from a UNIX-based system, use the following:

$ ssh userid@titan.ccs.ornl.gov

5.2. General-Purpose Systems

(Back to Top)

After a user account has been approved and created, the requesting user will be sent an email listing the system(s) to which the user requested and been given access. In addition to the system(s) listed in the email, all users also have access to the following general-purpose systems:

home.ccs.ornl.gov
Home is a general purpose system that can be used to log into other OLCF systems that are not directly accessible from outside the OLCF network. For example, running the screen or tmux utility is one common use of Home. Compiling, data transfer, or executing long-running or memory-intensive tasks should never be performed on Home. More information can be found on the The Home Login Host page.

dtn.ccs.ornl.gov
The Data Transfer Nodes are hosts specifically designed to provide optimized data transfer between OLCF systems and systems outside of the OLCF network. More information can be found on the Employing Data Transfer Nodes page.

HPSS
The High Performance Storage System (HPSS) provides tape storage for large amounts of data created on OLCF systems. The HPSS can be accessed from any OLCF system through the hsi utility. More information can be found on the HPSS page.

5.3. X11 Forwarding

(Back to Top)

Automatic forwarding of the X11 display to a remote computer is possible with the use of SSH and a local X server. To set up automatic X11 forwarding within SSH, you can do (1) of the following:

Invoke ssh on the command line with:
```
$ ssh -X hostname
```
Note that use of the -x option (lowercase) will disable X11 forwarding.
Edit (or create) your $HOME/.ssh/config file to include the following line:
```
ForwardX11 yes
```

All X11 data will go through an encrypted channel. The $DISPLAY environment variable set by SSH will point to the remote machine with a port number greater than zero. This is normal, and happens because SSH creates a proxy X server on the remote machine for forwarding the connections over an encrypted channel. The connection to the real X server will be made from the local machine.

Warning: Users should not manually set the $DISPLAY environment variable for X11 forwarding; a non-encrypted channel may be used in this case.

5.4. RSA Key Fingerprints

(Back to Top)

Occasionally, you may receive an error message upon logging in to a system such as the following:

@@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.

This can be a result of normal system maintenance that results in a changed RSA public key, or could be an actual security incident. If the RSA fingerprint displayed by your SSH client does not match the OLCF-authorized RSA fingerprint for the machine you are accessing, do not continue authentication; instead, contact help@olcf.ornl.gov.

5.5. Authenticating to OLCF Systems

(Back to Top)

All OLCF systems currently employ two-factor authentication only. To login to OLCF systems, an RSA SecurID^® key fob is required.
Image of an RSA SecudID fob

Activating a new SecurID^® fob

Initiate an SSH connection to home.ccs.ornl.gov.
When prompted for a PASSCODE, enter the 6-digit code shown on the fob.
You will be asked if you are ready to set your PIN. Answer with “Y”.
You will be prompted to enter a PIN. Enter a (4) to (6) digit number you can remember. You will then be prompted to re-enter your PIN.
You will then be prompted to wait until the next code appears on your fob and to enter your PASSCODE. When the (6) digits on your fob change, enter your PIN digits followed immediately by the new (6) digits displayed on your fob. Note that any set of (6) digits on the fob can only be “used” once.
Your PIN is now set, and your fob is activated and ready for use.

Using a SecurID^® fob

When prompted for your PASSCODE, enter your PIN digits followed immediately by the (6) digits shown on your SecurID^® fob. For example, if your pin is 1234 and the (6) digits on the fob are 000987, enter 1234000987 when you are prompted for a PASSCODE.

6. Data Management

(Back to Top)

OLCF users have many options for data storage. Each user has a series of user-affiliated storage spaces, and each project has a series of project-affiliated storage spaces where data can be shared for collaboration between users. The storage areas are mounted across all OLCF systems, making your data available to you from multiple locations.

The storage area to use at any given time depends upon the activities being carried out. Both users and projects are provided with three distinct types of storage areas: Home areas, Work areas, and Archive areas.

User Home areas (directories) are provided on a Network File System (NFS), User Work directories on a Lustre file system, and User Archive directories are provided on the High Performance Storage System (HPSS). User storage areas are intended to house user-specific files.

Similarly, projects have a Project Home area on NFS, a Project Work area on Lustre, and a Project Archive space on HPSS. Project storage areas are intended to house project-centric files that need to be accessed by multiple users.

6.1. User-Centric Data Storage

(Back to Top)

Users are provided with several storage areas, each of which serve different purposes. These areas are intended for storage of data for a particular user and not for storage of project data.

The following table summarizes user-centric storage areas available on OLCF resources and lists relevant polices.

Area	Nickname	Path	Type	Quota	Backups	Purge	Retention
User Home	–	/ccs/home/$USER	NFS	5 GB	Yes	Not purged	1 month after account deactivation
User Work	“Spider”	/tmp/work/$USER	Lustre	None	No	Files > 14 days old subject to deletion	Not retained
User Archive	“HPSS”	/home/$USER	HPSS	2 TB (or 2k files)	No	Not purged	3 months after account deactivation

Important! Files in User Work directories are not backed up, and are eligible for purge (deletion) when they become older than 14 days.

6.1.1. User Home Directories (NFS)

(Back to Top)

Each user is provided a home directory to store frequently used items such as source code, binaries, and scripts.

User Home Path

Home directories are located in a Network File Service (NFS) that is accessible from all OLCF resources as /ccs/home/$USER.

The environment variable $HOME will always point to your current home directory. It is recommended, where possible, that you use this variable to reference your home directory. In cases in which using $HOME is not feasible, it is recommended that you use /ccs/home/$USER.

Users should note that since this is an NFS-mounted filesystem, its performance will not be as high as other filesystems.

User Home Quotas

Quotas are enforced on user home directories. To request an increased quota, contact the OLCF User Assistance Center. To view your current quota and usage, use the quota command:

$ quota -Qs
Disk quotas for user usrid (uid 12345):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
nccsfiler1a.ccs.ornl.gov:/vol/home
                  4858M   5000M   5000M           29379   4295m   4295m

User Home Backups

If you accidentally delete files from your home directory, you may be able to retrieve them. Online backups performed hourly and nightly, with the most recent 6 hours and the most recent 2 nights available. These are available in /ccs/home/.snapshot/hourly.* and /ccs/home/.snapshot/nightly.*.

It is possible the files that were deleted will be available in one of those directories. Note that in the directory name, lower numbers represent more recent backups. Thus, /ccs/home/.snapshot/hourly.0 is a more recent backup than /ccs/home/.snapshot/hourly.1.

A ~/.yesterday link exists in many users' home directory pointing to the most recent hourly backup directory.

User Home Permissions

The default permissions for user home directories are 0750 (full access to the user, read and execute for the group). Users have the ability to change permissions on their home directories, although it is recommended that permissions be set to as restrictive as possible (without interfering with your work).

Special User Website Directory

User Home spaces may contain a directory named /www. If this directory exists, and if appropriate permissions exist, files in that directory will be accessible via the World Wide Web at http://users.nccs.gov/~user (where user is your userid).

6.1.2. User Work Directories (Lustre)

(Back to Top)

"User Work" storage areas are available across each OLCF system for fast access to job-related temporary files and for staging large files to and from archival storage.

The OLCF center-wide file systems is referred to as Spider. Spider is made up of multiple Lustre filesystems.

User Work Path

User Work directories can be accessed from each system via /tmp/work/$USER. This path is available on all systems. This path references a space on the Center-Wide Shared Lustre Filesystem, Spider.

User Work Backup

User Work directories are scratch areas intended for temporary storage of data (either while a batch job is running or while a file is being staged from an off-site location to archival storage). As such, files in user work directories are not backed up. Users are responsible for backing up these files, either to archival storage (HPSS) or to an off-site location.

Important! Files in User Work directories are not backed up.

User Work Purge

To ensure adequate work space is available for user jobs, a script that finds and deletes old files runs on the system nightly. This script deletes files that have not been accessed or modified in more than 14 days. If the file system does not have sufficient free space after the script runs, then the script will run again with a threshold of fewer than 14 days. Thus, it is critical to archive files from the scratch area as soon as possible.

Important! Files in User Work directories are eligible for purge (deletion) when they become older than 14 days.

User Work Permissions

By default, each user’s /tmp/work/$USER directory permissions are set to 0700. Changes to the default permissions of the temporary work directory are allowed by the owning user, but will be reset hourly for security purposes.

Note: Manual changes made directly to User Work directory permissions will be overwritten hourly for security purposes.

To request a permanent permission change, please contact the OLCF User Assistance Center. Only the top-level directory permissions (i.e. those of /tmp/work/$USER) are overwritten. Users may change permissions of sub-directories and of files within their work directory, and these will not be overwritten.

The /tmp directory

The default path to user work directories, /tmp/work/$USER is actually a symbolic link. This is important to note because the actual location of the directory is not in the /tmp filesystem but rather in some other filesystem such as /lustre/widow1/scratch/$USER. Thus, the /tmp filesystem itself is not intended as a storage location for temporary files. This filesystem is relatively small and is used by the system for tasks such as compiling, editing, etc. If the directory fills up, it can cause system problems.

Warning: Do not create any files in the /tmp filesystem itself. Rather cd /tmp/work/$USER to access your User Work directory and create temporary files there.

6.1.3. User Archive Directories (HPSS)

(Back to Top)

Users are also provided with user-centric archival space on the High Performance Storage System (HPSS).

User archive areas on HPSS are intended for storage of data not immediately needed in either User Home directories (NFS) or User Work directories (Lustre). User Archive areas also serve as a location for users to store backup copies of user files. User Archive directories should not be used to store project-related data. Rather, Project Archive directories should be used for project data.

User Archive Path

User archive directories are located at /home/$USER.

User Archive Access

User archive directories may be accessed only via specialized tools called HSI and HTAR. For more information on using HSI or HTAR, see the HSI and HTAR page.

User Archive Accounting

Each file and directory on HPSS is associated with an HPSS storage allocation. For information on storage allocation, please visit the Understanding HPSS Storage Allocations page.

6.2. Project-Centric Data Storage

(Back to Top)

Projects are provided with several storage areas for the data they need. Project directories provide members of a project with a common place to store code, data files, documentation, and other files related to their project. While this information could be stored in one or more user directories, storing in a project directory provides a common location to gather all files.

The following table summarizes project-centric storage areas available on OLCF resources and lists relevant policies.

Area	Nickname	Path	Type	Quota	Backups	Purge	Retention
Project Home	--	`/ccs/proj/[projectid]`	NFS	50 GB	Yes	Not purged	1 month after project deactivation
Project Work	"Spider"	`/tmp/proj/[projectid]`	Lustre	2 TB	No	Not purged	Not retained
Project Archive	"HPSS"	`/proj/[projectid]`	HPSS	100 TB (or 100k files)	No	Not purged	3 months after project deactivation

Important! Files in Project Work directories are not backed up.

6.2.1. Project Home Directories (NFS)

(Back to Top)

Projects are provided with a Project Home storage area in the Network File Service (NFS) mounted filesystem. This area is intended for storage of data, code, and other files that are of interest to all members of a project. Since Project Home is an NFS-mounted filesystem, its performance will not be as high as other filesystems.

Project Home Path

Project Home area is accessible at /ccs/proj/abc123 (where abc123 is your project ID).

Project Home Quotas

Quotas are enforced on project home directories. The current limit is shown on the Storage Policy page. To request an increased quota, contact the User Assistance Center.

Project Home Backups

If you accidentally delete one or more files from your project home directory, you may be able to retrieve it/them. Online backups performed hourly and nightly, with the most recent 6 hours and the most recent 2 nights available. These are available in: /ccs/proj/.snapshot/hourly.* and /ccs/proj/.snapshot/nighty.*. It is possible the files that were deleted will be available in one of those directories.

Note that in the directory name, lower numbers represent more recent backups. Thus, /ccs/proj/.snapshot/hourly.0 is a more recent backup than /ccs/proj/.snapshot/hourly.1.

Project Home Permissions

The default permissions for project home directories are 0770 (full access to the user and group). The directory is owned by root and the group is the project's group. All members of a project should also be members of that group-specific project. For example, all members of project "ABC123" should be members of the "abc123" UNIX group.

6.2.2. Project Work Directories (Lustre)

(Back to Top)

To provide projects with a high-performance storage area that is accessible to batch jobs, projects are given Project Work areas in a Lustre filesystem.

The OLCF center-wide file systems is referred to as Spider. Spider comprises multiple Lustre filesystems.

Project Work Path

Project Work directories can be accessed from each system via /tmp/proj/pjt000 (where pjt000 is your project ID). This path is available on all systems. This path references a space on the center-wide shared Lustre filesystem called Spider.

Project Work Backup

As with user work directories, Project Work directories are not backed up. Project members are responsible for backing up these files, either to Project Archive areas (HPSS) or to an off-site location.

Project Work Permissions

Project Work directory permissions are set to 0770 with root as the owner and the project group. Changes to the default permissions of the Project Work directory are allowed, but will be reset hourly for security purposes.

Note: Manual changes made directly to Project Work directory permissions will be overwritten hourly for security purposes.

To request a permanent permission change, please contact the OLCF User Assistance Center. Only the top-level directory permissions (i.e. those of /tmp/proj/pjt000) are overwritten. Users may change permissions of sub-directories and of files within The Project Work directory, and these will not be overwritten.

6.2.3. Project Archive Directories (HPSS)

(Back to Top)

Projects are also allocated project-specific archival space on the High Performance Storage System (HPSS). The default quota is shown on the Storage Policy page. If a higher quota is needed, contact the User Assistance Center.

The Project Archive space on HPSS is intended for storage of data not immediately needed in either Project Home (NFS) areas nor Project Work (Lustre) areas, and to serve as a location to store backup copies of project-related files.

Project Archive Path

The project archive directories are located at /proj/pjt000 (where pjt000 is your Project ID).

Project Archive Access

Project Archive directories may only be accessed via utilities called HSI and HTAR. For more information on using HSI or HTAR, see the HSI and HTAR page.

Project Archive Accounting

Each file and directory on HPSS is associated with an HPSS storage allocation. For information on HPSS storage allocations, please visit the Understanding HPSS Storage Allocations page.

6.3. Transferring Data

(Back to Top)

OLCF users are provided with several options for transferring data among systems at the OLCF as well as between the OLCF and other sites.

Data Transfer Nodes

Dedicated data transfer nodes are provided to OLCF users and are accessible via the load-balancing hostname dtn.ccs.ornl.gov. The nodes have been tuned specifically for wide area data transfers, and also perform well on the local area. They are recommended for data transfers as they will, in most cases, improve transfer speed and help decrease load on computational systems’ login nodes. More information on these nodes can be found on the Data Transfer Nodes page.

Local Transfers

The OLCF provides a shared-storage environment, so transferring data between our machines is largely unnecessary. However, we provide tools both to move large amounts of data between scratch and archival storage and from one scratch area to another.

`SPDCP`

spdcp is a parallel Lustre-aware copy tool. The tool can be used to copy large datasets between Lustre filesystems. More information can be found on the SPDCP page.

`HSI` and `HTAR`

Access to the HPSS is accomplished through the Hierarchical Storage Interface (hsi) and HPSS Tar (htar) utilities. More information can be found on the HSI and HTAR page.

Remote Transfers

The OLCF provides several tools for moving data between computing centers or between OLCF machines and local user workstations. The following tools are primarily designed for transfers over the internet, and aren't recommended for use transferring data between OLCF machines.

The following table summarizes options for remote data transfers:

	GridFTP + GridCert	GridFTP + SSH	SFTP/SCP	BBCP
Data Security	insecure (default) / secure (w/configation)	insecure (default) / secure (w/configation)	secure	insecure (unsuited for sensitive projects)
Authentication	GridCert	Passcode	Passcode	Passcode
Transfer speed	Fast	Fast	Slow	Fast
Required Infrastructure	GridFTP server at remote site + user DOE GridCert	GridFTP server at remote site	Comes standard with SSH install	BBCP installed at remote site

GridFTP

GridFTP is a high-performance data transfer protocol based on FTP and optimized for high-bandwidth wide-area networks. It is typically used to move large amounts of data between the OLCF and other majors centers. More information can be found on the GridFTP page.

`SFTP` and `SCP`

The SSH-based SFTP and SCP utilities can be used to transfer files to and from OLCF systems. Because these utilities can be slow, we recommend using them only to transfer limited numbers of small files. More information on these utilities can be found on the SFTP and SCP page.

`BBCP`

For larger files, the multi-streaming transfer utility BBCP is recommended. The BBCP utility is capable of breaking up your transfer into multiple simultaneously transferring streams, thereby transferring data much faster than single-streaming utilities such as SCP and SFTP. Note: BBCP is not secure, but is much faster than SFTP. More information can be found on the BBCP page.

6.4. Storage Policy Summary

(Back to Top)

Users must agree to the full storage policy as part of their account application. The current policy is summarized below. This policy applies to all projects with allocations on OLCF systems, regardless of the machine on which they have an allocation.

Storage Policy Summary Table

Area	The general name of storage area/directory discussed in the storage policy.
Nickname	The branded name given to some storage areas or file systems.
Path	The path (symlink) to the storage area's directory.
Type	The underlying software technology supporting the storage area.
Quota	The limits placed on total number of bytes and/or files in the storage area.
Backups	States if the data is automatically duplicated for disaster recovery purposes.
Purged	States when data will be marked as eligible for permanent deletion.
Retention	States when data will be marked as eligible for permanent deletion after an account/project is deactivated.

	Area	Nickname	Path	Type	Quota	Backups	Purged	Retention
User	Home	--	`/ccs/home/$USER`	NFS	5 GB	Yes	Not purged	1 month
	Work	"Spider"	`/tmp/work/$USER`	Lustre	None	No	14 days	Not retained
	Archive	"HPSS"	`/home/$USER`	HPSS	2 TB or 2k files	No	Not purged	3 months
Project	Home	--	`/ccs/proj/[projid]`	NFS	50 GB	Yes	Not purged	1 month
	Work	"Spider"	`/tmp/proj/[projid]`	Lustre	2 TB	No	Not purged	1 month
	Archive	"HPSS"	`/proj/[projid]`	HPSS	100 TB or 100k files	No	Not purged	3 months

Important! Files in User Work and Project Work directories are not backed up.

Important! Files in User Work directories are eligible for purge (deletion) when they become older than 14 days.

Storage Policy Implementation

User Home / Project Home Quotas (NFS)

User home directories and project home directories have hard quotas enabled on them to prevent users from exceeding their specified limits.

User Archive / Project Archive Quotas (HPSS)

A noticeable change from past policy deals with archival storage. Soft quotas are used on user archive and project archive directories. You will receive an email notifying you that you have exceeded your limits when we notice usage exceeding the specified limits. The email will ask you to clean up the appropriate storage area to get it below your limits.

We are strongly encouraging users to associate their files with projects as opposed to their individual user account. This helps us understand project needs and usage. Information on modifying an account associated with a file or directory can be found on the following page.

We recognize that not everyone will fall under the default limits. We will work with users and projects to negotiate additional space in some cases. If you feel your project needs additional space, please direct all requests and questions to the User Assistance Center.

Check Current Archive Usage

You can check your current overall archive (HPSS) usage by running the showusage command on any OLCF system:

$ showusage -s hpss

HPSS Storage in GB:
                                  Project Totals         user123
 Project                       Storage                    Storage
__________________________|__________________________|______________
 user123                  |     243.30               |     243.30
 legacy                   |                          |       7.01
 abc123                   |       5.86               |       3.01

Every user gets a default overhead account. This is a "project" with the same name as your username. This is how your user archive storage usage is calculated.

The project total is shown in the first storage column, and your usage in that project is shown in the second Storage column.

Files stored on the system prior to March 15, 2008 are associated with a legacy project. Any file associated with the legacy project is not used in the quota calculation. Users are not able to associate new files to the legacy account. We also strongly encourage users to associate legacy files with projects. More information on HPSS usage and accounting can be found on the Understanding HPSS Storage Allocations page.

Please direct all requests and questions to the OLCF User Assistance Center.

7. Software and Shell Environments

(Back to Top)

The OLCF provides hundreds of pre-installed software packages and scientific libraries for your use, in addition to taking software requests. Due to the large number of software packages and versions on OLCF resources, environment management tools are needed to handle changes to your shell environment. This chapter discusses how to manage your shell and software environment on OLCF systems.

7.1. Default Shell

(Back to Top)

Users request their preferred shell on their initial user account request form. The default shell is enforced across all OLCF resources. The OLCF currently supports the following shells:

bash
tsch
csh
ksh

Please contact the OLCF User Assistance Center to request a different default shell.

7.2. Using Modules

(Back to Top)

The modules software package allows you to dynamically modify your user environment by using pre-written modulefiles.

Modules Overview

Each modulefile contains the information needed to configure the shell for an application. After the modules software package is initialized, the environment can be modified on a per-module basis using the module command, which interprets a modulefile.

Typically, a modulefile instructs the module command to alter or set shell environment variables such as PATH or MANPATH. Modulefiles can be shared by many users on a system, and users can have their own personal collection to supplement and/or replace the shared modulefiles.

As a user, you can add and remove modulefiles from your current shell environment. The environment changes performed by a modulefile can be viewed by using the module command as well.

More information on modules can be found by running man module on OLCF systems.

Summary of Module Commands

Command	Description
`module list`	Lists modules currently loaded in a user’s environment
`module avail`	Lists all available modules on a system in condensed format
`module avail -l`	Lists all available modules on a system in long format
`module display`	Shows environment changes that will be made by loading a given module
`module load`	Loads a module
`module unload`	Unloads a module
`module help`	Shows help for a module
`module swap`	Swaps a currently loaded module for an unloaded module

Re-initializing the Module Command

Modules software functionality is highly dependent upon the shell environment being used. Sometimes when switching between shells, modules must be re-initialized. For example, you might see an error such as the following:

$ module list
-bash: module: command not found

To fix this, just re-initialize your modules environment:

$ source $MODULESHOME/init/myshell

Where myshell is the name of the shell you are using and need to re-initialize.

Examples of Module Use

To show all available modules on a system:

$ module avail   
------------ /opt/cray/modulefiles ------------
atp/1.3.0                          netcdf/4.1.3                       tpsl/1.0.01
atp/1.4.0(default)                 netcdf-hdf5parallel/4.1.2(default) tpsl/1.1.01(default)
atp/1.4.1                          netcdf-hdf5parallel/4.1.3          trilinos/10.6.4.0(default)
...

To search for availability of a module by name:

$ module avail -l netcdf
- Package -----------------------------+- Versions -+- Last mod. ------
/opt/modulefiles:
netcdf/3.6.2                                         2009/09/29 16:38:25
/sw/xk6/modulefiles:
netcdf/3.6.2                                         2011/12/09 18:07:31
netcdf/4.1.3                              default    2011/12/12 20:43:37
...

To show the modulefiles currently in use (loaded) by the user:

$ module list
Currently Loaded Modulefiles:
  1) modules/3.2.6.6                           12) pmi/3.0.0-1.0000.8661.28.2807.gem
  2) xe-sysroot/4.0.30.securitypatch.20110928  13) ugni/2.3-1.0400.3912.4.29.gem
  3) xtpe-network-gemini                       14) udreg/2.3.1-1.0400.3911.5.6.gem

To show detailed help info on a modulefile:

$ module help netcdf/4.1.3 
------------ Module Specific Help for 'netcdf/4.1.3' ------------
Purpose:
  New version of hdf5 1.8.7 and netcdf 4.1.3
Product and OS Dependencies:
  hdf5_netcdf 2.1 requires SLES 11 systems and was tested on Cray XE and
...

To show what a modulefile will do to the shell environment if loaded:

$ module display netcdf/4.1.3
------------
/opt/cray/modulefiles/netcdf/4.1.3:
setenv           CRAY_NETCDF_VERSION 4.1.3 
prepend-path     PATH /opt/cray/netcdf/4.1.3/gnu/45/bin 
...

To load or unload a modulefile

$ module load netcdf/4.1.3
$ module unload netcdf/4.1.3

To unload a modulefile and load a different one:

$ module swap netcdf/4.1.3 netcdf/4.1.2

7.3. Installed Software

(Back to Top)

The OLCF provides hundreds of pre-installed software packages and scientific libraries for your use, in addition to taking software installation requests.

See the software section for complete details on existing installs.

To request a new software install, use the software installation request form.

8. Compiling On Titan

(Back to Top)

Compiling code on Titan (and other Cray machines) is different than compiling code for commodity or beowulf-style HPC linux clusters. Among the most prominent differences:

Cray provides a sophisticated set of compiler wrappers to ensure that the compile environment is setup correctly. Their use is highly encourged.
In general, linking/using shared object libraries on compute partitions is not supported.
Cray systems include many different types of nodes, so some compiles are, in fact, cross-compiles.

Available Compilers

The following compilers are available on Titan:

PGI, the Portland Group Compiler Suite (default)
GCC, the GNU Compiler Collection
CCE, the Cray Compiling Environment
Intel, Intel Composer XE

8.1. Cray Compiler Wrappers

(Back to Top)

Cray provides a number of compiler wrappers that substitute for the traditional compiler invocation commands.

The wrappers call the appropriate compiler, add the appropriate header files, and link against the appropriate libraries based on the currently loaded programming environment module. To build codes for the compute nodes, you should invoke the Cray wrappers via:

cc To use the C compiler
CC To use the C++ compiler
ftn To use the FORTRAN 90 compiler

8.2. Compiling and Node Types

(Back to Top)

Titan is comprised of different types of nodes:

Login nodes running traditional Linux
Service nodes running traditional Linux
Compute nodes running the Cray Node Linux (CNL) microkernel

The type of work you are performing will dictate the type of node for which you build your code.

Compiling for Compute Nodes

Titan compute nodes are the nodes that carry out the vast majority of computation on the system. Compute nodes are running the CNL microkernel, which is markedly different than the OS running on the login and service nodes. Most code that runs on Titan will be built targeting the compute nodes.

All parallel codes should run on the compute nodes. Compute nodes are accessible only by invoking aprun within a batch job. To build codes for the compute nodes, you should use the Cray compiler wrappers.

Note: The OLCF highly recommends that the Cray-provided cc, CC, and ftn compiler wrappers be used when compiling and linking source code for use on Titan compute nodes.

Support for Shared Object Libraries

On Titan, and Cray machines in general, compiled executables to be run on compute nodes must always be linked statically.

Warning: In general, shared object libraries are not supported on Cray compute nodes.

Compiling for Login or Service Nodes

When you log into Titan you are placed on a login node. When you submit a job for execution, your job script is initially launched on one of a small number of shared service nodes. All tasks not launched through aprun will run on the service node. Users should note that there are a small number of these login and service nodes, and they are shared by all users. Because of this, long-running or memory-intensive work should not be performed on login nodes nor service nodes.

Warning: Long-running or memory-intensive codes should not be compiled for use on login nodes nor service nodes.

When using cc, CC, or ftn your code will be built for the compute nodes by default. If you wish to build code for the Titan login nodes or service nodes, you must do one of the following:

module swap xtpe-interlagos xtpe-target-native
(this is the preferred option, especially when trying to use a configure script)
Add the -target=native flag to your cc, CC, or ftn command
Call the underlying compilers directly (e.g. pgf90, ifort, gcc)

XK7 Service/Compute Node Incompatibilities

On the Cray XK7 architecture, service nodes differ greatly from the compute nodes. The difference between XK7 compute and service nodes may cause cross compiling issues that did not exist on Cray XT5 systems and prior.

For XK7, login and service nodes use AMD's Istanbul-based processor, while compute nodes use the newer Interlagos-based processor. Interlagos-based processors include instructions not found on Istanbul-based processors, so executables compiled for the compute nodes will not run on the login nodes nor service nodes; typically crashing with an illegal instruction error. Additionally, codes compiled specifically for the login or service nodes will not run optimally on the compute nodes.

Warning: Executables compiled for the XK7 compute nodes will not run on the XK7 login nodes nor XK7 service nodes.

Optimization Target Warning

Because of the difference between the login/service nodes (on which code is built) and the compute nodes (on which code is run), a software package's build process may inject optimization flags incorrectly targeting the login/service nodes. Users are strongly urged to check makefiles for CPU-specific optimization flags (ex: -tp, -hcpu, -march). Users should not need to set such flags; the Cray compiler wrappers will automatically add CPU-specific flags to the build. Choosing the incorrect processor optimization target can negatively impact code performance.

8.3. Controlling the Programming Environment

(Back to Top)

Upon login, the default versions of the PGI compiler and associated Message Passing Interface (MPI) libraries are added to each user's environment through a programming environment module. Users do not need to make any environment changes to use the default version of PGI and MPI.

Changing Compilers

If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.

The following programming environment modules are available on Titan:

PrgEnv-pgi
PrgEnv-gnu
PrgEnv-cray
PrgEnv-intel

To change the default loaded PGI environment to the default GCC environment use:

$ module unload PrgEnv-pgi 
$ module load PrgEnv-gnu

Or alternatively:

$ module swap PrgEnv-pgi PrgEnv-gnu

Changing Versions of the Same Compiler

To use a specific compiler version, you must first ensure the compiler's PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:

$ module swap PrgEnv-pgi PrgEnv-gnu
$ module swap gcc gcc/4.6.1

General Programming Environment Guidelines

We recommend the following general guidelines for using the programming environment modules:

Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
Do not swap or unload any of the Cray provided modules (those with names like xt-*, xe-*, xk-*, or cray-*).
Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.

8.4. Compiling Threaded Codes

(Back to Top)

When building threaded codes on Titan, you may need to take additional steps to ensure a proper build.

OpenMP

For PGI, add "-mp" to the build line.

$ cc -mp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For GNU, add "-fopenmp" to the build line.

$ cc -fopenmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For Cray, no additional flags are required.

$ module swap PrgEnv-pgi PrgEnv-cray
$ cc test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For Intel, ?

$ module swap PrgEnv-pgi PrgEnv-intel
$ cc test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

SHMEM

For SHMEM codes, users must load the xt-shmem module before compiling:

$ module load xt-shmem

9. Running Jobs on Titan

(Back to Top)

In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.

A job on Titan typically comprises a few different components:

A batch submission script.
A statically-linked binary executable.
A set of input files for the executable.
A set of output files created by the executable.

And the process for running a job, in general, is to:

Prepare executables and input files.
Write a batch script.
Submit the batch script to the batch scheduler.
Optionally monitor the job before and during execution.

The following sections describe in detail how to create, submit, and manage jobs for execution on Titan.

9.1. Login vs. Service vs. Compute Nodes

(Back to Top)

Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Titan nodes as existing in one of three categories: login nodes, service nodes, or compute nodes.

Login Nodes

Login nodes are designed to facilitate ssh access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.

Warning: Processor-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning.

Service Nodes

Memory-intensive tasks, processor-intensive tasks, and any production-type work should be submitted to the machine's batch system (e.g. to Torque/MOAB via qsub). When a job is submitted to the batch system, the job submission script is first executed on a service node.

Any job submitted to the batch system is handled in this way, including interactive batch jobs (e.g. via qsub -I). Often users are under the (false) impression that they are executing commands on compute nodes while typing commands in an interactive batch job. On Cray machines, this is not the case.

Compute Nodes

On Cray machines, when the aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.

Note: On Cray machines, the only way to access the compute nodes is via the aprun command.

9.2. Filesystems Available to Compute Nodes

(Back to Top)

Only User Work (Lustre) and Project Work (Lustre) storage areas are available to compute nodes on OLCF Cray systems. Other storage spaces (User Home, User Archive, Project Home, and Project Archive) are not mounted on compute nodes.

Warning: Only User Work (Lustre) and Project Work (Lustre) storage areas are available to compute nodes.

As a result, job executable binaries and job input files must reside within a Lustre Work space., e.g. /tmp/work/$USER. Job output must also be sent to a Lustre Work space.

Batch jobs can be submitted from User Home or Project Home, but additional steps are required to ensure the job runs successfully. Jobs submitted from Home areas should cd into a Lustre Work directory prior to invoking aprun. An error like the following may be returned if this is not done:

aprun: [NID 94]Exec /tmp/work/userid/a.out failed: chdir /autofs/na1_home/userid
No such file or directory

9.3. Writing Batch Scripts

(Back to Top)

Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:

Commands that can be interpreted by batch scheduling software (e.g. PBS)
Commands that can be interpreted by a shell

The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.

Sections of a Batch Script

Batch scripts are parsed into the following three sections:

The Interpreter Line
The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter's default shell will be used. The line uses the "hash-bang-shell" syntax: #!/path/to/shell
The Scheduler Options Section
The batch scheduler options are preceded by #PBS, making them appear as comments to a shell. PBS will look for #PBS options in a batch script from the script’s first line through the first non-comment line. A comment line begins with #. #PBS options entered after the first non-comment line will not be read by PBS.

Note: All batch scheduler options must appear at the beginning of the batch script.
The Executable Commands Section
The shell commands follow the last #PBS option and represent the main content of the batch job. If any #PBS lines follow executable statements, they will be ignored as comments.

The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job's queue wait time is finished, commands within this section will be executed on a service node (sometimes called a "head node") from the set of the job's allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.

An Example Batch Script

 1: #!/bin/bash
 2: #    Begin PBS directives
 3: #PBS -A pjt000
 4: #PBS -N test
 5: #PBS -j oe
 6: #PBS -l walltime=1:00:00,nodes=1500
 7: #PBS -l gres=widow2%widow3
 8: #    End PBS directives and begin shell commands
 9: cd /tmp/work/$USER
10: date
11: aprun -n 24000 ./a.out

The lines of this batch script do the following:

Line	Option	Description
1	Optional	Specifies that the script should be interpreted by the bash shell.
2	Optional	Comments do nothing.
3	Required	The job will be charged to the "pjt000" project.
4	Optional	The job will be named "test".
5	Optional	The job’s standard output and error will be combined.
6	Required	The job will request 1,500 compute nodes for 1 hour.
7	Optional	The job will be associated with the widow2 and widow3 lustre filesystems only.
8	Optional	Comments do nothing.
9	--	This shell command will the change to the user's /tmp/work directory.
10	--	This shell command will run the `date` command.
11	--	This invocation will run 24,000 MPI instances of the executable `a.out` on the compute nodes allocated by the batch system.

Note: For more batch script examples, please see the Batch Script Examples page.

Batch Scheduler `node` Requests

A node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.

Note: Whole nodes must be requested at the time of job submission, and allocations are reduced by core-hour amounts corresponding to whole nodes, regardless of actual core utilization.

9.4. Submitting Batch Scripts

(Back to Top)

Once written, a batch script is submitted to the batch scheduler via the qsub command.

$ cd /path/to/batch/script
$ qsub ./script.pbs

If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job's status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.

Note: Always make a note of the returned job ID upon job submission, and include it in help requests to the OLCF User Assistance Center.

Options to the qsub command allow the specification of attributes which affect the behavior of the job. In general, options to qsub on the command line can also be placed in the batch scheduler options section of the batch script via #PBS.

For more information on submitting batch jobs, see the Batch Script Examples page.

9.5. Interactive Batch Jobs

(Back to Top)

Batch scripts are useful for submitting a group of commands, allowing them to run through the queue, then viewing the results at a later time. However, it is sometimes necessary to run tasks within a job interactively.

Users are not permitted to access compute nodes nor run aprun directly from login nodes. Instead, users must use an interactive batch job to allocate and gain access to compute resources interactively. This is done by using the -I option to qsub.

Interactive Batch Example

For interactive batch jobs, PBS options are passed through qsub on the command line.

$ qsub -I -A pjt000 -q debug -X -V -l nodes=3,walltime=30:00 -l gres=widow2%widow3

This request will:

`-I`	Start an interactive session
`-A`	Charge to the "pjt000" project
`-X`	Enables X11 forwarding. The DISPLAY environment variable must be set.
`-q debug`	Run in the debug queue
`-V`	Import the submitting user’s environment
`-l nodes=3,walltime=30:00`	Request 3 compute nodes for 30 minutes (you get all cores per node)
`-l gres=widow2%widow3`	Associate the batch job with the widow2 and widow3 filesystems only

After running this command, you will have to wait until enough compute nodes are available, just as in any other batch job. However, once the job starts, you will be given an interactive prompt on the head node of your allocated resource. From here commands may be executed directly instead of through a batch script.

Debugging via Interactive Jobs

A common use of interactive batch is to aid in debugging efforts. Interactive access to compute resources allows the ability to run a process to the point of failure; however, unlike a batch job, the process can be restarted after brief changes are made without loosing the compute resource allocation. This may help speed the debugging effort because a user does not have to wait in the queue in between each run attempts.

Note: To tunnel a GUI from an interactive batch job, the -X PBS option should be used to enable X11 forwarding.

Choosing an Interactive Job's `nodes` Value

Because interactive jobs must sit in the queue until enough resources become available to allocate, to shorten the queue wait time, it is useful to base nodes selection on the number of unallocated nodes. The showbf command (i.e "show backfill") to see resource limits that would allow your job to be immediately back-filled (and thus started) by the scheduler. For example, the snapshot below shows that 802 nodes are currently free.

$ showbf
Partition   Tasks   Nodes   StartOffset    Duration       StartDate
---------   -----   -----   ------------   ------------   --------------
ALL         4744    802     INFINITY       00:00:00       HH:MM:SS_MM/DD

See showbf –help for additional options.

9.6. Common Batch Options to PBS

(Back to Top)

The following table summarizes frequently-used options to PBS:

Option	Use	Description
`-A`	`#PBS -A <account>`	Causes the job time to be charged to `<account>`. The account string, e.g. `pjt000`, is typically composed of three letters followed by three digits and optionally followed by a subproject identifier. The utility `showproj` can be used to list your valid assigned project ID(s). This option is required by all jobs.
`-l`	`#PBS -l nodes=<value>`	Maximum number of compute nodes. Jobs cannot request partial nodes.
	`#PBS -l walltime=<time>`	Maximum wall-clock time. `<time>` is in the format HH:MM:SS.
	`#PBS -l gres=<filesystem>`	Associate batch job with one or more Lustre filesystems. Valid options are `widow1`, `widow2`, and `widow3`. Include multiple filesystems like `widow2%widow3`. Useful to omit associations in the event of file system outages.
`-o`	`#PBS -o <filename>`	Writes standard output to `<name>` instead of `<job script>.o$PBS_JOBID`. `$PBS_JOBID` is an environment variable created by PBS that contains the PBS job identifier.
`-e`	`#PBS -e <filename>`	Writes standard error to `<name>` instead of `<job script>.e$PBS_JOBID.`
`-j`	`#PBS -j {oe,eo}`	Combines standard output and standard error into the standard error file (`eo`) or the standard out file (`oe`).
`-m`	`#PBS -m a`	Sends email to the submitter when the job aborts.
	`#PBS -m b`	Sends email to the submitter when the job begins.
	`#PBS -m e`	Sends email to the submitter when the job ends.
`-M`	`#PBS -M <address>`	Specifies email address to use for `-m` options.
`-n`	`#PBS -N <name>`	Sets the job name to `<name>` instead of the name of the job script.
`-S`	`#PBS -S <shell>`	Sets the shell to interpret the job script.
`-q`	`#PBS -q <queue>`	Directs the job to the specified queue.This option is not required to run in the default queue on any given system.
`-V`	`#PBS -V`	Exports all environment variables from the submitting shell into the batch job shell.

Further details and other PBS options may be found through the qsub man page.

9.7. Batch Environment Variables

(Back to Top)

PBS sets multiple environment variables at submission time. The following PBS variables are useful within batch scripts:

Variable	Description
`$PBS_O_WORKDIR`	The directory from which the batch job was submitted. By default, a new job starts in your home directory. You can get back to the directory of job submission with `cd $PBS_O_WORKDIR`. Note that this is not necessarily the same directory in which the batch script resides.
`$PBS_JOBID`	The job’s full identifier. A common use for `PBS_JOBID` is to append the job’s ID to the standard output and error files.
`$PBS_NUM_NODES`	The number of nodes requested.
`$PBS_NUM_PPN`	The number of cores requested.
`$PBS_JOBNAME`	The job name supplied by the user.
`$PBS_NODEFILE`	The name of the file containing the list of nodes assigned to the job. Used sometimes on non-Cray clusters.

9.8. Modifying Batch Jobs

(Back to Top)

The batch scheduler provides a number of utility commands for managing submitted jobs. See each utilities' man page for more information.

Removing and Holding Jobs

`qdel`

Jobs in the queue in any state can be stopped and removed from the queue using the command qdel.

$ qdel 1234

`qhold`

Jobs in the queue in a non-running state may be placed on hold using the qhold command. Jobs placed on hold will not be removed from the queue, but they will not be eligible for execution.

$ qhold 1234

`qrls`

Once on hold the job will not be eligible to run until it is released to return to a queued state. The qrls command can be used to remove a job from the held state.

$ qrls 1234

Modifying Job Attributes

`qalter`

Non-running jobs in the queue can be modified with the PBS qalter command. The qalter utility can be used to do the following (among others):

Modify the job’s name:

$ qalter -N newname 130494

Modify the number of requested cores:

$ qalter -l nodes=12 130494

Modify the job’s walltime:

$ qalter -l walltime=01:00:00 130494

Note: Once a batch job moves into a running state, the job's walltime can not be increased.

9.9. Monitoring Batch Jobs

(Back to Top)

PBS and Moab provide multiple tools to view queue, system, and job status. Below are the most common and useful of these tools.

Job Monitoring Commands

`showq`

The Moab utility showq can be used to view a more detailed description of the queue. The utility will display the queue in the following states:

Active	These jobs are currently running.
Eligible	These jobs are currently queued awaiting resources. A user is allowed two jobs in the eligible state. Eligible jobs are shown in the order in which the scheduler will consider them for allocation.
Blocked	These jobs are currently queued but are not eligible to run. Common reasons for jobs in this state are jobs on hold and the owning user currently having (2) jobs in the eligible state.

To see all jobs currently in the queue:

$ showq

To see all jobs owned by userA currently in the queue:

$ showq -u userA

`checkjob`

The Moab utility checkjob can be used to view details of a job in the queue. For example, if job 736 is a job currently in the queue in a blocked state, the following can be used to view why the job is in a blocked state:

$ checkjob 736

The return may contain a line similar to the following:

BlockMsg: job 736 violates idle HARD MAXJOB limit of 2 for user (Req: 1 InUse: 2)

This line indicates the job is in the blocked state because the owning user has reached the limit of (2) jobs currently in the eligible state.

`qstat`

The PBS utility qstat will poll PBS (Torque) for job information. However, qstat does not know of Moab's blocked and eligible states. Because of this, the showq Moab utility (see above) will provide a more accurate batch queue state.

To show show all queued jobs:

$ qstat -a

To show details about job 1234:

$ qstat -f 1234

To show all currently queued jobs owned by userA:

$ qstat -u userA

9.10. Titan Batch Queues

(Back to Top)

Queues are used by the batch scheduler to aid in the organization of jobs. Users typically have access to multiple queues, and each queue may allow different job limits and have different priorities. Unless otherwise notified, users have access to the following queues on Titan:

Name	Use	Limits	Description
`batch`	The default; No explicit request required	See the OLCF Scheduling Policy for details.	For production work; the default queue
`killable`	`#PBS -q killable`	See the OLCF Scheduling Policy for details.	Allows batch jobs to run even if the job will not complete before the start of a scheduled reservation. More information.
`debug`	`#PBS -q debug`	Max walltime: 1 hour Max running jobs per user: 1 Max core count: Unlimited	For short software development, testing, and debugging jobs. More information.

Debug

Proper use of the debug queue would be, for example, submitting (1) job at a time as part of a software development, testing, or debugging cycle. Interactive parallel work is an ideal use for the debug queue.

Warning: Users who misuse the debug queue may have further access to the queue denied.

Killable

Reservations are used to ensure that batch jobs are not running at the start of a scheduled system outage. Normally the batch system will not allocate a batch job if it will not complete before the reservation start time.

The killable queue allows the batch system to schedule batch jobs even if they will not complete (based on the job's specified walltime) before a scheduled reservation.

The batch system will stop scheduling batch jobs in the killable queue one hour before the scheduled outage. Because of this, the minimum runtime of a job submitted to the killable queue will be one hour. The maximum runtime will be based on the batch job's start time and the start of the reservation (greater than one hour and less than the requested walltime).

If your job can perform usable work in the one hour minimum runtime, the killable queue may allow you to take advantage of the idle resources availalable prior to each scheduled outage.

Batch jobs submitted to the killable queue that are killed by the system when the reservation starts will be re-queued and available for scheduling once the system is returned to service following the outage.

9.11. Job Execution on Titan

(Back to Top)

Once resources have been allocated through the batch system, users can:

Run commands in serial on the resource pool's primary service node
Run executables in parallel across compute nodes in the resource pool

Serial Execution

The executable portion of a batch script is interpreted by the shell specified on the first line of the script. If a shell is not specified, the submitting user’s default shell will be used. This portion of the script may contain comments, shell commands, executable scripts, and compiled executables. These can be used in combination to, for example, navigate file systems, set up job execution, run executables, and even submit other batch jobs.

Parallel Execution

By default, commands in the job submission script will be executed on the job’s primary service node. The aprun command is used to execute a binary on one or more compute nodes within a job's allocated resource pool.

Note: On Titan, the only way access a compute node is via the aprun command within a batch job.

9.11.1. Using the `aprun` command

(Back to Top)

The aprun command is used to run a compiled application program across one or more compute nodes. You use the aprun command to specify application resource requirements, request application placement, and initiate application launch.

The machine's physical node layout plays an important role in how aprun works. Each Titan compute node contains (2) 8-core NUMA nodes on a single socket (a total of 16 cores).

Note: The aprun command is the only mechanism for running an executable in parallel on compute nodes. To run jobs as efficiently as possible, a thorough understanding of how to use aprun and its various options is paramount.

Shell Resource Limits

By default, aprun will not forward shell limits set by ulimit for sh/ksh/bash or by limit for csh/tcsh.

To pass these settings to your batch job, you should set the environment variable APRUN_XFER_LIMITS to 1 via export APRUN_XFER_LIMITS=1 for sh/ksh/bash or setenv APRUN_XFER_LIMITS 1 for csh/tcsh.

Common `aprun` Options

The following table lists commonly-used options to aprun. For a more detailed description of aprun options, see the aprun man page.

Option	Description
`-D`	Debug; shows the layout `aprun` will use
`-n`	Number of total MPI tasks (aka 'processing elements') for the executable. If you do not specify the number of tasks to `aprun`, the system will default to 1.
`-N`	Number of MPI tasks (aka 'processing elements') per physical node. Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
`-m`	Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task
`-d`	Number of threads per MPI task. Warning: The default value for `-d` is 1. If you specify `OMP_NUM_THREADS` but do not give a `-d` option, `aprun` will allocate your threads to a single core. Use `OMP_NUM_THREADS` to specify to your code the number of threads per MPI task; use `-d` to tell `aprun` how to place those threads.
`-S`	Number of MPI tasks (aka 'processing elements') per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
`-ss`	Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.

9.11.2. XK7 CPU Description

(Back to Top)

Each Titan compute node contains (1) AMD Opteron™ 6274 (Interlagos) CPU. Each CPU contains (2) die. Each die contains (4) "bulldozer" compute units and a shared L3 cache. Each compute unit contains (2) integer cores (and their L1 cache), a shared floating point scheduler, and shared L2 cache.

To aid in task placement, each die is organized into a NUMA node. Each compute node contains (2) NUMA nodes. Each NUMA node contains a die's L3 cache and its (4) compute units (8 cores).

This configuration is shown graphically below.

Opteron 6274 CPU Schematic

9.11.3. Controlling MPI Task Layout Within a Physical Node

(Back to Top)

Users have (2) ways to control MPI task layout:

Within a physical node
Across physical nodes

This article focuses on how to control MPI task layout within a physical node.

Understanding NUMA Nodes

Each physical node is organized into (2) 8-core NUMA nodes. NUMA is an acronym for "Non-Uniform Memory Access". You can think of a NUMA node as a division of a physical node that contains a subset of processor cores and their high-affinity memory.

Applications may use resources from one or both NUMA nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one NUMA node before allocating tasks to another NUMA node.

Note: A brief description of how a physical XK7 node is organized can be found on the XK7 node description page.

Spreading MPI Tasks Across NUMA Nodes

Each physical node contains (2) NUMA nodes. Users can control MPI task layout using the aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a physical node's MPI task load over the (2) available NUMA nodes via the -S option to aprun.

Note: Jobs that do not utilize all of a physical node's processor cores may see performance improvements by spreading MPI tasks across NUMA nodes within a physical node.

Example 1: Default NUMA Placement

Job requests (2) processor cores without a NUMA flag. Both tasks are placed on the first NUMA node.

$ aprun -n2 a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 0, Core 1

Example 2: Specific NUMA Placement

Job requests (2) processor cores with aprun -S. A task is placed on each of the (2) NUMA nodes:

$ aprun -n2 -S1 a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 1, Core 0

The following table summarizes common NUMA node options to aprun:

Option	Description
`-S`	Processing elements (essentially a processor core) per NUMA node. Specifies the number of PEs to allocate per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
`-ss`	Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.

Advanced NUMA Node Placement

Example 1: Grouping MPI Tasks on a Single NUMA Node

Run a.out on (8) cores. Place (8) MPI tasks on (1) NUMA node. In this case the aprun -S option is optional:

$ aprun -n8 -S8 a.out

Compute Node
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
0	1	2	3	4	5	6	7

Example 2: Spreading MPI tasks across NUMA nodes

Run a.out on (8) cores. Place (4) MPI tasks on each of (2) NUMA nodes via aprun -S.

$ aprun -n8 -S4 a.out

Compute Node
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
0	1	2	3					4	5	6	7

Example 3: Spread Out MPI Tasks Across Paired-Core Compute Units

The -d option can be used for non-threaded codes to allow one task per paired-core compute unit.

Warning: Because threaded codes must use the -d option to spread out threads/tasks, using -d for paired-core control purposes is not an option for threaded codes.

Run a.out on (8) cores; (4) cores per NUMA node; but only (1) core on each paired-core compute unit:

$ aprun -n8 -S4 -d2 a.out

Compute Node
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
0		1		2		3		4		5		6		7

For threaded codes, one core on each compute unit can be skipped using the -cc option:

$ setenv OMP_NUM_TREADS 4
$ aprun -n2 -S1 -d4 -cc 0,2,4,6,8,10,12,14 a.out

9.11.4. Controlling MPI Task Layout Across Many Physical Nodes

(Back to Top)

Users have (2) ways to control MPI task layout:

Within a physical node
Across physical nodes

This article focuses on how to control MPI task layout across physical nodes nodes.

The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one physical node before allocating tasks to another physical node.

Viewing Multi-Node Layout Order

Task layout can be seen by setting MPICH_RANK_REORDER_DISPLAY to 1.

Changing Multi-Node Layout Order

For multi-node jobs, layout order can be changed using the environment variable MPICH_RANK_REORDER_METHOD. See man intro_mpi for more information.

Multi-Node Layout Order Examples

Example 1: Default Layout

The following will run a.out across (32) cores. This requires (2) physical compute nodes.

$ aprun -n 32 a.out

Compute Node 0
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15

Compute Node 1
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31

Example 2: Round-Robin Layout

The following will place tasks in a round robin fashion. This requires (2) physical compute nodes.

$ setenv MPICH_RANK_REORDER_METHOD 0
$ aprun -n 32 a.out

Compute Node 0
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
0	2	4	6	8	10	12	14	16	18	20	22	24	26	28	30

Compute Node 1
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
1	3	5	7	9	11	13	15	17	19	21	23	25	27	29	31

Example 3: Combining Inter-Node and Intra-Node Options

The following combines MPICH_RANK_REORDER_METHOD and -S to place tasks on three cores per processor within a node and in a round robin fashion across nodes.

$ setenv MPICH_RANK_REORDER_METHOD 0
$ aprun -n12 -S3 a.out

Compute Node 0
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
0	2	4						6	8	10

Compute Node 1
NUMA 0								NUMA 1
Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7	Core 0	Core 1	Core 2	Core 3	Core 4	Core 5	Core 6	Core 7
1	3	5						7	9	11

9.11.5. Controlling Thread Layout Within a Physical Node

(Back to Top)

Titan supports threaded programming within a compute node. Threads may span across both processors within a single compute node, but cannot span compute nodes. Users have a great deal of flexibility in thread placement. Several examples are shown below.

Note: Threaded codes must use the -d (depth) option to aprun.

The -d option to aprun specifies the number of threads per MPI task. Under previous CNL versions this option was not required. Under the current CNL version, the number of cores used is calculated by multiplying the value of -d by the value of -n.

Warning: Without the -d option, all threads will be started on the same processor core. This can lead to performance degradation for threaded codes.

Thread Layout Examples

The following examples are written for the bash shell. If using csh/tcsh, you should change export OMP_NUM_THREADS=x to setenv OMP_NUM_THREADS x wherever it appears.

Example 1: (2) MPI tasks, (16) Threads Each

This example will launch (2) MPI tasks, each with (16) threads. This requests (2) compute nodes and requires a node request of (2):

$ export OMP_NUM_THREADS=16
$ aprun -n2 -d16 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 0, Thread 2, Node 0, NUMA 0, Core 2 <-- slave
Rank 0, Thread 3, Node 0, NUMA 0, Core 3 <-- slave
Rank 0, Thread 4, Node 0, NUMA 0, Core 4 <-- slave
Rank 0, Thread 5, Node 0, NUMA 0, Core 5 <-- slave
Rank 0, Thread 6, Node 0, NUMA 0, Core 6 <-- slave
Rank 0, Thread 7, Node 0, NUMA 0, Core 7 <-- slave
Rank 0, Thread 8, Node 0, NUMA 1, Core 0 <-- slave
Rank 0, Thread 9, Node 0, NUMA 1, Core 1 <-- slave
Rank 0, Thread 10,Node 0, NUMA 1, Core 2 <-- slave
Rank 0, Thread 11,Node 0, NUMA 1, Core 3 <-- slave
Rank 0, Thread 12,Node 0, NUMA 1, Core 4 <-- slave
Rank 0, Thread 13,Node 0, NUMA 1, Core 5 <-- slave
Rank 0, Thread 14,Node 0, NUMA 1, Core 6 <-- slave
Rank 0, Thread 15,Node 0, NUMA 1, Core 7 <-- slave
Rank 1, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 1, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 1, Thread 2, Node 1, NUMA 0, Core 2 <-- slave
Rank 1, Thread 3, Node 1, NUMA 0, Core 3 <-- slave
Rank 1, Thread 4, Node 1, NUMA 0, Core 4 <-- slave
Rank 1, Thread 5, Node 1, NUMA 0, Core 5 <-- slave
Rank 1, Thread 6, Node 1, NUMA 0, Core 6 <-- slave
Rank 1, Thread 7, Node 1, NUMA 0, Core 7 <-- slave
Rank 1, Thread 8, Node 1, NUMA 1, Core 0 <-- slave
Rank 1, Thread 9, Node 1, NUMA 1, Core 1 <-- slave
Rank 1, Thread 10,Node 1, NUMA 1, Core 2 <-- slave
Rank 1, Thread 11,Node 1, NUMA 1, Core 3 <-- slave
Rank 1, Thread 12,Node 1, NUMA 1, Core 4 <-- slave
Rank 1, Thread 13,Node 1, NUMA 1, Core 5 <-- slave
Rank 1, Thread 14,Node 1, NUMA 1, Core 6 <-- slave
Rank 1, Thread 15,Node 1, NUMA 1, Core 7 <-- slave

Example 2: (2) MPI tasks, (6) Threads Each

This example will launch (2) MPI tasks, each with (6) threads. Place (1) MPI task per NUMA node. This requests (1) physical compute nodes and requires a nodes request of (1):

$ export OMP_NUM_THREADS=6
$ aprun -n2 -d6 -S1 a.out

Compute Node
NUMA 0								NUMA 1
Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7	Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7
Rank0 Thread0	Rank0 Thread1	Rank0 Thread2	Rank0 Thread3	Rank0 Thread4	Rank0 Thread5			Rank1 Thread0	Rank1 Thread1	Rank1 Thread2	Rank1 Thread3	Rank1 Thread4	Rank1 Thread5

Example 3: (4) MPI tasks, (2) Threads Each

This example will launch (4) MPI tasks, each with (2) threads. Place only (1) MPI task [and its (2) threads] on each NUMA node. This requests (2) physical compute nodes and requires a nodes request of (2), even though only (8) cores are actually being used:

$ export OMP_NUM_THREADS=2
$ aprun -n4 -d2 -S1 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 1, Thread 0, Node 0, NUMA 1, Core 0 <-- MASTER
Rank 1, Thread 1, Node 0, NUMA 1, Core 1 <-- slave
Rank 2, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 2, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 3, Thread 0, Node 1, NUMA 1, Core 0 <-- MASTER
Rank 3, Thread 1, Node 1, NUMA 1, Core 1 <-- slave

Example 4: (2) MPI tasks, (4) Threads Each, Using only (1) core per compute unit

The -cc option can be used to allow use of only one core per paired-core compute unit.

This example will launch (2) MPI tasks, each with (4) threads. Place only (1) MPI task [and its (4) threads] on each NUMA node. One core per paired-core compute unit sit idle. This requires (1) physical compute node and requires a nodes request of (1), even though only (8) cores are actually being used:

$ export OMP_NUM_THREADS=4
$ aprun -n2 -d4 -S1 -cc 0,2,4,6,8,10,12,14 a.out

Compute Node
NUMA 0								NUMA 1
Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7	Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7
Rank0 Thread0		Rank0 Thread1		Rank0 Thread2		Rank0 Thread3		Rank1 Thread0		Rank1 Thread1		Rank1 Thread2		Rank1 Thread3

Example 5: (2) MPI tasks, (8) Threads Each, Using only (1) core per compute unit

The -cc option can be used to allow use of only one core per paired-core compute unit.

This example will launch (1) MPI tasks, each with (8) threads. One core per paired-core compute unit will sit idle. This requires (2) physical compute node and requires a size request of (32), even though only (16) cores are actually being used:

$ export OMP_NUM_THREADS=8
$ aprun -n2 -d8 -N1 -cc 0,2,4,6,8,10,12,14 a.out

Compute Node 0
NUMA 0								NUMA 1
Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7	Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7
Rank0 Thread0		Rank0 Thread1		Rank0 Thread2		Rank0 Thread3		Rank0 Thread4		Rank0 Thread5		Rank0 Thread6		Rank0 Thread7
Compute Node 1
NUMA 0								NUMA 1
Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7	Core0	Core1	Core2	Core3	Core4	Core5	Core6	Core7
Rank1 Thread0		Rank1 Thread1		Rank1 Thread2		Rank1 Thread3		Rank1 Thread4		Rank1 Thread5		Rank1 Thread6		Rank1 Thread7

9.12. Job Resource Accounting

(Back to Top)

The hybrid nature of Titan's accelerated XK7 nodes mandated a new approach to its node allocation and job charge units. For the sake of resource accounting, each Titan XK7 node will be defined as possessing (30) total cores (e.g. (16) CPU cores + (14) GPU core equivalents). Jobs consume charge units in "Titan core-hours", and each Titan node consumes (30) of such units per hour.

As in years past, jobs on the Titan system will be scheduled in full node increments; a node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based on what a job makes unavailable to other users, a job is charged for an entire node even if it uses only one core on a node. To simplify the process, users are required to request an entire node through PBS.

Notably, codes that do not take advantage of GPUs will have only (16) CPU cores available per node; however, allocation requests–and units charged–will be based on (30) cores per node.

Note: Whole nodes must be requested at the time of job submission, and associated allocations are reduced by (30) core-hours per node, regardless of actual CPU or GPU core utilization.

Viewing Allocation Utilization

Projects are allocated time on Titan in units of "Titan core-hours". Other OLCF systems are allocated in units of "core-hours". This page describes how such units are calculated, and how users can access more detailed information on their relevant allocations.

Titan Core-Hour Calculation

The Titan core-hour charge for each batch job will be calculated as follows:

Titan core-hours = nodes requested * 30 * ( batch job endtime - batch job starttime )

Where batch job starttime is the time the job moves into a running state, and batch job endtime is the time the job exits a running state.

A batch job's usage is calculated solely on requested nodes and the batch job's start and end time. The number of cores actually used within any particular node within the batch job is not used in the calculation. For example, if a job requests 64 nodes through the batch script, runs for an hour, uses only 2 CPU cores per node, and uses no GPU cores, the job will still be charged for 64 * 30 * 1 = 1,920 Titan core-hours.

Viewing Usage

Utilization is calculated daily using batch jobs which complete between 00:00 and 23:59 of the previous day. For example, if a job moves into a run state on Tuesday and completes Wednesday, the job's utilization will be recorded Thursday. Only batch jobs which write an end record are used to calculate utilization. Batch jobs which do not write end records due to system failure or other reasons are not used when calculating utilization.

Each user may view usage for projects on which they are members from the command line tool showusage and the My OLCF site.

On the Command Line via `showusage`

The showusage utility can be used to view your usage from January 01 through midnight of the previous day. For example:

$ showusage
Usage on titan:
                                  Project Totals          <userid>
 Project      Allocation        Usage    Remaining          Usage
_________________________|___________________________|_____________
 <YourProj>    2000000   |   123456.78   1876543.22  |     1560.80

The -h option will list more usage details.

On the Web via My OLCF

More detailed metrics may be found on each project's usage section of the My OLCF site.

The following information is available for each project:

YTD usage by system, subproject, and project member
Monthly usage by system, subproject, and project member
YTD usage by job size groupings for each system, subproject, and project member
Weekly usage by job size groupings for each system, and subproject
Batch system priorities by project and subproject
Project members

The My OLCF site is provided to aid in the utilization and management of OLCF allocations. If you have any questions or have a request for additional data, please contact the OLCF User Assistance Center.

9.13. Titan Scheduling Policy

(Back to Top)

Note: This details an official policy of the OLCF, and must be agreed to by the following persons as a condition of access to or use of OLCF computational resources:

Principal Investigators (Non-Profit)
Principal Investigators (Industry)
All Users

Title: Titan Scheduling Policy
Version: 12.10

In a simple batch queue system, jobs run in a first-in, first-out (FIFO) order. This often does not make effective use of the system. A large job may be next in line to run. If the system is using a strict FIFO queue, many processors sit idle while the large job waits to run.

Backfilling would allow smaller, shorter jobs to use those otherwise idle resources, and with the proper algorithm, the start time of the large job would not be delayed. While this does make more effective use of the system, it indirectly encourages the submission of smaller jobs.

The DOE Leadership-Class Job Mandate

As a DOE Leadership Computing Facility, the OLCF has a mandate that a large portion of Titan's usage come from large, leadership-class (aka capability) jobs. To ensure the OLCF complies with DOE directives, we strongly encourage users to run jobs on Titan that are as large as their code will warrant. To that end, the OLCF implements queue policies that enable large jobs to run in a timely fashion.

Note: The OLCF implements queue policies that encourage the submission and timely execution of large, leadership-class jobs on Titan.

The basic priority-setting mechanism for jobs waiting in the queue is the time a job has been waiting relative to other jobs in the queue. However, several factors are applied by the batch system to modify the apparent time a job has been waiting. These factors include:

The number of nodes requested by the job.
The queue to which the job is submitted.
The 8-week history of usage for the project associated with the job.
The 8-week history of usage for the user associated with the job.

If your jobs require resources outside these queue policies, please complete the relevant request form on the Special Requests page. If you have any questions or comments on the queue policies below, please direct them to the User Assistance Center.

Job Priority by Processor Count

Jobs are aged according to the job's requested processor count (older age equals higher queue priority). Each job's requested processor count places it into a specific bin. Each bin has a different aging parameter, which all jobs in the bin receive.

Bin	Min Nodes	Max Nodes	Max Walltime (Hours)	Aging Boost (Days)
1	11,250	--	24.0	15
2	3,750	11,249	24.0	5
3	313	3,749	12.0	0
4	125	312	6.0	0
5	1	124	2.0	0

FairShare Scheduling Policy

FairShare, as its name suggests, tries to push each user and project towards their fair share of the system's utilization: in this case, 5% of the system's utilization per user and 10% of the system's utilization per project.

To do this, the job scheduler adds (30) minutes priority aging per user and (1) hour of priority aging per project for every (1) percent the user or project is under its fair share value for the prior (8) weeks. Similarly, the job scheduler subtracts priority in the same way for users or projects that are over their fair share.

For instance, a user who has personally used 0.0% of the system's utilization over the past (8) weeks who is on a project that has also used 0.0% of the system's utilization will get a (12.5) hour bonus (5 * 30 min for the user + 10 * 1 hour for the project).

In contrast, a user who has personally used 0.0% of the system's utilization on a project that has used 12.5% of the system's utilization would get no bonus (5 * 30 min for the user - 2.5 * 1 hour for the project).

`Batch` Queue Policy

The batch queue system places a limit of (2) production jobs in the queued (i.e., eligible-to-run) state per user. If a user submits more than (2) jobs, the additional jobs will enter a held state. If (1) one of the user’s queued jobs were to begin execution, (1) of the held jobs would be moved into the queued state.

Note: This is not a limit on the number of jobs that a user may have running simultaneously. It is a limit on the number of jobs eligible to enter a running state.

Additionally, a user may have only (2) jobs in bin 5 running at any time.

`Debug` Queue Policy

Jobs submitted to the debug queue will receive a (2)-day priority aging boost.

Production jobs are not allowed in the debug queue. Proper use of the debug queue would be, for example, submitting (1) job at a time as part of a software development, testing, or debugging cycle. Interactive parallel work is an ideal use for the debug queue.

Warning: Users who misuse the debug queue may have further access to the queue denied.

Allocation Overuse Policy

Projects that overrun their allocation are still allowed to run on OLCF systems, although at a reduced priority. Like the adjustment for the number of processors requested above, this is an adjustment to the apparent submit time of the job. However, this adjustment has the effect of making jobs appear much younger than jobs submitted under projects that have not exceeded their allocation. In addition to the priority change, these jobs are also limited in the amount of wall time that can be used.

For example, consider that job1 is submitted at the same time as job2. The project associated with job1 is over its allocation, while the project for job2 is not. The batch system will consider job2 to have been waiting for a longer time than job1.

The adjustment to the apparent submit time depends upon the percentage that the project is over its allocation, as shown in the table below:

% Of Allocation Used	Priority Reduction
< 100%	0 days
100% to 125%	30 days
> 125%	365 days

Effective July 19th 2012 and until further notice, leadership-class jobs (i.e. jobs requesting 60,000 or more cores) submitted against an INCITE project will not be subject to the usual over-allocation priority reduction penalties on Titan. The over-allocated project will be limited to (1) job running at any given time, but their submitted jobs will not be penalized with a priority reduction as in the table above.

Note: Leadership-class jobs submitted against an INCITE project will not be subject to the usual over-allocation priority reduction penalties on Titan. The over-allocated project will simply be limited to (1) job running at any given time.

System Reservation Policy

Projects may request to reserve a set of processors for a period of time through the reservation request form, which can be found on the Special Requests page.

If the reservation is granted, the reserved processors will be blocked from general use for a given period of time. Only users that have been authorized to use the reservation can utilize those resources. Since no other users can access the reserved resources, it is crucial that groups given reservations take care to ensure the utilization on those resources remains high.

To prevent reserved resources from remaining idle for an extended period of time, reservations are monitored for inactivity. If activity falls below 50% of the reserved resources for more than (30) minutes, the reservation will be canceled and the system will be returned to normal scheduling. A new reservation must be requested if this occurs.

Since a reservation makes resources unavailable to the general user population, projects that are granted reservations will be charged (regardless of their actual utilization) a CPU-time equivalent to
(# of cores reserved) * (length of reservation in hours).

10. Development Tools

(Back to Top)

Titan is configured with a broad set of tools to facilitate GPU acceleration of both new and existing application codes. These tools can be broken down into three categories based on their implementation methodology:

Development Tool Hierarchy

GPU-accelerated libraries
Accelerator compiler directives
Low-level GPU languages

Each of these three methods has pros and cons that must be weighed for each program and are not mutually exclusive. In addition to these tools, Titan supports a wide variety of performance and debugging tools to ensure that however you choose to implement GPU acceleration into your program code, it is being done so efficiently.

10.1. GPU Accelerated Libraries

(Back to Top)

Due to the performance benefits that come with GPU computing, many scientific libraries are now offering accelerated versions. If your program contains BLAS or LAPACK function calls, GPU-accelerated versions may be available. Magma, CULA, cuBLAS, and cuSPARSE libraries provide optimized GPU linear algebra routines that require only minor changes to existing code. These libraries require little understanding of the underlying GPU hardware, and performance enhancements are transparent to the end developer.

For more general libraries, such as Trilinos and PETSc, you will want to visit the appropriate software development site to examine the current status of GPU integration. For Trilinos, please see the latest documentation at the Sandia Trillinos page. Similarly, Argonne's PETSc Documentation has a page containing the latest GPU integration information.

MAGMA

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.

Use

This module is currently only compatible with the GNU programming environment:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit magma

To link in the MAGMA library:

cc -lcuda -lmagma -lmagmablas source.c

Resources

For comprehensive user manual please see the MAGMA Documentation. A knowledgable MAGMA User Forum is also available for personalized help. To see MAGMA in action see the following two PGI articles that include full example code of MAGMA usage with PGI Fortran: Using MAGMA With PGI Fortran and Using GPU-enabled Math Libraries with PGI Fortran.

CULA

CULA is a GPU accelerated linear algebra library, mimicking LAPACK, that utilizes the NVIDIA CUDA. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.

Use

CULA is accessed through the cula module, for linking it will be convenient to load the cuda module as well:

$ module load cula-dense cudatoolkit

To link in the CULA library:

cc -lcula_core -lcula_lapack source.c

Resources

A comprehensive CULA Programmers Guide is available for CULA that covers everything you need to know to use it. Once the module is loaded you can find up to date documentation in the $CULA_ROOT/doc directory and examples in $CULA_ROOT/examples. An example of using CULA with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran

Running the examples:

Obtain an interactive job and load the appropriate modules

$ qsub -I -A{projID} -lwalltime=00:30:00,nodes=1,feature=gpu
$ module load cuda cula-dense

Copy the example files

$ cd /tmp/work/$USER
$ cp -r $CULA_ROOT/examples .
$ cd examples

Now each example can be built and executed

$ cd basicUsage
$ make build64
$ aprun basicUsage

cuBLAS/cuSPARSE

cuBLAS and cuSPARSE are NVIDIA provided BLAS GPU routines optimized for dense and sparse use respectively. If your program currently uses BLAS routines integration should be straight forward and minimal CUDA knowledge is needed. Although primarily designed for use in C/C++ code Fortran bindings are available.

Use

cuBLAS and cuSPARSE are accessed through the cublas header and need to be linked against the cublas library:

module load cudatoolkit
cc -lcublas source.c

Resources

The CUBLAS and CUSPARSE Library User Guides are available to download at the NVIDIA Developer Documentation Center which provides complete function listings as well as example code. The nvidia SDK provides sample code and can accessed using the instructions below. An example of using CULA with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran.

Running the examples:

Obtain an interactive job and load the appropriate modules

$ qsub -I -A{projID} -lwalltime=00:30:00,nodes=1,feature=gpu
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit nvidia-sdk

Copy the example files:

$ cd /tmp/work/$USER
$ cp -r $NVIDIA_SDK_PATH/CUDALibraries .
$ cd CUDALibraries

Now each example can be executed:

$ cd bin/linux/release
$ aprun simpleCUBLAS

10.2. Accelerator Compiler Directives

(Back to Top)

Accelerator compiler directives allow the compiler, guided by the programmer, to take care of low-level accelerator work. One of the main benefits of a directives-based approach is an easier and faster transition of existing code compared to low-level GPU languages. Additional benefits include performance enhancements that are transparent to the end developer and greater portability between current and future many-core architectures.

OpenACC

OpenACC aims to provide an open accelerator interface consisting primarily of compiler directives. Currently PGI, Cray, and CAPS HMPP provide OpenACC implementations for C/C++ and Fortran. OpenACC aims to provide a portable cross platform solution for accelerator programming.

Use C/C++

PGI :

$ module load cudatoolkit
$ cc -acc vecAdd.c -o vecAdd.out

Cray :

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -h pragma=acc vecAdd.c -o vecAdd.out

HMPP :

$ module load cudatoolkit hmpp
$ hmpp cc vecAdd.c -o vecAdd.out

Use Fortran

PGI :

$ module load cudatoolkit
$ ftn -acc vecAdd.f90 -o vecAdd.out

Cray :

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -h acc vecAdd.f90 -o vecAdd.out

HMPP :

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit hmpp
$ hmpp ftn vecAdd.f90 -o vecAdd.out

Resources

The OpenACC specification provides the basis for all OpenACC implementations and is available OpenACC specification . In addition the implementation specific documentation may be of use. PGI has a site dedicated to collecting OpenACC resources. Chapter 5 of the Cray C and C++ Reference Manual provides details on Crays implementation. HMPP has provided an OpenACC Reference Manual.

Tutorials

The OLCF provides a vector addition example code demonstrating the OpenACC accelerator directives.

PGI Accelerator

The Portland Group provides accelerator directive support with their latest C and Fortran compilers. Performance and feature additions are still taking place at a rapid pace but it is currently stable and full featured enough to use in production code.

Use

To make use of the PGI accelerator directives the cuda module and pgi programming environment must be loaded:

$ module load cudatoolkit
$ module load PrgEnv-pgi

To specify the platform that the compiler directives should be applied to the Target Accelerator flag is used:

$ cc -ta=nvidia source.c
$ ftn -ta=nvidia source.f90

Resources

PGI provides a useful web portal for Accelerator resources. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. In addition the PGI Accelerator Programming Model on NVIDIA GPUs article series by Michael Wolfe walks you through basic and advanced programming using the framework providing very helpful tips along the way. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.

Examples

The OLCF provides both a vector addition and Game of Life example code demonstrating the PGI accelerator directives.

CAPS HMPP Workbench

The core of CAPS Enterprises GPU directive framework is the HMPP Workbench. HMPP Workbench is a compiler and runtime environment that interprets OpenHMPP and OpenACC directives and in conjunction with your traditional compiler (PGI, GNU, Cray or Intel C or Fortran compiler) creates GPU accelerated executables. In addition to HMPP Workbench CAPS also provides HMPPWizard performance analysis tool to help optimize and guide directive use (only compatible with HMPP 2.5.4 for the moment).

Use

To use CAPS accelerator framework you will need the cuda and hmpp modules loaded. Additionally a PGI, GNU, or Intel Programming environment must be enabled.

$ module load cudatoolkit
$ module load hmpp
$ module load PrgEnv-gnu

HMPP acts as a preprocessor generating accelerator code and then linking it in.

$ hmpp cc source.c
$ hmpp ftn source.f90

Resources

CAPS provides several documents and code snippets to get you started with HMPP Workbench. It is recomended to start off with the HMPP directives reference manual and the HMPPCG reference manual.

Examples

The OLCF provides both a vector addition and Game of Life example code demonstrating the HMPP accelerator directives.

10.3. Low-Level GPU Languages

(Back to Top)

For complete control over the GPU Titan will support C for CUDA, PGI’s CUDA Fortran, and OpenCL. These languages and language extensions, while allowing explicit control, are generally more cumbersome than directive-based approaches and must be maintained to stay up to date with the latest performance guidelines. Substantial code structure changes may be needed and an in-depth knowledge of the underlying hardware is often necessary for best performance.

NVIDIA C for CUDA

NVIDIA's C for CUDA is largely responsible for launching GPU computing to the forefront of HPC. With a few minimal additions to the C programming language NVIDIA has allowed low level control of the GPU without having to deal directly with a driver level API.

Use

To setup the CUDA environment the cuda module must be loaded:

$ module load cudatoolkit

This module will provide access to NVIDIA supplied utilities such as nvcc compiler, the CUDA visual profiler(computeprof), cuda-gdb, and cuda-memcheck. The the environment variable CUDAROOT will also be set to provide easy access to NVIDIA GPU libraries such as cuBLAS and cuFFT

To compile we use the NVIDIA CUDA compiler, nvcc.

$ nvcc source.cu

For a full usage walkthrough please see the supplied tutorials.

Resources

NVIDIA provides a comprehensive web portal for CUDA developer resources here. The developer documentation center contains the CUDA C programming guide which very thoroughly covers the CUDA architecture. The programming guide covers everything from the underlying hardware to performance tuning and is a must read for those interested in CUDA programming. Also available on the same downloads page are whitepapers covering topics such as Fermi Tuning and CUDA C best practices. The CUDA SDK is available for download as well and provides many samples to help illustrate C for CUDA programming technique. For personalized assistance NVIDIA has a very knowledgeable and active user forum.

Examples

The OLCF provides both a vector addition and Game of Life example code demonstrating C for CUDA usage.

PGI CUDA Fortran

PGI's CUDA Fortran provides a well integrated Fortran interface for low level GPU programming, doing for Fortran what NVIDIA did for C with C for CUDA. PGI worked closely with NVIDIA to ensure that the Fortran interface provides nearly all of the low level capabilities of the C framework.

Usage

CUDA Fortran will be properly configured by loading the pgi programming environment:

$ module load PrgEnv-pgi

To compile a file with the cuf extension we use the PGI fortran compiler as normal

$ ftn source.cuf

For a full usage walkthrough please see the supplied tutorials.

Resources

PGI provides a comprehensive web portal for CUDA Fortran resources here. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. The web portal also features a set of articles covering introductory material, device kernels, and memory management. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.

Examples

The OLCF provides both a vector addition and Game of Life example code demonstrating C for CUDA usage.

OpenCL

The Khronos group, a not for profit industry consortium, currently maintains the OpenCL(Open Compute Language) standard. The OpenCL standard provides a common low level interface for heterogeneous computing. At its core OpenCL is comprised of a C with extensions kernel language, similar to C for CUDA, along with a C API to control data management and code execution.

Usage

The cuda module must be loaded for the OpenCL header files to be found and a PGI or GNU programming environment enabled:

$ module load PrgEnv-pgi
$ module load cudatoolkit

To use OpenCL you must include the OpenCL library and library path:

gcc -lOpenCL source.c

Resources

Khronos provides a web portal for OpenCL. From here you can view the specification, browse the reference pages, and get individual level help from the OpenCL forums. A developers page is also of great use and includes tutorials and example code to get you started.

In addition to the general Khronos provided material users will want to check out the vendor specific available information for capability and optimization details. Of main interest to OLCF users will be the AMD and NVIDIA OpenCL developer zones.

Examples

The OLCF provides both a vector addition and Game of Life example code demonstrating C for CUDA usage.

11. Debugging and Optimizing Code on Titan

(Back to Top)

There are a number of code debugging and profiling tools available to users on Titan. This section will provide some general guidelines which should apply broadly to all applications, however, the focus will be on applications running on the Titan XK7 system.

DDT

DDT is the primary debugging tool available to users on Titan. DDT is a commercial debugger sold by Allinea Software, a leading provider of parallel software development tools for High Performance Computing.

For more information on DDT, see the DDT software page, or Allinea's DDT support page.

Optimization Guide for AMD64 Processors

AMD offers guidelines specifically for serial code optimization on the AMD Opteron processors. Please see AMD's Developer Documentation site for whitepages and information on the latest generation of AMD processors.

CrayPAT

CrayPAT is a profiling tool that provides information on application performance. CrayPAT is used for basic profiling of serial, multiprocessor, and multithreaded programs. More information can be found on the CrayPAT software page.

File I/O Tips

Spider, the OLCF's center-wide Lustre file system, is configured for efficient, fast I/O across OLCF computational resources. You can find information about how to optimize your application's I/O on the Spider page.

11.1. GPU Performance Tools

(Back to Top)

To ensure efficient use of Titan a full suite of performance and analysis tools will be offered. These tools offer a wide variety of support from static code analysis to full runtime heterogeneous tracing.

NVIDIA Compute Profiler

NVIDIAs command line profiler provides run time profiling of CUDA and OpenCL code. No special steps are needed when compiling your code and any tool that utilizes CUDA or OpenCL code, including compiler directives and accelerator libraries, can be profiled. The profile data can be collected in txt format or csv format to be viewed with the NVIDIA visual profiler.
Compute Visual Profiler

Running

To setup the Compute Visual Profiler the cuda module must be loaded and X11 forwarding must be enabled:

$ module load cudatoolkit

$ cd /tmp/work/$USER
$ export COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1
$ aprun a.out

Running the profiler:

$ nvvp

Resources

The Compute Visual Profiler Guide and Compute Command Line Profiler Guide are available on NVIDIAs developer documentation site and provide comprehensive coverage of the profilers usage and features.

CAPS HMPP Wizard

The goal of the HMPP Wizard is to identify common compute kernels and provide domain specific advice on how to best optimize them. This is done statically through a GUI interface and works best when used in conjunction with other HMPP performance tools. Currently the HMPP Wizard is compatible with GNU and Intel compilers.

Running

To run either the GNU or Intel programming environment needs to be loaded along with the hmppwizard module. Additional setup is performed by a provided shell script.

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ module load hmpp-wizard-perfanalyzer
$ source hmppwizard-env.sh

To view a report is a multi step process. The program is compiled, a report is generated, and finally an internet browser is launched to view the results.

$ hmppReport -o ./report gcc source.c
$ hmppReport --build -o ./report
$ firefox report/index.html

Resources

The HMPP Wizard User Manual provides a comprehensive overview of the use and features of the wizard.

VampirTrace

VampirTrace allows for temporal and spatial MPI enabled CPU/GPU runtime tracing. The produced OTF data can then be analyzed using the Vampir or Tau GUI visualizers.

Running

To run the vampirtrace module must be loaded, the cuda module must be loaded as well for GPU traces:

$ module load cudatoolkit
$ module load vampirtrace

Once loaded the program must be recompiled with one of the provided wrappers:

$ vtcc -vt:cc mpicc source.c
$ vtnvcc source.cu

To see the invoked compiler and its arguments:

$ vtcc -vt:verbose

For further help on wrapper compiler options:

$ vtcc -vt:help

Running the executable should then produce the trace files.

Resources

The VampirTrace User Manual provides comprehensive coverage of VampirTrace usage.

Vampir

Vampir is a graphical user interface for analyzing OTF trace data. For small traces all analysis may be done on the users local machine running a local Vampir copy. For larger traces the GUI can be run from the users local machine while the analysis is done using VampirServer, running on the parallel machine.

Use

The easiest way to get started is to launch the Vampir GUI from an OLCF compute resources, however a slow network connection and limit usability.

$ module load vampir
$ vampir

Depending on your network connection and trace size it may be beneficial to install the Vampir GUI locally and run VampirServer for the analysis. To do so first download and install the Vampir GUI

TAU

TAU provides profiling and tracing tools for C, C++, Fortran, and GPU hybrid programs. Generated traces can be viewed in the included Paraprof GUI or displayed in Vampir.

Use

A simple GPU profiling example could be preformed as follows:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load tau cudatoolkit
$ nvcc source.cu -o gpu.out

Once the cuda code has been compiled tau_exec -cuda can be used to profile the code at runtime

$ aprun tau_exec -cuda ./gpu.out

The resulting trace file can then be viewed using paraprof

$ paraprof

Resources

The TAU documentation website contains a complete User Guide, Reference Guide, and even video tutorials.

hpss

lens

smoky

Titan User Guide

Titan Has Replaced Jaguar

Contents

Titan Has Replaced Jaguar

Items to Note Before Running on Titan

Titan Availability Schedule

Connecting to the XK7

GPU Access

Batch Job Submission

Batch Environment Variables

Batch Job Charging

Scratch Purges Resume on January 23, 2013

Recompile Recommended

Changes to /tmp/work and /tmp/proj

Compute Partition

Specialized NVIDIA Accelerators

External Login Nodes

Network Topology

File Systems

Operating System

Project Type Details

After Project Approval

Steps to Obtain a User Account

Hours

Contact Us

After Hours

Ticket Submission Webform

OLCF Announcements Mailing Lists

OLCF “Notice” Mailing Lists

Weekly Update

System Status Pages

Mobile Apps

Twitter

Message of the Day

Activating a new SecurID® fob

Using a SecurID® fob

User Home Path

User Home Quotas

User Home Backups

User Home Permissions

Special User Website Directory

User Work Path

User Work Backup

User Work Purge

User Work Permissions

The /tmp directory

User Archive Path

User Archive Access

User Archive Accounting

Project Home Path

Project Home Quotas

Project Home Backups

Project Home Permissions

Project Work Path

Project Work Backup

Project Work Permissions

Project Archive Path

Project Archive Access

Project Archive Accounting

Data Transfer Nodes

Local Transfers

SPDCP

HSI and HTAR

Remote Transfers

GridFTP

SFTP and SCP

BBCP

Storage Policy Summary Table

Storage Policy Implementation

User Home / Project Home Quotas (NFS)

User Archive / Project Archive Quotas (HPSS)

Check Current Archive Usage

Modules Overview

Summary of Module Commands

Re-initializing the Module Command

Examples of Module Use

Available Compilers

Changes to `/tmp/work` and `/tmp/proj`

Activating a new SecurID^® fob

Using a SecurID^® fob

`SPDCP`

`HSI` and `HTAR`

`SFTP` and `SCP`

`BBCP`

Batch Scheduler `node` Requests

Choosing an Interactive Job's `nodes` Value

`qdel`

`qhold`

`qrls`