hpss

Since 2/12/13 01:55 pm

lens

Since 2/13/13 10:20 am

smoky

Since 2/13/13 08:05 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Getting Started

Access to OLCF resources is limited to approved users and allocations. This page lists the steps to gain access to the OLCF systems and some basic system usage for new users.

1 Project Allocation Request

The resources of the Oak Ridge Leadership Facility (OLCF) are allocated via projects. The type of project request (listed below) will determine the application and review procedure. Approved projects will be granted a project allocation of core-hours for a period of time on one or more OLCF systems.

Each user account must be associated with at least one project allocation. Once a project allocation has been approved, users can apply for user accounts on the project to run jobs. If you would like to join an already existing project allocation, you can skip the rest of this section and go directly to the section on Account Requests.

If you have any questions about project types or the procedures for applying for a project allocation and are not answered here, feel free to contact the OLCF Accounts Team at accounts@ccs.ornl.gov.

INCITE Director’s Discretion ALCC
Allocations Large Small Large
Call for Proposals Once per year At any time Once per year
Closeout Report
Duration 1 year 1 year 1year
Priority High Medium High
Quarterly Reports
Apply for INCITE Apply for DD Apply for ALCC

What are the detailed differences between project types at the OLCF?

INCITE – The Novel Computational Impact on Theory and Experiment (INCITE) program invites proposals for large-scale, computationally intensive research projects to run at the OLCF. The INCITE program awards sizeable allocations (typically, millions of processor-hours per project) on some of the world’s most powerful supercomputers to address grand challenges in science and engineering. There is an annual call for INCITE proposals and awards are made on an annual basis. For more information or to apply for an INCITE project, please visit http://www.er.doe.gov/ascr/INCITE/index.html

ALCC – The ASCR Leadership Computing Challenge (ALCC) is open to scientists from the research community in national laboratories, academia and industry. The ALCC program allocates computational resources at the OLCF for special situations of interest to the Department with an emphasis on high-risk, high-payoff simulations in areas directly related to the Department’s energy mission in areas such as advancing the clean energy agenda and understanding the Earth’s climate, for national emergencies, or for broadening the community of researchers capable of using leadership computing resources. For more information or to submit a proposal, please visit http://www.er.doe.gov/ascr/Facilities/ALCC.html.

DD – Director’s Discretion (DD) projects are dedicated to leadership computing preparation, INCITE and ALCC scaling, and application performance to maximize scientific application efficiency and productivity on leadership computing platforms. The OLCF Resource Utilization Council, as well as independent referees, review and approve all DD requests. Applications are accepted year round via http://www.olcf.ornl.gov/support/getting-started/olcf-director-discretion-project-application/.

Frost – Frost is an internal Oak Ridge development resource available to members of ORNL Computing and Computational Sciences Directorate. Applicants should check with their Group Leader to determine if an active project is available. If not, the Group Leader should submit a project request via http://www.nccs.gov/user-support/access/frost-project-request/.

Vendor – OLCF resources are also available to ORNL vendors. Applications may be submitted year round via http://www.nccs.gov/user-support/access/vendor-project-request/.

What Happens After my Project Request is Approved?

Once a project is approved, an OLCF Accounts Manager will notify the PI, outlining the steps (listed below) necessary to create the project. If you have any questions, please feel free to visit our Knowledge Base at (link to KB) or contact the OLCF Accounts Team at accounts@ccs.ornl.gov.

Steps for Activating a Project Once the Allocation is Approved

  1. A signed Principal Investigator’s PI Agreement must be submitted with the project application.
  2. Export Control: The project request will be reviewed by ORNL Export Control to determine whether sensitive or proprietary data will be generated or used. The results of this review will be forwarded to the PI. If the project request is deemed sensitive and/or proprietary, the OLCF Security Team will schedule a conference call with the PI to discuss the data protection needs.
  3. ORNL Personnel Access System (PAS): All PI’s are required to be entered into the ORNL PAS system. An OLCF Accounts Manager will send the PI a PAS invitation to submit all the pertinent information. Please note that processing a PAS request may take 15 or more days.
  4. User Agreement/Appendix A or Subcontract: A User Agreement/Appendix A or Subcontract must be executed between UT-Battelle and the PI’s institution. If our records indicate this requirement has not been met, all necessary documents will be provided to the applicant by an OLCF Accounts Manager.

Upon completion of the above steps, the PI will be notified that the project has been created and provided with the Project ID and system allocation. At this time, project participants may apply for an account via http://www.olcf.ornl.gov/support/getting-started/olcf-user-account-application/.

2User Account Request

Once a project allocation has been approved and processed, users can apply for an account on the project using the Account Request Form . There are several steps involved in applying for an account, and we’re here to help you through the process. If you have any questions that are not answered here, please feel free to contact the OLCF Accounts Team at accounts@ccs.ornl.gov.

The first step in the process is to fill out and submit the:

OLCF User Account Application Form

What are all the steps involved in obtaining an account on a project?

    1. Apply for an account using the Account Request Form.
    2. The principal investigator (PI) of the project must approve your account and system access. The Accounts Team will contact the PI for this approval.
    3. If you have or will receive a RSA SecurID from our facility, additional paperwork will be sent to you via email to complete for identity proofing.
    4. Foreign national participants will be sent an Oak Ridge National Lab (ORNL) Personnel Access System (PAS) request specific for the facility and cyber-only access. After receiving your response, it takes between 15-35 days for approval.
    5. Fully-executed User Agreements with each institution having participants are required. If our records indicate your institution needs to sign either a User Agreement and/or Appendix A, the form(s) along with instructions will be sent via email.
    6. If you are processing sensitive or proprietary data, additional paperwork is required and will be sent to you.

Your account will be created and you will be notified via email when all of the following steps are complete:

* Account application form completed;
* Account approved by the project’s PI;
* Identity proofing for participants with a RSA SecurID issued by our facility;
* PAS request approved for foreign nationals;
* Fully-executed User Agreement and/or Appendix A; and
* If a proprietary/sensitive project, additional paperwork completed

3 I Have An Account… What Now?

Once your user account is setup, it’s time to get to work! The following links will walk you through the basics of connecting to and using computational resources at the OLCF. Please note the following information may be generic to all OLCF systems; system-specific details can be found within the individual system User Guides.

On which systems do I have an account?

Once a requested account has been approved and created, the requesting user will be sent an email containing the system(s) to which the user has access. Details on each listed system can be found in the OLCF Knowledge Base.

In addition to the system(s) listed in the email, all users also have access to the following systems:

home
General purpose system that can be used to log into systems not accessible outside the OLCF network. Running the screen utility is one example. Compiling, data transfer, and running long running memory intensive tasks should not be performed on home.
dtn01, dtn02
Data transfer systems. Designed to improve data transfer between OLCF systems and systems outside the OLCF network.
HPSS
The High Performance Storage System (HPSS) provides tape storage for large amounts of data created on OLCF systems. The HPSS can be accessed from any OLCF system through the hsi utility.

How do I connect/login to OLCF systems?

Connection Utilities


To avoid risks associated with using plain-text communication, the only supported remote client on NCCS systems is a secure shell (SSH) client, which encrypts the entire session between the NCCS systems and the client system. Currently, the only authentication method supported is one-time passwords (OTPs); static passwords and private-key authentication are no longer supported.

For example, to connect to Jaguarpf from a UNIX-based system, you’d use the following:

ssh userid@jaguarpf.ccs.ornl.gov

SSH clients are also available for Windows-based systems.

Note that your SSH client must support protocol version 2 (supported by all modern SSH clients). Several security vulnerabilities exist in version 1, and access using a version 1 client is no longer allowed.

Your SSH client must allow keyboard-interactive authentication to access NCCS systems.

For UNIX-based SSH clients, the following line should be in either the default ssh_config file or your $HOME/.ssh/config file:

PreferredAuthentications keyboard-interactive,password

The line may also contain other authentication methods, but keyboard-interactive must be included.

For recent SecureCRT versions, the change can be made through the connection properties menu.

One-Time Password Authentication


All NCCS systems currently use OTPs as their authentication method. To log in to NCCS systems, an RSA SecurID key fob is required.

To activate your new SecurID key fob, do the following:

* Initiate an SSH connection to home.ccs.ornl.gov.
* When prompted for a PASSCODE, enter the token code shown on the fob.
* You will be asked if you are ready to set your PIN. Answer with “Y.”
* You will be prompted to enter a PIN. Enter a 4- to 6-digit number you can remember. You will then be prompted to reenter your PIN.
* You will then be prompted to wait until the next token code appears on your fob and to enter your PASSCODE, which is now your PIN + 6-digit token code displayed on your fob.
* Your PIN is now set, and your fob is activated and ready for use.

To use your fob, do the following:

When prompted for your PASSCODE, enter your PIN + 6-digit token code shown on the fob. For example, if your pin is 1234 and the token code is 987654, enter 1234987654 when you are prompted for a PASSCODE.

How can I keep up with system outages and other events?

The OLCF provides users with several ways of staying informed about system outages.

System Announcement Lists
These are low-volume lists as compared to the System Status lists. Messages of interest to all users (system upgrades, long-term outages, etc.) are sent to these lists. Since they are low-volume lists and the information sent is important to all users, users are automatically subscribed to these lists when their accounts are set up.
Weekly Update
Each week, typically on Friday afternoon, an email announcing the next week’s scheduled outages is sent to all users. This message also includes meeting announcements and other items of interest to all NCCS users. If you are an NCCS user but are not receiving this weekly message, please contact the NCCS User Assistance Center.
System Status Pages
The OLCF System Status Page shows the current status of selected OLCF systems. The status arrow for each system is a hyperlink to a page with additional detail about that system, including recent and upcoming downtimes as well as other notable events.
System Status Lists
The OLCF also provides opt-in email lists that providing automated notices about the status of OLCF systems as well as other notable system events. As these are high-volume lists, they are offered on an opt-in basis. More information on the lists can be found here .
Message of the Day
In addition to other methods of notification, the system motd or “Message of the Day” that is echoed upon login shows recent downtimes. Important announcements are also posted to the motd. Users are encouraged to take a look at the motd upon login to see if there are any important notices.

What filesystems are available on the OLCF systems?

Available Storage Areas


NCCS users are provided with multiple user and project storage areas that can be classified one of three different ways:

home areas
Home areas/directories are provided on a Network File System (NFS). The user home areas are by default accessible only to the owning user. Similarly the project home areas are accessible by only member of the project. The areas are backed-up but available space is limited. Because space is limited, each area has a quota. Users should store small source code, scripts, and other similar items in the area. Users should not store large job output or input in the area. Job I/O should be performed in the system’s temporary work area.
work areas
Temporary work directories are provided to each user and project on Lustre file systems. Similar to the home areas, by default, the user and project areas are accessible to the user and project members. The areas provide a large amount of storage, but are not backed-up. In general, job I/O performance will be faster in the lustre areas than the NFS mounted home areas. The temporary work areas are regularly purged of data that has not been recently accessed, because of this all needed data should be backed-up to the HPSS.
archive areas
Archive directories are provided on the High Performance Storage System (HPSS). Similar to the home and work areas, by default, the HPSS user and project areas are accessible to the user and project members. The HPSS provides tape storage for large amounts of data created on OLCF systems. The HPSS can be accessed from any OLCF system through the hsi utility.

User Areas v/s Project Areas


User
User storage areas are intended to house user-specific files that will most likely not be shared between other project members.
Project
Project storage areas are intended to house project-centric files that need to be accessed by other members of the project.

Additional Information


A discussion of each of these storage areas can be found on the filesystems page.

How do I transfer data between OLCF systems and systems outside the OLCF?

Available Tools


The NCCS provides several tools for moving data between computing centers or between our machines & your workstation.

GridFTP using GridCert GridFTP using SSH SFTP/SCP BBCP
Data Security insecure (default) / secure (w/configation) insecure (default) / secure (w/configation) secure insecure (unsuited for sensitive projects)
Authentication GridCert Passcode Passcode Passcode
Transfer speed fast fast slow fast
Remote Infrastructure GridFTP server at remote site, user must apply for DOE GridCert GridFTP server at remote site Comes with standard SSH install BBCP must be installed on remote computer

Data Transfer Nodes


NCCS provides two nodes dedicated to data transfer: dtn01.ccs.ornl.gov and dtn02.ccs.ornl.gov. These nodes have been tuned specifically for wide-area data transfers, and also preform well on local-area transfers. NCCS recommends that users use these nodes for data transfers as they will, in most cases, improve transfer speed and help decrease the load on computational systems’ login and service nodes.

Which third party software packages are available and how do I use them?

Modules


Software packages on OLCF systems are managed through the modules utility. The modules utility provides the ability to dynamically modify you environment adding and removing software packages. More information on modules can be found on the modules page.

Available Packages


A number of third party software packages are available for each OLCF system. Software available for a system can be found through the following methods:

module avail
The module avail command can be executed on any OLCF system. The command will list the software modules available on the system.
software web pages
Each installed software package has an associated web page. The software web pages for a system can be found on the software page.

Requesting Additional Software


If you do not see the needed package or version, the needed package can be requested through the software request form.

How do I compile on the OLCF systems?

Compiler Wrappers


When building parallel codes, many systems provide compiler wrappers that will include the needed libraries, headers, and compiler options behind the scenes. For example when building a parallel code on the Cray systems it is highly recommended to use the cc, CC, and ftn compiler wrappers.

The system specific compiling articles provide more details on compiling for an individual system.

Compilers Available


A number of compilers and versions for each may be available on each system. With the exception of the IBM, compiler versions are controlled through the modules utility. module avail can be used to determine which compilers and versions are installed for a system.

Changing Compilers


On non-IBM systems, changing compilers and compiler version, is performed through the modules utility.

It is important to note that since a module may key off of the compiler module loaded, when changing compilers you may also need to unload and reload any manually loaded modules.

The Cray systems provide module containers to help ensure a correct environment follows compiler changes. When changing compiler modules, the PrgEnv modules should be used for most cases. For example:

> module swap PrgEnv-pgi/a.a.a PrgEnv-pgi/b.b.b

Small Example


The following example hello-mpi.c code can be use to test:

#include
#include "mpi.h"

int main (int argc, char *argv[])
{
  int rank,nproc,nid;
  int i;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);

  PMI_CNOS_Get_nid(rank, &nid);

  printf("  Rank: %10d  NID: %10d  Total: %10d \n",rank,nid,nproc);

  MPI_Finalize();

  return 0;
}

The following will build the example code on a Cray system using the default Pathscale compiler:

> module swap PrgEnv-pgi PrgEnv-pathscale
> cc hello-mpi.c -o hello-mpi.x

Are tools available to help debug?

To aid in debugging efforts the following tools are provided on most OLCF systems:

DDT
Allinea’s Distributed Debugging Tool (DDT) is a parallel debugger. For more information, see the DDT page in the general support software section.
Totalview
The TotalView debugger is a tool that lets you debug, analyze, and tune the performance of complex serial, multiprocessor, and multithreaded programs. For more information, see the TotalView page in the general support software section.

How do I run on the compute resources?

Compute v/s Login Resources


When you connect to a system, you are placed on one of the system’s login nodes. Login node resources are shared by all users of a system. Because of this, users should be mindful when performing tasks on the node’s shared resources. Login nodes should be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory or processing intensive tasks. Users should also limit the number of simultaneous tasks performed on the login resources. For example, a user should not run ten simultaneous tar processes.

Memory and processor intensive tasks as well as production work should be performed on the system’s compute resources. Access to the compute resources is managed by the batch system (Torque and MOAB).

Batch Systems


On non-IBM systems, the MOAB/Torque (PBS) batch systems are used to allocate/access the compute resources. On IBM systems LoadLeveler is used.

A batch system allows users to request cores/nodes for a specified amount of time. The batch system will organize the resource requests by priority and allocate the requested resources as they become available.

Batch Commands


The following are common batch arguments:

Argument Description
-A Required Specifies the ProjectID to run the job against. The job’s used cpu-hours will be deducted from the given ProjectID’s allocation.
-l walltime=HH:MM:SS Required Specifies the amount of time to request the resources. A batch job will be killed if it runs for the requested time. If a batch job completes in less time the resources will be released.
-l size=24000 Required on the Cray XK systems Requests 24,000 XK cores.
-l nodes=2:ppn=16 Required on the cluster systems Requests 32 cores, 16 on two nodes.

Job Execution


Once compute resources have been allocated, the parallel binary will need to run on each of the allocated cores. On Cray systems, the aprun utility will perform this tasks. On cluster systems, the mpirun or mpiexec_mpt should be used. For example:

XK:

aprun -n 16 a.out

Cluster:

mpiexec_mpt -np 16 a.out
mpirun -np 16 a.out

Batch Submission


The qsub command can be used to submit the batch script to the batch system. Batch options can be given at the top of the batch script preceded by #PBS or on the command line.

Once the job has been submitted, a batch identifier will be returned.

Viewing Batch Queue


The showq utility can be used to see the batch queue.

checkjob can be used to see details of a currently queued batch job.

MPI Example


The following batch script will request 124,000 cores for 24 hours and then run a.out on the allocated resources:

> cat script.pbs
#PBS -A ABC123
#PBS -l 24:00:00
#PBS -l size=124000

cd /tmp/work/$USER
date
aprun -n 124000 a.out

> qsub script.pbs
99955051.nid00004
>

The job was assigned jobid 99955051. The id can be used to find the job in the queue, alter the job (qalter), or delete the job (qdel).

More Information


More information can be found on each system’s batch job articles.

How do I back-up data created on the OLCF systems?

Data created from batch jobs should initially be created/stored in the user or project temporary lustre area. Because this area is regularly purged and not backed-up, each user is responsible for backing-up any needed data.

Archive directories are provided to each user and project on the High Performance Storage System (HPSS). The HPSS provides tape storage for large amounts of data created on OLCF systems. The HPSS can be accessed from any OLCF system through the hsi utility. HPSS provides an on-site data archive option.

How do I tracking allocation utilization?

CPU-Hour Calculation


When batch jobs complete there system utilization is calculated in CPU-Hour units using the following equation:

cores requested by batch job * ( batch job's endtime - batch job's starttime)

Where:

– batch job’s starttime is the time the job moves into a running state.

– batch job’s endtime is the time the job exits a run state.

A batch job’s usage is calculated solely on requested cores and the batch job’s start and end time. The number of cores actually used within the batch job is not used in the calculation. For example, if a job requests 1,024 cores through the batch script, but only uses 2 cores, the job will be charged for 1,024 cores.

The job’s calculated cpu-hours are then deducted from the project’s given allocation.

Viewing Usage


A user and project’s cpu-hour usage can be seen from the command line of any OLCF system using the showusage utility. A web based format can also been seen from https://users.nccs.gov.

More Information


More information on allocation usage can be found on the allocation utilization article.