hpss

Since 2/27/13 10:35 am

lens

Since 2/13/13 10:20 am

smoky

Since 2/27/13 08:55 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Spider – the Center-Wide Lustre File System

Bookmark and Share

Spider is the name of the OLCF’s center-wide Lustre file system that handles most of the operational work for OLCF systems. It is a large-scale Lustre file system with over 19,500 clients and provides (10) PB of disk space. It has a demonstrated bandwidth of 120 GB/s.

Spider is Center-Wide

Spider is currently accessible nearly all of the OLCF’s computational resources, including Titan and its 300,000+ compute cores. The file system is available from the following OLCF systems:

Note: Because the file system is shared by most OLCF computational resources, times of heavy load may impact file system responsiveness.
Spider is for Temporary Storage

Spider provides a location to temporarily store large amounts of data needed and produced by batch jobs. Due to the size of the file system, the area is not backed up. In most cases, a regularly running purge removes data not recently accessed to help ensure available space for all users. Needed data should be copied to more permanent locations.

Warning: Spider provides temporary storage of data produced by or used by batch jobs. The space is not backed up. Users should copy needed data to more permanent locations.
Spider Comprises Multiple File Systems

Spider comprises (3) file systems:

File System Path to User Work Directory
widow1 /lustre/widow1/scratch/$USER
widow2 /lustre/widow2/scratch/$USER
widow3 /lustre/widow3/scratch/$USER
Why three filesystems?

There are a few reasons why having multiple file systems within Spider is advantageous.

More Metadata Servers – Currently each Lustre filesystem can only utilize one Metadata Server (MDS). Interaction with the MDS is expensive; heavy MDS access will impact interactive performance. Providing (3) filesystems allows the load to be spread over (3) MDSs.

Higher Availability – The existence of multiple filesystems increases our ability to keep at least one filesystem available at all times.

Associating a Batch Job with a File System

Through the PBS gres option, users can specify the scratch area used by their batch jobs so that the job will not start if that file system becomes degraded or unavailable.

Creating a Dependency on a Single File System

Line (5) in the following example will associate a batch job with the widow2 file system. If widow2 becomes unavailable prior to execution of the batch job, the job will be placed on hold until widow2 returns to service.

  1: #!/bin/csh
  2: #PBS -A ABC123
  3: #PBS -l size = 160000
  4: #PBS -l walltime = 08:00:00
  5: #PBS -l gres=widow2
  6:
  7: cd /lustre/widow2/scratch/$USER
  8: aprun -n 120000 a.out
Creating a Dependency on Multiple File System

The following example will associate a batch job with the widow2 and widow3 file systems. If either widow2 or widow3 becomes unavailable prior to execution of the batch job, the job will be placed on hold until both widow2 and widow3 are in service.

  -l gres=widow2%widow3
Default is Dependency on All Spider File System

If a batch job is not associated with a file system, i.e., if the gres option is not used, the batch job will be associated with all four widow file systems by adding -l gres=widow1%widow2%widow3 to to the batch submission. A warning message to this effect is printed to stderr.

Note: To help prevent batch jobs from running during periods where a Spider file system is not available, batch jobs that do not explicitly specify the PBS gres option will be given a dependency on all Spider file systems.
Why Explicitly Associate a Batch Job with a File System?
  • Associating a batch job with a file system will prevent the job from running if the file system becomes degraded or unavailable.
  • If a batch job only uses (1) or (2) of the spider file systems, specifying the file systems explicitly instead of taking the default of all (4), would prevent the job from being held if a file system not used by the job becomes degraded or unavailable.
Verifying/View Batch Job File System Association

The checkjob utility can be used to view a batch job’s file system associations. For example:

  $ qsub -lgres=widow2 batchscript.pbs
  851694.nid00004
  $ checkjob 851694 | grep "Dedicated Resources Per Task:"
  Dedicated Resources Per Task: widow2: 1
Available Directories on Spider
User Work

A temporary User Work scratch directory is available for each user at (in each Spider file system) at /lustre/widow[1-3]/scratch/$USER

By default, User Work directories are owned by the user, and the group is set to the owning user’s userid-named-group. Permissions are set to 700. Changes to the default permissions by the owning user will be reset hourly for security purposes. Long-term changes to the directory permissions can be requested by contacting OLCF User Assistance Center.

Note: Changes to the default permissions of User Work directories are allowed by the owning user, but permissions will be reset hourly for security purposes.

Files in the user scratch directories are subject to the standard purge.

A default file system has been chosen for each user. The /tmp/work/$USER link can be used to access the default directory. A user’s default file system was chosen based on the user’s initial project membership. For example, all users whose initial project membership is climate-centric will be placed on the same file system. All climate-centric Project Work areas are also placed on the same file system. Using the default file system helps spread load over the file systems as well as helps to ease data sharing/access between project members.

Note: The /tmp/work/$USER link points to each user’s default scratch directory. Using the default file system is recommended as it helps spread load over all file systems.
Project Work

A temporary Project Work directory is available for each project on (1) of the (4) Spider file systems. The directory can be accessed through the /tmp/proj/ link.

By default, Project Work directories are owned by the root, the group is set to the project’s group, and permissions are set to 770. Changes to the directory permissions can be requested by contacting OLCF User Assistance Center.

Files in the Project Work directories are not currently subject to the standard purge. However, this is subject to change and users should always consider Spider temporary storage.

How do I Determine the Default File System for my User Work/Project Work directory?

Use ls – The following ls command can be used to determine where a link points. The target location’s path will specify the file system on which the directory exists:

  ls -ld /tmp/work/$USER
  ls -ld /tmp/proj/

Use spiderinfo – The spiderinfo utility will list each file system’s status as well as the calling user’s /tmp/work and /tmp/proj file systems:

  $ spiderinfo

    Current lustre status (Tue Jan 25 14:32:27 2011):
    widow1 (up), widow2 (up), widow3 (up)

    Lustre directory information for user 'joe'
    /tmp/work/joe: widow2 (up)
    /tmp/proj/abc123: widow2 (up)
Current Configuration of Spider
widow1 widow2 widow3
Total disk space 2.5 PB 2.5 PB 2.5 PB
Number of OSTs 336 336 336
Default stripe count 4 4 4
Default stripe size 1 MB 1 MB 1 MB
Additional Information

More information on Spider and Lustre can be found on the Spider Best Practices page and the Lustre Basics page.