I/O Tips

The purpose of this page is to convey tips for getting better performance with your I/O on the Jaguar Lustre file systems running CNL. Note that this file system (consisting of 15 object storage targets [OSTs]) is temporary, and will be much bigger after the entire machine is upgraded to run CNL.

Contents

Striping

Two commonly used lfs suboptions are getstripe and setstripe. lfs getstripe can be used to get striping information on files and directories, while lfs setstripe can be used to set the striping (and Lustre stripe buffer size).

The setstripe usage is as follows:

lfs setstripe <filename|dirname> <stripe size> <stripe start> <stripe count>

where

Stripe size = Number of bytes on each OST (0 meaning use default of 1 MB).
Stripe start = OST index of first stripe (-1 indicating default). Please use default only.
Stripe count = Number of OSTs to stripe over (0 indicating default of 4 and -1 indicating to stripe over all OSTs).

For example, to set the stripe count (width) to 1 on a directory, use

lfs setstripe <dir> 0 -1 1

Setting the stripe count to 1 is suggested when you write out 1 (or more) files per process. The exception to this would be if you run at small processor counts (<100) and your files are big (>1 GB.)

If you use MPI I/O to write (read) your data from many processes to 1 or a few files, then we suggest striping your files across most if not all of the OSTs. You could use a command such as the following:

lfs setstripe <filename> 0 -1 -1

Note that there is one scratch directory, /lustre/scratch, which has 15 OSTs with which you can stripe over.

We have empirically seen on previous Lustre file systems that setting your stripe count to multiples of 32 leads to a loss of performance (as compared to stripe counts that are multiples of 32 minus 1, for example.) With our current file systems, the maximum stripe count is 15, so this does not apply for the current setup.

Parallel I/O

With the /lustre/scratch file system, high parallel I/O bandwidths can be achieved using MPI I/O to either a shared file or to a file per process. And, for example, similar performance can be achieved with Fortran writes within an MPI program (each process writing to its own file.)

We know from the larger Lustre file system on Jaguar (not running CNL) that high bandwidth rates are observed when doing parallel I/O with anywhere from 500 to 2,000 clients. (Note that these numbers are approximations and most likely a function of the number of OSTs in the file system.) When more than 2,000 clients are used, the OSTs seemingly get overwhelmed and bandwidth rates drop off significantly. And if you use too few clients, then the OSTs are not fully used or kept busy enough (it seems.) So for the Lustre file system on Jaguarcnl with 15 OSTs, that likely means that using 60 to 150 clients will be optimal.

What does it mean to the user that there is a range of process counts to get the best I/O performance? It means

  • If you are in that range or below, your current parallel I/O method is probably okay. That is, if you write to one shared file, or if you create one file per process, your I/O performance will probably do okay. But you will need to make sure you set your striping appropriately.
  • If you are running beyond the high range (>150), then you might want to consider using a subset of your MPI processes to do I/O. Yes, this will require code changes, but in actuality the changes are not difficult in theory. And the performance gain can be nearly an order of magnitude. But this may be necessary only if your I/O takes more than 5% of your runtime, or you would like to do more I/O but don’t because of the cost. Please contact the NCCS User Assistance Center if you would like help with your parallel I/O. An example follows that creates an MPI communicator that include only ionodes:
! listofionodes is an array of the ranks of writers/readers
call MPI_COMM_GROUP(MPI_COMM_WORLD, WORLD_GROUP, ierr)
call MPI_GROUP_INCL(WORLD_GROUP, nionodes, listofionodes, IO_GROUP,ierr)
call MPI_COMM_CREATE(MPI_COMM_WORLD,IO_GROUP, MPI_COMM_IO, ierr)
! open
call MPI_FILE_OPEN(MPI_COMM_IO, trim(filename), filemode, finfo, mpifh, ierr)
! read/write
call MPI_FILE_WRITE_AT(mpifh, offset, iobuf, bufsize, MPI_REAL8, status, ierr)
!   OR
!  call MPI_FILE_SET_VIEW(mpifh, disp, MPI_REAL8, MPI_REAL8, "native", finfo, ierr)
!  call MPI_FILE_WRITE_ALL(mpifh, iobuf, bufsize, MPI_REAL8, status, ierr)
! close
call MPI_FILE_CLOSE(mpifh, ierr)

If you cannot implement a subsetting approach, it would still be to your advantage to limit the number of synchronous opens, say to 100, even if you can’t with writes/reads. This is useful for limiting too many requests hitting the metadata server (of which there is only one) at the same time.

Update August 2007

We have seen good performance with HDF5 writes and okay performance with HDF5 reads (on Jaguar without CNL). That is, HDF5 writes can get about the same performance as MPI-IO writes. However, the HDF5 read performance is quite a bit below that of MPI-IO, although you can still get low-single-digit GB/s.

Performance is highly dependent on the buffer size and whether you are doing independent or collective I/O. With respect to buffering, using a small buffer size of 64 KB versus a large 16 MB buffer can be the difference in 1 GB/s versus 16 GB/s when doing parallel I/O with 1,024 processors, for example. If you are doing collective I/O with HDF5, you really should set the MPICH_NO_RECORD_LOCKING environment variable to 1. With MPI-IO, if you are going to use fileviews, you should really set the romio_ds_[read,write] hints to disable. If you don’t use these “tricks,” your performance will be in the low-single-digit MB/s rather than in GB/s (a three-orders-of-magnitude difference).

We have not done any studies of netcdf and are under the impression that it performs much worse than either MPI-IO or HDF5.

IOTA Library

The NCCS Technology Integration Group has also developed an I/O Tracing and Allocation (IOTA) Library. It is an interposition library. It has two potential functions:

  1. to provide tracing of I/O requests for later analysis and
  2. to allow the user to specify Lustre file allocation (striping) parameters for newly created files.

Both are implemented by the use of environmental variables.

For more information, please do a man iota after loading the IOTA module.

For service nodes running Linux (for example the login nodes), it is implemented by setting the environmental variable LD_PRELOAD to $IOTA_LD_PRELOAD after you have loaded the IOTA module.

To use it on compute nodes, your executables must be relinked by appending $IOTA_LD_OPTS to the objects to be loaded (again after loading the IOTA module.)