[2/11/98] Hints for Optimizing I/O Using netCDF 2.4.x on Cray Systems =========================================================== ++++++++++++++++++++ Notes: ----- 1. One of the main problems with netCDF on the Cray T90 is "pre-fill", which is horrendously slow under netCDF 2.4.x. If you are using GFDL's NCIR routines, compile them with "-DNO_NCPREFILL". 2. NetCDF version 3.3 promises to be significantly faster than version 2.4.x, and is free of the pre-fill problem. It will be made available as soon as possible. ++++++++++++++++++++ Discussion: ========== For most applications, it is expected that netCDF file interactions will fall into one of two categories: basically sequential dumping of the data into the file (eg, history writes), or completely random reads of the file for plotting/analysis purposes (for example). In the case of writing moderate amounts of data, and in the case of reading small amounts of data (say, for a plotting package), the default behavior of the netCDF library on the Cray should be adequate. However, there are certain cases where the pattern of disk writes is "sparse and regular"; ie, the chunks that are read/written are non- contiguous, but the stride between them is fixed. More about this later. Likewise, there may be instances where large volumes of data need to be written or ingested, in which case efficiency becomes an issue. The netCDF library offers a mechanism by which the user can tune the I/O efficiency on Crays using the FFIO layer. You can get a brief introduction to FFIO from "man intro_ffio", but, basically, it offers you the capability of buffering or caching parts of the file, thus saving repeated and costly disk accesses. FFIO offers several types of I/O layering. You tell the netCDF library what type of layer to use via the environment variable NETCDF_FFIOSPEC. ------------------- Writing netCDF data: ------------------- Our earlier investigations indicated that the default "bufa" layering employed by the netCDF library was generally adequate for the simpler, more sequential writes. More recently, it has been found that a "cachea" layer can perform even better, and can reduce system-CP-time by a factor of 3-4 for writes (and as much an order of magnitude for reads - more later). Furthermore, achieving the best I/O performance requires consideration of the disk hardware (the size of its sectors, tracks, and "cylinders") To define a memory-resident cache before running your code, use the following as a first-guess, default cachea FFIO layer: setenv NETCDF_FFIOSPEC cachea:224:2 You can also use call PUTENV from within your code: INTEGER PUTENV . . I = PUTENV('NETCDF_FFIOSPEC=cachea:224:2') ("PUTENV" is useful if your program reads *and* writes netCDF) Either way, this will use an additional 0.23 MW of memory as cache for each open file. You can increase the size of the cache pages to multiples of 224 (see below) to get somewhat better wall-clock time (at the cost of additional memory), or decrease it to some multiple of 8 (the size of one disk sector) to reduce memory usage (at the cost of increased wall-clock time). System load will also have a *significant* impact on wall clock time. User- and system-CP-time increase slightly with smaller pages, but the change is pretty small. Note that the value of NETCDF_FFIOSPEC at the time a netCDF file is opened determines the type of FFIO to be used. You may want to arrange different FFIO strategies for different files, or different file access-patterns, in which case you should call PUTENV to specify the FFIO method just before opening each file. You *can* take finer control over the cache page usage to gain some added efficiency, but you must also take responsibility for implementing it properly! This can be quite "challenging", to say the least. However, it is ESSENTIAL in cases where small, non-contiguous chunks are accessed in a pattern with a set "stride". "cachea" layering will work quite well if you set up a cache page for each "region" being written to. Note that this is only true for "chunks" that are of uniform size - if the chunk-size changes, it is almost impossible to determine a cache strategy that will work. Writing out vertical slabs -------------------------- A good example of this type of access is when i-k-j-t data in memory (ie, vertical slabs) is written to the i-j-k-t file -- each level of the slab must be stored separately in the correct location with the other rows in that horizontal slab. (The netCDF files you produce should be i-j-k-t if you want to take advantage of any of the available graphics packages.) The netCDF library will take of putting the data in all the right locations, but this represents a challenge because of the necessary skipping about in the disk file during the write. This is not CP intensive, but wall-clock time can absolutely balloon. Proper caching will limit the number of physical disk accesses needed. For these applications, define a cache layer before running your code using the following: setenv NETCDF_FFIOSPEC cachea:AA:BB where "AA" and "BB" are defined below. You can also use call PUTENV: INTEGER PUTENV I = PUTENV('NETCDF_FFIOSPEC=cachea:AA:BB') To reset the FFIO method to the default recommended method, call PUTENV again: I = PUTENV('NETCDF_FFIOSPEC=cachea:196:2') (In particular, don't forget to reset it if you subsequently need to *read* data from a netcdf file - see below) "BB" is the number of cache pages, and this should be set to approximately BB = NZ + 10 + NV where NV is the total number of variables and NZ is the total number of *horizontal* slabs of data at one time-level. For example, if you had 3 variables with 10 vertical levels and 4 variables with only 1 vertical level, NV would be 7 and NZ would be 34. "BB" can be quite sensitive - a difference of a few pages can make a sizable difference in performance. The above is probably larger than you need - try it first, then whittle it down until you notice performance dropping off. If you don't have enough pages, things will slow down very quickly. "AA" is the cache page size, in blocks (1 block = 512 8-byte words). You will want to be careful about the choice of "AA" since the entire cache will reside in main memory (although bigger is still better). The amount of extra memory you'll use will be BB * AA * 512 words For AA, try to stick to multiples of 8, since "8" is the size of one disk 'sector' on GFDL's current DD-302 disks (FTMPDIR). There is a logical limit to how big AA should be. The total size of the cache should not exceed the size of one complete time-slice of all the data variables. So, if NPTS is the number of points in one horizontal slab, then the size of one time-slice of data is NPTS * NZ * 4 bytes If the size of your cache, given by BB * AA * 512*8 bytes is larger than the size of the data slice, reduce AA ( but don't go below 4). Alternatively, since you are setting aside enough memory to hold an entire time-slice, you could set up FFIO with fewer, larger cache pages - this is, in general, more efficient to process. (Hint: always add an extra page or two - this appears to facilitate better double-buffering.) A note about SDS caching ------------------------ Do NOT try to use SDS caching of the form cachea.sds:224:2 This would be nice if it worked, since you wouldn't have to use main memory for buffering. Unfortunately, a Cray bug causes incomplete flushing of the data to the disk file, and the file will be useless. Cray is aware of the problem. Two-Layer, Memory- and SDS-Resident Cache: ----------------------------------------- An even faster option is to use a "two-layer cache", where the second layer resides in SDS. This is *particularly* fast as long as the entire file will fit within your SDS limit. The specification takes the form: setenv NETCDF_FFIOSPEC cache:AA:BB,sds:4096::4096 Don't forget that you'll need enough SDS to hold *all* your open files. *********** * WARNING * *********** If you specify this two-layer cache in your shell via a "setenv" call, "ncdump" (and possibly other utilities) will not work correctly. You will have to reset it to make it work: unsetenv NETCDF_FFIOSPEC or to something like: setenv NETCDF_FFIOSPEC cachea:64:2 ------------------- Reading netCDF data: ------------------- The single best piece of advice about reading netCDF files we can give you is to avoid going back to get any "metadata" (eg, coordinates, variable names and sizes, units, etc.) once you have started to read the data itself. Doing so will likely require dumping a cache page in order to use it to hold the "header" info. This has a cascading effect because of the fact that the "oldest" cache pages are re-used first, which could royally disrupt the page sequencing occurring during intensive reading activities. Another tip is to avoid closing and re-opening files. Not only is the system overhead expensive, you will have to (typically) re-read all the metadata again for the "newly" opened file. In terms of the cache layer specification, less work has been done investigating how to optimize FFIO when reading netCDF files. But this is, presumably, less complicated because almost all reads (at least those requiring speedy I/O) are likely to be horizontal-slab oriented. In that case, the best configuration we've come up with is the same as that mentioned earlier, which seems to work fine: In your shell before running a file-reading program: setenv NETCDF_FFIOSPEC cachea:224:4 (If memory is a problem, reducing the page size to some smaller multiple of 8 is OK.) Or, in your code before you open the file for reading : INTEGER PUTENV I = PUTENV('NETCDF_FFIOSPEC cachea:224:4) (Again, "PUTENV" is useful if your program reads *and* writes netCDF or if different FFIO schemes are needed for different files.) More experimentation with optimizing read efficiencies is needed. (Cooperative investigations with Cray are ongoing.) Also, the next release of GFDL's NCIR library will concentrate on increasing reading efficiency, principally by minimizing access to the "header" information. One known problem with reading netCDF files regards retrieving the coordinate values along the UNLIMITED dimension. Filling the large cache pages just to get the coordinates is very inefficient. If the record axis has many points (the current trigger is 13 or more), this is addressed in the GFDL NCIR routines by temporarily opening the same file again with NETCDF_FFIOSPEC set to "cachea:1:64", reading in the coordinates of the record dimension, and closing the file. In the unlikely event that the user wants to use a different FFIO scheme, this can be specified via the environment variable "NETCDF_FFIOSPEC_A", which will override the internal "cachea:1:64" setting. -------------------------------------------------------------- Please share your experiences, particularly if you can come up with a more useful "formula" for determining the best I/O layering. If you have any problems or questions, just stop by or send mail. John Sheldon jps@gfdl.gov (609) 987-5053 http://www.gfdl.gov