NERSCPowering Scientific Discovery Since 1974

PARATEC

Code Description 

PARATEC: Parallel Total Energy Code


General Description

The benchmark code PARATEC (PARAllel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. Total energy minimization of electrons is done with a self-consistent field (SCF) method. Force calculations are also done to relax the atoms into equilibrium positions. PARATEC uses an all-band (unconstrained) conjugate gradient (CG) approach to solve the Kohn-Sham equations of Density Functional Theory (DFT) and obtain the ground-state electron wavefunctions. In solving the Kohn-Sham equations using a plane wave basis, part of the calculation is carried out in Fourier space, where the wavefunction for each electron is represented by a sphere of points, and the remainder is in real space. Specialized parallel three-dimensional FFTs are used to transform the wavefunctions between real and Fourier space and a key optimization in PARATEC is to transform only the non-zero grid elements. [PAR1]

The code can use optimized libraries for both basic linear algebra and Fast Fourier Tranforms but due to its global communication requirements, architectures with a poor balance between bisection bandwidth and computational rate will suffer performance degradation at higher concurrencies on PARATEC. [Oliker] Nevertheless, due to favorable scaling, high computational intensity and other optimizations, the code generally achieves a high percentage of peak performance on both superscalar and vector systems. [PAR2]

In the benchmark problems supplied here 11 conjugate gradient iterations are performed; however, a real run would typically do between 20 and 60.

Coding

PARATEC consists of about 50,000 lines of Fortran90 code. Preprocessing via m4 is used to include machine-specific routines such as the FFT calls. The version supplied uses MPI although it can also be built for a single-processor run and for SHMEM, if available.

The code typically spends about 30% of its time in BLAS3 routines and 30% in the one-dimensional FFTs on which the three-dimensional FFTs are built. The remainder of the time is spent in various other Fortran90 routines. Vendor libraries such as IBM's ESSL can be used for both the linear algebra and Fourier transform routines. However, PARATEC includes code that does 3-D FFTs via three sets of hand-written one-dimensional FFTs. Many FFTs are done at the same time to avoid latency issues and only non-zero elements are communicated/calculated; thus, these routines can be faster than vendor-supplied routines. Additional libraries used are ScaLAPACK and BLACS.

A list of files that are preprocessed and may require modification (although unlikely) is here.

NOTE Concerning Use of FFTW on 64-bit address platforms

When using the FFTW library on machines that have 64-bit addresses (ie. AMD Opteron) you must change the Fortran90 declaration for two variables in file fft_macros.m4h in the subdirectory src/macros/fft. The statement

INTEGER FFTW_PLAN_BWD, FFTW_PLAN_FWD

must be changed to

INTEGER*8 FFTW_PLAN_BWD, FFTW_PLAN_FWD

Otherwise, the program will compile but fail with a segmentation violation in the FFTW call.

Authorship

See http://www1.nersc.gov/projects/paratec/DOC/

Relationship to NERSC Workload

A recent survey of NERSC ERCAP requests for materials science applications showed that Density Functional Theory (DFT) codes similar to PARATEC accounted for nearly 80% of all HPC cycles delivered to the Materials science community. Supported by DOE BES, PARATEC is an excellent proxy to the application requirements of the entire Materials Science community. PARATEC simulations can also be used to predict nuclear magnetic resonance shifts. Overall goal is to simulate the synthesis and predict the properties of multi-component nanosystems.

Parallelization

Each electron in a plane wave simulation is represented by a grid of points from which the wavefunction is constructed. Parallel decomposition of such a problem can be over n(g), the number of grid points per electron (typically O(100,000) per electron), n(i), the number of electrons (typically O(800) per system simulated), or n(k), a number of sampling points (O(1-10)).

PARATEC uses MPI and parallelizes over grid points, thereby achieving a fine-grain level of parallelism. In Fourier space each electron's wavefunctiongrid forms a sphere. The figure below depicts a visualization of the parallel data layout on three processors. Each processor holds several columns which are lines along the z-axis of the FFT grid. Load balancing is important because much of the compute-intensive part of the calculation is carried out in Fourier space. To get good load balancing, the columns are first assigned to processors in descending length order and then to processors containing the fewest points.

The real-space data layout of the wavefunctions is on a standard Cartesian grid, where each processor holds a contiguous part of the space arranged in columns, shown in Figure 4b. Custom three-dimensional FFTs transform betweeen these two data layouts. Data are arranged in columns as the three-dimensional FFT is performed, by taking one-dimensional FFTs along the Z, Y, and X directions with parallel data transposes between each set of one-dimensional FFTs. These transposes require global interprocessor communication and represent the most important impediment to high scalability.

The FFT portion of the code scales approximately as n2log(n)and the dense linear algebra portion scales approximately as therefore, the overall computation-to-communication ratio scales as n, where n is the number of atoms in the simulation.


Obtaining Version 6 of the Code

To use the latest version of paratec and find build instructions please see the paratec software page. Note, the below instructions deal predominantly with the old version 5.1 of the code.

You can download the NERSC-6 Paratec benchmark input data files here ( tar file).


Running the Code

The concurrency simply equals the number of MPI tasks. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing must be equal to the number of physical processors on the node.

Invoke the application by typing, for example,

 mpirun -np #tasks paratec.mpi

or

 poe paratec.mpi

Paratec expects two files, "input" and "Si_POT.DAT" in the directory that it is executing. Copy the file "input.<size>" to "input" for the <size> required.

The important output file is "OUT." The last line contains the time for the run.


Timing Issues

The code is heavily instrumented for timing; the timer is called "gimmetime" and it is defined in one of the system-specific source files /scr/shared/ze_<machine_name>.f90. Note: the timing harness of interest is the one that produces the output string "NERSC_TIME" and it times the main loop in file pwmain.f90p. The intention is to measure elapsed (wallclock) time.


Storage Issues

Memory Required By The Sample Problems:

 

small

medium

large

Memory

.256 GB (from LoadLeveler)

1.25 GB (from LoadLeveler)

2.0 GB (from LoadLeveler)


Required Runs

The directory "benchmark" contains input for 3 problem sizes, "input.", where is "small", "medium" and "large". There are also corresponding sample output files, "OUT.". The small case is only used for porting and debugging. Each problem size must be executed with a fixed concurrency as specified below. The intent of these decks is not to gauge scalability but to obtain timing data for the three distinct concurrencies.

All runs simulate silicon in the diamond structure.

 

small

medium

large

#atoms

16

250

686

Concurrency

4

64

256

A typical calculation might require between 20 and 60 CG iterations to converge the charge density.

There is a subdirectory "benchmark" in which input data files, reference output files, and sample batch submissions scripts are located.Note that PARATEC must be executed with "fully-packed" nodes, i.e. the number of processes or threads employed on each node should equal the number of physical processors available on the node.


Verifying Results

As many as seven different output files may be produced from the run, only one of which, OUT, is important for benchmarking purposes. A verification script, "checkout", is provided with the distribution to determine correctness of the run by comparing "OUT" with the reference "OUT.<size>." The "OUT" files for the medium and large cases should be provided to NERSC to verify the results.

Additionally, the configuration file ("sysvars.machine_name") used and a complete log of the build process should also be returned to NERSC for verification.


Modification Record

This is PARATEC Release 5.1.13b1


Record of Formal Questions and Answers

No entries as yet.


Bibliography

 [PAR1] Paratec web page http://www1.nersc.gov/projects/paratec/

 [Oliker] "Leading Computational Methods on Scalar and Vector HEC Platforms," Proceedings of SC|05 November 12-18, 2005 Seattle Washington USA.

 [PAR2] "Scaling First Principles Materials Science Codes to Thousands of Processors. " CanningNanoscienceSC04.pdf