NERSCPowering Scientific Discovery Since 1974

MILC

Code Description

MILC: MIMD Lattice Computation


General Description

This is the description of the NERSC-6 procurement version of the MILC benchmark. The version of the code used is MILC version 7. The inputs and runs required are different than those used for ealier procurements and the NERSC-6 inputs will not work with earlier versions of the MILC code.

The benchmark code MILC represents part of a set of codes written by the MIMD Lattice Computation (MILC) collaboratoration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. [MILC]

The MILC collaboration has produced application codes to study several different QCD research areas, only one of which, ks_dynamical simulations with conventional dynamical Kogut-Susskind quarks, is used here.

QCD discretizes space and evaluates field variables on sites and links of a regular hypercube lattice in four-dimensional space time. Each link between nearest neighbors in this lattice is associated with a 3-dimensional SU(3) complex matrix for a given field. [HOLMGREN] The version of MILC used here uses matrices ranging in size from 84 to 1284.

Coding

The MILC code has been optimized to achieve high efficiency on cache-based superscalar processors. Both ANSI standard C and assembler-based codes for several architectures are provided.

Lines of C code:

Directory

# Files

Total Lines

libraries

112

5715

generic

49

33844

generic_ks

43

19164

ks_imp_dyn2

11

1412

QCD involves integrating an equation of motion for hundreds or thousands of time steps that requires inverting a large, sparse matrix at each step of the integration. The sparse matrix problem is solved using a conjugate gradient method but because the linear system is nearly singular many CG iterations are required for convergence. Within a processor the four-dimensional nature of the problem requires gathers from widely separated locations in memory. The matrix in the linear system being solved contains sets of complex 3-dimensional "link" matrices, one per 4-D lattice link but only links between odd sites and even sites are non-zero. The inversion by CG requires repeated three-dimensional complex matrix-vector multiplications, which reduces to a dot product of three pairs of three-dimensional complex vectors. The code separates the real and imaginary parts, producing six dot product pairs of six-dimenaionl real vectors. Each such dot product consists of five multiply-add operations and one multiply. [GOTTLIEB]

Authorship

See http://physics.indiana.edu/~sg/milc.html

Relationship to NERSC Workload

MILC has widespread physics community use and a large allocation of resources on NERSC systems. It supports research that addresses fundamental quesitons in high energy and nuclear physics.

Parallelization

The primary parallel programing model for MILC is a 4-D domain decomposition with each MPI process assigned an equal number of sublattices of contiguous sites. In a four-dimensional problem each site has eight nearest neighbors.

MILC is normally used in a weak scalability mode and the four input files supplied with this distribution implement this.


Obtaining the Code

The entire code as used in the NERSC-6 procurement, with all data files and instructions, is available here (tar file).


Building the Code

If your compiler is not ANSI compliant, try using the gnu C compiler gcc instead. Note that if the library code is compiled with gcc the application directory code must also be compiled with gcc, and vice versa. This is because gcc understands prototypes and some other C compilers don't, and they therefore pass float arguments differently. We recommend gcc. The code can also be compiled using a C++ compiler but it uses no exclusively C++ constructs.

Note regarding Makefiles. At least two makefiles are involved although separate compiles not required. In the libraries subdirectory the makefile called "Make_vanilla" is currently used although other possibilities include Make_RS6K Make_alpha Make_t3e Make_SSE_nasm and Make_opteron. The makefile used in this subdirectory includes compiler options that affect the libraries only. The C compiler used to create objects in this subdirectory can be a serial (i.e., not MPI) compiler.

In the ks_imp_dyn subdirectory the makefile currently used is called "Makefile." You can edit this file to change compiler options (variable "OPT" around line 53). The PRECISION variable should remain 'single'. The makefile used in this subdirectory includes compiler options that affect code in the ks_imp_dyn, generic, and generic_ks subdirectories only. The C compiler used to create objects in this subdirectory must be an MPI-aware compiler (typically something like mpicc, etc.).

In several subdirectories there is a file called "Make_template" that should not changed.

Building the code involves first building two libraries. The library complex.a contains routines for operations on complex numbers. See `complex.h' for a summary. The library su3.a contains routines for operations on SU3 matrices,3 element complex vectors, and Wilson vectors (12 element complex vectors). See `su3.h' for a summary. None of the library routines involve communication so a sequential compiler, i.e., one not involving MPI "wrapper" scripts, can be used.

The following simple steps are required to build the code.

  1. Typing "make clean" in the ks_imp_dyn subdirectory will eliminate object files and the executable in that directory only (i.e., the libraries will remain unchanged).
  2. cd to the ks_imp_dyn subdirectory and type "gmake su3_rmd." This command will build the two libraries (complex.1.a and su3.1.a) in the ../libraries subdirectory and the target program in ks_imp_dyn, transferring object files from the ../generic and ../generic_ks subdirectories as needed to ks_imp_dyn. The file "com_mpi.c" in the subdirectory generic contains all the MPI interfaces. "Main" is in ks_imp_dyn2/control.c.

There is no automatic detection of operating system done in the build.


Running the Code

A symbolic link in the benchmark_n6 subdirectory points to the ks_imp_dyn/su3_rmd executable.

Input decks and sample batch submission scripts for four problem sizes are provided in the "benchmark_n6" directory: "small", "256", "1024," and "8192". The numbers refer to the target concurrency (MPI tasks) for the "Base Case." The small case should be used only for compilation and run testing and can be run on 2-4 processors. Benchmark timings (and/or projections; see NERSC6 Benchmark Instructions document) are required for the other cases.

For the "Base Case," the concurrency equals the number of MPI tasks unless the problem will not fit within the available memory, in which case more processors may be used. MILC, however is not a memory intensive code and is expected to easily fit within available memory.

For the "Base Case" computational nodes employed in the benchmark must be fully-packed (meaning that the number of MPI processes executing on a node may not be less than the number of physical cores on the node).

For the Optional Optimized Case concurrency and density of cores on a node may vary as long as the three input decks provided here are still used; see the NERSC6 Benchmark Instructions document.

Invoke the application with syntax similar to the following:

mpirun -np 4 su3_rmd < small.in

or

mpirun -np 256 su3_rmd < 256.in

or

mpirun -np 1024 su3_rmd < 1024.in

or

mpirun -np 8192 su3_rmd < 8192.in

In other words, the input file must be redirected to the standard input. The exact execution line depends on the system.


Timing Issues

Timing of the NERSC benchmark code MILC is done via the dclock and dclock_cpu function calls made in the ks_imp_dyn2/control.c source file and defined in the generic/com_mpi.c source file. These routines return either gettimeofday or clock, respectively. For the medium and large cases, extract the elapsed run time from the line labelled "NERSC_TIME."

All three problem sizes are set up to actually do two runs each. This is to more accurately represent the work that the CG solver must do in actual QCD simulations. The CG solver will take insufficient interations to converge if one starts with an ordered system so we first do a short run with a few steps, with a larger step size, and with a loose convergence criterion. This lets the lattice evolve from totally ordered. In the NERSC-6 version of the code this portion of the run is timed as "INIT_TIME" and it takes about 2 minutes on the NERSC Cray XT4. Then, starting with this "primed" lattice, we increase the accuracy for the CG solve and the iteration count per solve goes up to a more representative value. This is the portion of the code that we time and is labelled "NERSC_TIME"


Storage Issues

Approximate Memory Required (Per MPI Process) By The Sample Problems:

Small

To be determined

Medium

To be determined

Large

To be determined

XL

To be determined

The minumum memory configuration required to run the problems in each configuration must be reported (OS + buffers + code + data + ...).


Required Runs

There is a subdirectory "benchmark_n6" in which input data files and sample batch submissions scripts are located. Four sample input files for four different size runs are provided.

The three NERSC-6 problems use "weak" scaling so that all three problems have a local lattice size of 8 x 8 x 8 x 9. There are no differences in steps per trajectory or number of trajectories. The table below shows the global lattice size as a function of the target concurrency.

NERSC-6 Input

Lattice Size (Global)

256 (medium)

32 x 32 x 32 x 36

1024 (large)

64 x 64 x 32 x 72

8192 (extra large)

64 x 64 x 64 x 144


Verifying Results

Use "checkout_n6" to check correctness for all four cases. This script prints either "OK" or "Failed." Usage: $ checkout_n6 For reference, sample outputs from runs on the NERSC Cray XT4 system are also provided in the subdirectory sample_outputs. Note: there will be a slight difference in the number of CG iterations on different systems. See the sample outputs for examples.

The script compares the fermion action on the last "PBP:" line of the output file and has correct" values hard-coded in it for the 256, 1024, and 8192 cases. The calculated value must differ by less 1.e-5 from the "correct" value in order to pass. Run the script with checkout >output_file<.


Modification Record

This is MIMD Version 7


Record of Formal Questions and Answers

No entries as yet.


Bibliography

 [MILC] MIMD Lattice Computation (MILC) Collaboration http://www.physics.indiana.edu/~sg/milc.html

 [HOLMGREN] Performance of MILC Lattice QCD Code on Commodity Clusters http://lqcd.fnal.gov/badHonnef.pdf

 [GOTTLIEB] Lattice QCD on the Scalable POWERParallel Systems SP2 Proceedings of SC95, December 4-8, 1995, San Diego, California, USA.