GTC
Code Description
GTC: 3D Gyrokinetic Toroidal Code
General Description
GTC is a 3-dimensional code used to study microturbulence in magnetically confined toroidal fusion plasmas via the Particle-In-Cell (PIC) method. It solves the gyro-averaged Vlasov equation in real space using global gyrokinetic techniques and an electrostatic approximation. The Vlasov equation describes the evolution of a system of particles under the effects of self-consistent electromagnetic fields. The unknown is the flux,f(t,x,v), which is a function of time t , position x, and velocity v, and represents the distribution function of particles (electrons, ions,...) in phase space. This model assumes a collision-less plasma; i.e., the particles interact with one another only through a self-consistent field and thus the algorithm scales as N instead of N2, where N is the number of particles. [GTC]
A "classical" PIC algorithm involves using the notion of discrete "particles" to sample a charge distribution function. The particles interact via a grid, on which the potential is calculated from charges "deposited" at the grid points. Four steps are involved iteratively:
- "SCATTER," or deposit, charges on the grid. A particle typically contributes to many grid points and many particles contribute to any one grid point.
- Solve the Poisson Equation;
- "GATHER" forces on each particle from the resultant potential function;
- and "PUSH" (move) the particles to new locations on the grid.
In GTC point-charge particles are replaced by charged rings using a four-point gyro averaging scheme, as shown below.
An important approximation in GTC comes from the fact that fast particle motion along the magnetic field lines leads to a quasi-two-dimensional structure in the electrostatic potential. Thus, the Poisson Equation needs only to be solved on a 2-D poloidal plane. GTC uses a logically non-rectangular field-aligned grid, in part, to keep the number of particles per cell nearly constant. The mesh effectively twists along with the magnetic field in the toroidal direction.
A PIC simulation is characterized by the number of particles used and the size of the mesh. GTC has been used in runs with over 1 billion particles and over 100 million grid points.
Coding
The GTC code has been optimized to achieve high efficiency on both cache-based super scalar and vector nodes. Both vector and super scalar code for the Poisson Equation solver are provided.
In GTC, the main bottleneck is usually the charge deposition, or scatter operation, as is true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionately to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex because of the quickly gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations than in "classical" PIC, since several points are picked on the rings and each of them has its own neighboring grid points.
This version of GTC is written entirely in free-format Fortran90/95, about 4300 lines of code (17 source files), including routines for which multiple versions are provided.
Authorship
Written by Zhihong Lin, Stephane Ethier, and Jerome Lewandowski of the Princeton Plasma Physics Laboratory.
Relationship to NERSC Workload
GTC is used for fusion energy research via the DOE SciDAC program [FUS] and for the international fusion collaboration [ITER]. Support for it comes from the DOE office of Fusion Energy Science. It is used for studying neoclassical and turbulent transport in tokamaks and stellarators as well as for investigating hot-particle physics, toroidal Alfven modes, and neoclassical tearing modes. [LEE].
Parallelization
The primary parallel programming model for GTC is a 1-D domain decomposition with each MPI process assigned a toroidal section and MPI_SENDRECV used to shift particles moving between domains. A variety of both grid-based and particle-based loops can also be multitasked via OpenMP directives included in the code. MPI library calls and OpenMP directives are not isolated in any one routine or source file. Builds can be MPI-only or combined OpenMP/MPI.
In GTC the entire calculation is done in real space, including the Poisson solve, thereby improving parallel scalability. In a typical simulation only about 10-15% of the time is spent in interprocessor communication. However, due to the gather/scatter nature of the charge deposition in PIC codes, GTC is very senstive to memory latency. [Ethier].
The GTC distribution includes highly vectorizable routines for the charge scatter, Poisson solve, particle push, and particle shift between domains. The vectorizable scatter routines can make use of gather/scatter hardware and have been vectorized using an algorithm that duplicates charge "bins" at the expense of additional memory in order to eliminate loop-carried dependences. "Atomic" memory operations could also mitigate the effect of the dependences. The duplication is done to the extent of the hardware vector length. The same loop-carried dependences and method for eliminating them apply to OpenMP parallelization of these loops.
Obtaining the Code
The NERSC-6 procurement web site is no longer available.
Due to licensing restrictions GTC cannot be downloaded from NERSC. To obtain the code visit the PPPL site.
You can obtain the NERSC-6 GTC benchmark input data files here (tar file).
Building the Code
GNUmake (gmake) is required. Several files are preprocessed by the compiler "inline." All files are either .F90 or .f90; there are no header files.
All source files are in a single directory and there is a single Makefile with definitions for several different systems included (IRIX64, AIX, Linux, SUPER-UX, and UNICOS). The makefile runs the uname -s command to detect the operating system automatically.
Note well that some compilers may not like the way function types are declared in the file function.f90. An alternate version of this file is provided. Rename function_alt.f90 to function.f90 to solve the problem.
Two basic versions of the code can be built:
Version |
Executable Name |
MPI-only |
gtcmpi |
Combined MPI/OpenMP |
gtc |
The basic command to build the code is
gmake <opt1=y/n> <opt2=y/n>
Options included in the Makefile are:
OPENMP=y|n
With or without OpenMP support; default is y.
DOUBLE_PRECISION=y
8-byte floating point precision
DEBUG=y|n
With or without debug option (-g). The default is n.
ESSL=y|n
With or without the AIX Engineering Scientific Subroutine Library, used for the FFT routines. The default is n.
64BITS=y|n
64- or 32-bit compilation mode on AIX. The default is 64BITS.
PGI=y
Use the PGI compiler (pgf90) on Linux. The default is to use the Lahey-Fujitsu compiler lf95.
ALTIX=y
Compiles with Intel compilers on the Altix using ifort ... -lmpi
The vendor is expected to provide the appropriate system calls in timers.f to return local processor CPU time and global elapsed (wall clock) time.
The MPI ranks are identified by the variable "mype" and use mype=0 as the master.
Build-Related Files in This Distribution
File Name |
Purpose |
Preprocessing |
README.html |
README in HTML format |
|
README |
README in ASCII text |
|
main.F90 |
main with MPI_Init |
|
module.F90 |
precision, parameter, |
double precision and NEC SX- |
setup.F90 |
MPI_comm_size, MPI_comm_rank |
AIX, OpenMP, Debug |
setup_vec.F90 |
same as setup with add'l code to minimize bank conflicts on vector systems |
AIX, OpenMP, SX, Debug |
shifte.F90 |
move electrons across domains |
OpenMP |
shifti.F90 |
move ions across domains |
OpenMP |
shifti_vec.F90 |
vectorizable version with SX-6-specific performance monitoring calls |
SX- |
chargee.F90 |
interpolate electron charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE. |
OpenMP |
chargei.F90 |
interpolate ion charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE. |
OpenMP |
chargei.F90 |
vectorizable version |
SX and OpenMP; SX compiler directives |
tracking.F90 |
tag each particle with a unique number that will be carried along ! with it as it moves from one processor/domain to another. Write track data to HDF files when doing a snapshot |
HDF |
smooth.F90 |
charge smoothing, MPI_SENDRECV of flux; MPI_Gather for transpose of 2D matrix; |
ESSL |
diagnosis.F90 |
global sums, write output files |
NERSC |
pushi_vec.F90 |
vectorizable version of ion particle push |
SX performance trace routines |
pushi.f90 |
ion particle push, MPI_ALLREDUCE for various diagnostics |
|
pushe.f90 |
electron particle push, MPI_ALLREDUCE for various diagnostics |
|
snapshot.f90 |
write particle data for re-run, MPI_REDUCE for ? |
|
field.f90 |
finite differencing for electric field, MPI_SENDRECV |
|
function.f90 |
various Fortran90 function definitions |
|
function_alt.f90 |
version for some compilers that don't like the function declarations in function.f90 |
|
load.f90 |
initialize particle position and velocity; not timed?? |
|
poisson.f90 |
iterative poisson solve first locally and then globally via MPI_ALLGATHER |
|
restart.f90 |
restart dump write/read |
|
poisson_vec.f90 |
vectorizable version |
|
set_random_values.f90 |
random number generators with Fortran90 and Fortran77 interfaces |
|
fft_cray _essl _gl _nag and _sx6.f90 |
interfaces to various system-specific fft libraries |
|
Makefile |
make file |
|
Running the Code
Benchmark code GTC must be run using 64-bit floating-point precision for all non-integer data variables.
There is a subdirectory "run" in which input data files and sample batch submission scripts are located. The code does not require/accept any command line arguments. However, an input data file, called "gtc.input," is required.
Timing Issues
All timings made by this code are via the timer() subroutine in main.F90. This routine returns both the CPU and wall (elapsed) time. The source should be modified by the vendor to provide the former for the particular hardware with acceptable accuracy and resolution; the latter is returned via MPI_WTIME() and should remain unchanged.
The time to be reported is the line "TOTAL WALL CLOCK TIME(SEC)."
The code is heavily instrumented with eight different portions of the code timed separately. The initialization phase is timed but for runs with large numbers of timesteps it comprises only a small portion of the total time. All MPI processes call the timers but only the master process prints the time and there are no barriers before each timing call to measure the collective time of all processors.
Here is a timing profile for the three sample problems, obtained from the machine Jacquard at NERSC.
|
Small |
Medium |
Large |
PEs |
4 |
64 |
256 |
particles |
323590 |
323590 |
323590 |
gridpoints |
32449 |
32449 |
32449 |
#domains |
4 |
64 |
256 |
Timesteps |
100 |
2000 |
2000 |
|
Times in seconds: |
||
pusher |
148.4 |
3,191. |
3,251. |
shift |
5.35 |
481.5 |
1,337. |
charge |
156.5 |
3,483. |
3,713. |
poisson |
19.1 |
765.0 |
760.8 |
smooth |
10.6 |
218.6 |
403.3 |
field |
3.6 |
86.1 |
87.2 |
load |
2.6 |
3.6 |
4.0 |
total |
347.3 |
8,229. |
9,559. |
%Pusher |
43% |
39% |
34% |
%Shifter |
2% |
6% |
14% |
%poisson |
5% |
9% |
8% |
%smoother |
3% |
3% |
4% |
%Charge |
45% |
42% |
39% |
Storage Issues
Almost all of the storage requirements comes from the allocatable arrays for particle, field, and diagnostic data declared in module.F90 and allocated in setup.F90.
Memory Required (per MPI process) By The Sample Problems:
Small |
0.26 GB |
Medium |
0.26 GB |
Large |
Required Runs
Two problems are to be run and their outputs reported; see the table below. These two problems use weak scaling. All input from these files is namelist based and is explained at the end of the files; however, nothing in the three input files provided should be changed for the runs.
Problem Size |
Concurrency (# of MPI processes) |
# of Particles |
# of GridPoints |
small | 64 | 207,097,600 | 2,076,736 |
medium |
512 |
1,656,780,800 |
16,613,888 |
large |
2048 |
6,627,123,200 |
66,455,552 |
The NERSC-6 "base case" runs are for 512- and 2048-MPI process configurations. Use
the gtc.input.512p and gtc.input.2048p input files for these runs, respectively. An input file for a smaller configuration (gtc.input.64p) is also provided for correctness checking and/or projections. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing on each node must be equal to the number of physical processors on this node.
The input file must be renamed "gtc.input" and be in the current working directory. A problem description is printed which reflects the input parameters. It also tells if OpenMP threads were used in the run.
Verifying Results
The code checks itself after the last time step and compares a computed value with the corresponding value calculated on the Cray XT4 system at NERSC using the Pathscale compiler. The difference reported should be less than .01. The internal error checking will only work with # of MPI tasks equal to 64, 512, or 2048; an error condition will result otherwise.
Modification Record
This is Version 2 (03/28/2005).
Record of Formal Questions and Answers
No entries as yet.
Bibliography
[ ZIN] "Particle-in-cell simulations of electron transport from plasma turbulence: recent progrress in gyrokinetic particle simulations of turbulent plasmas, Z. Lin, G. Rewoldt, S. Ethier, et. al, "Journal of Physics: Conference Series 16 (2005) 16-24.
[ GTC] Gyrokinetic Particle Simulations http://w3.pppl.gov/theory/proj_gksim.html
[ FUS] SciDAC Fusion Research http://www.scidac.gov/FES/FES_GPSC.html
[ ITER] ITER Fusion Research http://www.iter.org"
[ LEE] full-torus simulation of turbulent transport http://www.nersc.gov/news/annual_reports/annrep01/sh_FES_05.htm
[ Ethier] "Gyrokinetic particle-in-cell simulations of plasma microturbulence," S. Ethier, W. Tang, and Z. Lin, "Journal of Physics: Conference Series 16 (2005) 1-15.