GTC

Code Description

GTC: 3D Gyrokinetic Toroidal Code

General Description

GTC is a 3-dimensional code used to study microturbulence in magnetically confined toroidal fusion plasmas via the Particle-In-Cell (PIC) method. It solves the gyro-averaged Vlasov equation in real space using global gyrokinetic techniques and an electrostatic approximation. The Vlasov equation describes the evolution of a system of particles under the effects of self-consistent electromagnetic fields. The unknown is the flux,f(t,x,v), which is a function of time t , position x, and velocity v, and represents the distribution function of particles (electrons, ions,...) in phase space. This model assumes a collision-less plasma; i.e., the particles interact with one another only through a self-consistent field and thus the algorithm scales as N instead of N2, where N is the number of particles. [GTC]

A "classical" PIC algorithm involves using the notion of discrete "particles" to sample a charge distribution function. The particles interact via a grid, on which the potential is calculated from charges "deposited" at the grid points. Four steps are involved iteratively:

"SCATTER," or deposit, charges on the grid. A particle typically contributes to many grid points and many particles contribute to any one grid point.
Solve the Poisson Equation;
"GATHER" forces on each particle from the resultant potential function;
and "PUSH" (move) the particles to new locations on the grid.

In GTC point-charge particles are replaced by charged rings using a four-point gyro averaging scheme, as shown below.

An important approximation in GTC comes from the fact that fast particle motion along the magnetic field lines leads to a quasi-two-dimensional structure in the electrostatic potential. Thus, the Poisson Equation needs only to be solved on a 2-D poloidal plane. GTC uses a logically non-rectangular field-aligned grid, in part, to keep the number of particles per cell nearly constant. The mesh effectively twists along with the magnetic field in the toroidal direction.

A PIC simulation is characterized by the number of particles used and the size of the mesh. GTC has been used in runs with over 1 billion particles and over 100 million grid points.

Coding

The GTC code has been optimized to achieve high efficiency on both cache-based super scalar and vector nodes. Both vector and super scalar code for the Poisson Equation solver are provided.

In GTC, the main bottleneck is usually the charge deposition, or scatter operation, as is true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionately to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex because of the quickly gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations than in "classical" PIC, since several points are picked on the rings and each of them has its own neighboring grid points.

This version of GTC is written entirely in free-format Fortran90/95, about 4300 lines of code (17 source files), including routines for which multiple versions are provided.

Authorship

Written by Zhihong Lin, Stephane Ethier, and Jerome Lewandowski of the Princeton Plasma Physics Laboratory.

Relationship to NERSC Workload

GTC is used for fusion energy research via the DOE SciDAC program [FUS] and for the international fusion collaboration [ITER]. Support for it comes from the DOE office of Fusion Energy Science. It is used for studying neoclassical and turbulent transport in tokamaks and stellarators as well as for investigating hot-particle physics, toroidal Alfven modes, and neoclassical tearing modes. [LEE].

Parallelization

The primary parallel programming model for GTC is a 1-D domain decomposition with each MPI process assigned a toroidal section and MPI_SENDRECV used to shift particles moving between domains. A variety of both grid-based and particle-based loops can also be multitasked via OpenMP directives included in the code. MPI library calls and OpenMP directives are not isolated in any one routine or source file. Builds can be MPI-only or combined OpenMP/MPI.

In GTC the entire calculation is done in real space, including the Poisson solve, thereby improving parallel scalability. In a typical simulation only about 10-15% of the time is spent in interprocessor communication. However, due to the gather/scatter nature of the charge deposition in PIC codes, GTC is very senstive to memory latency. [Ethier].

The GTC distribution includes highly vectorizable routines for the charge scatter, Poisson solve, particle push, and particle shift between domains. The vectorizable scatter routines can make use of gather/scatter hardware and have been vectorized using an algorithm that duplicates charge "bins" at the expense of additional memory in order to eliminate loop-carried dependences. "Atomic" memory operations could also mitigate the effect of the dependences. The duplication is done to the extent of the hardware vector length. The same loop-carried dependences and method for eliminating them apply to OpenMP parallelization of these loops.

Obtaining the Code

The NERSC-6 procurement web site is no longer available.

Due to licensing restrictions GTC cannot be downloaded from NERSC. To obtain the code visit the PPPL site.

You can obtain the NERSC-6 GTC benchmark input data files here (tar file).

Building the Code

GNUmake (gmake) is required. Several files are preprocessed by the compiler "inline." All files are either .F90 or .f90; there are no header files.

All source files are in a single directory and there is a single Makefile with definitions for several different systems included (IRIX64, AIX, Linux, SUPER-UX, and UNICOS). The makefile runs the uname -s command to detect the operating system automatically.

Note well that some compilers may not like the way function types are declared in the file function.f90. An alternate version of this file is provided. Rename function_alt.f90 to function.f90 to solve the problem.

Two basic versions of the code can be built:

Version	Executable Name
MPI-only	gtcmpi
Combined MPI/OpenMP	gtc

The basic command to build the code is

gmake <opt1=y/n> <opt2=y/n>

Options included in the Makefile are:

OPENMP=y|n

With or without OpenMP support; default is y.

DOUBLE_PRECISION=y

8-byte floating point precision

DEBUG=y|n

With or without debug option (-g). The default is n.

ESSL=y|n

With or without the AIX Engineering Scientific Subroutine Library, used for the FFT routines. The default is n.

64BITS=y|n

64- or 32-bit compilation mode on AIX. The default is 64BITS.

PGI=y

Use the PGI compiler (pgf90) on Linux. The default is to use the Lahey-Fujitsu compiler lf95.

ALTIX=y

Compiles with Intel compilers on the Altix using ifort ... -lmpi

The vendor is expected to provide the appropriate system calls in timers.f to return local processor CPU time and global elapsed (wall clock) time.

The MPI ranks are identified by the variable "mype" and use mype=0 as the master.

Build-Related Files in This Distribution

File Name	Purpose	Preprocessing
README.html	README in HTML format
README	README in ASCII text
main.F90	main with MPI_Init
module.F90	precision, parameter, and global array declarations	double precision and NEC SX-
setup.F90	MPI_comm_size, MPI_comm_rank	AIX, OpenMP, Debug
setup_vec.F90	same as setup with add'l code to minimize bank conflicts on vector systems	AIX, OpenMP, SX, Debug
shifte.F90	move electrons across domains	OpenMP
shifti.F90	move ions across domains	OpenMP
shifti_vec.F90	vectorizable version with SX-6-specific performance monitoring calls	SX-
chargee.F90	interpolate electron charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE.	OpenMP
chargei.F90	interpolate ion charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE.	OpenMP
chargei.F90	vectorizable version	SX and OpenMP; SX compiler directives
tracking.F90	tag each particle with a unique number that will be carried along ! with it as it moves from one processor/domain to another. Write track data to HDF files when doing a snapshot	HDF
smooth.F90	charge smoothing, MPI_SENDRECV of flux; MPI_Gather for transpose of 2D matrix;	ESSL
diagnosis.F90	global sums, write output files	NERSC
pushi_vec.F90	vectorizable version of ion particle push	SX performance trace routines
pushi.f90	ion particle push, MPI_ALLREDUCE for various diagnostics
pushe.f90	electron particle push, MPI_ALLREDUCE for various diagnostics
snapshot.f90	write particle data for re-run, MPI_REDUCE for ?
field.f90	finite differencing for electric field, MPI_SENDRECV
function.f90	various Fortran90 function definitions
function_alt.f90	version for some compilers that don't like the function declarations in function.f90
load.f90	initialize particle position and velocity; not timed??
poisson.f90	iterative poisson solve first locally and then globally via MPI_ALLGATHER
restart.f90	restart dump write/read
poisson_vec.f90	vectorizable version
set_random_values.f90	random number generators with Fortran90 and Fortran77 interfaces
fft_cray _essl _gl _nag and _sx6.f90	interfaces to various system-specific fft libraries
Makefile	make file

Running the Code

Benchmark code GTC must be run using 64-bit floating-point precision for all non-integer data variables.

There is a subdirectory "run" in which input data files and sample batch submission scripts are located. The code does not require/accept any command line arguments. However, an input data file, called "gtc.input," is required.

Timing Issues

All timings made by this code are via the timer() subroutine in main.F90. This routine returns both the CPU and wall (elapsed) time. The source should be modified by the vendor to provide the former for the particular hardware with acceptable accuracy and resolution; the latter is returned via MPI_WTIME() and should remain unchanged.

The time to be reported is the line "TOTAL WALL CLOCK TIME(SEC)."

The code is heavily instrumented with eight different portions of the code timed separately. The initialization phase is timed but for runs with large numbers of timesteps it comprises only a small portion of the total time. All MPI processes call the timers but only the master process prints the time and there are no barriers before each timing call to measure the collective time of all processors.

Here is a timing profile for the three sample problems, obtained from the machine Jacquard at NERSC.

	Small	Medium	Large
PEs	4	64	256
particles	323590	323590	323590
gridpoints	32449	32449	32449
#domains	4	64	256
Timesteps	100	2000	2000
	Times in seconds:
pusher	148.4	3,191.	3,251.
shift	5.35	481.5	1,337.
charge	156.5	3,483.	3,713.
poisson	19.1	765.0	760.8
smooth	10.6	218.6	403.3
field	3.6	86.1	87.2
load	2.6	3.6	4.0
total	347.3	8,229.	9,559.
%Pusher	43%	39%	34%
%Shifter	2%	6%	14%
%poisson	5%	9%	8%
%smoother	3%	3%	4%
%Charge	45%	42%	39%

Storage Issues

Almost all of the storage requirements comes from the allocatable arrays for particle, field, and diagnostic data declared in module.F90 and allocated in setup.F90.

Memory Required (per MPI process) By The Sample Problems:

Small	0.26 GB
Medium	0.26 GB
Large

Required Runs

Two problems are to be run and their outputs reported; see the table below. These two problems use weak scaling. All input from these files is namelist based and is explained at the end of the files; however, nothing in the three input files provided should be changed for the runs.

Problem Size	Concurrency (# of MPI processes)	# of Particles	# of GridPoints
small	64	207,097,600	2,076,736
medium	512	1,656,780,800	16,613,888
large	2048	6,627,123,200	66,455,552

The NERSC-6 "base case" runs are for 512- and 2048-MPI process configurations. Use
the gtc.input.512p and gtc.input.2048p input files for these runs, respectively. An input file for a smaller configuration (gtc.input.64p) is also provided for correctness checking and/or projections. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing on each node must be equal to the number of physical processors on this node.

The input file must be renamed "gtc.input" and be in the current working directory. A problem description is printed which reflects the input parameters. It also tells if OpenMP threads were used in the run.

Verifying Results

The code checks itself after the last time step and compares a computed value with the corresponding value calculated on the Cray XT4 system at NERSC using the Pathscale compiler. The difference reported should be less than .01. The internal error checking will only work with # of MPI tasks equal to 64, 512, or 2048; an error condition will result otherwise.

Modification Record

This is Version 2 (03/28/2005).

Record of Formal Questions and Answers

No entries as yet.

Bibliography

[ ZIN] "Particle-in-cell simulations of electron transport from plasma turbulence: recent progrress in gyrokinetic particle simulations of turbulent plasmas, Z. Lin, G. Rewoldt, S. Ethier, et. al, "Journal of Physics: Conference Series 16 (2005) 16-24.

[ GTC] Gyrokinetic Particle Simulations http://w3.pppl.gov/theory/proj_gksim.html

[ FUS] SciDAC Fusion Research http://www.scidac.gov/FES/FES_GPSC.html

[ ITER] ITER Fusion Research http://www.iter.org"

[ LEE] full-torus simulation of turbulent transport http://www.nersc.gov/news/annual_reports/annrep01/sh_FES_05.htm

[ Ethier] "Gyrokinetic particle-in-cell simulations of plasma microturbulence," S. Ethier, W. Tang, and Z. Lin, "Journal of Physics: Conference Series 16 (2005) 1-15.

GTC