NERSCPowering Scientific Discovery Since 1974

GTC

Code Description

GTC: 3D Gyrokinetic Toroidal Code


General Description

GTC is a 3-dimensional code used to study microturbulence in magnetically confined toroidal fusion plasmas via the Particle-In-Cell (PIC) method. It solves the gyro-averaged Vlasov equation in real space using global gyrokinetic techniques and an electrostatic approximation. The Vlasov equation describes the evolution of a system of particles under the effects of self-consistent electromagnetic fields. The unknown is the flux,f(t,x,v), which is a function of time t , position x, and velocity v, and represents the distribution function of particles (electrons, ions,...) in phase space. This model assumes a collision-less plasma; i.e., the particles interact with one another only through a self-consistent field and thus the algorithm scales as N instead of N2, where N is the number of particles. [GTC]

A "classical" PIC algorithm involves using the notion of discrete "particles" to sample a charge distribution function. The particles interact via a grid, on which the potential is calculated from charges "deposited" at the grid points. Four steps are involved iteratively:

  1. "SCATTER," or deposit, charges on the grid. A particle typically contributes to many grid points and many particles contribute to any one grid point.
  2. Solve the Poisson Equation;
  3. "GATHER" forces on each particle from the resultant potential function;
  4. and "PUSH" (move) the particles to new locations on the grid.


In GTC point-charge particles are replaced by charged rings using a four-point gyro averaging scheme, as shown below.



An important approximation in GTC comes from the fact that fast particle motion along the magnetic field lines leads to a quasi-two-dimensional structure in the electrostatic potential. Thus, the Poisson Equation needs only to be solved on a 2-D poloidal plane. GTC uses a logically non-rectangular field-aligned grid, in part, to keep the number of particles per cell nearly constant. The mesh effectively twists along with the magnetic field in the toroidal direction.

A PIC simulation is characterized by the number of particles used and the size of the mesh. GTC has been used in runs with over 1 billion particles and over 100 million grid points.

Coding

The GTC code has been optimized to achieve high efficiency on both cache-based super scalar and vector nodes. Both vector and super scalar code for the Poisson Equation solver are provided.

In GTC, the main bottleneck is usually the charge deposition, or scatter operation, as is true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionately to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex because of the quickly gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations than in "classical" PIC, since several points are picked on the rings and each of them has its own neighboring grid points.

This version of GTC is written entirely in free-format Fortran90/95, about 4300 lines of code (17 source files), including routines for which multiple versions are provided.

Authorship

Written by Zhihong Lin, Stephane Ethier, and Jerome Lewandowski of the Princeton Plasma Physics Laboratory.

Relationship to NERSC Workload

GTC is used for fusion energy research via the DOE SciDAC program [FUS] and for the international fusion collaboration [ITER]. Support for it comes from the DOE office of Fusion Energy Science. It is used for studying neoclassical and turbulent transport in tokamaks and stellarators as well as for investigating hot-particle physics, toroidal Alfven modes, and neoclassical tearing modes. [LEE].

Parallelization

The primary parallel programming model for GTC is a 1-D domain decomposition with each MPI process assigned a toroidal section and MPI_SENDRECV used to shift particles moving between domains. A variety of both grid-based and particle-based loops can also be multitasked via OpenMP directives included in the code. MPI library calls and OpenMP directives are not isolated in any one routine or source file. Builds can be MPI-only or combined OpenMP/MPI.

In GTC the entire calculation is done in real space, including the Poisson solve, thereby improving parallel scalability. In a typical simulation only about 10-15% of the time is spent in interprocessor communication. However, due to the gather/scatter nature of the charge deposition in PIC codes, GTC is very senstive to memory latency. [Ethier].

The GTC distribution includes highly vectorizable routines for the charge scatter, Poisson solve, particle push, and particle shift between domains. The vectorizable scatter routines can make use of gather/scatter hardware and have been vectorized using an algorithm that duplicates charge "bins" at the expense of additional memory in order to eliminate loop-carried dependences. "Atomic" memory operations could also mitigate the effect of the dependences. The duplication is done to the extent of the hardware vector length. The same loop-carried dependences and method for eliminating them apply to OpenMP parallelization of these loops.


Obtaining the Code

The NERSC-6 procurement web site is no longer available.

Due to licensing restrictions GTC cannot be downloaded from NERSC. To obtain the code visit the PPPL site.

You can obtain the NERSC-6 GTC benchmark input data files here (tar file).

Building the Code

GNUmake (gmake) is required. Several files are preprocessed by the compiler "inline." All files are either .F90 or .f90; there are no header files.

All source files are in a single directory and there is a single Makefile with definitions for several different systems included (IRIX64, AIX, Linux, SUPER-UX, and UNICOS). The makefile runs the uname -s command to detect the operating system automatically.

Note well that some compilers may not like the way function types are declared in the file function.f90. An alternate version of this file is provided. Rename function_alt.f90 to function.f90 to solve the problem.

Two basic versions of the code can be built:

Version

Executable Name

MPI-only

gtcmpi

Combined MPI/OpenMP

gtc

The basic command to build the code is

gmake <opt1=y/n> <opt2=y/n>

Options included in the Makefile are:

OPENMP=y|n

With or without OpenMP support; default is y.

DOUBLE_PRECISION=y

8-byte floating point precision

DEBUG=y|n

With or without debug option (-g). The default is n.

ESSL=y|n

With or without the AIX Engineering Scientific Subroutine Library, used for the FFT routines. The default is n.

64BITS=y|n

64- or 32-bit compilation mode on AIX. The default is 64BITS.

PGI=y

Use the PGI compiler (pgf90) on Linux. The default is to use the Lahey-Fujitsu compiler lf95.

ALTIX=y

Compiles with Intel compilers on the Altix using ifort ... -lmpi

The vendor is expected to provide the appropriate system calls in timers.f to return local processor CPU time and global elapsed (wall clock) time.

The MPI ranks are identified by the variable "mype" and use mype=0 as the master.


Build-Related Files in This Distribution

File Name

Purpose

Preprocessing

README.html

README in HTML format

 

README

README in ASCII text

 

main.F90

main with MPI_Init

 

module.F90

precision, parameter,
and global array declarations

double precision and NEC SX-

setup.F90

MPI_comm_size, MPI_comm_rank

AIX, OpenMP, Debug

setup_vec.F90

same as setup with add'l code to minimize bank conflicts on vector systems

AIX, OpenMP, SX, Debug

shifte.F90

move electrons across domains

OpenMP

shifti.F90

move ions across domains

OpenMP

shifti_vec.F90

vectorizable version with SX-6-specific performance monitoring calls

SX-

chargee.F90

interpolate electron charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE.

OpenMP

chargei.F90

interpolate ion charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE.

OpenMP

chargei.F90

vectorizable version

SX and OpenMP; SX compiler directives

tracking.F90

tag each particle with a unique number that will be carried along ! with it as it moves from one processor/domain to another. Write track data to HDF files when doing a snapshot

HDF

smooth.F90

charge smoothing, MPI_SENDRECV of flux; MPI_Gather for transpose of 2D matrix;

ESSL

diagnosis.F90

global sums, write output files

NERSC

pushi_vec.F90

vectorizable version of ion particle push

SX performance trace routines

pushi.f90

ion particle push, MPI_ALLREDUCE for various diagnostics

 

pushe.f90

electron particle push, MPI_ALLREDUCE for various diagnostics

 

snapshot.f90

write particle data for re-run, MPI_REDUCE for ?

 

field.f90

finite differencing for electric field, MPI_SENDRECV

 

function.f90

various Fortran90 function definitions

 

function_alt.f90

version for some compilers that don't like the function declarations in function.f90

 

load.f90

initialize particle position and velocity; not timed??

 

poisson.f90

iterative poisson solve first locally and then globally via MPI_ALLGATHER

 

restart.f90

restart dump write/read

 

poisson_vec.f90

vectorizable version

 

set_random_values.f90

random number generators with Fortran90 and Fortran77 interfaces

 

fft_cray _essl _gl _nag and _sx6.f90

interfaces to various system-specific fft libraries

 

Makefile

make file

 


Running the Code

Benchmark code GTC must be run using 64-bit floating-point precision for all non-integer data variables.

There is a subdirectory "run" in which input data files and sample batch submission scripts are located. The code does not require/accept any command line arguments. However, an input data file, called "gtc.input," is required.


Timing Issues

All timings made by this code are via the timer() subroutine in main.F90. This routine returns both the CPU and wall (elapsed) time. The source should be modified by the vendor to provide the former for the particular hardware with acceptable accuracy and resolution; the latter is returned via MPI_WTIME() and should remain unchanged.

The time to be reported is the line "TOTAL WALL CLOCK TIME(SEC)."

The code is heavily instrumented with eight different portions of the code timed separately. The initialization phase is timed but for runs with large numbers of timesteps it comprises only a small portion of the total time. All MPI processes call the timers but only the master process prints the time and there are no barriers before each timing call to measure the collective time of all processors.

Here is a timing profile for the three sample problems, obtained from the machine Jacquard at NERSC.

 

Small

Medium

Large

PEs

4

64

256

particles

323590

323590

323590

gridpoints

32449

32449

32449

#domains

4

64

256

Timesteps

100

2000

2000


Times in seconds:

pusher

148.4

3,191.

3,251.

shift

5.35

481.5

1,337.

charge

156.5

3,483.

3,713.

poisson

19.1

765.0

760.8

smooth

10.6

218.6

403.3

field

3.6

86.1

87.2

load

2.6

3.6

4.0

total

347.3

8,229.

9,559.

%Pusher

43%

39%

34%

%Shifter

2%

6%

14%

%poisson

5%

9%

8%

%smoother

3%

3%

4%

%Charge

45%

42%

39%


Storage Issues

Almost all of the storage requirements comes from the allocatable arrays for particle, field, and diagnostic data declared in module.F90 and allocated in setup.F90.

Memory Required (per MPI process) By The Sample Problems:

Small

0.26 GB

Medium

0.26 GB

Large

 

Required Runs

Two problems are to be run and their outputs reported; see the table below. These two problems use weak scaling. All input from these files is namelist based and is explained at the end of the files; however, nothing in the three input files provided should be changed for the runs.

Problem Size

Concurrency (# of MPI processes)

# of Particles

# of GridPoints
small 64   207,097,600   2,076,736

medium

512

1,656,780,800

16,613,888

large

2048

6,627,123,200

66,455,552

The NERSC-6 "base case" runs are for 512- and 2048-MPI process configurations.   Use
the gtc.input.512p and gtc.input.2048p input files for these runs, respectively.  An input file for a smaller configuration (gtc.input.64p) is also provided for correctness checking and/or projections.  Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing on each node must be equal to the number of physical processors on this node.

The input file must be renamed "gtc.input" and be in the current working directory. A problem description is printed which reflects the input parameters. It also tells if OpenMP threads were used in the run.


Verifying Results

The code checks itself after the last time step and compares a computed value with the corresponding value calculated on the Cray XT4 system at NERSC using the Pathscale compiler. The difference reported should be less than .01. The internal error checking will only work with # of MPI tasks equal to 64, 512, or 2048; an error condition will result otherwise.


Modification Record

This is Version 2 (03/28/2005).


Record of Formal Questions and Answers

No entries as yet.


Bibliography

[ ZIN] "Particle-in-cell simulations of electron transport from plasma turbulence: recent progrress in gyrokinetic particle simulations of turbulent plasmas, Z. Lin, G. Rewoldt, S. Ethier, et. al, "Journal of Physics: Conference Series 16 (2005) 16-24.

[ GTC] Gyrokinetic Particle Simulations http://w3.pppl.gov/theory/proj_gksim.html

[ FUS] SciDAC Fusion Research http://www.scidac.gov/FES/FES_GPSC.html

[ ITER] ITER Fusion Research http://www.iter.org"

[ LEE] full-torus simulation of turbulent transport http://www.nersc.gov/news/annual_reports/annrep01/sh_FES_05.htm

[ Ethier] "Gyrokinetic particle-in-cell simulations of plasma microturbulence," S. Ethier, W. Tang, and Z. Lin, "Journal of Physics: Conference Series 16 (2005) 1-15.