NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Programming on Franklin

Important Notice

NERSC is upgrading Franklin to a quad-core XT4 system from July to October 2008. Please refer to Franklin Quad Core Upgrade Plan for detailed time lines and changes in user environment and programming on Franklin.


Guidelines for Compilation and Linking on Franklin

There are two separate run-time environments on Franklin:

  • The Linux environment on the login nodes, which allows dynamic loading (smaller executable files) and a full set of system routines
  • The Compute Node Linux environment on the compute nodes, which requires static loading (much larger executable files) and fewer system routine support.
All executable codes must be compiled and linked, on the login nodes, targeting one environment or the other, and only one. Every compilation for parallel execution on Franklin is, in fact, a cross-compilation. Because of the innate complexity of cross-compilation (compiling and linking in one environment for execution in a different environment), Cray has provided wrappers for the compilers which should be used for all parallel user codes.

The default compilers on Franklin are PGI Fortran and PGI C/C++ compilers. For serial-only very short test codes to be run only on the login nodes users may directly invoke the base compilers from the Portland Group (PGI) suite (e.g., pgf77, pgf90, pgcc, pgCC). It is not recommended to do production work on login nodes.

Here is a guileline for compilation and linking for parallel applications to run on Franklin compute nodes:

  • Always use the vendor-provided wrappers (ftn for Fortran codes, cc for C codes, and CC for C++ codes) to compile. The compiler wrappers would automatically link with correct portals libraries (MPI, Shmem, etc.) and some scientific libraries (such as Cray LibSci). Use "ftn -v", "cc -v", or "CC -v" to find details of all the libraries included.
  • Do not use vendor base compilers (such as pgf90, pgcc, pgCC, gcc, g++, pathf90, pathcc, etc.), which could not find necessary portals libraries (MPI, Shmem, etc.).
  • Do not use direct mpi compiler commands (mpif90, mpicc, and mpicxx), which have problems to link with correct MPI libraries.
  • Do not set environment variables MPICH_CC to cc, MPICH_F90 to ftn, or MPICH_F77 to f77; Doing so would result compilation into an infinite loop.

First Examples: Fortran and C++ with MPI "Hello"

Here is a basic example of how to compile, link and execute a simple Fortran and MPI "hello" code on Franklin.

Here is a similar example in C++ and MPI.

The Fortran example invokes the ftn compiler wrapper to compile and link for the parallel environment, and the C++ examples uses the CC wrapper. Codes written in C should be compiled by invoking the cc wrapper. Man pages are available for these wrappers.

GNU Compilers and Pathscale Compilers

GNU compilers and Pathscale compilers are also available, through modules, as an alternative base compiler set. As with the PGI compiler suite, users should not attempt to directly compile and link cross-compiled codes for the parallel compute node environment using the GNU compilers and pathscale compilers. The base compiler set under the wrappers (ftn, CC, cc) can be swapped from PGI to GNU or Pathscale by the module command:

$  module swap PrgEnv-pgi PrgEnv-gnu
$  module swap PrgEnv-pgi PrgEnv-pathscale

There are a total of four Pathscale compiler licenses available on Franklin. When all the licenses are taken, compilation would encounter such an error message:


franklin% cc -o demo demo.c 
/opt/xt-pe/2.0.24b/bin/snos64/cc: INFO: linux target is being used
*** Subscription: Unable to obtain subscription. For more information,
please rerun with the -subverbose flag added to the command line].

MPI Programming

The MPI on Franklin is Cray MPICH2. It implements the MPI-2 Standard, except for the support of dynamic process spawn functions (which is not possible under the microkernel). It also supports the MPI 1.2 Standard with minor modifications from MPH 1.1 Standard. Cray MPICH2 is implemented on top of the Portals low-level message passing scheme.

A high-performance, portable MPI-IO library is also available. It is called ROMIO, developed by Argonne National Laboratory.

As illustrated in the examples above, user codes must include the MPI library header file appropriate for the source language.

! For Fortran codes
include 'mpif.h' 

# For C or C++ codes
#include <mpi.h>

For C++ codes, it is important that the include for mpi.h come first before any other include directives.

Compiler wrappers will automatically link the MPI libraries. These wrappers should be used for all parallel code compile and link steps:


% ftn mpi_program.f
% cc mpi_program.c
% CC mpi_program.C

MPI Deadlock From Send-to-Self Messages

Cray MPICH2 has a known deadlock problem when an MPI task sends a message to itself. This is due to the lack of MPI buffering for the same-node send-receive pair. Users must modify their source codes to exclude these message passing patterns. This restriction may be removed in a future release.

MPI Rank Assignments

The distribution of MPI ranks on the nodes can be written to the standard output file by setting environment variable PMI_DEBUG to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD. See MPI Task Distribution on Nodes and the "intro_mpi" man page for more information.

Some XT specific tuning for MPI program

  • XT is optimized for message preposting. Posting receive calls first can improve performance.
  • Avoid MPI_(I)probe which eliminates many of the advantages of Portals network protocol stack.
  • Aggregate very small messages into a larger message.
  • XT has limited optimization for non-contiguous MPI derived data types. In contrast to some platforms it may be better to do multiple transfers of contiguous data rather than sending and receiving non-contiguous data types.

SHMEM Programming

The Cray SHared, distributed MEMory access (SHMEM) library is a set of logically shared, distributed memory access routines. Cray SHMEM library routines are similar to MPI library routines in that they both pass data among a set of parallel processors. SHMEM routines use one-sided put and get communications to remote address spaces. Cray SHMEM is implemented on top of the Portals low-level message passing scheme.

As with MPI, a header file is required:

! For Fortran
include 'mp/shmem.fh'

# For C/C++
#include <mpp/shmem.h>

Compiler wrappers will automatically link the SHMEM libraries:


% ftn shmem_program.f 
% cc shmem_program.c 
% CC shmem_program.C 

Please refer to intro_shmem man page for more information about SHMEM.

Some XT specific tuning for SHMEM program

  • Use non-blocking SHMEM operations if possible.
  • SHMEM barriers performed in software and have high overhead, so use with care.
  • Use shmem_fence rather than shmem_quiet where possible.
  • Don't use strided SHMEM operations in performance critical sections of application.

Executable File Sizes and Compile Times

Consider the following 33 byte Fortran source program:
/scratchdir => cat hello.f
      print *,"Hello!"
      end
When this code is compiled for serial execution on the login nodes under a standard Linux environment that support dynamic loading, the executable size is 2.2 megabytes using the PGI compilers, and 26.4 kilobytes using the GNU compiler. However, when the same source code is compiled with the cross-compiling wrapper ftn for the microkernel environment on the compute nodes, where static loading is required, the executable size is 13.0 megabytes using the PGI compilers and 11.1 megabytes using the GNU compilers. Executables for the parallel, compute node environment are larger because of static linking.

If an attempt is made to statically link together an executable in excess of 2 Gigabytes, the linker will produce a truncation error message such as the following:

...
: relocation truncated to fit: ...
It is then generally necessary for the user to reduce large static arrays in the code, replacing them by dynamically allocated arrays. This problem is more common with older codes with large static arrays (or Fortran common arrays) which are used in various ways by subroutines as a user-managed dynamic memory area.

Compile times may be significantly longer when cross-compiling for the static linking environment on the compute nodes because of the added I/O time required to make static copies of library routines.

The object mode on Franklin is 64-bit, which means that all executables will run in 64-bit address mode.

Memory Considerations

Each dual-core node has about 3.75 GBytes of user accessible memory. When running in the default dual-core mode with two MPI tasks per node, MPI task will have access to about 1.75 GB of memory. Running in explicit single-core mode with one MPI task per node will allow each MPI task to use 3.58 GBytes of user memory. Memory use by the MPI or shmem layer may grow as you move to higher processor counts. See Memory Usage Consideration on Franklin for more details.

Debugging and Optimization

The basic debugging tool on Franklin is Distributed Debugging Tool (DDT) from Allinea Software.

The Multi Core Report jointly produced by Cray, NERSC, and AMD presented dual core and quad core processor architectures, analyzed impact of multi core processors on the performance of selected micro and application benchmarks, and discussed compiler options and software optimization techniques. Please also refer to Important Portland Group Compiler Options for basic tuning with compiler option choices.

Here is a collection of papers written by Stephen Whalen from Cray on Optimizing the NPB benchmarks for multi-core AMD Opteron Microprocessors. Many of the techniques described in these papers could be used in optimizing general applications.


LBNL Home
Page last modified: Tue, 12 Aug 2008 18:15:10 GMT
Page URL: http://www.nersc.gov/nusers/systems/franklin/programming/
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science