FranklinQuad Core Upgrade
Quick Start Guide
Access to Franklin
Software
Status & StatsNERSC MOTD Announcements Known Problems Current Queue Look Completed Jobs List Job Stats |
Programming on FranklinImportant NoticeNERSC is upgrading Franklin to a quad-core XT4 system from July to October 2008. Please refer to Franklin Quad Core Upgrade Plan for detailed time lines and changes in user environment and programming on Franklin.
Guidelines for Compilation and Linking on FranklinThere are two separate run-time environments on Franklin:
The default compilers on Franklin are PGI Fortran and PGI C/C++ compilers. For serial-only very short test codes to be run only on the login nodes users may directly invoke the base compilers from the Portland Group (PGI) suite (e.g., pgf77, pgf90, pgcc, pgCC). It is not recommended to do production work on login nodes. Here is a guileline for compilation and linking for parallel applications to run on Franklin compute nodes:
First Examples: Fortran and C++ with MPI "Hello"Here is a basic example of how to compile, link and execute a simple Fortran and MPI "hello" code on Franklin. Here is a similar example in C++ and MPI. The Fortran example invokes the ftn compiler wrapper to compile and link for the parallel environment, and the C++ examples uses the CC wrapper. Codes written in C should be compiled by invoking the cc wrapper. Man pages are available for these wrappers.
GNU Compilers and Pathscale CompilersGNU compilers and Pathscale compilers are also available, through modules, as an alternative base compiler set. As with the PGI compiler suite, users should not attempt to directly compile and link cross-compiled codes for the parallel compute node environment using the GNU compilers and pathscale compilers. The base compiler set under the wrappers (ftn, CC, cc) can be swapped from PGI to GNU or Pathscale by the module command:$ module swap PrgEnv-pgi PrgEnv-gnu $ module swap PrgEnv-pgi PrgEnv-pathscale There are a total of four Pathscale compiler licenses available on Franklin. When all the licenses are taken, compilation would encounter such an error message:
franklin% cc -o demo demo.c /opt/xt-pe/2.0.24b/bin/snos64/cc: INFO: linux target is being used *** Subscription: Unable to obtain subscription. For more information, please rerun with the -subverbose flag added to the command line]. MPI ProgrammingThe MPI on Franklin is Cray MPICH2. It implements the MPI-2 Standard, except for the support of dynamic process spawn functions (which is not possible under the microkernel). It also supports the MPI 1.2 Standard with minor modifications from MPH 1.1 Standard. Cray MPICH2 is implemented on top of the Portals low-level message passing scheme. A high-performance, portable MPI-IO library is also available. It is called ROMIO, developed by Argonne National Laboratory. As illustrated in the examples above, user codes must include the MPI library header file appropriate for the source language. ! For Fortran codes include 'mpif.h' # For C or C++ codes #include < For C++ codes, it is important that the include for mpi.h come first before any other include directives. Compiler wrappers will automatically link the MPI libraries. These wrappers should be used for all parallel code compile and link steps: % ftn mpi_program.f % cc mpi_program.c % CC mpi_program.C MPI Deadlock From Send-to-Self MessagesCray MPICH2 has a known deadlock problem when an MPI task sends a message to itself. This is due to the lack of MPI buffering for the same-node send-receive pair. Users must modify their source codes to exclude these message passing patterns. This restriction may be removed in a future release. MPI Rank AssignmentsThe distribution of MPI ranks on the nodes can be written to the standard output file by setting environment variable PMI_DEBUG to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD. See MPI Task Distribution on Nodes and the "intro_mpi" man page for more information. Some XT specific tuning for MPI program
SHMEM ProgrammingThe Cray SHared, distributed MEMory access (SHMEM) library is a set of logically shared, distributed memory access routines. Cray SHMEM library routines are similar to MPI library routines in that they both pass data among a set of parallel processors. SHMEM routines use one-sided put and get communications to remote address spaces. Cray SHMEM is implemented on top of the Portals low-level message passing scheme. As with MPI, a header file is required: ! For Fortran include 'mp/shmem.fh' # For C/C++ #include < Compiler wrappers will automatically link the SHMEM libraries: % ftn shmem_program.f % cc shmem_program.c % CC shmem_program.C Please refer to intro_shmem man page for more information about SHMEM. Some XT specific tuning for SHMEM program
Executable File Sizes and Compile TimesConsider the following 33 byte Fortran source program:/scratchdir => cat hello.f print *,"Hello!" endWhen this code is compiled for serial execution on the login nodes under a standard Linux environment that support dynamic loading, the executable size is 2.2 megabytes using the PGI compilers, and 26.4 kilobytes using the GNU compiler. However, when the same source code is compiled with the cross-compiling wrapper ftn for the microkernel environment on the compute nodes, where static loading is required, the executable size is 13.0 megabytes using the PGI compilers and 11.1 megabytes using the GNU compilers. Executables for the parallel, compute node environment are larger because of static linking. If an attempt is made to statically link together an executable in excess of 2 Gigabytes, the linker will produce a truncation error message such as the following: ... : relocation truncated to fit: ...It is then generally necessary for the user to reduce large static arrays in the code, replacing them by dynamically allocated arrays. This problem is more common with older codes with large static arrays (or Fortran common arrays) which are used in various ways by subroutines as a user-managed dynamic memory area. Compile times may be significantly longer when cross-compiling for the static linking environment on the compute nodes because of the added I/O time required to make static copies of library routines. The object mode on Franklin is 64-bit, which means that all executables will run in 64-bit address mode. Memory ConsiderationsEach dual-core node has about 3.75 GBytes of user accessible memory. When running in the default dual-core mode with two MPI tasks per node, MPI task will have access to about 1.75 GB of memory. Running in explicit single-core mode with one MPI task per node will allow each MPI task to use 3.58 GBytes of user memory. Memory use by the MPI or shmem layer may grow as you move to higher processor counts. See Memory Usage Consideration on Franklin for more details. Debugging and OptimizationThe basic debugging tool on Franklin is Distributed Debugging Tool (DDT) from Allinea Software. The Multi Core Report jointly produced by Cray, NERSC, and AMD presented dual core and quad core processor architectures, analyzed impact of multi core processors on the performance of selected micro and application benchmarks, and discussed compiler options and software optimization techniques. Please also refer to Important Portland Group Compiler Options for basic tuning with compiler option choices. Here is a collection of papers written by Stephen Whalen from Cray on Optimizing the NPB benchmarks for multi-core AMD Opteron Microprocessors. Many of the techniques described in these papers could be used in optimizing general applications.
|
Page last modified: Tue, 12 Aug 2008 18:15:10 GMT Page URL: http://www.nersc.gov/nusers/systems/franklin/programming/ Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |