N I H H e l i x S y s t e m s

B i o w u l f: A Beowulf for Bioscience

Steven Fellini
sfellini@nih.gov


Susan Chacko
susanc@helix.nih.gov


CIT

October 24, 2008

This page is at http://biowulf.nih.gov/seminar.html
This Biowulf User Guide is at http://biowulf.nih.gov/user_guide.html
The Biowulf Home Page is at http://biowulf.nih.gov


Morning Schedule: Afternoon Schedule:

What is Biowulf?

  • A large-scale Linux cluster for NIH scientists
  • Linux clusters (also known as Beowulf-type clusters): commodity computers interconnected by (mostly) commodity network running the Linux (Unix) operating system
  • Funded by the NIH Management Fund and CIT, built and supported by Helix Systems Staff, CIT (1999)

  • Production facility (high availability, data integrity and backup)
  • General purpose scientific (not dedicated to any one application type)
  • Access to shared central storage (Helix)
  • Research on Biowulf (http://helix.nih.gov/research.html)

    • Molecular dynamics
    • Linkage analysis
    • DNA/Protein sequence analysis
    • NMR spectral analysis
    • Statistical analysis
    • Microarray data analysis
    • Protein folding
    • PET and EPR imaging
    • Free energy calculations
    • Rendering

    Cluster Basics



    Important concepts

  • A node is one computer out of the 1900+ computers in the Biowulf cluster.
  • A processor (or core) is one cpu on a node. Each Biowulf node has 2 or 4 processors.
  • A process is one command or program that will run on 1 processor.
  • The Biowulf login node, and all the computational nodes, run Linux.
  • interactive vs batch processing. The Biowulf cluster is primarily a batch system. (PBS).
  • shared vs distributed memory. The cluster is a distributed memory system.
  • serial programs (swarms) vs. parallel programs (message passing)
  • parallel job scaling (you must benchmark parallel code)

    When do you need to use Biowulf?

    • Long jobs (e.g. molecular dynamics)
    • Many jobs (e.g. bioinformatics with 50+ sequences)
    • Large-memory jobs (e.g. Gaussian requiring 8GB memory)
    • Parallel jobs (e.g. NAMD on 16 processors)

    What jobs are unsuitable for Biowulf?

    • "Serial" jobs.
    • Small numbers of short jobs.
    • Interactive jobs

    Accounts & Passwords

  • Every user must have his/her own account. NO SHARING of accounts.
  • requires pre-existing Helix account (http://helix.nih.gov/new_users/accounts.html).
  • registering for a Biowulf account (http://helix.nih.gov/register/biowulf.html).
  • Passwords -- initially the same as your Helix password, but not thereafter.
  • Connecting to Biowulf

    ssh to biowulf.nih.gov (see http://helix.nih.gov/new_users/connect.html)
    No major computation on the login node!
    Email goes to user@helix.nih.gov.
    Important: forward your Helix email if you don't normally read it! (see http://helix.nih.gov/docs/online/email/email.html#forward)

    Storage & Backups

  • /home/user -- 1 GB, shared between Helix and Biowulf /home/user.
  • /data/user -- 48 GB.

    Location Creation Backups Amount of Space Accessible
    from (*)
    /home network (NFS) with Helix accountyes 1 GB
    (quota)
    B,C
    /scratch (nodes)local created by userno 30 - 166 GB dedicated
    while node is allocated
    C
    /scratch (biowulf)network (NFS)created by userno 500 GB sharedB,H
    /data network (NFS) with Biowulf accountyesbased on quota
    (48 GB default)
    HSM managed
    B,C,H
    (*) H = helix, B = biowulf login node, C = biowulf computational nodes

  • checkquota command
  • don't use /tmp on login node
  • use /data/yourname, not /data3/c/yourname
  • clearscratch - to clear out local /scratch on nodes
  • Snapshots. (see http://helix.nih.gov/new_users/backups.html)

    Running Programs on Biowulf

    Demo: set up and submit a simple batch job.
    #!/bin/tcsh
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -a 100 < /data/me/mydata
    
    qsub -l nodes=1 myjob.sh
    

    Notes:

  • Jobs are queued.
  • Nodes are allocated exclusively.
  • Most jobs should be run in batch.

    Monitoring Batch Jobs

  • qstat
  • cluster monitor
  • jobload

  • Demo: fully utilizing the node.
    #!/bin/bash
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -a 100 < infile1 > outfile1 &
    myprog -a 200 < infile2 > outfile2 &
    wait
    

    Notes:

  • If a job ends unexpectedly, check the standard error/output files for clues.
  • Also check your quota!
  • qdel - to delete a job.

    Interactive Batch

    qsub -I -l nodes=1
    
    Demo: Matlab on an interactive node.

    Installed Applications

    http://biowulf.nih.gov/apps

    Demo: easyblast



    Biowulf Hardware Configuration (1900 nodes, 4700 processors)


    Fileservers Network Switches
  • NetApp FAS960c (4) Filers
  • NetApp 3050c (4) Filers
  • NetApp R200, R100 (2) NearStore

  • Foundry MG-8 (core)
    10 GbE & GbE
  • Foundry EdgeSwitch FES448
    GbE
  • Foundry FastIron
    100Base-T & GbE
  • Myricom Clos-128
    Myrinet 2000
  • Voltaire ISR 9288
    Infiniband
  • Node configurations

    # of nodes processors per node memory network
    232
    4 x 2.8 GHz
    AMD Opteron 290
    1 MB
    secondary
    cache
    8 GB 1 Gb/s ethernet
    471
    2 x 2.8 GHz
    AMD Opteron 254
    1 MB
    secondary
    cache
    40 x 8 GB
    226 x 4 GB
    91 x 2 GB
    1 Gb/s ethernet
    160 x 8 Gb/s Infiniband
    289 4 x 2.6 GHz
    AMD Opteron 285
    1 MB
    secondary
    cache
    8 GB 1 Gb/s ethernet
    389 2 x 2.2 GHz
    AMD Opteron 248
    1 MB
    secondary
    cache
    129 x 2 GB
    66 x 4 GB
    1 Gb/s ethernet
    80 x 2 Gb/s Myrinet
    91 2 x 2.0 GHz
    AMD Opteron 246
    1 MB
    secondary
    cache
    48 x 1 GB
    43 x 2 GB
    1 Gb/s ethernet
    48 x 2 Gb/s Myrinet
    390 2 x 2.8 GHz
    Intel Xeon
    512 kB
    secondary
    cache
    119 x 1 GB
    207 x 2 GB
    64 x 4 GB
    64 x 2 Gb/s Myrinet
    189 x 1 Gb/s ethernet
    201 x 100 Mb/s ethernet
    1 32 x 1.4 GHz
    SGI Itanium 2
    96 GB 1 Gb/s ethernet

  • 2-4 GB swap space
  • 30-136 GB local scratch disk
  • 19 different node configurations
  • previous generations of nodes: 450 MHz, 550 MHz and 866 MHz Pentiums,
    and 1.4 GHz and 1.8 GHz Athlons

  • Jobs on the Biowulf Cluster

  • The batch system allocates nodes (2 or 4 processors/node)
  • Each allocated processor should run a single process
  • small numbers of serial jobs (~1-4)

    Do you need a cluster???

    swarms of serial jobs

    #!/bin/bash
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -a 100 < infile1 > outfile1 &
    myprog -a 200 < infile2 > outfile2 &
    wait
    

    Using swarm

    #
    # this file is cmdfile
    #
    myprog -param a < infile-a > outfile-a
    myprog -param b < infile-b > outfile-b
    myprog -param c < infile-c > outfile-c
    myprog -param d < infile-d > outfile-d
    myprog -param e < infile-e > outfile-e
    myprog -param f < infile-f > outfile-f
    myprog -param g < infile-g > outfile-g
    

    swarm -f cmdfile
    

    32 lines (processes) => 16 jobs
    2000 lines (processes) => 1000 jobs

    # bundle option
    #
    swarm -b 20 -f cmdfile
    
    2000 lines (processes) => 100 bundles => 50 jobs

    swarmdel



    parallel jobs

    Issues:
  • N processes running on N cpus
  • distributed memory vs. shared memory
  • message passing libraries (MPI, PVM)
  • communication and scalabilty
  • high-performance interconnects
  • Parallel (MPI), multi-node applications only
  • Benchmarking application is essential
  • Low latency more important than bandwidth
  • Bypasses TCP/IP stack
  • Requires compilation against special libraries
  • Two high performance interconnects:
    Myrinet (165 nodes) and Infiniband (112 nodes)

  • gigabit ethernetMyrinetInfiniband
    (Pathscale
    Infinipath)
    bandwidth
    (full duplex)
    1 Gb/s2 Gb/s8 Gb/s
    latency"high""low""low"
    TCP/IP?yesnono
    nodeso2xxx
    p2800
    k8
    p2800
    o2800
    PBS propertygigemyr2kib
    system
    applications
    allcharmm
    amber
    GROMACS
    charmm
    NAMD
    GROMACS

  • benchmarks
  • GROMACS
  • NAMD
  • charmm

  • large memory jobs (>4 GB)

    memory 2.6 and
    2.8 GHz
    Opterons

    (dual-core!)
    2.8 GHz
    Opterons

    SGI Altix 350
    (1.4 GHz Itanium2)
    SUN X4600
    3.0 GHz Opterons

    (Helix!)
    8 GB 521 40 - -
    96 GB
    (shared)
    - - 1 (32 processors) -
    128 GB
    (shared)
    - - - 1 (16 processors)

    See also Firebolt web page (http://biowulf.nih.gov/firebolt.html) for running jobs on the Altix.

    Node Selection and Allocation

  • Node priorities (fastest nodes have highest priorities)
  • Fair Share (1 week half-life)
  • PBS Node Properties

    property selects
    o28002.8 GHz Opteron processor
    o26002.6 GHz Opteron processordual-core only
    o22002.2 GHz Opteron processor
    o20002.0 GHz Opteron processor
    k82.2 GHz Opteron processor
    2.0 GHz Opteron processor
    p28002.8 GHz Xeon processor
    m20482 GB memory
    m40964 GB memory
    m81928 GB memoryreserved
    x84-6464-bit Linux
    gigeGigabit ethernet network
    myr2kMyrinet networkreserved
    ibInfiniband networkreserved
    dcDual-core processors
    8 GB memory
    reserved
    altixSGI Altix processorreserved

    Notes:

  • specify only the properties your job requires
  • nodes with reserved properties are allocated only if explicitly specified
  • memory properties are per 2 processors
  • if a property list can't ever be satisfied, the job will remain queued forever
  • ncpus= and mem= are for Altix jobs only
  • the freen command
  • Single-core vs. Dual-core nodes

  • dual-core nodes are reserved
  • swarm and multiblast will automatically "do the right thing"
  • most parallel-code scales poorly! (exception: NAMD)
  • Examples

    qsub -l nodes=1 myjob
    qsub -l nodes=1:x86-64 my64bitjob
    qsub -l nodes=8:o2800:gige -v np=16 namdjob
    qsub -l nodes=4:o2000:myr2k -v np=8 mdjob
    qsub -l nodes=16:ib -v np=32 bignamd.sh
    qsub -l nodes=1:m4096 bigjob.bat
    swarm -l nodes=1:m4096 -f bigjobs
    qsub -l nodes=1:altix:ncpus=4,mem=12gb verybigmem.bat
    

    Limits

    $ batchlim
                 Max CPUs    Max CPUs
                 Per User   Available
                ---------- -----------
    ib          32         n/a        
    norm        160        n/a        
    nist1       32         172        
    norm3       32         100        
    nist2       16         48
    
    $ freeib
    13 nodes free in ib (large) queue, 25 in ib2 (small)
    


    System Software

      Biowulf
    login node
    (Xeon)
    compute
    nodes (Xeon)
    compute
    nodes (Opteron)
    compute nodes
    (Opteron/Myrinet)
    compute nodes
    (Opteron/IB)
    hardware 64-bit 32-bit 64-bit 64-bit 64-bit
    system
    software/
    compilers
    32-bit 32-bit 64-bit 64-bit 64-bit
    application
    software
    32-bit 32-bit 32-bit
    64-bit
    32-bit
    64-bit
    32-bit
    64-bit
    Linux distrib RHEL 5.2 CentOS 5.2 CentOS 5.2 CentOS 5.2 CentOS 5.2
    Linux kernel 2.6.18-92.1.13.el5PAE 2.6.18-53.1.21.el5 2.6.18-53.1.21.el5 2.6.18-53.1.21.el5 2.6.18-53.1.21.el5
    C library glibc-2.5 glibc-2.5 glibc-2.5 glibc-2.5 glibc-2.5

    Note: CentOS is a clone of RHEL

    Compiling Code

  • 32-bit code: compile on biowulf login node
  • 64-bit code: compile on a 64-bit node
    qsub -I -l nodes=1:x86-64
    
  • default Linux compilers: GNU 4.x
  • potentially higher performing compilers: Portland Group, Intel, PathScale
  • Compilers and set-up scripts available on Biowulf:

    compiler Front-ends Environment Setup
    GCC 4.1.2 gcc (C)
    g++ (C++)
    g77 (Fortran77)
    gfortran (Fortran90/95)
    Default
    PGI 7.2 pgcc (C)
    pgCC (C++)
    pgf77 (Fortran77)
    pgf90 (Fortran90)
    pgf95 (Fortran95)
    % source /usr/local/pgi/pgivars.sh
    % source /usr/local/pgi/pgivars.csh
    Intel v10.1 icc (C)
    icpc (C++)
    ifort (Fortran77/90/95)
    % source /usr/local/intel/intelvars.sh
    % source /usr/local/intel/intelvars.csh
    Pathscale 3.1 pathcc (C)
    pathCC (C++)
    pathf90 (Fortran77/90)
    pathf95 (Fortran95)
    % source /usr/local/pathscale/pathvars.sh
    % source /usr/local/pathscale/pathvars.csh

    Debuggers

    • GNU Debugger (gdb)
    • Valgrind (valgrind)
    • Intel Debugger (idb)
    • Portland Group Debugger (pgdbg)
    • Pathscale Debugger (pathdb)

    Compiling MPI parallel programs

  • Source compiler environment file (as above).
  • Set your PATH environment variable:
    PATH=<MPICH Home>/bin:$PATH (ethernet)
    PATH=<MPICH-GM Home>/bin:$PATH (myrinet)
    
    Infiniband only, compile on IB node:
    qsub -l nodes=1:ib -I
    
    (doesn't require a special PATH)
  • Compile the program.
  • Runtime PATH must be set to the same PATH as above!
  • MPI PATHS:

    CompilerEthernet
    (MPICH2)
    Myrinet
    (MPICH1)
    Infiniband
    (MPICH1)
    GNU/usr/local/mpich2
    /usr/local/mpich2-gnu64
    /usr/local/mpich-gm2k
    PGI/usr/local/mpich2-pgi
    /usr/local/mpich2-pgi64
    /usr/local/mpich-gm2k-pg
    Intel/usr/local/mpich2-intel
    /usr/local/mpich2-intel64
    /usr/local/mpich-gm2k-i
    Pathscale/usr/local/mpich2-pathscale
    /usr/local/mpich2-pathscale64
    /usr/local/mpich-gm2k-psDefault PATH

    MPI compiler wrappers:

    cmpicc
    c++mpicxx
    fortranmpif77
    mpif90

    Sample MPI program:

    #include <stdio.h>
    #include "mpi.h"
    int
    main(int argc, char **argv)
    {
      int myrank, n_processes, srcrank, destrank;
      char mbuf[512], name[40];
      MPI_Status mstat;
      MPI_Init(&argc, &argv);
      MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
      MPI_Comm_size(MPI_COMM_WORLD, &n_processes);
      if (myrank != 0) {
        gethostname(name,39);
        sprintf(mbuf, "Hello, from process %d on node %s!", myrank, name);
        destrank = 0;
        MPI_Send(mbuf, strlen(mbuf)+1, MPI_CHAR, destrank, 90, MPI_COMM_WORLD);
      } else {
        for (srcrank = 1; srcrank < n_processes; srcrank++) {
          MPI_Recv(mbuf, 512, MPI_CHAR, srcrank, 90, MPI_COMM_WORLD, &mstat);
          printf("From process %d: %s\n", srcrank, mbuf);
        }
      }
      MPI_Finalize();
    }
    

    % PATH=/usr/local/mpich/bin:$PATH
    % mpicc -o hello_mpi hello_mpi.c
    

    Running MPI Programs (MPICH2)

    #!/bin/bash
    # This file is hello-mpich2.bat
    #
    #PBS -N Hello
    PATH=/usr/local/mpich2/bin:$PATH; export PATH
    mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
    mpiexec -n $np /home/steve/hello/hello-mpich2
    mpdallexit
    

    qsub -v np=16 -l nodes=8 hello-mpich2.sh

    Infiniband only:

    % cd ~/.ssh
    % cp /usr/local/etc/ssh_config_ib config
    % chmod 600 config
    

    Running MPI Programs (MPICH1)

    #!/bin/bash
    # This file is hello-mpich.bat
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    PATH=/usr/local/mpich/bin:$PATH; export PATH
    mpirun -machinefile $PBS_NODEFILE -np $np hello_mpi
    
    qsub -v np=8 -l nodes=4:myr2k hello-myr.sh
    qsub -v np=16 -l nodes=8:ib hello-ib.sh
    

    Scientific Libraries

    • FFTW
    • Intel Math Kernel Library (MKL)
    • AMD Core Math Library (ACML)
    • Intel Integrated Performance Primitives (IPP)
    • GNU Scientific Library (GSL)

    Contact Biowulf Staff at staff@biowulf.nih.gov