N I H H e l i x S y s t e m s

Large-Scale Bioinformatics
on the NIH Biowulf Supercluster

Susan Chacko
susan.chacko@nih.gov

Steven Fellini
sfellini@nih.gov

CIT

1 Dec 2005

This page is at http://biowulf.nih.gov/easy.html
The Biowulf Home Page is at http://biowulf.nih.gov


  • an overview of Biowulf
  • what sorts of sequence analysis projects are suited for Biowulf
  • overview of the available sequence analysis software
  • how to set up and run a large-scale Blast job (demo)
  • how to run a RepeatMasker job.
  • how to set up your own sequence analysis program
  • Example: fastDNAml (phylogenetic analysis) on Biowulf
  • sequence databases available

  • What is Biowulf?

  • Biowulf is the name of a PC/Linux cluster built and supported by the Helix Systems Staff (CIT)
  • Production facility for computationally intensive biomedical computing
  • Access to shared central storage (accessible from all Helix Systems)
  • Primarily funded by CIT (management fund). Additional nodes funded by NHGRI, NIDDK, & NHLBI.
  • PC/Linux Clusters

  • A cluster of computers assembled from mass-market, commodity "off-the-shelf" components
  • PCs (Intel and AMD) interconnected by low cost local area network technology (fast ethernet)
  • Linux operating system
  • Helix (SGI)Biowulf (cluster)
    one computer system with CPUs,
    memory and disks
    many systems (nodes)
    proprietary hardware and softwarecommodity hardware and open software (Linux)
    moderate number of CPUs (8-32)2000+ CPUs
    shared memorydistributed memory
    large memory (8-32 GB)smaller memory (1-4 GB)
    computation on login systemcomputation on computational nodes
    system runs several applications simultaneouslynode dedicated to one computation
    interactivequeuing system (batch)

    Research on Biowulf

    • Molecular modeling
    • Linkage analysis
    • DNA/Protein sequence analysis
    • NMR spectral analysis
    • Statistical analysis
    • Microarray data analysis
    • Protein folding
    • PET and EPR imaging
    • Free energy calculations
    • Rendering

    Hardware configuration

    nodes (2p)processorsmemorynetworks
    805AMD Opteron
    2.8, 2.2 & 2.0 GHz
    2 & 4 GBInfiniband, Myrinet & Gigabit ethernet
    388Intel Xeon
    2.8 GHz
    1, 2 & 4 GBMyrinet, Gigabit ethernet & Fast ethernet
    203AMD Athlon
    1.8 & 1.4 GHz
    1 & 2 GBMyrinet & Fast ethernet

  • Foundry, Myricon & Voltaire switches
  • Network Appliance F960 (4), R200 & R100 Filers
  • Accounts

  • requires pre-existing Helix account.
  • registering for a Biowulf account
  • Logging In

  • biowulf.nih.gov
  • ssh recommended
  • program development/compilations
  • no intensive application codes on the login node!
  • mail routed to helix.nih.gov
  • Storage Options

    Location Creation Backups Performance Amount of Space Accessible
    from (*)
    /home network (NFS) with Biowulf accountyes high 200 MB
    (quota)
    B,C
    /scratch (nodes)local created by userno best 6 - 30 GB dedicated
    while node is allocated
    C
    /scratch (biowulf)network (NFS)created by userno low 120 GB sharedB,H,N
    /data network (NFS) with Biowulf accountyeshighbased on quota
    (48 GB default)
    B,C,H,N
    (*) H = helix, N = nimbus, B = biowulf login node, C = biowulf computational nodes

    Snapshots Demo

    $ ls -l foo.tmp
    -rw-r--r--    1 steve    wheel           2 Mar 11  2005 foo.tmp
    [steve@biobos steve]$ rm foo.tmp 
    [steve@biobos steve]$ ls -l foo.tmp
    ls: foo.tmp: No such file or directory
    [steve@biobos steve]$ cd .snapshot
    [steve@biobos .snapshot]$ ls
    _hourly.0  _hourly.3  _nightly.0   _nightly.11  _nightly.2  _nightly.5  _nightly.8  _weekly.1
    _hourly.1  _hourly.4  _nightly.1   _nightly.12  _nightly.3  _nightly.6  _nightly.9  _weekly.2
    _hourly.2  _hourly.5  _nightly.10  _nightly.13  _nightly.4  _nightly.7  _weekly.0   _weekly.3
    [steve@biobos .snapshot]$ cd _nightly.0
    [steve@biobos _nightly.0]$ ls -l foo.tmp
    -rw-r--r--    1 steve    wheel           2 Mar 11  2005 foo.tmp
    [steve@biobos _nightly.0]$ cp foo.tmp /home/steve
    

    Features of the Biowulf Cluster

  • shared vs. distributed memory
  • dual-processor nodes (2 processes per node)
  • serial programs (swarms) vs. parallel programs (message passing)
  • batch system
  • Linux (Unix)

  • What kind of sequence analysis projects are suitable for Biowulf?

  • Large-scale, i.e. hundreds or thousands of sequences.
  • Parallel programs (e.g. Genehunter or HMMER)

    Not Suitable:

  • 50 Blast jobs every month
  • Interactive programs (e.g. PDraw)
  • Series of serial programs on a small number of sequences.


    Installed Sequence Analysis software

  • BLAST
  • BLAT
  • EMBOSS
  • Wu-Blast
  • Fasta
  • HMMER
  • PfSearch
  • Repeatmasker

    Phylogenetic/Linkage Analysis

  • Allegro
  • Solar
  • Simwalk
  • Parallel Genehunter
  • Tree-Puzzle
  • FastDNAml
  • Merlin
  • FastSlink


    Blast on Biowulf

    Details, instructions and examples are at biowulf.nih.gov/blast.html. [demo]
      Demo: How to run a large-scale Blast job on Biowulf

    1. Transfer your sequence files to Biowulf.
      On a PC: use the ftp program (with all versions of Windows).
      On a Mac: use the Fetch program (downloadable from NIH Pubnet)
      On a Unix machine: use ftp (comes with all flavours of Unix).

      Open a connection to biowulf.nih.gov (or helix.nih.gov)
      Change directory to /data/username/
      Put your files into that directory.

    2. Use EasyBlast to submit your jobs to the cluster. Sample session: (user input in bold)
      biobos% easyblast
      
      EasyBlast: Blast for large numbers of sequences
      Enter the directory which contains your input sequences: data/username/blast/myseqs
      
      Enter the directory where you want your Blast output to go: /data/username/blast/results
      ** WARNING: There are already files in /data/username/blast/results which will be deleted by this job.
      ** Continue? (y/n) :y
      
      BLAST programs:
          blastn - nucleotide query sequence against nucleotide database
          blastp - protein query sequence against protein database
          blastx - nucleotide query translated in all 6 reading frames 
                against a protein database
          tblastn - protein query sequence against a nucleotide database
                translated in all 6 reading frames
          tblastx - 6-frame translations of a nucleotide query sequence 
                against the 6-frame translations of a nucleotide database
          blastpgp - PSI-BLAST protein query against protein database
      Which program do you want to run: blastn
      
      The following nucleotide databases are available:
      (or enter your own database with full pathname)
          nt - all nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
          hs_genome - human genome assembly (Build 33, 14 Apr 2003)
          est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
          est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
          est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
          patnt - from the patent division of Genbank
          pdbnt - from the 3-dimensional structures 
          htgs - high throughput genome sequences
          ecoli.nt - ecoli genomic sequences
          mito.nt - mitochondrial sequences
          yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
          drosoph.nt - drosophila sequences
          hs.fna - RefSeq human sequences
          other_genomic - non-human genomic sequences
          mouse_genome - mouse genome
          mouse_masked - mouse genome, masked
      Database to run against: nt
      
       Want a summary file in the output directory? (y/n, default y) : n
      
      http://biowulf.nih.gov/blast.html has a full list of available parameters.
      Any additional Blast parameters (e.g. -v 10): 
      Checking node situation....
      
      Submitting to 20 nodes. Job number is 709061.biobos
      
      Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?709061.biobos
      
      
    Monitoring your job

    Use the URL that EasyBlast gives you to watch your job run. You can see the full range of Biowulf monitors at http://biowulf.nih.gov/sysmon/

    Blue: no load
    Green: load ~ 1
    Yellow: load ~ 2 (i.e. fully utilized)
    Red: load > 2 (problem?)

    What to expect:

  • nodes will start off blue, go to green and then yellow.
  • watch for nodes that stay red.
  • Ideally, all nodes will finish at about the same time. If one node continues running for much longer (e.g. 30 mins) than the others, see if you can 'balance' your job better the next time.

    Contact the helix staff (staff@helix.nih.gov or 4-6248) if you have questions or concerns about your job.

    What EasyBlast does

    1. Checks that your input sequences exist and that at least one file contains a Fasta sequence.
    2. Checks if the output directory exists. Warns you if there are already files in that directory which may be over-written.
    3. Lists the available Blast programs for you to choose.
    4. Lists the appropriate (nucleotide or protein) databases for your choice.
    5. Calculates the node memory required based on size of database. (i.e. 'type' of node).
    6. Writes an environment file with the name of database, name of program, input and output directories, and other parameters that the actual Blast job will use.
    7. Checks the current available node situation, and submits to as many nodes as possible (max 32).
    8. Writes a summary file in the output directory containing just the 'hits' from each Blast output.

    Note: To run against your own database, enter its full pathname. e.g.

    Database to run against: /data/username/blast_db/my_own_db
    
    where my_own_db is a Blast database formatted with the formatdb program. (available in /usr/local/blast/formatdb).

    Benchmarks

    Blastn
    Query: 1000 nucleotide EST sequences
    Database: NCBI nt nucleotide database (1,431,631 sequences, 1.8 Gb)

    Serial blast runs with GCG's Netblast against NCBI server ~ 18 hrs
    Silicon Graphics R14000, 4 processors (nimbus.nih.gov) - 5.5 hrs
    10 Biowulf p2800 nodes: 33 mins


    Example 2: Repeatmasker

    Details, instructions, examples are at biowulf.nih.gov/repeatmasker.html.

    This program is NOT parallelized on Biowulf. The advantage of using the Biowulf cluster would be to run Repeatmasker on a large number of sequences.

    1. Set up a command file with one line for each command.
      repeatmasker NM_00110*
      repeatmasker NM_00111*
      repeatmasker NM_00112*
      repeatmasker NM_00113*
      repeatmasker NM_00114*
      repeatmasker NM_00115*
      repeatmasker NM_00116*
      repeatmasker NM_00117*
      repeatmasker NM_00118*
      repeatmasker NM_00119*
      
    2. Submit this job with:
      swarm -f swarmcmd
      


    Example 3: Setting up your own programs for large-scale analysis

    Swarm

    1. Set up a command file with one line for each program you want to run.
    #
    # this file is cmdfile
    #
    myprog -param a < infile-a > outfile-a
    myprog -param b < infile-b > outfile-b
    myprog -param c < infile-c > outfile-c
    myprog -param d < infile-d > outfile-d
    myprog -param e < infile-e > outfile-e
    myprog -param f < infile-f > outfile-f
    myprog -param g < infile-g > outfile-g
    

    2. Submit the job via the 'swarm' command.

    swarm -f cmdfile
    


    Example 4: Linkage Analysis using Parallel Genehunter

    Create a Biowulf batch script.
    #!/bin/bash -v
    # This file name is my_script
    #
    #PBS -N run1
    #PBS -m be
    #PBS -k oe
    
    PATH=/usr/local/mpich/bin:$PATH; export PATH
    
    mpirun -machinefile $PBS_NODEFILE -np $np ght < test.in
    
    Submit this script to the batch system using the 'qsub' command:
    qsub -l nodes=1 my_script
    
    You can test the job interactively:
    biobos% qsub -I -l nodes=1
    qsub: waiting for job 664776.biobos to start
    qsub: job 664776.biobos ready
    
    [user@p2 ~]$ cd /data/username/mydir
    [user@p2 mydir]$ setenv PATH /usr/local/mpich/bin:$PATH
    [user@p2 mydir]$ mpirun -machinefile $PBS_NODEFILE -np 1 /usr/local/bin/ght < test.in
    
    
    ************************************************************************
    *                                                                      *
    *        GENEHUNTER-TWOLOCUS - A modified version of GENEHUNTER        *
    *                             (version 1.3)                            *
    *                                                                      *
    ************************************************************************
    
    Type 'help' or '?' for help.
    Can't find help file - detailed help information is not available.
    See installation instructions for details.
    
     running on 1 nodes
    
    npl:1> 'photo' is on: file is 'two02.out'
    
    npl:2> Fri Nov 25 13:54:28 2005
    
    npl:3> Single point mode is now 'off'
    
    npl:4> Count recs is now 'off'
    
    npl:5> Haplotype output is now 'off'
    
    npl:6> Unaffected children are now used.
    
    npl:7> Currently analyzing a maximum of 9 bits per pedigree
    
    npl:8> Large pedigrees are now used but trimmed.
    
    npl:9> The current analysis type is 'BOTH'
     [...]
    
    [user@p2 /data/user/mydir] exit
    logout
    
    qsub: job 668223.biobos completed
    [susanc@biobos ~]$
    

    You must exit the node after an interactive run!


    Databases

    A list of all sequence databases and their current status is at http://molbio.info.nih.gov/helixdb.php. Please contact the helix staff (staff@helix.nih.gov) if you have questions about a particular database.

    URLs

  • The Biowulf Home Page: biowulf.nih.gov
  • Biowulf Cluster/Job Monitors: biowulf.nih.gov/sysmon
  • Getting a Helix Account: helix.nih.gov/new_users/accounts.html
  • Registering for a Biowulf account: helix.nih.gov/register/biowulf.html
  • Research on Biowulf: helix.nih.gov/research.html
  • Scientific Applications on Biowulf: biowulf.nih.gov/apps.html
    Contact Biowulf Staff at staff@biowulf.nih.gov