Biowulf at the NIH
Genome Assembly with PCAP on Biowulf

The PCAP program is intended for large-scale assembly of genomic sequences with quality values and with or without forward-reverse read pairs. Assemblies of 300 Mb can be performed on Helix. For larger assemblies, you can take advantage of multiple processors on Biowulf for some parts of the process, and the large memory of the Biowulf fat node (Firebolt) for the other parts.

A version of PCAP to assemble 454 data is also available in /usr/local/pcap-454/. See the README and Doc files in that directory for more information.

Please read the detailed description of PCAP in Generating a Genome Assembly with PCAP, by X. Huang and S-P Yang, in Current Protocols of Bioinformatics (2005). (available online through the NIH library).

The PCAP package consists of several programs:

Note that the 'autopcap' script which runs these programs in sequence is suitable for small genomes of up to 100 Mb, and can only be used on Helix.

The PCAP package is installed in /usr/local/pcap. The PCAP scripts have been customized to run on the Biowulf cluster.

Sample genome assembly on Biowulf

  1. Create a directory with the pairs of base and quality files (no file should be > 30 Mb), and the constraint file. e.g.
    [susanc@biowulf bacs]$ ls -l 
    -rw-r--r--    1 susanc   staff     1000343 Oct  9 11:38 465083.fasta.screen.gz
    -rw-r--r--    1 susanc   staff     1898352 Oct  9 11:38 465083.fasta.screen.qual.gz
    -rw-r--r--    1 susanc   staff     1602411 Oct  9 11:38 593272.fasta.screen.gz
    -rw-r--r--    1 susanc   staff     3036986 Oct  9 11:38 593272.fasta.screen.qual.gz
    -rw-r--r--    1 susanc   staff     1263376 Oct  9 11:38 600478.fasta.screen.gz
    -rw-r--r--    1 susanc   staff     2412922 Oct  9 11:38 600478.fasta.screen.qual.gz
    -rw-r--r--    1 susanc   staff     1700833 Oct  9 11:38 941368.fasta.screen.gz
    -rw-r--r--    1 susanc   staff     3359995 Oct  9 11:38 941368.fasta.screen.qual.gz
    -rw-r--r--    1 susanc   staff          80 Oct  9 11:38 Bacs
    -rw-r--r--    1 susanc   staff      621398 Oct  9 11:38 Bacs.con
    
    [susanc@biowulf bacs]$ cat Bacs
    465083.fasta.screen
    593272.fasta.screen
    600478.fasta.screen
    941368.fasta.screen
    

    The files *.screen.gz contain the raw data, *.screen.qual.gz contains the quality values, Bacs contains the names of the read files, and Bacs.con contains the constraint information.

  2. sublapjobs - creates and runs a swarm of independent jobs, each of which determines a unique subset of reads and computes overlaps.
    [susanc@biowulf bacs]$ /usr/local/pcap/sublapjobs Bacs -y 10
    
    PCAP code directory path:          -c /usr/local/pcap
    Stringent qual diff score cutoff:  -d 130
    Min depth of coverage for repeats: -l 75
    Running pcap jobs in parallel:     -p 1
    Input data directory path:         -r /data3/c/susanc/pcap/bacs
    Adjusted overlap score cutoff:     -s 4500
    Overlap percent identity cutoff:   -t 92
    Number of pcap jobs:               -y 10
    
    Preparing batch jobs ....
    1608655.biobos
    1608656.biobos
    1608657.biobos
    To see if all the submitted jobs are completed,
    check the Bacs.pcap.info* files in this common input directory.
    

    'qstat -u user' will show the status of these swarm jobs, or use the user monitor. Check the pcap.info* files for successful completion.

  3. runtigcode - runs bdocs and bclean to remove low-quality overlaps, and bcontig to produce the layout of contigs and supercontigs. This process requires large memory and should therefore be run on Firebolt, the 'fat node' of the Biowulf cluster. For a project with N Mb of raw bases, 15N Mb of memory will be required. (e.g. a project of 3 Gb raw bases will require 45 GB of memory on Firebolt.) Firebolt jobs are limited to a maximum of 48 GB of memory; if your project requires more than this, please contact the Helix staff at staff@helix.nih.gov
    • Estimate the required memory according to the formula above. This will be required for job submission to Firebolt.
    • Set up a batch submission script as follows:
      ----- file runtig.bat ---------------------------
      #!/bin/csh
      #PBS -N Pcap
      #PBS -m be
      
      cd /data/user/mypcapdir
      /usr/local/pcap/runtigcode Bacs -y 10  > runtig.log
      
    • Submit this job to the batch system with
      qsub -l nodes=1:altix,mem=##gb runtig.bat
      
    • You should receive email when the job starts and completes. Check the scaffold.info file for successful completion.

      If your job requires less than 8 GB of memory, it can be submitted to the m8192 nodes or the m4096 nodes with the command

      qsub -l nodes=1:m8192 runtig.bat
      
  4. subsenjobs - creates a swarm of jobs that run bconsen.
    [susanc@biowulf bacs]$ /usr/local/pcap/subsenjobs Bacs -y 10
    PCAP code directory path:          -c /usr/local/pcap
    Stringent qual diff score cutoff:  -d 130
    Min depth of coverage for repeats: -l 75
    Running pcap jobs in parallel:     -p 1
    Input data directory path:         -r /data3/c/susanc/pcap/bacs
    Adjusted overlap score cutoff:     -s 4500
    Overlap percent identity cutoff:   -t 92
    Number of pcap jobs:               -y 10
    
    Preparing batch jobs ....
    1608662.biobos
    1608663.biobos
    1608664.biobos
    To see if all the submitted jobs are completed,
    check the Bacs.pcap.consen.pros* files in this common input directory.
    

    Check the consen.pros* files for successful completion.

  5. runstatcode requires large amounts of memory and should be run on Firebolt.
    • Set up a batch script along the following lines:
      ------ file runstat.bat -----
      #!/bin/csh
      #PBS -N Pcap
      #PBS -m be
      
      cd /data/user/mypcapdir
      /usr/local/pcap/runstatcode Bacs -y 10 > runstat.log
      
    • Submit this job to the batch system with
      qsub -l nodes=1:altix,mem=##gb runstat.bat
      
    • You should receive email when the job starts and completes. Check the runstat.log file for successful completion.
Documentation

Generating a Genome Assembly with PCAP. by X. Huang and S-P Yang, in Current Protocols of Bioinformatics (2005). (available online through the NIH library).

PCAP: A Whole-Genome Assembly Program. Huang, X., Wang, J., Aluru, S., Yang, S.-P. and Hillier, L. (2003) Genome Research, 13: 2164-2170.