The PCAP program is intended for large-scale assembly of genomic sequences with quality values and with or without forward-reverse read pairs. Assemblies of 300 Mb can be performed on Helix. For larger assemblies, you can take advantage of multiple processors on Biowulf for some parts of the process, and the large memory of the Biowulf fat node (Firebolt) for the other parts.
A version of PCAP to assemble 454 data is also available in /usr/local/pcap-454/. See the README and Doc files in that directory for more information.
Please read the detailed description of PCAP in Generating a Genome Assembly with PCAP, by X. Huang and S-P Yang, in Current Protocols of Bioinformatics (2005). (available online through the NIH library).
The PCAP package consists of several programs:
- pcap - computes pairwise overlaps between reads.
- bdocs - uses these overlaps to calculate the coverage depths at each region of the genome.
- bclean - removes overlaps between reads with extremely high coverage depths.
- bcontig - builds the assembly layout and assembles contigs and supercontigs.
- bconsen - generates the consensus sequence of the contigs.
- Additional utility programs.
The PCAP package is installed in /usr/local/pcap. The PCAP scripts have been customized to run on the Biowulf cluster.
Sample genome assembly on Biowulf
- Create a directory with the pairs of base and quality files (no file should
be > 30 Mb), and the constraint file. e.g.
[susanc@biowulf bacs]$ ls -l -rw-r--r-- 1 susanc staff 1000343 Oct 9 11:38 465083.fasta.screen.gz -rw-r--r-- 1 susanc staff 1898352 Oct 9 11:38 465083.fasta.screen.qual.gz -rw-r--r-- 1 susanc staff 1602411 Oct 9 11:38 593272.fasta.screen.gz -rw-r--r-- 1 susanc staff 3036986 Oct 9 11:38 593272.fasta.screen.qual.gz -rw-r--r-- 1 susanc staff 1263376 Oct 9 11:38 600478.fasta.screen.gz -rw-r--r-- 1 susanc staff 2412922 Oct 9 11:38 600478.fasta.screen.qual.gz -rw-r--r-- 1 susanc staff 1700833 Oct 9 11:38 941368.fasta.screen.gz -rw-r--r-- 1 susanc staff 3359995 Oct 9 11:38 941368.fasta.screen.qual.gz -rw-r--r-- 1 susanc staff 80 Oct 9 11:38 Bacs -rw-r--r-- 1 susanc staff 621398 Oct 9 11:38 Bacs.con [susanc@biowulf bacs]$ cat Bacs 465083.fasta.screen 593272.fasta.screen 600478.fasta.screen 941368.fasta.screen
The files *.screen.gz contain the raw data, *.screen.qual.gz contains the quality values, Bacs contains the names of the read files, and Bacs.con contains the constraint information.
- sublapjobs - creates and runs a swarm
of independent jobs, each of which determines a unique subset of reads and
computes overlaps.
[susanc@biowulf bacs]$ /usr/local/pcap/sublapjobs Bacs -y 10 PCAP code directory path: -c /usr/local/pcap Stringent qual diff score cutoff: -d 130 Min depth of coverage for repeats: -l 75 Running pcap jobs in parallel: -p 1 Input data directory path: -r /data3/c/susanc/pcap/bacs Adjusted overlap score cutoff: -s 4500 Overlap percent identity cutoff: -t 92 Number of pcap jobs: -y 10 Preparing batch jobs .... 1608655.biobos 1608656.biobos 1608657.biobos To see if all the submitted jobs are completed, check the Bacs.pcap.info* files in this common input directory.
'qstat -u user' will show the status of these swarm jobs, or use the user monitor. Check the pcap.info* files for successful completion.
- runtigcode - runs bdocs and bclean to remove low-quality overlaps,
and bcontig to produce the layout of contigs and supercontigs. This process
requires large memory and should therefore be run on Firebolt, the 'fat node'
of the Biowulf cluster. For a project with N Mb of raw bases, 15N Mb of memory
will be required. (e.g. a project of 3 Gb raw bases will require 45 GB of
memory on Firebolt.) Firebolt jobs are limited to a maximum of 48 GB of memory;
if your project requires more than this, please contact the Helix staff at
staff@helix.nih.gov
- Estimate the required memory according to the formula above. This will be required for job submission to Firebolt.
- Set up a batch submission script as follows:
----- file runtig.bat --------------------------- #!/bin/csh #PBS -N Pcap #PBS -m be cd /data/user/mypcapdir /usr/local/pcap/runtigcode Bacs -y 10 > runtig.log
- Submit this job to the batch system with
qsub -l nodes=1:altix,mem=##gb runtig.bat
- You should receive email when the job starts and completes. Check the
scaffold.info file for successful completion.
If your job requires less than 8 GB of memory, it can be submitted to the m8192 nodes or the m4096 nodes with the command
qsub -l nodes=1:m8192 runtig.bat
- subsenjobs - creates a swarm of jobs that run bconsen.
[susanc@biowulf bacs]$ /usr/local/pcap/subsenjobs Bacs -y 10 PCAP code directory path: -c /usr/local/pcap Stringent qual diff score cutoff: -d 130 Min depth of coverage for repeats: -l 75 Running pcap jobs in parallel: -p 1 Input data directory path: -r /data3/c/susanc/pcap/bacs Adjusted overlap score cutoff: -s 4500 Overlap percent identity cutoff: -t 92 Number of pcap jobs: -y 10 Preparing batch jobs .... 1608662.biobos 1608663.biobos 1608664.biobos To see if all the submitted jobs are completed, check the Bacs.pcap.consen.pros* files in this common input directory.
Check the consen.pros* files for successful completion.
- runstatcode requires large amounts of memory and should be run on
Firebolt.
- Set up a batch script along the following lines:
------ file runstat.bat ----- #!/bin/csh #PBS -N Pcap #PBS -m be cd /data/user/mypcapdir /usr/local/pcap/runstatcode Bacs -y 10 > runstat.log
- Submit this job to the batch system with
qsub -l nodes=1:altix,mem=##gb runstat.bat
- You should receive email when the job starts and completes. Check the runstat.log file for successful completion.
- Set up a batch script along the following lines:
Generating a Genome Assembly with PCAP. by X. Huang and S-P Yang, in Current Protocols of Bioinformatics (2005). (available online through the NIH library).
PCAP: A Whole-Genome Assembly Program. Huang, X., Wang, J., Aluru, S., Yang, S.-P. and Hillier, L. (2003) Genome Research, 13: 2164-2170.