Scientific Supercomputing at the NIH

CAP3 and PCAP

CAP3 is a sequence assembly program for small-scale assembly of EST sequences with or without quality values. PCAP is for large-scale assembly of genomic sequences with quality values and with or without forward-reverse read pairs. CAP3 & PCAP were developed at Iowa State University (CAP3/PCAP website).

PCAP can handle a genome of 300 Mb (requires ~ 5GB memory and 22 GB disk space) on Helix, and a genome of 3 Gb on the Biowulf cluster. Any genome assembly project larger than 300 Mb should be performed using PCAP on the Biowulf cluster. Typically, assembling N base pairs will require 15N memory and 75N disk space. Please contact the Helix staff (staff@helix.nih.gov) if you have any questions about disk space, memory, or where to run the program.

For a detailed description of the assembly protocol, see Generating a Genome Assembly with PCAP, by X. Huang and S-P Yang, in Current Protocols of Bioinformatics (2005). (available online through the NIH library).

Version

Type '/usr/local/cap3/cap3' or '/usr/local/pcap/pcap' with no parameters. The version date will be displayed on the terminal, along with a brief description of usage.

Usage

CAP3

Usage: ./cap3 File_of_reads [options] File_of_reads is a file of DNA reads in FASTA format If the file of reads is named 'xyz', then the file of quality values must be named 'xyz.qual', and the file of constraints named 'xyz.con'. Options (default values): -a N specify band expansion size N > 10 (20) -b N specify base quality cutoff for differences N > 15 (20) -c N specify base quality cutoff for clipping N > 5 (12) -d N specify max qscore sum at differences N > 20 (200) -e N specify clearance between no. of diff N > 10 (30) -f N specify max gap length in any overlap N > 1 (20) -g N specify gap penalty factor N > 0 (6) -h N specify max overhang percent length N > 2 (20) -i N specify segment pair score cutoff N > 20 (40) -j N specify chain score cutoff N > 30 (80) -k N specify end clipping flag N >= 0 (1) -m N specify match score factor N > 0 (2) -n N specify mismatch score factor N < 0 (-5) -o N specify overlap length cutoff > 15 (40) -p N specify overlap percent identity cutoff N > 65 (90) -r N specify reverse orientation value N >= 0 (1) -s N specify overlap similarity score cutoff N > 250 (900) -t N specify max number of word matches N > 30 (300) -u N specify min number of constraints for correction N > 0 (3) -v N specify min number of constraints for linking N > 0 (2) -w N specify file name for clipping information (none) -x N specify prefix string for output file names (cap) -y N specify clipping range N > 5 (100) -z N specify min no. of good reads at clip pos N > 0 (3)

PCAP

The 'autopcap' script will run a sequence of PCAP programs with default parameters. Usage: ./pcap File_of_file_names [options] File_of_file_names is a file of names of read files If File_of_file_names is named 'xyz', then the file of constraints must be named 'xyz.con'. Options (default values): -a N specify band expansion size N > 10 (15) -c N specify base quality cutoff for clipping N > 5 (10) -e N specify segment pair score cutoff N > 30 (40) -f N specify chain score cutoff N > 60 (80) -g N specify gap penalty factor N > 0 (6) -i N specify max length of a read end to clip N > 50 (400) -j N specify max sum of quality values to clip N > 1000 (3500) -k N specify max sum of qv outside similarity N > 100 (400) -l N specify min depth of coverage for repeats N > 20 (75) -m N specify match score factor N > 0 (2) -n N specify mismatch score factor N < 0 (-5) -o N specify overlap length cutoff > 20 (30) -r N specify directory name for base/quality files (null) Note: If base/quality files are in the current directory, then the -r option must not appear on the command line. -s N specify overlap similarity score cutoff N > 100 (1000) -t N specify number of segment pairs cutoff N > 10 (150) -w N specify number of words cutoff N > 20 (500) -x N specify prefix string for output file names (pcap) -y N specify number of processors N > 0 (1) -z N specify processor id N >= 0 (0)

Sample session with the PCAP example data

helix% ls
fofn      others.fasta.screen.gz       plasmid.fasta.screen.gz
fofn.con  others.fasta.screen.qual.gz  plasmid.fasta.screen.qual.gz

helix% cat fofn
plasmid.fasta.screen
others.fasta.screen

helix% /usr/local/pcap/autopcap fofn -y 2 
Stringent qual diff score cutoff:  -d 130
Min depth of coverage for repeats: -l 75
Amount of available memory in GB:  -m 1
Running pcap jobs in parallel:     -p 1
Adjusted overlap score cutoff:     -s 4500
Overlap percent identity cutoff:   -t 92
Number of pcap jobs:               -y 2
ProcessOverlaps: lowid 0 and highid 1035 
Number of bdocs jobs must be set to 1 
ProcessOverlaps: Space allocated
ProcessOverlaps: depth of overlaps
ReadLenAndNameSpace is done
NameQualCalClip is done
ProcessOverlaps is done
ReadConstraints is done
The autopcap job is completed.

helix% ls
contigs.bases                     fofn.pcap.contigs1.gz     fofn.pcap.scaffold.new1
contigs.quals                     fofn.pcap.contigs1.links  fofn.pcap.scaffold0
fofn                              fofn.pcap.contigs1.qual   fofn.pcap.scaffold0.ace
fofn.con                          fofn.pcap.contigs1.snp    fofn.pcap.scaffold1
fofn.con.pcap.results             fofn.pcap.docs.info0      fofn.pcap.scaffold1.ace
fofn.con.pcap.results.bpair.info  fofn.pcap.docs0.gz        fofn.pcap.singleton0.ace
fofn.con.pcap.sort                fofn.pcap.goodoverlap0    fofn.pcap.singleton1.ace
fofn.con.pcap.sort.stat           fofn.pcap.goodoverlap1    fofn.pcap.singlets
fofn.pcap.bform.info              fofn.pcap.info0           fofn.pcap.super0
fofn.pcap.cap3out0                fofn.pcap.info1           fofn.pcap.super1
fofn.pcap.cap3out1                fofn.pcap.joins1          fofn.pcap.unused0
fofn.pcap.clean.info              fofn.pcap.joins2          fofn.pcap.unused1
fofn.pcap.clustersize             fofn.pcap.multiple0       others.fasta.screen.gz
fofn.pcap.consen.info0            fofn.pcap.multiple1       others.fasta.screen.qual.gz
fofn.pcap.consen.info1            fofn.pcap.n50             plasmid.fasta.screen.gz
fofn.pcap.consen.pros0            fofn.pcap.overlap0.gz     plasmid.fasta.screen.qual.gz
fofn.pcap.consen.pros1            fofn.pcap.overlap1.gz     readpairs.contigs
fofn.pcap.contigs0.gz             fofn.pcap.repeat0.gz      readpairs.reads
fofn.pcap.contigs0.links          fofn.pcap.repeat1.gz      reads.placed
fofn.pcap.contigs0.qual           fofn.pcap.scaffold.info   reads.unplaced
fofn.pcap.contigs0.snp            fofn.pcap.scaffold.new0   supercontigs

Documentation