Scientific Supercomputing at the NIH

Genome Alignment with SOAPaligner/SOAP2 on Helix

[SOAPalign] [SOAPsnp] [Sample Session] [Documentation]

SOAP, developed by R. Li and colleagues at the Beijing Genomics Institute, is a program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. It features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer. It require only 2 minutes aligning one million single-end reads onto the human reference genome. SOAPaligner is that it now supports a wide range of the read length. For detailed information, see http://soap.genomics.org.cn/

SOAP currently contains 2 programs, will expend soon to 5 programs:

Version

Type 'soap' on commend line.

SOAP programs (2bwt-builder, soap, soapsnp) location

/usr/local/soap/

How To Use

SOAPaligner

To run SOAPaligner, we need to build index files for the reference genome, and then search reads against the formatted index files.

1. Format reference sequence:

/usr/local/soap/2bwt-builder <FastaPath/YourBackboneFasta>
eg: ./2bwt-builder ~/human_genome.fa

Then under the directory there will be 13 index files, all their prefixes are your_fasta file name with “.index” added, e.g. human_genome.fa.index. The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hot, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt, *.rev.pac, *.sa, and *.sai.

2. Alignment quick start:

For alignment of single-end reads:

/usr/local/soap/soap –a <reads_a> -D <index.files> -o <output></output>

For paired-end reads:

/usr/local/soap/soap –a <reads_a> -b <reads_b> -D <index.files> -o <PE_output> -2 <SE_output> -m <min_insert_size> -x <max_insert_size>

NOTE: For the –D option, the program can only accept the prefix of your index files, such as “~/human_genome.fa.index”.

3. Options:

-D STR Prefix name for reference index [*.index].
-a STR Query file, for SE reads alignment or one end of PE reads
-b STR Query b file, one end of PE reads
-o STR Output file for alignment results
-2 STR Output file contains mapped but unpaired reads when do PE alignment
-u STR Output file for unmapped reads, [none]
-m INT Minimal insert size INT allowed for PE, [400]
-x INT Maximal insert size INT allowed for PE, [600]
-n INT Filter low quality reads contain more INT bp Ns, [5]
-t Output reads id instead reads name, [none]
-r INT How to report repeat hits, 0=none; 1=random one; 2=all, [1]
-A Report all the mismatches in SOAP format or not, [none] only report the mismatches in seed, and the number of mismatches in remain
-R RF alignment for long insert size(>= 2k bps) PE data, [none] FR alignment
-l INT For long reads with high error rate at 3'-end, those can't align whole length, then first align 5' INT bp subsequence as a seed, [256] use whole length of the read
-v INT Totally allowed mismatches in one read, [2]
-M INT Match mode for each read or the seed part of read, which shouldn't contain more than 2 mismaches, [4]
0: exact match only
1: 1 mismatch match only
2: 2 mismatch match only
3: [gap] (coming soon)
4: find the best hits
-p INT Multithreads, n threads, [1]

SOAPsnp

1. Required parameters:

-i <FILE> Input SORTED SOAP alignment result
Note that here we say “sorted’ means alignments of each chromosome are sorted first by chromosome name lexicographically and then by coordinates on each chromosome numerically.

-d <FILE> Reference DNA sequence in FASTA format

-o <FILE> Output consensus file

2. Optional parameters:(default in [ ])

 

-z <Char> ASCII character that stands for quality score==0 [@]
FASTQ files generated by Illumina base-calling pipeline use ‘@’ as 0, but some institutes use ‘!’ as 0.

-g <Double> Global error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.9]

-p <Double> PCR error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.5]
Sequencing errors are found slightly repeatable (once an error occur, additional errors also tend to occur) due to various reasons. Therefore, observations of sequencing errors are not complete independent.The main source of repeatable errors is believed to be PCR amplification in sequencing process. The proper values of the two parameters rely on wetlab process. Nonetheless, the default value generally work at most time.

-r <Double> novel altHOM prior probability [0.0005]

-e <Double> novel HET prior probability [0.0010]
The two are prior probabilities of homozygous SNPs (altHOM) and heterozygous SNPs (HET), which are
used in Bayes formula calculation. Note these are prior probabilities of a new (novel) SNP. They are
expected to be stringent. For different species, the two values should change if necessary.

-t set transition/transversion ratio to 2:1 in prior probability

-s <FILE> Pre-formatted known SNP information.
The file consist of a lot of lines like this one:
chr1 201979756 1 1 0 0.161 0 0 0.839 rs568
The columns from left to right are: name of chromosome, coordinate on the chromosome, whether the SNP has allele frequency information (1 is true, 0 is false), whether the SNP is validated by experiment (1 is true, 0 is false), whether the SNP is actually an indel (1 is true, 0 is false), frequency of A, frequency of C, frequency of T, frequency of G, SNP id. For known SNP sites that do not have allele frequency information, the frequency information can be arbitrarily determined as any positive values, which only imply what alleles have already been deposited in the database.

-2 specify this option will REFINE SNP calling using known SNP information [Off]

-a <Double> Validated HET prior, if no allele frequency known [0.1]

-b <Double> Validated altHOM prior, if no allele frequency known[0.05]

-j <Double> Unvalidated HET prior, if no allele frequency known [0.02]

-k <Double> Unvalidated altHOM rate, if no allele frequency known[0.01]
The parameters are related to using external SNP information to alter prior probabilities for SNP calling.
SOAPsnp will try using allele frequency information as prior probability in calling genotypes for each site.
If the allele frequency information is absent, it will use the above 4 parameters as prior probability.

-u Enable rank sum test (that check whether the two allele of a possible HET call have same sequencing quality) to give HET further penalty for better accuracy. [Off]

-n Enable binomial probability calculation (that check whether the two allele are observed equally)to give HET further penalty for better accuracy. [Off]

-m Enable monoploid calling mode, this will ensure all consensus as HOM and you probably should SPECIFY higher altHOM rate. [Off]

-q Only output potential SNPs. Useful in Text output mode. [Off]

-M <FILE> Output the quality calibration matrix; the matrix can be reused with -I if you rerun the program

-I <FILE> Input previous quality calibration matrix. It cannot be used simutaneously with -M

-L <short> maximum length of read [45]
Please note that once length of some reads exceeds the parameter will probably collapse the program.

-Q <short> maximum FASTQ quality score [40]

-F <int> Output format. 0: Text; 1: GLFv2; 2: GPFv2.[0]

-E <String> Extra headers EXCEPT CHROMOSOME FIELD specified in GLFv2 output. Format is "TypeName1:DataName1:TypeName2:DataName2"[]

-T <FILE> Only call consensus on regions specified in FILE. Format of this file is:
ChrName\tStart\tEnd
ChrName\tStart\tEnd

-h Display this help

Output Format

1.Text format

The result of SOAPsnp has 17 columns:
1) Chromosome ID
2) Coordinate on chromosome, start from 1
3) Reference genotype
4) Consensus genotype
5) Quality score of consensus genotype
6) Best base, average quality score of best base
7) Count of uniquely mapped best base
8) Count of all mapped best base
9) Second best bases, average quality score of second best base
10) Count of uniquely mapped second best base
11) Count of all mapped second best base
12) Sequencing depth of the site, rank sum test p_value
13) Average copy number of nearby region
14) Whether the site is a dbSNP.

2.GLFv2 and GPFv2
GLFv2 (Genome Likelihood Format v2) is a binary file format proposed by Prof. R. Durbin.

Sample Session on Helix

Soap sample files can be copied from:

/usr/local/soap/sample

The infile.fasta contains 4 million reads of 33-bp Solexa sequencing data looking like this:

>HWI-EAS3_2:6:1:860:735
GTTTCCGTAGTGTAGTGGTTATTCTTATTCCGT
>HWI-EAS3_2:6:1:446:8
GGTAGATCTGATGTCTGGTGAGTCGTATGCCGT
>HWI-EAS3_2:6:1:29:559
GCATAAGATTAGAAGGTCGTATGCCGTCTTCTG
..........
..........
>HWI-EAS9_2:6:1:454:45
GTACAGTACTGTGATAACTGATCGTATGCCGTC
>HWI-EAS9_2:6:1:181:407
GTACAGTACTGTGATAACTGAATCGTATGCCGT

The backbonefile.fasta is a 250 million bp genomic sequence in fasta format as well.

The following example formats the reference sequence, align reads to reference sequence, then run SOAPsnp to assembly consensus sequence before identify SNPs.

1. Format reference sequence:

<helix> 229% cd /data/userID/soap_run_1
<helix> 230% /usr/local/soap/2bwt-builder backbonefile.fasta

Parsing FASTA file..
Finished. Parsed 1 sequences.
Elapsed time = 13.85 s

Building Look-Up..
Finished.
Elapsed time = 43.20 s

Building BWT..
Finished constructing BWT in 97 iterations.
Elapsed time = 108.90 s

Saving BWT..
Finished saving BWT.
Elapsed time = 1.06 s

Building Reversed BWT..
Finished constructing Reversed BWT in 97 iterations.
Elapsed time = 123.99 s

Saving BWT..
Finished saving BWT.
Elapsed time = 1.05 s

Loading BWT...
Finished loading BWT.
Elapsed time = 0.12 s

Building SA value...
Finished building SA value.
Elapsed time = 98.08 s

Building High-Occ Hash Table...
Finished.
Elapsed time = 54.51 s

Building SA index...
Finished building SA index.
Elapsed time = 6.43 s

Index building is completed.
Total elapsed time = 451.21 s

2. Alignment:

% /usr/local/soap/soap -a infile.fasta -D backbonefile.fasta.index -o out.soap

Begin Program SOAPaligner/soap2
Thu Apr 23 09:03:04 2009
Reference: jean_backbone_in.fasta.index
Query File a: jean_in.solexa.fasta
Output File: out.soap
Load Index Table ...
Load Index Table OK
Begin Alignment ...
131072 ok 6.43 sec
262144 ok 13.03 sec
393216 ok 19.64 sec
524288 ok 26.35 sec
655360 ok 33.02 sec
786432 ok 39.81 sec
917504 ok 46.57 sec
........
3670016 ok 191.44 sec
3801088 ok 198.58 sec
3932160 ok 205.65 sec
4063232 ok 212.68 sec
4150401 ok 217.38 sec
Total Reads: 4150401
Alignment: 5418 ( 0.13%)
Total Elapsed Time: 219.59
- Load Index Table: 2.15
- Alignment: 217.44

SOAPaligner/soap2 End
Thu Apr 23 09:06:44 2009

3. SOAPsnp

% /usr/local/soap/soapsnp -i out.soap -d backbonefile.fasta -o out.soapsnp
Reading Chromosome and dbSNP information Done.
Correction Matrix Done!
........

Documentation

http://soap.genomics.org.cn/