ProSplign

Overview

Download

Documentation

Algorithm

FAQ

Contacts

ProSplign is a global alignment tool developed by Dr. Boris Kiryutin. It produces accurate spliced alignments and locates alignments of distantly related proteins with low similarity.

ProSplign algorithm is an integral component of the NCBI's Genome Annotation Pipeline (Gnomon), which has been used to annotate critical genomes that include many different plant and animal species (such as human, mouse, cow etc.). The Pipeline was used by the Sea Urchin Genome Sequencing center for sequence analysis of the 814-megabase genome of the sea urchin Strongylocentrotus purpuratus that was published in Science in 2006. The integration of ProSplign with the genome annotation pipeline significantly improved the quality of genome annotation over existing available methods. Due to the success of the method it was used to annotate Tribolium castaneum (Nature, 2008), Taurine Cattle (Science, 2009), Acyrthosiphon Pisum (PLoS Biology, 2010), Nasonia (Science, 2010), and many other genomes.

Also ProSplign is a central part of the automatic pipeline for Influenza virus genomes, an important part of the Influenza Genome Sequencing Project. Sponsored by the National Institutes of Health, the Influenza Project is an international collaboration of critical importance for the public health. It has already led to multiple new discoveries about the recent evolution and pathogenesis of influenza, which have been published in leading journals including Journal of Virology, PLoS Biology, and Nature.

ProSplign is a utility for computing the alignment of proteins to genomic nucleotide sequence. This alignment can include eukaryotic splicing. At the heart of the program is a global alignment algorithm that specifically accounts for introns and splice signals. It is due to this algorithm that ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

ProSplign uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and then to speed up the core dynamic programming.

Please follow one of the links below or navigate using the menu bar at the top of this page.

This web site is a single-point source of information on ProSplign, the tool for computing protein-to-genomic alignments that include an effort to account for mRNA splicing. ProSplign was developed with the following goals in mind:

Accuracy in determining splice signals
Recognition of short exons and non-consensus splices where feasible
Ability to identify and separate multiple compartments typically representing gene copying events

ProSplign is used to compute transcript alignments as a part of the NCBI Genome Annotation Pipeline.

ProSplign is available for use in a number of different ways. There is no online version of ProSplign. You must download and install the console version which is available for major platforms (and may also be available for a few platforms not listed - please request). You can also link to ProSplign from your own applications in a portable way since ProSplign is a part of the NCBI C++ Toolkit. And finally, ProSplign is available as a plugin for the NCBI Genome Workbench.

Reference: ProSplign - Protein to Genomic Alignment Tool. B. Kiryutin, A. Souvorov, T. Tatusova. Manuscript in preparation

Binaries (updated 02/23/15)
Pre-built executables are available for Linux/i386 (64bit)

Sources
ProSplign was written for gene prediction at NCBI. There is no effort to encompass backward-compatibility between versions.
ProSplign is included into the NCBI C++ Toolkit. For details on how to download, configure, and build the Toolkit, please consult the NCBI C++ Toolkit book.
You can browse the Toolkit's code through the LXR or Doxygen source browsers. Search for CProSplign C/C++ Symbol to go directly to ProSplign sources.

Using the console version

The console ProSplign can be launched in two modes - pairwise and batch. The pairwise mode is useful if you need to quickly align a few sequences and you don't want to compute separate blast hits for them. Batch mode is the best candidate for performing massive transcript alignment jobs, e.g. as a part of your genome annotation process. To see the parameters run "./prosplign -help" Most of the parameters are for the internal NCBI gene prediction process.

In pairwise mode, put your protein query and nucleic acid subject sequences in two files (only first sequences in each file will be aligned) and the command-line "./prosplign -full -nfa nuc.fa -pfa prot.fa -out aln.txt -fasn aln.asn". The nfa parameter is the file of the nucleic acid subject, the pfa parameter is the file of the protein query. The output is text output to the file specified in the out parameter and ASN1 output to the file specified in the fasn parameter.

Batch mode is organized in three steps.

Run BLAST program to generate the 12-column, tab-separated output. Make sure the output is sorted by subject and query. For example:

formatdb -i subj.fa -p F -o T 
tblastn -i query.fa -d subj.fa -m 8 | sort -k 2,2 -k 1,1 > test.hit

resulting in:

gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 48.98   98      46      3       97      190     20647295  20647002        1e-23   76.3
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 55.00   40      17      1       58      96      20647507  20647388        1e-23   50.4
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 67.65   68      21      1       149     216     20646883  20646683        8e-22   100
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 66.18   68      22      1       149     216     20624596  20624396        3e-20   94.7
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 66.18   68      22      1       149     216     20601700  20601500        3e-20   94.7

Run the compart tool to find approximate locations of the protein instances on the nucleic acid (./compart -f blast.hit -add 10000 > comp). Each line of the output file represents a single instance, or 'compartment'.

1       NT_010783.14    NP_032143.1     20591500        20606200        -
2       NT_010783.14    NP_032143.1     20606202        20617641        -
3       NT_010783.14    NP_032143.1     20617643        20632343        -
4       NT_010783.14    NP_032143.1     20632345        20643487        -
5       NT_010783.14    NP_032143.1     20643489        20657875        -

Run ProSplign with the compartment file and the fasta file to generate an alignment for each compartment (./prosplign -two_stages -pfa p.fa -nfa n.fa -f comp -inf pro.inf -out pro.out). The .inf file is designed for further computation.

1       gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1|     20591500        20606200        -
1       TCC     20602708        20602550        GT      -       id:     60%     pos:    71%
1       AG      20602325        20602206        GT      -       id:     47%     pos:    75%
1       AG      20602112        20601948        GT      -       id:     61%     pos:    76%
1       AG      20601694        20601506        GGC     -       id:     67%     pos:    78%     total:  id:     60%     pos:    75%frame:  0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
2       gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1|     20606202        20617641        -
1       TCC     20610888        20610730        GT      -       id:     67%     pos:    77%
1       AG      20610519        20610400        GT      -       id:     47%     pos:    72%
1       AG      20610307        20610143        GT      -       id:     67%     pos:    76%
1       AG      20609889        20609701        GGC     -       id:     64%     pos:    76%     total:  id:     62%     pos:    75%frame:  0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
3       gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1|     20617643        20632343        -
1       TCC     20625604        20625446        GT      -       id:     60%     pos:    71%
1       AG      20625221        20625102        GT      -       id:     47%     pos:    75%
1       AG      20625008        20624844        GT      -       id:     61%     pos:    76%
1       AG      20624590        20624402        GGC     -       id:     67%     pos:    78%     total:  id:     60%     pos:    75%frame:  0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
4       gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1|     20632345        20643487        -
1       TCC     20640293        20640131        GC      -       id:     58%     pos:    68%
1       AG      20639906        20639791        GT      -       id:     43%     pos:    74%
1       AG      20639697        20639533        GT      -       id:     60%     pos:    72%
1       AG      20639279        20639091        GGC     -       id:     64%     pos:    75%     total:  id:     58%     pos:    72%frame:  0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
5       gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1|     20643489        20657875        -
1       TCC     20647875        20647717        GT      -       id:     69%     pos:    75%
1       AG      20647507        20647388        GT      -       id:     55%     pos:    77%
1       AG      20647295        20647131        GT      -       id:     69%     pos:    80%
1       AG      20646877        20646689        GGC     -       id:     68%     pos:    79%     total:  id:     66%     pos:    78%frame:  0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1

The .out file is designed for human reading.

Algorithmic details

ProSplign works with input sequences on a pairwise basis. In other words, exon/intron structures are determined independently for each query and subject.

The dynamic programming alone is accurate in determining splice junctions but computationally expensive. Also, if copies of a gene share same genomic sequence and strand, direct application may produce incorrect results by connecting exons from different copies.

Thus, for every input query/subject pair, it is important to localize genes on the genomic sequence which ProSplign achieves with the algorithm to compartmentize the BLAST hits. The compartmentization step starts with computing protein-to-genomic blast hits. These give initial insight into the structure of compartments. Hits are separated into two same-strand sets and then compartments are identified within each strand. To do so, we formally define the optimization problem in terms of genomic sequence coverage and then solve it with a dynamic programming algorithm whose running time is short compared to the core dynamic programming described above.

Frequently Asked Questions

Q: Why am I getting "Unable to locate XXX" exceptions?
A: Please make sure that sequence identifiers in the input hit file match those in the index file. When indexing your fasta files, ProSplign records sequence IDs exactly as they appear after the leading '>' while your blast program could have printed them slightly differently.

Q: What does 'No compartment found' log file message mean? What is compartment?
A: Compartment is a localized interval on genomic sequence providing bounds for ProSplign in its search for exons. Compartments are identified based on input blast hits, so when there are not enough hits or hits are too weak or not consistent with each other to form a compartment, this message is generated.