ProSplign is a global alignment tool developed by Dr. Boris Kiryutin. It produces accurate spliced alignments and locates alignments of distantly related proteins with low similarity.
ProSplign algorithm is an integral component of the NCBI's Genome Annotation Pipeline (Gnomon), which has been used to annotate critical genomes that include many different plant and animal species (such as human, mouse, cow etc.). The Pipeline was used by the Sea Urchin Genome Sequencing center for sequence analysis of the 814-megabase genome of the sea urchin Strongylocentrotus purpuratus that was published in Science in 2006. The integration of ProSplign with the genome annotation pipeline significantly improved the quality of genome annotation over existing available methods. Due to the success of the method it was used to annotate Tribolium castaneum (Nature, 2008), Taurine Cattle (Science, 2009), Acyrthosiphon Pisum (PLoS Biology, 2010), Nasonia (Science, 2010), and many other genomes.
Also ProSplign is a central part of the automatic pipeline for Influenza virus genomes, an important part of the Influenza Genome Sequencing Project. Sponsored by the National Institutes of Health, the Influenza Project is an international collaboration of critical importance for the public health. It has already led to multiple new discoveries about the recent evolution and pathogenesis of influenza, which have been published in leading journals including Journal of Virology, PLoS Biology, and Nature.
|
ProSplign is a utility for computing the alignment of proteins to genomic nucleotide sequence. This alignment can include eukaryotic splicing. At the heart of the program is a global alignment algorithm that specifically accounts for introns and splice signals. It is due to this algorithm that ProSplign is accurate in determining splice sites and tolerant to sequencing errors.
ProSplign uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and then to speed up the core dynamic programming.
Please follow one of the links below or navigate using the menu bar at the top of this page.
This web site is a single-point source of information on ProSplign, the tool for computing protein-to-genomic alignments that include an effort to account for mRNA splicing. ProSplign was developed with the following goals in mind:
- Accuracy in determining splice signals
- Recognition of short exons and non-consensus splices where feasible
- Ability to identify and separate multiple compartments typically representing gene copying events
ProSplign is used to compute transcript alignments as a part of the NCBI Genome Annotation Pipeline.
ProSplign is available for use in a number of different ways. There is no online version of ProSplign. You must download and install the console version which is available for major platforms (and may also be available for a few platforms not listed - please request). You can also link to ProSplign from your own applications in a portable way since ProSplign is a part of the NCBI C++ Toolkit. And finally, ProSplign is available as a plugin for the NCBI Genome Workbench.
Reference: ProSplign - Protein to Genomic Alignment Tool. B. Kiryutin, A. Souvorov, T. Tatusova. Manuscript in preparation
|
Binaries (updated 02/23/15)
Pre-built executables are available for
Linux/i386 (64bit)
Sources
ProSplign was written for gene prediction at NCBI. There is no effort to encompass backward-compatibility between versions.
ProSplign is included into the NCBI C++ Toolkit. For details on how to download, configure, and build the Toolkit, please consult the NCBI C++ Toolkit book.
You can browse the Toolkit's code through the LXR or Doxygen source browsers. Search for CProSplign C/C++ Symbol to go directly to ProSplign sources.
|
Using the console version
The console ProSplign can be launched in two modes - pairwise and batch. The pairwise mode is useful if you need to quickly align a few sequences and you don't want to compute separate blast hits for them. Batch mode is the best candidate for performing massive transcript alignment jobs, e.g. as a part of your genome annotation process. To see the parameters run "./prosplign -help" Most of the parameters are for the internal NCBI gene prediction process.
|
In pairwise mode, put your protein query and nucleic acid subject sequences in two files (only first sequences in each file will be aligned) and the command-line "./prosplign -full -nfa nuc.fa -pfa prot.fa -out aln.txt -fasn aln.asn". The nfa parameter is the file of the nucleic acid subject, the pfa parameter is the file of the protein query. The output is text output to the file specified in the out parameter and ASN1 output to the file specified in the fasn parameter.
|
Batch mode is organized in three steps.
-
Run BLAST program to generate the 12-column, tab-separated output. Make sure the output is sorted by subject and query. For example:
formatdb -i subj.fa -p F -o T
tblastn -i query.fa -d subj.fa -m 8 | sort -k 2,2 -k 1,1 > test.hit
resulting in:
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 48.98 98 46 3 97 190 20647295 20647002 1e-23 76.3
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 55.00 40 17 1 58 96 20647507 20647388 1e-23 50.4
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 67.65 68 21 1 149 216 20646883 20646683 8e-22 100
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 66.18 68 22 1 149 216 20624596 20624396 3e-20 94.7
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 66.18 68 22 1 149 216 20601700 20601500 3e-20 94.7
-
Run the compart tool to find approximate locations of the protein instances on the nucleic acid (./compart -f blast.hit -add 10000 > comp). Each line of the output file represents a single instance, or 'compartment'.
1 NT_010783.14 NP_032143.1 20591500 20606200 -
2 NT_010783.14 NP_032143.1 20606202 20617641 -
3 NT_010783.14 NP_032143.1 20617643 20632343 -
4 NT_010783.14 NP_032143.1 20632345 20643487 -
5 NT_010783.14 NP_032143.1 20643489 20657875 -
-
Run ProSplign with the compartment file and the fasta file to generate an alignment for each compartment (./prosplign -two_stages -pfa p.fa -nfa n.fa -f comp -inf pro.inf -out pro.out). The .inf file is designed for further computation.
1 gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1| 20591500 20606200 -
1 TCC 20602708 20602550 GT - id: 60% pos: 71%
1 AG 20602325 20602206 GT - id: 47% pos: 75%
1 AG 20602112 20601948 GT - id: 61% pos: 76%
1 AG 20601694 20601506 GGC - id: 67% pos: 78% total: id: 60% pos: 75%frame: 0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
2 gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1| 20606202 20617641 -
1 TCC 20610888 20610730 GT - id: 67% pos: 77%
1 AG 20610519 20610400 GT - id: 47% pos: 72%
1 AG 20610307 20610143 GT - id: 67% pos: 76%
1 AG 20609889 20609701 GGC - id: 64% pos: 76% total: id: 62% pos: 75%frame: 0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
3 gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1| 20617643 20632343 -
1 TCC 20625604 20625446 GT - id: 60% pos: 71%
1 AG 20625221 20625102 GT - id: 47% pos: 75%
1 AG 20625008 20624844 GT - id: 61% pos: 76%
1 AG 20624590 20624402 GGC - id: 67% pos: 78% total: id: 60% pos: 75%frame: 0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
4 gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1| 20632345 20643487 -
1 TCC 20640293 20640131 GC - id: 58% pos: 68%
1 AG 20639906 20639791 GT - id: 43% pos: 74%
1 AG 20639697 20639533 GT - id: 60% pos: 72%
1 AG 20639279 20639091 GGC - id: 64% pos: 75% total: id: 58% pos: 72%frame: 0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
5 gi|37544107|ref|NT_010783.14|Hs17_10940 gi|6679997|ref|NP_032143.1| 20643489 20657875 -
1 TCC 20647875 20647717 GT - id: 69% pos: 75%
1 AG 20647507 20647388 GT - id: 55% pos: 77%
1 AG 20647295 20647131 GT - id: 69% pos: 80%
1 AG 20646877 20646689 GGC - id: 68% pos: 79% total: id: 66% pos: 78%frame: 0
start - stop - frameshifts - stop_inside_exon - number_of_pieces 1
The .out file is designed for human reading.
|
|
Algorithmic details
ProSplign works with input sequences on a pairwise basis. In other words, exon/intron structures are determined independently for each query and subject.
The dynamic programming alone is accurate in determining splice junctions but computationally expensive. Also, if copies of a gene share same genomic sequence and strand, direct application may produce incorrect results by connecting exons from different copies.
Thus, for every input query/subject pair, it is important to localize genes on the genomic sequence which ProSplign achieves with the algorithm to compartmentize the BLAST hits.
The compartmentization step starts with computing protein-to-genomic blast hits. These give initial insight into the structure of compartments. Hits are separated into two same-strand sets and then compartments are identified within each strand. To do so, we formally define the optimization problem in terms of genomic sequence coverage and then solve it with a dynamic programming algorithm whose running time is short compared to the core dynamic programming described above.
|
Frequently Asked Questions
Q: Why am I getting "Unable to locate XXX" exceptions?
A: Please make sure that sequence identifiers in the input hit file match those in the index file. When indexing your fasta files, ProSplign records sequence IDs exactly as they appear after the leading '>' while your blast program could have printed them slightly differently.
Q: What does 'No compartment found' log file message mean? What is compartment?
A: Compartment is a localized interval on genomic sequence providing bounds for ProSplign in its search for exons. Compartments are identified based on input blast hits, so when there are not enough hits or hits are too weak or not consistent with each other to form a compartment, this message is generated.
|
|
|