Biowulf at the NIH
IPRscan on Biowulf

InterProScan (IPRScan) is a tool that combines different protein signature recognition methods into one resource. IPRScan not only wraps the sequence analysis applications, it also performs a considerable amount of program outputs and data look-up from various databases. InterPro (IPR) integrates PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D and PANTHER databases. Each member database in IPR uses different scanning applications.

InterPro and Interproscan are developed and maintained at the EMBL-EBI.

IPRScan on Biowulf is designed to be used for bulk sequence analysis such as hundreds or thousands of sequences. For small scale of sequence analysis, please use the EBI IPRScan website which can handle 1 sequence at a time.

Input sequences
Iprscan Options
A summary of options for iprscan can be seen by typing at the biowulf prompt:
/usr/local/iprscan/bin/iprscan -cli -h

Mandatory options

-cli use command-line interface. Required for Biowulf runs.
-i Input sequence file.
-seqtype n Required for nucleotide sequences

Other options

-o outputfilename The output file where to write results (optional), default is STDOUT which is /home/UserID (see table below).
-email <addr> Submitter email address (Not required for Biowulf batch system).
-appl <name> Application(s) to run (optional), default is all.
Possible values (dependent on set-up):
blastprodom
fprintscan
hmmpfam
hmmpir
hmmpanther
hmmtigr
hmmsmart
superfamily
gene3d
scanregexp
profilescan
seg
coils
-nocrc Don't perform CRC64 check and rerun all searches even already exist in database which is unnecessary. The default is run without this flag.
-altjobs Launch jobs alternatively, chunk after chunk. Default is off.
-seqtype <type> Sequence type: n for DNA/RNA, p for protein (default).
-trlen <n> Transcript length threshold (20-150).
-trtable <table> Codon table number.
-goterms Show GO terms if iprlookup option is also given.
-iprlookup Switch on the InterPro lookup for results.
-format <format> Output results format (raw, txt, html, xml(default), ebixml(EBI header on top of xml), gff)
-verbose Print messages during run
Prepare batch script
Sample script file. See the Biowulf user guide for more information about batch scripts.
#!/bin/bash
#
#PBS -N YourScriptNameHere
#PBS -m be
#PBS -k oe

/usr/local/iprscan/bin/iprscan -cli -i /data/maoj/iprscan/test248aa.seq -o /dev/null \
  -format raw -goterms -iprlookup
Submit PBS job
Structure of output files after job started:
Directory/File Example Notes
/home/YourUserID/YourJobName.oJobNo
/home/YourUserID/YourJobName.eJobNo
/home/YourUserID/iprscan-xxx.oxxxxx
/home/YourUserID/iprscan-xxx.exxxxx
/home/userID/growth.e1151902
/home/userID/growth.o1151902
/home/userID/iprscan-2008050.e1269981
/home/userID/iprscan-2008050.o1269981

YourJobName is the name specified in the batch script file, beside "#PBS -N".

The JobNo is the number appears right after user submit the job.

If -o option is not given, the summary result not only will appear in 'merged.raw' file (see below) but also in this 'xxx.oJobNo' file. To stop the result from duplicating, include '-o /dev/null' in the command as appeared in sample script above.

/data/YourUserID/iprscan-yyyymmdd/ /data/userID/iprscan-20080301 Automatically created

/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
Subdirectories created based on time stamp the job is submitted if multiple jobs are submitted in the same day.
/data/UserID/pxxxxxxxx.res or .log or .fa /data/UserID/p1063121154651720535.fa
/data/UserID/p1063121154651720535.log
/data/UserID/p1063121154651720535.res
These files are temperatory and will be cleaned up automatically before job finishes. Do not touch or remove these files when the job is running.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/iprscan-
yyyymmdd-hhmmssxx.exitcode
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.exitcode
File content should be '0' if job runs successfully. However, .exitcode files under all chunks should be double checked to confirm. User can go to each chunk_x directory and type 'more iprscan*.exitcode' to view all the exitcode files content at once.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.input
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.input
Input file with sequence format converted to fasta format
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.input.inx
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.input.inx
Binary format of input sequences
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.params
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.params
checksum summary of all the input sequences
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.seqs
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.seqs
Input file with sequences of original format
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/merged.raw

In addition to merged.raw, it can also be xxx.html or xxx.xml or xxx.txt depend on the format user specified.

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/merged.raw

html output sample :

html output sample

Output summary file of all chunks. The format can be merged.raw or html or xml or txt.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_3
Directories created for each chunk of sequences which contains output files for each of the 13 applications.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
iprscan-yyyymmdd-hhmmssxx-APP-cnkX.
OUTPUTFILE
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/
iprscan-20080310-13280403-profilescan-cnk2.
exitcode
Each application generates 4 output files: .output; .output.inx; .errors; .exitcode. Check all the exitcode output file for each application in each chunk. The content should be '0' in all exitcode files for a successful run.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
iprscan-yyyymmdd-hhmmssxx.nocrc
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/
iprscan-20080310-13280403.nocrc
.nocrc file under each chunk_x directory contains the query sequences that do not have a known crc64 according to the match.xml file. Applications will only be launched against these sequences if -nocrc flag is not issued.

/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
iprscan-yyyymmdd-hhmmssxx.xml

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/
iprscan-20080310-13280403.xml

'xxx.xml' file under each chunk_x directory contains the default output result from the search for each chunk. Additional output format can be obtained by changing the command option from -format raw to -format html for example.

To view .html output file, type 'firefox YourFileName.html'

/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/merged.raw

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/merged.raw
merged.raw file under each chunk_x directory contains the output in raw format converted from .xml file.
Cleanup

Output files from iprscan runs are grouped under two directories: /home/userID & /data/userID/iprscan-xxxxx. Since the number of files can accumulate and fill up user's space fast, frequent cleanup by users themselves is highly recommended.

The main output file(s) such as merged.raw or xxx.html or xxx.xml or xxx.txt contains the summarized interesting output from all chunks in each run. The other files can be deleted after checking exitcode files in the chunk_x directories as described above. Sample cleanup commands:

% cd /data/user
% mv iprscan-yymmdd/iprscan..../merged.raw . # or *.html or *.txt or *.xml
% rm -r iprscan-yymmdd
Benchmarks

For these benchmarks, an input file containing query sequences was submitted to batch system using qsub -l nodes=1,mem=2048 scriptName as in the example above. Jobs were run on ??? nodes with 2 GB of RAM.

IPRscan Databases

Please contact the Helix Systems staff if you have questions.