OMSSA on the Biowulf Linux CLuster

OMSSA on Biowulf

The Open Mass Spectrometry Search Algorithm [OMSSA] is an efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST.

OMSSA was developed by researchers at the NCBI, National Institutes of Health. [OMSSA website]

Small numbers of OMSSA jobs should be run on the NCBI OMSSA server. OMSSA on Biowulf is intended for running a large number of OMSSA searches, or running OMSSA against a personal database.

A swarm of OMSSA jobs

To run a large number of OMSSA searches, use the swarm utility. Set up a swarm command file containing one line for each of your OMSSA runs. Here is a sample swarm command file:

------------------file sample.com--------------------
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file1.dta -ox file1.xml 
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file2.dta -ox file2.xml 
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file3.dta -ox file3.xml 
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file4.dta -ox file4.xml 
----------------end of file -------------------------

Submit this file with

swarm -f sample.com -n 1

Note about multithreading: As of v 2.1.0, OMSSA is multithreaded and will attempt to use all available processors on a node. Thus, it is critical to use the '-n 1' parameter on the swarm command above (sending only one OMSSA command to each node), otherwise the nodes will get overloaded and performance will suffer.

These OMSSA commands will produce XML output. You can write your own script to process the XML data. The OMSSA package includes a sample parser: the command to use it is

perl /usr/local/omssa/readOMSSA.pl file1.xml

Thus, it is possible to set up an OMSSA search and parse the results in a single swarm command.

cd /data/user/mydir ; /usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file1.dta -ox file1.xml \
   ; perl /usr/local/omssa/readOMSSA.pl file1.xml >file1.out
cd /data/user/mydir ; /usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file2.dta -ox file2.xml \
   ; perl /usr/local/omssa/readOMSSA.pl file2.xml >file2.out
cd /data/user/mydir ; /usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file3.dta -ox file3.xml \
   ; perl /usr/local/omssa/readOMSSA.pl file3.xml >file3.out
cd /data/user/mydir ; /usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file4.dta -ox file4.xml \
   ; perl /usr/local/omssa/readOMSSA.pl file4.xml >file4.out

Available databases

OMSSA searches Blast-format sequence databases. A large collection of Blast protein databases is available and updated on the Biowulf cluster, in /fdb/blastdb/.
Names, location and status of Blast databases. (OMSSA will search only protein databases)

Bundling jobs

If you have over 1000 OMSSA searches to run, they should be bundled with the '-b' flag to swarm, such that there are no more than a few hundred jobs. The 'bundle number' is calculated by:

  bundle number = no. of commands / (2* no. of jobs)

Thus, if you have 5000 OMSSA searches and want them packaged into 100 jobs total, the bundle number is 5000/200 = 25'. You would submit these jobs with the command:

swarm -b 50 -f sample.com

Monitoring your jobs

As always, jobs can be monitored using the Biowulf cluster monitors. Click on 'List status of running jobs only', and then your username or job number on the resultant page to view your own jobs only, as in the image on the right.

OMSSA Options

v2.4 (Oct 2008)

USAGE
  omssacl [-h] [-help] [-xmlhelp] [-pm param] [-d blastdb] [-umm] [-f infile]
    [-fx xmlinfile] [-fb dtainfile] [-fp pklinfile] [-fm pklinfile]
    [-foms omsinfile] [-fomx omxinfile] [-fbz2 bz2infile] [-fxml omxinfile]
    [-o textasnoutfile] [-ob binaryasnoutfile] [-ox xmloutfile]
    [-obz2 bz2outfile] [-op pepxmloutfile] [-oc csvfile] [-w] [-to pretol]
    [-te protol] [-tom promass] [-tem premass] [-tez prozdep] [-ta autotol]
    [-tex exact] [-i ions] [-cl cutlo] [-ch cuthi] [-ci cutinc]
    [-cp precursorcull] [-v cleave] [-x taxid] [-w1 window1] [-w2 window2]
    [-h1 hit1] [-h2 hit2] [-hl hitlist] [-ht tophitnum] [-hm minhit]
    [-hs minspectra] [-he evalcut] [-mf fixedmod] [-mv variablemod] [-mnm]
    [-mm maxmod] [-e enzyme] [-zh maxcharge] [-zl mincharge]
    [-zoh maxprodcharge] [-zt chargethresh] [-z1 plusone] [-zc calcplusone]
    [-zcc calccharge] [-pc pseudocount] [-sb1 searchb1] [-sct searchcterm]
    [-sp productnum] [-scorr corrscore] [-scorp corrprob] [-no minno]
    [-nox maxno] [-is subsetthresh] [-ir replacethresh] [-ii iterativethresh]
    [-p prolineruleions] [-il] [-el] [-ml] [-mx modinputfile]
    [-mux usermodinputfile] [-nt numthreads] [-ni] [-ns] [-os]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   Search engine for identifying MS/MS peptide spectra

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore other arguments
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS description;  ignore other arguments
 -xmlhelp
   Print USAGE, DESCRIPTION and ARGUMENTS description in XML format;  ignore
   other arguments
 -pm 
   search parameter input in xml format (overrides command line)
   Default = `'
 -d 
   Blast sequence library to search. Do not include .p* filename suffixes.
   Default = `nr'
 -umm
   use memory mapped sequence libraries
 -f 
   single dta file to search
   Default = `'
 -fx 
   multiple xml-encapsulated dta files to search
   Default = `'
 -fb 
   multiple dta files separated by blank lines to search
   Default = `'
 -fp 
   pkl formatted file
   Default = `'
 -fm 
   mgf formatted file
   Default = `'
 -foms 
   omssa oms file
   Default = `'
 -fomx 
   omssa omx file
   Default = `'
 -fbz2 
   omssa omx file compressed by bzip2
   Default = `'
 -fxml 
   omssa xml search request file
   Default = `'
 -o 
   filename for text asn.1 formatted search results
   Default = `'
 -ob 
   filename for binary asn.1 formatted search results
   Default = `'
 -ox 
   filename for xml formatted search results
   Default = `'
 -obz2 
   filename for bzip2 compressed xml formatted search results
   Default = `'
 -op 
   filename for pepXML formatted search results
   Default = `'
 -oc 
   filename for csv formatted search summary
   Default = `'
 -w
   include spectra and search params in search results
 -to 
   product ion m/z tolerance in Da
   Default = `0.8'
 -te 
   precursor ion m/z tolerance in Da
   Default = `2.0'
 -tom 
   product ion search type (0 = mono, 1 = avg, 2 = N15, 3 = exact)
   Default = `0'
 -tem 
   precursor ion search type (0 = mono, 1 = avg, 2 = N15, 3 = exact)
   Default = `0'
 -tez 
   charge dependency of precursor mass tolerance (0 = none, 1 = linear)
   Default = `0'
 -ta 
   automatic mass tolerance adjustment fraction
   Default = `1.0'
 -tex 
   threshold in Da above which the mass of neutron should be added in exact
   mass search
   Default = `1446.94'
 -i 
   id numbers of ions to search (comma delimited, no spaces)
   Default = `1,4'
 -cl 
   low intensity cutoff as a fraction of max peak
   Default = `0.0'
 -ch 
   high intensity cutoff as a fraction of max peak
   Default = `0.2'
 -ci 
   intensity cutoff increment as a fraction of max peak
   Default = `0.0005'
 -cp 
   eliminate charge reduced precursors in spectra (0=no, 1=yes)
   Default = `0'
 -v 
   number of missed cleavages allowed
   Default = `1'
 -x 
   comma delimited list of taxids to search (0 = all)
   Default = `0'
 -w1 
   single charge window in Da
   Default = `20'
 -w2 
   double charge window in Da
   Default = `14'
 -h1 
   number of peaks allowed in single charge window
   Default = `2'
 -h2 
   number of peaks allowed in double charge window
   Default = `2'
 -hl 
   maximum number of hits retained per precursor charge state per spectrum
   Default = `30'
 -ht 
   number of m/z values corresponding to the most intense peaks that must
   include one match to the theoretical peptide
   Default = `6'
 -hm 
   the minimum number of m/z matches a sequence library peptide must have for
   the hit to the peptide to be recorded
   Default = `2'
 -hs 
   the minimum number of m/z values a spectrum must have to be searched
   Default = `4'
 -he 
   the maximum evalue allowed in the hit list
   Default = `1'
 -mf 
   comma delimited (no spaces) list of id numbers for fixed modifications
   Default = `'
 -mv 
   comma delimited (no spaces) list of id numbers for variable modifications
   Default = `'
 -mnm
   n-term methionine should not be cleaved
 -mm 
   the maximum number of mass ladders to generate per database peptide
   Default = `128'
 -e 
   id number of enzyme to use
   Default = `0'
 -zh 
   maximum precursor charge to search when not 1+
   Default = `3'
 -zl 
   minimum precursor charge to search when not 1+
   Default = `1'
 -zoh 
   maximum product charge to search
   Default = `2'
 -zt 
   minimum precursor charge to start considering multiply charged products
   Default = `3'
 -z1 
   fraction of peaks below precursor used to determine if spectrum is charge 1
   Default = `0.95'
 -zc 
   should charge plus one be determined algorithmically? (1=yes)
   Default = `1'
 -zcc 
   how should precursor charges be determined? (1=believe the input file,
   2=use a range)
   Default = `2'
 -pc 
   minimum number of precursors that match a spectrum
   Default = `1'
 -sb1 
   should first forward (b1) product ions be in search (1=no)
   Default = `1'
 -sct 
   should c terminus ions be searched (1=no)
   Default = `0'
 -sp 
   max number of ions in each series being searched (0=all)
   Default = `100'
 -scorr 
   turn off correlation correction to score (1=off, 0=use correlation)
   Default = `0'
 -scorp 
   probability of consecutive ion (used in correlation correction)
   Default = `0.5'
 -no 
   minimum size of peptides for no-enzyme and semi-tryptic searches
   Default = `4'
 -nox 
   maximum size of peptides for no-enzyme and semi-tryptic searches (0=none)
   Default = `40'
 -is 
   evalue threshold to include a sequence in the iterative search, 0 = all
   Default = `0.0'
 -ir 
   evalue threshold to replace a hit, 0 = only if better
   Default = `0.0'
 -ii 
   evalue threshold to iteratively search a spectrum again, 0 = always
   Default = `0.01'
 -p 
   id numbers of ion series to apply no product ions at proline rule at (comma
   delimited, no spaces)
   Default = `'
 -il
   print a list of ions and their corresponding id number
 -el
   print a list of enzymes and their corresponding id number
 -ml
   print a list of modifications and their corresponding id number
 -mx 
   file containing modification data
   Default = `mods.xml'
 -mux 
   file containing user modification data
   Default = `usermods.xml'
 -nt 
   number of search threads to use, 0=autodetect
   Default = `0'
 -ni
   don't print informational messages
 -ns
   depreciated flag
 -os
   use omssa 1.0 scoring
 -logfile 
   File to which the program log should be redirected
 -conffile 
   Program's configuration (registry) data file
 -version
   Print version number;  ignore other arguments
 -version-full
   Print extended version data;  ignore other arguments
 -dryrun
   Dry run the application: do nothing, only test all preconditions