Prospector General Instructions

Instructions for General Features
Common to Multiple ProteinProspector Programs

Purpose
This document provides instructions for features found across more than one program in the ProteinProspector package.

Instructions for features particular to an individual ProteinProspector Program

Contents of this document:

Search Times
Stopping / Cancelling a Search
Saving Hits from one ProteinProspector program, searching them with another
Databases
Species Filtering
Intact Protein MW Filtering
Intact Protein pI Filtering
Enzyme specificity / Missed cleavages
Frame Translation in DNA databases
General features of links from program output
Link from the accession number in program output to an annotated remote database entry
Link from the MS-Digest index number in program output to MS-Digest
Link from the peptide sequence in program output to MS-Product
Link from the elemental composition in program output to MS-Isotope
Modified N or C Terminal Groups
Modified Cysteine Residues
Modifying Amino Acids
User Specified Amino Acid
Mass (m/z)
Mass type
Charge (z)
Sample ID (comment)
Max. Reported Hits
AA Composition Ions
Absent Amino Acids
Modified Amino Acids Possibly Present
Instrument

Search Times

Search times vary from a few seconds to a few minutes depending on the computer hardware ProteinProspector is running on, the size of the database being searched, the restrictiveness of the search parameters and the number of searches being simultaneously performed. runs on a 266 MHz Pentium II machine running Windows NT 4.0. runs on a Sun Microsystems Sparc 10 running SunOS 4.1.3. All searches taking longer than 15 minutes are automatically terminated. When two or more searches are being performed simultaneously the searches slow noticeably. In general faster searches result with more discriminating search parameters: single species, narrow intact protein MW range, 0 missed cleavages. For DNA database searches set the intact protein MW filter to All.

Searches will run as much as 2X faster if you are NOT using a web browser on the computer performing the search, but are instead communicating to the search server via a network. As much as 50% of one's CPU time can be allocated to merely keeping the stars shooting in the browser window while the search is being performed.

Stopping / Cancelling a Search

Prior to ProteinProspector 3.1 a submitted search request would run to completion even if the user clicked stop on his/her web browser. The end result of this was that if a user clicked stop, changed a parameter and resubmitted the search request each additional search became progressively slower because the server was running multiple searches.

Starting with ProteinProspector 3.1 some of the programs display messages such as:

Press stop on your browser if you wish to abort this MS-Fit search prematurely.

If you see such a message the search can be stopped and resubmitted without clogging up the server.

Saving Hits from one ProteinProspector program, searching them with another

Starting with ProteinProspector v 3.0 one ProteinProspector search program can serve as a pre-filter for another search program. To accomplish this the Hits (index numbers for matching database entries) from the first program are saved to a user specified file. This file is then retrieved by the second program, and only those matching database entries are searched by the second program. Since this operation requires disk space for the saved files, the Internet versions of ProteinProspector limit the user specified file to 3 possible filenames: lastres1, lastres2, lastres3. For ProteinProspector licensees there is no such limit.

The following programs can both save hits and search saved hits:

MS-Fit
MS-Tag
MS-Seq
MS-Edman

MS-Tag or MS-Seq cannot serve as a pre-filter for MS-Fit.

If MS-Tag is used in the Unknome mode as a pre-filter for MS-Edman then a list of possible peptide sequences is saved. MS-Edman should be used with Search Mode set to List of Sequences.

Databases

ProteinProspector programs search sequence databases which are located locally on the server running the programs. The actual files searched are FASTA formatted copies of the source database which contain minimal annotation. Search output typically contains a web-link into a fully annotated version of the source database for each entry matched.

ProteinProspector programs currently allow searching of the publicly available Genome and Proteome databases listed below. However, nearly any sequence database in a suitable FASTA format can be set up for use by contacting the administrator of a ProteinProspector server.

Protein Databases

NCBInr: Current README file
A non-redundant database compiled by NCBI by combining most of the public domain databases (EST's not included).
Genpept: Current Release Notes
Protein translation of Genbank (EST's not included).
Swiss Prot
A curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc), a minimal level of redundancy and high level of integration with other databases
Owl
OWL is a non-redundant composite of 4 publicly available primary sources: SWISS-PROT, PIR (1-3), GenBank (translation) and NRL-3D. SWISS-PROT is the highest priority source, all others being compared against it to eliminate identical and trivially different sequences.
Unknome
Karl Clauser's attempt at a "clever" name for a theoretical database used in de novo MS/MS spectral interpretation that is created on-the-fly and contains all AA sequence permutations consistent with the parent mass and AA composition information contained in an MS/MS spectrum.

DNA Databases

dbEST
A division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.

Reasons to search particular databases:

NCBInr
Largest protein database and updated most frequently.
Swiss Prot
Smallest and best annotated.
dbEST
No matches in protein databases, so gene for your protein may not yet be cloned. Perhaps an EST is known which contains part of your protein. Search times will typically be longer because of multi-frame translation combined with the fact that the dbEST file is > 3x larger than the NCBInr file.

Reasons NOT to search particular databases:

Owl
Least frequently updated, and least consistent species nomenclature.

The local copy of the database being searched with the programs is subject to updating by the administrator of a ProteinProspector server.

Species Filtering

If you don't know the Latin taxonomic name for the species you're interested in try: NCBI Taxonomy Browser

Species limited searches in ProteinProspector programs are performed by means of preliminary filtering of a database according to the user designated species or collection of species. This species pre-filter is bypassed when the species is designated as All.

This species pre-filtering is imperfect because of the poor usage of taxonomy (standard species naming conventions) in the databases, AND the poorly standardized location of this information in the FASTA database formats used by ProteinProspector programs.

Users who desire additional/changed species filtering capability should direct their local ProteinProspector Server administrator to the instructions To Add/Change Species Filter. For the World Wide Web version of ProteinProspector please send email to: .

Species pre-filtering is implemented in ProteinProspector programs by correlating the user selected species name in the HTML form with the variety of pseudonyms for a particular species in the databases through behind the scenes access to a species alias list for all the databases used. This alias list is located on each ProteinProspector server in the directory.

Below is a list of the variety of pseudonyms for Mouse.
NCBInr dbEST Genpept Owl SwissProt

MOUSE
MUS MUSCULUS
MUS SP. M. MUSCULUS
M.MUSCULUS
MOUSE
MUS DOMESTICUS
MUS MUSCULUS
MUS MUSCULUS
MOUSE
MUS MUSCULUS
MUS MUSCULUS (MOUSE)
MOUSE

NCBInr	dbEST	Genpept	Owl	SwissProt
MOUSE MUS MUSCULUS MUS SP.	M. MUSCULUS M.MUSCULUS MOUSE MUS DOMESTICUS MUS MUSCULUS	MUS MUSCULUS	MOUSE MUS MUSCULUS MUS MUSCULUS (MOUSE)	MOUSE

Server Administrators can edit this alias list without requiring access to ProteinProspector source code. Note that while this mechanism of pseudonym correlation is a hassle it also allows for significant flexibility. For example an alias can be created that includes a collection of species i.e. mammals, eukaroytes, prokaryotes etc. Server administrators who create such alias collections are encouraged to send the modified parameter files to for inclusion in subsequent ProteinProspector releases.

Intact Protein MW Filtering

Intact protein MW limited searches in ProteinProspector programs are performed by means of preliminary filtering of a database according to the user designated intact protein MW. This pre-filter is bypassed when the MW range checkbox All is checked.

The intact protein MW pre-filtering is imperfect because sequences in protein databases often exist in pre, pro, and fragment forms. Sequences in DNA databases often exist as fragments (EST's) or as cDNA's.

ProteinProspector programs ALWAYS calculate the intact protein MW, according to the following constraints.

Treat protein as uncharged.
Use average mass scale.
Treat amino acid C as unmodified.
Treat amino acid X as leucine.
Treat amino acid B as glutamic acid.
Treat amino acid Z as glutamine.
Ignore amino acids J, O, U.

Entries in DNA databases are subject to the following additional constraints:

Translate in frame 1.
Ignore stop codons.
If translation of nucleotide N results in a codon that does not uniquely encode an amino acid, call it amino acid X.
Ignore all nucleotides other than A, G, T, C, and N.

Intact Protein pI Filtering

Intact protein pI limited searches in ProteinProspector programs are performed by means of preliminary filtering of a database according to the user designated intact protein pI. This pre-filter is bypassed when the pI range checkbox All is checked.

The intact protein pI pre-filtering is imperfect because sequences in protein databases often exist in pre, pro, and fragment forms. Sequences in DNA databases often exist as fragments (EST's) or as cDNA's.

ProteinProspector programs ALWAYS calculate the intact protein pI, according to the following constraints.

Treat amino acid C as unmodified.
Treat amino acid X as leucine.
Treat amino acid B as glutamic acid.
Treat amino acid Z as glutamine.
Ignore amino acids J, O, U.

Entries in DNA databases are subject to the following additional constraints:

Translate in frame 1.
Ignore stop codons.
If translation of nucleotide N results in a codon that does not uniquely encode an amino acid, call it amino acid X.
Ignore all nucleotides other than A, G, T, C, and N.

The pK values used to calculate the pI values can be modified by ProteinProspector server administrators. You must remake the database index files using FA-Index if you change the pK values.

Frame Translation in DNA databases

DNA databases can NOT be searched with mass spectrometry data from DNA samples. ProteinProspector programs perform translation of DNA sequences to protein sequences.

Frames 1, 2, and 3, represent translation of the database sequence from left to right beginning in positions 1, 2, or 3 respectively. Frames 4, 5, 6 represent translation of the complement of the database sequence from right to left beginning in positions 1, 2, or 3 respectively.

Frame translation in ProteinProspector programs can be designated in 1, -1, 3, -3 or 6 frame translation modes. Frame mode 1 considers only frame 1 described above whereas frame mode -1 considers only frame 4. Frame mode 3 considers only frames 1, 2 and 3 whereas frame mode -3 considers only frames 4, 5 and 6. Frame mode 6 considers all 6 frames. A user should select frame mode 6 unless he/she knows that the database being searched contains sequences exclusively cloned in one direction or contains known genes with sequences already in frame.

Since the capability of searching DNA databases was intended to use EST databases, translation initiation does not require a start codon. If a stop codon is encountered the polypeptide is terminated. Translation is then reinitialized and continued with the following codon, thus beginning a new open-reading frame. MS-Fit requires all matches to a particular database entry to belong not only to the same translational frame, but also to the same open reading frame. Users who feel any of these procedures are inappropriate or inadequate, are urged to contact . Implementation of these procedures was done with significant uncertainty as to optimal strategy.

Enzyme specificity / Missed cleavages

The termini of the matched peptides can be set to be consistent with the cleavage specificity of the enzyme used to generate the peptide. By selecting No enzyme (not available in MS-Fit or MS-Digest) the matched peptides have no constraint on their termini. Increasing the number of maximum number of missed cleavages allowed enables matching to sequences with uncleaved sites internal to the peptide.

The option for the non-existent enzyme Slymotrypsin was created as a means for allowing Chymotryptic cleavages in Trypsin digests. When using this choice it is important to increase the missed cleavages allowed. Increasing to 9 will result in only a marginal increase in the search time. It is possible to combine the rules for two or more enzymes by adding options to the Enzyme item on the HTML form. For example adding the option:
<OPTION> CNBr/Trypsin/Asp-N
would combine the cleavage rules for CNBr, Trypsin and Asp-N.

It is possible to mix N-terminal cleavage rules with C-terminal ones in this way.

ProteinProspector server administrators can edit the existing enzyme cleavage rules or add new ones.

General features of HTML links in program output

The links in program output are intended to easily facilitate user access to obvious sources of additional information about proteins or peptides matched or under study. Some of the default parameters of these links can be changed by ProteinProspector server administrators. Without access to source code, releases of ProteinProspector v. 2.0 and earlier only allow server administrators to
change the default parameters in the HTML links from the accession number

ProteinProspector v. 3.0 and later also allow server administrators to:
change the default parameters in the HTML links from the MS-Digest index number
change the default parameters in the HTML links from the peptide sequence
change the default parameters in the HTML links from the elemental composition

Link from the accession number in program output to an annotated remote database entry

The database accession number in the search results has an HTML link to retrieve the complete entry including comments from a remote database. In order for this link to be created the programs need to know the URL for the remote database. Users who desire links to different fully annotated databases, or who find links to a particular database to be defective should contact their local ProteinProspector server administrator. For the World Wide Web version of ProteinProspector please send email to: .

Server Administrators can change the default address of links from accession numbers in program output without requiring access to ProteinProspector source code. Those administrators who find improved options for links to publicly available databases are encouraged to send the modified parameter files to for inclusion in subsequent ProteinProspector releases.

Link from the MS-Digest index number in program output to MS-Digest

The MS-Digest index number in the search results has an HTML link to retrieve a listing of all the masses and sequences of peptides that can be produced by digesting the matched protein with the designated enzyme. If No enzyme was designated in the search parameters, then Trypsin is supplied in this HTML link. The number of missed cleavages is set to 2 unless a higher number was designated in the search parameters.

Without access to source code, releases of ProteinProspector v. 2.0 and earlier do not allow server administrators to change the default parameters associated with this HTML link.

In ProteinProspector v. 3.0 and later server administrators can change the HTML link from the MS-Digest index number in the search results.

If the MS-Digest number link marked Coverage Map in the MS-Fit detailed results is pressed then the protein display at the top of the MS-Digest report has the matching peptides highlighted.

Link from the peptide sequence in program output to MS-Product

The peptide sequence in the search results has an HTML link to MS-Product for retrieving a listing of the theoretical fragment-ions that may be formed in an MS/MS experiment. The default set of ion types supplied in this link corresponds to those expected to be formed in post-source decay (PSD) experiments.

Without access to source code, releases of ProteinProspector v. 2.0 and earlier do not allow server administrators to change the default parameters associated with this HTML link.

In ProteinProspector v. 3.0 and later server administrators can customize the HTML link from the peptide sequence in the search results.

Link from the elemental composition in program output to MS-Isotope

The elemental composition in the search results has an HTML link to MS-Isotope for retrieving a listing and visualization of the isotopic distribution corresponding to the composition.

In ProteinProspector v. 3.0 and later server administrators can customize the HTML link from the elemental composition in the search results.

Modified N or C Terminal Groups

Most ProteinProspector programs allow the peptide terminal groups to be modified from the defaults of hydrogen at the n terminus and free acid at the c terminus.

If the n terminal group chosen is PTC then any Lysines in the peptide are also modified.

Users who desire additional options for terminal groups should contact their local ProteinProspector server administrator. For the World Wide Web version of ProteinProspector please send email to: .

Server Administrators can add terminal groups without requiring access to ProteinProspector source code. Those administrators who add terminal groups are encouraged to send the modified parameter files to for inclusion in subsequent ProteinProspector releases.

Modified Cysteine Residues

ProteinProspector programs handle the amino acid cysteine in a different manner from any other amino acid. For each execution of a program all cysteines in a database are treated as though they are modified in the user designated way. More than one method of modification (mixing) canNOT generally be designated at the same time for a single search. There is one exception to this rule in the MS-Fit and MS-Digest programs where it is possible to consider Acrylamide Modified Cys in addition to the selected cysteine modification (Modifying Amino Acids).

Users who desire additional options for modification of cysteine residues should contact their local ProteinProspector server administrator. For the World Wide Web version of ProteinProspector please send email to: .

Server Administrators can add cysteine modification options without requiring access to ProteinProspector source code. Those administrators who add cysteine modification options are encouraged to send the modified parameter files to for inclusion in subsequent ProteinProspector releases.

Modifying Amino Acids

User Specified Amino Acid

Some ProteinProspector programs allow the use of a user specified amino acid for which you must supply the elemental composition. To specify the user defined amino acid in a peptide or protein sequence use the letter u (lower case). The default elemental composition for the user defined amino acid is that of glycine.

Mass (m/z)

ProteinProspector programs expect the mass input values to represent the actual m/z values measured on a mass spectrometer. Thus protons - H+ (other charging agents are not currently allowed), need not be subtracted. However, input data that has had the mass of the protons subtracted can be used; simply designate the charge as 0.

Mass type

Monoisotopic: only the lowest common isotope for each element is used in the mass calculations 12C, 1H, 14N, 16O, 32S, 31P.

Average: All isotopes for each element are used and with their abundances reflecting their "normal" proportion in the biosphere. The isotope abundances can be changed by editing the elements.txt file.

Par(mi)Frag(av): Parent masses are calculated as monoisotopic and fragment masses are calculated as average. Note: for the purposes of searching, fragment masses are multiplied by a fudge factor and all calculations are done as monoisotopic. However, for the purposes of displaying search results, fragment mass errors are calculated as average mass (without using the fudge factor). This approach should be reasonable as the Par(mi)Frag(av) option should usually be chosen when the mass accuracy on fragment mass measurements is modest ( +/- 1000 ppm ), and the error in the fudge factor is negligible compared to the fragment mass accuracy.

Par(av)Frag(mi): Parent masses are calculated as average and fragment masses are calculated as monoisotopic. Note: for the purposes of searching, the parent mass is multiplied by a fudge factor and all calculations are done as monoisotopic. However for the purposes of displaying search results the parent mass error is calculated as the average mass (without using the fudge factor). This approach should be reasonable as the Par(av)Frag(mi) option should usually be chosen when the mass accuracy of the parent mass measurement is modest ( +/- 1000 ppm ), and the error in the fudge factor is negligible compared to the parent mass accuracy.

Charge (z)

ProteinProspector programs can handle multiply charged data from both positive and negative ion experiments. Simply specify the integer charge state corresponding to the m/z value. Absence of charge specification in the input defaults to a charge state of +1. Input data that has had the mass of the protons subtracted can be used; simply designate the charge as 0. The charge is used to convert the m/z value to an MH⁺ value for search purposes. Output will show the m/z value with the charge as a superscript.

Max. Reported Hits

This option is used to limit the maximum number of hits displayed. For example if the maximum number of reported hits is set to 50 and there are 100 hits then only the first 50 hits are displayed.

Sample ID (comment)

This option allows a user defined comment or sample identifier to be added the output.

Composition Ions

Searches can be restricted to matching sequences containing particular amino acid(s) by checking the appropriate boxes. This information can be derived from the masses of immonium and related low-mass ions or high-mass ions indicating side-chain losses from the parent ion. The programs do not actually use the mass values but instead filter the matched sequence for the presence of the designated amino acid(s).

In MS-Tag the masses of immonium and related low-mass ions can also be placed directly in the fragment-ion mass window. MS-Tag invokes the same rules as conveyed in the check box chart, and converts the masses to AA characters and filters matched sequences as above for presence of the described amino acid(s). ProteinProspector server administrators can control these immonium ion rules by editing the immonium.txt file.

Absent Amino Acids

Both MS-Comp and MS-Tag UnKnome consider the 20 naturally occurring amino acids as a default. If you know that your unknown peptide doesn't contain particular amino acids you can narrow the range of the search by excluding them. You might also wish to exclude either Leucine or Isoleucine.

Modified Amino Acids Possibly Present

Both MS-Comp and MS-Tag UnKnome consider the 20 naturally occurring amino acids as a default. They can also optionally include the following:

m - Oxidized Methionine

q - Pyroglutamic Acid

h - Homoserine Lactone

s - Phosphorylated Serine

t - Phosphorylated Threonine

y - Phosphorylated Tyrosine

u - A User Specified Amino Acid

Instrument

Some ProteinProspector parameters are specific to an instrument type. Server administrators can modify these parameters or add new instrument types by editing the instrument.txt file.

Instructions for General Features Common to Multiple ProteinProspector Programs byKarlAndPete ();

Search Times

Stopping / Cancelling a Search

Saving Hits from one ProteinProspector program, searching them with another

Enzyme specificity / Missed cleavages

Instructions for General Features
Common to Multiple ProteinProspector Programs