Description, Instructions, and Tips for


Purpose
This document provides instructions for .

You need not bother reading this document unless you are administering a server running the ProteinProspector programs.

Instructions for ProteinProspector Programs

Contents of this document: (all in one file, so it can be printed and read)

Links to topics in the general instructions:

Introduction

FA-Index was developed for five main reasons:
  1. To enable an internal means for the ProteinProspector programs to store an index number when a hit is recorded during a search, then later use that number to retrieve that database entry for output/report generation purposes. This cuts down the memory requirements for program execution.
  2. To provide indices which can be used to accelerate searches that are pre-filtered by intact protein MW, protein pI and/or species.
  3. To aid the ProteinProspector programs in addressing some of the hindrances inherent in FASTA comment line format heterogeneity.
  4. To allow users to create subset databases based on either a Species/Protein MW pre-filter or the results of a previous search. Searches performed on these smaller databases are often very much faster than searches performed on complete databases.
  5. To allow users to create databases containing user defined proteins.


Background on the FASTA format

The FASTA format for sequence databases was originally developed by Pearson for use with the FASTA program. Today it is probably the most widely used standard format, primarily because its brevity results in the smallest possible file size for sequences.

An example of the format is shown below:

>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD

As a standard it leaves something to be desired, because the "standard" is that there is a single comment line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or de-limiting of fields in the comment line. Often the comment line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.

The FASTA format was chosen for use with ProteinProspector primarily because of it's universality, brevity, and expected ease with which database files could be shared on the same computer with other programs for sequence analysis.

The FA-Index program creates several indices which are much smaller files than the FASTA database file. These indices aid the ProteinProspector programs in addressing some of the hindrances inherent in the FASTA comment line format heterogeneity.


Using other programs with the same FASTA database files

There is no reason that we know of that should prevent use of the FASTA database files by both ProteinProspector programs and other programs which accept FASTA format. Further, we believe it should be possible for the files to be simultaneously read by more than one program at a time. It may be of interest to some users that the SEQUEST program from John Yates' group at the University of Washington also uses FASTA formatted databases.


ProteinProspector filenaming conventions for public FASTA databases

Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. However, this information is NOT consistently organized into fields in the comment line of different FASTA database, though within a specific database it is usually consistent.

The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated comment line format described.

Genpept

>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]

ProteinProspector programs designate:

  • accession number, D13314, as the alphanumeric string in the first set of parentheses in the line
  • name, arginine deiminase, as the string between the first ")" and the last "[" in the line
  • species, Mycoplasma hominis, as the string between the last set of brackets in the line
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/Genpept.usp.

    Owl

    >10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
    >AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
    >pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

    ProteinProspector programs designate:

  • accession number, 10KD_VIGUN, AEOHFPA, 100K_RAT as either the string before the first space in the line or the string between the second dash in the line and the first space in the line. The second case is activated if the letters "pir" immediately follow the '>' character.
  • species, VIGNA UNGUICULATA, A.hydrophila DNA, clone pPH4, RATTUS NORVEGICUS as the string between the last dash " - " in the line and either the character combination " (" or the period character.
  • name, 10 KD PROTEIN PRECURSOR (CLONE PSAS10)., AEOHFPA NID: g141875, 100 KD PROTEIN (EC 6.3.2.-). as the string between the first space " " and the last dash " -" in the line
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/Owl.usp.

    SwissProt

    >sp|P15394|REPA_AGRTU REPLICATING PROTEIN

    ProteinProspector programs designate:

  • accession number, P15394 the alphanumeric string beginning in position 5 and ending before a "|"
  • AGRTU, as the string between "_" and " "
  • REPLICATING PROTEIN, as the string following the species
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line (this usually does not happen for any entries in SwissProt). All of these UNREADABLE lines are then written by FA-Index to the file seqdb/SwissProt.usp.

    NCBInr

    The comment lines from this database are tricky to handle because it is a non-redundant database which collects entries form several databases, thus there are several formats present in the final database.

    >gi|304881 (L07596) alaS [Escherichia coli]
    >gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN
    >gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
    >gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)
    >gi|543687|pir||A48298 sodium channel homolog - jellyfish (Cyanea capillata)

    ProteinProspector programs designate:

  • accession number, 304881, as all consecutive digits following the first "|"
  • species
  • name Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/NCBInr.usp.

    dbEST
    This database wins the booby prize as the one with the least consistent comment lines.

    >gi|1705383|gb|N20717|N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5'

    ProteinProspector programs designate:

  • accession number, 1705383, as all consecutive digits following ".i|"
  • species, Schistosoma mansoni, since this database is so haphazard in its placement of the species, FA-Index does a string search in the line after first consulting the file dbEST.spl.txt for valid species names. The string search method is possible with this particular database because there is a more limited range of species represented. However, this means that a server administrator needs to keep the dbEST.spl.txt file up to date to insure continuous high quality species searching of dbEST with ProteinProspector programs. This task, though annoying, is made somewhat easier by consulting the seqdb/dbEST.usp file.
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/dbEST.usp.
  • name, N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5', as the string following the 4th "|"


    ProteinProspector filenaming conventions for proprietary/generic FASTA databases

    Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. With well curated databases, this information is consistently organized into fields in the comment line of a FASTA formatted database.

    For ProteinProspector programs the sequence field is only subject to 2 constraints. 1) it must be in CAPITAL lettters, and 2) it must be in single letter code (some people express amino acids in 3-letter code).

    The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database's comment line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.

    DN and PN

    The D forms designate that the sequence is DNA and will be translated into protein sequence by ProteinProspector programs. The P forms indicate protein sequence.

    > 417909| Better than sliced bread growth factor beta|Mouse|pancreas|

    ProteinProspector programs designate:

  • accession number, 417909, as the integer before the first "|"
  • name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")
  • species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/DN.usp, or seqdb/PN.usp.
    If the accession number is alphanumeric, FA-Index will still run to completion, and all ProteinProspector programs will function properly, except those which retrieve an entry based on the accession number. Currently, this applies only to MS-Digest and MS-Edman when retrieve entry by accession number is designated. In those cases, supplying an alphanumeric accession number will result in retrieve the entry closest to the end of the file which has an alphanumeric accession number.

    DA and PA

    The D forms designate that the sequence is DNA and will be translated into protein sequence by ProteinProspector programs. The P forms indicate protein sequence.

    Note that the DA and PA differ from the DN and PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently updated databases FA-Index can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.

    > SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|

    ProteinProspector programs designate:

  • accession number, SlowSort909, as the alphanumeric string before the first "|"
  • name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")
  • species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/DA.usp or seqdb/PA.usp.

    Any number of proprietary databases may be created with DA, DN, PA or PN prefixes. You must also create species alias lists and accession number links for any databases which you create.


    FA-Index output files (the indices)

    Suffix
    (databasefilename.xxx)
    Description
    .idxIndex assigning a number to each entry in the database. The number is simply the order in which the entries appear in the database file. When a database is updated the number corresponding to a particular entry will change only if the order of the entries in the file changes. Users see this number in ProteinProspector programs designated as the MS-Digest index number. Internally, the programs store this number when a hit is recorded during a search, the number is then used later to retrieve the sequence for output/report generation purposes.
    .unkIndex which keeps track of all foreign characters in the sequence field for each database entry.
        For protein databases any characters other than the 20 standard amino acids are foreign characters.

        For DNA databases any characters other than A, G, C, T, and N are foreign characters.
        Note that the sequences must be in CAPITAL lettters, and in single letter code (some people express amino acids in 3-letter code).
    .mwIndex containing the calculated protein Molecular Weight (MW) of each sequence in the database. For DNA sequences this MW is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact MW.
    .piIndex containing the calculated protein pI of each sequence in the database. For DNA sequences this pI is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI.
    .spIndex containing the Species of each sequence in the database. Used to accelerate searches that are constrained by species.
    .slContains a list in alphabetical order of the text strings used to denote different species. A text string has to occur at least ten times to appear in this file. This file is never used by the ProteinProspector programs. The text strings are the ones you should use in MS-Edman if you have the Search Mode set to Species.
    .usp File created to list the comment lines of each entry for which FA-Index cannot read the species. This file is never used by the ProteinProspector programs; it is created only for use by server administrators in troubleshooting species problems.
    .accIndex of alphanumeric accession numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA.
    .acnIndex of integer accession numbers, created only for database filename prefixes: NCBInr, nr, dbEST, dbest, DN, PN.


    Ignore/bypass the indicies?

    Suffix
    (databasefilename.xxx)
    BypassableHow to by-pass if possible
    .idxnoNecessary for any ProteinProspector program that searches/consults a database file.
    .unknoNecessary for any ProteinProspector program that searches/consults a database file.
    .mwyesSelect All in the MW search parameters.
    .piyesSelect All in the pI search parameters.
    .spyesSelect All in the Species search parameters.
    .slyesThis file is never used by the ProteinProspector programs; it is used to report the contents of the species fields in the database file.
    .uspyesThis file is never used by the ProteinProspector programs; it is created only for use by server administrators in troubleshooting species problems.
    .acc
    .acn
    yesDon't choose retrieve by Accession number in MS-Digest, or set the search mode to Accession number in MS-Edman.


    The Browser Version of FA-Index


    Creating Indicies for a New Database

    Once you've downloaded a new database into the seqdb directory you need to create the index files described above before you can start to use it. To do this:

    1). Type the name of the database into the Newly Downloaded Database field.

    2). Press the Create Indicies For New Database button.

    3). Update the database list.


    Updating the Database List in the HTML Forms

    The list of databases used by the other forms is held in a Javascript file; the default location of this file is shown on the FA-Index form. To update the contents of this file press the Update Database List in HTML Forms button. After doing this you will probably have to reload the relevant HTML form before the new database list appears. If this doesn't work place the cursor in the URL location box of the browser and press return. If even this doesn't work investigate the cache settings on your browser.

    The Javascript file is automatically updated after performing the following operations on the FA-Index form:

  • Create Indicies For New Database
  • Create a Species and Protein MW Subset Database with Indicies
  • Create Subset Database with Indices from Saved Hits
  • Create or Append to User Database

    You will still have to reload the form as described above to be able to select a newly created database.


    Creating a Species and Protein MW Subset Database with Indicies

    ProteinProspector licensees can create their own subset databases which have been pre-filtered for species and molecular weight. For example to create a subset database of human proteins between 1000-100000 Da from the SwissProt database:

    1). Choose a suitable suffix for the database such as human.

    2). Select SwissProt.rxx as the existing database.

    3). Select HOMO SAPIENS as the species.

    4). Enter 1000 to 100000 as the MW of the Protein and deselect All.

    5). Press the Create a Species and Protein MW Subset Database with Indicies button.

    6). Update the database list.

    Using subset databases is likely to dramatically decrease search times.

    This feature is only available to ProteinProspector licensees.


    Creating a Subset Database with Indices from Saved Hits

    The Hits (index numbers for matching database entries) from ProteinProspector search programs can be saved to a user-specified file. This file can then be used create a subset database containing only the Hit proteins from the search.

    1). Choose a suitable suffix for the database. The suffix must be unique; if you use the same suffix twice then the previously created subset database will be overwritten.

    2). Identify the database that was used in the original search.

    3). Identify the file containing the saved hits by entering the Program and File Name.

    4). Press the Create Subset Database with Indices from Saved Hits button.

    5). Update the database list.

    This feature is only available to ProteinProspector licensees.


    Creating or Appending to a Database Containing User Supplied Protein or DNA Sequences

    It is possible to create your own fasta format database which can be searched by the ProteinProspector search programs. An entry for a single protein or DNA sequence is made up of a comment line containing accession number, species and name fields followed by one or more lines containing the sequence.

    1). Enter the database name. There are several dialects of fasta with the essential difference between them being the format of the comment line. You are strongly advised to use a proprietary format but it is also possible to use a public format. If you choose a database name that already exists on the disk then subsequent proteins will be appended to the end of the file, otherwise a new database file will be created. It is possible to append entries to the end of the publicly available databases but this is not advisable; firstly because the index files are remade after each entry, secondly because newer versions of the database won't contain your entries and thirdly because any errors in the information you supply when adding the entry could potentially damage the whole database. If you want to use a public database format you should use a database name such as NCBInr.user.

    2). Enter a name for the entry. Whether you are using a proprietary format or a public format make sure you don't use characters in the name which might give the ProteinProspector programs problems in sorting out the fields in the comment line.

    3). Enter a species for the entry. This should be consistent with the information in the species.txt file.

    4). Enter an accession number for the entry. The accession number must be unique; the program will alert you if it isn't. If your database uses numeric accession numbers then the accession number must be numeric.

    5). Enter the protein or DNA sequence using only the upper case symbols for the 20 naturally occurring amino acids or the four base pairs as appropriate. X may also be used to if the sequence is unknown at a particular point.

    6). Press the Create or Append to User Database button.


    Database Summary Report

    The database summary report option is used to list the accession numbers, species and name fields for a selected index number range of a selected database. Deselect the Hide Protein Sequence checkbox if you also want to see the protein sequences. You can also select the DNA Reading Frame if you are looking at a DNA database.


    The Command Line Version of FA-Index

    Traditionalists and users afraid of mice can still use the command line version of FA-Index.

    FA-Index and the ProteinProspector Directory Structure

    On all operating systems the FA-Index program is expected to reside in the same directory as all other ProteinProspector programs (i.e. ). FA-Index accepts a single input argument (the name of the database file). Upon execution FA-Index issues an instruction to read the database file from seqdb/database_filename and write the indices to seqdb/database_filename.suffix.

    This can cause some problems about which directory to launch FA-Index from and the syntax of launching it. We've tried to make this as simple as possible, however system administrators can easily outsmart themselves, particularly if they want to alter the ProteinProspector directory structure.

    Basically you should launch FA-Index from the directory immediately above the seqdb directory, without specifying the path to the database file. FA-Index inserts only seqdb/ in front of the filename, and it "knows" whether to put a forward slash or a back slash for your particular operating system.

    If the FA-Index program does not reside in the directory immediately above the seqdb directory (the normal case on Windows NT systems) then you may need to specify the path to faindex (but not to the database).

    On UNIX systems there is no reason why seqdb cannot be a symbolic link to another directory.

    Running FA-Index

    Examples:

    On SunOS UNIX systems issue a command of the form:
          /home/httpd//faindex.cgi Genpept.r95

    On Windows NT systems use an MS-DOS command prompt to issue a command of the form:
          C:\http> faindex.cgi Genpept.r95
    (you may first need to type)
          path=C:\http\
    or try
          C:\http> \faindex.cgi Genpept.r95