You need not bother reading this document unless you are administering a server running the ProteinProspector programs.
Instructions for ProteinProspector Programs
Contents of this document: (all in one file, so it can be printed and read)
An example of the format is shown below:
>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD
As a standard it leaves something to be desired, because the "standard" is that there is a single comment line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or de-limiting of fields in the comment line. Often the comment line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.
The FASTA format was chosen for use with ProteinProspector primarily because of it's universality, brevity, and expected ease with which database files could be shared on the same computer with other programs for sequence analysis.
The FA-Index program creates several indices which are much smaller files than the FASTA database file. These indices aid the ProteinProspector programs in addressing some of the hindrances inherent in the FASTA comment line format heterogeneity.
The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated comment line format described.
Genpept
>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]
ProteinProspector programs designate:
Owl
>10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
>AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
>pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).
ProteinProspector programs designate:
SwissProt
>sp|P15394|REPA_AGRTU REPLICATING PROTEIN
ProteinProspector programs designate:
NCBInr
The comment lines from this database are tricky to handle because it is a non-redundant database which collects entries form several databases, thus there are several formats present in the final database.
>gi|304881 (L07596) alaS [Escherichia coli]
>gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
>gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster)
(fragment)
>gi|543687|pir||A48298 sodium channel homolog - jellyfish (Cyanea capillata)
ProteinProspector programs designate:
dbEST
This database wins the booby prize as the one with the least consistent comment lines.
>gi|1705383|gb|N20717|N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5'
ProteinProspector programs designate:
For ProteinProspector programs the sequence field is only subject to 2 constraints. 1) it must be in CAPITAL lettters, and 2) it must be in single letter code (some people express amino acids in 3-letter code).
The way ProteinProspector programs "know" which dialect of FASTA to "speak" with a particular database's comment line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.
DN and PN
The D forms designate that the sequence is DNA and will be translated into protein sequence by ProteinProspector programs. The P forms indicate protein sequence.
> 417909| Better than sliced bread growth factor beta|Mouse|pancreas|
ProteinProspector programs designate:
DA and PA
The D forms designate that the sequence is DNA and will be translated into protein sequence by ProteinProspector programs. The P forms indicate protein sequence.
Note that the DA and PA differ from the DN and PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently updated databases FA-Index can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.
> SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|
ProteinProspector programs designate:
Any number of proprietary databases may be created with DA, DN, PA or PN prefixes. You must also create species alias lists and accession number links for any databases which you create.
Suffix (databasefilename.xxx) | Description |
---|---|
.idx | Index assigning a number to each entry in the database. The number is simply the order in which the entries appear in the database file. When a database is updated the number corresponding to a particular entry will change only if the order of the entries in the file changes. Users see this number in ProteinProspector programs designated as the MS-Digest index number. Internally, the programs store this number when a hit is recorded during a search, the number is then used later to retrieve the sequence for output/report generation purposes. |
.unk | Index which keeps track of all foreign characters in the sequence field
for each database entry.
    For protein databases any characters other than the 20 standard amino acids are foreign characters.     For DNA databases any characters other than A, G, C, T, and N are foreign characters.     Note that the sequences must be in CAPITAL lettters, and in single letter code (some people express amino acids in 3-letter code). |
.mw | Index containing the calculated protein Molecular Weight (MW) of each sequence in the database. For DNA sequences this MW is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact MW. |
.pi | Index containing the calculated protein pI of each sequence in the database. For DNA sequences this pI is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI. |
.sp | Index containing the Species of each sequence in the database. Used to accelerate searches that are constrained by species. |
.sl | Contains a list in alphabetical order of the text strings used to denote different species. A text string has to occur at least ten times to appear in this file. This file is never used by the ProteinProspector programs. The text strings are the ones you should use in MS-Edman if you have the Search Mode set to Species. |
.usp | File created to list the comment lines of each entry for which FA-Index cannot read the species. This file is never used by the ProteinProspector programs; it is created only for use by server administrators in troubleshooting species problems. |
.acc | Index of alphanumeric accession numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA. |
.acn | Index of integer accession numbers, created only for database filename prefixes: NCBInr, nr, dbEST, dbest, DN, PN. |
Suffix (databasefilename.xxx) | Bypassable | How to by-pass if possible |
---|---|---|
.idx | no | Necessary for any ProteinProspector program that searches/consults a database file. |
.unk | no | Necessary for any ProteinProspector program that searches/consults a database file. |
.mw | yes | Select All in the MW search parameters. |
.pi | yes | Select All in the pI search parameters. |
.sp | yes | Select All in the Species search parameters. |
.sl | yes | This file is never used by the ProteinProspector programs; it is used to report the contents of the species fields in the database file. |
.usp | yes | This file is never used by the ProteinProspector programs; it is created only for use by server administrators in troubleshooting species problems. |
.acc .acn | yes | Don't choose retrieve by Accession number in MS-Digest, or set the search mode to Accession number in MS-Edman. |
1). Type the name of the database into the Newly Downloaded Database field.
2). Press the Create Indicies For New Database button.
The Javascript file is automatically updated after performing the following operations on the FA-Index form:
You will still have to reload the form as described above to be able to select a newly created database.
1). Choose a suitable suffix for the database such as human.
2). Select SwissProt.rxx as the existing database.
3). Select HOMO SAPIENS as the species.
4). Enter 1000 to 100000 as the MW of the Protein and deselect All.
5). Press the Create a Species and Protein MW Subset Database with Indicies button.
Using subset databases is likely to dramatically decrease search times.
This feature is only available to ProteinProspector licensees.
1). Choose a suitable suffix for the database. The suffix must be unique; if you use the same suffix twice then the previously created subset database will be overwritten.
2). Identify the database that was used in the original search.
3). Identify the file containing the saved hits by entering the Program and File Name.
4). Press the Create Subset Database with Indices from Saved Hits button.
This feature is only available to ProteinProspector licensees.
1). Enter the database name. There are several dialects of fasta with the essential difference between them being the format of the comment line. You are strongly advised to use a proprietary format but it is also possible to use a public format. If you choose a database name that already exists on the disk then subsequent proteins will be appended to the end of the file, otherwise a new database file will be created. It is possible to append entries to the end of the publicly available databases but this is not advisable; firstly because the index files are remade after each entry, secondly because newer versions of the database won't contain your entries and thirdly because any errors in the information you supply when adding the entry could potentially damage the whole database. If you want to use a public database format you should use a database name such as NCBInr.user.
2). Enter a name for the entry. Whether you are using a proprietary format or a public format make sure you don't use characters in the name which might give the ProteinProspector programs problems in sorting out the fields in the comment line.
3). Enter a species for the entry. This should be consistent with the information in the species.txt file.
4). Enter an accession number for the entry. The accession number must be unique; the program will alert you if it isn't. If your database uses numeric accession numbers then the accession number must be numeric.
5). Enter the protein or DNA sequence using only the upper case symbols for the 20 naturally occurring amino acids or the four base pairs as appropriate. X may also be used to if the sequence is unknown at a particular point.
6). Press the Create or Append to User Database button.
Traditionalists and users afraid of mice can still use the command line version of FA-Index.
This can cause some problems about which directory to launch FA-Index from and the syntax of launching it. We've tried to make this as simple as possible, however system administrators can easily outsmart themselves, particularly if they want to alter the ProteinProspector directory structure.
Basically you should launch FA-Index from the directory immediately above the seqdb directory, without specifying the path to the database file. FA-Index inserts only seqdb/ in front of the filename, and it "knows" whether to put a forward slash or a back slash for your particular operating system.
If the FA-Index program does not reside in the directory immediately above the seqdb directory (the normal case on Windows NT systems) then you may need to specify the path to faindex (but not to the database).
On UNIX systems there is no reason why seqdb cannot be a symbolic link to another directory.
On SunOS UNIX systems issue a command of the form:
      /home/httpd//faindex.cgi Genpept.r95
On Windows NT systems use an MS-DOS command prompt to issue a command of
the form:
      C:\http> faindex.cgi Genpept.r95
(you may first need to type)
      path=C:\http\
or try
      C:\http> \faindex.cgi Genpept.r95