BLAST topics

A. Query Input and database selection

The query sequence(s) to be used for a BLAST search should be pasted in the 'Search' text area. BLAST accepts a number of different types of input and automatically determines the format or the input. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in 3) below. Accepted input types are FASTA, bare sequence, or sequence identifiers .

Accepted Input Formats

  1. FASTA

    A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

    		>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
    		QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
    		KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    		VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
    		FLFLIKHNPTNTIVYFGRYWSP
    		

    Blank lines are not allowed in the middle of FASTA input.

    Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:

    		A  adenosine          C  cytidine             G  guanine
    		T  thymidine          N  A/G/C/T (any)        U  uridine 
    		K  G/T (keto)         S  G/C (strong)         Y  T/C (pyrimidine) 
    		M  A/C (amino)        W  A/T (weak)           R  G/A (purine)        
    		B  G/T/C              D  G/A/T                H  A/C/T      
    		V  G/C/A              -  gap of indeterminate length
    		

    For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:

    		A  alanine               P  proline       
    		B  aspartate/asparagine  Q  glutamine      
    		C  cystine               R  arginine      
    		D  aspartate             S  serine      
    		E  glutamate             T  threonine      
    		F  phenylalanine         U  selenocysteine
    		G  glycine               V  valine        
    		H  histidine             W  tryptophan        
    		I  isoleucine            Y  tyrosine
    		K  lysine                Z  glutamate/glutamine
    		L  leucine               X  any
    		M  methionine            *  translation stop
    		N  asparagine            -  gap of indeterminate length
    		
    NOTE:
    ¹ The degenerate nucleotide codes in red are treated as mismatches in nucleotide alignment. Too many such degenerate codes within an input nucleotide query will cause the BLAST webpage to reject the input. For protein queries, too many nucleotide-like code (A,C,G,T,N) may also cause similar rejection.
    ² The BLAST webpage will not accept "-" in the query. To represent gaps, use a string of N or X instead.

  2. Bare Sequence

    This may be just lines of sequence data, without the FASTA definition line, e.g.:

    		QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
    		KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    		VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
    		FLFLIKHNPTNTIVYFGRYWSP
    	
    It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:
    		  1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn
    		 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek
    		121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
    		181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp
    	

    Blank lines are not allowed in the middle of bare sequence input.

  3. Identifiers

    Normally these are simply an accession or accession.version. The identifier may consist of only one token (i.e., word). Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier are allowed). Examples of illegal input are:

    		ACCESSION   P01013
    		AAA68881. 1
    		gi| 129295
    	
    For the first example "ACCESSION" must be removed, in the second example there is a space before the version number of the accession, in the third example there is a space after the bar ("|").

    If more than one query is specified, each identifier should be on a separate line.

Upload file

This function allows users to upload a text file containing queries formatted in FASTA format. The file can also contain sequence identifiers instead of FASTA sequences.

Query subrange

A segment of the query sequences can be used in BLAST searching. You can enter the range in the "Form" and "To" boxes provided under "Query subrange" to specify the position of this segment. For example to limit matches to the region from 24 to 200 of a query sequence, you would enter 24 in the "From" field and 200 in the "To" field. If one of the limits you enter is out of range, the intersection of the [From,To] and [1,length] intervals will be searched, where length is the length of the whole query sequence.

Query Genetic Code

Genetic code to be used in blastx and tblastx translation of the query. See list of Genetic Codes in Taxonomy.

B. BLAST Search Parameters

Limit by Organism

A BLAST search may be limited by organism. The entry field will suggest completions once a user starts typing. A checkbox will exclude rather than include the organism in the search.

Limit by Entrez Query

A BLAST search can be limited to the result of an Entrez query against the database chosen. This restricts the search to a subset of entries from that database fitting the requirement of the Entrez query. Terms normally accepted by Entrez nucleotide or protein searches are accepted here. Examples are given below.

  • protease NOT hiv1[organism]

    This will limit a BLAST search to all proteases, except those in HIV 1.

  • 1000:2000[slen]

    This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.

  • Mus musculus[organism] AND biomol_mrna[properties]

    This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.

  • 10000:100000[mlwt]

    This is yet another example usage, which limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.

  • src specimen voucher[properties]

    This limits the search to entries that are annotated with a /specimen_voucher qualifier on the source feature.

  • all[filter] NOT enviromnental sample[filter] NOT metagenomes[orgn]

    This excludes sequences from metagenome studies and uncultured sequences from anonymous environmental sample studies.

For help in constructing Entrez queries please see the " Writing Advanced Search Statements" section of the Entrez Help document. Knowing the content of a database and applying the Entrez terms accordingly are important. For example, biomol_mrna[prop] should not be applied to htgs or chromosome database since they do not contain mRNA entries!

Compositional adjustments

Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed; this procedure is called "composition-based statistics" (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003; Yu and Altschul, 2005; Altschul et al., 2005). Such "compositional score matrix adjustment" may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005); under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.

[1] Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements," Nucleic Acids Res. 29:2994-3005.
[2] Yu, Y.-K., Wootton, J.C. and Altschul, S.F. (2003) "The compositional adjustment of amino acid substitution matrices," Proc. Natl. Acad. Sci. USA 100:15688-15693.
[3] Yu, Y.-K. and Altschul, S.F. (2005) "The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions," Bioinformatics 21:902-911.
[4] Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A. and Yu, Y.-K. (2005) "Protein database searches using compositionally adjusted substitution matrices," FEBS J 272(20):5101-9.

Filter

  • Filter (Low-complexity)

    This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

    Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

    It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used.

  • Filter (Human repeats)

    This option masks Human repeats (LINE's, SINE's, plus retroviral repeasts) and is useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (htgs). This filter should be checked for genomic queries to prevent potential problems that may arise from the numerous and often spurious matches to those repeat elements.

    For more information please see "Why does my search timeout on the BLAST servers?" in the BLAST Frequently Asked Questions.

  • Filter (Mask for lookup table only)

    BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.

  • Mask Lower Case

    With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.

One can use different combinations of the above filter options to achieve optimal search result.

Word-size

BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied. The webpage allows the word-sizes 2, 3, and 6.

Expect

This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

Reward and Penalty for Nucleotide Programs

Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [1]. Read more here

[1] States DJ, Gish W, and Altschul SF (1991) METHODS: A companion to Methods in Enzymology 3:66-70.

Matrix and Gap Costs

  • Matrix

    A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with (see the BLAST Frequently Asked Questions). See more information on BLAST substitution matrices.

  • Gap Cost

    The pull down menu shows the Gap Costs for the chosen Matrix. There can only be a limited number of options for these parameters. Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduced.

  • PSSM

    PSI-BLAST can save the Position Specific Score Matrix constructed through iterations. The PSSM thus constructed can be used in searches against other databases with the same query by copying and pasting the encoded text into the PSSM field.

    To save a PSSM file:
    1. Run a protein BLAST search.
    2. Check the PSI-BLAST box on formatting page.
    3. Click the "Format" Button.
    4. On the PSI-BLAST results page, click the "Run PSI-BLAST Iteration 2" button.
    5. Select the Download link at the top of the page and download the PSSM to your computer.

    To use the PSSM in a new protein BLAST search against other databases:

    1. Open a new protein BLAST page.
    2. Select PSI-BLAST as the Algorithm under "Program Selection" (this may already be set).
    3. Select the "+" next to "Algorithm parameters" at the bottom of the search page.
    4. Scroll to the "PSI/PHI/DELTA BLAST" section and use the "Choose File" button to upload the PSSM that you saved in step 5 above.
    5. Select a different target database.
    6. Click "BLAST" button to start the search

    If the database is the same as when the PSSM was stored, you'll reproduce the iteration on which you've saved the PSSM; A different database will yield a different hit list.

PHI-BLAST Pattern

PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question:

What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences?
PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. See PHI-BLAST pattern syntax for details.

C. Result Format Options

Graphical Overview

An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple segments of alignments to the same database sequence are connected by a thin grey line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.

CDS feature

Checking this option will allow BLAST formatter to parse out the annotated sequence features found in or around the vicinity of hits and display them within the BLAST result. For custom query sequences, it will also translate the CDS using the CDS translation annotated on matching database sequence as a guide. Mismatch in translation will be highlighted in pink. A representative example with CDS translation is given below.

>gi|46452254|gb|AY585334.1| Sus scrofa cystic fibrosis transmembrane conductance regulator 
(CFTR) mRNA, complete cds
Length=4449

 Score = 5453 bits (2751),  Expect = 0.0
 Identities = 4036/4449 (90%), Gaps = 6/4449 (0%)
 Strand=Plus/Plus

CDS: Putative 1       1      M  Q  R  S  P  L  E  K  A  S  V  V  S  K  L  F  F  S  W  T 
Query                 133   ATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACnnnnnnnCAGCTGGACC  192
                            |||||||||||||||||||||||||||||| |  ||||||||||||||||||||||||||
Sbjct                 1     ATGCAGAGGTCGCCTCTGGAAAAGGCCAGCATCTTCTCCAAACTTTTTTTCAGCTGGACC  60
CDS:cystic fibrosis   1      M  Q  R  S  P  L  E  K  A  S  I  F  S  K  L  F  F  S  W  T 

CDS: Putative 1       21     R  P  I  L  R  K  G  Y  R  Q  R  L  E  L  S  D  I  Y  Q  I 
Query                 193   AGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATC  252
                            |||||||||||||| |||||||| |||||||||||||||||||||||||||||||| |||
Sbjct                 61    AGACCAATTTTGAGAAAAGGATATAGACAGCGCCTGGAATTGTCAGACATATACCATATC  120
CDS:cystic fibrosis   21     R  P  I  L  R  K  G  Y  R  Q  R  L  E  L  S  D  I  Y  H  I 

CDS: Putative 1       41     P  S  V  D  S  A  D  N  L  S  E  K  L  E  R  E  W  D  R  E 
Query                 253   CCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGAGAATGGGATAGAGAG  312
                             |||||  ||| |||||||||||||| |||||||||||||||||||||||||| ||||| 
Sbjct                 121   TCTTCTTCTGACTCTGCTGACAATCTGTCTGAAAAATTGGAAAGAGAATGGGACAGAGAA  180
CDS:cystic fibrosis   41     S  S  S  D  S  A  D  N  L  S  E  K  L  E  R  E  W  D  R  E 

Masking

There are two options that determines the way filter masked region should be displayed in.

  • Masking Character

    "X or N" displays the masked region in X for protein and N for nucleotide
    "Lower Case" displays maksed region in lower case letters

  • Masking Color

    The masked region can be "highlighted" with grey or red colored fonts

Descriptions

This option restricts the number of short descriptions of matching sequences reported to the number specified. Default setting varies from page to page. See also EXPECT.

Alignment View

  • Pairwise

    The databases alignments are displayed as pairs of matches between query and subject sequence. A middle line between the query and subject sequence displays the status of a letter. For protein alignments (e.g, BLASTP/BLASTX/TBLASTN), identities present the letter, conservative substitutions present a "+", and nothing otherwise. For nucleotide alignments (e.g., BLASTN and megaBLAST) a "|" is shown for matches and nothing for mismatches. This is the default view.

  • Pairwise with dots for identities

    The databases alignments are anchored (shown in relation to) to the query sequence in pairwised fashion with mismatches colored in red. Sbjct will be in red and bold font if a line in the alignment contains mismatches. See example below.

  • Query-anchored with dots for identities

    The databases alignments are anchored (shown in relation to) to the query sequence. Identities are displayed as dots (.), with mismatches displayed as single letter abbreviations.

  • Query-anchored with letters for identities

    Identities are shown as single letter nucleotide abbreviations.

  • Flat Query-anchored with dots for identities

    The 'flat' display shows inserts as deletions on the query. Identities are displayed as dots (.), with mismatches displayed as single letter abbreviations.

  • Flat Query-anchored with letters for identities

    The 'flat' display shows inserts as deletions on the query. Identities are shown as single letter abbreviations.

>gi|21536448|ref|NM_002622.3|   Homo sapiens prefoldin 1 (PFDN1), mRNA
Length=1296

 Score =  392 bits (212),  Expect = 2e-107
 Identities = 220/223 (98%), Gaps = 3/223 (1%)
 Strand=Plus/Plus

Query  107  TCCTACCTGGAGCGAAG-GTTANAGGAAGCTGAGGACAACATCCGGGAGATGCTGATGGC  165
Sbjct  300  .................C....-.....................................  358

Query  166  ACGAAGGG-CCAGTAGGGAGCCTCTCTGGGAAGCTCTTCCTCCTGCCCCTCCCATTCCTG  224

Sbjct  359  ........C...................................................  418

Query  225  GTGGGGGCAGAGGAGTGTCTGCAGGGAAACAGCTTCTCCTCTGCCCCGATGGATGCTTTA  284
Sbjct  419  ............................................................  478

Query  285  TTTGGATGGCCTGGCAACATCACATTTTCTGCATCACCCTGAG  327
Sbjct  479  ...........................................  521

Download

The Download links allows downloads of XML, Text report, CSV, XML, ASN.1 or JSON.

Format for PSI-BLAST

The Position-Specific Iterated BLAST (PSI-BLAST) program performs iterative searches with a protein query, in which sequences found in one round of search are used to build a custom score model for the next round.

In PSI-BLAST the algorithm is not tied to a specific score matrix, such as BLOSUM62, which has been implemented using an AxA substitution matrix where A is the alphabet size. Instead, it uses a QxA matrix, where Q is the length of the query sequence. At each position the cost of a letter depends on the position with regard to the query and the letter in the subject sequence.

To run this search, "Format for PSI-BLAST" checkbox must be checked.

Inclusion Threshold

This sets the statistical significance threshold for including a sequence in the model used by PSI-BLAST to create the PSSM on the next iteration.

Limit results by entrez query

This function is similar to the "Limit by Entrez Query terms" in the option section. The only difference is that it applies only to the identified hits. In another word, it is applied post-search and allows users to see only hits fitting the requirement of the Entrez query terms. Default is to format without input query terms and allow users to see all the hits.

Expect value range

This instructs BLAST formatter to display hits with Expect value within the specified range. Default value is 0 to Expect value setting. Lower bound goes to the first box, higher bound goes to the second box.

D. Rules for pattern syntax for PHI-BLAST

Web PHI-BLAST search requires a pattern along with a protein sequence containing the pattern.

The syntax for pattern specification in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns in a file separated by a blank line between patterns. When using the Web-page only one pattern is allowed per query.

Accepted PHI-BLAST Pattern Vocabulary
SymbolsDescription
ABCDEFGHIKLMNPQRSTVWXYZUProtein alphabet
ACGTDNA alphabet
[ ] means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T
- nothing, used as a spacer to clearly separate each position
x with nothing following means any residue
(n) means the preceeding residue is repeated 5 times
(m,n) the preceeding residue is repeated between m to n times (n > m)
> only at the end of a pattern and means nothing it may occur before a period
. may be used at the end, means nothing

When using the stand-alone program, the pattern should be stored in a pattern input file, with the first line starting with ID followed by 2 spaces and a text string giving the pattern a name. There should also be a line starting with PA followed by 2 spaces and then the pattern description.

All other PROSITE codes in the first two columns are allowed, but only the HI code, described below is relevant to PHI-BLAST.

Here is an example from PROSITE:

ID CNMP_BINDING_2; PATTERN. AC PS00889;
DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE Cyclic nucleotide-binding domain signature 2.
PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR /RELEASE=32,49340;
NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR /FALSE_NEG=1; /PARTIAL=1;
CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

The line starting with ID gives the pattern a name.

The lines starting with AC, DT, DE, NR, NR, and CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI-BLAST.

The line starting with PA describes the pattern, which can be explained as the following.

Explanation of PROSITE example
Pattern PositionPattern SyntaxMeaning
1[LIVMF]one of LIVMF
2GG
3EE
4Xany one residue
5[GAS]one of GAS
6[LIVM]one of LIVM
7X(5,11)5 to 11 any residue
8RR
9[STAQ]one of STAQ
10Aone A
11Xany one residue
12[LIVMA]one of LIVMA
13Xany one residue
14[STACV]any one of STACV
Note: total length of this motif/pattern is between 18 to 24 residues.

In this case the pattern ends with a period. It can end with nothing after the last specifying symbol or any number of > signs or periods or combination thereof. Given below is another example, illustrating the use of an HI line.

	ID ER_TARGET; PATTERN.
	PA [KRHQSA]-[DENQ]-E-L>.
	HI (19 22)
	HI (201 204)

In this example, the HI lines specify that the pattern occurs twice, once from positions 19 through 22 in the sequence and once from positions 201 through 204 in the sequence. These specifications are relevant when stand-alone PHI-BLAST is used with the seedp option, in which the interesting occurrences of the pattern in the sequence are specified. In this case the HI lines specify which occurrence(s) of the pattern should be used to find good alignments.

In general, the seedp option is more useful than the standard patternp option ONLY when the pattern occurs K > 1 times in the sequence AND the user is interested in matching to J < K of those occurrences. Then using the HI lines enables the user to specify which occurrences are of interest.

E. What is discontiguous Mega BLAST?

This version of Mega BLAST is designed specifically for comparison of diverged sequences, especially sequences from different organisms, which have alignments with low degree of identity, where the original Mega BLAST is not very effective. The major difference is in the use of the 'discontiguous word' approach to finding initial offset pairs, from which the gapped extension is then performed.

Both Mega BLAST and all previous versions of nucleotide-nucleotide BLAST look for exact matches of certain length as the starting points for gapped alignments. When comparing less conserved sequences, i.e. when the expected share of identity between them is e.g. 80% and below, this traditional approach becomes much less productive than for the higher degree of conservation. Depending on the length of the exact match to start the alignments from, it either misses a lot of statistically significant alignments, or on the contrary finds too many short random alignments.

According to [1], as well as our own probability simulations, it turns out that if initial 'words' are based not on the exact match, but on a match of a certain set of nonconsecutive positions within longer segments of the sequences, the productivity of the word finding algorithm is much higher. This way fewer words are found overall, but more of them end up producing statistically significant alignments, than in the case of contiguous words of the same, and even shorter length than the number of matched positions in the discontiguous word.

As an example, we can define a pattern (template) of 0s and 1s of length e.g. 21:

100101100101100101101. For each pair of offsets in the query and subject sequences that are being compared, we compare the 21 nucleotide segments in these sequences ending at these offsets, and require only those positions in those segments to match that correspond to the 1s in the above template.

There are several advantages in using this approach. First, the conditional probabilities of finding word hits satisfying discontiguous templates given the expected identity percentage in the alignments between two sequences, are higher than for contiguous words with the same number of positions required matched. If two word hits are required to initiate a gapped extension, the effect of the discontiguous word approach is even larger. In both cases higher sensitivity is achieved because there is less correlation between successive words as the database sequence is scanned across the query sequence. Second, when comparing coding sequences, the conservation of the third nucleotides in every codon is not essential, so there is no need to require it when matching initial words. This implies the advantage of using templates based on the '110' pattern, which are called 'coding'. Finally, to achieve even higher sensitivity, one might combine two different discontiguous word templates and require any one of them to match at a given position to qualify it for the initial word hit.

The following options specific to this approach are supported:

Template length: 16, 18, 21.
Word size (i.e. number of 1s in the template): 11, 12
Template type: coding, non-coding.
Require two words for extension: yes/no.

The 'coding' templates are based on the 110 pattern, although more 0s are required for most of them, so some of the patterns become 010 or 100. These are the most effective for comparison of coding regions.

The non-coding templates attempt to minimize the correlation between successive words, when the database sequence is shifted by 4 positions against the query sequence. This means more 1s are concentrated at the ends of the template (at least 3 on each side).

When the option to require two words for extension is chosen, two word hits matching the template must be found within a distance of 50 nucleotides of one another.

Below are the exact discontiguous word template patterns for different combinations of word sizes and lengths:

   W = 11, t = 16, coding:     1101101101101101
   W = 11, t = 16, non-coding: 1110010110110111
   W = 12, t = 16, coding:     1111101101101101
   W = 12, t = 16, non-coding: 1110110110110111
   W = 11, t = 18, coding:     101101100101101101
   W = 11, t = 18, non-coding: 111010010110010111
   W = 12, t = 18, coding:     101101101101101101
   W = 12, t = 18, non-coding: 111010110010110111
   W = 11, t = 21, coding:     100101100101100101101
   W = 11, t = 21, non-coding: 111010010100010010111
   W = 12, t = 21, coding:     100101101101100101101
   W = 12, t = 21, non-coding: 111010010110010010111
[1] Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5