Education page
PSI-BLAST Tutorial
Introduction: QUERY  details, more_psi_intro

Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

This PSI-BLAST tutorial uses as an example, the uncharacterized archaebacterial protein, MJ0577, from Methanococcus jannaschii. The tutorial illustrates the potential for PSI-BLAST searches to identify even weak (subtle) homologies to annotated entries in the database. It demonstrates that PSI-BLAST is an important tool for predicting both biochemical activities and function from sequence relationships.

A BLAST search of this sequence revealed a number of probable homologs but no hits that provided useful information about the function or biochemical activities of this protein. Since this PSI-BLAST tutorial builds on the example in the BLAST tutorial, it is most useful to tackle the two tutorials in order. Go back to the BLAST tutorial now, if you missed it before.

The PSI-BLAST search in this tutorial has two purposes:
    (1) to identify distant relatives of the MJ0577 family and
    (2) to gain insight into the function of this family of proteins.



Step 1.  Choose the database to search.  details, more_psi1
Database

PSI-BLAST uses the blastp program exclusively, so there is no need to select the program.
The nr database has been selected for use in this example in order to do a thorough search of all available sequences.

Step 2.  Input the data.  details, more_psi2

Query data is formatted as Paste the query sequence or its (GI or accession) number into the query window.


Step 3.   Set program options or choose defaults.

Expect:
The Expect value for inclusion in PSI-BLAST iteration 1:
Two different E value settings need to be specified in the PSI-BLAST program. The first of these (upper) sets the threshold for the initial BLAST search. The default value is 10 as in the standard BLAST program. In this example, the initial (BLAST) E value is set at 1.0. The second E value (lower) is the threshold value for inclusion in the position specific matrix used for PSI-BLAST iterations. In this example the PSI-BLAST E value is left at the default setting of 0.001. The E values specified in this example allow the user to see (and selectively, based on prior knowledge, include) all of the BLAST hits up to E=1; but to automatically include only those hits exceeding a relatively rigorous E value threshold of 0.001  details, more_psi3a

Filter Low complexity
It is appropriate to filter most queries for low complexity sequences. By taking an advance peek at the first alignment in the BLAST output, it can be seen that MJ0577 has no low complexity regions that are detected by the SEG filtering algorithm. Low complexity regions would appear as X's in the alignment of MJ0577 with itself.

Some types of low complexity sequences may not be detected by the filtering option in BLAST. For example, coiled-coil and transmembrane regions need to be detected using the appropriate programs outside of BLAST. As an example, the COILS algorithm was used to look for coiled-coil regions in MJ0577. The MJ0577 open reading frame is found to have a coiled coil region as seen in this COILS output page. Since coiled-coil encoding sequences can lead to matches with other coiled-coil proteins and thus obscure more meaningful hits, the user might consider manually masking the region to optmize the sensitivity of the search. To do this, replace the amino acids between aa 71(SLLL) and aa 120 (IIVV) with X's. Click here for a query window in which this has already been done.

 details, more_psi3c



Matrix Gap existence cost Per residue gap cost Lambda ratio
BLOSUM62 is a general purpose matrix and the default choice in PSI-BLAST. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45.  details, more_psi3d

Other advanced options:
In the "Advanced Options" field it is possible to specify gap costs, word size, and other parameters not otherwise selectable on the query form. Output formatting options may also be adjusted here. For example, the user might type: "-v150" to cause 150 descriptions (rather than 100 or 250 available through the pull-down menu) to be displayed. Find out how to specify these options using the details button.  details, more_psi3g

Step 4.   Set the output formatting options.

NCBI-gi Graphical overview
In this example, the NCBI-gi designation is checked to facilitate the process of doing additional searches (described later) to investigate the significance of a given alignment. The graphics corresponding to PSI-BLAST iterations are still in development. The graphic is therefore suppressed in this PSI_BLAST tutorial.  details, more_psi4b

Descriptions Alignments

The default number of descriptions and alignments to be listed is 500. Although it may seem useful to change the default to something smaller to control the magnitude of the output, these variables affect the search in two important ways:

First, if the total number of hits in which E is less than the threshold exceeds the number (x) of descriptions requested, only the top x most significant would be listed; additional possibly significant alignments would not be shown, though these may embody important information.

Second, the number of sequences used in generating the multiple alignment and the position specific matrix is specified by the larger of the two(descriptions, alignments) variables. If at any point in the iterative PSI-BLAST process, significant sequences are omitted from the profile, all subsequent output will be affected.

By selecting a large number of descriptions (e.g. 250-500) it is possible to ensure that the E value and not the description limit will be the determining factor in generating the profile to be used for additional iterations. Reducing the output can then be accomplished, if desired, by limiting the number of alignments to be reported.

 details, more_psi4c


Alignment view

A variety of different alignment formats are available. The choice of which to use is based on personal preference. Pairwise alignment gives a good view of the quality of an individual hit. However, a flat query-anchored alignment (with identities) is a format in which identities shared by numerous sequences can be easily spotted.  details, more_psi4d

Step 5.   Perform the search.
Click on the search button now to initiate the search. In seconds, the query sequence has been compared to all of the entries in the specified database. Each comparison is scored and the top scores are listed in rank order.

  

Revised May 17, 2000

BLAST tutorial glossary Query tutorial PSI-BLAST tutorial Guide BLAST information