VAST banner graphic
NCBI Home PageNCBI Site Search pageNCBI Guide that lists and describes the NCBI resources
Structure Home 3D Macromolecular Structures Conserved Domains   PubChem    BioSystems 
 
 
 
 VAST: Vector Alignment Search Tool
 
   

This VAST help document provides general information about VAST and describes the output displays. A separate VAST Search help document describes how to use the VAST Search page, which allows you to enter a query structure as a PDB-formatted file.

 
     
 
VAST Help
 
   
 
back to top What is VAST?
 
  VAST, short for Vector Alignment Search Tool, is a computer algorithm developed at NCBI and used to identify similar protein 3-dimensional structures by purely geometric criteria, and to identify distant homologs that cannot be recognized by sequence comparison. The similar 3D structures identified by VAST are also referred to as "structure neighbors."

The following article describes the algorithm and provides examples:

 
  Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996 Jun; 6(3): 377-85. [PubMed]  
 
 
 
back to top How can I find similar structures for a structure that is already in the Molecular Modeling Database (MMDB)?
 
  VAST is applied on every protein in the Molecular Modeling Database (MMDB) during MMDB data processing in order to identify similar 3D structures. The pre-computed results are accessible from a structure's summary page by either:
  • following the link for via "Similar Structures" in the upper right corner of the page. The resulting VAST summary page will list the protein molecules in the structure and the 3D domains found in each protein. Select any protein molecule or 3D domain to open a list of structures that are similar in shape to the protein or 3D domain you selected.

    OR


  • viewing the "show annotation" graphic for any protein molecule of interest, then clicking on the bar graphic for the overall protein molecule or for any 3D domain it contains in order to view a list of other structures that are similar in shape to the molecule or 3D domain you selected.
 
 
 
back to top How can I compare a newly resolved 3D structure against all of the structures in the Molecular Modeling Database (MMDB)?
 
  If you have a newly determined protein structures that is not yet in MMDB, then you can use the VAST Search service to input your data in PDB file format and compare your structure against all those in MMDB. The VAST Search Help document provides additional information about using the VAST Search page.

 
 
 
back to top What information is displayed on the initial VAST results page?
 
  Whether you retrieve similar structures for a structure record that's available in the public database or for a newly resolved structure, the initial VAST results page will:

  • display a list of the protein molecules ("chains") in the query structure and the 3D domains that were identified in each protein; and


  • allow you to retrieve structures that are similar in shape to any protein or 3D domain from the structure.
The illustration below provides an example, using the 1PTH (MMDB ID 50885) structure for sheep prostaglandin H2 synthase-1 as an example:

 
  Illustration of the compact substructures, called 3D domains, in the structure for 1PTH, sheep prostaglandin H2 synthase, and the list of Similar Structures that have a similar shape to the whole protein molecule or to any 3D domain within it. Click anywhere on this image to open an interactive view of the 3D alignment of protein A, domain 1, in 1PTH and a sample similar structure, 1EGQ. Install Cn3D before clicking, if that program is not yet on your computer.  
 
 
 
back to top What does the "Graphics" display show?
 
  After you select a protein molecule ("chain") or 3D domain of interest from the initial VAST results page, you will see a brief list of structures that are similar in shape to the protein molecule or 3D domain you selected.

By default, the results are shown as a Graphics display (illustrated below) and list only a "medium redundancy" subset of structure neighbors, with red bars representing the alignment footprint of each structure neighbor relative to the query protein. (More details about the display are provided below the illustrated example.)

Controls near the top of the page allow you to change to a Table display (descibed in the next section of this document), and/or to increase or decrease the number of hits shown on the VAST results page with options that range from a "low redundancy" subset of proteins from structure records to "all sequences." After you select the desired options, be sure to press the "List" button in order to refresh the display.

The similar structures can be displayed as Graphics (illustrated below) or as a Table (descibed in the next section of this document).

 
  Illustration showing the default VAST results graphic display for the sheep prostaglandin H2 synthase-1 protein from the 1PTH structure record.

 
  The identifier for each structure neighbor is shown in the format of PDB ID + protein chain ID + 3D domain ID (e.g., 1Q4G A 4, which represents domain 4 in protein chain A from the 1Q4G structure record).

The red bars indicate the region/residues of the query domain that can be superimposed on residues from each neighbor. The gray bars and blank space are unaligned regions. These region colors are the same as those shown in Cn3D when a structure superposition is viewed in Cn3D. When the mouse is over each icon, it will display a description of what it represents.

On the sequence ruler next to the query protein ("1PTH A" in the illustration above), the aligned region indicates a sum of regions from all neighbors. This indicates the maximum fragment in the query that is similar to some other structures. The individual 3D domains in the chain are indicated by rectangles below the sequence ruler with different colors and numbers. MMDB's 3D domains are defined on the basis of structural compactness. Red indicates the query domain. Links to the conserved domain database are provided for convenience, to provide names and descriptions (where possible) of the 3D domains to which they correspond.

The check box at the left hand side of a structure neighbor's "row" allows for selection of individual neighbors and their 3D superposition. Clicking the sequence identifier beside it will go to the Entrez sequence page of the neighbor. The red aligned regions in a neighbor's sequence are displayed at the positions of their equivalent residues in the query sequence. Clicking on these will display an HTML view of the sequence alignment between the query and the neighbor. One of the VAST similarity measures used for sorting (here, the alignment length: e.g., 551 residues residues from 1PTH_A are aligned with 1Q4GA) is listed at the right hand side of the row. Clicking the name of the similarity measure (i.e., "Ali_len" in our example) will display a table with all of the VAST statistics.
 
 
 
 
back to top What does the "Table" display show, and what VAST similarity measures are listed?
 
  The display controls at the top of a VAST results page allow you to change the display from the default "Graphics" format (described in the previous section) to a "Table" format.

The "Table" display lists the identifier for each structure neighbor in the format of PDB ID + protein chain ID + 3D domain ID (e.g., 1Q4G A 4, which represents domain 4 in protein chain A from the 1Q4G structure record), its description, and a number of measures of structural similarity. The columns in the table include:

  • Check box: Allow you to select the structure neighbors you'd like to view in a 3D alignment with the query protein structure.
  • PDB: The four-character PDB-Identifier of the structure neighbor. Click on the Identifier to switch to the MMDB Summary page of the respective neighbor.
  • C: The PDB chain name. A blank space indicates that the chain does not have an identifier (many protein structures have a single chain only). Note that non-alphanumeric characters such as dashes, hyphens, underscores, etc. may be used as chain names by PDB.
  • D: The MMDB 3D domain identifier. Domains are parsed based on geometrical criteria (the ratio of intradomain contacts to interdomain contacts) by an automatic method and can be visualized with Cn3D.
  • Aligned Length: The number of equivalent pairs of C-alpha atoms superimposed between the two structures, i.e. how many residues have been used to calculate the 3D superposition.
  • SCORE: The VAST structure-similarity score. This number is related to the number of secondary structure elements superimposed and the quality of that superposition. Higher VAST scores correlate with higher similarity.
  • P-VAL: The VAST p value is a measure of the significance of the comparison, expressed as a probability. For example, if the p value is 0.001, then the odds are 1000 to 1 against seeing a match of this quality by pure chance. The p value from VAST is adjusted for the effects of multiple comparisons using the assumption that there are 500 independent and unrelated types of domains in the MMDB database. The p value shown thus corresponds to the p value for the pairwise comparison of each domain pair, divided by 500.
  • RMSD: The root mean square superposition residual in Angstroms. This number is calculated after optimal superposition of two structures, as the square root of the mean square distances between equivalent C-alpha atoms. Note that the RMSD value scales with the extent of the structural alignments and that this size must be taken into consideration when using RMSD as a descriptor of overall structural similarity.
  • %Id: Percent identical residues in the aligned sequence region. This is a raw measure of sequence similarity in the parts of the proteins that have been superimposed.
  • LHM: Loop Hausdorff Metric. A Loop Similarity measure that shows how well two structures conform to each other in the loop regions, after structural superposition. The "loop regions" are the parts of the structures between aligned secondary structure elements (helices and strands). LHM is measured in Angstroms, with a smaller value indicative of greater similarity. The loop similarity may be undefined (indicated by 'NA') if there are too many residues with missing coordinates in the loops. Citation: Analysis of protein homology by assessing the (dis)similarity in protein loop regions
  • GSP: Gapped Score. A combination (algebraic) score that uses RMSD, aligned length, and the number of gapped regions in the alignment. A smaller gapped score correlates with greater similarity. Citation: Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures.
  • Description: A string parsed out of PDB's COMPOUND records that describes the nature of the structure neighbor.
 
 
 
back to top How are non-redundant subsets of protein chains selected?
 
  MMDB chains are clustered into groups according to their amino acid sequence similarity in pairwise comparisons. A representative chain is selected from each group to compile a non-redundant subset of MMDB, and only one representative of each group is shown in a neighbor-list calculated by VAST. By default, a lower level of redundancy at 10e-40 is used to report structure neighbors. This keeps the table shorter while providing the most informative summary of structural relationships in MMDB.

All-against-all pairwise comparisons of MMDB-domains are calculated with the BLAST algorithm, setting a fixed database size parameter of 500,000 residues. Sequences are then clustered into groups by single linkage, whereby a sequence is merged into a group if it shows a BLAST p value of C or less with any member of the group. There are 5 levels of redundancy defined in MMDB database:
  1. Low redundancy: representatives are chosen from each group where sequences show a BLAST p value of 10e-7 to each other
  2. Medium redundancy: representatives are chosen from each group where sequences show a BLAST p value of 10e-40 to each other
  3. High redundancy: representatives are chosen from each group where sequences show a BLAST p value of 10e-80 to each other
  4. Non-identical sequence level: representatives are chosen from each group where sequences are not identical to each other
  5. All sequences level: this is the most redundant level, which includes all of MMDB sequences
Within each cluster of similar protein chains, cluster members are ranked according to the apparent quality and completeness of the structure data. The following criteria are used (ranked by decreasing priority):
  1. Low fraction of residues with unknown residue type
  2. Low fraction of residues with incomplete coordinates
  3. Low fraction of residues with missing coordinates
  4. Low fraction of residues with incomplete side-chain coordinates
  5. High resolution
  6. High number of chains (subunits) contained in the PDB entry
  7. High number of heterogens contained in the PDB entry
  8. High number of different types of heterogens.
  9. Chain length
For the display of structure neighbors calculated by VAST, the highest ranking chain (according to the criteria above) from each cluster found in the list of neighbors is reported. In most cases this implies that the parent structure is also similar to the other members of the sequence redundant cluster. To have them displayed, the user must select a higher level of redundancy.
 
 
 
 
back to top How can I display a different subset of similar structures, or find a specific structure within my search results?
 
  The display controls at the top of a VAST results page allow you to change the appearance of display from Graphics to Table format. The graphic is helpful to understand the superpositions between a query domain and its neighbors. The table is good for viewing or saving the statistics from a VAST calculation.

The VAST similarity measures reported for each neighbor can be used to determine sort order. The lengths of the whole graphic and table are strongly influenced by the display subset, which determines the level of sequence redundancy chosen.

The display controls also allow you to change the number of structure neighbors listed in the display. A brief subset of structure neighbors is shown by default. You can choose increase or decrease the number of hits shown on the VAST results page by using the "List" options, which range from a "low redundancy" subset of proteins from structure records to "all sequences."

The total number of neighbors displayed in a page is limited. At most 60 neighbors from a non-redundant subset can be displayed simultaneously on one page. In addition, by clicking check boxes to select from previously listed neighbors, at most another 40 neighbors can also be displayed in the same page. Therefore the maximum capacity of one page is 100 neighbors. This feature, together with the pagination, is able to keep interesting neighbors from different pages displayed together. The page can be selected from the third pull-down menu in the "List" line.

The "Advanced similar structure search" options allow you to search for specific structures in your current set of search results. For example, if you know that a particular structure should be in your VAST results but you don't see it in the currently displayed subset of hits, you can use the "Find" button to look for that structure by MMDB, PDB, or 3D-Domain identifier. If you have done a previous search in the Entrez Structure (MMDB) database and want to find out if any of the structures retrieved by that search are in the current VAST results, you can use the "Entrez History" function in the "Advanced similar structure search" panel. That will show you the intersection, if any, of the previous Entrez Structure and current VAST search results.

 
 
 
back to top How can I view or save a structure superposition?
 
  On a VAST results page (in either the "Graphics" or the "Table" display), individual structure neighbors can be selected by clicking in the check boxes at the left margin. Then if one chooses the button labeled "View 3D Structure", the 3D superposition of the query protein with the selected neighbors is displayed in Cn3D. Up to 10 neighbors may be viewd in a superposition simultaneously, if Cn3D without the cache mechanism is selected (this is the default). This selection also works for Cn3D version 3.0.

Although the default is to submit all atoms for display in Cn3D, the "Backbone" option can be used to control the size of the files being downloaded by Cn3D, in order to save time and memory for data transmission to the viewer. With the release of Cn3D version 4.0, the Cn3D/Cache mechanism is used to store downloaded structure data locally. With this option, the number of neighbors for display is not limited. The user must take care not to exceed the physical memory available in his/her computer. If available memory is exceeded, Cn3D will not operate properly.

The Cn3D Tutorial provides additional details about viewing structure alignments in Cn3D.

Alternatively, instead of viewing the 3D superpositions, the data can be examined or saved to disk as a local file, for browser-independent or later viewing. Also if the "List" "Asn1" option is selected instead of the "List" "Graphics" or "List" "Table" from the last menu, a complete alignment file will be saved locally, including all of the neighbors in the subset.
 
 
 
back to top How can I display a sequence alignment created from a structure superposition?
 
  If the "View Alignment" button is chosen, a multiple alignment view will be opened in HTML, text, or FASTA with Gap formats. The check boxes at each neighbor "row" allow one to add the "Selected" neighbors into the alignments. The "All on page" option will allow a display of multiple alignments made from all of the neighbors on the same page.

The HTML- and text-format alignment views indicate aligned vs. unaligned residues as uppercase and lowercase letters, respectively. In HTML views, columns with identical residues aligned across all selected sequences are colored red, whereas those with different aligned residues are colored blue. Those not covered by all sequences will be shown in grey.
 
 
 
 
back to top What does it mean when it says VAST did not find any structure neighbors?
 
  There are a few different reasons for this condition. One reason is simply that VAST does not consider this structure to be sufficiently similar to any other structure in the MMDB database. The VAST data use a statistical significance cutoff of P < 0.0001. This cutoff was set to be conservative intentionally, to reduce the number of false positives, but some hits that are biologically significant may be omitted because of this statistical threshold.

There are also some entries where the VAST calculation was not done: those for proteins with fewer than 3 secondary structure elements (SSEs), and structures containing no protein chains (i.e., only DNA or RNA). The molecule type and SSE count can be checked out by examining the structure with Cn3D.
 
 
 
 
 
 
 
 | Revised 13 December 2011 | | Help Desk | Disclaimer | Privacy statement | Accessibility |