DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-
The U.S. Department of Energy Biological and Environmental Research program funds this site.

Sequence similarity searching using NCBI BLAST

This tutorial is designed to serve as a basic introduction to NCBI's BLAST, which is used for comparing the sequence of a particular gene or protein with other sequences from a variety of organisms. Since this tutorial is targeted to new users, it will cover only selected BLAST options and features.

Contents of this tutorial:


What is BLAST?

BLAST (Basic Local Alignment Search Tool) is a set of programs designed to perform similarity searches on all available sequence data. Scientists frequently use such searches to gain insight into the function and biological importance of gene products.

BLAST uses an algorithm developed by NCBI that seeks out local alignment (the alignment of some portion of two sequences) as opposed to global alignment (the alignment of two sequences over their entire length). By searching for local alignments, BLAST is able to identify regions of similarity within two sequences.

Some BLAST search services include the following:

  • blastp - comparing an amino acid query sequence with others stored in protein sequence databases

  • blastn - comparing a nucleotide query sequence against a nucleotide sequence database

  • blastx - comparing a nucleotide query sequence translated in all reading frames with other amino acid sequences stored in protein sequence databases

Which type of BLAST search should you use?

Since more than one codon or triplet of nucleotides could code for a particular amino acid, a considerable variation in nucleotide sequences could translate into the same amino acid sequence. Comparing amino acid sequences is a more reliable predictor of similarity between two sequences than comparing nucleotide sequences. For this reason, this tutorial will focus on using blastp to compare the gene product's amino acid sequence with other. For more information on using a variety of BLAST services, see Additional BLAST Resources.


Obtaining a FASTA Formatted Amino Acid Sequence

As a shortcut, we will use NCBI's LocusLink to quickly access the amino acid sequence of a gene product. The amino acid sequence also could be obtained by searching protein sequence databases such as NCBI's Entrez; this process, however, can be more involved and rather time-consuming since it often requires examining and sifting through several sequence records.

1. Go to the LocusLink Web site:

http://www.ncbi.nlm.nih.gov/LocusLink/

LocusLink Home

2. In the search box at the top of the LocusLink home page:

  • Enter the gene symbol in the query box. Using the field qualifier [sym] to restrict your query tells LocusLink that you are searching by gene symbol only. Since a gene symbol is unique for each human gene, you should retrieve only one result. Otherwise, the search will return results that mention the query term anywhere in the record. For the hereditary hemochromatosis gene, HFE[sym] should be entered in the query box. For more information on options for refining your search, see the Query Tips section of LocusLink Help.

  • Choose Human from the Organism drop-down menu on the right; otherwise, LocusLink will also retrieve records from other organisms such as mouse and rat.

  • Once the search box at the top of the LocusLink page looks like the screenshot below, click Go to submit your query.
LocusLink Search Box

3. The search should return one entry.

LocusLink Search Results

4. Click on the LocusID number 3077 to pull up the LocusLink record. The record for HFE should look like the following screenshot.

LocusLink Record

5. Once the record is open, click on RefSeq in the blue navigation column on the left. This is a quick link to NCBI Reference Sequences for the HFE gene and the protein it encodes.

6. In the NCBI Reference Sequences (RefSeq) Section of the LocusLink record, you will find direct links to RefSeq mRNA and protein records.

RefSeq Section of LocusLink Record

Notice that there are multiple variants for the same gene. How each variant differs from the most complete variant (variant 1) is described in the Transcript Variant section of each HFE reference sequence entry.

Accession numbers for RefSeq protein sequences always begin with NP_. Reference sequences should be used when available because they have been (or are in the process of being) reviewed by NCBI staff to ensure completeness and freedom from error and contamination.

7. To open the RefSeq sequence record for the HFE protein, in the first Reference Sequence entry, simply click on the accession number NP_000401 (Clicking on this link will open a new browser window with the protein sequence record in it).

8. From the Display drop-down menu of the protein sequence record, select FASTA to display the amino acid sequence in FASTA format, and click on the Display button.

Sequence Display Options

9. A sequence in FASTA format consists of a single line of descriptive text that begins with >, followed by sequence data. Highlight the FASTA sequence with your mouse and copy it by pressing Ctrl + C on the keyboard, selecting copy from the Edit menu in your browser, or by right-clicking and selecting the copy option (shown below).

HFE Protein Sequence in FASTA Format

Now that you have found the amino acid sequence of the HFE protein and put it in FASTA format, you are now ready to submit this sequence as a BLAST query, which is covered in the next section of this tutorial.

Submitting a Query Sequence

1. After you have copied the sequence in FASTA format, access the protein-protein BLAST service at http://www.ncbi.nlm.nih.gov/BLAST/ by clicking on Standard protein-protein BLAST [blastp]. The protein-protein BLAST search page should look like the following screenshot.

protein-protein BLAST Home

2. Paste the amino acid sequence into the "Search" box by pressing Ctrl + V on the keyboard, by selecting paste from your browser's Edit menu, or by right-clicking inside the search box and selecting the paste option. The pasted sequence in the search box is shown below.

Protein BLAST Search Box

3. For more information about different search and format options, see BLAST Search Options Guide at the end of this tutorial. Leave all search options set to their default values except for Limit by entrez query in the Options for advanced blasting section. Scroll to the Options for advanced blasting section of the protein-protein BLAST page.

Since the default database setting will automatically search sequence data from many different organisms, Limit by entrez query allows you to narrow a search by specifying search criteria such as organism type. Adding the qualifier [ORGN] or [organism] to the common or scientific name of a particular organism will retrieve sequences from that organism only.

Let's say that we are interested in finding out which protein sequences in the mouse or rat are most similar to the human HFE protein. To limit our search to mouse and rat, enter the following into the text box as demonstrated below: mouse[ORGN] OR rat[ORGN]

Options for Advanced Blasting

4. Click on the BLAST Button button below the Search boxes or at the bottom of the page to submit your query.

5. After you submit your query, you will be taken to the formatting BLAST page (see screen shot below).

Formatting BLAST Page

The formatting BLAST page displays the results of a conserved domain search. A conserved domain is a recurring sequence pattern or motif. When you submit sequence data, the conserved domain search will detect regions within the sequence that share a common recurring pattern with other proteins. Many conserved domains are associated with certain protein features or functions. For example, the IGc1 domain is an immunoglobulin-like domain that is found in antibodies and other proteins. This type of domain is often found in regions of proteins that interact with other proteins.

The formatting BLAST page also provides options for changing the format of BLAST results. Since we are not changing any format options, simply click on the Format Button button on the formatting BLAST page, and a new browser window will open that contains the BLAST results. It may take a few minutes to generate the BLAST Results page.

At certain times when the conserved domain searching takes longer than usual, the formatting BLAST page may not include a conserved domain diagram. Instead, a button that links to the results of the conserved domain search is presented, as shown in the screenshot below. If you are interested in viewing the conserved domains, click on the yellow button. If you are only interested in accessing the BLAST results, just click the Format! button.

Formatting BLAST Page


The next section of this tutorial is designed to help you interpret the results you get from a BLAST search.

return to top


Understanding BLAST results

1. The top of the results of BLAST page should resemble the screen shot below. Scrolling through the BLAST results, you will see that this page includes a unique request ID (RID), query information, database information, a link to taxonomy reports, a graphical display showing alignments to the query sequence, descriptions of sequences producing significant alignments, and pairwise alignments between the query sequence and each BLAST hit sequence.

BLAST Results

2. Clicking on Taxonomy reports just above the Graphical Display will open a new browser window that displays BLAST results in three different views: Organism Report, Lineage Report, and Taxonomy Report. Organism Report groups all hits by organism. For example, of 100 hits retrieved for this run, Organism Report groups 47 mouse hits together and 53 rat hits together. Organism Report also includes both scientific and common names of organisms included in the BLAST hit list. For more information about BLAST taxonomy reports, see Taxonomy BLAST Help.

3. The graphical overview shown in the previous screenshot displays the top 50 sequence alignments for this search (the default setting). If you would like to see more lower-scoring alignments, restore the formatting BLAST page, specify the desired number of alignments, and resubmit your request to change the results format.

Graphical Display Features

  • The graphical overview aligns hits (database sequences retrieved during BLAST search) with the query sequence. The thick red numbered bar at the top represents the query sequence, and the numbers correspond to those of amino acid residues.

  • All hits are represented by colored bars below the query sequence. Mousing over a hit will display its definition and score in the text box above the graphical display. Clicking on a hit will take you to the pairwise alignment between hit and query sequence.

  • The bar color for a hit refers to alignment score, a mathematically derived value that reflects the degree of similarity between hit and query sequences. The higher the score, the more similar the two. The Color Key at the top of the graphical display gives the range of alignment scores assigned to each color. For example, red hits are most similar, with alignment scores greater than or equal to 200, while black hits are least similar, with alignment scores lower than 40.

4. Below the graphical display are descriptions of statistically significant alignments. The most significant alignments are at the top. From these results, we see that the first entry (hemochromatosis protein from the mouse [Mus musculus]) is more similar to the human sequence than the hemochromatosis protein sequence from the rat [Rattus norvegicus].

The default number of descriptions specified for each set of BLAST results is 100. The number of descriptions and other features included on the BLAST results page can be adjusted by returning to the formatting BLAST page.

The first ten descriptions are included in the screenshot below.

Sequence Descriptions

Features of Each Sequence Description

1 - This portion of each description links to the sequence record for a particular hit. See our Sequence Database tutorial to learn more about sequence records.

2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence).

3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance.

4 - These links provide the user with direct access from BLAST results to related entries in other databases. links to LocusLink records and S links to structure records in NCBI's Molecular Modeling DataBase.

5. Below the descriptions are pairwise alignments that show the entire length of each hit sequence matched up with the entire query sequence. With a pairwise alignment you can see how the hit sequence compares with the query sequence amino acid by amino acid. The screen shot below is the pairwise alignment for the first hit. For descriptions of different types of sequence alignments see NCBI's Examples of Alignment Formats.

Pairwise Alignment

  • The hit sequence is presented in the Sbjct: line, and the query sequence in the Query: line.

  • Each letter between the Subject and Query lines indicates that the amino acids at that position in both sequences are identical. Use the Table of Genetic Code to see which amino acid is represented by each letter. Each blank space between the Subject and Query lines means that amino acids at the specified position in both sequences do not match.

  • X's are inserted into the query sequence as a result of automatic filtering. A string of X's is used to replace a sequence's low-complexity regions that can generate artifactual hits. In nucleotide sequences, N's replace low-complexity regions rather than X's.

  • Dashes inserted into either query or subject sequence indicate gaps introduced to compensate for insertions and deletions.

return to top


Additional BLAST Resources

This tutorial was designed as a basic introduction to using BLAST and interpreting BLAST results. To learn more about BLAST, check out the following NCBI resources used as references for this tutorial:


BLAST Search Options Guide

BLAST provides several options for narrowing or modifying a search. Several of the options presented on the protein-protein BLAST page and the formatting BLAST page (accessible after submitting a BLAST query) are explained below. Each search option on these pages links to a BLAST Help page that includes a brief description of the option.

Search: Besides pasting sequence data into the search box, you can also submit query sequences by entering sequence identifier numbers such as accession numbers or gi's. For descriptions of what accession numbers and gi's are, see the Glossary of Bioinformatics Terms.

Set Subsequence: Lets you limit your query to a particular portion of your sequence. For example, if you want to limit the query so that only the region between amino acid residues 50 and 150 is compared with other protein sequences, simply enter 50 into the From box and 150 into the To box.

Choose Database: Choose from among the following protein sequence databases:

NR - Default setting - All non-redundant translations of CDS (coding sequences) of GenBank nucleotide sequences as well as amino acid sequences from Protein Data Bank (PDB), SwissProt, Protein Information Resource (PIR), and Protein Resource Foundation (PRF) in Japan. See our Genome Database Guide for more information about these databases. Non-redundant means that the same sequence or translation in more than one database should be listed only once in the BLAST output.

swissprot - Only protein sequences from the last major release of Swiss-Prot protein sequence database. No updates to Swiss-Prot sequences are included.

pat - Protein sequences derived from the Patent division of GenBank.

yeast - Translations of Yeast (Saccharomyces cerevisiae) genomic CDS (coding sequences).

ecoli - Translations of Escherichia coli genomic CDS (coding sequences).

PDB - Protein sequences derived from 3-dimensional structures at Protein Data Bank (PDB). See our Genome Database Guide for more information about PDB.

Drosophila genome - Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP).

month - Sequences in the NR database that are new or have been added in the last 30 days.

Do CD Search: Checking this box will compare the query sequence with the Conserved Domain Database. A domain is a protein section that has a a distinct evolutionary origin and function. CD Search is carried out by default for each protein-protein BLAST query. BLAST search results will include a link to CD-Search results if this box is checked. For more information about CD Search, see the CDD Home Page.

Options for Advanced Blasting

Limit by entrez query: This option can be used to specify search criteria for limiting or refining BLAST searches. Any query statement that can be submitted to an Entrez database can be entered into the first box. For example, you could enter mouse[ORGN] OR rat[ORGN] to include only protein sequences from mice or rats. A specific organism also may be chosen using the "Select from" drop-down box on the right. For more information on formulating an entrez query, see Refining Your Search from the Entrez Help Document.

Choose filter:

Low complexity - This option is checked as the default. This filter allows the masking of query sequence portions that have low complexity (e.g., a long string of the same amino acid or nucleotide). For a protein sequence query, the filter will replace a low-complexity region with a string of X's (e.g., XXXXXXXXXXXXX), or a string of N's in a nucleotide sequence query. Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filtering is applied only to the query sequence (or its translation products), not to database sequences.

Mask for lookup table only - This option for advanced searchers is used in constructing the lookup table used by BLAST. This experimental option is likely to change in the future.

Mask lower case - Select this option to customize filtering from the query sequence when it is compared with other database sequences. The query sequence in uppercase characters is entered into the search box, and areas to be filtered are denoted in lowercase characters.

Expect: All sequences retrieved during a BLAST search must have an Expect (E Value) lower than the number specified by this option. The Expect describes the likelihood that a sequence with a similar score will occur in the database by chance. The default Expect value is 10. Since hit sequences with Expect values closer to zero are more statistically significant, you may want to set this option to 1 or to some decimal value.

Other "Options for Advanced Blasting," such as composition-based statistics, Word size, Matrix, PSSM, Other Advanced, and PHI Pattern, are designed for more advanced BLAST users. For our purposes, these options should be left to their default values. For more information about these advanced options, see BLAST help.

Format

Show

Graphical Overview - This option is selected by default. In BLAST results, this option provides a graphic depiction of how the similar sequences retrieved from the databases (the subject sequences) line up with the query sequence (the thick red line at the top). The score of each alignment is indicated by one of five different colors as defined in the Color Key for Alignment Scores shown at the top of the graphical overview.

Linkout - Also selected by default. If this box is unchecked, no links from BLAST results to other NCBI databases are provided.

NCBI-gi - Also selected by default. This option allows the NCBI-GI (GenBank Identifier, a number unique to each sequence) to be displayed for each hit sequence included in output. NCBI-GI links to a subject sequence record from NCBI sequence databases.

Format - Leave the drop-down menu beside the NCBI-GI option set to the default ALIGNMENT. Other selections in the drop-down menu (PSSM and Bioseq) are for more advanced users. To view the graphical overview, the HTML (default) setting should be selected from the second drop-down menu in the Format option. Selecting "Plain Text" from the drop-down menu will present BLAST output in a more printer-friendly format; the graphical overview feature, however, will be omitted and all hyperlinks deactivated.

Number of

Descriptions - Restricts the number of matching-sequence descriptions reported. The default limit is 100 descriptions.

Alignments - Restricts the number of alignments (default alignment type is pairwise) between query and subject sequences included in the BLAST results. The default limit is 50.

Alignment View

To see some of the following formats, see NCBI's Examples of Alignment Formats.

Pairwise - Default setting for alignment view in which the query sequence's full length is lined up, amino acid by amino acid, with the full length of each retrieved database sequence. When comparing DNA sequences using BLAST, the query sequence's nucleotides are matched up with those of each database sequence.

Query-anchored with identities - Rather than a pairwise alignment, this is a type of multiple alignment. In this view, a query-sequence segment (for example, amino acids 1 through 60) is displayed with the corresponding section of each retrieved sequence listed below it. Each query-sequence segment begins with the number 1 at the far left, while each database-sequence segment begins with its corresponding gi (GenBank identifier) at the far left. Identities are displayed as dashes, with mismatches as single-letter amino acid abbreviations

Query-anchored without identities - This multiple alignment view is similar to query-anchored with identities; each match, however, is indicated by the single-letter amino acid abbreviation instead of a dash.

Hit Table: Presents all BLAST results in a table that summarizes some of the following information for each subject sequence retrieved: subject ID, % identity between query and each subject sequence, alignment length, number of mismatches, number of gap openings, E Value, and bit score

The Limit results by entrez query option is described above. Format for PSI-BLAST and Expect value range options are designed for more advanced BLAST users (see BLAST help).

return to top


Acknowledgments

Source for screen shots used in this tutorial:

NCBI BLAST. National Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov/BLAST/> (January 3, 2003).


Continue with other tutorials:

Searching OMIM: Finding information about genes, traits, and disorders

Finding a gene on a chromosome map

Accessing records in NCBI's sequence databases

Examining protein structures from the Protein Data Bank


Last Updated: January 3, 2003

Feedback and comments about this site, contact site designer, Jennifer Bownas of HGMIS. To order a poster, click here.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Home Site Index Chromosome Viewer Genetic Disorder Guide Gene and Protein Guide Bioinformatics Tutorials
Bioinformatics Terms Sample Profiles Evaluating Medical Information Links FAQs Order Poster


The online presentation of this poster is a special feature of the U.S. Department of Energy (DOE) Human Genome Project Information Web site. The DOE Biological and Environmental Research program of the Office of Science funds this site.