This
tutorial is designed to serve as a basic introduction to NCBI's BLAST,
which is used for comparing the sequence of a particular gene or protein
with other sequences from a variety of organisms. Since this tutorial
is targeted to new users, it will cover only selected BLAST options and
features.
Contents of this tutorial:
What is BLAST?
BLAST (Basic Local
Alignment Search Tool) is a set of programs designed to perform similarity
searches on all available sequence data. Scientists frequently use such
searches to gain insight into the function and biological importance
of gene products.
BLAST uses an algorithm
developed by NCBI that seeks out local alignment (the alignment of some
portion of two sequences) as opposed to global alignment (the alignment
of two sequences over their entire length). By searching for local alignments,
BLAST is able to identify regions of similarity within two sequences.
Some BLAST search
services include the following:
- blastp - comparing
an amino acid query sequence with others stored in protein sequence
databases
- blastn - comparing
a nucleotide query sequence against a nucleotide sequence database
- blastx - comparing
a nucleotide query sequence translated in all reading frames with
other amino acid sequences stored in protein sequence databases
Which type of BLAST
search should you use?
Since more than
one codon or triplet of nucleotides could code for a particular amino
acid, a considerable variation in nucleotide sequences could translate
into the same amino acid sequence. Comparing amino acid sequences is
a more reliable predictor of similarity between two sequences than comparing
nucleotide sequences. For this reason, this tutorial will focus on using
blastp to compare the gene product's amino acid sequence with other.
For more information on using a variety of BLAST services, see Additional
BLAST Resources.
Obtaining
a FASTA Formatted Amino Acid Sequence
As a shortcut, we
will use NCBI's LocusLink to quickly access the amino acid sequence of
a gene product. The amino acid sequence also could be obtained by searching
protein sequence databases such as NCBI's Entrez; this process, however,
can be more involved and rather time-consuming since it often requires
examining and sifting through several sequence records.
1. Go to the LocusLink
Web site:
http://www.ncbi.nlm.nih.gov/LocusLink/
2. In the search
box at the top of the LocusLink home page:
3.
The search should return one entry.
4.
Click on the LocusID number 3077 to
pull up the LocusLink record. The record for HFE should look like the
following screenshot.
5.
Once the record is open, click on RefSeq in the blue navigation
column on the left. This is a quick link to NCBI Reference Sequences
for the HFE gene and the protein it encodes.
6.
In the NCBI Reference Sequences (RefSeq) Section of the LocusLink record,
you will find direct links to RefSeq mRNA and protein records.
Notice
that there are multiple variants for the same gene. How each variant
differs from the most complete variant (variant 1) is described
in the Transcript Variant section of each HFE
reference sequence entry.
Accession
numbers for RefSeq protein sequences always begin with NP_.
Reference sequences should be used when available because they
have been (or are in the process of being) reviewed by NCBI staff
to ensure completeness and freedom from error and contamination.
7.
To open the RefSeq sequence record for the HFE protein, in the first
Reference
Sequence entry, simply click on the accession
number NP_000401
(Clicking on this link will open a new browser window with the protein
sequence record in it).
8.
From the Display drop-down menu of the protein sequence record, select
FASTA to display the amino acid sequence in FASTA format, and click
on the Display button.
9.
A sequence in FASTA format consists of a single line of descriptive
text that begins with >, followed by sequence data.
Highlight the FASTA sequence with your mouse and copy it by pressing
Ctrl + C on the keyboard, selecting copy from the Edit menu in
your browser, or by right-clicking and selecting the copy option (shown
below).
Now
that you have found the amino acid sequence of the HFE protein and put
it in FASTA format, you are now ready to submit this sequence as a BLAST
query, which is covered in the next section of this tutorial.
Submitting
a Query Sequence
1.
After you have copied the sequence in FASTA format, access the protein-protein
BLAST service at http://www.ncbi.nlm.nih.gov/BLAST/
by clicking on Standard
protein-protein BLAST [blastp].
The protein-protein BLAST search page should look like the following
screenshot.
2.
Paste the amino acid sequence into the "Search" box by pressing
Ctrl + V on the keyboard, by selecting paste from your browser's Edit
menu, or by right-clicking inside the search box and selecting the paste
option. The pasted sequence in the search box is shown below.
|
3. For more information
about different search and format options, see BLAST
Search Options Guide at the end of this tutorial. Leave all search
options set to their default values except for Limit by entrez query
in the Options for advanced blasting section. Scroll to the Options
for advanced blasting section of the protein-protein BLAST page.
Since the default
database setting will automatically search sequence data from many
different organisms, Limit by entrez query allows you to narrow
a search by specifying search criteria such as organism type. Adding
the qualifier [ORGN] or [organism] to the common or scientific name
of a particular organism will retrieve sequences from that organism
only.
Let's say that
we are interested in finding out which protein sequences in the mouse
or rat are most similar to the human HFE protein. To limit our search
to mouse and rat, enter the following into the text box as demonstrated
below: mouse[ORGN] OR rat[ORGN]
4.
Click on the button
below the Search boxes or at the bottom of the page to submit
your query.
5.
After you submit your query, you will be taken to the formatting
BLAST page (see screen shot below).
The formatting
BLAST page displays the results of a conserved domain search. A
conserved domain is a recurring sequence pattern or motif. When you
submit sequence data, the conserved domain search will detect regions
within the sequence that share a common recurring pattern with other
proteins. Many conserved domains are associated with certain protein
features or functions. For example, the IGc1 domain is an immunoglobulin-like
domain that is found in antibodies and other proteins. This type of
domain is often found in regions of proteins that interact with other
proteins.
The formatting
BLAST page also provides options for changing the format of BLAST
results. Since we are not changing any format options, simply click
on the
button on the formatting BLAST page, and a new browser window will open
that contains the BLAST results.
It may take a few minutes to generate the BLAST Results page.
At certain times
when the conserved domain searching takes longer than usual, the formatting
BLAST page may not include a conserved domain diagram. Instead, a button
that links to the results of the conserved domain search is presented,
as shown in the screenshot below. If you are interested in viewing the
conserved domains, click on the yellow button. If you are only interested
in accessing the BLAST results, just click the Format!
button.
The next section of this tutorial is designed to help you interpret the
results you get from a BLAST search.
return
to top
Understanding
BLAST results
1. The top
of the results of BLAST page should resemble the screen shot
below. Scrolling through the BLAST results, you will see that this page
includes a unique request ID (RID), query information, database information,
a link to taxonomy reports, a graphical display showing alignments to
the query sequence, descriptions of sequences producing significant
alignments, and pairwise alignments between the query sequence and each
BLAST hit sequence.
2.
Clicking on Taxonomy
reports just above the Graphical Display will open
a new browser window that displays BLAST results in three different
views: Organism Report, Lineage Report, and Taxonomy Report. Organism
Report groups all hits by organism. For
example, of 100 hits retrieved for this run, Organism Report groups
47 mouse hits together and 53 rat hits together. Organism Report also
includes both scientific and common names of organisms included in the
BLAST hit list. For more information about BLAST taxonomy reports, see
Taxonomy
BLAST Help.
3. The graphical
overview shown in the previous screenshot displays the top 50 sequence
alignments for this search (the default setting). If you would like
to see more lower-scoring alignments, restore the formatting BLAST
page, specify the desired number of alignments, and resubmit your request
to change the results format.
Graphical
Display Features
- The graphical
overview aligns hits (database sequences retrieved during BLAST
search) with the query sequence. The thick red numbered bar at the
top represents the query sequence, and the numbers correspond to
those of amino acid residues.
- All hits are
represented by colored bars below the query sequence. Mousing over
a hit will display its definition and score in the text box above
the graphical display. Clicking on a hit will take you to the pairwise
alignment between hit and query sequence.
- The bar color
for a hit refers to alignment score, a mathematically derived value
that reflects the degree of similarity between hit and query sequences.
The higher the score, the more similar the two. The Color Key at
the top of the graphical display gives the range of alignment scores
assigned to each color. For example, red hits are most similar,
with alignment scores greater than or equal to 200, while black
hits are least similar, with alignment scores lower than 40.
4. Below the graphical
display are descriptions of statistically significant alignments. The
most significant alignments are at the top. From these results, we see
that the first entry (hemochromatosis protein from the mouse [Mus musculus])
is more similar to the human sequence than the hemochromatosis protein
sequence from the rat [Rattus norvegicus].
The default number
of descriptions specified for each set of BLAST results is 100. The
number of descriptions and other features included on the BLAST results
page can be adjusted by returning to the formatting BLAST page.
The first ten descriptions
are included in the screenshot below.
|
Features of
Each Sequence Description
1
- This portion of each
description links to the sequence record for a particular hit. See
our Sequence Database tutorial to learn
more about sequence records.
2
- Score or bit score is a value calculated from the number of gaps
and substitutions associated with each aligned sequence. The higher
the score, the more significant the alignment. Each score links to
the corresponding pairwise alignment between query sequence and hit
sequence (also referred to as subject sequence).
3
- E Value (Expect Value) describes the likelihood that a sequence
with a similar score will occur in the database by chance. The smaller
the E Value, the more significant the alignment. For example, the
first alignment has a very low E value of e-117 meaning
that a sequence with a similar score is very unlikely to occur simply
by chance.
4
- These links provide the user with direct access from BLAST results
to related entries in other databases.
links to LocusLink records and
links to structure records in NCBI's Molecular Modeling DataBase.
5.
Below the descriptions are pairwise alignments that show the entire
length of each hit sequence matched up with the entire query sequence.
With a pairwise alignment you can see how the hit sequence compares
with the query sequence amino acid by amino acid. The screen shot below
is the pairwise alignment for the first hit. For descriptions of different
types of sequence alignments see NCBI's
Examples
of Alignment Formats.
- The hit sequence
is presented in the Sbjct: line, and the query sequence in
the Query: line.
- Each letter
between the Subject and Query lines indicates that the amino acids
at that position in both sequences are identical. Use the Table
of Genetic Code to see which amino acid is represented by each
letter. Each blank space between the Subject and Query lines means
that amino acids at the specified position in both sequences do
not match.
- X's are inserted
into the query sequence as a result of automatic filtering. A string
of X's is used to replace a sequence's low-complexity regions that
can generate artifactual hits. In nucleotide sequences, N's replace
low-complexity regions rather than X's.
- Dashes inserted
into either query or subject sequence indicate gaps introduced to
compensate for insertions and deletions.
return
to top
Additional
BLAST Resources
This tutorial was
designed as a basic introduction to using BLAST and interpreting BLAST
results. To learn more about BLAST, check out the following NCBI resources
used as references for this tutorial:
BLAST
Search Options Guide BLAST
provides several options for narrowing or modifying a search. Several
of the options presented on the protein-protein
BLAST page and the formatting BLAST page (accessible after submitting
a BLAST query) are explained below. Each search option on these pages
links to a BLAST Help page that includes a brief description of the option.
Search:
Besides pasting sequence data into the search box, you can also submit
query sequences by entering sequence identifier numbers such as accession
numbers or gi's. For descriptions of what accession numbers and gi's
are, see the Glossary
of Bioinformatics Terms.
Set Subsequence:
Lets you limit your query to a particular portion of your sequence.
For example, if you want to limit the query so that only the region
between amino acid residues 50 and 150 is compared with other protein
sequences, simply enter 50 into the From box and 150 into the
To box.
Choose Database:
Choose from among the following protein sequence databases:
NR - Default
setting - All non-redundant translations of CDS (coding sequences)
of GenBank nucleotide sequences as well as amino acid sequences from
Protein Data Bank (PDB), SwissProt, Protein Information Resource (PIR),
and Protein Resource Foundation (PRF) in Japan. See our Genome
Database Guide for more information about these databases. Non-redundant
means that the same sequence or translation in more than one database
should be listed only once in the BLAST output.
swissprot
- Only protein sequences from the last major release of Swiss-Prot
protein sequence database. No updates to Swiss-Prot sequences are
included.
pat -
Protein sequences derived from the Patent division of GenBank.
yeast
- Translations of Yeast (Saccharomyces cerevisiae) genomic CDS (coding
sequences).
ecoli
- Translations of Escherichia coli genomic CDS (coding sequences).
PDB -
Protein sequences derived from 3-dimensional structures at Protein
Data Bank (PDB). See our Genome Database
Guide for more information about PDB.
Drosophila
genome - Drosophila genome proteins provided by Celera and Berkeley
Drosophila Genome Project (BDGP).
month
- Sequences in the NR database that are new or have been added in
the last 30 days.
Do
CD Search: Checking this box will compare
the query sequence with the Conserved Domain Database. A domain is a
protein section that has a a distinct evolutionary origin and function.
CD Search is carried out by default for each protein-protein BLAST query.
BLAST search results will include a link to CD-Search results if this
box is checked. For more information about CD Search, see the CDD
Home Page.
Options for Advanced
Blasting
Limit
by entrez query: This option can be used
to specify search criteria for limiting or refining BLAST searches.
Any query statement that can be submitted to an Entrez database can
be entered into the first box. For example, you could enter mouse[ORGN]
OR rat[ORGN] to include only protein sequences from mice or rats. A
specific organism also may be chosen using the "Select from" drop-down
box on the right. For more information on formulating an entrez query,
see Refining
Your Search from the Entrez Help Document.
Choose filter:
Low complexity
- This option is checked as the default. This filter allows the masking
of query sequence portions that have low complexity (e.g., a long
string of the same amino acid or nucleotide). For a protein sequence
query, the filter will replace a low-complexity region with a string
of X's (e.g., XXXXXXXXXXXXX), or a string of N's in a nucleotide sequence
query. Low-complexity regions can result in high scores that reflect
compositional bias rather than significant position-by-position alignment
(Wootton
& Federhen, 1996). Filtering is applied only to the query sequence
(or its translation products), not to database sequences.
Mask for lookup
table only - This option for advanced searchers is used in constructing
the lookup table used by BLAST. This experimental option is likely
to change in the future.
Mask lower
case - Select this option to customize filtering from the query
sequence when it is compared with other database sequences. The query
sequence in uppercase characters is entered into the search box, and
areas to be filtered are denoted in lowercase characters.
Expect:
All
sequences retrieved during a BLAST search must have an Expect (E Value)
lower than the number specified by this option. The
Expect describes the likelihood that a sequence with a similar score
will occur in the database by chance.
The default Expect value is 10. Since hit
sequences with Expect values closer to zero are more statistically significant,
you may want to set this option to 1 or to some decimal value.
Other "Options
for Advanced Blasting," such as composition-based statistics, Word
size, Matrix, PSSM, Other Advanced, and PHI Pattern, are designed for
more advanced BLAST users. For our purposes, these options should be
left to their default values. For more information about these advanced
options, see BLAST
help.
Format
Show
Graphical
Overview - This option is selected by default. In BLAST results,
this option provides a graphic depiction of how the similar sequences
retrieved from the databases (the subject sequences) line up with
the query sequence (the thick red line at the top). The score of each
alignment is indicated by one of five different colors as defined
in the Color Key for Alignment Scores shown at the top of the graphical
overview.
Linkout
- Also selected by default. If this box is unchecked, no links from
BLAST results to other NCBI databases are provided.
NCBI-gi -
Also selected by default. This option allows the NCBI-GI (GenBank
Identifier, a number unique to each sequence) to be displayed for
each hit sequence included in output. NCBI-GI links to a subject sequence
record from NCBI sequence databases.
Format -
Leave the drop-down menu beside the NCBI-GI option set to the default
ALIGNMENT. Other selections in the drop-down menu
(PSSM and Bioseq) are for more advanced
users. To view the graphical overview, the HTML (default)
setting should be selected from the second drop-down menu in the Format
option. Selecting "Plain Text" from the drop-down menu will present
BLAST output in a more printer-friendly format; the graphical overview
feature, however, will be omitted and all hyperlinks deactivated.
Number of
Descriptions
- Restricts the number of matching-sequence descriptions reported.
The default limit is 100 descriptions.
Alignments
- Restricts the number of alignments (default alignment type
is pairwise) between query and subject sequences included in the BLAST
results. The default limit is 50.
Alignment View
To see some of
the following formats, see NCBI's Examples
of Alignment Formats.
Pairwise -
Default setting for alignment view in which the query sequence's full
length is lined up, amino acid by amino acid, with the full length
of each retrieved database sequence. When comparing DNA sequences
using BLAST, the query sequence's nucleotides are matched up with
those of each database sequence.
Query-anchored
with identities - Rather than a pairwise alignment, this is a
type of multiple alignment. In this view, a query-sequence segment
(for example, amino acids 1 through 60) is displayed with the corresponding
section of each retrieved sequence listed below it. Each query-sequence
segment begins with the number 1 at the far left, while each database-sequence
segment begins with its corresponding gi (GenBank identifier) at the
far left. Identities are displayed as dashes, with mismatches as single-letter
amino acid abbreviations
Query-anchored
without identities - This multiple alignment
view is similar to query-anchored with identities; each match, however,
is indicated by the single-letter amino acid abbreviation instead
of a dash.
Hit Table:
Presents all BLAST results in a table that summarizes some of
the following information for each subject sequence retrieved: subject
ID, % identity between query and each subject sequence, alignment
length, number of mismatches, number of gap openings, E Value, and
bit score
The Limit results
by entrez query option is described above.
Format for PSI-BLAST and Expect value range options are
designed for more advanced BLAST users (see BLAST
help).
return
to top
Acknowledgments
Source for screen
shots used in this tutorial:
NCBI BLAST. National
Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov/BLAST/>
(January 3, 2003).
Continue with
other tutorials:
Searching
OMIM: Finding information about genes, traits, and disorders
Finding a gene on a chromosome map
Accessing
records in NCBI's sequence databases
Examining
protein structures from the Protein Data Bank
|