This tutorial serves
as a basic introduction to finding sequence information for the human
genes associated with the genetic disorders and traits listed on the Human
Genome Landmarks Poster. Since this tutorial is targeted to new users
of NCBI's sequence databases, only selected features and options available
from these resources will be addressed. For more information on using
NCBI's sequence databases see Entrez
Help Document or the Sample
GenBank Record.
Contents of this tutorial:
In this tutorial,
we will be using NCBI Entrez Nucleotides. This database contains sequence
data from GenBank, RefSeq, European Molecular Biology Laboratory (EMBL),
DNA DataBank of Japan (DDBJ), and Protein Data Bank.
GenBank is an NCBI
database that serves as an archive for all publicly available DNA sequences
from more than 75,000 organisms. GenBank is part of the International
Nucleotide Sequence Database Collaboration,
which also includes DDBJ and
EMBL. Sequence information
is exchanged daily among these three organizations. In addition to GenBank,
NCBI also maintains RefSeq, a database of non-redundant reference sequence
records. For more information about GenBank and RefSeq, see the GenBank
vs. RefSeq box below or access NCBI's
RefSeq Frequently Asked Questions
GenBank
vs. RefSeq
Sequence records
are created by scientists who submit sequence data to GenBank. As
an archival database, GenBank may contain hundreds of records for
the same gene. In addition, because there is no independent review
system, the types of information may vary from record to record,
and GenBank sequence data may contain errors and contaminant vector
DNA. To address some of the problems associated with GenBank sequence
records, NCBI developed its RefSeq database.
RefSeq is NCBI's
database of reference sequences. RefSeq serves as a curated, non-redundant
source of sequence information for genomic DNA contigs (genomic
segments constructed by ordering cloned DNA fragments), mRNA transcripts,
and proteins associated with known genes. RefSeq records are created
and updated as needed by NCBI staff. Since RefSeq records undergo
a review process that screens for problems such as sequencing errors
and vector contamination, RefSeq records are good sources of sequence
information. |
Tutorial Tips
One option for
following along with the steps described in this tutorial is to open
two browser windows at once (one for the tutorial and one for NCBI resources)
and toggle between these two windows as needed. Another option would
be to print this tutorial out and then go to NCBI.
Finding
a sequence record for a gene
This part of the
tutorial will demonstrate how to access both genomic and mRNA sequence
data for a particular gene using two NCBI resources: LocusLink and Entrez
Nucleotides. It is helpful to know that there is more than one way to
get to sequence records at NCBI. Not only can you directly search for
records at Entrez Nucleotides, you can also link to sequence records from
other NCBI resources such as LocusLink, OMIM, Map Viewer, and PubMed.
The mRNA sequences
stored in NCBI's sequence databases are really complementary DNA (cDNA)
sequences generated from mRNA transcripts extracted from the cells of
different organisms. Genomic DNA sequences of eukaryotic organisms contain
exons (segments that encode proteins) interspersed with introns (noncoding
segments or junk DNA). After transcription, introns are spliced out, exons
are pieced together, and the mRNA strand is processed to form messenger
RNA (mRNA). The removal of introns from mRNA sequences make cDNA much
shorter than their genomic precursors.
Comparing a gene's
mRNA sequence with its genomic sequence reveals a great deal, such as
where nucleotides that code for the mRNA transcript begin and end, what
segments of a gene actually code for amino acids, and how much of a gene's
sequence is comprised of intron DNA.
Before beginning
a search for a gene's sequence, it is helpful to know the gene's official
gene symbol. Searching by gene symbol is more specific than searching
by disorder or trait name. If you do not know the official symbol for
a particular gene, see the OMIM tutorial to learn
how to find information about genes associated with disorders or traits.
Although this tutorial
will demonstrate how to find sequence data for HFE, the human gene associated
with hereditary hemochromatosis, the same process can be used to find
sequence information for other human genes.
Finding sequence
records with LocusLink
1. Go to the LocusLink
Web site:
http://www.ncbi.nlm.nih.gov/LocusLink/
2. In the search
box at the top of the LocusLink home page:
3.
The search should return one entry.
4.
Click on the LocusID number 3077 to
pull up the LocusLink record. The record for HFE should look like the
following screenshot.
5.
Once the record is open, click on RefSeq in the blue navigation
column on the left. This is a quick link to NCBI Reference Sequences
for the HFE gene and the protein it encodes.
6.
In the NCBI Reference Sequences (RefSeq) Section of
the LocusLink record, you will find direct links to RefSeq mRNA and
protein records.
Notice
that there are multiple variants for the same gene. How each variant
differs from the most complete variant (variant 1) is described in
the Transcript Variant section of each HFE reference
sequence entry.
Accession
numbers for RefSeq mRNA sequences always begin with NM_, and those
for RefSeq protein sequences always begin with NP_. Reference sequences
should be used when available because they have been (or are in the
process of being) reviewed by NCBI staff to ensure completeness and
freedom from error and contamination.
7.
To open the mRNA sequence record for the HFE gene, simply click on the
reference sequence with the accession number NM
000410.
8.
To find a genomic sequence (a gene sequence that includes both introns
and exons) for HFE, click on GenBank below RefSeq in the
blue navigation column on the left. This will take you to the following
section of the LocusLink record:
9.
In the GenBank Sequences section, all nucleotide sequences that
are type g are genomic sequences for the HFE gene. You will need
to open each record and browse through definition, sequence annotation,
and comments for more information about the nucleotide sequence contained
within each GenBank sequence record.
For
example, the U91328
GenBank record contains the sequence of a genomic segment that not
only includes the HFE gene sequence but also sequences for the histone
2A-like protein gene, RoRet gene, and sodium phosphate transporter
(NPT3) gene. Y09801
contains only sequence information for the HFE promoter and the HFE
gene's first exon. Of the three genomic GenBank records listed for
HFE, Z92910
is best for examining HFE's genomic sequence and its seven exons and
six introns.
Now
you are familiar with accessing sequence records using NCBI's LocusLink
database. The following section will describe how you can access the sequence
data for a gene by directly searching Entrez Nucleotides.
return
to top
Finding
sequence records with NCBI's Entrez Nucleotides
1.
Go to the Entrez Nucleotides Web site:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
Entrez Nucleotides
provides many options for refining your search strategy. The use of
Limits, Boolean operators, and search fields and qualifiers are briefly
discussed in this tutorial. For more detailed information regarding
these search features see the Entrez
Help Document.
2. Some search
fields and limits that may be helpful in refining your search for the
sequence of a gene include:
Search
Field |
Definition |
Qualifier |
Gene
Name |
Standard
and common names of genes found in database records. |
[GENE] |
Organism |
Standard
and common names of source organisms for protein or nucleotide sequences. |
[ORGN] |
Protein
Name |
Standard
names of proteins found in database records. Common names may not
be indexed in this field, so it is best to also consider All Fields
or Text Word. |
[PROT] |
Text
Word |
All
"free text" associated with a record. |
[WORD] |
Title
Word |
Words
found in a record's definition line, which summarizes sequence biology
and is carefully constructed by database staff. A standard definition
line will include organism, product name, gene symbol, molecule type,
and whether it is a partial or complete CDS. |
[TITL] |
All
Fields |
All
searchable fields in the database. |
[ALL] |
Definitions
and qualifiers for the search fields listed above were obtained from
the Entrez
Help Document Summary Tables. |
Limit |
Definition |
Molecule
Type |
Limit
searches to retrieve only genomic DNA/RNA, messenger RNA (mRNA), or
ribosomal RNA (rRNA). |
Database |
Limit
searches to retrieve only from RefSeq, GenBank, EMBL, DDBJ, or the
structure database Protein Data Bank (PDB). |
3. Example of a
search for the HFE gene's mRNA reference sequence:
|
The search statement
above will search for records with human HFE gene sequences. Always
enter Boolean operators such as OR, AND, or NOT in all capital letters.
After entering the search statement, select the Limits link
to limit your search to a particular molecule type and database.
Selecting mRNA
from the Molecule drop-down menu will retrieve only mRNA sequences.
Selecting RefSeq
from the Only from drop-down box will limit your search to
sequences in the RefSeq database. After you have selected your limit
options, click the GO button to submit your query.
4. This search
should retrieve eleven records. See screenshot below. Each of these
eleven records is a variant form of the mRNA for the HFE gene. To see
how these variants differ from one another you will need to open each
record and read about the variant in the COMMENT section of the record.
After examining these records, Variant 1 is the longest, most complete
mRNA reference sequence for the HFE gene.
5. You can modify
the above strategy so that you retrieve only genomic nucleotide sequences
(gene sequences that include both coding and noncoding segments of genes):
The query above
will still search for records that list HFE as the gene symbol and
Homo sapiens or human as the source organism but will eliminate
any records that include partial CDS or partial sequences in the definition
line.
Rather than limiting
this search to mRNA sequences in the RefSeq database, select Genomic
DNA/RNA from the Molecule drop down menu, and leave the database
limit set to its default value (Only from).
This query will
retrieve a much smaller and more relevant results set than if you
had attempted a search without any qualifiers or limits. Simply typing
HFE into the search box and searching without using
any limits or field qualifiers will retrieve more than 100 results!
Now that you have
learned some options for finding sequence records, the next section of
this tutorial will help you make sense of the information provided in
these records.
return
to top
Making
sense of the sequence record
This part of the
tutorial is designed to help you identify the kinds of information contained
in sequence records. By understanding how these records are organized,
you will be able to conduct more effective searches.
Let's examine the
GenBank record for the genomic sequence of HFE, the hereditary hemochromatosis
gene. The GenBank ID for this record is Z92910
(clicking on this link will open the GenBank record in a new browser window).
Field Descriptions
(As adapted from
the NCBI
Sample GenBank Record)
1
The LOCUS field consists of five different
subfields:
1a
Locus Name
(HSHFE) - The locus name is a tag for grouping similar sequences. The
first two or three letters usually designate the organism. In this case
HS stands for Homo Sapiens The last several characters
are associated with another group designation, such as gene product.
In this example, the last three digits represent the gene symbol, HFE.
Currently, the only requirement for assigning a Locus name to a record
is that it be unique.
1b
Sequence
Length (12146 bp) - The total number of nucleotide base pairs (or
amino acid residues) in the sequence record. Nucleotide sequence length
can range from 50 bp to 350 kb.
1c
Molecule
Type (DNA) - Type of molecule that was sequenced. All sequence data
must come from a single molecule type. Some examples of molecule type
include genomic DNA, mRNA (cDNA), genomic RNA, and ribosomal RNA.
1d
GenBank
Division (PRI) - There are 16
different GenBank divisions. In this example, PRI stands for primate
sequences. Some other divisions include ROD (rodent sequences), MAM
(other mammal sequences), PLN (plant, fungal, and algal sequences),
and BCT (bacterial sequences).
1e
Modification
Date (23-July-1999) - Date of most recent modification made to the
record. The date of first public release is not available in the sequence
record. This information can be obtained only by contacting NCBI at
info@ncbi.nlm.nih.gov.
2
DEFINITION
- Brief description of the sequence. The description may include source
organism name, gene or protein name, or function of a noncoding sequence
(e.g., a promoter region). For sequences containing a coding region
(CDS), the definition field may also contain a completeness qualifier
such as "complete CDS" or "exon 1," indicating sequence
information pertaining only to a gene's first exon.
3
ACCESSION
(Z92910) - Unique identifier assigned to a complete sequence record.
This number never changes, even if the record is modified. An accession
number is a combination of letters and numbers that are usually in the
format of one letter followed by five digits (e.g., M12345) or two letters
followed by six digits (e.g., AC123456).
4
VERSION
(Z92910.1) - Identification number assigned to a single, specific
sequence in the database. This number is in the format accession.version.
If any changes are made to the sequence data, the version part of the
number will increase by one. For example U12345.1 becomes U12345.2.
A version number of Z92910.1 for this HFE sequence indicates that the
sequence data has not been altered since its original submission.
5
GI
(1890179) - Also a sequence identification number. Whenever a sequence
is changed, the version number is increased and a new GI is assigned.
If a nucleotide sequence record contains a protein translation of the
sequence, the translation will have its own GI number.
6
KEYWORDS
(haemochromatosis; HFE gene) - A keyword can be any word or phrase used
to describe the sequence. Keywords are not based on any controlled vocabulary.
Notice that in this record the keyword "haemochromatosis,"
the British spelling of the term "hemochromatosis," is used.
For many records, no keywords are included. A period is placed in this
field for records without keywords.
7
SOURCE (human)
- Usually contains an abbreviated or common form of the source organism's
name.
8
ORGANISM (Homo Sapiens) - Source organism's formal
scientific name (usually genus and species) and phylogenetic lineage.
See the NCBI Taxonomy
Homepage for more information about the classification scheme used
to construct the organism's lineage.
9
REFERENCE
- Citations of publications by sequence authors that support information
presented in the sequence record. Several references may be included
in one record. References are automatically sorted so that the oldest
are always listed first. The last citation in this field provides contact
information for the submitter of the sequence record, and the title
of this reference contains the words "Direct Submission." Cited publications
listed as references are searchable by author, article or publication
title, journal title, or MEDLINE unique identifier (UID). The UID links
to the reference's MEDLINE record.
FEATURES
In a sequence record,
a list of sequence features follows the references. A feature is simply
an annotation that describes a portion of the sequence. An alphabetical
list
of features can be found in Appendix III: Feature Keys Reference of
the DDBJ/EMBL/GenBank
Feature Table. Each feature includes a location (sequence interval
to which the feature refers) and one or several qualifiers. Clicking
on the feature name will open a record for the sequence interval identified
in the feature location.
The following
features are included in the sample HFE sequence record Z92910:
source
- The source feature must be included in each sequence record. The
source gives the length of the entire sequence, the scientific name
of the source organism, and the Taxon ID number. Other types of information
that the submitter may include in this field are chromosome number,
map location, and clone or strain identification.
exon
- Sequence segment that codes for a portion of spliced mRNA, rRNA,
or tRNA. An exon may contain a portion of mRNA's 5' UTR (untranslated
region) or 3' UTR, in addition to part of the coding sequence. The
name of the gene to which the exon belongs and exon number are provided.
gene
- Sequence portion that encodes a specific functional product.
CDS
- Sequence of nucleotides that code for amino acids of the protein
product (coding sequence). The CDS begins with the start codon's first
nucleotide and ends with the stop codon's third nucleotide. This feature
includes the coding sequence's amino acid translation and may also
contain gene name, gene product function, link to protein sequence
record, and cross-references to other database entries.
intron
- Segment of noncoding sequence that is transcribed but removed from
the transcript by splicing together the exons (sequence portions)
on either side of it.
polyA_signal
- Identifies the sequence portion required for endonuclease cleavage
of an mRNA transcript. Consensus sequence for the polyA signal is
AATAAA.
BASE COUNT - Base Count gives the total number
of adenine (A), cytosine (C), guanine (G), and thymine (T) bases in
the sequence.
ORIGIN - Origin contains the sequence data,
which begins on the line immediately below the field title.
Now that you are
familiar with the fields of sequence records, the following section of
this tutorial will describe some of the different options available for
displaying sequence data.
return
to top
Exploring
display options for sequence records
The
GenBank ID for the record used in this part of the tutorial is Z92910
(clicking on this link will open the GenBank record in a new browser
window).
In the left
corner near the top of each record in the Entrez Nucleotide database
is the Display drop-down box. The Display setting described in the
previous section of this tutorial is the default or GenBank
display. NCBI also provides several other formats for viewing
a sequence record.
- The Summary
display will bring up the sequence's Accession number and an abbreviated
description. The GI List is another brief format that lists the
GI (GenInfo identification) number for the sequence.
- The ASN.1
display will bring up a computer-readable data format known as
the Abstract Syntax Notation 1 form. XML (Extensible Markup Language),
GBSeqXML, and TinySeqXML are other computer-readable formats.
- The FASTA
format consists of a single line of descriptive text called the
definition line followed by the sequence characters. The FASTA
format of a sequence can be used as input for sequence analysis
tools such as NCBI's BLAST.
|
|
In addition to default
or GenBank format, Graphics is another format type that can help you learn
about different features of a particular sequence. To access the graphic
display of a sequence, select Graphics from the Display drop-down
menu and click the Display button. A screen shot from the graphic
display of the human HFE genomic sequence record Z92910
is shown below.
The thick blue
bar at the top represents the entire sequence included in the record.
The gray bar below it indicates where the HFE gene begins and ends.
Below the legend, the annotated sequence is displayed, 2000 base pairs
at a time. Use the blue arrows to move up and down the sequence. Exploring
a sequence in the graphic display can help you identify where the coding
sequence, exons, introns, and other gene features begin and end.
return
to top
This concludes
our introduction to finding and interpreting sequence records from NCBI
databases. See the following resources for more information about NCBI's
sequence databases:
Acknowledgments
Sources for screenshots
used in this tutorial:
LocusLink. National
Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov/LocusLink/>
(January 2, 2003).
Entrez Nucleotides.
National Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?dB=Nucleotide>
(January 2, 2003).
Continue with other
tutorials:
Searching
OMIM: Finding information about genes, traits, and disorders
Finding
a gene on a chromosome map
Sequence
similarity searching using NCBI BLAST
Examining
protein structures from the Protein Data Bank
|