DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-
The U.S. Department of Energy Biological and Environmental Research program funds this site.

Accessing records in NCBI's sequence databases


This tutorial serves as a basic introduction to finding sequence information for the human genes associated with the genetic disorders and traits listed on the Human Genome Landmarks Poster. Since this tutorial is targeted to new users of NCBI's sequence databases, only selected features and options available from these resources will be addressed. For more information on using NCBI's sequence databases see Entrez Help Document or the Sample GenBank Record.

Contents of this tutorial:

In this tutorial, we will be using NCBI Entrez Nucleotides. This database contains sequence data from GenBank, RefSeq, European Molecular Biology Laboratory (EMBL), DNA DataBank of Japan (DDBJ), and Protein Data Bank.

GenBank is an NCBI database that serves as an archive for all publicly available DNA sequences from more than 75,000 organisms. GenBank is part of the International Nucleotide Sequence Database Collaboration, which also includes DDBJ and EMBL. Sequence information is exchanged daily among these three organizations. In addition to GenBank, NCBI also maintains RefSeq, a database of non-redundant reference sequence records. For more information about GenBank and RefSeq, see the GenBank vs. RefSeq box below or access NCBI's RefSeq Frequently Asked Questions

GenBank vs. RefSeq

Sequence records are created by scientists who submit sequence data to GenBank. As an archival database, GenBank may contain hundreds of records for the same gene. In addition, because there is no independent review system, the types of information may vary from record to record, and GenBank sequence data may contain errors and contaminant vector DNA. To address some of the problems associated with GenBank sequence records, NCBI developed its RefSeq database.

RefSeq is NCBI's database of reference sequences. RefSeq serves as a curated, non-redundant source of sequence information for genomic DNA contigs (genomic segments constructed by ordering cloned DNA fragments), mRNA transcripts, and proteins associated with known genes. RefSeq records are created and updated as needed by NCBI staff. Since RefSeq records undergo a review process that screens for problems such as sequencing errors and vector contamination, RefSeq records are good sources of sequence information.

Tutorial Tips

One option for following along with the steps described in this tutorial is to open two browser windows at once (one for the tutorial and one for NCBI resources) and toggle between these two windows as needed. Another option would be to print this tutorial out and then go to NCBI.


Finding a sequence record for a gene

This part of the tutorial will demonstrate how to access both genomic and mRNA sequence data for a particular gene using two NCBI resources: LocusLink and Entrez Nucleotides. It is helpful to know that there is more than one way to get to sequence records at NCBI. Not only can you directly search for records at Entrez Nucleotides, you can also link to sequence records from other NCBI resources such as LocusLink, OMIM, Map Viewer, and PubMed.

The mRNA sequences stored in NCBI's sequence databases are really complementary DNA (cDNA) sequences generated from mRNA transcripts extracted from the cells of different organisms. Genomic DNA sequences of eukaryotic organisms contain exons (segments that encode proteins) interspersed with introns (noncoding segments or junk DNA). After transcription, introns are spliced out, exons are pieced together, and the mRNA strand is processed to form messenger RNA (mRNA). The removal of introns from mRNA sequences make cDNA much shorter than their genomic precursors.

Comparing a gene's mRNA sequence with its genomic sequence reveals a great deal, such as where nucleotides that code for the mRNA transcript begin and end, what segments of a gene actually code for amino acids, and how much of a gene's sequence is comprised of intron DNA.

Before beginning a search for a gene's sequence, it is helpful to know the gene's official gene symbol. Searching by gene symbol is more specific than searching by disorder or trait name. If you do not know the official symbol for a particular gene, see the OMIM tutorial to learn how to find information about genes associated with disorders or traits.

Although this tutorial will demonstrate how to find sequence data for HFE, the human gene associated with hereditary hemochromatosis, the same process can be used to find sequence information for other human genes.


Finding sequence records with LocusLink

1. Go to the LocusLink Web site:

http://www.ncbi.nlm.nih.gov/LocusLink/

LocusLink Home

2. In the search box at the top of the LocusLink home page:

  • Enter the gene symbol in the query box. Using [sym] to restrict your query tells LocusLink that you are searching by gene symbol only. Since a gene symbol is unique for each human gene, you should retrieve only one result. Otherwise, the search will return results that mention the query term anywhere in the record. For the hereditary hemochromatosis gene, HFE[sym] should be entered in the query box. For more information on options for refining your search, see the Query Tips section of LocusLink Help.
  • Choose Human from the Organism drop-down menu on the right; otherwise, LocusLink will also retrieve records from other organisms such as mouse and rat.
  • Once the search box at the top of the LocusLink page looks like the screenshot below, click Go to submit your query.
LocusLink Search Box

3. The search should return one entry.

LocusLink Search Results

4. Click on the LocusID number 3077 to pull up the LocusLink record. The record for HFE should look like the following screenshot.

LocusLink Record

5. Once the record is open, click on RefSeq in the blue navigation column on the left. This is a quick link to NCBI Reference Sequences for the HFE gene and the protein it encodes.

6. In the NCBI Reference Sequences (RefSeq) Section of the LocusLink record, you will find direct links to RefSeq mRNA and protein records.

RefSeq Section of LocusLink Record

Notice that there are multiple variants for the same gene. How each variant differs from the most complete variant (variant 1) is described in the Transcript Variant section of each HFE reference sequence entry.

Accession numbers for RefSeq mRNA sequences always begin with NM_, and those for RefSeq protein sequences always begin with NP_. Reference sequences should be used when available because they have been (or are in the process of being) reviewed by NCBI staff to ensure completeness and freedom from error and contamination.

7. To open the mRNA sequence record for the HFE gene, simply click on the reference sequence with the accession number NM 000410.

8. To find a genomic sequence (a gene sequence that includes both introns and exons) for HFE, click on GenBank below RefSeq in the blue navigation column on the left. This will take you to the following section of the LocusLink record:

GenBank Section of LocusLink Record

9. In the GenBank Sequences section, all nucleotide sequences that are type g are genomic sequences for the HFE gene. You will need to open each record and browse through definition, sequence annotation, and comments for more information about the nucleotide sequence contained within each GenBank sequence record.

For example, the U91328 GenBank record contains the sequence of a genomic segment that not only includes the HFE gene sequence but also sequences for the histone 2A-like protein gene, RoRet gene, and sodium phosphate transporter (NPT3) gene. Y09801 contains only sequence information for the HFE promoter and the HFE gene's first exon. Of the three genomic GenBank records listed for HFE, Z92910 is best for examining HFE's genomic sequence and its seven exons and six introns.

Now you are familiar with accessing sequence records using NCBI's LocusLink database. The following section will describe how you can access the sequence data for a gene by directly searching Entrez Nucleotides.

return to top


Finding sequence records with NCBI's Entrez Nucleotides

1. Go to the Entrez Nucleotides Web site:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide

Entrez Nucleotide Home

Entrez Nucleotides provides many options for refining your search strategy. The use of Limits, Boolean operators, and search fields and qualifiers are briefly discussed in this tutorial. For more detailed information regarding these search features see the Entrez Help Document.

2. Some search fields and limits that may be helpful in refining your search for the sequence of a gene include:

Search Field
Definition
Qualifier
Gene Name Standard and common names of genes found in database records. [GENE]
Organism Standard and common names of source organisms for protein or nucleotide sequences. [ORGN]
Protein Name Standard names of proteins found in database records. Common names may not be indexed in this field, so it is best to also consider All Fields or Text Word. [PROT]
Text Word All "free text" associated with a record. [WORD]
Title Word Words found in a record's definition line, which summarizes sequence biology and is carefully constructed by database staff. A standard definition line will include organism, product name, gene symbol, molecule type, and whether it is a partial or complete CDS. [TITL]
All Fields All searchable fields in the database. [ALL]
Definitions and qualifiers for the search fields listed above were obtained from the Entrez Help Document Summary Tables.

Limit Definition
Molecule Type Limit searches to retrieve only genomic DNA/RNA, messenger RNA (mRNA), or ribosomal RNA (rRNA).
Database Limit searches to retrieve only from RefSeq, GenBank, EMBL, DDBJ, or the structure database Protein Data Bank (PDB).

3. Example of a search for the HFE gene's mRNA reference sequence:

Entrez Nucleotides Search Box

The search statement above will search for records with human HFE gene sequences. Always enter Boolean operators such as OR, AND, or NOT in all capital letters. After entering the search statement, select the Limits link to limit your search to a particular molecule type and database.

Limit By Molecule Type

Selecting mRNA from the Molecule drop-down menu will retrieve only mRNA sequences.

Limit By Database

Selecting RefSeq from the Only from drop-down box will limit your search to sequences in the RefSeq database. After you have selected your limit options, click the GO button to submit your query.

4. This search should retrieve eleven records. See screenshot below. Each of these eleven records is a variant form of the mRNA for the HFE gene. To see how these variants differ from one another you will need to open each record and read about the variant in the COMMENT section of the record. After examining these records, Variant 1 is the longest, most complete mRNA reference sequence for the HFE gene.

Results from Search

5. You can modify the above strategy so that you retrieve only genomic nucleotide sequences (gene sequences that include both coding and noncoding segments of genes):

Search Strategy

The query above will still search for records that list HFE as the gene symbol and Homo sapiens or human as the source organism but will eliminate any records that include partial CDS or partial sequences in the definition line.

Rather than limiting this search to mRNA sequences in the RefSeq database, select Genomic DNA/RNA from the Molecule drop down menu, and leave the database limit set to its default value (Only from).

This query will retrieve a much smaller and more relevant results set than if you had attempted a search without any qualifiers or limits. Simply typing HFE into the search box and searching without using any limits or field qualifiers will retrieve more than 100 results!

Now that you have learned some options for finding sequence records, the next section of this tutorial will help you make sense of the information provided in these records.

return to top


Making sense of the sequence record

This part of the tutorial is designed to help you identify the kinds of information contained in sequence records. By understanding how these records are organized, you will be able to conduct more effective searches.

Let's examine the GenBank record for the genomic sequence of HFE, the hereditary hemochromatosis gene. The GenBank ID for this record is Z92910 (clicking on this link will open the GenBank record in a new browser window).

HFE Sequence Record

Field Descriptions
(As adapted from the NCBI Sample GenBank Record)

1 The LOCUS field consists of five different subfields:

1a Locus Name (HSHFE) - The locus name is a tag for grouping similar sequences. The first two or three letters usually designate the organism. In this case HS stands for Homo Sapiens The last several characters are associated with another group designation, such as gene product. In this example, the last three digits represent the gene symbol, HFE. Currently, the only requirement for assigning a Locus name to a record is that it be unique.

1b Sequence Length (12146 bp) - The total number of nucleotide base pairs (or amino acid residues) in the sequence record. Nucleotide sequence length can range from 50 bp to 350 kb.

1c Molecule Type (DNA) - Type of molecule that was sequenced. All sequence data must come from a single molecule type. Some examples of molecule type include genomic DNA, mRNA (cDNA), genomic RNA, and ribosomal RNA.

1d GenBank Division (PRI) - There are 16 different GenBank divisions. In this example, PRI stands for primate sequences. Some other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant, fungal, and algal sequences), and BCT (bacterial sequences).

1e Modification Date (23-July-1999) - Date of most recent modification made to the record. The date of first public release is not available in the sequence record. This information can be obtained only by contacting NCBI at info@ncbi.nlm.nih.gov.

2 DEFINITION - Brief description of the sequence. The description may include source organism name, gene or protein name, or function of a noncoding sequence (e.g., a promoter region). For sequences containing a coding region (CDS), the definition field may also contain a completeness qualifier such as "complete CDS" or "exon 1," indicating sequence information pertaining only to a gene's first exon.

3 ACCESSION (Z92910) - Unique identifier assigned to a complete sequence record. This number never changes, even if the record is modified. An accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456).

4 VERSION (Z92910.1) - Identification number assigned to a single, specific sequence in the database. This number is in the format accession.version. If any changes are made to the sequence data, the version part of the number will increase by one. For example U12345.1 becomes U12345.2. A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been altered since its original submission.

5 GI (1890179) - Also a sequence identification number. Whenever a sequence is changed, the version number is increased and a new GI is assigned. If a nucleotide sequence record contains a protein translation of the sequence, the translation will have its own GI number.

6 KEYWORDS (haemochromatosis; HFE gene) - A keyword can be any word or phrase used to describe the sequence. Keywords are not based on any controlled vocabulary. Notice that in this record the keyword "haemochromatosis," the British spelling of the term "hemochromatosis," is used. For many records, no keywords are included. A period is placed in this field for records without keywords.

7 SOURCE (human) - Usually contains an abbreviated or common form of the source organism's name.

8 ORGANISM (Homo Sapiens) - Source organism's formal scientific name (usually genus and species) and phylogenetic lineage. See the NCBI Taxonomy Homepage for more information about the classification scheme used to construct the organism's lineage.

9 REFERENCE - Citations of publications by sequence authors that support information presented in the sequence record. Several references may be included in one record. References are automatically sorted so that the oldest are always listed first. The last citation in this field provides contact information for the submitter of the sequence record, and the title of this reference contains the words "Direct Submission." Cited publications listed as references are searchable by author, article or publication title, journal title, or MEDLINE unique identifier (UID). The UID links to the reference's MEDLINE record.

FEATURES

In a sequence record, a list of sequence features follows the references. A feature is simply an annotation that describes a portion of the sequence. An alphabetical list of features can be found in Appendix III: Feature Keys Reference of the DDBJ/EMBL/GenBank Feature Table. Each feature includes a location (sequence interval to which the feature refers) and one or several qualifiers. Clicking on the feature name will open a record for the sequence interval identified in the feature location.

The following features are included in the sample HFE sequence record Z92910:

source - The source feature must be included in each sequence record. The source gives the length of the entire sequence, the scientific name of the source organism, and the Taxon ID number. Other types of information that the submitter may include in this field are chromosome number, map location, and clone or strain identification.

exon - Sequence segment that codes for a portion of spliced mRNA, rRNA, or tRNA. An exon may contain a portion of mRNA's 5' UTR (untranslated region) or 3' UTR, in addition to part of the coding sequence. The name of the gene to which the exon belongs and exon number are provided.

gene - Sequence portion that encodes a specific functional product.

CDS - Sequence of nucleotides that code for amino acids of the protein product (coding sequence). The CDS begins with the start codon's first nucleotide and ends with the stop codon's third nucleotide. This feature includes the coding sequence's amino acid translation and may also contain gene name, gene product function, link to protein sequence record, and cross-references to other database entries.

intron - Segment of noncoding sequence that is transcribed but removed from the transcript by splicing together the exons (sequence portions) on either side of it.

polyA_signal - Identifies the sequence portion required for endonuclease cleavage of an mRNA transcript. Consensus sequence for the polyA signal is AATAAA.


BASE COUNT
- Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine (T) bases in the sequence.


ORIGIN
- Origin contains the sequence data, which begins on the line immediately below the field title.

Now that you are familiar with the fields of sequence records, the following section of this tutorial will describe some of the different options available for displaying sequence data.

return to top


Exploring display options for sequence records

The GenBank ID for the record used in this part of the tutorial is Z92910 (clicking on this link will open the GenBank record in a new browser window).

In the left corner near the top of each record in the Entrez Nucleotide database is the Display drop-down box. The Display setting described in the previous section of this tutorial is the default or GenBank display. NCBI also provides several other formats for viewing a sequence record.

- The Summary display will bring up the sequence's Accession number and an abbreviated description. The GI List is another brief format that lists the GI (GenInfo identification) number for the sequence.

- The ASN.1 display will bring up a computer-readable data format known as the Abstract Syntax Notation 1 form. XML (Extensible Markup Language), GBSeqXML, and TinySeqXML are other computer-readable formats.

- The FASTA format consists of a single line of descriptive text called the definition line followed by the sequence characters. The FASTA format of a sequence can be used as input for sequence analysis tools such as NCBI's BLAST.

Sequence Display Options

In addition to default or GenBank format, Graphics is another format type that can help you learn about different features of a particular sequence. To access the graphic display of a sequence, select Graphics from the Display drop-down menu and click the Display button. A screen shot from the graphic display of the human HFE genomic sequence record Z92910 is shown below.

HFE Graphic Display

The thick blue bar at the top represents the entire sequence included in the record. The gray bar below it indicates where the HFE gene begins and ends. Below the legend, the annotated sequence is displayed, 2000 base pairs at a time. Use the blue arrows to move up and down the sequence. Exploring a sequence in the graphic display can help you identify where the coding sequence, exons, introns, and other gene features begin and end.

return to top


This concludes our introduction to finding and interpreting sequence records from NCBI databases. See the following resources for more information about NCBI's sequence databases:


Acknowledgments

Sources for screenshots used in this tutorial:

LocusLink. National Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov/LocusLink/> (January 2, 2003).

Entrez Nucleotides. National Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?dB=Nucleotide> (January 2, 2003).


Continue with other tutorials:

Searching OMIM: Finding information about genes, traits, and disorders

Finding a gene on a chromosome map

Sequence similarity searching using NCBI BLAST

Examining protein structures from the Protein Data Bank


Last Updated: January 2, 2003

Feedback and comments about this site, contact site designer, Jennifer Bownas of HGMIS. To order a poster, click here.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Home Site Index Chromosome Viewer Genetic Disorder Guide Gene and Protein Guide Bioinformatics Tutorials
Bioinformatics Terms Sample Profiles Evaluating Medical Information Links FAQs Order Poster


The online presentation of this poster is a special feature of the U.S. Department of Energy (DOE) Human Genome Project Information Web site. The DOE Biological and Environmental Research program of the Office of Science funds this site.