NCBI Reference Sequence (RefSeq)

PubMed	All Databases	BLAST	OMIM	Books	Taxonomy	Structure
Search for

RefSeq Frequently Asked Questions

Home

What is a Reference Sequence?
What is the difference between RefSeq and GenBank?
How do I cite a RefSeq record?
How do I access RefSeq records?
What are the distinguishing features of a RefSeq record?
What is the difference between different accession prefixes (such as NM_123456 and XM_123456)?
How can I quickly identify RefSeq records in BLAST or Entrez search results?
What do the RefSeq STATUS codes indicate?
In the NCBI curation-supported pipeline, how is the GenBank 'source' record selected?
RefSeq NM_123456 and GenBank AF123456 appear to be duplicates. Will one be removed?
Why aren't RefSeq records made for all organisms, or for all of the loci available in Entrez Gene?
How are gene symbols and names chosen?
Why is the gene symbol in a RefSeq record sometimes different from the symbol used in related GenBank records?
Also see:Entrez Gene FAQ

What is a Reference Sequence?

The NCBI Reference Sequence project provides sequence data and related information for numerous organisms and provides a baseline for medical, functional, and comparative studies. Whereas GenBank is an archival repository of all sequences, the RefSeq database is a non-redundant set of reference standards that includes chromosomes, complete genomic molecules (organelle genomes, viruses, plasmids), intermediate assembled genomic contigs, curated genomic regions, mRNAs, RNAs, and proteins.

RefSeq records are provided using several processes:

Entrez Genomes processing provides genomic, RNA, and protein records for numerous organisms as data becomes available. This pipeline provides all of the bacterial, viral, organelle, and plasmid RefSeq records and also provides some of the records for larger genomes including plants and fungi.
The NCBI Annotation process, an automated computational method, provides intermediate assembled contigs and some records representing potential transcripts and proteins. [more...]
The NCBI curation-supported RefSeq pipeline supplies genomic regions, RNA, and protein sequence records for which a significant level of additional curation and review effort is provided.
Collaboration. Collaborations include those that supply a fully annotated genome, gene family, or single gene. Collaborations with official nomenclature groups and organism-specific database groups are also established.

Curated RefSeq records are made available in several status levels [see Status Key]]. Reviewed records represent a compilation of our current knowledge of a gene and its transcripts.

What is the difference between RefSeq and GenBank?

The GenBank archival sequence database includes publicly available DNA sequences submitted from individual laboratories and large-scale sequencing projects. GenBank accession numbers are assigned to these submitted sequences. Submitted sequence data is exchanged between NCBIs GenBank, EMBL Data Library (EMBL) and the DNA Data Bank of Japan (DDBJ) to achieve comprehensive worldwide coverage. As an archival database, GenBank can be very redundant for some loci. GenBank sequence records are owned by the original submitter and can not be altered by a third party.

RefSeq sequences are derived from GenBank and provide non-redundant curated data representing our current knowledge of known genes. Some records include additional sequence information that was never submitted to an archival database but is available in the literature. Some sequence records are provided through collaboration; the underlying primary sequence data is available in GenBank, but may not be available in any one GenBank record. RefSeq sequences are not submitted primary sequences. RefSeq records are owned by NCBI and therefore can be updated as needed to maintain current annotation or to incorporate additional sequence information.

How do I cite a RefSeq record?

It is appropriate to cite the RefSeq accession number and the RefSeq Handbook chapter.

Ideally any accession cited should indicate both the accession and version number. For example:

NCBI Accession NM_000001.1

Please cite the NCBI Handbook as follows:

To cite the whole Handbook:

The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Available from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books
To cite the RefSeq chapter (Chapter 18):

The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 18, The Reference Sequence (RefSeq) Project. Available from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books

For your reference, one web site that lists a number of Citation Guides for Electronic Documents is: http://www.ifla.org/I/training/citation/citing.htm.

How do I access RefSeq records?

RefSeq records can be accessed through several NCBI resources including:

BLAST Transcript, protein, and 'genomic region' (NG accessions) records are in the nucleotide and protein non-redundant databases (nr).
BLAST against larger genomic records is provided via organism-specific BLAST pages.

Entrez Gene Entrez Gene reports provide links to all categories of RefSeq records. The Gene database can be queried with a RefSeq accession number in addition to text terms - see the Gene Help documentation for detailed query tips.

Entrez Genomes Division Records representing completed genomes and chromosomes are presented on the Genomes pages.

FTP Nucleotide and protein records provided by the Entrez Genomes and RefSeq processes are available from the /refseq directory; Nucleotide and protein records that are provide by the Genome Annotation Pipeline are available in the /genomes/ directory.

Map Viewer The NCBI Map Viewer includes links to RefSeq records when the annotated genome assembly information is available.

Sequence databases RefSeq records are included in the Entrez nucleotide and protein databases. See the Entrez Query Hints for hints on formatting your query to retrieve RefSeq records.

What are the distinguishing features of a RefSeq record?

RefSeq records are distinguished from other GenBank records by:

a distinct accession number (two characters followed by an underscore)
a COMMENT that includes the term REFSEQ and identifies the record status, GenBank source accessions, and collaborating group
consistent use of available official nomenclature (when available)
inclusion of dbxrefs on the gene feature to link to other sources of information such as OMIM, Entrez Gene, and model-organism databases
protein records indicate REFSEQ as the DBSOURCE

Please see the description of RefSeq accession prefixes.

What is the difference between RefSeq accession prefixes?

The RefSeq accession prefix indicates the molecule type and method used to supply the record. Please see the description of RefSeq accession prefixes.

NCBI Accession numbers that begin with the prefix XM_ (mRNA), XR_ (non-coding transcript), and XP_ (protein) are model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs and they may be different from GenBank submissions for mRNAs and/or the curated RefSeq records (with NM,NR,NP accession prefixes). These differences may reflect real sequence variation (polymorphism), or errors (or gaps) in the available genomic sequence. These model RefSeq records should be used with caution, after comparing them to other available sequence information (Check the evidence viewer, BLink, Gene, or sequence neighbors). [more...]

How can I quickly identify RefSeq records in BLAST or Entrez search results?

Entrez and BLAST results both present the following formatted text as part of the returned result:

gi|4557284|ref|NM_000646.1|[4557284]

Data Element	Comment
gi	"GenBank Identifier", or sequence ID number. "gi\|" denotes that the number which follows is a unique sequence id. Any change to the sequence data will result in a new gi number.
4557284	The gi number.
ref	Indicates that RefSeq is the source database.
NM_000646.1	The RefSeq accession and version number.

Identifying NM_ and NP_ RefSeq records in BLAST results:

The distinct format of RefSeq accession numbers (they include an underscore) provides a quick indication that a BLAST result includes a RefSeq record.

                                                                Score     E
Sequences producing significant alignments:                     (bits)  Value

gi|6226959|ref|NM_000014.3|  Homo sapiens alpha-2-macroglobu...  9073  0.0    
    	    ^     ^
	    |     |
	    |     RefSeq accession numbers have a distinct format 
	    |            
	  "ref" indicates RefSeq database

What do the RefSeq STATUS codes indicate?

RefSeq records are made available in several status levels.

Reviewed records represent a compilation of our current knowledge of a gene and its transcripts. These records have been reviewed by experts, either NCBI staff or collaborating groups, to create a sequence record that is analagous to a 'review article'.

Some enhancements to the reviewed record might include:

addition of sequence data (e.g., to extend UTRs)
removal of sequence data (e.g. vector or linker sequence)
addition of publications of general relevance to the gene
addition of nucleotide and protein features
addition of summary text describing gene function

When a record is reviewed, sequence data from more than one record may be merged together, as deemed appropriate, to construct a more complete mRNA record. The review process includes reading the primary literature to cross-check accuracy and determine if additional data is available concerning the extent of the UTR, alternate splicing, or function. Transcript variant records are only made when there is information available on the full-length nature of the product; in other words, should multiple alternate exons be found through-out the length of the gene, no assumption is made about what combinations of alternate exons exist in vivo. Therefore, the RefSeq collection does under-represent alternate splice products.

For the NCBI curation-supported RefSeq pipeline, the review process includes analysis of all sequences that represent the gene at hand (at that time). This analysis results in expanding the list of accessions that represent the gene under investigation, identifying (and correcting) errors in the accession-to-gene associations, and identifying accessions with significant errors (i.e. chimeric). The curated list of accessions that represent a gene is available in Entrez Gene; this list is not intended to be fully comprehensive; additional 'related' sequence information will always be available in the Entrez 'related sequences' and BLink reports, and BLAST search results.

For examples of reviewed RefSeq records, see the following entries:

Gene Symbol GeneID Comments

PAX2 5076 Example of splice variant treatment. We make RefSeq records for splice variants for which the full length nature of the transcript is well documented and supported by experimental evidence. There is a greater emphasis on providing RefSeq records for cases where some of the transcript variation results in altered coding regions.

MICA 4276 Note several references included; the record is analogous to a 'review article'. A single article is annotated in the Reference field of the source GenBank record.

GCKR 2646 Note the last line of the Comment field on the RefSeq record provides a 'completeness' indicator. If we determine during the review process that the 5' and/or 3' end of the mRNA is complete, then this information is provided on the RefSeq record.

In the NCBI curation-supported pipeline, how is the GenBank source sequence initially selected?

There are several factors used in selecting the source GenBank sequence that is first used to generate the PROVISIONAL mRNA RefSeq record, but quite often the record used is selected primarily because it includes more complete UTR sequence data.

Reference sequence records are not intended to represent the historical 'first sequenced' record (although for genes with very limited available sequence data they may at times do so). PROVISIONAL records may be updated before being fully reviewed to use a longer GenBank source sequence that becomes available. While the PROVISIONAL RefSeq records do represent a single GenBank source sequence, the REVIEWED RefSeq records are intended to represent the current state of knowledge as provided by the whole research community rather than by any one laboratory.

RefSeq NM_123456 and GenBank AF123456 appear to be duplicates. Will one be removed?

No, both records will continue to be available. RefSeq and GenBank are separate databases, and both databases are available in the Entrez nucleotides data set.

Provisional RefSeq records are usually quite similar to the source GenBank records from which they were drawn. However, when RefSeq records are reviewed by experts, additional sequence data, biological annotations, and references are often added. At that time, the original source GenBank record(s) and the corresponding RefSeq entry can be quite different -- the RefSeq entry can represent a combination of information from various labs, which are credited in the Comments and/or References field of the record.

The RefSeq database is designed to reduce duplication by selecting one representative sequence for each human locus, whereas GenBank is a repository of sequences that might contain numerous records for any given gene. The only duplicates in the RefSeq database will represent naturally occurring paralogs and splice variants.

Why aren't RefSeq records made for all organisms, or for all of the loci available in Entrez Gene?

RefSeq records are provided for identified complete genomes and for identified genome sequencing projects as collaborations are established or as the sequence data becomes available. For the NCBI curation-supported pipeline they are made under the following conditions:

The locus in question represents a functional gene that either encodes a protein, structural, or other RNA product. In addition, RNA and genomic RefSeq records are provided to represent identified pseudogenes. RefSeq records are not provided for those Gene records that represent a chromosomal region rather than a gene.
At least one representative accession number has been identified for a given locus. The starting point can be either an mRNA or genomic sequence record.
For protein-coding genes, the identified sequence has a full length coding region annotated.

RefSeq records are not made for loci where only partial coding region sequence data is available (as annotated on the GenBank source record). In addition, there are some loci for which we have not yet identified an appropriate representative GenBank accession number.

We welcome comments from the research community that provide as yet unidentified representative accession numbers for loci lacking RefSeq data. We also welcome corrections to errors, or additional biological information. Please send comments using the RefSeq and Gene update form, or contact the NCBI Service Desk; be as specific as possible, and cite the RefSeq accession, GeneID, and any relevant published citations.

Why is the gene symbol in a RefSeq record sometimes different from the symbol used in related GenBank records?

RefSeq records use the terminology defined by the original GenBank submission, collaborating group, or by an official nomenclature group for the organism. For example, for the human RefSeq collection uses gene symbols and names supplied by the Human Gene Nomenclature Committee; records may also include alternate symbols and names, when available.

GenBank is an archive of publicly available sequence records that were submitted by the originators of the data. Submitters of GenBank records maintain editorial control over their records and decide which gene symbol to use. Some authors consult with the nomenclature committees associated with the organism from which they sequenced a gene to obtain an official gene symbol for that organism. Other authors might not do that, or might not update their submitted records if the official name changes. Therefore, it is possible that GenBank records for a given gene will use different gene symbols.

Last updated October 10, 2007

Questions or Comments?
Write to the Help Desk

Disclaimer Privacy statement