Annotation Examples

mRNA sequence
Prokaryotic gene
Eukaryotic gene
rRNA and/or ITS
Promoter region
Viral sequence
HIV-1
Influenza viruses
Transposon or insertion sequence
Microsatellite sequence
Repeat regions
Pseudogene
Translocation and/or fusion protein
Cloning vector
Gapped sequence
Phylogenetic or population set
EST submissions
GSS submissions
STS submissions
HTGS submissions
FLICs submissions

mRNA sequence

Relevant feature information for a mRNA (cDNA) sequence encoding a protein:

coding region intervals, including start and stop codons
protein name
gene name, if available
amino acid sequence, if available

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Homo sapiens prolidase (PEPD) mRNA, complete cds.

                      source          1..1888
                                      /organism="Homo sapiens"
                                      /chromosome="19"
                                      /map="19q12-q13.2"
                                      /cell_type="fibroblasts"
                                                 
                      gene            1..1888
                                      /gene="PEPD"
                                                 
                      CDS             17..1498
                                      /gene="PEPD"
                                      /EC_number="3.4.13.9"
                                      /note="imidodipeptidase"
                                      /product="prolidase"

Prokaryotic gene

Top

Relevant feature information for a prokaryotic genomic sequence encoding a protein:

coding region intervals, including start and stop codons, if present
protein name
gene name, if known
amino acid sequence, if known

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Escherichia coli RecA protein (recA) gene, complete cds.

                 source          1..3300
                                 /organism="Escherichia coli"
                                 /strain="K-12"
                                 
                 gene            783..1961
                                 /gene="recA"
                                 
                 CDS             783..1961
                                 /gene="recA"
                                 /function="DNA repair protein"
                                 /product="RecA protein"

Eukaryotic gene

Top

Relevant feature information for a eukaryotic genomic sequence encoding a protein:

coding region intervals, including start and stop codons, if present, and all exon intervals
protein name
gene name, if known
amino acid sequence, if known

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Caenorhabditis elegans tyrosine kinase PTK-2 (ptk-2) gene, complete cds.

                 source          1..3180
                                 /organism="Caenorhabditis elegans"

                 gene            211..3011
                                 /gene="ptk-2"
                                 
                 mRNA            join(211..288,533..703,763..890,940..1024,
                         1084..1380,1838..1962,2018..2099,2301..3011)
                                 /gene="ptk-2"
                                 /product="protein kinase PTK-2"
                                 
                 CDS             join(250..288,533..703,763..890,940..1024,
                                 1084..1380,1838..1962,2018..2099,2301..2456)
                                 /gene="ptk-2"
                                 /product="protein kinase PTK-2"

rRNA and/or ITS

Top

Relevant feature information for a genomic sequence containing structural RNAs and/or spacers:

names of any structural RNAs (eg, tRNA-Ile, 16S ribosomal RNA)
names of any spacer regions (eg, internal transcribed spacer 1, 16S/23S intergenic spacer)
nucleotide spans of each of the above features, if known

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Saccharomyces cerevisiae 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence.

                 source          1..540
                                 /organism="Saccharomyces cerevisiae"
                                 /strain="UMD 334"

                 rRNA            <1..5
                                 /product="18S ribosomal RNA"
                                 
                 misc_RNA        6..178
                                 /product="internal transcribed spacer 1"
                                 
                 rRNA            179..377
                                 /product="5.8S ribosomal RNA"
                                 
                 misc_RNA        378..519
                                 /product="internal transcribed spacer 2"
                                 
                 rRNA            520..>540
                                 /product="28S ribosomal RNA"

Promoter region

Top

Relevant feature information for promoter, genomic 5' flanking sequence, or genomic 3' flanking sequence:

protein or gene name for the sequence to which the promoter or flanking region belongs
intervals of any transcribed regions or coding regions, if present on the sequence

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Homo sapiens enhancer-binding protein 2 (EBP2) gene, promoter region and partial cds.
  
              source          1..3061
                              /organism="Homo sapiens"
                              /chromosome="15"
                              /map="15q13"
                              /cell_line="H441"
                              /tissue_type="lung"
                             
              gene            1..>3061
                              /gene="EBP2"
                              
              promoter        1..2947
                              /gene="EBP2"
                              
              TATA_signal     2918..2923
                              /gene="EBP2"
                             
              mRNA            2948..>3061
                              /gene="EBP2"
                              /product="enhancer-binding protein 2"
                             
              5'UTR           2948..3010
                              /gene="EBP2"
                             
              CDS             3011..>3061
                              /gene="EBP2"
                              /product="enhancer-binding protein 2"

Viral sequence

Top

Relevant feature information for a viral sequence:

include strain, serotype, host, country, and collection_date when known
coding region intervals, including start and stop codons, if present
protein name
gene name, if known
amino acid sequence, if known

if no coding region is present, other description of the sequence

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Human adenovirus 3 strain RKI-4263/07 hexon (H) gene, partial cds.  

           source          1..1520
                           /organism="Human adenovirus 3"
                           /mol_type="genomic DNA"
                           /strain="RKI-4263/07"
                           /serotype="3"
                           /host="Homo sapiens"
                           /db_xref="taxon:45659"
                           /country="Germany"
                           /collection_date="Apr-2007"

           gene            <1..>1520
                           /gene="H"
                           
           CDS             <1..>1520
                           /note="major capsid protein"
                           /codon_start=1
                           /product="hexon"

HIV-1

Top

Relevant feature information for an HIV-1 sequence:

name of the country from which the virus was isolated
clone and isolate information

coding region intervals, including start and stop codons, if present
protein names
gene names, if known
amino acid sequences, if known

if no coding region is present, other description of the sequence

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

HIV-1 isolate X clone 5601 from USA, complete genome.

               source          1..9720
                               /organism="Human immunodeficiency virus type 1"
                               /clone="5601"
                               /isolate="X"
                               /country="USA"

               LTR             1..634

               gene            789..2291
                               /gene="gag"

               CDS             789..2291
                               /gene="gag"
                               /product="gag protein"

               gene            2084..5095
                               /gene="pol"
                               
               CDS             2084..5095
                               /gene="pol"
                               /product="pol protein"

               gene             5040..5618
                               /gene="vif"
                               
               CDS             5040..5618
                               /gene="vif"
                               /product="vif protein"

               gene             5558..5848
                               /gene="vpr"

               CDS             5558..5848
                               /gene="vpr"
                               /product="vpr protein"

               gene             5829..8476
                               /gene="tat"

               CDS             join(5829..6043,8386..8476)
                               /gene="tat"
                               /product="tat protein"

               gene             5968..8660
                               /gene="rev"

               CDS             join(5968..6043,8386..8660)
                               /gene="rev"
                               /product="rev protein"

               gene             6060..6305
                               /gene="vpu"
                               
               CDS             6060..6305
                               /gene="vpu"
                               /product="vpu protein"

               gene            6223..8802
                               /gene="env"
                               /pseudo
                               
               gene             8804..9070
                               /gene="nef"
                               
               CDS             8804..9070
                               /gene="nef"
                               /product="nef protein"

               LTR             9086..9719
               
               polyA_signal    9612..9617

Influenza viruses

Top

Relevant feature information for Influenza sequences:

properly formatted strain identifier. Example: A/chicken/India/1234/2010
name of the country from which the virus was isolated
collection date, including month and day if known
serotype for Influenza A viruses
host

coding region intervals, including start and stop codons and exons, if present
protein names
gene names

For Influenza A and B submissions, use the Influenza Virus Resource Annotation webtool to create a feature table: http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/annotation.cgi

Example:

Influenza A virus (A/Wisconsin/28/2011 (H1N1)) segment 8 nuclear export protein (NEP) and nonstructural protein 1 (NS1) genes, complete cds.
             source          1..864
                             /organism="Influenza A virus (A/Wisconsin/28/2011(H1N1))"
                             /mol_type="viral cRNA"
                             /strain="A/Wisconsin/28/2011"
                             /serotype="H1N1"
                             /host="Homo sapiens"
                             /segment="8"
                             /country="USA"
                             /collection_date="01-Dec-2011"
                             /note="C1 passage(s)"

             gene            1..838
                             /gene="NEP"                
                             /gene_synonym="NS2"
                           
             CDS             join(1..30, 503..838)
                             /gene="NEP"
                             /note="nonstructural protein 2"
                             /product="nuclear export protein"

             gene            1..660
                             /gene="NS1"                
                           
             CDS             1..660
                             /gene="NS1"
                             /product="nonstructural protein 1"

Transposon or insertion sequence

Top

Relevant feature information for transposons or insertion sequences:

specific name of the transposon or IS, if available
nucleotide spans corresponding to the transposon/IS

Optional:

name and nucleotide intervals of any host gene/product disrupted by the transposon/IS
name and nucleotide intervals of any gene/product in the transposon/IS (eg, transposase)
nucleotide spans any other features (LTRs, repeat regions)

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Bacillus subtilis strain RS2 transposon BLT transposase (tnpA) gene, complete cds

             source          1..1221
                             /organism="Bacillus subtilis"
                             /strain="RS2"

             repeat_region   21..1127
                             /rpt_type="dispersed"
                             /mobile_element="transposon: BLT"

             repeat_region   21..61
                             /rpt_type=inverted
                           
             gene            128..1034
                             /gene="tnpA"                
                           
             CDS             128..1034
                             /gene="tnpA"
                             /product="transposase"
                           
             repeat_region   1085..1127
                             /rpt_type=inverted

Microsatellite sequence

Top

Relevant feature information for a microsatellite sequence:

unique microsatellite/clone name for each sequence
interval of any repeat region(s) within the microsatellite sequence, if known
are these considered STS sequences?

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example #1:

Chorthippus parallelus clone IIB-G5 microsatellite sequence.

             source          1..288
                             /organism="Chorthippus parallelus"
                             /mol_type="genomic DNA"
                             /db_xref="taxon:37639"
                             /clone="IIB-G5"

             repeat_region   1..288
                             /rpt_type=tandem
                             /satellite="microsatellite"

Example #2:

Noturus exilis voucher KU 40271 microsatellite Noex254 sequence.

             source          1..556
                             /organism="Noturus exilis"
                             /mol_type="genomic DNA"
                             /specimen_voucher="KU 40271"
                             /db_xref="taxon:61323"
                             /clone="Noex_02_03_H06"
                             /PCR_primers="fwd_seq: catgtttgcacaaagggaaa, rev_seq:
                             atgtggatgcagattgtgga"

             repeat_region   77..100
                             /rpt_type=tandem
                             /rpt_unit_range=77..100
                             /rpt_unit_seq="ca"
                             /satellite="microsatellite:Noex254"

Repeat regions

Top

Relevant feature information for sequences containing repeat regions:

repeat region intervals
repeat family, if known (eg, Alu, Mer)
repeat type (tandem, inverted, flanking, terminal, direct, dispersed, or other)
repeat unit description/intervals, if region contains more than one repeat

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Homo sapiens repeat regions

             source          1..2050
                             /organism="Homo sapiens"
                             /chromosome="6"
                             /map="6q25"
                             
             repeat_region   8..126
                             /rpt_type=dispersed
                             /rpt_family="B2" 
                                               
             repeat_region   197..344
                             /rpt_type="direct"
                             /rpt_unit="197..220"
                                                  
             repeat_region   389..673
                             /rpt_family="AluSx"
                             /rpt_type=dispersed
                             
             repeat_region   847..876
                             /rpt_type="tandem"
                             /rpt_unit="ca"
                             /satellite="microsatellite:BT21"

Pseudogene

Top

Relevant feature information for a pseudogene sequence:

gene intervals
gene name

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Mus musculus DNA methyltransferase (Dmt1) pseudogene, complete sequence.

             source          1..2131
                             /organism="Mus musculus"
                             /strain="SvJ/129"
                             
             gene            123..1444
                             /gene="Dmt1"
                             /note="DNA methyltransferase 1"
                             /pseudo

Translocation and/or fusion protein

Top

Relevant feature information for a sequence resulting from a chromosomal translocation:

nucleotide location of the translocation breakpoint, if known
map information for the translocation breakpoint (e.g., t(18;X)(q11.2;p11.2)

if the translocation results in a fusion protein, please include:

coding region intervals, including start and stop codons, if present
protein name
amino acid sequence, if known

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Homo sapiens SYT/SSX4 fusion protein mRNA, complete cds.

             source          1..2935
                             /organism="Homo sapiens"
                             /tissue_type="sarcoma"
                             /map="t(18;X)(q11.2;p11.2)"

             source          1..1242
                             /organism="Homo sapiens"
                             /chromosome="18"
                             /map="18q11.2"

             CDS             1..1479
                             /product="SYT/SSX4 fusion protein"

             source          1243..2935
                             /organism="Homo sapiens"
                             /chromosome="X"
                             /map="Xp11.2"

             3'UTR           1480..2935

Cloning vector

Top

Relevant feature information for a cloning vector

unique name for the cloning vector

Optional:

coding region intervals, including start and stop codons
protein names, gene names

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Cloning vector pRB223, complete sequence

             source          1..4361
                             /organism="Cloning vector pRB223"
                             
             gene            86..1276
                             /gene="tet"
                             
             CDS             86..1276
                             /gene="tet"
                             /product="tetracycline resistance protein"

             RBS             1905..1909
                             /note="Shine-Dalgarno sequence"
                             
             rep_origin      2535
                             
             gene            complement(3293..4194)
                             /gene="bla"
                             
             CDS             complement(3293..4153)
                             /gene="bla"
                             /product="beta-lactamase"

             misc_feature    4069..4125
                             /note="multiple cloning site"

             RBS             complement(4161..4165)
                             /gene="bla"     
                             /note="Shine-Dalgarno sequence"

             promoter        complement(4188..4194)
                             /gene="bla"

Gapped sequence

Top

A gapped sequence includes both known, directly sequenced data and unknown data. The unknown sections of sequence are represented by strings of 'nnn' between the known, directly sequenced, contiguous data. All pieces of a gapped sequence must be from the same source and be in the same orientation and in the correct order.

Relevant feature information for a gapped sequence:

if a gap length is estimated, insert the equivalent number of nnns between the directly determined, contiguous sections of sequence
if the gap length is unknown, insert a string of 100 nnns to represent the gap between the sections of sequence
add a misc_feature for each gap with a /note qualifier to describe it as either 'gap of unknown length' or 'gap of estimated length, # nts'
add all other appropriate features (exons, introns, CDS, gene, etc)

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

Example:

Homo sapiens MHC class I antigen (HLA-B) gene, HLA-B_458_01445 allele, exons 2, 3 and partial cds.

         source          1..788
                         /organism="Homo sapiens"
                         /mol_type="genomic DNA"
                         /db_xref="taxon:9606"

         gene            <1..>788
                         /gene="HLA-B"
                         /allele="HLA-B_458_01445"

         mRNA            join(<1..270,513..>788)
                         /gene="HLA-B"
                         /allele="HLA-B_458_01445"
                         /product="MHC class I antigen"

         CDS             join(<1..270,513..>788)
                         /gene="HLA-B"
                         /allele="HLA-B_458_01445"
                         /codon_start=3
                         /product="MHC class I antigen"
                         /protein_id="ACR38915.1"
                         /db_xref="GI:238055051"
                         /translation="SHSMRYFDTAMSRPGRGEPRFISVGYVDDTQFVRFDSDAASPRE
                         EPRAPWIEQEGPEYWDRNTQIFKTNTQTDRESLRNLRGYYNQSEAGSHTLQSMYGCDV
                         GPDGRLLRGHDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAARVAEQDRAYLE
                         GTCVEWLRRYLENGKDTLERA"

         exon            1..270
                         /gene="HLA-B"
                         /allele="HLA-B_458_01445"
                         /number=2

         gap             271..512
                         /estimated_length=242

         exon            513..788
                         /gene="HLA-B"
                         /allele="HLA-B_458_01445"
                         /number=3

Phylogenetic or population set

Top

Relevant feature information for population or phylogenetic studies:

A set comprises a group of sequences that represent the same gene or locus in different organisms or in different isolates, strains, or clones of the same organism. A set can be, for example, phylogenetic (different organisms), population (same organism), or environmental (unclassified or unknown organisms).

unique descriptive information for each sequence (eg, clone, strain, isolate, or organism names)
creating a set will allow the sequences to be retreivable by Entrez PopSet as a group.

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

EST submissions

Top

Please submit directly to dbEST: the EST division of GenBank.

GSS submissions

Top

Please submit directly to dbGSS: the GSS division of GenBank.

STS submissions

Top

Relevant feature information for STS submissions:

submit directly to dbSTS: the STS division of GenBank

submit using BankIt and provide:

chromosome and/or specific map locations
clone name
clone library [catalog number, reference, lab source, and/or specific (in-house) name or number]
PCR conditions and primer binding sites

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.

HTGS submissions

Top

Requirements for HTGs submissions:

large genome centers should submit these through an FTP account to the High Throughput Genomic (HTG) Sequences division of GenBank
one time only submitters should submit to gb-sub@ncbi.nlm.nih.gov

FLICs submissions

Top

Relevant feature information for FLIC submissions:

explicit labeling as FLICs

Optional:

protein name
gene name
CDS intervals, including start/stop codons

We strongly suggest that you provide as much of the above information as possible to ensure the most complete annotation of your sequence. If any of this information is not known, please inform us.