PubMed Nucleotide Protein Genome Structure PMC Taxonomy OMIM
 Search for
     
Genome resources

Information
Home
About this site
About viruses
Statistics
FAQs
Advisors
Help

All Viral Genomes
Alphabetical list
RefSeq genomes
Other genomes
RefSeq proteins
RefSeq FTP
Taxonomy groups

All Viroid Genomes
Alphabetical list
RefSeq genomes
Other genomes
Taxonomy groups

Tools
BLAST
PASC
Protein clusters

Related NCBI Resources
Genotyping
Influenza viruses
Retroviruses
SARS-CoV
Taxonomy

Virus Taxonomy
ICTV
ICTV 7th Report

Other Databases and Projects
dsRNA viruses
HCV
HCV(eu)
HIV
Influenza
Plant viruses
Poxviruses
SARS Bioinformatics
Subviral RNA
VIDA
VBCa

Related Sites
All The Virology
Big Picture Book
The Beauty of Viruses
Viruses: From Structure To Biology

   Viral Genomes FAQ
  
How to retrieve nucleotide and protein sequences of viral reference genomes?
How to retrieve non-RefSeq (DDBJ/EMBL/GenBank) nucleotide sequences of complete viral genomes?
Why can't I just retrieve sequences marked as "complete" in public databases?
Why is a particular full-length genomic sequence missing in Entrez Genomes? Is there a RefSeq genome for this virus?
How were the virus reference sequences chosen?
Can a reference sequence be replaced with another one?
Which strand is shown if a genome consists of single-stranded RNA or DNA?
Why are the dates on the reference sequences more recent than on their original source records?
How did you choose the names for the viruses?

How to retrieve nucleotide and protein sequences of viral reference genomes?

Most of viral genomes are relatively small, so their sequences can be easily retrieved via a Web interface. Nucleotide or protein sequences of all viral reference genomes can be retrieved from the corresponding Entrez database via the Entrez Nucleotide or Entrez Protein hyperlinks located in the "All Viral Genomes" section of the left side blue bar on the main page or other informational pages (including this one). To retrieve sequences for a particular virus group, use the "Sequence Info" menu on a correspondent group page. For example, the links on the "Flaviviridae" page will bring up the nucleotide or protein sequences belonging exclusively to this virus family. To further narrow the search, add to an Entrez query one or more specific terms, e.g. "Hepatitis C virus[Organism]" or "(polymerase OR replicase)[Protein Name]" without quotes.

As viral reference sequences are also part of the NCBI RefSeq collection, they can be downloaded via the NCBI RefSeq Web page. The direct link "RefSeq FTP" is located on the left side blue bar. Alternatively, one can use NCBI eUtils. Note that the site ftp://ftp.ncbi.nih.gov/genomes/ does NOT contain viral sequences.

How to retrieve non-RefSeq (DDBJ/EMBL/GenBank) nucleotide sequences of complete viral genomes?

First, retrieve the RefSeq genomes of interest via an Entrez Genome search (e.g., by typing "Flaviviridae[organism]" in the search box). Then either choose "Other genomes" in the option box "Display" or use the link "Other genomes for species" located under the menu "Links" to the right from each RefSeq accession found.

Why can't I just retrieve sequences marked as "complete" in public databases?

You can, but some complete sequences are not marked as "complete" in public records. We manually reveal such sequences and make them available as either RefSeq records or Genome neighbors (other complete genomes for the species).

Why is a particular full-length genomic sequence missing in Entrez Genomes? Is there a RefSeq genome for this virus?

Another sequence may have already become the reference sequence for this viral genome (or its component). To check this out, search Entrez Nucleotide with DDBJ/EMBL/GenBank accession of interest. Then either choose "RefSeq genome" in the option box "Display" or use the link "RefSeq genome for species" located under the menu "Links" on the right. If the query record was used to build the corresponding RefSeq record, the link is "Genome". If none of those links is present, please let us know.

Note that there is no need to bother whether the organism name in the query record is a species level name. But if the exact species name is known, it can be used to retrieve corresponding RefSeq record(s), if any, via Entrez Genome (e.g., by typing "Hepatitis C virus[organism]" in the search box). Precomputed global alignments of reference sequences with corresponding additional complete genomic sequences are available from the lists of reference sequences - see the help page.

How were the virus reference sequences chosen?

Potential genomic records were initially found in GenBank by an automated process. The candidates were then manually reviewed for quality and completeness. For each virus species, only one complete genome was selected as the source for the associated reference sequence(s). Those records that were not selected were cross-referenced to the RefSeq record as Genome neighbors (other complete sequences for the species). To identify sequences belonging to a particular virus we compared the species level tax_ids (taxonomy identification numbers). For a more detailed explanation see About Viral Genomes.

Can a reference sequence be replaced with another one?

An existing RefSeq record can be rebuilt from another sequence if the NCBI staff finds the latter to be a better representative for a particular organism. If such happens, old NC_ accession numbers are retained whenever possible. However, if a new RefSeq gets a different accession number, the old one is still retrievable via Entrez by the old accession or gi number.

Which strand is shown if a genome consists of single-stranded RNA or DNA?

This boils down to the question "In what orientation are sequences accepted by GenBank?". Whenever possible, the default is the coding strand. Beware, however, ambisense viruses, such as Phleboviruses or Tenuiviruses, that belong to the ssRNA negative-strand viruses group but have one or more ssRNA components carrying genes in both orientations. For these, the "virion-sense" strand is shown.

Why are the dates on the reference sequences more recent than on their original source records?

The date displayed is the date when the reference sequence record was created or updated. The source record from which the refseq record was made is still in GenBank, with the original entry/update date.

How did you choose the names for the viruses?

Wherever possible, virus names are given according to the recommendations described in recent proposal or reports of the International Committee on Taxonomy of Viruses (ICTV), currently the Eighth Report:

Virus Taxonomy: Classification and Nomenclature of Viruses. Eighth Report of the International Committee on Taxonomy of Viruses (book). Ed.C.M. Fauquet, M.A. Mayo, J. Maniloff, D.J., U. Desselberger and L.A. Ball (2005). Academic Press, San Diego

In other cases, when expert advice are not available, virus names are created de novo by the NCBI Taxonomy group.


Revised: June 8, 2006