How to retrieve nucleotide and protein sequences of viral reference genomes?
Most of viral genomes are relatively small, so their sequences can be easily retrieved via a Web interface. Nucleotide or protein sequences of all viral reference genomes can be retrieved from the corresponding Entrez database via the Entrez Nucleotide or Entrez Protein hyperlinks located in the "All Viral Genomes" section of the left side blue bar on the main page or other informational pages (including this one). To retrieve sequences for a particular virus group, use the "Sequence Info" menu on a correspondent group page. For example, the links on the "Flaviviridae" page will bring up the nucleotide or protein sequences belonging exclusively to this virus family. To further narrow the search, add to an Entrez query one or more specific terms, e.g. "Hepatitis C virus[Organism]" or "(polymerase OR replicase)[Protein Name]" without quotes.
As viral reference sequences are also part of the NCBI RefSeq collection, they can be downloaded via the NCBI RefSeq Web page. The direct link "RefSeq FTP" is located on the left side blue bar. Alternatively, one can use NCBI eUtils. Note that the site ftp://ftp.ncbi.nih.gov/genomes/ does NOT contain viral sequences.
How to retrieve non-RefSeq (DDBJ/EMBL/GenBank) nucleotide sequences of complete viral genomes?
First, retrieve the RefSeq genomes of interest via an Entrez Genome search (e.g., by typing "Flaviviridae[organism]" in the search box). Then either choose "Other genomes" in the option box "Display" or use the link "Other genomes for species" located under the menu "Links" to the right from each RefSeq accession found.
Why can't I just retrieve sequences marked as "complete" in public databases?
You can, but some complete sequences are not marked as "complete" in public records. We manually reveal such sequences and make them available as either RefSeq records or Genome neighbors (other complete genomes for the species).
Why is a particular full-length genomic sequence missing in Entrez Genomes? Is there a RefSeq genome for this virus?
Another sequence may have already become the reference sequence for this viral genome (or its component). To check this out, search Entrez Nucleotide with DDBJ/EMBL/GenBank accession of interest. Then either choose "RefSeq genome" in the option box "Display" or use the link "RefSeq genome for species" located under the menu "Links" on the right. If the query record was used to build the corresponding RefSeq record, the link is "Genome". If none of those links is present, please let us know.
Note that there is no need to bother whether the organism name in the query record is a species level name. But if the exact species name is known, it can be used to retrieve corresponding RefSeq record(s), if any, via Entrez Genome (e.g., by typing "Hepatitis C virus[organism]" in the search box). Precomputed global alignments of reference sequences with corresponding additional complete genomic sequences are available from the lists of reference sequences - see the help page.
How were the virus reference sequences chosen?
Potential genomic records were initially found in GenBank by an automated process. The candidates were then manually reviewed for quality and completeness. For each virus species, only one complete genome was selected as the source for the associated reference sequence(s). Those records that were not selected were cross-referenced to the RefSeq record as Genome neighbors (other complete sequences for the species). To identify sequences belonging to a particular virus we compared the species level tax_ids (taxonomy identification numbers). For a more detailed explanation see About Viral Genomes.
Can a reference sequence be replaced with another one?
An existing RefSeq record can be rebuilt from another sequence if the NCBI staff finds the latter to be a better representative for a particular organism. If such happens, old NC_ accession numbers are retained whenever possible. However, if a new RefSeq gets a different accession number, the old one is still retrievable via Entrez by the old accession or gi number.
Which strand is shown if a genome consists of single-stranded RNA or DNA?
This boils down to the question "In what orientation are sequences accepted by GenBank?". Whenever possible, the default is the coding strand. Beware, however, ambisense viruses, such as Phleboviruses or Tenuiviruses, that belong to the ssRNA negative-strand viruses group but have one or more ssRNA components carrying genes in both orientations. For these, the "virion-sense" strand is shown.
Why are the dates on the reference sequences more recent than on their original source records?
The date displayed is the date when the reference sequence record was created or updated. The source record from which the refseq record was made is still in GenBank, with the original entry/update date.
How did you choose the names for the viruses?
Wherever possible, virus names are given according to the recommendations described in recent proposal or reports of the International Committee on Taxonomy of Viruses (ICTV), currently the Eighth Report:
Virus Taxonomy: Classification and Nomenclature of Viruses. Eighth Report of the International Committee on Taxonomy of Viruses (book).
Ed.C.M. Fauquet, M.A. Mayo, J. Maniloff, D.J., U. Desselberger and L.A. Ball (2005). Academic Press, San Diego
In other cases, when expert advice are not available, virus names are created de novo by the NCBI Taxonomy group.