Whole Genome Shotgun Submissions

PubMed

Entrez

BLAST

OMIM

Books

Taxonomy

Structure

NCBI

SITE MAP

WGS List
List of WGS Projects

Genome Project
Registration page
Home page

Metagenomes
Metagenome Submission Guide

GenBank
Sequence submission support and software

Trace Archive
Trace Archive database

tbl2asn
Command line sequence submission tool

Annotation Guidelines
Prokaryotic genomes
Eukaryotic genomes

Example Files
Sample .fsa and .tbl files

Sequin
Stand-alone sequence submission tool

Whole Genome Shotgun Sequence Submissions

DDBJ/EMBL/GenBank accepts contigs from ongoing Whole Genome Shotgun (WGS) sequencing projects. These records can contain annotations, and an entire project is updated as sequencing progresses.

See the list of WGS projects.

Introduction
What To Do
WGS Project
How to Create a WGS Submission
AGP Files to Build Scaffolds and/or Chromosomes
Updating a WGS Project

Introduction

Each WGS project is assigned a stable 4-letter WGS project_ID, which does not change as the project is updated. In addition to the WGS project_ID, the contig identifiers have a version number corresponding to a particular project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS project_ID and version number. For instance, if a project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig).

The nucleotide data from most WGS projects go into the BLAST wgs database, whereas proteins go into the BLAST nr database. Nucleotides from environmental projects are present in either the BLAST env_nt or wgs database, depending upon whether that sequence has been identified as a particular organism, or if the organism is not yet known. Similarly, the proteins from those projects are in the env_nr or nr BLAST database.

See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project.

What To Do

Register your project and a locus_tag prefix with the Genome Project database. See the annotation guides for eukaryotes or prokaryotes for more information about locus_tags. Include your GenomeProject ID in correspondence about your project and with your submissions.

Submit the contigs as the WGS project. WGS projects consist of only contigs (overlapping reads), not any supercontigs (assembled contigs separated by gaps), of a sequencing project. Supercontig or assembly information can be sent to us in AGP format, which will allow us to make CON records that indicate how the pieces of the WGS submission are put together.

Submit your reads to the Trace Archive database as this information is useful for the scientific community. Contact trace@ncbi.nlm.nih.gov for questions about submitting to the Trace Archive.

WGS Project

WGS projects without annotation require at least two weeks to be processed. Projects with annotation require at least one month for processing. Please submit your project with enough lead time.

Complete genomes that lack annotation are processed as WGS projects. When annotation is added, the complete genome is given a new accession number and the WGS accession number is made secondary, so that Entrez searches for either number will retrieve the complete annotated genome.

Submit complete organellar and viral genomes as regular GenBank records by emailing the submissions to GenBank Submissions.

Include specific source information, such as strain or isolate name, country where the sample was collected, specimen voucher, sex, and any other relevant information. See the tbl2asn page for information on how to include source qualifiers in a submission.

In general, submit only those contigs >200bp. However, if you are submitting an AGP file with assembly information, then include in the WGS project all the contigs that are part of scaffolds/chromosomes, regardless of the contig length.

Annotation can be included on the WGS contigs or on the scaffold or chromosome CON records that are generated from the information in the agp file, whichever is most appropriate for the project. Annotation that is submitted on a WGS contig will be displayed in Entrez on the scaffold or chromosome that includes that contig. Similarly, if a scaffold has annotation and is a component of a chromosome CON record, then its annotation will be displayed in Entrez on the chromosome. However, annotation that is submitted on a scaffold or chromosome CON record is not displayed on the underlying components. Contact NCBI for information about annotating scaffolds or chromosomes.

Examples (annotated records are shown as GenBank(Full) view in Entrez):

Annotated Contigs Annotated Scaffolds No Annotation

WGS contig with annotation WGS contig without annotation WGS contig without annotation

Chromosome CON with annotation Chromosome CON with annotation Chromosome CON without annotation

Scaffold CON with annotation Scaffold CON with annotation Scaffold CON without annotation

How to Create a WGS Submission

Submissions to WGS can be created with tbl2asn , a command line program that automates parts of the submission process. tbl2asn reads a template along with sequence (*.fsa) and optional annotation table (*.tbl) files and outputs an ASN.1 file (*.sqn) for submission to GenBank. The *.sqn files are then deposited in the submitter's FTP account at NCBI. Please contact genomes@ncbi.nlm.nih.gov if you need an FTP account.

File Format

Nucleotide sequences of any size FASTA format can be used as input with tbl2asn. FASTA format consists of a single definition line, beginning with a '>', followed by text and subsequent lines of sequence. [See below for information about having multiple sequences in a single file.] At a minimum, all definition lines must contain [tech=wgs] (to indicate that these sequences are whole genome shotgun sequences) and an identifier for the nucleotide sequence, called the SeqID. The SeqID must be unique for each sequence, and is important when updating a WGS project. It cannot begin with the word "assembly" as that causes errors in tbl2asn. Other information about the biological source of the organism, including the organism name, should also be included in the definition line of the sequence. In addition to organism name, other source modifiers include [strain=yyy], and [chromosome=nnn]. Note that there are no spaces surrounding the equal sign. A complete list of modifiers is available from the Sequin FAQ page. The definition line must be on a single line with no line break.

A sample definition line is
>SeqID [organism=Mus musculus] [strain=BALB/c] [tech=wgs] [chromosome=2]

Annotation can be included by creating a 5-column table in a .tbl file for each .fsa file. Go to the appropriate page for information about the format of the table and the desired annotation for eukaryotic or bacterial genomes. Three required fields are

locus_tag for genes

The locus_tag is the systematic name of the gene and is used for tracking individual genes. It therefore must be unique across all the genes in a project. If a gene's biological name is known, then it is included as the gene qualifier in the table.

protein_id for proteins

The protein_id is the SeqID of the protein (analogous to the nucleotide SeqID) and is used to track the protein. All of the SeqIDs, both nucleotide and protein, must be unique within a project. For WGS projects you can use type general protein_id's (format: gnl|dbname|SeqID) or local protein_id's (format: lcl|SeqID). Note that during our processing both forms of protein_id's are converted to type general id's in the format gnl|WGS:XXXX|SeqID, where XXXX is the project_ID.

product for proteins

The product is free text, chosen by the submitter. Protein names should be concise names, not descriptions or phrases. BLAST similarity results can be included as a note, or can be modified to be used as the product name. For example, if BLAST results indicate that the translation is similar to XYZ protein, then the product name could be "XYZ-like protein". If the protein is predicted and the product name is not known, use "hypothetical protein" as the product name.

Note that the nucleotide SeqID appears in the DEFINITION line in the flatfile view of the record. Although the protein SeqIDs are not displayed in the final flatfile view, they are present in the ASN.1.

See example *.fsa and *.tbl files for various situations, such as partial CDS or features on the minus strand.

Use tbl2asn to include the Phrap/Consed quality scores of a sequence. The scores must be in files named *.qvl that are in the same directory and have the same nucleotide SeqIDs as the corresponding *.fsa files.

Template File

The template file for tbl2asn is created with Sequin . On the starting Sequin page, choose "Start New Submission". Enter a manuscript title if desired. Enter the contact, authors and affiliation information then return to the submission tab and use File->Export Submitter Info. Save the file as 'template.sbt'.

If there is a published reference, it can be sent separately with the submission, to be added to the records by GenBank staff during WGS processing.

tbl2asn does not include a release date in the output file, so include that information in your email message to us when you submit.

Using tbl2asn

Some tbl2asn options that are relevant to WGS submissions are:

  -t Template file [File In] [required]
  -p path for table and sequence files ('-p .' is the current directory)
  -v Validate[T/F] Optional
       default=F

  -j Allows the addition of source qualifiers that will be the same for each submission. Example: -j "[organism=Saccharomyces cerevisiae] [strain=S288C]"
  -b Generate GenBank file[T/F] Optional
       default=F
  -s Read FASTAs as set [T/F] Optional
       default=F
  -i Only this file[file In] Optional

To create WGS submissions from multiple .fsa files, run tbl2asn with the command:

tbl2asn -t template.sbt -p path_to_files -v

Note that you must also specify the path as part of the template file's name if the template file is in a different directory than where you are running tbl2asn.

In a directory specified by '-p', the program looks for pairs of .fsa and .tbl files with the same file name prefix, for example file.fsa and file.tbl, and it builds ASN.1 records for these pairs. The ASN.1 record will be called file.sqn. The results of the validation (-v; error checking) will be called file.val. Most validation errors must be fixed before the .sqn files can be submitted to GenBank; however, taxonomy-related errors and "No publications anywhere" errors can generally be ignored.

If you wish to have GenBank flatfiles generated also, use the -b argument when you run tbl2asn, and a .gbf file will be generated for each .fsa file in the directory.

If you want to create a submission from only one particular .fsa file that is in a directory that contains multiple .fsa files, use the -i argument to indicate which file is to be read.

Go to the tbl2asn page for more detailed information about tbl2asn, its command line arguments, and file formats.

Multiple Sequences in a Single .fsa File

You can put multiple FASTA sequences (usually less than 10,000 sequences) into a single .fsa file if you wish to have fewer files. The corresponding .tbl file must have the annotation information for all of the sequences in the .fsa file. Run tbl2asn with the -s argument so that each definition line is recognized as the beginning of a new sequence. A single .sqn file will then be generated for the multiple sequences of each .fsa file.

See example .fsa and .tbl files.

AGP Files to Build Scaffolds and/or Chromosomes

If there is assembly information, of how the contigs are assembled into scaffolds (supercontigs) or chromosomes, then submit an AGP file with that information. AGP files provide the ordering and orientation information to construct supercontigs or scaffolds from contigs, or to construct chromosomes from supercontigs and/or contigs. More information about genome assemblies is here. See this page for the AGP format.

Some specific requests are:

Use "100" as the length of gaps of unknown size, as that is the GenBank convention. They will appear as gap(unk100) in the flatfile view of the GenBank record.

Include the accession.version number as the component identifier, not just the accession number.

If the project is not annotated, then please generate two AGP files. The first is the 'complete' AGP file that includes all of the WGS contigs as components. In addition to that one, please also submit a 'no-singletons' AGP file, whose objects are only the multi-component scaffolds.

Updating a WGS Project

If the update is a completely new version of the WGS project, then the nucleotide SeqIDs must be unique from all previous versions. A small change, such as adding "_2" to the end of the original SeqID is sufficient to make the new set unique. Another strategy is to include the version in the contig names, eg, Cont02_xxxxx.

If the same version is being updated, then the SeqIDs must be identical and the accession numbers must be included in the update, for both nucleotides and proteins. The correct format of a nucleotide identifier in such an update is:

gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx

where XXXX is the project_ID and XXXX01xxxxxx is the contig's accession number. We recommend that you contact NCBI before generating a complicated update.

If you need additional assistance in preparing WGS submissions, please contact genomes@ncbi.nlm.nih.gov.

Revised December 14, 2007

Table of Contents