Whole Genome Shotgun Submissions |
PubMed | Entrez | BLAST | OMIM | Books | Taxonomy | Structure |
WGS List List of WGS Projects Genome Project Registration page Home page Metagenomes Metagenome Submission Guide GenBank Sequence submission support and software Trace Archive Trace Archive database tbl2asn Command line sequence submission tool Annotation Guidelines Prokaryotic genomes Eukaryotic genomes Example Files Sample .fsa and .tbl files Sequin Stand-alone sequence submission tool |
Whole Genome Shotgun Sequence SubmissionsDDBJ/EMBL/GenBank accepts contigs from ongoing Whole Genome Shotgun (WGS) sequencing projects. These records can contain annotations, and an entire project is updated as sequencing progresses. See the list of WGS projects. Table of Contents
IntroductionEach WGS project is assigned a stable 4-letter WGS project_ID, which does not change
as the project is updated. In addition to the WGS project_ID, the contig identifiers have a version number
corresponding to a particular project update. Finally, each individual contig
within the assembly is assigned a unique accession number prefixed by the WGS
project_ID and version number. For instance, if a project's assigned accession
number is XXXX00000000, then that project's first assembly version would be
XXXX01000000, and the first contig of that version would be XXXX01000001. (The
last six digits of this ID identify each individual contig). The nucleotide data from most WGS projects go into the BLAST wgs database, whereas
proteins go into the BLAST nr database. Nucleotides from environmental projects are
present in either the BLAST env_nt or wgs database, depending upon whether that sequence
has been identified as a particular organism, or if the organism is not yet known.
Similarly, the proteins from those projects are in the env_nr or nr BLAST database. See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project. What To Do
WGS Project
Examples (annotated records are shown as GenBank(Full) view in Entrez):
How to Create a WGS SubmissionSubmissions to WGS can be created with tbl2asn , a command line program that automates parts of the submission process. tbl2asn reads a template along with sequence (*.fsa) and optional annotation table (*.tbl) files and outputs an ASN.1 file (*.sqn) for submission to GenBank. The *.sqn files are then deposited in the submitter's FTP account at NCBI. Please contact genomes@ncbi.nlm.nih.gov if you need an FTP account. A sample definition line is Annotation can be included by creating a 5-column table in a .tbl file for each .fsa file. Go to the appropriate page for information about the format of the table and the desired annotation for eukaryotic or bacterial genomes. Three required fields are
Note that the nucleotide SeqID appears in the DEFINITION line in the flatfile view of the record. Although the protein SeqIDs are not displayed in the final flatfile view, they are present in the ASN.1. See example *.fsa and *.tbl files for various situations, such as partial CDS or features on the minus strand. Use tbl2asn to include the Phrap/Consed quality scores of a sequence. The scores must be in files named *.qvl that are in the same directory and have the same nucleotide SeqIDs as the corresponding *.fsa files. If there is a published reference, it can be sent separately with the submission, to be added to the records by GenBank staff during WGS processing. tbl2asn does not include a release date in the output file, so include that information in your email message to us when you submit. Using tbl2asnSome tbl2asn options that are relevant to WGS submissions are:-t Template file [File In] [required] -p path for table and sequence files ('-p .' is the current directory) -v Validate[T/F] Optional default=F -j Allows the addition of source qualifiers that will be the same for each submission. Example: -j "[organism=Saccharomyces cerevisiae] [strain=S288C]" -b Generate GenBank file[T/F] Optional default=F -s Read FASTAs as set [T/F] Optional default=F -i Only this file[file In] Optional To create WGS submissions from multiple .fsa files,
run tbl2asn with the command: Note that you must also specify the path as part of the template file's name if the template file is in a different directory than where you are running tbl2asn. In a directory specified by '-p', the program looks for pairs of .fsa and .tbl files with the same file name prefix, for example file.fsa and file.tbl, and it builds ASN.1 records for these pairs. The ASN.1 record will be called file.sqn. The results of the validation (-v; error checking) will be called file.val. Most validation errors must be fixed before the .sqn files can be submitted to GenBank; however, taxonomy-related errors and "No publications anywhere" errors can generally be ignored. If you wish to have GenBank flatfiles generated also, use the -b argument
when you run tbl2asn, and a .gbf file will be generated for each .fsa file
in the directory. If you want to create a submission from only one particular .fsa file that is in a directory that contains multiple .fsa files, use the -i argument to indicate which file is to be read. Go to the tbl2asn page
for more detailed information about tbl2asn, its command line arguments, and file formats.
You can put multiple FASTA sequences (usually less than 10,000 sequences) into a single .fsa file if you wish to have
fewer files. The corresponding .tbl file must have the annotation information
for all of the sequences in the .fsa file. Run tbl2asn with the -s argument so
that each definition line is recognized as the beginning of a new sequence. A
single .sqn file will then be generated for the multiple sequences of each .fsa file. See example .fsa and .tbl files.
If there is assembly information, of how the contigs are assembled into scaffolds
(supercontigs) or chromosomes, then submit an AGP file with that information. AGP files provide the ordering and
orientation information to construct supercontigs or scaffolds from contigs, or to
construct chromosomes from supercontigs and/or contigs. More information about
genome assemblies
is here. See this page for the AGP
format.
Some specific requests are:
If the update is a completely new version of the WGS project, then the nucleotide SeqIDs must be unique from all previous
versions. A small change, such as adding "_2" to the end of the original SeqID is sufficient to make the
new set unique. Another strategy is to include the version in the contig names, eg, Cont02_xxxxx.
If the same version is being updated, then the SeqIDs must be identical and the accession numbers must be included
in the update, for both nucleotides and proteins. The correct format of a nucleotide identifier in such an update is:
gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx
where XXXX is the project_ID and XXXX01xxxxxx is the contig's accession number. We recommend
that you contact NCBI before generating a complicated update.
If you need additional assistance in preparing WGS submissions, please contact genomes@ncbi.nlm.nih.gov. Revised December 14, 2007
|