NCBI dbGSS

The GSS division of GenBank is similar in nature to the EST division, except that its sequences are genomic in origin, rather than cDNA (mRNA). The GSS division contains (but is not limited to) the following types of data:

random "single pass read" genome survey sequences.
single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be)
exon trapped genomic sequences
Alu PCR sequences

Section 1.3.3 of the GenBank 96.0 release notes provides additional information about the GSS division.

GSSs by nature are usually submitted to GenBank and dbGSS as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submittor and library information. To improve the efficiency of the submission process for this type of data, we have designed a separate streamlined submission process and data format.

Data submission file types

There are two parts to the submission instructions, one for the sequence data, and one for any mapping data.

The batch submission process for GSS sequence data involves the completion of four file types:

1. Publication
2. Library
3. Contact
4. GSS sequence file

The format for each file is described below.

If all the GSSs share the same Publication, Library, and Contact information, you only need to prepare one of each of those files. Then complete a separate GSS file (file type d) for each sequence.

If any of the GSS files have different Publication, Library, or Contact information, you must complete a new set of file types 1-3.

Once we have entered particular Publication, Library, or Contact information into the database, you do not need to resend the data input files.

The batch submission process for GSS map data involves the completion of four file types, below.

1. Publication
2. Contact
3. Method
4. Map data

The publication and contact files use the same file format as the publication and contact files described under the submission format for GSS sequence data. If the map data share the same Publication and Contact files as the sequence data, there is no need to resubmit the Publication and Contact files. Rather, the CITATION and contact name (CONT_NAME) fields of the Map Data files will serve as a cross reference to the appropriate Publication and Contact files.

Mailing files to dbGSS

Send the completed files to: batch-sub@ncbi.nlm.nih.gov

You can attach all the files to a single email message, or you can include them in the body of the email message. Please be sure that they are in plain text (ASCII) format.

We prefer to have the individual GSS and Map data files batched together as much as possible: for example, all GSS entries in one file and all Map entries in another file.

You can submit Publication, Library, and Contact data together in one file. You can also send them in the same file as the GSS entries - the TYPE field will differentiate them for the parsing software.

Assignment of GenBank Accession numbers and release of data

You will receive a list of dbGSS IDs and GenBank Accession numbers from a dbGSS curator via email.

If you would like your sequences held confidential until publication, you can indicate that by putting the release date in the PUBLIC field of the GSS files. Your sequences will be released on that date, or when the Accession numbers or sequence data are published, whichever comes first.

Once your sequences are released into the public database, they will be available from the GSS division of GenBank (accessible through the Entrez Nucleotide division).

Updating your dbGSS data

Updates to GSS entries are done basically in the same way as new entries. Changes to any item in the GSS input file (other than GSS# or CONT_NAME) are made by completing an input file with new data in the fields that need to be changed. For the STATUS field, enter "Update" instead of "New".

In addition to the fields to be changed Updates need to include TYPE, STATUS, GSS#, and CONT_NAME fields.

For changes in Publication, Contact, or Source data, or for changes in GSS#'s or CONT_NAME, send an email message describing the change that is needed.

Send the update files to: batch-sub@ncbi.nlm.nih.gov

Questions and Comments

If you have questions about the GSS submission format, please contact info@ncbi.nlm.nih.gov

Submission Format for GSS Sequence Data

The following is a specification for flat file formats for delivering GSS and related data to the NCBI GSS database.

The format consists of colon delineated capitalized field tags, followed by data.
The data fields should appear on the same line as the tag, with no line wrapping. Exceptions to this are the TITLE and AUTHORS fields of the Publication file; Description (DESCR:) field of the Library file; and the CITATION, COMMENT, and SEQUENCE fields of the GSS file. In these fields, the data text begins on the line following the field tag, and the lines can be wrapped.
Note that some fields are obligatory.
The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file.
If data are not available for a non-obligatory field, the field can either be omitted entirely, or the tag may be included with an empty data field. Please do not put "*", "-", etc. to indicate missing data.
Each record (including the last record in the file) should end with a double-bar tag (||) to indicate the end of the record.
Please also see the bulleted tips at the end of each file format for important notes about that file type.

File Types

There are four types of deliverable files:
1. Publication
2. Library
3. Contact
4. GSS sequence file

Each GSS file needs to reference the Publication, Library, and Contact data. Therefore the Publication, Library, and Contact files must be in the database when the GSS file is entered. Once these files have been submitted and entered, they do not need to be re-submitted for additional GSS files that have the same Publication, Library, or Contact.

1. Publication Files

The following is an example of the valid tags and some illustrative data:


TYPE:    Entry type - must be "Pub" for publication entries. 
         **Obligatory field**.
MEDUID:  Medline unique identifier. 
	 Not obligatory, include if you know it.
TITLE:   Title of article. 
         **Obligatory field**.
	 Begin on line below tag, use multiple lines if needed
AUTHORS: Author name, format:  Name,I.I.; Name2,I.I.; Name3,I.I.
         **Obligatory field**.
	 Begin on line below tag, use multiple lines if needed
JOURNAL: Journal name
VOLUME:  Volume number
SUPPL:   Supplement number
ISSUE:   Issue number
I_SUPPL: Issue supplement number
PAGES:   Page, format:   123-9
YEAR:    Year of publication.
         **Obligatory field**.
STATUS:  Status field.1=unpublished, 2=submitted, 3=in press, 
	 4=published
         **Obligatory field**.
||

Examples:

TYPE: Pub
MEDUID: 92347897
TITLE: 
Genomic sequences from a subtracted retinal pigment epithelium 
library
AUTHORS: 
Gieser,L.; Swaroop,A.
JOURNAL: Genomics
VOLUME: 13
ISSUE: 2
PAGES: 873-6
YEAR:  1992
STATUS: 4
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file.
The MEDUID field is a MEDLINE record unique identifier. We do not normally expect you to supply this. We try to retrieve this from our relational version of MEDLINE database.
The STATUS field is 1=unpublished, 2=submitted, 3=in press, 4=published
The TITLE field is a free format string. The only requirement is that you put an identical string in the CITATION field of the GSS files (or Map Data files, as appropriate), because we will be matching that field automatically against the publications in the publication table and replacing the string with the publication identity number in the GSS table.

2. Library Files

The following is an example of the valid tags and some illustrative data:

TYPE:      Entry type - must be "Lib" for library entries. 
           **Obligatory field**.
NAME:      Name of library. 
           **Obligatory field**.
ORGANISM:  Organism from which library prepared.
STRAIN:    Organism strain
CULTIVAR:  Plant cultivar
ISOLATE:   Individual isolate from which the sequence was obtained
SEX:       Sex of organism (female, male, hermaphrodite)
ORGAN:     Organ name 
TISSUE:    Tissue type
CELL_TYPE: Cell type
CELL_LINE: Name of cell line
STAGE:     Developmental stage
HOST:      Laboratory host
VECTOR:    Name of vector
V_TYPE:    Type of vector (Cosmid, Phage, Plasmid, YAC, other)
RE_1:      Restriction enzyme at site1 of vector
RE_2:      Restriction enzyme at site2 of vector
DESCR:     Description of library preparation methods, 
	   vector, etc. 
           This field starts on the line below the DESCR: tag.
||

Examples:

TYPE: Lib
NAME:  Rat Lambda Zap Express Library
ORGANISM: Rattus norvegicus
STRAIN: Sprague-Dawley
SEX: male
STAGE: embryonic day 17 post-fertilization
TISSUE: aorta
CELL_TYPE: vascular smooth muscle
DESCR: 
Put description here.
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file.
Try to keep the library NAME field to <= 48 characters. We can accept up to 255 characters, but it will be truncated to 48 characters in the identification line of the FASTA file created for BLAST searching.
When you enter the library NAME in the Library Files, please note that the identical string must be used in the LIBRARY field of the GSS files.
The DESCR field should contain as much detail about the library as seems appropriate.

3. Contact Files

The following is an example of the valid tags and some illustrative data:

TYPE:   Entry type - must be "Cont" for contact entries. 
        **Obligatory field**.
NAME:   Name of person providing the GSS sequence 
        **Obligatory field**.
FAX:    Fax number as string of digits.
TEL:    Telephone number as string of digits.
EMAIL:  E-mail address
LAB:    Laboratory
INST:   Institution name
ADDR:   Address string
||

Examples:

TYPE: Cont
NAME: Sikela JM
FAX: 303 270 7097
TEL: 303 270 
EMAIL: tjs@tally.hsc.colorado.edu
LAB: Department of Pharmacology
INST: University of Colorado Health Sciences Center
ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file.
None of the other fields are obligatory, but we require at least the name of a contact person.
We would like as many of the fields filled in as possible to provide complete information to the user for contacting a source for the GSS or further information about it.
The contact name field (CONT_NAME) in the GSS files must contain an identical string to the string used in the NAME field of the Contact file, for automatic matching.

4. GSS Files

The following is an example of the valid tags and some illustrative data:

TYPE:          Entry type - must be "GSS" for GSS entries. 
               **Obligatory field**
STATUS:        Status of GSS entry - "New" or "Update". 
               **Obligatory field**
CONT_NAME:     Name of contact 
               Must be identical string to the contact entry
	       **Obligatory field**
CITATION:      Journal citation 
	       Must be identical string to the publication title
               Begins on line below tag.
               Use continuation lines if needed.
	       **Obligatory field**
LIBRARY:       Library name
               Must be identical string to library name entry.
	       **Obligatory field**
GSS#:          GSS name or number assigned by contact lab. For GSS entry 
               updates, this is the string we match on.
	       **Obligatory field**
GDB#:          Genome Database accession number
GDB_DSEG:      Genome Database Dsegment number
CLONE:         Clone number/name
SOURCE:        Source providing clone, e.g., ATCC
SOURCE_DNA:    Source identity number for the clone as pure DNA
SOURCE_INHOST: Source identity number for the clone stored in the host
OTHER_GSS:     Other GSSs on this clone.
DBNAME:        Database name for cross-reference to another 
	       database
DBXREF:        Database cross-reference accession
PCR_F:         Forward PCR primer sequence
PCR_B:         Backward PCR primer sequence
INSERT:        Insert length (in bases)
ERROR:         Estimated error in insert length (bases)
PLATE:         Plate number or code
ROW:           Row number or letter
COLUMN:        Column number or letter
SEQ_PRIMER:    Sequencing primer description or sequence
P_END:         Which end sequenced, e.g., 5'
HIQUAL_START:  Base position of start of high-quality sequence 
               (default = 1)
HIQUAL_STOP:   Base position of last base of high-quality 
	       sequence
DNA_TYPE:      Genomic (default), cDNA, Viral, Synthetic, Other
CLASS:         Class of sequencing method, e.g., BAC ends, 
	       YAC ends, exon-trapped
	       **Obligatory field**
PUBLIC:        Date of public release
	       Leave blank for immediate release. 
	       **Obligatory field**
               Format:   MM/DD/YYYY
PUT_ID:        Putative identification of sequence by submitter
COMMENT:       Comments about GSS. 
               Text starts on line below COMMENT: tag.
SEQUENCE:      Sequence string. 
               Text starts on line below SEQUENCE: tag. 
	       **Obligatory field**
||

Examples:

TYPE: GSS
STATUS:  New
CONT_NAME: Sikela JM
GSS#: Ayh00001
CLONE: HHC189
SOURCE: ATCC
SOURCE_INHOST: 65128
OTHER_GSS:  GSS00093, GSS000101
CITATION: 
Genomic sequences from Human 
brain tissue
SEQ_PRIMER: M13 Forward
P_END: 5'
HIQUAL_START: 1
HIQUAL_STOP: 285
DNA_TYPE: Genomic
CLASS: shotgun
LIBRARY: Hippocampus, Stratagene (cat. #936205)
PUBLIC: 
PUT_ID: Actin, gamma, skeletal
COMMENT:
This is a comment about the sequence. It may contain features.
It may span several lines.
SEQUENCE:
AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG
ATAGCTTGTTACACAGTAATTAGATTGAAGATAATGGACACGAAACATATTCCGGGATTAAA
CATTCTTGTCAAGAAAGGGGGAGAGAAGTCTGTTGTGCAAGTTTCAAAGAAAAAGGGTACCA
GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT
GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT
TGTTAGGAAATGGCAAAGTATTGATGATTGTGTGCTATGTGATTGGTGCTAGATACTTTAAC
TGAGTATACGAGTGAAATACTTGAGACTCGTGTCACTT
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file.
Valid data values for the GSS STATUS field are New (new entry) or Update (change existing GSS entry).
When updating a GSS, only the fields present in the GSS file will be changed.
Please try to stick to standard map location formats so that we will be able to write functions to parse them in the future.
The DNA_TYPE is assumed to be Genomic, so this field may be omitted unless the DNA type differs from this.
Sequences start on line below the SEQUENCE field tag and should be 60 bases per line with no blank spaces.

The CLASS field is a controlled vocabulary field. Currently (October 2007) accepted values are:

AFLP fragment
Alu-PCR
B1-PCR
BAC ends
BAC sequence gap
BAC subclone
BAC subclone end
BAC/YAC ends
CAPS
Concatamer T-DNA junction
cosmid ends
cosmid sequence
CoT 5E-3 hydroxyapatite-fractioned DNA
DArT clone
deletion endpoint
Ds tagged
Ds/TDNA launch pad
EcoRI fragments
enhancer trap
ERIC-PCR
exon-trapped
fosmid ends
Gene Trap
Genomic PCR
High-Cot
HindIII fragments
HpaII fragments
HpaII/MspI fragment
Hydroxyapatite-fractionated DNA
internal BAC sequence
Intron Spanning
ISSR
Low-Cot
MboI fragments
methylation filtered
microarray
microsatellite
MuTAIL-PCR
NdeI/DraI fragments
NotI site
P1 ends
PAC end
PAC nested deletions
PAC subclone
paralogous sequence variant
partial digestion
PCR fragment
PCR from cDNA
PCR product
PCR product with degenerate primers
PCR with nonspecific primers
PCR with specific primers
PCR-based subtractive hybridization
plasmid
plasmid ends
plasmid insert
plasmid insertion site
primer walking
PSTI fragment
Random amplified microsatellites
random plasmid subclone
Random sheared small inserts
RAPD
REP-PCR
repeat-enriched
representational difference analysis
RFLP clone
RFLP probe
RLGS
SCAR
sheared ends
shotgun
SRAP
SSR-containing BAC subclone
SSR-containing genome clone
Subtraction library
subtractive hybridization
TAC ends
TAIL-PCR
Targeting vectors
TDNA tagged
Telomere Associated Sequences
transposon insertion site
transposon-tagged
U3NeoSV1-trapped
U3NeoSV2-trapped
viral insertion site
viral tagged
virtual transcript
YAC ends

It is important that the strings in the following fields be completely identical:

CONT_NAME of GSS file and NAME field of the Contact file

LIBRARY field of GSS file and NAME field of the Library file

CITATION field of GSS file and TITLE field of the Publication file

incorrect

Submission Format for GSS Map Data

The following is a specification for flatfile formats for delivering GSS mapping and related data to the NCBI GSS database.

The format consists of colon-delineated, capitalized field tags, followed by data.
Each record (including the last record in the file) should end with a double-bar tag (||) to indicate the end of the record.
Each Map Data file needs to reference the Publication, Method, and Contact data. Therefore, the Publication, Method, and Contact files must be in the database when the Map Data file is entered. Once these files have been submitted and entered, they do not need to be re-submitted for additional Map files that have the same Publication, Method, or Contact.
If the map data share the same Publication and Contact files as the sequence data, there is no need to resubmit the Publication and Contact files. Rather, the CITATION and contact name (CONT_NAME) fields of the Map Data files will serve as a cross-reference to the appropriate Publication and Contact files.

File Types

There are four types of deliverable files:
1. Publication
2. Contact
3. Method
4. Map data

1. Publication Files

Use the same Publication file format as shown in Submission Format for GSS Sequence Data

2. Contact Files

Use the same Contact file format as shown in Submission Format for GSS Sequence Data

3. Method Files

The following is an example of the valid tags and some illustrative data:

TYPE:     Entry type - must be "Meth" for method entries
          **Obligatory field**.
NAME:     Name of method
          **Obligatory field**.
ORGANISM: Organism from which library prepared
          **Obligatory field**.
ABSOLUTE: Method gives absolute or relative address? Y or N
          **Obligatory field**.
L1:       Interpretation of line 1
L2:       Interpretation of line 2
L3:       Interpretation of line 3
L4:       Interpretation of line 4
L5:       Interpretation of line 5
L6:       Interpretation of line 6
L7:       Interpretation of line 7
L8        Interpretation of line 8
L9:       Interpretation of line 9
L10:      Interpretation of line 10
DESCR:    Description of method. 
          Description starts on line after DESCR tag. 
          May be multi-line, free format text.
||        Entry separator

Examples:

TYPE: Meth
NAME:  YAC/CEPH JMS
ORGANISM: Homo sapiens
ABSOLUTE: n
L1: plate
L2: row
L3: column
L4: comment
L5: comment
L6: comment
L7: comment
DESCR:
PCR-based mapping of 3'UT-derived primers to CEPH YAC 
DNA pools.  Primers are chosen using the PRIMER program 
by Lincoln et al., ver 0.5 (1991).
To date, MIT puts out YAC pools A and B; if both pools
were used for the mapping data given, then 'C' is designated.
||

TYPE: Meth
NAME:  Radiation Hybrid JMS
ORGANISM: Homo sapiens
ABSOLUTE: y
L1: chromosome
L2: bin
L3: comment
L4: comment
L5: comment
DESCR:
Radiation hybrid panels with binning.
Primers are chosen using the PRIMER program by Lincoln et al., 
ver 0.5 (1991).
||

TYPE: Meth
NAME:  Somatic Hybrid JMS
ORGANISM: Homo sapiens
ABSOLUTE: y
L1: chromosome
L2: arm
L3: band
L4: band range
L5: comment
L6: comment
DESCR:
Somatic cell hybrid mapping.
Primers are chosen using the PRIMER program by Lincoln et al., 
ver 0.5 (1991).
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file.
Lines 1 to 10 are available for describing interpretation of data in the corresponding Map Data files. There must be an interpretation line for each line of parsed mapping information provided in the Map Data files.
The METHOD field of the Map Data files (below) must be identical to the NAME field of the Method file.

4. Map Data Files

The following is an example of the valid tags and some illustrative data:

TYPE:       Entry type - must be "Map" for map data entries
            **Obligatory field**
STATUS:     Status of GSS entry - "New","Replace" or "Update" 
            **Obligatory field**
CONT_NAME:  Name of contact 
            (must be identical string to the contact name.)
METHOD:     Method name
            (Must be identical string to the method entry name.)
CITATION:   Citation title
            (Must be identical string to the publication entry title.)
NCBI#:      NCBI Id of GSS
            (File must have either NCBI#, GSS#, or GB#)
GSS#:       Name of GSS 
            (File must have either NCBI#, GSS#, or GB#)
GB#:        GenBank accession number of GSS
PUBLIC:     blank = for release to public; 
	    date (MM/DD/YYYY) = confidential. 
            **Obligatory field**
MAPSTRING:  Full mapping information. Unparsed. For output only.
            **Obligatory field**
CHROM:      Chromosome name or number
L1:         Line 1 of parsed mapping information.
L2:         Line 2 of parsed mapping information.
L3:         Line 3
L4:         Line 4
L5:         Line 5
L6:         Line 6
L7:         Line 7
L8:         Line 8
L9:         Line 9
L10:        Line 10 of parsed mapping information.
||          Entry separator

Examples:

TYPE: Map
STATUS:  New
CONT_NAME: Sikela JM
METHOD: YAC/CEPH JMS
CITATION: Nature Genetics, 2:180-185 (1992)
NCBI#: 51839
PUBLIC: 
MAPSTRING: 956H08
CHROM: 
L1: 959
L2: H
L3: 08
L4: Pool B
L5: Forward Primer: CCCCAGAGTTCCAAGTTAATT
L6: Reverse Primer: GTCGCATTGCTCAACATTCGTTT
L7: Product Length: 162
||

TYPE: Map
STATUS:  New
CONT_NAME: Sikela JM
METHOD: Radiation hybrid JMS
CITATION: Nature Genetics, 2:180-185 (1992)
GSS#: GSST001a
PUBLIC: 
MAPSTRING: 4, bin 2
CHROM: 4
L1: 4
L2: 2
L3: Forward Primer: TTDDGTAGAGGGTGCTAAGAAGG
L4: Reverse Primer: GAAATGGACCTATTAAAACCAGCT
L5: Product Length: 119
||

TYPE: Map
STATUS:  New
CONT_NAME: Sikela JM
METHOD: Somatic hybrid JMS
CITATION: Nature Genetics, 2:180-185 (1992)
GB#: T12813
PUBLIC: 
MAPSTRING: 20
CHROM: 20
L1: 20
L2:
L3:
L4:
L5: Forward Primer: CGTAATGTCCCTGTGTCTGAG
L6: Reverse Primer: CACCTCACCCATAGCCTTAGCTA
||

Revised 06/20/2003.