dbEST: database of "Expressed Sequence Tags" |
PubMed | Entrez | BLAST | OMIM | Taxonomy | Structure |
EST data submission
file types:
Map data submission
file types: |
Expressed Sequence Tags (ESTs) are short (usually about 300-500 bp), single-pass sequence reads from mRNA (cDNA). Typically they are produced in large batches. They represent a snapshot of genes expressed in a given tissue and/or at a given developmental stage. They are tags (some coding, others not) of expression for a given cDNA library. Additional information
about ESTs can be found in:
Most EST projects develop large numbers of sequences. These are commonly submitted to GenBank and dbEST as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submitter and library information. To improve the efficiency of the submission process for this type of data, we have designed a special streamlined submission process and data format. dbEST also includes sequences that are longer than the traditional ESTs, or are produced as single sequences or in small batches. Among these sequences are products of differential display experiments and RACE experiments. The thing that these sequences have in common with traditional ESTs, regardless of length, quality, or quantity, is that there is little information that can be annotated in the record. If a sequence is later characterized and annotated with biological features such as a coding region, 5'UTR, or 3'UTR, it should be submitted through the regular GenBank submissions procedure (via BankIt or Sequin), even if part of the sequence is already in dbEST. dbEST is reserved for single-pass reads. Assembled sequences should not be submitted to dbEST. GenBank will accept assembled EST submissions for the forthcoming TSA (Transcriptome Shotgun Assembly) division. Please contact gb-admin@ncbi.nlm.nih.gov for more information about submitting EST assemblies. The individual reads which make up the assembly should be submitted to dbEST, the Trace archive or the Short Read Archive (SRA) prior to the submission of the assemblies. For additional information about submitting to Trace or SRA please see Trace web site.
Sequences which should not be included in EST submissions include the following: mitochondrial sequences, rRNA, viral sequences, vector sequences. Vector and linker regions should be removed from EST sequences before submission.
There are two parts to the submission instructions, one for the sequence data, and one for any mapping data. The batch submission process for EST sequence data involves the completion of four file types: a. Publication
The format for each file is described below. If all the ESTs share the same Publication, Library, and Contact information, you only need to prepare one of each of those files. Then complete a separate EST file (file type d) for each sequence. If any of the EST files have different Publication, Library, or Contact information, you must complete a new file of type a, b, or c. Once we have entered particular Publication, Library, or Contact information into the database, you do not need to resend the data input files. The batch submission process for EST map data involves the completion of four file types below. a. Publication
The publication and contact files use the same file format as the publication and contact files described under the submission format for EST sequence data. If the map data share the same Publication and Contact files as the sequence data, there is no need to resubmit the Publication and Contact files. Rather, the CITATION and contact name (CONT_NAME) fields of the Map Data files will serve as a cross reference to the appropriate Publication and Contact files.
Send the completed files to:
batch-sub@ncbi.nlm.nih.gov You can attach all the files to a single email message, or you can include them in the body of the email message. Please be sure that they are in plain text (ASCII) format. We prefer to have the individual EST and Map data files batched together as much as possible: for example, all EST entries in one file and all Map entries in another file. You can submit library, publication, and contact data together in one file. You can also send them in the same file as the EST entries - the TYPE field will differentiate them for the parsing software.
You will receive a list of dbEST IDs and GenBank accession numbers from a dbEST curator via email. If you would like your sequences held confidential until publication, you can indicate that by putting the release date in the PUBLIC field of the EST files. Your sequences will be released on that date, or when the accession numbers or sequence data are published, whichever comes first. Once your sequences are released into the public database, they will be available from the EST division of GenBank (accessible through the Entrez Nucleotides division), and through the separate but related Database of Expressed Sequence Tags (dbEST). The sequences and accession numbers in both sources are the same, but there is additional annotation in the dbEST records such as references to the top nucleotide and protein matches.
Updates to EST entries are done basically in the same way as new entries. Changes to any item in the EST input file (other than EST# or CONT_NAME) are made by completing an input file with new data in the fields that need to be changed. For the STATUS field enter "Update" instead of "New". In addition to the fields to be changed Updates need to include TYPE, STATUS, EST#, and CONT_NAME fields. For changes in Publication, Contact, or Source data, or for changes in EST#'s or CONT_NAME, send an email message describing the change that is needed. Send the update files to: batch-sub@ncbi.nlm.nih.gov
If you have questions about the EST submission format, please contact
info@ncbi.nlm.nih.gov.
1. Submission Format for EST Sequence DataThe following is a specification for flat file formats for delivering EST and related data to the NCBI EST database.
File TypesThere are four types of deliverable files:a. Publication b. Library c. Contact d. EST Each EST file needs to reference the Publication, Library, and Contact data. Therefore the Publication, Library, and Contact files must be in the database when the EST file is entered. Once these files have been submitted and entered, they do not need to be re-submitted for additional EST files that have the same Publication, Library, or Contact.
a. Publication FilesThese are the valid tags and a short description:TYPE: Entry type - must be "Pub" for publication entries. **Obligatory field** MEDUID: Medline unique identifier. Not obligatory, include if you know it. TITLE: Title of article. (Begin on line below tag, use multiple lines if nec.) **Obligatory field** AUTHORS: Author name, format: Name,I.I.; Name2,I.I.; Name3,I.I. (Begin on line below tag, use multiple lines if nec.) **Obligatory field** JOURNAL: Journal name VOLUME: Volume number SUPPL: Supplement number ISSUE: Issue number I_SUPPL: Issue supplement number PAGES: Page, format: 123-9 YEAR: Year of publication. **Obligatory field** STATUS: Publication status. 1=unpublished, 2=submitted, 3=in press, 4=published **Obligatory field** || Examples: TYPE: Pub MEDUID: 92347897 TITLE: Expressed sequence tags and chromosomal localization of cDNA clones from a subtracted retinal pigment epithelium library AUTHORS: Gieser,L.; Swaroop,A. JOURNAL: Genomics VOLUME: 13 ISSUE: 2 PAGES: 873-6 YEAR: 1992 STATUS: 4 || Pub data template with required and most often used fields:
TYPE: Pub TITLE: title AUTHORS: authors JOURNAL: VOLUME: ISSUE: PAGES: YEAR: STATUS: ||
b. Library FilesThese are the valid tags and a short description:TYPE: Entry type - must be "Lib" for library entries. **Obligatory field** NAME: Name of library. **Obligatory field** ORGANISM: Organism from which library prepared. Scientific name. **Obligatory field** STRAIN: Organism strain CULTIVAR: Plant cultivar ISOLATE: Individual isolate from which the sequence was obtained SEX: Sex of organism (female, male, hermaphrodite) ORGAN: Organ name TISSUE: Tissue type CELL_TYPE: Cell type CELL_LINE: Name of cell line STAGE: Developmental stage HOST: Laboratory host VECTOR: Name of vector. V_TYPE: Type of vector (Cosmid, Phage,Plasmid,YAC, other) RE_1: Restriction enzyme at site1 of vector RE_2: Restriction enzyme at site2 of vector DESCR: Description of library preparation methods, vector, etc. Text starts on the line below the DESCR: tag. || Examples: TYPE: Lib NAME: Rat embryonic day 17 post-fertilization Library ORGANISM: Rattus norvegicus STRAIN: Sprague-Dawley SEX: male STAGE: embryonic day 17 post-fertilization TISSUE: aorta CELL_TYPE: vascular smooth muscle DESCR: || Lib data template with required and most often used fields:
TYPE: Lib NAME: ORGANISM: STRAIN: CULTIVAR: SEX: ORGAN: TISSUE: CELL_TYPE: CELL_LINE: STAGE: HOST: VECTOR: V_TYPE: RE_1: RE_2: DESCR: description ||
c. Contact FilesThese are the valid tags and a short description:TYPE: Entry type - must be "Cont" for contact entries. **Obligatory field** NAME: Name of person submitting the EST. **Obligatory field** FAX: Fax number as string of digits. TEL: Telephone number as string of digits. EMAIL: E-mail address LAB: Laboratory providing EST. INST: Institution name ADDR: Address string, comma delineation. || Examples: TYPE: Cont NAME: Sikela JM FAX: 303 270 7097 TEL: 303 270 EMAIL: tjs@tally.hsc.colorado.edu LAB: Department of Pharmacology INST: University of Colorado Health Sciences Center ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA || Contact data template with required and most often used fields:
TYPE: Cont NAME: FAX: TEL: EMAIL: LAB: INST: ADDR: ||
d. EST FilesThese are the valid tags and a short description:TYPE: Entry type - must be "EST" for EST entries. **Obligatory field** STATUS: Status of EST entry - "New" or "Update". **Obligatory field** CONT_NAME: Name of contact (must be identical string to the contact entry) **Obligatory field** CITATION: Journal citation. (Must be identical string to the publication title) Begins on line below tag - use continuation lines if necessary. **Obligatory field** LIBRARY: Library name. (Must be identical string to library name entry.) **Obligatory field** EST#: EST id assigned by contact lab. For EST updates, this is the string we match on. **Obligatory field** GB#: GenBank accession number GB_SEC: Secondary GenBank accessions GDB#: Genome database accession number GDB_DSEG: Genome database Dsegment number CLONE: Clone id. SOURCE: Source providing clone e.g. ATCC SOURCE_DNA: Source id number for the clone as pure DNA SOURCE_INHOST: Source id number for the clone stored in the host. OTHER_EST: Other ESTs on this clone. DBNAME: Database name for cross-reference to another database DBXREF: Database cross-reference accession PCR_F: Forward PCR primer sequence PCR_B: Backward PCR primer sequence INSERT: Insert length (in bases) ERROR: Estimated error in insert length (bases) PLATE: Plate number or code ROW: Row number or letter COLUMN: Column number or letter SEQ_PRIMER: Sequencing primer description or sequence. P_END: Which end sequenced e.g. 5' HIQUAL_START: Base position of start of highest quality sequence (default=1) HIQUAL_STOP: Base position of last base of highest quality sequence. DNA_TYPE: cDNA (default), Genomic, Viral, Synthetic, Other PUBLIC: Date of public release. Leave blank for immediate release. Format: 9/11/1994 (MM/DD/YYYY) **Obligatory field** PUT_ID: Putative identification of sequence by submitter. TAG_LIB: Name of library whose tag is found in this sequence. TAG_TISSUE: Tissue that was source for the tagged library, if a library tag was found. TAG_SEQ: The actual sequence of the library tag found in the EST read. If the tag was searched for and not found, put 'Not found' in this field. POLYA: Y or N to indicate if a polyA tail was or was not found in the EST sequence. COMMENT: Comments about EST. Starts on line below COMMENT: tag. SEQUENCE: Sequence string. Starts on line below SEQUENCE: tag. **Obligatory field** || Examples: TYPE: EST STATUS: New CONT_NAME: Kerlavage AR EST#: HHC189f CLONE: HHC189 SOURCE: ATCC SOURCE_INHOST: 65128 OTHER_EST: HHC189r CITATION: Complementary DNA sequencing: expressed sequence tags and human genome project SEQ_PRIMER: M13 Forward P_END: 5' HIQUAL_START: 1 HIQUAL_STOP: 285 DNA_TYPE: cDNA LIBRARY: Hippocampus, Stratagene (cat. #936205) PUBLIC: PUT_ID: Actin, gamma, skeletal COMMENT: This is a comment about the sequence. It may span several lines. SEQUENCE: AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG ATAGCTTGTTACACAGTAATTAGATTGAAGATAATGGACACGAAACATATTCCGGGATTAAA CATTCTTGTCAAGAAAGGGGGAGAGAAGTCTGTTGTGCAAGTTTCAAAGAAAAAGGGTACCA GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT TGTTAGGAAATGGCAAAGTATTGATGATTGTGTGCTATGTGATTGGTGCTAGATACTTTAAC TGAGTATACGAGTGAAATACTTGAGACTCGTGTCACTT || EST data template with required and most often used fields:
TYPE: EST STATUS: CONT_NAME: CITATION: publication title LIBRARY: EST#: CLONE: SOURCE: SOURCE_DNA: SOURCE_INHOST: PCR_F: PCR_B: INSERT: ERROR: PLATE: ROW: COLUMN: SEQ_PRIMER: P_END: HIQUAL_START: HIQUAL_STOP: DNA_TYPE: PUBLIC: PUT_ID: POLYA: COMMENT: comments SEQUENCE: sequence ||
CONT_NAME field of EST file and NAME field of the Contact file LIBRARY field of EST file and NAME field of the Library file. CITATION field of EST file and TITLE field of the Publication file. We scan these fields from the EST file and matching them automatically to Library, Contact and Publication records in the other tables, so content, spelling, letter case and spacing must match.
2. Submission Format for EST Map DataThe following is a specification for flat file formats for delivering EST mapping and related data to the NCBI EST database. Send the EST map data in a separate message from the EST sequence data to: batch-sub@ncbi.nlm.nih.gov
File TypesThere are four types of deliverable files:a. Publication b. Contact c. Method d. Map Data
a. Publication
b. Contact
c. Method filesThese are the valid tags and a short description:TYPE: Entry type - must be "Meth" for method entries. **Obligatory field** NAME: Name of method. **Obligatory field** ORGANISM: Organism from which library prepared. **Obligatory field** ABSOLUTE: Method gives absolute or relative address? Y or N. **Obligatory field** L1: Interpretation of line 1. L2: Interpretation of line 2. L3: Interpretation of line 3. L4: Interpretation of line 4. L5: Interpretation of line 5. L6: Interpretation of line 6. L7: Interpretation of line 7. L8 Interpretation of line 8. L9: Interpretation of line 9. L10: Interpretation of line 10. DESCR: Description of method. Description starts on line after DESCR heading. May be multi-line free format text. || Entry separator Examples: TYPE: Meth NAME: YAC/CEPH JMS ORGANISM: Homo sapiens ABSOLUTE: n L1: plate L2: row L3: column L4: comment L5: comment L6: comment L7: comment DESCR: PCR-based mapping of 3'UT-derived primers to CEPH YAC DNA pools. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). To date, MIT puts out YAC pools A and B; if both pools were used for the mapping data given, then 'C' is designated. || TYPE: Meth NAME: Radiation Hybrid JMS ORGANISM: Homo sapiens ABSOLUTE: y L1: chromosome L2: bin L3: comment L4: comment L5: comment DESCR: Radiation hybrid panels with binning. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). || TYPE: Meth NAME: Somatic Hybrid JMS ORGANISM: Homo sapiens ABSOLUTE: y L1: chromosome L2: arm L3: band L4: band range L5: comment L6: comment DESCR: Somatic cell hybrid mapping. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). || Map method data template with required and most often used fields:
TYPE: Meth NAME: ORGANISM: ABSOLUTE: DESCR: comments ||
d. Map Data FilesThese are the valid tags and a short description:TYPE: Entry type - must be "Map" for map data entries. **Obligatory field** STATUS: Status of EST entry - "New","Replace" or "Update". **Obligatory field** CONT_NAME: Name of contact (Must be identical string to the contact entry) **Obligatory field** METHOD: Method name. (Must be identical string to the method entry name) **Obligatory field** CITATION: Journal citation. Must be identical string to the publication title) Begins on line below tag - use continuation lines if necessary. **Obligatory field** NCBI#: NCBI Id of EST. (Must have either NCBI#, EST# or GB#) GB#: GenBank accession number of EST. EST#: EST name (can only use this if you are the original submitter of the EST) PUBLIC: Date of public release. Leave blank for immediate release. Format: 9/11/1990 (MM/DD/YYYY) **Obligatory field** MAPSTRING: Full mapping information. **Obligatory field** CHROM: Chromosome name or number L1: Line 1 of parsed mapping information. L2: Line 2 of parsed mapping information. L3: Line 3. L4: Line 4. L5: Line 5. L6: Line 6. L7: Line 7. L8 Line 8. L9: Line 9. L10: Line 10 of parsed mapping information. || Entry separator Examples: TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: YAC/CEPH JMS CITATION: Single pass sequencing and physical and genetic mapping of human cDNAs NCBI#:21839 PUBLIC: MAPSTRING: 959H08 CHROM: L1: 959 L2: H L3: 08 L4: Pool B L5: Forward Primer: CCCCAGCAGAGAAGTTAATT L6: Reverse Primer: GTCAACGTCAACATTCGTTT L7: Product Length: 162 || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Radiation hybrid JMS CITATION: Single pass sequencing and physical and genetic mapping of human cDNAs NCBI#:21839 PUBLIC: MAPSTRING: 4, bin 2 CHROM: 4 L1: 4 L2: 2 L3: Forward Primer: TTGAGGGTTTACAACAGATAGG L4: Reverse Primer: GAAATGGAAGAGAACCAGCT L5: Product Length: 119 || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Somatic hybrid JMS CITATION: Single pass sequencing and physical and genetic mapping of human cDNAs EST#: EST0023c PUBLIC: MAPSTRING: 20 CHROM: 20 L1: 20 L2: L3: L4: L5: Forward Primer: GTCTTCCTGTGTCTGCTGAG L6: Reverse Primer: CACCTCACCTTACATCCAAA || Map data data template with required and most often used fields:
TYPE: Map STATUS: CONT_NAME: METHOD: CITATION: publication title EST#: PUBLIC: MAPSTRING: CHROM: ||
Rev. 4/10/2001 |