Submitting tips

NEW and UPDATE submissions
UPDATEINFO submission
WITHDRAW submission
Tracking Submissions

The submitting data should be placed on a provided by NCBI secure FTP site (ftp-trace.ncbi.nih.gov). Contact trace@ncbi.nlm.nih.gov to obtain a secure FTP account. Please have a contact information as well as a full center's name and the center's acronym provided with the request.

All submissions made to NCBI via ftp are automatically picked up by Ensembl (http://trace.ensembl.org).

Submissions made to Ensembl are placed on NCBI FTP site to pick up and load.

Each submission is a single file in UNIX USTAR format compressed with "gzip" utility. It is suggested to have the size of the submission file between 1 and 4 GB. Also it is suggested to use unique names for the submissions and include the center's name and the date into its name.

All submissions when extracted should have a top directory. The top directory may be named similar to the submission's file. All ancillary files should be placed under that directory. In case when the submission should contain trace files at least one more directory should be introduced to the top directory and all trace files should be placed under that directory.

Below is what should be placed under the top directory.

TRACEINFO.xml or TRACEINFO.txt: either one must be present. This is the main file describing the submission. It contains ancillary data and references to trace files if necessary. It can be either in XML or tab delimited format.
MD5: md5 hashes, suggested to be present.
README: free text describing this volume and preparation.

Below are examples of the submission directory hierarchy and examples of TRACEINFO ancillary files.

The trace files should not appear in the top level directory, but rather should be in a subdirectory. It is suggested to use the name of the traces or the name of the project for subdirectories. There may be subdirectories within and this is encouraged to group traces.

NEW and UPDATE submissions

NEW and UPDATE submissions should have the structure shown below

TOP_DIRECTORY/
TOP_DIRECTORY/TRACEINFO.txt
TOP_DIRECTORY/MD5
TOP_DIRECTORY/README
TOP_DIRECTORY/traces
TOP_DIRECTORY/traces/HBBA/
TOP_DIRECTORY/traces/HBBA/HBBAA1U0001.scf
TOP_DIRECTORY/traces/HBBA/HBBAA1U0002.scf
TOP_DIRECTORY/traces/HBBA/HBBAA1U0003.scf
...

Examples are available for download: ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/misc/examples

The ancillary TRACEINFO file describes the submitted data as well as points to the location of the chromatograms. XML format is preferable, since it is easier for a human to read if necessary. The ancillary data requirements are in the Validation Table (Excel format) for specific combinations of STRATEGY and TRACE_TYPE_CODE. Both types of ancillary files can contain common fields section at the beginning of it. This section defines common for the submission values if any.

TRACEINFO.xml example

If the trace info is provided as an XML file the info fields will serve as the tags. To preserve the grouping, the trace_volume tag is used.

<?xml version="1.0" encoding="UTF-8"?>
<trace_volume>
   <common_fields>
      <center_name>CENTER NAME ACRONYM IS HERE</center_name>
      <submission_type>NEW</submission_type>
      <strategy>WGS</strategy>
      <trace_type_code>WGS</trace_type_code>
      <center_project>Gorilla WGS</center_project>
      <source_type>G</source_type>
      <species_code>Gorilla gorilla</species_code>
      <insert_size>1500</insert_size>
   </common_fields>
   <trace>
      <template_id>HBBAA1U0001</template_id>
      <trace_name>HBBAA1U0001</trace_name>
      <trace_file>traces/HBBA/HBBAA1U0001.scf</trace_file>
      <trace_format>scf</trace_format>
      <trace_end>R</trace_end>
      <clip_vector_left>56</clip_vector_left>
      <clip_vector_right>737</clip_vector_right>
      <run_machine_id>legrenzi</run_machine_id>
      <chemistry>BIGDYEV2</chemistry>
      <program_id>phred version=0.020425.c</program_id>
      <run_machine_type>ABI 3700</run_machine_type>
   </trace>
   <trace>
      <template_id>HBBAA1U0002</template_id>
      <trace_name>HBBAA1U0002</trace_name>
        ...more info...
   </trace>
</trace_volume>

TRACEINFO.txt example

Tabular file has the following format, and either has no extension at all, or its name is extended with '.txt' or '.tbl'. Data represents the actual values of the fields described in the header. It is also tab delimited

center_name     = CENTER NAME ACRONYM IS HERE
submission_type = NEW
strategy        = WGA
trace_type_code = WGS
center_project  = Gorilla WGS
source_type     = G
species_code    = Gorilla gorilla
insert_size     = 1500
template_id	trace_name	trace_file	trace_format	trace_end   clip_vector_left	clip_vector_right	run_machine_id	chemistry   program_id	run_machine_type
HBBAA1U0001	HBBAA1U0001	traces/HBBA/HBBAA1U0001.scf	scf	R   56   737	legrenzi	BIGDYEV2	phred version=0.020425.c   ABI 3700   s2SCF 	scf	F	44	793	agricola	BIGDYEV2	phred version=0.020425.c	ABI 3700
HBBAA1U0002	HBBAA1U0002  ...
    ...

MD5 example

728018368a7820c50cbaad633bc608a1  TRACEINFO
0cbaad633bc608a1728018368a7820c5  traces/TRACE0001.scf

UPDATEINFO submission

An UPDATEINFO submission should have the structure shown below

TOP_DIRECTORY/
TOP_DIRECTORY/TRACEINFO.txt
TOP_DIRECTORY/MD5
TOP_DIRECTORY/README

TRACEINFO file in this case has to have SUBMISSION_TYPE =UPDATEINFO, the unique keys to the traces(CENTER_NAME and TRACE_NAME) and fields with their values that you wish to update. These data will be uploaded into our database without changing the ti's and the rest information. This file can contain common fields section at the beginning of it. This section defines common for the submission values if any.

TRACEINFO.xml example

If the trace info is provided as an XML file the info fields will serve as the tags. To preserve the grouping, the trace_volume tag is used.

<?xml version="1.0"?>
<trace_volume>
    <common_fields>
        <center_name>CENTER NAME ACRONYM IS HERE</center_name>
        <submission_type>UPDATEINFO</submission_type>
        <trace_type_code>WGS</trace_type_code>
        <insert_size>40000</insert_size>
    </common_fields>
    <trace>
        <trace_name>HBBA0001</trace_name>
        <template_id>template_id_HBBA0001</template_id>
        ...more info...
    </trace>
    <trace>
        <trace_name>HBBA0002</trace_name>
        <template_id>template_id_HBBA0002</template_id>
        ...more info...
    </trace>
</trace_volume>

TRACEINFO.txt example

Tabular file has the following format, and either has no extension at all, or its name is extended with '.txt' or '.tbl'.Data represents the actual values of the fields described in the header. It is also tab delimited

SUBMISSION_TYPE=UPDATEINFO
CENTER_NAME=CENTER NAME ACRONYM IS HERE
trace_name   clip_vector_left    clip_vector_right     more fields (if necessary)... 
my_trace1     33   89    ...
my_trace2     19   80    ...
my_trace2     1    68    ...
more trace_name's...

MD5 example

728018368a7820c50cbaad633bc608a1  TRACEINFO
0cbaad633bc608a1728018368a7820c5  traces/TRACE0001.scf

WITHDRAW submission

To delete traces use SUBMISSION_TYPE =WITHDRAW.

A WITHDRAW submission is a TRACEINFO file inside a tar file, just as any regular Trace Archive submission. It should have the structure shown below:

TOP_DIRECTORY/
TOP_DIRECTORY/TRACEINFO.txt

TRACEINFO.txt example

WITHDRAW type of submission is very similar to UPDATEINFO, except you do not have to supply extra fields but center_name, trace_name, and submission_type=WITHDRAW

submission_type = WITHDRAW
center_name     = CENTER NAME ACRONYM IS HERE
trace_name	
my_trace1
my_trace2
my_trace2
   ...

Tracking Submissions

When a submission is loaded a log file is generated. This log file contains the ti and read name for passed reads and a list of the reads that were rejected.

If more than 5% of the reads from a particular submission fail, the entire submission will be rejected.

A tracking system has been implemented that will allow the tracking of individual submissions. Each FTP submission is given a unique tracking identifier (SID). Submissions can be tracked by name, SID, date or status. The submitting center will be notified via ftp when a submission has been processed.

After each submission has been processed log files documenting the load are placed on the FTP site.

There is an ability to track the submissions with query_tracedb Perl script. The output is in XML format.

Here are some examples:

$ query_tracedb "track name='NISC_mkp_2006-09-22.tar.gz'"
$ query_tracedb "track sid=174661"
$ query_tracedb "track name in ('NISC_mkp_2006-09-22.tar.gz', 'NISC_jyp_2006-09-22.tar.gz')"
$ query_tracedb "track sid in (174661, 174657)"

If submission does not completely comply to the RFC it will be either rejected or a warning will be sent.

Some ancillary fields are mutually exclusive or not required for a particular type of submission. Please do not include redundant fields into the submission, it can be rejected because of this. For example, if no chromosome information is available for a read, the CHROMOSOME field should not be included.

If a read fails the reason of it failure will be documented in the log file. For example it can fail for the following reasons:

Information in the ancillary information file, but no trace file
Zero length trace file
Number of bases does not match the number of quality values
There is a trace file but no ancillary information
If the SUBMISSION_TYPE field has the value 'NEW' but the values in the CENTER_NAME and TRACE_NAME fields are already in the database, the read will be rejected.
If the same read name is found more than one time in the tar file all reads with that name are failed.

Submitting tips

Table of Contents

NEW and UPDATE submissions

UPDATEINFO submission

WITHDRAW submission

Tracking Submissions