Submitting tips
Table of Contents
The submitting data should be placed on a provided by NCBI secure FTP site (ftp-trace.ncbi.nih.gov). Contact trace@ncbi.nlm.nih.gov to obtain a secure FTP account. Please have a contact information as well as a full center's name and the center's acronym provided with the request.
All submissions made to NCBI via ftp are automatically picked up by Ensembl (http://trace.ensembl.org).
Submissions made to Ensembl are placed on NCBI FTP site to pick up and load.
Each submission is a single file in UNIX USTAR format compressed with "gzip" utility. It is suggested to have the size of the submission file between 1 and 4 GB. Also it is suggested to use unique names for the submissions and include the center's name and the date into its name.
All submissions when extracted should have a top directory. The top directory may be named similar to the submission's file. All ancillary files should be placed under that directory. In case when the submission should contain trace files at least one more directory should be introduced to the top directory and all trace files should be placed under that directory.
Below is what should be placed under the top directory.
- TRACEINFO.xml or TRACEINFO.txt: either one must be present. This is the main file describing the submission. It contains ancillary data and references to trace files if necessary. It can be either in XML or tab delimited format.
- MD5: md5 hashes, suggested to be present.
- README: free text describing this volume and preparation.
Below are examples of the submission directory hierarchy and examples of TRACEINFO ancillary files.
The trace files should not appear in the top level directory, but rather should be in a subdirectory. It is suggested to use the name of the traces or the name of the project for subdirectories. There may be subdirectories within and this is encouraged to group traces.
NEW and UPDATE submissions
NEW and UPDATE submissions should have the structure shown below
TOP_DIRECTORY/ TOP_DIRECTORY/TRACEINFO.txt TOP_DIRECTORY/MD5 TOP_DIRECTORY/README TOP_DIRECTORY/traces TOP_DIRECTORY/traces/HBBA/ TOP_DIRECTORY/traces/HBBA/HBBAA1U0001.scf TOP_DIRECTORY/traces/HBBA/HBBAA1U0002.scf TOP_DIRECTORY/traces/HBBA/HBBAA1U0003.scf ...
Examples are available for download: ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/misc/examples
The ancillary TRACEINFO file describes the submitted data as well as points to the location of the chromatograms. XML format is preferable, since it is easier for a human to read if necessary. The ancillary data requirements are in the Validation Table (Excel format) for specific combinations of STRATEGY and TRACE_TYPE_CODE. Both types of ancillary files can contain common fields section at the beginning of it. This section defines common for the submission values if any.
TRACEINFO.xml exampleIf the trace info is provided as an XML file the info fields will serve as the tags. To preserve the grouping, the trace_volume tag is used.
<?xml version="1.0" encoding="UTF-8"?> <trace_volume> <common_fields> <center_name>CENTER NAME ACRONYM IS HERE</center_name> <submission_type>NEW</submission_type> <strategy>WGS</strategy> <trace_type_code>WGS</trace_type_code> <center_project>Gorilla WGS</center_project> <source_type>G</source_type> <species_code>Gorilla gorilla</species_code> <insert_size>1500</insert_size> </common_fields> <trace> <template_id>HBBAA1U0001</template_id> <trace_name>HBBAA1U0001</trace_name> <trace_file>traces/HBBA/HBBAA1U0001.scf</trace_file> <trace_format>scf</trace_format> <trace_end>R</trace_end> <clip_vector_left>56</clip_vector_left> <clip_vector_right>737</clip_vector_right> <run_machine_id>legrenzi</run_machine_id> <chemistry>BIGDYEV2</chemistry> <program_id>phred version=0.020425.c</program_id> <run_machine_type>ABI 3700</run_machine_type> </trace> <trace> <template_id>HBBAA1U0002</template_id> <trace_name>HBBAA1U0002</trace_name> ...more info... </trace> </trace_volume>
Tabular file has the following format, and either has no extension at all, or its name is extended with '.txt' or '.tbl'. Data represents the actual values of the fields described in the header. It is also tab delimited
center_name = CENTER NAME ACRONYM IS HERE submission_type = NEW strategy = WGA trace_type_code = WGS center_project = Gorilla WGS source_type = G species_code = Gorilla gorilla insert_size = 1500 template_id trace_name trace_file trace_format trace_end clip_vector_left clip_vector_right run_machine_id chemistry program_id run_machine_type HBBAA1U0001 HBBAA1U0001 traces/HBBA/HBBAA1U0001.scf scf R 56 737 legrenzi BIGDYEV2 phred version=0.020425.c ABI 3700 s2SCF scf F 44 793 agricola BIGDYEV2 phred version=0.020425.c ABI 3700 HBBAA1U0002 HBBAA1U0002 ... ...
728018368a7820c50cbaad633bc608a1 TRACEINFO 0cbaad633bc608a1728018368a7820c5 traces/TRACE0001.scf
UPDATEINFO submission
An UPDATEINFO submission should have the structure shown below
TOP_DIRECTORY/ TOP_DIRECTORY/TRACEINFO.txt TOP_DIRECTORY/MD5 TOP_DIRECTORY/README
TRACEINFO file in this case has to have SUBMISSION_TYPE =UPDATEINFO, the unique keys to the traces(CENTER_NAME and TRACE_NAME) and fields with their values that you wish to update. These data will be uploaded into our database without changing the ti's and the rest information. This file can contain common fields section at the beginning of it. This section defines common for the submission values if any.
TRACEINFO.xml exampleIf the trace info is provided as an XML file the info fields will serve as the tags. To preserve the grouping, the trace_volume tag is used.
<?xml version="1.0"?> <trace_volume> <common_fields> <center_name>CENTER NAME ACRONYM IS HERE</center_name> <submission_type>UPDATEINFO</submission_type> <trace_type_code>WGS</trace_type_code> <insert_size>40000</insert_size> </common_fields> <trace> <trace_name>HBBA0001</trace_name> <template_id>template_id_HBBA0001</template_id> ...more info... </trace> <trace> <trace_name>HBBA0002</trace_name> <template_id>template_id_HBBA0002</template_id> ...more info... </trace> </trace_volume>
Tabular file has the following format, and either has no extension at all, or its name is extended with '.txt' or '.tbl'.Data represents the actual values of the fields described in the header. It is also tab delimited
SUBMISSION_TYPE=UPDATEINFO CENTER_NAME=CENTER NAME ACRONYM IS HERE trace_name clip_vector_left clip_vector_right more fields (if necessary)... my_trace1 33 89 ... my_trace2 19 80 ... my_trace2 1 68 ... more trace_name's...
728018368a7820c50cbaad633bc608a1 TRACEINFO 0cbaad633bc608a1728018368a7820c5 traces/TRACE0001.scf
WITHDRAW submission
To delete traces use SUBMISSION_TYPE =WITHDRAW.
A WITHDRAW submission is a TRACEINFO file inside a tar file, just as any regular Trace Archive submission. It should have the structure shown below:
TOP_DIRECTORY/ TOP_DIRECTORY/TRACEINFO.txt
WITHDRAW type of submission is very similar to UPDATEINFO, except you do not have to supply extra fields but center_name, trace_name, and submission_type=WITHDRAW
submission_type = WITHDRAW center_name = CENTER NAME ACRONYM IS HERE trace_name my_trace1 my_trace2 my_trace2 ...
Tracking Submissions
When a submission is loaded a log file is generated. This log file contains the ti and read name for passed reads and a list of the reads that were rejected.
If more than 5% of the reads from a particular submission fail, the entire submission will be rejected.
A tracking system has been implemented that will allow the tracking of individual submissions. Each FTP submission is given a unique tracking identifier (SID). Submissions can be tracked by name, SID, date or status. The submitting center will be notified via ftp when a submission has been processed.
After each submission has been processed log files documenting the load are placed on the FTP site.
There is an ability to track the submissions with query_tracedb Perl script. The output is in XML format.
Here are some examples:
$ query_tracedb "track name='NISC_mkp_2006-09-22.tar.gz'" $ query_tracedb "track sid=174661" $ query_tracedb "track name in ('NISC_mkp_2006-09-22.tar.gz', 'NISC_jyp_2006-09-22.tar.gz')" $ query_tracedb "track sid in (174661, 174657)"
If submission does not completely comply to the RFC it will be either rejected or a warning will be sent.
Some ancillary fields are mutually exclusive or not required for a particular type of submission. Please do not include redundant fields into the submission, it can be rejected because of this. For example, if no chromosome information is available for a read, the CHROMOSOME field should not be included.
If a read fails the reason of it failure will be documented in the log file. For example it can fail for the following reasons:
- Information in the ancillary information file, but no trace file
- Zero length trace file
- Number of bases does not match the number of quality values
- There is a trace file but no ancillary information
- If the SUBMISSION_TYPE field has the value 'NEW' but the values in the CENTER_NAME and TRACE_NAME fields are already in the database, the read will be rejected.
- If the same read name is found more than one time in the tar file all reads with that name are failed.