Formats

Standard Flowgram Format (SFF)
ZTR SPEC v1.2
SCF File Format version 3.10

Standard Flowgram Format (SFF)

SFF was designed by 454 Life Sciences, Whitehead Institute for Biomedical Research and Sanger Institute.

This document describes proposed changes which will allow the Trace Archive to efficiently incorporate data generated in formats such as those used by the 454 Life Sciences' system.

The definition of a Standard Flowgram Format (SFF), similar to the SCF format, to hold the "trace" data for 454 reads

The proposed SFF file format is a container file for storing one or many 454 reads. 454 reads differ from standard sequencing reads in that the 454 data does not provide individual base measurements from which basecalls can be derived. Instead, it provides measurements that estimate the length of the next homopolymer stretch in the sequence (i.e., in "AAATGG", "AAA" is a 3-mer stretch of A's, "T" is a 1-mer stretch of T's and "GG" is a 2-mer stretch of G's). A basecalled sequence is then derived by converting each estimate into a homopolymer stretch of that length and concatenating the homopolymers.

The file format consists of three sections, a common header section occurring once in the file, then for each read stored in the file, a read header section and a read data section. The data in each section consists of a combination of numeric and character data, where the specific fields for each section are defined below. The sections adhere to the following rules:

The standard Unix types uint8_t, uint16_t, uint32_t and uint64_t are used to define 1, 2, 4 and 8 byte numeric values.
All multi-byte numeric values are stored using big endian byteorder (same as the SCF file format).
All character fields use single-byte ASCII characters.
Each section definition ends with an "eight_byte_padding" field, which consists of 0 to 7 bytes of padding, so that the length of each section is divisible by 8 (and hence the next section is aligned on an 8-byte boundary).

Common Header Section

The common header section consists of the following fields:

                
magic_number               uint32_t
version                    char[4]
index_offset               uint64_t
index_length               uint32_t
number_of_reads            uint32_t
header_length              uint16_t
key_length                 uint16_t
number_of_flows_per_read   uint16_t
flowgram_format_code       uint8_t
flow_chars                 char[number_of_flows_per_read]
key_sequence               char[key_length]
eight_byte_padding         uint8_t[*]

where the following properties are true for these fields:

The magic_number field value is 0x2E736666, the uint32_t encoding of the string ".sff"
The version number corresponding to this proposal is 0001, or the byte array "\0\0\0\1".
The index_offset and index_length fields are the offset and length of an optional index of the reads in the SFF file. If no index is included in the file, both fields must be 0.
The number_of_reads field should be set to the number of reads stored in the file.
The header_length field should be the total number of bytes required by this set of header fields, and should be equal to "31 + number_of_flows_per_read + key_length" rounded up to the next value divisible by 8.
The key_length and key_sequence fields should be set to the length and nucleotide bases of the key sequence used for these reads.
- Note: The key_sequence field is not null-terminated.
The number_of_flows_per_read should be set to the number of flows for each of the reads in the file.
The flowgram_format_code should be set to the format used to encode each of the flowgram values for each read.
- Note: Currently, only one flowgram format has been adopted, so this value should be set to 1.
- The flowgram format code 1 stores each value as a uint16_t, where the floating point flowgram value is encoded as "(int) round(value * 100.0)", and decoded as "(storedvalue * 1.0 / 100.0)".
The flow_chars should be set to the array of nucleotide bases ('A', 'C', 'G' or 'T') that correspond to the nucleotides used for each flow of each read. The length of the array should equal number_of_flows_per_read.
- Note: The flow_chars field is not null-terminated.
If any eight_byte_padding bytes exist in the section, they should have a byte value of 0.

If an index is included in the file, the index_offset and index_length values in the common header should point to the section of the file containing the index. To support different indexing methods, the index section should begin with the following two fields:

                
index_magic_number         uint32_t
index_version              char[4]

and should end with an eight_byte_padding field, so that the length of the index section is divisible by 8. The format of the rest of the index section is specific to the indexing method used. The index_length given in the common header should include the bytes of these fields and the padding.

Note: Currently, there are no officially supported indexing formats, however support for the io_lib hash table indexing and a simple sorted list indexing should be developed shortly.

Read Header Section

The rest of the file contains the information about the reads, namely number_of_reads entries consisting of read header and read data sections. The read header section consists of the following fields:

                
read_header_length         uint16_t
name_length                uint16_t
number_of_bases            uint32_t
clip_qual_left             uint16_t
clip_qual_right            uint16_t
clip_adapter_left          uint16_t
clip_adapter_right         uint16_t
name                       char[name_length]
eight_byte_padding         uint8_t[*]

where these fields have the following properties:

The read_header_length should be set to the length of the read header for this read, and should be equal to "16 + name_length" rounded up to the next value divisible by 8.
The name_length and name fields should be set to the length and string of the read's accession or name.
- Note: The name field is not null-terminated.
The number_of_bases should be set to the number of bases called for this read.
The clip_qual_left and clip_adapter_left fields should be set to the position of the first base after the clipping point, for quality and/or an adapter sequence, at the beginning of the read. If only a combined clipping position is computed, it should be stored in clip_qual_left.
- The position values use 1-based indexing, so the first base is at position 1.
- If a clipping value is not computed, the field should be set to 0.
- Thus, the first base of the insert is "max(1, max(clip_qual_left, clip_adapter_left))".
The clip_qual_right and clip_adapter_right fields should be set to the position of the last base before the clipping point, for quality and/or an adapter sequence, at the end of the read. If only a combined clipping position is computed, it should be stored in clip_qual_right.
- The position values use 1-based indexing.
- If a clipping value is computed, the field should be set to 0.
- Thus, the last base of the insert is "min( (clip_qual_right == 0 ? number_of_bases : clip_qual_right), (clip_adapter_right == 0 ? number_of_bases : clip_adapter_right) )".

Read Data Section

The read data section consists of the following fields:

                
flowgram_values            uint*_t[number_of_flows]
flow_index_per_base        uint8_t[number_of_bases]
bases                      char[number_of_bases]
quality_scores             uint8_t[number_of_bases]
eight_byte_padding         uint8_t[*]

where the fields have the following properties:

The flowgram_values field contains the homopolymer stretch estimates for each flow of the read. The number of bytes used for each value depends on the common header flowgram_format_code value (where the current value uses a uint16_t for each value).
The flow_index_per_base field contains the flow positions for each base in the called sequence (i.e., for each base, the position in the flowgram whose estimate resulted in that base being called).
- These values are "incremental" values, meaning that the stored position is the offset from the previous flow index in the field.
- All position values (prior to their incremental encoding) use 1-based indexing, so the first flow is flow 1.
The bases field contains the basecalled nucleotide sequence.
The quality_scores field contains the quality scores for each of the bases in the sequence, where the values use the standard -log10 probability scale.

Computing Lengths and Scanning the File

The length of each read's section will be different, because of different length accession numbers and different length nucleotide sequences. However, the various flow, name and bases lengths given in the common and read headers can be used to scan the file, accessing each read's information or skipping read sections in the file. The following pseudocode gives an example method to scanning the file and accessing each read's data:

Open the file and/or reset the file pointer position to the first byte of the file.
Read the first 31 bytes of the file, confirm the magic_number value and version, then extract the number_of_reads, number_of_flows_per_read, flowgram_format_code, header_length, key_length, index_offset and index_length values.
- Convert the flowgram_format_code into a flowgram_bytes_per_flow value (currently with format_code 1, this value is 2 bytes).
If the flow_chars and key_sequence information is required, read the next "header_length - 31" bytes, then extract that information. Otherwise, set the file pointer position to byte header_length.
While the file contains more bytes, do the following:
- If the file pointer position equals index_offset, either read or skip index_length bytes in the file, processing the index if read.
- Otherwise,
  - Read 16 bytes and extract the read_header_length, name_length and number_of_bases values.
  - Read the next "read_header_length - 16" bytes to read the name.
  - At this point, a test of the name field can be perform, to determine whether to read or skip this entry.
  - Compute the read_data_length as "number_of_flows * flowgram_bytes_per_flow + 3 * number_of_bases" rounded up to the next value divisible by 8.
  - Either read or skip read_data_length bytes in the file, processing the read data if the section is read.

ZTR SPEC v1.2

ZTR format was designed by J.K. Bonfield and R. Staden (ZTR: a new format for DNA sequence trace data. Bioinformatics, Oxford University Press, 2002)

Header

The header consists of an 8 byte magic number (see below), followed by a 1-byte major version number and 1-byte minor version number. Changes in minor numbers should not cause problems for parsers. It indicates a change in chunk types (different contents), but the file format is the same. The major number is reserved for any incompatible file format changes (which hopefully should be never).

/* The header */
typedef struct {
    unsigned char  magic[8];	  /* 0xae5a54520d0a1a0a (be) */
    unsigned char  version_major; /* 1 */
    unsigned char  version_minor; /* 2 */
} ztr_header_t;

/* The ZTR magic numbers */
#define ZTR_MAGIC		"\256ZTR\r\n\032\n"
#define ZTR_VERSION_MAJOR	1
#define ZTR_VERSION_MINOR	2

So the total header will consist of:

 
Byte number   0  1  2  3  4  5  6  7  8  9
            +--+--+--+--+--+--+--+--+--+--+
Hex values  |ae 5a 54 52 0d 0a 1a 0d|01 02|
            +--+--+--+--+--+--+--+--+--+--+

Chunk format

The basic structure of a ZTR file is (header,chunk*) - ie header followed by zero or more chunks. Each chunk consists of a type, some meta-data and some data, along with the lengths of both the meta-data and data.

   
Byte number   0  1  2  3  4  5  6  7  8  9
            +--+--+--+--+---+---+---+---+--+--+  -  +--+--+--+--+--+--  -  --+
Hex values  |   type    |meta-data length  | meta-data |data length| data .. |
            +--+--+--+--+---+---+---+---+--+--+  -  +--+--+--+--+--+--  -  --+

Ie in C:

      
typedef struct {
    uint4 type;			/* chunk type (be) */
    uint4 mdlength;		/* length of meta-data field (be) */
    char *mdata;		/* meta data */
    uint4 dlength;		/* length of data field (be) */
    char *data;			/* a format byte and the data itself */
} ztr_chunk_t;

All 2 and 4-byte integer values are stored in big endian format.

The meta-data is uncompressed (and so it does not start with a format byte). The format of the meta-data is chunk specific, and many chunk types will have no meta-data. In this case the meta-data length field will be zero and this will be followed immediately by the data-length field.

The data length is the length in bytes of the entire 'data' block, including the format information held within it.

The first byte of the data consists of a format byte. The most basic format is zero - indicating that the data is "as is"; it's the real thing. Other formats exist in order to encode various filtering and compression techniques. The information encoded in the next bytes will depend on the format byte.

Data format 0 - Raw

Byte number   0 1  2       N
            +--+--+--  -  --+
Hex values  | 0|  raw data  |
            +--+--+--  -  --+

Raw data has no compression or filtering. It just contains the unprocessed data. It consists of a one byte header (0) indicating raw format followed by N bytes of data.

Data format 1 - Run Length Encoding

Byte number   0  1    2     3     4      5     6  7  8               N
            +--+----+----+-----+-----+-------+--+--+--+--  -  --+--+--+
Hex values  | 1| Uncompressed length | guard | run length encoded data|
            +--+----+----+-----+-----+-------+--+--+--+--  -  --+--+--+

Run length encoding replaces stretches of N identical bytes (with value V) with the guard byte G followed by N and V. All other byte values are stored as normal, except for occurrences of the guard byte, which is stored as G 0. For example with a guard value of 8:

Input data:

 
	20 9 9 9 9 9 10 9 8 7

Output data:

	1			(rle format)
	0 0 0 10		(original length)
	8			(guard)
	20 8 5 9 10 9 8 0 7	(rle data)

Data format 2 - ZLIB

Byte number   0  1    2     3     4    5  6  7         N
            +--+----+----+-----+-----+--+--+--+--  -  --+
Hex values  | 2| Uncompressed length | Zlib encoded data|
            +--+----+----+-----+-----+--+--+--+--  -  --+

This uses the zlib code to compress a data stream. The ZLIB data may itself be encoded using a variety of methods (LZ77, Huffman), but zlib will automatically determine the format itself. Often using zlib mode Z_HUFFMAN_ONLY will provide best compression when combined with other filtering techniques.

Data format 64/0x40 - 8-bit delta

Byte number   0       1        2      N 
            +--+-------------+--  -  --+
Hex values  |40| Delta level |   data  |
            +--+-------------+--  -  --+

This technique replaces successive bytes with their differences. The level indicates how many rounds of differencing to apply, which should be between 1 and 3. For determining the first difference we compare against zero. All differences are internally performed using unsigned values with automatic an wrap-around (taking the bottom 8-bits). Hence 2-1 is 1 and 1-2 is 255.

For example, with level set to 1:

    
Input data: 
      10 20 10 200 190 5
Output data: 
       1			(delta1 format)
       1			(level)
       10 10 246 190 246 71	(delta data)

For level set to 2:

Input data: 
 
     10 20 10 200 190 5
Output data: 
       1			(delta1 format)
       2			(level)
       10 0 236 200 56 81	(delta data)

Data format 65/0x41 - 16-bit delta

Byte number 0 1 2 N +--+-------------+-- - --+ Hex values |41| Delta level | data | +--+-------------+-- - --+

This format is as data format 64 except that the input data is read in 2-byte values, so we take the difference between successive 16-bit numbers. For example "0x10 0x20 0x30 0x10" (4 8-bit numbers; 2 16-bit numbers) yields "0x10 0x20 0x1f 0xf0". All 16-bit input data is assumed to be aligned to the start of the buffer and is assumed to be in big-endian format.

Data format 66/0x42 - 32-bit delta

Byte number   0       1        2  3  4      N 
            +--+-------------+--+--+--  -  --+
Hex values  |42| Delta level | 0| 0|   data  |
            +--+-------------+--+--+--  -  --+

This format is as data formats 64 and 65 except that the input data is read in 4-byte values, so we take the difference between successive 32-bit numbers.

Two padding bytes (2 and 3) should always be set to zero. Their purpose is to make sure that the compressed block is still aligned on a 4-byte boundary (hence making it easy to pass straight into the 32to8 filter).

Data format 67-69/0x43-0x45 - reserved

At present these are reserved for dynamic differencing where the 'level' field varies - applying the appropriate level for each section of data. Experimental at present...

Data format 70/0x46 - 16 to 8 bit conversion

Byte number   0
            +--+--  -  --+
Hex values  |46|   data  |
            +--+--  -  --+

This method assumes that the input data is a series of big endian 2-byte signed integer values. If the value is in the range of -127 to +127 inclusive then it is written as a single signed byte in the output stream, otherwise we write out -128 followed by the 2-byte value (in big endian format). This method works well following one of the delta techniques as most of the 16-bit values are typically then small enough to fit in one byte.

  
Example input data: 
	0 10 0 5 -1 -5 0 200 -4 -32 (bytes)
	(As 16-bit big-endian values: 10 5 -5 200 -800)
Output data: 
       70			(16-to-8 format)
       10 5 -5 -128 0 200 -128 -4 -32

Data format 71/0x47 - 32 to 8 bit conversion

Byte number   0
            +--+--  -  --+
Hex values  |47|   data  |
            +--+--  -  --+

This format is similar to format 70, but we are reducing 32-bit numbers (big endian) to 8-bit numbers.

Data format 72/0x48 - "follow" predictor

Byte number   0  1     FF 100  101   N
            +--+--  -  -  - --+-- - --+
Hex values  |48| follow bytes |  data |
            +--+--  -  -  - --+-- - --+

For each symbol we compute the most frequent symbol following it. This is stored in the "follow bytes" block (256 bytes). The first character in the data block is stored as-is. Then for each subsequent character we store the difference between the predicted character value (obtained by using follow[previous_character]) and the real value. This is a very crude, but fast, method of removing some residual non-randomness in the input data and so will reduce the data entropy. It is best to use this prior to entropy encoding (such as huffman encoding).

Data format 73/0x49 - floating point 16-bit chebyshev polynomial predictor

Version 1.1 only. Replaced by format 74 in Version 1.2.

WARNING: This method was experimental and have been replaced with an integer equivalent. The floating point method may give system specific results.

Byte number   0  1  2      N
            +--+--+--  -  --+
Hex values  |49| 0|   data  |
            +--+--+--  -  --+

This method takes big-endian 16-bit data and attempts to curve-fit it using chebyshev polynomials. The exact method employed uses the 4 preceeding values to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients only 4 are used to predict the next value. Then we store the difference between the predicted value and the real value. This procedure is repeated throughout each 16-bit value in the data. The first four 16-bit values are stored with a simple 1-level 16-bit delta function. Reversing the predictor follows the same procedure, except now adding the differences between stored value and predicted value to get the real value.

Data format 74/0x4A - integer based 16-bit chebyshev polynomial predictor

Version 1.2 onwards This replaces the floating point code in ZTR v1.1.

Byte number   0  1  2      N
            +--+--+--  -  --+
Hex values  |4A| 0|   data  |
            +--+--+--  -  --+

Chunk types

As described above, each chunk has a type. The format of the data contained in the chunk data field (when written in format 0) is described below. Note that no chunks are mandatory. It is valid to have no chunks at all. However some chunk types may depend on the existance of others. This will be indicated below, where applicable.

Each chunk type is stored as a 4-byte value. Bit 5 of the first byte is used to indicate whether the chunk type is part of the public ZTR spec (bit 5 of first byte == 0) or is a private/custom type (bit 5 of first byte == 1). Bit 5 of the remaining 3 bytes is reserved - they must always be set to zero.

Practically speaking this means that public chunk types consist entirely of upper case letters (eg TEXT) whereas private chunk types start with a lowercase letter (eg tEXT). Note that in this example TEXT and tEXT are completely independent types and they may have no more relationship with each other than (for example) TEXT and BPOS types.

It is valid to have multiples of some chunks (eg text chunks), but not for others (such as base calls). The order of chunks does not matter unless explicitly specified.

A chunk may have meta-data associated with it. This is data about the data chunk. For example the data chunk could be a series of 16-bit trace samples, while the meta-data could be a label attached to that trace (to distinguish trace A from traces C, G and T). Meta-data is typically very small and so it is never need be compressed in any of the public chunk types (although meta-data is specific to each chunk type and so it would be valid to have private chunks with compressed meta-data if desirable).

The first byte of each chunk data when uncompressed must be zero, indicating raw format. If, having read the chunk data, this is not the case then the chunk needs decompressing or reverse filtering until the first byte is zero. There may be a few padding bytes between the format byte and the first element of real data in the chunk. This is to make file processing simpler when the chunk data consists of 16 or 32-bit words; the padding bytes ensure that the data is aligned to the appropriate word size. Any padding bytes required will be listed in the appopriate chunk definition below.

The following lists the chunk types available in 32-bit big-endian format. In all cases the data is presented in the uncompressed form, starting with the raw format byte and any appropriate padding.

SAMP

Meta-data:

Byte number   0  1  2  3
            +--+--+--+--+
Hex values  | data name |
            +--+--+--+--+

Data:

 
Byte number   0  1  2  3  4  5  6  7       N
            +--+--+--+--+--+--+--+--+-     -+
Hex values  | 0| 0| data| data| data|   -   |
            +--+--+--+--+--+--+--+--+-     -+

This encodes a series of 16-bit trace samples. The first data byte is the format (raw); the second data byte is present for padding purposes only. After that comes a series of 16-bit big-endian values.

The meta-data for this chunk contains a 4-byte name associated with the trace. If a name is shorter than 4 bytes then it should be right padded with nul characters to 4 bytes. For sequencing traces the four lanes representig A, C, G and T signals have names "A\0\0\0", "C\0\0\0", "G\0\0\0" and "T\0\0\0".

At present other names are not reserved, but it is recommended that (for consistency with elsewhere) you label private trace arrays with names starting in a lowercase letter (specifically, bit 5 is 1).

For sequencing traces it is expected that there will be four SAMP chunks, although the order is not specified.

SMP4

Meta-data: none present

Data:

Byte number   0  1  2  3  4  5  6  7       N
            +--+--+--+--+--+--+--+--+-     -+
Hex values  | 0| 0| data| data| data|   -   |
            +--+--+--+--+--+--+--+--+-     -+

The first byte is 0 (raw format). Next is a single padding byte (also 0). Then follows a series of 2-byte big-endian trace samples for the "A" trace, followed by a series of 2-byte big-endian traces samples for the "C" trace, also followed by the "G" and "T" traces (in that order). The assumption is made that there is the same number of data points for all traces and hence the length of each trace is simply the number of data elements divided by four.

This chunk is mutually exclusive with the SAMP chunks. If both sets are defined then the last found in the file should be used. Experimentation has shown that this gives around 3% saving over 4 separate SAMP chunks, but it lacks in

BASE

Meta-data: none present

Data:

Byte number   0  1  2  3      N  
            +--+--+--+--  -  --+
Hex values  | 0| base calls    |
            +--+--+--+--  -  --+

The first byte is 0 (raw format). This is followed by the base calls in ASCII format (one base per byte). The base call case an encoding set should be IUPAC characters.

BPOS

Meta-data: none present

Data:

Byte number   0  1  2  3  4  5  6  7       
            +--+--+--+--+--+--+--+--+-     -+--+--+--+--+
Hex values  | 0| padding|   data    |   -   |    data   |
            +--+--+--+--+--+--+--+--+-     -+--+--+--+--+

This chunk contains the mapping of base call (BASE) numbers to sample (SAMP) numbers; it defines the position of each base call in the trace data. The position here is defined as the numbering of the 16-bit positions held in the SAMP array, counting zero as the first value.

The format is 0 (raw format) followed by three padding bytes (all 0). Next follows a series of 4-byte big-endian numbers specifying the position of each base call as an index into the sample arrays (when considered as a 2-byte array with the format header stripped off).

Excluding the format and padding bytes, the number of 4-byte elements should be identical to the number of base calls. All sample numbers are counted from zero. No sample number in BPOS should be beyond the end of the SAMP arrays (although it should not be assumed that the SAMP chunks will be before this chunk). Note that the BPOS elements may not be totally in sorted order as the base calls may be shifted relative to one another due to compressions.

CNF4

Meta-data: none present

Data:

Byte number   0  1              N              4N
            +--+--+--   -   --+--+----- -  -----+
Hex values  | 0| call confidence | A/C/G/T conf |
            +--+--+--   -   --+--+----- -  -----+

(N == number of bases in BASE chunk)

The first byte of this chunk is 0 (raw format). This is then followed by a series confidence values for the called base. Next comes all the remaining confidence values for A, C, G and T excluding those that have already been written (ie the called base). So for a sequence AGT we would store confidences A1 G2 T3 C1 G1 T1 A2 C2 T2 A3 C3 G3.

The purpose of this is to group the (likely) highest confidence value (those for the called base) at the start of the chunk followed by the remaining values. Hence if phred confidence values are written in a CNF4 chunk the first quarter of chunk will consist of phred confidence values and the last three quarters will (assuming no ambiguous base calls) consist entirely of zeros.

For the purposes of storage the confidence value for a base call that is not A, C, G or T (in any case) is stored as if the base call was T.

The confidence values should be from the "-10 * log10 (1-probability)". These values are then converted to their nearest integral value. If a program wishes to store confidence values in a different range then this should be stored in a different chunk type.

If this chunk exists it must exist after a BASE chunk.

TEXT

Meta-data: none present

Data:

 
Byte number   0 
            +--+-  -  -+--+-  -  -+--+-     -+-  -  -+--+-  -  -+--+--+
Hex values  | 0| ident | 0| value | 0|   -   | ident | 0| value | 0| 0|
            +--+-  -  -+--+-  -  -+--+-     -+-  -  -+--+-  -  -+--+--+

This contains a series of "identifier\0value\0" pairs.

The identifiers and values may be any length and may contain any data except the nul character. The nul character marks the end of the identifier or the end of the value. Multiple identifier-value pairs are allowable, with a double nul character marking the end of the list.

Identifiers starting with bit 5 clear (uppercase) are part of the public ZTR spec. Any public identifier not listed as part of this spec should be considered as reserved. Identifiers that have bit 6 set (lowercase) are for private use and no restriction is placed on these.

See below for the text identifier list.

CLIP

Meta-data: none present

Data:

Byte number   0  1  2  3  4  5  6  7  8
            +--+--+--+--+--+--+--+--+--+
Hex values  | 0| left clip | right clip|
            +--+--+--+--+--+--+--+--+--+

This contains suggested quality clip points. These are stored as zero (raw data) followed by a 4-byte big endian value for the left clip point and a 4-byte big endian value for the right clip point. Clip points are defined in units of base calls, starting from 1.

CR32

Meta-data: none present

Data:

Byte number   0  1  2  3  4 
            +--+--+--+--+--+
Hex values  | 0|   CRC-32  |
            +--+--+--+--+--+

This chunk is always just 4 bytes of data containing a CRC-32 checksum, computed according to the widely used ANSI X3.66 standard. If present, the checksum will be a check of all of the data since the last CR32 chunk. This will include checking the header if this is the first CR32 chunk, and including the previous CRC32 chunk if it is not. Obviously the checksum will not include checks on this CR32 chunk.

COMM

Meta-data: none present

Data:

Byte number   0  1        N
            +--+--   -   --+
Hex values  | 0| free text |
            +--+--   -   --+

This allows arbitrary textual data to be added. It does not require a identifier-value pairing or any nul termination.

Text Identifiers

These are for use in the TEXT segments. None are required, but if any of these identifiers are present they must confirm to the description below. Much (currently all) of this list has been taken from the NCBI Trace Archive documentation. It is duplicated here as the ZTR spec is not tied to the same revision schedules as the NCBI trace archive (although it is intended that any suitable updates to the trace archive should be mirrored in this ZTR spec).

The Trace Archive specifies a maximum length of values. The ZTR spec does not have length limitations, but for compatibility these sizes should still be observed.

The Trace Archive also states some identifiers are mandatory; these are marked by asterisks below. These identifiers are not mandatory in the ZTR spec (but clearly they need to exist if the data is to be submitted to the NCBI).

Finally, some fields are not appropriate for use in the ZTR spec, such as BASE_FILE (the name of a file containing the base calls). Such fields are included only for compatibility with the Trace Arhive. It is not expected that use of ZTR would allow for the base calls to be read from an external file instead of the ZTR BASE chunk.

More detailed information on the format of the ancillary values should be obtained from the Trace Archive RFC.

SCF File Format version 3.10

Intro

SCF format files are used to store data from DNA sequencing instruments. Each file contains the data for a single reading and includes: its trace sample points, its called sequence, the positions of the bases relative to the trace sample points, and numerical estimates of the accuracy of each base. Comments and "private data" can also be stored. The format is machine independent and the first version was described in Dear, S and Staden, R. "A standard file format for data from DNA sequencing instruments", DNA Sequence 3, 107-110, (1992).

Since then it has undergone several important changes. The first allowed for different sample point resolutions. The second, in response to the need to reduce file sizes for large projects, involved a major reorganisation of the ordering of the data items in the file and also in the way they are represented. Note that despite these changes we have retained the original data structures into which the data is read. Also this reorganisation in itself has not made the files smaller but it has produced files that are more effectively compressed using standard programs such as gzip. The io library included in the package contains routines that can read and write all the different versions of the format (including reading of compressed files). The header record was not affected by this change. This documentation covers both the format of scf files and the data structures that are used by the io library. Prior to version 3.00 these two things corresponded much more closely.

In order for programs to label themselves as supporting SCF files they need to adhere to one of SCF versions. If they do not support the latest version then the version of SCF supported should be clearly labelled. Note that although SCF 3.00 and SCF 3.10 are binary compatible, a key difference is that 3.10 does not allow programs to fail due to the existance or non-existance of comment types.

Header Record

The file begins with a 128 byte header record that describes the location and size of the chromatogram data in the file. Nothing is implied about the order in which the components (samples, sequence and comments) appear. The version field is a 4 byte character array representing the version and revision of the SCF format. The current value of this field is "3.10".

 
  /*
   * Basic type definitions
   */
  typedef unsigned int   uint_4;
  typedef signed   int    int_4;
  typedef unsigned short uint_2;
  typedef signed   short  int_2;
  typedef unsigned char  uint_1;
  typedef signed   char   int_1;
  
  /*
   * Type definition for the Header structure
   */
  #define SCF_MAGIC (((((uint_4)'.'<<8)+(uint_4)'s'<<8) \
                       +(uint_4)'c'<<8)+(uint_4)'f')
  
  typedef struct {
      uint_4 magic_number;
      uint_4 samples;          /* Number of elements in Samples matrix */
      uint_4 samples_offset;   /* Byte offset from start of file */
      uint_4 bases;            /* Number of bases in Bases matrix */
      uint_4 bases_left_clip;  /* OBSOLETE: No. bases in left clip (vector) */
      uint_4 bases_right_clip; /* OBSOLETE: No. bases in right clip (qual) */
      uint_4 bases_offset;     /* Byte offset from start of file */
      uint_4 comments_size;    /* Number of bytes in Comment section */
      uint_4 comments_offset;  /* Byte offset from start of file */
      char version[4];         /* "version.revision", eg '3' '.' '0' '0' */
      uint_4 sample_size;      /* Size of samples in bytes 1=8bits, 2=16bits*/
      uint_4 code_set;         /* code set used (may be ignored) */
      uint_4 private_size;     /* No. of bytes of Private data, 0 if none */
      uint_4 private_offset;   /* Byte offset from start of file */
      uint_4 spare[18];        /* Unused */
  } Header;

For versions of SCF files 2.0 or greater Header.version is `greater than' "2.00"), the version number, precision of data (sample_size), the uncertainty code set are specified in the header. Otherwise, the precision is assumed to be 1 byte, and the code set to be the default code set. The following uncertainty code sets are recognised (but are generally ignored by programs reading this file). If in doubt, set code_set to 0 or 2.

   
0       {A,C,G,T,-}   (default)
1       Staden
2       IUPAC (NC-IUB)
3       Pharmacia A.L.F. (NC-IUB)
4       {A,C,G,T,N}   (ABI 373A)
5       IBI/Pustell
6       DNA*
7       DNASIS
8       IG/PC-Gene
9       MicroGenie

Points

The sample data is the four chromatogram channels. If none exists the Header.samples value should be zero.

The trace information is stored at byte offset Header.samples_offset from the start of the file. For each sample point there are values for each of the four bases. Header.sample_size holds the precision of the sample values. The precision must be one of "1" (unsigned byte) and "2" (unsigned short). The sample points need not be normalised to any particular value, though it is assumed that they represent positive values. This is, they are of unsigned type.

With the introduction of scf version 3.00, in an attempt to produce efficiently compressed files, the sample points are stored in A,C,G,T order; i.e. all the values for base A, followed by all those for C, etc. In addition they are stored, not as their original magnitudes, but in terms of the differences between successive values. The C language code used to transform the values for precision 2 samples is shown below.

     
  void delta_samples2 ( uint_2 samples[], int num_samples, int job) {
   
      /* If job == DELTA_IT:
       *  change a series of sample points to a series of delta delta values:
       *  ie change them in two steps:
       *  first: delta = current_value - previous_value
       *  then: delta_delta = delta - previous_delta
       * else
       *  do the reverse
       */
   
      int i;
      uint_2 p_delta, p_sample;
   
      if ( DELTA_IT == job ) {
          p_delta  = 0;
          for (i=0;i<num_samples;i++) {
              p_sample = samples[i];
              samples[i] = samples[i] - p_delta;
              p_delta  = p_sample;
          }
          p_delta  = 0;
          for (i=0;i<num_samples;i++) {
              p_sample = samples[i];
              samples[i] = samples[i] - p_delta;
              p_delta  = p_sample;
          }
      }
      else {
          p_sample = 0;
          for (i=0;i<num_samples;i++) {
              samples[i] = samples[i] + p_sample;
              p_sample = samples[i];
          }
          p_sample = 0;
          for (i=0;i<num_samples;i++) {
              samples[i] = samples[i] + p_sample;
              p_sample = samples[i];
          }
      }
  }

The io library data structure is as follows:

  /*
   * Type definition for the Sample data
   */
  typedef struct {
          uint_1 sample_A;           /* Sample for A trace */
          uint_1 sample_C;           /* Sample for C trace */
          uint_1 sample_G;           /* Sample for G trace */
          uint_1 sample_T;           /* Sample for T trace */
  } Samples1;
  
  typedef struct {
          uint_2 sample_A;           /* Sample for A trace */
          uint_2 sample_C;           /* Sample for C trace */
          uint_2 sample_G;           /* Sample for G trace */
          uint_2 sample_T;           /* Sample for T trace */
  } Samples2;

Sequence Information

If no sequence exists then Header.bases should be zero. Otherwise this holds the number of called bases.

Information relating to the base interpretation of the trace is stored at byte offset Header.bases_offset from the start of the file. Stored for each base are: its character representation and a number (an index into the Samples data structure) indicating its position within the trace. The relative probabilities of each of the 4 bases occurring at the point where the base is called can be stored in prob_A prob_C, prob_G and prob_T and there may also be substitution, insertion and deletion probabilities too. Note that although the base calls are in sequential order it should not be assumed that the positions will therefore be numerically sorted (due to the possibility of compressions in the data).

From version 3.00 these items are stored in the following order: all "peak indexes", i.e. the positions in the sample points to which the bases corresponds; all the accuracy estimates for base type A, all for C,G and T; the called bases; this is followed by 3 sets of empty int1 data items (now substition, insertion and deletion scores - see below). These values are read into the following data structure by the routines in the io library.

The format of the prob_A, prob_C, prob_G and prob_T values was not specified, apart from being a 1-byte integral value. From version 3.10 we specify that all probability values will be stored as -10*log10(P(error)), where P(error) is the probability of an error. This is the same format as phred and, more recently, other similar tools. If no probabilities are available, all four values should be set to zero. When only one value is available (for the called base), the relevant "prob_" field should be set according and the other three should be left as zero.

For uncalled bases, or bases that are not A, C, G or T, all four probability values should be specified. Specifically for "N" or "-", all four probability values should be set to the same value (which is typically very low - note that using the log scale above a probability of correctness of 0.25 equates to a prob_* value of 1.25).

From version 3.10 onwards we may also store the substitution, insertion and deletion probability values. These are stored using the same scale as the prob_A, prob_C, prob_G and prob_T values. It is expected that the four prob_A, prob_C, prob_G and prob_T values will encode the absolute probability of that base call being correct, taking into account the chance of it being an overcalled base. For alignment algorithms it may be useful to obtain individual confidence values for the chance of insertion, deletion and substitution. These are stored in prob_ins, prob_del and prob_sub. In version 3.00 these fields existed in the SCF files, but were labelled as "uint_1 spare[3]".

   
  /*
   * Type definition for the sequence data
   */
  typedef struct {
      uint_4 peak_index;        /* Index into Samples matrix for base posn */
      uint_1 prob_A;            /* Probability of it being an A */
      uint_1 prob_C;            /* Probability of it being an C */
      uint_1 prob_G;            /* Probability of it being an G */
      uint_1 prob_T;            /* Probability of it being an T */
      char   base;              /* Called base character        */

      uint_1 prob_sub;		/* Probability of this base call being a */
				/*     substitution for another base */
      uint_1 prob_ins;		/* Probability of it being an overcall */
      uint_1 prob_del;		/* Probability of an undercall at this point
				/*     (extra base between this base and the */
				/*      previous base) */

  } Base;

Comments

Comments are stored at offset Header.comments_offset from the start of the file. Lines in this section are of the format:

<Field-ID>=<Value>

<Field-ID> can be any single-line string, not including spaces or equals sign. The <Value> may be any single-line string. If you need to include newline characters in <Value> it is recommended that you escape them by using "\n".

No program should fail due to particular <Field-ID>s being missing (all should be considered as optional), however certain <Field-ID>s have historically become common place. Hence if any of the following <Field-ID>s are present they must be in the format specified below.

BandSpreadRatio

This indicates the amount of image processing done to create the curves in this file. It is used as reference only. (requested by LI-COR)

BCSW

Base calling software (and optionally version).

CONV

The software used to convert the trace to this file. See DATF and DATN.

DATE

Human readable date for the production of this sequence.

DATF

Format (and optionally version) of the original data that this file was created from (assuming that it was not written natively). See DATN and CONV.

DATN

The filename of the original data file. See DATF and CONV.

DYEP

This indicates the type of chemistry and dye(s) used to create this file. The format is an arbitrary string.

ENHANCEMENT

This indicates whether any image processing has been done to create the curves in this file. It is used as reference only. (requested by LI-COR)

IMAGE

If this exists it contains the full drive, path and file name of the image that was used to generate the SCF file. This is used as a reference when locating raw data to reprocess. (requested by LI-COR)

LANE

The lane number of the sequence when loaded onto the gel, counting from the left edge.

MACH

Sequencing machine type and model.

MTXF

Matrix file name (relative or absolute path name) specified using whatever format is suitable for the OS under which this file was created.

NAME

This is the name of the sample. If it doesn't exist software that needs it should generate it from the filename. If it does exist as a parameter it should be the name of the sample ONLY, no drive, path or suffix information should be included and must be limited to 31 characters in length. Most software assumes that the name of the file is also the name of the sample and this is fine, the NAME parameter however allows some deviation from that.

OPER

The name of the operator who produced this sequence.

PRIM

The position, in samples, of the first base call in the raw (unprocessed) trace data.

RUND

A machine readable "run date" for the production of this sequence. The format should be "YYYYMMDD.HHMMSS - YYYYMMDD.HHMMSS" where YYYYMMDD encodes 4 digits of year, 2 digits of month number and 2 digits of day number (in the month) and HHMMSS encodes the time (hours, minutes, seconds) using a 24 hour clock.

SampleRemark

This is used only to pass along comment information from one processing step to the next. (requested by LI-COR)

SIGN

Average signal strength specified as "A=x,C=x,G=x,T=x" where 'x' is an integer or floating point number.

SPAC

Average base spacing specified as the number of trace samples per base call - an integer or floating point number.

SRCE

File souce - synonym for MACH.

TPSW

Trace processing software (and optionally version).

  /*
   * Type definition for the comments
   */
  typedef char Comments[];                /* Zero terminated list of
                                             \n separated entries */

Private data

The private data section is provided to store any information required that is not supported by the SCF standard. If the field in the header is 0 then there is no private data section. We impose no restrictions upon the format of this section. However we feel it maybe a good idea to use the first four bytes as a magic number identifying the used format of the private data.

File structure

From SCF version 3.0 onwards the in memory structures and the data on the disk are not in the same format. The overview of the data on disk for the different versions is summarised below.

Versions 1 and 2:

(Note Samples1 can be replaced by Samples2 as appropriate.)

Length in bytes                        Data
---------------------------------------------------------------------
128                                    header
Number of samples * 4 * sample size    Samples1 or Samples2 structure
Number of bases * 12                   Base structure
Comments size                          Comments
Private data size                      private data

Version 3:

Length in bytes                        Data
---------------------------------------------------------------------------
128                                    header
Number of samples * sample size        Samples for A trace
Number of samples * sample size        Samples for C trace
Number of samples * sample size        Samples for G trace
Number of samples * sample size        Samples for T trace
Number of bases * 4                    Offset into peak index for each base
Number of bases                        Accuracy estimate bases being 'A'
Number of bases                        Accuracy estimate bases being 'C'
Number of bases                        Accuracy estimate bases being 'G'
Number of bases                        Accuracy estimate bases being 'T'
Number of bases                        The called bases
Number of bases * 3                    Reserved for future use
Comments size                          Comments
Private data size                      Private data
---------------------------------------------------------------------------

Byte ordering and integer representation

"Forward byte and reverse bit" ordering will be used for all integer values. This is the same as used in the MC680x0 and SPARC processors, but the reverse of the byte ordering used on the Intel 80x86 processors.

         Off+0   Off+1  
       +-------+-------+  
uint_2 |  MSB  |  LSB  |  
       +-------+-------+

         Off+0   Off+1   Off+2   Off+3
       +-------+-------+-------+-------+
uint_4 |  MSB  |  ...  |  ...  |  LSB  | 
       +-------+-------+-------+-------+

To read integers on systems with any byte order use something like this:

 
  uint_2 read_uint_2(FILE *fp)
  {
      unsigned char buf[sizeof(uint_2)];
  
      fread(buf, sizeof(buf), 1, fp);
      return (uint_2)
          (((uint_2)buf[1]) +
           ((uint_2)buf[0]<<8));
  }
  
  uint_4 read_uint_4(FILE *fp)
  {
      unsigned char buf[sizeof(uint_4)];
  
      fread(buf, sizeof(buf), 1, fp);
      return (uint_4)
          (((unsigned uint_4)buf[3]) +
           ((unsigned uint_4)buf[2]<<8) +
           ((unsigned uint_4)buf[1]<<16) +
           ((unsigned uint_4)buf[0]<<24));
  }

Formats

Table of Contents