Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


93. A Figure of Merit for DNA Sequence Data 

Mark O. Mundt, Allon G. Percus, and David C. Torney 
Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, New Mexico 
dct@lanl.gov 

We have implemented a new measure of the quality of sequence data. Given a sample of sequence data and its phred scores, our figure is the predicted net error rate for finished sequence that would be generated from a given coverage of sequences of comparable quality. It is reasonable to use our figure of merit for assessing the quality of batches of sequence data for continuous quality control in sequencing factories. 

This figure of merit avoids the complexities of fragment assembly. It assumes that the sequence reads occur at random positions, uniformly across the target sequence, and with each orientation being equally likely. The figure of merit is then the expected composite rate of erroneous basecalls, for a given coverage in sequences comparable in quality to those of the sample dataset. 

Thus, an average is taken over the different ways in which the bases and their associated phred scores "align" on the different base positions. 

We implemented this figure of merit in an executable Java computer program, available by anonymous ftp from cell.lanl.gov., in the directory pub/fom. The inputs to the program are the standard phred output for the sample of sequences whose quality is to be assessed, and also the desired coverage to be used. There are essentially no restrictions upon the size of the dataset: it could consist of the phred scores for one or multiple reads. This program can trim off specified sequences, such as vector sequences. We illustrate the result of using different coverage parameters for three datasets, one of which has noticeably lower quality. It is surprising that expected net error rates manifest no trace of a dependence on the parity of the number of times sequenced, arising from "majority rule" statistics. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure