![]() |
|
![]() |
![]() |
|
Archive Edition | |
Sponsored
by the U.S. Department of
Energy Human Genome Program
|
Santa Fe, New Mexico, November 13-17, 1994
Introduction to the Workshop
The electronic form of this document may be cited in the following style: Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected. |
Towards Simplifying Fragment AssemblyEugene Myers The key computational problem in assembling fragments in a project involving some amount of shotgun sequencing is to determine which overlaps between fragments to use in forming contigs. Particularly when the underlying target sequence is repetitive, there are often several possible overlaps and different choices lead to distinct and often suboptimal reconstructions of the target. We have discovered a very simple preprocess which identifies and melds all fragments that must overlap in any optimal solution, leaving us with an assembly problem that is small and for which the key combinatorial choices are clear. For example, on sequence that is non-repetitive, the process usually melds all fragments providing the one, unique layout for the fragments. This confirms an observation by many that the simple "greedy" algorithm used in most fragment assembly systems is adequate for such problems. The real potential of our simplification arises in problems involving repetitive sequence. For example, given a sequence with 3 exact copies of a 2kb repeat, our process results in 5 contigs that involve 6 overlaps and for which the distinct arrangements are easy to enumerate and the optimal one is self-evident. We will present our simplification strategy (easily understood by non-mathematicians) along with empirical results showing the reductions gained for various types of sequence. We will then discuss some interesting aspects of the reduced problems and how we can lever them to rigorously solve for the target sequence. What is particularly exciting here is that the combinatorial reduction raises our expectations of being able to solve problems involving *constraints* and *repetitive* sequence in a rigorous, accurate way. It also permits us to correctly formulate the assembly problem in terms of the most probable solution using the one-sided Kolmogorov-Smimov statistic. If time permits, we will also report on our progress to assembly 50,000 fragments covering a 700kb stretch of E. Coli with no augmenting or supporting information (data supplied by Fred Blattner's lab). This problem involves 13mb of raw data and is of a scale and order of magnitude larger than any tried before. We are using a combination of supercomputers, state-of-the-art algorithms, and of course, our new simplification technique.
|
Send the url of this page to a friend
Last modified: Wednesday, October 29, 2003
Home * Contacts * Disclaimer
Base URL: www.ornl.gov/hgmis
Site sponsored by the U.S. Department of Energy
Office of Science, Office
of Biological and Environmental Research, Human
Genome Program