Trans-NIH Mouse Initiative
* Reports and Publications

Mice

Revised 01/27/99
Princeton Sequencing Meeting - January 8, 1999

On January 8th, 1999, a Jackson Laboratory-sponsored meeting of 50 scientists from the US and European mouse genetics, sequencing and bioinformatics communities, took place on the Princeton University campus. The order of the day was to establish priorities and approaches as the mouse sequencing initiative begins.

The following summary of the "Sequencing the Mouse Genome" Meeting has been submitted by Dr. Barbara Knowles for distribution to the community.

 

Sequencing the Mouse Genome: Depth of Coverage and Informatics Are the Substantive Issues.

Planning for mouse genome sequencing. Francis Collins spoke first, emphasizing that this meeting was organized by members of the mouse genetics community (Shirley Tilghman,

Eric Lander, Barbara Knowles and Ken Paigen) not the NIH. He stated that he felt it was his role to summarize the initiatives underway and to listen to the ideas and concerns voiced. He described the initiative to establish a 15X BAC library from C57BL/6J, which will be fingerprinted and end sequenced, as the basis for the sequencing effort. He stated that a high-resolution radiation hybrid panel will be constructed that will be anchored with 15,000 markers, with the capacity to map an additional 10,000 markers over 3 years. In order to increase the capacity for generating mouse sequence an RFA has just been issued to establish new mouse sequencing centers with the goal of obtaining 5X coverage by 2003. Proposals are due at NIH by 4/29/99.

The user’s perspective on mouse sequencing. Shirley Tilghman and Janet Rossant outlined the major utilities of having mouse sequence and then polled the users present to elicit their needs. The consensus was that having the sequence of a second mammalian species will be critical for understanding the human genome sequence in terms of the intron/exon structure, and for identifying non-coding functional elements such as control regions, as well as acquiring an understanding of evolutionary mechanisms. The sequence is immediately and critically important to mouse geneticists now working on positional cloning of spontaneous mutations and QTLs, ENU mutagenesis, chromosome architecture, and the characterization of gene families and proteins. None of these uses necessarily requires deep coverage. However, it was pointed out that the data must be of high quality, and there is a very high cost to finish sequence. The question of the utility of eventually sequencing other strains at the 2-3X draft level was posed.

A community-based approach to sequencing the mouse. Eric Lander outlined an approach to accessing regions of the genome based on coordinated clone-based allocation. Central projects will generate library characterization information. A central server will show contigs based on restriction fingerprint information, identify overlapping clones from end sequencing information, display marker content, and BLAST 1X initial coverage of any new BACs versus the existing clones. Individual groups can "sign out" clones based on their interest in the region. After a given time the allocation will expire if sequence data is not generated and a maximum number of clones per group will be assigned on the basis of recent productivity.

The nature of the sequence product. Richard Gibbs discussed how draft sequence is produced and emphasized that whichever center does the BAC draft must also do the finishing, which is 1/3 of the effort. He went through the process, sequence to assembly to finishing, for a typical 150kb BAC after it is sheared, subcloned, and the sequence from the subclones read randomly. Some of the regions are single-stranded, some regions are misassembled and there are gaps. Computer program help fill the virtual gaps and double-ended sequencing can help solve misassembly and give high accuracy.

He then defined the different draft options: (A) Skimming, at 0.5-1X coverage (2 reads/kb), no contigs are expected; (B) half-shotgun, with 3.5-4X coverage (9 reads/kb) in which 2-8kb contigs are possible; © pre-finish, 6.5-8.0X coverage (18 reads/kb) where 8-30kb contigs are obtained and (D) deep shotgun or 10X draft (22-24 reads/kb) in which 15-70kb contigs can be obtained. He suggested establishing a requirement for a percentage of finished sequence per sequencing center.

There was lively discussion over the immediate trade-off between getting finished sequence on a shorter region of or of having early access to more extensive, 5-6X, coverage. The genetics community appeared to favor the latter, while the sequencers favored a balance so that finishing gets done.

Sequence analysis issues. Lee Hood expressed the opinion that analysis of chromosome function, genes, ESTs and cDNAs is best tackled by comparative analysis of mouse and human sequences. He felt that gene-finding programs are currently spotty. There is great sequence divergence between species as one goes from cDNAs, where there is about 70-75% identity, to the non-coding regions, although there are some areas of conservation in introns, suggesting these conserved sequences perform some function. Comparative genomics has proven helpful in finding regulatory sequences, however protein motif identification is difficult because of degeneracy. Polymorphisms of SNPs, single-stranded repeats, and internal deletions in various mouse strains have helped to identify functional regions in the regions that have been compared so far. The mouse genome has acquired about three times as many repeats as the human. Those that have happened since the divergence of these genomes are useful to study evolution. Isochores, regions of high GC content, are strongly correlated with regions of high gene content and they may provide a way to identify regions for sequencing. Syntenic comparisons also allow identification of orthologues and paralogues. Genes that work together in particular pathways may need to stay together. Comparison of syntenic regions in multiple species can answer this question.

A major effort must be made to develop effective software programs to assemble these regions with minimum error and to search for relationships in the noncontiguous sequences. The challenges are to: assemble 5X sequence; analyze 5X sequence data; align conserved regions; assess orthologues versus paralogues; identify motifs; make graphical displays; sort through polymorphisms/errors in overlapping sequences; develop algorithms to find long range patterns; and develop integrated software tools for use in basic research laboratories. All of this will have to be accomplished on very large data sets.

Webb Miller pointed out there is no real action plan for informatics to go along with the sequencing effort, and this must be addressed. We must find what the users want, make a plan and translate it into action. For example, how many software engineers would be required to accomplish the tasks before us? He suggests meetings to bring software people together with representatives of all of the communities to define the necessary effort. He also suggested sequence analysis workshops where multiple groups concentrate on a given data set and then compare their approaches to determine their relative effectiveness.

The logistics of starting new centers. Elbert Branscomb spoke to the scaling and start-up issues involved when a new center must reach 2 million lanes per year to be cost effective. The 5 top labs are within 2x of this level, making the total public capacity about 8-10M lanes per year, Celera expects to be at 20-times this level this year. New centers must come on line because the present rate of mouse genomic sequencing will not allow us to reach the goal of a completed finished sequence by 2005. He also pointed out two major problems for new centers: the critical mass required to support an adequate informatics program; and maintaining quality in a scale-up operation. In his opinion, for small centers to play a substantive role they must do high pay-off focused things or the technology must change.

Gerry Rubin discussed the evolution of sequencing at this center which currently does 400K lanes/year. He made a plea for itemizing the technology that can be transferred from center to center so that each new center does not have to do their own development, repeating, in an expensive way, lessons that have already been learned elsewhere. For example, developing informatics is very expensive. He suggests there is definitely an optimum size between huge centers and small centers. To cut costs, yet maintain the control possible in a smaller center, he suggested the new centers buddy-up with one another or establish some sort of satellite relationship with existing large centers.

The sequencing community stressed the need for analysis workshops, and the necessity for the best sequence, in terms of finish, possible. This reflected back to the earlier discussion of the priority conflicts between the users, who want maximum sequence and the sequencers who are concerned with the need to provide finished sequence. The need for a plan for the logistics of clone distribution, a central server to manage this process, and the need for better and more extensive and effective analysis tools were again emphasized.

Three breakout groups THEN met for an hour and reported back to the assemblage.

Designing a clone server to support a community-based approach. Greg Schuler, reporter for this group, stated that there must be a central clone server to capture the information generated from a single C57BL/6J library. The primary data elements it should contain to establish clone identity are: the end sequences; the restriction digest fingerprint; RH map position if known; and GenBank accession number to link the sequence (random to finished) as it evolves. Computed information will include clone overlap obtained by analysis of fingerprints. Bins of clones will be put together so that by using related clones one can design an efficient walk to get a contig. Hits to known ESTs, genes, and STS would be captured automatically and made publicly available. Some algorithm would be necessary to prioritize the clones to be sequenced.

Allocation of a clone to a laboratory comes with the commitment to sequencing it. If this is not done within a specific time period the clone goes back into play. A working group is being formed to overview this effort and a dedicated server is being designed by NCBI. Interaction with the European effort is necessary.

Nature of the sequence product and its analysis. Mark Boguski reported the group’s consensus that the minimum sequence for analysis will be at 5X double-ended coverage, and of the BACs done, 25% will be finished sequence. 5X coverage is sufficient for positional cloners and those doing ENU mutagenesis and gene discovery. However, there must be a commitment on the part of the center to finish what they have started.

For groups getting involved now, they need to know what it takes to build a good sequencing operation, and they can only find out if their standards are rigorous enough when they are required to finish a significant percentage of the BACs. There is a real added cost of returning later to get finished sequence, and finished sequence must be the long-term strategy. Each record needs to be properly annotated with, at minimum: the name of the reporting lab, the species/strain from which the clone came, the clone sequence, a determination of the base-by-base accuracy, instructions for assembling larger contigs and overlaps, and any features or markers known. Servers should be available for analyzing homology and synteny.

A major challenge now is to develop transportable informatics tools, especially in the small user labs that do not have the software available to the large sequencing centers. Sequence analysis workshops would be helpful.

The important issues are defining: the criteria by which a center should be qualified; and what each center should be doing? The focus should be on what the user community thinks is important, those are the clones to finish first.

Starting centers. Gerry Rubin reported that the group thought the minimum for start-up of a new center would be $2M/yr to produce 100K lanes in the first year but that $3M was probably a more reasonable estimate, given that it would take about $1M to equip such a sequencing group. To ameliorate the challenges of setting up a new center it might buddy-up with an existing center, especially to utilize their software. A workshop on how to set up new centers was suggested.

*

Comments