Revised 01/27/99
Princeton Sequencing Meeting - January 8, 1999
On January 8th, 1999, a Jackson Laboratory-sponsored
meeting of 50 scientists from the US and European mouse genetics,
sequencing and bioinformatics communities, took place on the Princeton
University campus. The order of the day was to establish priorities
and approaches as the mouse sequencing initiative begins.
The following summary of the "Sequencing the Mouse Genome"
Meeting has been submitted by Dr. Barbara Knowles for distribution
to the community.
Sequencing the Mouse Genome: Depth of Coverage
and Informatics Are the Substantive Issues.
Planning for mouse genome sequencing. Francis Collins spoke
first, emphasizing that this meeting was organized by members of
the mouse genetics community (Shirley Tilghman,
Eric Lander, Barbara Knowles and Ken Paigen) not the NIH. He stated
that he felt it was his role to summarize the initiatives underway
and to listen to the ideas and concerns voiced. He described the
initiative to establish a 15X BAC library from C57BL/6J, which will
be fingerprinted and end sequenced, as the basis for the sequencing
effort. He stated that a high-resolution radiation hybrid panel
will be constructed that will be anchored with 15,000 markers, with
the capacity to map an additional 10,000 markers over 3 years. In
order to increase the capacity for generating mouse sequence an
RFA has just been issued to establish new mouse sequencing centers
with the goal of obtaining 5X coverage by 2003. Proposals are due
at NIH by 4/29/99.
The users perspective on mouse sequencing. Shirley
Tilghman and Janet Rossant outlined the major utilities of having
mouse sequence and then polled the users present to elicit their
needs. The consensus was that having the sequence of a second mammalian
species will be critical for understanding the human genome sequence
in terms of the intron/exon structure, and for identifying non-coding
functional elements such as control regions, as well as acquiring
an understanding of evolutionary mechanisms. The sequence is immediately
and critically important to mouse geneticists now working on positional
cloning of spontaneous mutations and QTLs, ENU mutagenesis, chromosome
architecture, and the characterization of gene families and proteins.
None of these uses necessarily requires deep coverage. However,
it was pointed out that the data must be of high quality, and there
is a very high cost to finish sequence. The question of the utility
of eventually sequencing other strains at the 2-3X draft level was
posed.
A community-based approach to sequencing the mouse. Eric
Lander outlined an approach to accessing regions of the genome based
on coordinated clone-based allocation. Central projects will generate
library characterization information. A central server will show
contigs based on restriction fingerprint information, identify overlapping
clones from end sequencing information, display marker content,
and BLAST 1X initial coverage of any new BACs versus the existing
clones. Individual groups can "sign out" clones based
on their interest in the region. After a given time the allocation
will expire if sequence data is not generated and a maximum number
of clones per group will be assigned on the basis of recent productivity.
The nature of the sequence product. Richard Gibbs discussed
how draft sequence is produced and emphasized that whichever center
does the BAC draft must also do the finishing, which is 1/3 of the
effort. He went through the process, sequence to assembly to finishing,
for a typical 150kb BAC after it is sheared, subcloned, and the
sequence from the subclones read randomly. Some of the regions are
single-stranded, some regions are misassembled and there are gaps.
Computer program help fill the virtual gaps and double-ended sequencing
can help solve misassembly and give high accuracy.
He then defined the different draft options: (A) Skimming, at 0.5-1X
coverage (2 reads/kb), no contigs are expected; (B) half-shotgun,
with 3.5-4X coverage (9 reads/kb) in which 2-8kb contigs are possible;
© pre-finish, 6.5-8.0X coverage (18 reads/kb) where 8-30kb
contigs are obtained and (D) deep shotgun or 10X draft (22-24 reads/kb)
in which 15-70kb contigs can be obtained. He suggested establishing
a requirement for a percentage of finished sequence per sequencing
center.
There was lively discussion over the immediate trade-off between
getting finished sequence on a shorter region of or of having early
access to more extensive, 5-6X, coverage. The genetics community
appeared to favor the latter, while the sequencers favored a balance
so that finishing gets done.
Sequence analysis issues. Lee Hood expressed the opinion
that analysis of chromosome function, genes, ESTs and cDNAs is best
tackled by comparative analysis of mouse and human sequences. He
felt that gene-finding programs are currently spotty. There is great
sequence divergence between species as one goes from cDNAs, where
there is about 70-75% identity, to the non-coding regions, although
there are some areas of conservation in introns, suggesting these
conserved sequences perform some function. Comparative genomics
has proven helpful in finding regulatory sequences, however protein
motif identification is difficult because of degeneracy. Polymorphisms
of SNPs, single-stranded repeats, and internal deletions in various
mouse strains have helped to identify functional regions in the
regions that have been compared so far. The mouse genome has acquired
about three times as many repeats as the human. Those that have
happened since the divergence of these genomes are useful to study
evolution. Isochores, regions of high GC content, are strongly correlated
with regions of high gene content and they may provide a way to
identify regions for sequencing. Syntenic comparisons also allow
identification of orthologues and paralogues. Genes that work together
in particular pathways may need to stay together. Comparison of
syntenic regions in multiple species can answer this question.
A major effort must be made to develop effective software programs
to assemble these regions with minimum error and to search for relationships
in the noncontiguous sequences. The challenges are to: assemble
5X sequence; analyze 5X sequence data; align conserved regions;
assess orthologues versus paralogues; identify motifs; make graphical
displays; sort through polymorphisms/errors in overlapping sequences;
develop algorithms to find long range patterns; and develop integrated
software tools for use in basic research laboratories. All of this
will have to be accomplished on very large data sets.
Webb Miller pointed out there is no real action plan for informatics
to go along with the sequencing effort, and this must be addressed.
We must find what the users want, make a plan and translate it into
action. For example, how many software engineers would be required
to accomplish the tasks before us? He suggests meetings to bring
software people together with representatives of all of the communities
to define the necessary effort. He also suggested sequence analysis
workshops where multiple groups concentrate on a given data set
and then compare their approaches to determine their relative effectiveness.
The logistics of starting new centers. Elbert Branscomb
spoke to the scaling and start-up issues involved when a new center
must reach 2 million lanes per year to be cost effective. The 5
top labs are within 2x of this level, making the total public capacity
about 8-10M lanes per year, Celera expects to be at 20-times this
level this year. New centers must come on line because the present
rate of mouse genomic sequencing will not allow us to reach the
goal of a completed finished sequence by 2005. He also pointed out
two major problems for new centers: the critical mass required to
support an adequate informatics program; and maintaining quality
in a scale-up operation. In his opinion, for small centers to play
a substantive role they must do high pay-off focused things or the
technology must change.
Gerry Rubin discussed the evolution of sequencing at this center
which currently does 400K lanes/year. He made a plea for itemizing
the technology that can be transferred from center to center so
that each new center does not have to do their own development,
repeating, in an expensive way, lessons that have already been learned
elsewhere. For example, developing informatics is very expensive.
He suggests there is definitely an optimum size between huge centers
and small centers. To cut costs, yet maintain the control possible
in a smaller center, he suggested the new centers buddy-up with
one another or establish some sort of satellite relationship with
existing large centers.
The sequencing community stressed the need for analysis workshops,
and the necessity for the best sequence, in terms of finish, possible.
This reflected back to the earlier discussion of the priority conflicts
between the users, who want maximum sequence and the sequencers
who are concerned with the need to provide finished sequence. The
need for a plan for the logistics of clone distribution, a central
server to manage this process, and the need for better and more
extensive and effective analysis tools were again emphasized.
Three breakout groups THEN met for an hour and reported back
to the assemblage.
Designing a clone server to support a community-based approach.
Greg Schuler, reporter for this group, stated that there must
be a central clone server to capture the information generated from
a single C57BL/6J library. The primary data elements it should contain
to establish clone identity are: the end sequences; the restriction
digest fingerprint; RH map position if known; and GenBank accession
number to link the sequence (random to finished) as it evolves.
Computed information will include clone overlap obtained by analysis
of fingerprints. Bins of clones will be put together so that by
using related clones one can design an efficient walk to get a contig.
Hits to known ESTs, genes, and STS would be captured automatically
and made publicly available. Some algorithm would be necessary to
prioritize the clones to be sequenced.
Allocation of a clone to a laboratory comes with the commitment
to sequencing it. If this is not done within a specific time period
the clone goes back into play. A working group is being formed to
overview this effort and a dedicated server is being designed by
NCBI. Interaction with the European effort is necessary.
Nature of the sequence product and its analysis. Mark Boguski
reported the groups consensus that the minimum sequence for
analysis will be at 5X double-ended coverage, and of the BACs done,
25% will be finished sequence. 5X coverage is sufficient for positional
cloners and those doing ENU mutagenesis and gene discovery. However,
there must be a commitment on the part of the center to finish what
they have started.
For groups getting involved now, they need to know what it takes
to build a good sequencing operation, and they can only find out
if their standards are rigorous enough when they are required to
finish a significant percentage of the BACs. There is a real added
cost of returning later to get finished sequence, and finished sequence
must be the long-term strategy. Each record needs to be properly
annotated with, at minimum: the name of the reporting lab, the species/strain
from which the clone came, the clone sequence, a determination of
the base-by-base accuracy, instructions for assembling larger contigs
and overlaps, and any features or markers known. Servers should
be available for analyzing homology and synteny.
A major challenge now is to develop transportable informatics tools,
especially in the small user labs that do not have the software
available to the large sequencing centers. Sequence analysis workshops
would be helpful.
The important issues are defining: the criteria by which a center
should be qualified; and what each center should be doing? The focus
should be on what the user community thinks is important, those
are the clones to finish first.
Starting centers. Gerry Rubin reported that the group thought
the minimum for start-up of a new center would be $2M/yr to produce
100K lanes in the first year but that $3M was probably a more reasonable
estimate, given that it would take about $1M to equip such a sequencing
group. To ameliorate the challenges of setting up a new center it
might buddy-up with an existing center, especially to utilize their
software. A workshop on how to set up new centers was suggested.
|