Contents
Foreword
Executive
Summary
Report
Agenda
Roster
Planning
Committee |
Lansdowne
Conference Center
Lansdowne, Virginia
December 78, 1998
Introduction
Databases are vital for research in
biology and medicine. Databases serve many roles, including the
capture and organization of key information, integration of data from
disparate sources, and facilitation of the formulation of new
hypotheses and new perspectives. Research communities are facing many
challenges, including a flood of new data, the rapid growth in data
diversity, and the complexity of data produced by cross-disciplinary
investigation. Robust and highly interconnected databases are
essential to address these scientific challenges and to capitalize on
new research opportunities. Insufficient support or ineffective
implementation of model organism databases (MODs) will slow the pace
and increase the cost of biological discovery.
Databases From the Perspective
of Model Organism Research
In the last 100 years, research on a
handful of organisms has played a profound role in advancing our
understanding of the biological and biomedical sciences. The need to
capture, organize, and access data from these model organisms has
driven the creation of organism-specific databases. These model
organism databases have allowed researchers to sift through masses of
data, to gain access to information or materials they might have
missed, and to go in new research directions. Comparative analysis has
proven to be valuable in increasing our understanding of biological
processes, including those in humans. Because these MODs are of
immense value, offer tremendous opportunities, and represent a
significant fiscal investment, it is timely to examine issues
pertaining to the establishment, maintenance, evaluation, and future
directions of model organism databases. Thus, the NIH convened the
Model Organism Database Workshop, which brought together an
international group representing developers and users of established
databases, investigators interested in developing new databases, and
funding agencies. The goals were to assess the range of data that MODs
capture, evaluate data acquisition strategies, identify means of
community input and support, establish review criteria for new and
existing database projects, and consider mechanisms to support
coordinated efforts. It is mutually beneficial to all MODs that each
of them is successful.
Recommendations of the
Workshop
This report addresses MODs as research
resources. We outline the salient features toward which the MODs,
individually and in concert, should strive. We do not have the
knowledge to preordain a "one size fits all" database
project plan. We can nevertheless state the general goals, exemplify
some ways that MODs might achieve these goals, and consider how these
goals should be translated into review criteria for scientific and
administrative evaluation of MODs. The report also addresses
additional initiatives needed to support the MODs and ensure that the
broad U.S. biomedical research community has access to them.
The MODs as Research Resources
MODs deal with two sets of research
communities, with different needs and expectations:
- Model organism community:
This community provides the data to a MOD, adds value by
contributing to the curation of the data, and comprises a major set
of users who need access to a great deal of specialized information,
such as strain collections.
- General research community:
This community uses but does not directly contribute information to
MODs. Unlike the model organism community, the general community
does not usually understand the specialized jargon and nomenclature
for a model organism. The MODs should provide accessible summaries
of genomic, functional, and phenotypic information in addition to
full access to the underlying datasets.
The Model Organism Databases: A Life Cycle
Perspective
Database projects have different needs
and goals at different points in their life cycles. The overarching
goal should always be to meet the needs of the research communities.
Some Features Common
Throughout the Life Cycle
- Tools to facilitate data submission
should be developed or imported. Both human interface and automated
machine-readable submission tools are needed.
- Where appropriate, raw data should
be captured so that they can be reanalyzed. Because of the expense
of capturing raw data, there must be a balance between taking in raw
and summarized data.
- Curation is demanding and requires
a high level of domain expertise. Ph.D.-level curators are needed,
typically with research experience in the particular experimental
system.
- Continual development of tools that
support queries and graphical summaries of large data sets is
important. Query tools and graphical viewers should address the
needs of the general and the expert user communities, balancing ease
of use with depth of information.
- Controlled vocabularies and
standardized nomenclatures should be developed and implemented to
support database organization and querying. The levels of controlled
vocabularies and free text should be established and periodically
reevaluated.
- Timely and effective user support
is essential in maintaining good relations with the community.
- The MOD data presentation
represents only one view of the biological world. Hence, the MODs
should provide third parties with readily ported access to their
entire data sets so that the information may be viewed in other
ways.
- Database objects such as genes must
have unique permanent identifier numbers in order to provide stable
links, track changes in the names of the objects, and maintain
synonym lists.
- Each MOD should establish extensive
cross-links to other MODs and other types of relevant databases
through the exchange of linked lists of objects and their
identifiers.
- Each MOD should collaborate with
other MODs and relevant databases to develop and share improved
technologies, methods, and controlled vocabularies.
- The MODs should provide gene lists
with Medline identifiers so that Medline curators can build the
links to model organism genes reported in publications. This
facilitates Medline-MOD links for users and aids the identification
and parsing of the model organism literature.
- The MODs should encourage journals
to develop mechanisms to promote MOD user submissions and to
incorporate MOD object identifier numbers as well as valid names.
Guiding Principles During the
Establishment Phase
When does an organism warrant its own
database? Although it is difficult to come up with definitive answers,
important criteria include the following:
- The experimental system really is a
model system, which means that it is important for studying
some biological processes or human health issues.
- The information should be rich
enough to be the object of higher levels of analysis or of analysis
not available in the primary literature.
- The community has an accepted
system for nomenclature and a gene registry.
- The value-added data of the MOD
should be of interest to both the organismal community and the
general research community.
Once the need for a MOD is
established, some priorities during the establishment phase are as
follows:
- Particularly in the early phases of
the project, there may be much to be gained by piggybacking on the
software and technical expertise of existing MODs. Expanding an
existing MOD or affiliating with one should be considered first.
Highly portable database software could be considered next.
Alternatively, shared data structures, schemas, and tools would
enable software engineers to build rapidly on other database
platforms. This would permit the new MOD to focus on issues of data
curation while gaining a better, "field tested" view of
the needs of its community and would promote the cost-effectiveness
of the project.
- As considerable expertise, both
technical and strategic, is available within the existing MODs, they
should play a mentoring role in facilitating the establishment of
new MOD projects. Hence, ways should be sought to provide
interactions among existing and embryonic MODs. The planners of new
MODs may wish to contact NIH staff early in the planning stage of a
new database project; these staff members are knowledgeable about
the existing projects and can facilitate the necessary contacts.
Travel funds should be provided to the existing MODs for visitors
programs. Participation of individuals from the new MODs at periodic
meetings of the MOD groups would also facilitate interaction. The
availability of a comprehensive WWW site describing MOD sites would
be of considerable help for new MOD groups.
- The most essential needs of the
model organism community should be addressed first. This is crucial
to get those researchers who will be both providers and major users
of the data to identify themselves as the "partners" in
the MOD. The community needs should be assessed through a
combination of advisors and directed surveys. Advisory committees
should be established so that they are independent critics and
intermediaries to the research community.
- Establishing a new database
enterprise is a long-term and complex commitment and should be
implemented in steps. This allows the MOD to address the initial
organizational and logistical issues within the context of a
reasonable set of production goals. MODs and funding agencies need
to plan for the stepwise ramp-up in responsibilities and funding.
- Priority should be given to genomic
and genetic data, and then more complex phenotypic data classes can
be addressed. Phenotypic and expression pattern data should be
treated as attributes of genomic/genetic data objects when possible.
It should be recognized that much genetic and phenotypic data may be
more expensive to collect than genomic data, but they are still
essential for the scientific community.
Guiding Principles During the
Maintenance Phase
Many of the features inherent in the
establishment phase continue to be important as the MOD matures, and
there are additional responsibilities:
- With regular input from the
advisory groups and others in the communities, the MOD should
reevaluate its priorities, policies, and procedures with an eye
towards maintaining a modern and effective resource that supports
the rapidly advancing science in the model organism.
- Bioinformatics is a rapidly
evolving field. Each MOD needs a budget for developing innovative
solutions to problems or to migrate to new platforms while
maintaining the daily operation of the database project.
- The MOD must address the needs of
the general research community as well as the specialized organism
community. Doing so requires outreach to the broader community and
is also likely to involve alternative data views without jargon.
Such outreach might include demonstrations at a range of scientific
meetings and live on-line classes for scientists at their home
institutions.
On the Reproductive,
Senescence, and Death Phases of the MOD Life Cycle
MODs are not static or immortal. Over
time even a successful MOD may find it efficient to transfer some
types of information to a central database, or it may become so large
and cumbersome that it proves necessary to divide it into smaller
projects. MODs have to be able to change as needed.
MODs are complicated projects and can
run into difficulties for many reasons. The Human Genome Database
(GDB) example shows that early recognition of such difficulties and
early intervention is preferable to allowing the MOD to undergo a
lingering death. The workshop did not come to any explicit conclusions
on how to achieve early detection and therapy, but part of the answer
is to encourage critical and constructive review by the external
advisory committees. Another part may be mechanisms that encourage
interaction among the existing databases, such as periodic MOD
workshops or visitor programs. A workshop would allow the database
providers to talk freely about problems as well as solutions, which is
essential for improving MOD projects through cooperative efforts.
Computer experts are in great demand.
This makes MODs vulnerable to premature demise through the loss of key
bioinformatics people. Affiliating new MODs with existing groups
permits the technical groups to grow in size and therefore be less
sensitive to staff loss.
Guiding Principles for the Review and
Funding of MODs
- Each MOD is a critical research
resource, which has important implications for evaluation.
- From the early stage of project
development, MOD applicants should work with their advisory groups,
other MOD projects, and NIH program staff to prevent as many
pitfalls as possible in developing credible database applications.
- The initial review group must be
put together carefully and must receive considerable education about
the individual MOD. Representation of the specific and general
research communities is essential, possibly including some external
advisors and some officers of governing bodies that exist for some
model organism communities. Other reviewers need to have the
necessary computational or database project management expertise.
The goal of the review committee education process is to ensure
that, regardless of the funding mechanism, these grant applications
are reviewed as research resources.
- Review criteria should be well
established and understood consistently by both reviewers and
applicants. As with any grant application, a complex mixture of
positives and negatives must be distilled into a priority score and
budget recommendations. Although the features of new and ongoing
MODs were listed above as suggestions to provide flexibility and
encourage innovation, the applicant must demonstrate that the goals
of the review criteria have been met.
Specific Review Criteria
Documentation should be provided to
demonstrate the following:
- The MOD is addressing critical
needs of the model organism and general communities.
- The value added to the data for the
primary community.
- The results of community surveys.
- The effective composition and use
of external advisory committees.
- The effectiveness of user support.
- The outreach and education efforts
to the user communities.
- Data on WWW database hits, by a
method that NIH staff and MODs should establish. Although these data
have some problems, trends of hit frequency over time are
informative.
- Interactions with other MODs and
database groups.
- How the MOD has achieved
cost-effectiveness and evaluated technology and software. Choices
for the more expensive of alternative approaches must be carefully
justified.
- The effectiveness of the curation
models.
- How the appropriate data object
relationships are represented in the MOD, and the types of queries
that the database supports.
- Database performance, ease and
transparency of use, interface design, documentation, and data
access.
- How the MOD supports the
advancement of science relating to the data it contains, and how it
will respond to scientific advances.
Funding Considerations
- The workshop considered that the
MODs and other database projects are substantially underfunded,
using conservative figures of industry funding distributions (10 to
15 percent of research budget in informatics) and considering the
amount of support for hypothesis-driven research in a model system.
Budget increases for effective database support are essential for
maintaining an outstanding publicly funded research enterprise.
- In general, established databases
with strong track records should be on 5-year funding cycles,
whereas those in flux require more frequent review. In both cases,
periodic (typically annual) administrative review would be valuable,
such as program officer visits to the MOD sites or attendance at
external advisory committee meetings.
Additional Recommendations
- Many aspects of database
development and implementation are still experimental. Funding
independent research projects addressing these issues is important
to support the MODs. These projects might focus on important areas,
such as the development of functional ontologies or the production
of reusable and readily portable software modules for data
acquisition, maintenance, analysis, or display.
- There is serious concern that the
capabilities offered by the MODs will outstrip the ability of users
to take advantage of them. The difficulty of obtaining NIH research
grant funding for computer hardware is completely at odds with the
need for effective informatics infrastructure and should be
resolved. The other potential bottleneck in delivering informatics
support is network speed. Universal high-speed networks will be
essential for transporting data sets and display tools across the
WWW.
- The need for increased training in
bioinformatics at all levels is well recognized, and the workshop
encourages efforts to support such training. The MODs are important
training sites in such programs, and affiliation of MODs with such
programs should be fostered.
Concluding Remarks
A great deal of important information
exchange and consensus occurred during this effective workshop. The
discussions were consistently frank and constructive. Nonetheless,
there were many topics that this workshop could not do justice to in
the constrained time, such as how various MODs should interact and
coordinate, which data types should be provided to users from the
nonorganismal community, and how curation should be done. Future
workshops bringing together database providers, users, and NIH staff
should be strongly encouraged. Other mechanisms for encouraging
scientific interaction and collaboration among the database providers
should also be considered.
 
NIH Home |
NHGRI Home |
NHLBI Home |