Report of the Protein Structure Initiative Assessment Panel

Message from the NIGMS Director

As part of an ongoing assessment process of the larger programs supported by NIGMS, a working group of the National Advisory General Medical Sciences Council assembled a panel of scientific experts in structural biology and related fields in June 2007 to assess the Protein Structure Initiative (PSI). We greatly appreciate the thoughtful analysis and scientific judgement this panel applied to its charge to examine the goals, progress, impact, opportunities, and limitations of the currently funded programs. The panel’s report will help guide us as we examine our processes for the development and management of this and other initiatives.

The panel’s final report includes six main conclusions that highlight many significant accomplishments of the PSI, but also raise some concerns. I would like to take this opportunity to offer several comments.

The first two conclusions, related to the productivity of the PSI pipelines in determining novel protein structures and to the development and maturation of methodology and technology accomplished by the PSI researchers, are quite positive.

With regard to the conclusion that the dissemination of PSI results has been poor and that PSI efforts to facilitate the use of structures and materials by the broad scientific community have remained ad hoc and low-throughput, it is important to note that information and materials have always been available from the PSI centers individually as a requirement of the program. However, we recognized the need for centralized access to PSI products and established the PSI Structural Genomics Knowledgebase and Materials Repository. A variety of factors delayed the deployment of these resources, but both are now operational (although still under active development). Groups that oversee the PSI have had extensive discussions about how best to disseminate PSI results to the broad scientific community, and the panel’s input reinforces the importance of pursuing this goal even more energetically.

The panel noted that the PSI effort has been aimed at coverage of protein sequence space but that sequence space is still growing linearly with the number of deposited sequences, making the PSI an open-ended endeavor and full coverage of sequence space an unattainable goal. This finding raises an important technical consideration that has been key to the design of the PSI. Along with growth in the number of known protein sequences, largely through genome sequencing projects, the number of sequence families has also continued to grow approximately linearly. However, most of the new families are quite small, containing only one or a few known members, and most of the new sequences fall into existing families. Since three-dimensional structure is much more evolutionarily conserved than is sequence, a single experimental structure within a family provides an initial framework for understanding the structures of all members of the family and for the development of homology models. The target selection strategy utilized in the current phase of the PSI has been directed largely toward the determination of structures within large sequence families for which no experimental structure is yet available, and this approach has the potential to achieve substantial coverage of sequence space.

With regard to the panel’s comments on homology modeling, the most recent Critical Assessment of Structure Prediction (CASP) meeting revealed significant improvements in homology modeling, and the PSI has contributed the majority of the structures used to drive the CASP assessment. In addition, the value of homology models depends tremendously on the question being asked: While a highly accurate model is required to examine issues such as the presence of particular chemical interactions, even a rough model revealing the overall fold of a domain can be extremely useful for such tasks as mapping the positions of mutations or polymorphisms. This point was discussed in some detail in a 2001 review article.

The panel concluded that the focus of PSI efforts on coverage of sequence space has resulted in structures that are by and large divorced from known biological function. This is certainly true for some of the structures determined through the PSI, but approximately 35% have an experimentally assigned function and an additional 45% have a function assigned through computational analysis. Moreover, some exciting new methods for assigning detailed functions to proteins computationally have recently been developed by PSI investigators and other researchers (see examples 1, 2, 3). In addition, the PSI Structural Genomics Knowledgebase and the Open Protein Structure Annotation Network (TOPSAN) are exploring methods to engage the scientific community in helping to define and refine functional annotation.

Finally, the panel stated that funding for the PSI structure determination centers represents approximately one-fifth to one-fourth of total structure determination expenditures by major funding sources in the United States. This conclusion is based on budget information for all of the PSI centers as well as related NIGMS investments, leading to the implication that annual funding for the PSI has averaged approximately $85 million. In reality, the total cost of the 14 PSI-2 research, modeling, and outreach centers this fiscal year is approximately $65 million. The cost of the PSI components focused on high-throughput structure determination--the four large-scale centers--is approximately $40 million.

I hope that these comments provide valuable context to the panel’s conclusions and help inform a vigorous discussion about future NIGMS support of structural biology. The next step will be one or more workshops focused on the roles of structural studies in biomedical research, including experimental methods and the use of homology models.

Jeremy M. Berg, Ph.D.
Director, NIGMS

Send Feedback

December 2007

Executive Summary

This report presents the conclusions of the Panel convened by the National Institute of General Medical Sciences (NIGMS) to assess the Protein Structure Initiative (PSI). On September 24, 2007, the Panel met on the NIH campus and heard presentations from several PSI grantees, from PSI program administrators and scientific advisors, and from members of the scientific community who are skeptical of the PSI program. Before the meeting, the Panel received input from the broader research community in the form of responses to an electronic questionnaire distributed by NIGMS. PSI grantees provided written answers to questions from the Panel, and also provided access to an internal web site with summaries of PSI progress and plans.

The analysis carried out by the Panel is presented in two parts below. First, in Part A, an evaluation is provided of how well the currently funded PSI centers are making progress towards the goals of PSI-2, as presently defined. The Panel, on the whole, is positive about the functioning of the PSI centers. Part B of this report is an assessment of the scientific goals of the PSI, informed by the results obtained thus far, and the value of the financial investment in it. Here, the Panel is negative in its conclusions. Questions are raised about the validity of the PSI enterprise as presently defined, and the NIH is urged to undertake a fundamental reconsideration of the goals of this endeavor. Suggestions for future prospects are presented at the end of the report.

Main Conclusions

The PSI has been highly successful in establishing an automated pipeline for protein production and structure determination. The generation of new structures is proceeding at a commendable rate. The 2700 structures determined in PSI centers are the greatest evidence of the success of the pipeline.
The aim of the PSI to improve methodology and technology is clearly of great importance, and progress in this area is laudable. The PSI has provided an excellent format for scientists to focus on driving methodological advances without necessarily coupling them to immediate biological understanding. While more evolutionary than revolutionary, PSI efforts have accelerated many technology developments that are now increasingly used by the broader research community and deserve continued support.
Dissemination of PSI results—structures and materials—has been poor. PSI efforts to facilitate the use of structures and materials by the broad scientific community have remained ad hoc and low throughput while the structural pipeline has moved successfully to high throughput. The use of structural information is best driven by scientists who wish to understand biological mechanism, therefore deposition of structures in the PDB does not reach the key audience. This failing is critical because the long-term value of PSI-generated structures will depend on the value of the information being generated. The PSI KnowledgeBase and the PSI Materials Repository, now being formed, should have been launched at the start of the PSI, and may flounder as they rush to catch up with a huge backlog of structures and materials.
The PSI effort has been aimed at full coverage of fold space and sparse coverage of "sequence space." Although fold space may be nearing complete coverage, sequence space is still growing linearly with the number of deposited sequences, making the PSI an open-ended endeavor and full coverage of sequence space an unattainable goal. The large numbers of new structures determined by the PSI effort have not led to significant improvements in the accuracy of homology modeling that would allow modeling of more biologically relevant proteins, complexes or conformational states. Taken together, the lack of an end point for the PSI and the lack of modeling improvements indicate that the concepts underlying the current PSI effort are seriously flawed.
The focus of PSI effort on fold and sequence has resulted in structures that are by and large divorced from biological function. There is little incentive for PSI groups to follow up on the few structures of biological importance they have determined. The consensus of both the Panel and the broader community is that the best structural work is done as part of research into a larger biological problem, and that structure interpretation requires closely coupled biophysical, biochemical, genetic and in vivo data. It is increasingly clear that virtually all proteins function by joining multi-protein complexes, binding ligands, changing conformation or undergoing posttranslational modifications. The evolution and diversity of biological structures needs to be coupled to biological function. None of this is currently captured by the high-throughput methods of the PSI. By not reaching out seriously to the biological community, the PSI lost an opportunity to expand its vision beyond the original fold space and the metrics of number of structures solved.
Funding for the PSI structure-determination centers represents approximately one-fifth to one-quarter of total structure-determination expenditures by major funding sources in the U.S. At these levels, and given the weakness of the underlying concepts and lack of biological relevance of most PSI structures, the large PSI structure-determination centers are not cost-effective in terms of benefit to biomedical research.

A. Progress of the PSI Toward Its Current Goals

A.1 Main Conclusions

The major goals of PSI-2 are: (i) to increase the number of sequence families with structural representatives, including families with high biological impact, (ii) to continue technology development, especially for challenging classes of proteins, such as membrane proteins, and (iii) to facilitate the use of structures by the broad scientific community. The PSI has succeeded in many of its major goals.

The PSI centers have established successful pipelines for protein production, crystallization and structure determination. Large regions of sequence space have been populated with new structures. Over the past few years, the PSI has contributed more novelty to the protein structural database than has any other effort worldwide.
The PSI centers have matured many new technologies, and the activity around the PSI has led to impressive advances that have a broad impact and are much appreciated by the structural community. These include, but are not limited to, automation for synchrotron beamlines, new vectors for production of recombinant protein, automation of crystallization screening, nanotechniques for crystallization, and automation for many steps in routine cloning, expression and purification.
Although the final crystal structures are being disseminated through the PDB, there has been limited effort to date to collate or analyze these structures so as to usefully inform biologists. Similarly, communication of process improvements and available materials has been inconsistent, making it difficult for the biological community at large to appreciate and build on the accomplishments of the PSI.

A.2 PSI Output: Structures

In terms of the goal of determining novel structures, the PSI centers are making excellent progress. One measure of novelty is the "weighted chain" count, introduced by Michael Levitt,¹ which reduces the impact of protein chains with 25% or greater sequence identity to related structures. Since 2000, when the PSI initiative began, 71% of all chains solved by the PSI are novel (based on the weighted chain count updated to 9 August 2007). This level of novelty contrasts with the situation for all structure determination outside the four large PSI structure-determination centers, where only 16% of the numbers of chains determined correspond to a unique weighted chain. The growth in output of the PSI centers is correlated with a decline in the output of novel structures (as defined by the weighted chain criterion), so that efforts outside the PSI are contributing to an ever-decreasing fraction of novel solved structures. The efficiency of the PSI centers is also increasing, with the number of weighted chains solved per quarter having doubled from mid-2005 to mid-2006.

The PSI has clearly been successful in its goal of increasing the coverage of structural space, and the principal investigators and staff members of the structure determination centers are to be commended for their efforts. PSI is less successful in making links to biology, which was a stated goal of PSI-2. This issue is discussed in Section B.

A.3 PSI Output: Technologies

A.3.1 Overview

Technology development is one of the most highly valued and praised aspects of the PSI, even by some in the community who are most negative about the program itself. In fact, many have argued that the methodological improvements are more valuable than the structures themselves, and will prove to be an enduring legacy of the PSI. Impressive progress has been made in all aspects necessary to advance from a genome sequence to a set of structures. Although an argument can be put forward that not all these advances can be directly attributed to the PSI in that some technologies were already under way, some have multiple funding sources, and some have long development times, the Panel felt that such issues of provenance are difficult to assess. What is clear is that the PSI drove the development of many new technologies.

A.3.2 Specific Technological Advances

A.3.2A Pipeline

Particularly notable is the construction of the pipeline that enables the entire process of going from sequence to solved crystal structure to be almost fully automated and capable of working at high throughput for amenable proteins. At the outset of PSI-1, the feasibility of such a pipeline was not obvious. Constructing and coordinating the many steps has been a highly challenging task, and the PSI centers have accomplished it well. Perhaps the most impressive advance of the PSI has been the integration of many new and existing technologies into a smoothly functioning pipeline. In addition, they have demonstrated that the quality of the crystal structures derived from this pipeline is as good as the quality of structures derived from individual labs.

Specific advances include new or improved tools for cloning (vectors, tags), expression (cell free expression systems, media, yeast strains), and purification of proteins (robotic systems), as well as folding and solubility reporters and salvage pathways. While most of these would be considered evolutionary rather than revolutionary advances, collectively they have had a significant impact on accelerating structure determination at the PSI. Many of these have had impact in the broader structural community as well.

A.3.2B Crystallography

Many impressive technological advances have been made at the synchrotron beamline itself, including automation for sample mounting, crystal centering, crystal evaluation and data collection. These improvements were accelerated by the PSI, and are becoming standard at many beamlines across the country, increasing the access and utility of these resources for all investigators. For example, at SSRL, about 75% of users now collect data via remote access in their home laboratories.

Similarly, significant advances in crystallographic software (such as HKL3000, SOLVE and PHENIX, which are supported in part by the PSI) allow raw data to be transformed rapidly to structure. While not applicable in automated fashion to the most difficult cases, these packages provide significant benefits for many proteins. Other useful methods include crystallization chaperones, crystallization cocktails and nanocrystallization procedures. The latter will be particularly important for membrane protein crystallization, where materials are highly limiting.

Overall, technological innovations in crystallization and crystallography that were accelerated by the PSI are helping researchers outside the PSI solve structures of biomedical significance. In other areas, results are mixed.

A.3.2C NMR

NMR technology from the PSI is not seen as having the general impact described above for crystallography. Many of the NMR advances (G-matrix Fourier-transform NMR, automated data analysis and structure determination) are not readily applicable to complex systems with higher molecular weight. Thus, these have not changed the practices of the NMR community the way PSI-supported technologies have changed the crystallographic community.

A.3.2D Generation of Materials

The narrowest bottleneck in structural biology is still at the level of production of soluble materials. While the PSI has made advances in cloning and expression technologies, some in the community expressed disappointment that no major innovations, conceptual advances or more general solutions to major challenges of production of difficult soluble proteins (e.g. eukaryotic proteins) have come from the PSI. However, significant biochemical advances are being made for membrane proteins, for which much less is currently known. For membrane proteins, there is significant utility in having sufficient data to develop statistical rather than anecdotal evidence. These approaches to membrane protein production are promising, but have not yet led to large numbers of publishable results.

A.3.2E Modeling

Modeling efforts were expected to benefit from the large number of PSI structures available as test cases. However, technologies for modeling do not appear to have advanced to the same degree as those for structure determination. There are some notable exceptions, such as the recent advances reported by Baker on combining structure prediction with crystallographic molecular replacement.² It is not clear that the PSI is making a concerted effort to address fundamental shortcomings of computational structure prediction, such as the need for accurate physical descriptions of atomic interactions. The combined PSI expenditure on theoretical and computational methods to improve the accuracy and sensitivity of homology modeling has been only a small fraction of what has been spent on experimental structure determination. Given that the primary aim of the PSI is to maximize the number and diversity of structures that can be modeled accurately, the lack of resources devoted to such theoretical methods may have reduced the effective value of the novel structural data contributed by the PSI.

A.3.2F High-Throughput Technologies

The accessibility of the many PSI-derived methods to the rest of the community is a matter of some contention. The PSI investigators claim that nearly all the technological advances are applicable and available to smaller-scale laboratories because few require equipment that is out of range for such laboratories. They point to the availability of expression vectors and clones, expression and purification protocols, purified proteins, and crystallization strategies as well as the software and robotics. It is critical that these technologies be available to the biological research community in their home laboratories. Equipment (and reagents) has been commercialized, often at a cost that single or groups of investigators could afford. Roughly 500 publications from the PSI describe the new methods. These have been cited an average of 14.9 times each, and 15 have appeared in journals with impact factor greater than 20.

However, many in the community expressed concerns that automation and high-throughput approaches may be inherently ill suited to solving the biochemical problems of difficult proteins or complexes. Thus, some PSI methodologies may have limited impact on projects whose focus is on large, dynamic, multi-component (or multi-domain) systems, in which an iterative approach focused on details is necessary. Another concern was the cost of building an automated infrastructure in smaller labs. However, it is not entirely clear whether or not these concerns reflect differences in the attitude of labs outside the PSI, rather than genuine shortcomings in the applicability of high-throughput approaches to their problems.

A.4 PSI Output: Materials

In addition to advances in technology, the PSI has produced materials of benefit to the general scientific community. These include vectors, expression plasmids, cDNA clones, and proteins as well as information and protocols that relate to expression, purification, and crystallization. Unfortunately issues of cataloging and storing materials generated by PSI centers were not adequately addressed in PSI-1. A PSI Materials Repository (PSI-MR) was established in September 2006 to deal with the collective assimilation and distribution of some of the above materials, specifically the clones that were generated by PSI-1 and PSI-2 centers. The Panel found it impossible to evaluate the materials output of the PSI because the PSI-MR is not yet operating (scheduled opening is January 2008) and searchable lists of materials are not available on the web sites of the large PSI structure-determination centers.

During its first year of funding the PSI-MR gathered data from the old and new PSI centers, began to build its database, and began to negotiate Depositor Agreements with each of the participating centers and their corresponding institutions. Obtaining Material Transfer Agreements from each institution has been time consuming, and only three are finalized at this point. As with the KnowledgeBase, it is unfortunate that this cross communication between the various PSI centers was not launched early on as a mandated requirement of the PSI so that it would now be routine. A PSI-MR business plan was not presented before or during the Panel meeting. Eventually the Panel received a two-page document describing the progress to date. This did not include details regarding information such as nomenclature and web site design.

A.5 Dissemination of PSI Output

Dissemination is a critical aspect of the PSI mission. It is the mechanism by which the many results of the PSI—structures, technologies and reagents—become useful to the broader community. Dissemination thus adds significant value to the PSI endeavor, and would be expected to play a central role in garnering enthusiasm and support from the community.

A.5.1 Inadequate Past Dissemination

Although there are exceptions, overall dissemination by the PSI has been poorly coordinated between the various centers and is therefore ineffective. Sharing and distribution of results should have been a high priority at the outset of PSI-1, but only recently has an organized effort begun in earnest. Because of this delay, the community is mostly unaware of PSI progress; lack of dissemination is one of the most widely cited criticisms of the PSI. It should be expected that, as the paradigm of structure production changed, the paradigm of dissemination of results would change with it. The consistent problem seems to be holding on to low-throughput dissemination modes in a high-throughput environment. The situation may be ameliorated by the development of the KnowledgeBase and Materials Repository. Nevertheless, there is concern that these resources will be implemented superficially because of the late starting date, and ultimately prove to be insufficient.

There has been no centralized effort to organize, annotate, link, advertise and distribute information from the PSI as a whole. The focus of centers on production rather than distribution has diminished the quality and effectiveness of most individual efforts. The lack of higher coordination has made it difficult for the community to gather and use PSI results, and has prevented the whole from rising above the sum of the parts.

There are three principal products of the PSI: structures and their associated knowledge, technologies and reagents. These are discussed in the sub-sections below.

A.5.1A Structures

Structural results are disseminated primarily through the standard mechanisms of publications, presentations at scientific meetings and deposition of coordinates into the PDB. Since its inception the PSI has published approximately 500 structural papers and presented talks and posters at a number of meetings. However, most PSI structures are deposited without publication and most presentations have been to structural audiences. Thus, deposition of structures into the PDB is the primary mechanism of dissemination to a broad audience. The immediate deposition of PSI structures into the PDB has made coordinates readily available to the structural and computational communities. This steady production of large numbers of diverse structures has benefited the work of some modeling and informatics groups. For example, PSI structures have dominated the CASP competition for the past several years as it was clear that structures for particular sequences would be solved on the four-month time scale of the CASP competition.

However, deposition of structures into the PDB does not necessarily impact the wider biological community. This lack of impact occurs for several reasons.

First, most biologists are unaware of PSI efforts and do not use the structural database. The PSI seems to offer a potentially exciting opportunity to bring structural biologists together with biologists working on a host of problems, but the broader community currently does not appreciate this opportunity.

Second, most biologists lack the knowledge necessary to use structural information effectively, even if they are aware of it. Structural biologists play the role of translators in this relationship, bridging structure to broader function (biochemical, biological, etc.). The PDB was never intended to be a communication tool to biologists, and without expert intermediaries much potential of the structural database will remain unrealized.

Third, a shortcoming of PDB-based dissemination derives from the absence in the PDB of detailed information regarding expression, purification and biochemical characterization of proteins used in structure determination. Such information on both successful and failed structural targets would be very useful to the biochemical and biological communities. Some information of this type is contained in the PepcDB database, which opened in late 2004 as an adjunct to the PDB. However, the information in PepcDB on expression/purification procedures is incomplete, particularly for unsuccessful targets, and largely generic. Protocols for deposition are still under development.

Some effort has been made by the PSI to enhance the utility of structures through annotation. The most prominent effort involves development by the JCMM and JCSG of TOPSAN, a wiki-based website (analogous to Wikipedia) where the PSI and members of the community can provide annotation of individual structures in the database. This is a creative and interesting idea with significant potential, especially if integrated with other sites/information. However, TOPSAN is still experimental, having been initiated in December 2006. Only JCSG appears to be participating currently, and only a relatively small fraction of structures (<25% of JCSG coordinates) are annotated at present.

Until very recently there has not been a coordinated effort among the various participants of the PSI to determine what information is most useful to the larger community; how it should be associated with structures, organized and integrated with other databases; and how it can be most effectively accessed by those outside the PSI. Together, these various problems have led to a clear disconnect between PSI-generated structural information and the biological community that could best use it.

A.5.1B Technology

With respect to PSI technology development, many aspects have been well disseminated through a combination of publications, center websites, meetings, workshops and direct creation of experimental resources. New technologies for rapid, automated acquisition of crystallographic data are being put into place at several synchrotron beamlines. These are of great value and will be distributed widely. These facilities also aid dissemination of new crystallographic software, along with the other mechanisms above.

However, the technologies developed at various centers for cloning, expression, purification and characterization of proteins have not spread as widely into the structural, biochemical or biological communities. In part this failure may reflect a general resistance to new technologies. However, little effort has been made to curate information within and between different sites in order to enable simple searching or to produce summaries that are readily accessible. Thus, potential users are currently forced to search all sites individually and manually to identify advances that may benefit them. The lack of central advertising has led to ad hoc dissemination, which has been ineffective. Given the inherently hands-on aspect of many new technologies, expansion of programs in practical workshops in methodology targeted at students and post-docs, such as those run recently by MCSG and CESG, would be of significant benefit to the community.

A.5.1C Materials

As detailed in Section A.4 above, the materials generated by the PSI, including vectors, clones and proteins, are not yet centrally organized. The lack of a simple well-understood mechanism for obtaining materials and of a single well known clearinghouse for all materials generated at any of the PSI centers has led to ad hoc, ineffective dissemination of materials by individual PSI centers. The panel received a table that lists the number of requests from each PSI center, but it is not clear whether the reagents were actually provided. The requests differ significantly for each group, and, contrary to what was stated at the meeting, there does not in general seem to have been significant distribution of these materials to the biological research community. Only two groups funded by PSI-1 received a significant number of requests for plasmids, and neither of these groups was funded in PSI-2. NYSGXRC received many requests for protein, but not for clones and plasmids. The lists of requests from the PSI-2 centers is unimpressive and suggests that this has not been a major priority so far. The single exception is the CHTSB, which received many requests for protein. It is not clear that protein dissemination is included in the present PSI-MR, which only discusses the deposition and distribution of clones.

A.5.2 Improving Dissemination in the Future

The recently launched KnowledgeBase (KB) has potential to address many of the problems in dissemination described above. Headed by Helen Berman, Director of the PDB, KB should act as a central portal where results from all PSI centers will be archived, organized and made available to the public. Information accessible through KB will include structures, sequences, models, structural/functional annotations, technologies, protein production protocols and protein characterization. The various elements of the database are intended to be tightly linked, both with each other and with external databases, so that users can easily search and extract a broad range of information relevant to systems of interest. Connections will also be made to the Materials Repository to link information with reagents.

If implemented well, KB promises to be a powerful resource to the broad biomedical community. However, there are a number of concerns that it may not reach this potential. Chief among these is that KB will be large and complex, comprising many different types of information derived from many sources. It should not be a simple catalog of information, as is currently the case with most PSI websites. Rather, its holdings must be effectively integrated and coupled with expert knowledge in many areas. It will require thoughtful curation and sophisticated indexing and search capabilities that go beyond simple text matching. Search capabilities in the current PDB are limited; KB will need more advanced and powerful functionality. In addition, KB must be accessible and useful to biologists, not simply to scientists already familiar with structural, biochemical and biophysical information and approaches. KB must function as translator to the biological world, a role that is typically played by the publications and presentations of structural biologists. This requirement substantially raises the bar of annotation, summary and cross-referencing. It creates special problems in the integration of external analytical tools, which are not designed for non-experts. These are diverse and serious challenges; together with the late starting date, they raise significant concern that KB will ultimately prove to be ineffective in disseminating the results of the PSI to the broader community.

Distribution of vectors, clones and plasmids will be covered under the PSI-MR, although this is not specifically stated in the summary provided to the Panel. The policy regarding distribution of proteins needs to be addressed specifically. The PSI materials database should be easily searchable by the community, not just by structural biologists. In addition to a catalog of clones, there should be a progress page for every plasmid, for its design, for the expression and purification results, and for the crystallization results. Associated protocols also need to be easily accessible. How will nomenclature be addressed? This is an important question because in many cases the function and the fold family of a target protein are unknown before the structure is solved. Presumably there will be a common nomenclature with the KB, but no information was provided. A numbering system alone would not be useful. Industrial groups, e.g. pharmaceutical companies, have operated materials archives for many years. The PSI-MR is encouraged to engage these groups in order to benefit from their expertise.

B. Evaluation of the Rationale and Utility of the PSI Goals

B.1 Main Conclusions

One aim of PSI-2—to improve methodology and technology—is clearly of great importance. Progress in structural biology has, since its inception, been boosted by technical advances in diverse areas, such as X-ray sources (particularly synchrotrons in recent times), crystallographic phase determination, high-field NMR instrumentation, as well as methods of purification and sample preparation, including crystallization. It is crucial to provide resources for technical development in structural biology in a way that allows scientists to focus on driving methodological advances without necessarily coupling them to immediate biological understanding. PSI-2 provides an excellent format for this endeavor, which deserves continued support.
Another PSI aim—facilitating the use of structures by the broad scientific community—is diffuse and difficult to define. The use of structural information is best driven by scientists who wish to understand biological mechanism. The use of PSI-generated structures will depend ultimately on the value of the information being generated. However, the vast majority of PSI-generated structures are divorced from biological function.
The PSI-2 effort is aimed at a sparse coverage of "sequence space." Although fold space may be nearing complete coverage, sequence space continues to grow linearly, making PSI-2 an open-ended endeavor. The ability to model structures, particularly complex ones, is very far from being able to connect most PSI-2 structures to function. It is here that the concepts underlying PSI-2 are seriously flawed.

B.2 A Critique of the PSI-2 Structure Determination Goal

The major thrust of PSI-2 is to increase the number of sequence families that have at least one member that is characterized structurally. This is a laudable goal in the abstract—the three-dimensional structure of the protein encoded by a gene clearly represents a greater density of information than the linear gene sequence. On close examination, however, the value of this goal is seriously undermined by several fundamental issues. One significant problem is that knowledge of the structure of a representative member of a protein family (defined, for example, at the level of 30% sequence identity) provides rather limited information about the specific function and mechanism of other members of the family.

Analysis of sequences and known structures indicates that the coverage of "fold-space" is nearing completion. At the same time, it was appreciated that the concept of a protein fold is an ill-defined one, and the target of PSI-2 was shifted from the determination of representative protein folds to structures that are representative of protein families at the level of 30% sequence identity (this corresponds closely to the weighted chain count discussed earlier). One consequence of this shift in the target of PSI-2 is that the problem becomes unbounded: as the number of genomes that are sequenced becomes larger, the number of protein families (derived on the basis of the 30% sequence identity criterion) also becomes larger. An enormous number of structures need to be determined in order to hope to attempt to link structural data to cellular processes, a goal that is both ill-defined and out of the grasp of even the most ambitious structural genomics efforts. Thus we perceive no real progress towards closing the gap between sequence families and structures.

The most comprehensive recent analysis of protein sequence space is a sequence-based clustering of over 10,000,000 non-redundant protein sequences.³ A total of 300,000 clusters were found using fairly conservative sequence-matching criteria based on BLASTPGP and PSI-BLAST (40 to 70% similarity). The number of sequence clusters represented by the sequences with known structure depends on the sequence threshold used, and is crudely estimated to be about 10,000 of which about 15% has come from the PSI. Thus, the PSI has closed the gap in number of sequence families with a known structure by less than 1%. Of greater concern is that the number of families is growing almost linearly with non-redundant sequences so that the protein sequence universe is still in an expansion phase.

Another limitation of PSI-2 arises from a fundamental difference between gene sequence and protein structure. A gene sequence has meaning in and of itself, i.e., it provides the sequence of a protein. The vast amount of genomic sequence provides a unique and unprecedented window into the evolution of proteins. Proteins, however, gain their functions as a result of multiple additional processes, such as complex formation, the binding of small molecule ligands, or various covalent modifications. Therefore, the structure of an isolated protein domain cannot be used in a reliable way to predict the nature of the structural and functional changes that result from these interactions or modifications. This feature of protein function makes analogies between PSI and genome sequencing efforts particularly problematic.

A critical stumbling block in linking structure to function is the severe limitations faced by computational homology modeling methods, whereby the details of an unknown structure are deduced from a known structure. Although homology modeling efforts, in many cases, can link a sequence to a known protein fold, these methods are unable to predict with sufficient reliability conformational changes, the docking of ligands and the structures of regions that are unrelated to the target structures. Although progress is certainly being made in improving the power of these methods, it is not clear that a sufficient number of major advances will occur over the next five to ten years to effectively close the gap. The problem of constructing a sufficiently accurate model of a multi-domain protein from knowledge of the structures of the component domains—if the structures of these are known—is even more difficult.

As a consequences of these limitations, the few insights into possible function that have come from analyzing the structures of PSI targets (such as the recent successful identification of substrates for a previously uncharacterized protein⁴) have been massively outweighed by the almost complete lack of insight into function provided by the majority of structures determined within the PSI. The PSI initiative also ignores the fact that most proteins operate within networks of interacting proteins. The structure of an isolated protein might provide valuable insights when determined within a functional context, but it is unlikely to do so within the function-independent context of the PSI.

It might be argued that the growing availability of novel structures improves the development of computational methods that aim to close the gap between structure and sequence. While there may be some broadly defined leverage provided to bioinformatics by the current PSI efforts, there is no compelling evidence that determining additional novel structures will have a decisive impact on the accuracy of homology modeling.

Given the absence of other biological connection, the Panel considered whether the structures determined by the PSI could benefit research aimed at improving drug development, but found that the PSI structures have minimal impact. Two factors underlie this observation: first, the intrinsic flexibility of protein structures, and second, the divergence of detailed structural features, even at active sites, when sequence identity drops below ~80%. The combination of these two factors means that for structural information to be effective in inhibitor development the process must rely on structures of the actual target (or a very close one), often in a particular state of modification or ligation. The lack of impact in biomedicine is to be expected, given the near exclusive emphasis on sequence novelty in the PSI. Nevertheless, it should be recognized that limitations of homology modeling and the vastness of protein family space conspire to minimize the biomedical relevance of the majority of PSI structures.

In summary, the PSI structure determination centers are meeting the goals of increasing the rate of determination of novel structures. While there is some value to determining new structures as an end itself, the structures determined in this way are limited in their ability to inform biology or biomedical science. The over-arching goal of linking genomic information to three-dimensional structure is unlikely to be met in any meaningful way because the number of protein families continues to expand as additional genomes are sequenced.

B.3 Connection of PSI to Biology

The PSI to date has had minimal relevance for biology or broader biomedical research. Among comments from the community returned to the NIH, a clear majority—encompassing those positive, neutral and negative about the PSI—criticized the initiative for its lack of biological relevance.

The consensus of both the Panel and the broader community is that the best structural work is done as part of research into a larger biological problem, and that it requires additional biophysical, biochemical, genetic and cell-based data for its interpretation. There was also agreement that problems like multi-protein complexes cannot be understood when dissected into individual components or fragments of these components. While the structures of individual domains provide some usefulness, those of multi-domain complexes are essential in the long run. Further, many proteins do not fold or function in the absence of interacting partners, and proteins undergo conformational changes and regulation by posttranslational modifications. None of this is currently captured by the high throughput methods of the PSI.

The initial guidelines for PSI-1 specifically did not require any relevance for biology. Even in PSI-2, in which a degree of biological relevance could be established through biomedical theme targets and collaboration with the community, the bulk of the PSI effort has gone into filling the gap between sequence and structure with little regard to the significance of these targets. The PSI-2 centers have solved several structures of broader impact through collaborations with the community and through the biomedical theme targets (e.g. Thermotoga maritima proteins, human and pathogen phosphatases, cancer networks and virulence factors). But these have been a relatively small fraction of the total, and in many cases the structures have not had great biological impact because the PSI has not been able to follow up on interesting molecules.

To address the question of target selection and specifically to consider how the structural goals might be made more relevant for biology, a Target Selection Committee was established at the onset of PSI-2. However, the composition of the committee did not include cell biologists or geneticists. With a focus more on folds and sequence families, and with the metric being the number of novel structures, it is not surprising that targets were not chosen for their biological importance. Some strong leaders of the biological community, including for example the fields of pharmacology, physiology, chemical biology, molecular evolution and metagenomics need to be part of the deliberations over targets. The lack of biological relevance of PSI structures is further exacerbated by the high-throughput pipeline, which favors proteins whose structures can be solved readily because of the ease of expression, purification, crystallization or other properties.

There is wide concern that the PSI is not solving structures with relevance to biology. The broader biological community, in particular, does not feel that this effort is relevant for them. Even among those in the modeling community, there was only modest enthusiasm for the role played by the PSI. Intrinsically there should not be a problem for the PSI to make a stronger connection to biology. For example, the tuberculosis center in PSI-1 was linked strongly to biology, but was not included in PSI-2. The TB community that formed around the PSI-1 center seems to be vibrant, broad and solidly focused on a biological question.

It is a challenge to educate the broad biological community on the importance of structure for their own work, but the dissemination of PSI structural data and materials is an opportunity to provide such an interface and to increase the biological relevance of the PSI. Unfortunately this has been a missed opportunity so far. Undoubtedly the lack of attention to providing structural and materials resources to the outside world explains in part why the community overall is not more appreciative of the accomplishments of the PSI and why they do not understand how their own work might be benefited.

An additional concern related to the lack of biological relevance is training. With no biological questions being asked or models being tested, the PSI appears not to be a strong training environment for the next generation of structural biologists. In addition, the pipeline format—while productive—does not give the individual scientist experience in working through multiple stages of the structural process.

B.4 Cost Effectiveness of PSI

The PSI was launched in FY2000 as a five-year program through FY2004 (PSI-1). Total funding in FY2000 was $34,356,000 and reached $80,763,000 in FY2004. FY2003 was exceptional in that significant funding of beamline infrastructure occurred, resulting in a total budget that year of $107,704,000. Total funding for PSI-1 was $340,631,000.

PSI-2 has been funded for two years (FY2005 and FY2006) at over $80M per year. The commitments for FY2007 to FY2010 are at the same general level, except for FY2010 at $11,248,000. Total funding for PSI-2 is projected to be $424,816,000.

Total funding for PSI-1 and PSI-2 is therefore $765,447,000.

Other significant sources of support for structural biology include estimates of $20M/year from the Howard Hughes Medical Institute and $350M/year from NIH ($150M/year from NIGMS and $200M/year from other NIH sources) for a total of $370M/year.

The averaged annual PSI funding in the last five years (FY2002 through FY2006) is $85M/year to be compared to $370M/year from other sources. Thus PSI total funding is approximately 19% of total funds expended for structural studies. If one includes only the total PSI Center portion of the budgets, the PSI Centers represent 14% of total funds expended for structural studies. However, it is important to note that PSI funding is totally dedicated to high-throughput structure determination and does not support either training or research projects that are aimed at learning the functional implications of protein structures. A conservative estimate of these activities in typical structural biology laboratories would be that at most two-thirds of the laboratory funding is comparable to PSI funding. With this discount, the amount of total PSI funding relative to total funding of structure determinations is 26%, and total PSI Center funding relative to total funding of structure determinations is 19%. A more realistic discount rate of 50% results in total PSI funding representing 31% of total funding for structure determinations (PSI Centers would represent 23% of total funding structure determinations). There is also the issue of higher value on combined structure-function studies that was raised by many members of the structural biology community. Given the cost per structure of nearly $100,000, the failure to make functional linkages is a serious one. In conclusion, a disproportionate amount of resources is going to the PSI based on the findings of this assessment.

C. Future Prospects

In general, the Panel was not enthusiastic about the benefit to biomedical research of the current large-scale, high-throughput structure determination effort, given the impossibility of reaching the core goal summarized in Section B.2 and the cost-benefit analysis summarized in Section B.4. Although the development of concrete suggestions either for improving the goals of the current PSI effort or for revising the larger-scale policy of whether NIGMS should support such an effort are beyond the scope of the Panel's remit, a number of issues emerged as common themes either within the Panel discussion or from the community at large.

Changes in target choice within the current PSI-2 effort should be considered. As noted above, there is no compelling evidence that determining additional novel structures will have a decisive impact on the accuracy of homology modeling. Rather, a much more tightly focused effort into a limited number of protein families may help overcome the present bottlenecks in modeling.

Multiple mechanisms can be envisioned to achieve a closer link to biology. For example, might it be worthwhile to cover the protein space of a single organism? For prokaryotic proteins, there is the possibility of completing the structures of all the proteins involved in various metabolic pathways, or of novel operons of unknown function. The production of mammalian proteins, multi-domain proteins and protein complexes and the continuing technical difficulties associated with membrane proteins remain as stumbling blocks. A central question relates to developing truly interdisciplinary teams that will work together so that when a structure is solved there is some biological context for that structure.

Given the experience gained in the PSI project, the time is ripe for NIGMS to re-assess its policies for supporting high-throughput structure determination. Other options might be considered to move structural biology forward, building on a PSI-2 that has met its goals of increasing the structural novelty of the Protein Data Bank, but without continuing PSI in its present form. Future effort might be focused on smaller projects with much higher experimental coupling to biological function and to improving computational methods of analyzing and predicting protein structure. The legacy of the PSI effort can be positively incorporated by providing funding to disseminate its technological infrastructures across institutions, and further built upon by retaining centers that are focused on further technology development. We urge the Director of the NIGMS to establish a Panel and seek community input to formulate an appropriate policy.

Meeting Agenda
Protein Structure Initiative (PSI) Assessment Panel
September 24, 2007

8:15 – 10:15 Session 1: Presentations from the PSI

8:15 John Norvell Goals, policies, coordination, communication
8:20 Brian Matthews PSIAC advice, site visits
8:25 Andrzej Joachimiak Descriptions of 4 centers, construction of SG pipelines, technology developments
8:55 Ian Wilson Target selection, impact, value of structures, Thermatoga as an example, future plans
9:25 Tom Terwilliger Descriptions of 6 centers, technology developments
9:50 Helen Berman Knowledgebase, metrics, web page

10:15 – 10:30 Break

10:30 – 11:00 Session 2: Questions from the Assessment Panel for PSI participants

11:00 –12:30 Session 3: Discussions of reservations about the PSI

Walter Chazin, Philip Cole, and Stephen Harrison

12:30 – 1:00 Working Lunch for the Assessment Panel

1:00 – 5:00 Session 4: Panel deliberations

Charge to the PSI Assessment Panel

Background: The NIGMS Advisory Council has requested, through its Large Grants Working Group, that assessments be done of each of the Institute’s large grant programs. These assessments will be used by the Council as it advises NIGMS on the goals and strategic management of its large grants programs. The first of these assessments, an examination of the Protein Structure Initiative (PSI), will be carried out by a small panel of scientists which will meet on September 24, 2007 on the NIH campus. The roster for this panel is included below.

Charge: The Assessment Panel is charged with examining the goals, progress, impact, opportunities and limitations of the currently funded programs. This assessment should include (1) an examination of the original intent and goals of the PSI as described in the Requests for Application and other documents, (2) an examination of technology developed, its role in the PSI, and impact of that technology on the broader structural biology community, and (3) the impact and potential future impact of the structures determined by the PSI on research efforts in structural biology, computational biology, bioinformatics, and the broader biological community. The panel is expected to obtain input from the scientific community in conducting this assessment. The panel is expected to develop a set of specific questions to provide structure to its deliberations. These questions should be made available to the PSI Advisory Committee in a timely manner. The panel may, if it chooses, provide advice to the NIGMS regarding the future of the PSI; this advice may focus on the remaining three years of PSI-2 and/or on extensions of these research activities that might be undertaken after PSI-2 is completed. The panel should document its findings in a report that will be submitted to the Director of NIGMS by December 21, 2007 for consideration by the NIGMS Advisory Council at its meeting in January, 2008.

Jeremy Berg, Ph.D.
Director,
National Institute of General Medical Sciences
8/6/07

National Advisory General Medical Sciences Council
Working Group Panel for the
Assessment of the Protein Structure Initiative

Janet Smith, Ph.D. (Chair)
Margaret J. Hunter Collegiate Professor of Life Sciences
Life Sciences Institute
Department of Biological Sciences
University of Michigan
Ann Arbor, Michigan 48109-2216
734-615-9564
JanetSmith@umich.edu

David A. Clayton, Ph.D.
Vice President for Research Operations
Howard Hughes Medical Institute
4000 Jones Bridge Road
Chevy Chase, MD 20815-6789
301-215-8803
claytond@hhmi.org

Stanley Fields, Ph.D.
Investigator, Howard Hughes Medical Institute
Departments of Genome Sciences and Medicine
University of Washington
Box 355065
Seattle, WA 98195-5065
206-616-4522
fields@u.washington.edu

Homme W. Hellinga, Ph.D.
James B. Duke Professor of Biochemistry
Director of the Institute for Biological Structure and Design
Duke University Medical Center
Durham, North Carolina 27710
919-681-5885
hwh@biochem.duke.edu

John Kuriyan, Ph.D.
Investigator, Howard Hughes Medical Institute
Chancellor's Professor, UC Berkeley
Department of Molecular and Cell Biology
Department of Chemistry
University of California, Berkeley
Berkeley, California 94720-3202
510-643-1710
kuriyan@berkeley.edu

Michael Levitt, Ph.D.
Professor of Structural Biology
Stanford School of Medicine
Stanford, California 94305
650-276-0500
michael.levitt@stanford.edu

Catherine E. Peishoff, Ph.D.
Vice President
Computational & Structural Chemistry
GlaxoSmithKline
1250 South Collegeville Rd.
Collegeville, Pennsylvania 19426
610-917-6584
catherine.e.peishoff@gsk.com

Michael Rosen, Ph.D.
Investigator, Howard Hughes Medical Institute
Professor, Department of Biochemistry
University of Texas Southwestern Medical Center
Dallas, Texas 75390
214-645-6361
michael.rosen@utsouthwestern.edu

Susan S. Taylor, Ph.D.
Investigator, Howard Hughes Medical Institute
Professor, Department of Chemistry and Biochemistry
Professor, Department of Pharmacology
University of California, San Diego
858-534-3677
staylor@ucsd.edu

¹ Levitt, M. Proc. Natl Acad. Sci. USA 104, 3183-3188 (2007).
² Qian, B. et al. Nature. 450, 259-264 (2007).
³ Yooseph, S. et al. PLoS Biol. 5, e16 (2007).
⁴ Hermann, J. C. et al. Nature 448, 775-779 (2007).

Related Information