Fifth Workshop on Target Selection for the Protein Structure Initiative

June 26-27, 2006

The production phase of PSI-2 is entering its second year; the first unified PSI target lists have been constructed and thousands of protein targets have been progressing rapidly through the experimental pipelines. From these targets, PSI-2 already delivered about 350 crystal structures. In order to achieve maximal leverage, the BioInformatics Groups (BIG) of the four large-scale production centers are coordinating their target selection efforts to identify the most valuable protein targets and to avoid overlaps.

The 5th Target Selection workshop aimed at getting input from the broad scientific community about strategies that could be used to further improve and refine current target selection mechanisms. There were 59 participants representing bioinformatics and experimental experts. These experts represented the four large-scale and the six specialized structural genomics centers, as well as the general community of external research groups from related areas. Speakers were invited both from within the centers and from external research groups.

The first day of the workshop was divided into four scientific sessions that discussed clustering and identification of large sequence families, domain boundary prediction, defining modelling families and predicting function and evaluating structure/function relations. The second day started with a session on evaluation strategies and short progress reports from the large-scale production centers. One main component of the meeting were the following two extensive breakout sessions that channelled discussions into parallel sessions chaired by Janet Thornton (EBI, England), Barry Honig (Columbia University, USA), Chris Sander (Sloan Kettering Cancer Institute, USA) and John Moult (Univ. of Maryland, USA). The breakout sessions reviewed issues arising from the scientific presentations given on the first day and also considered recommendations for possible milestones and evaluation methods to monitor the progress of the four large-scale centres in PSI-2. The second day finished with reports from the chairs of the breakout sessions and an open plenary discussion about the major themes and future plans arising from the breakout sessions.

The presentations and the discussions on the first day were driven by a set of questions compiled by the workshop organisers. The organizers included the PIs of the four bioinformatics groups (BIG) of the large-scale centres and representatives of the operations management group (OMG), which includes PIs of the large-scale centers and representatives of NIGMS.

During the session on clustering and identifying large sequence families, Alex Bateman (Sanger Centre, England) presented the strategy for generating new Pfam families and reported back on his curation of 60 families supplied to him by BIG as part of their recent target selection list. Liisa Holm (Univ. of Helsinki, Finland) and Michal Linial (Hebrew Univ., Israel) described their clustering methods and domain family resources called ADDA and EVEREST, respectively. Christine Orengo (UCL, England) discussed the Gene3D and GEMMA resources used by the Midwest Center and emphasised the importance of targeting proteins from large, structurally underrepresented families currently absent from PSI target lists. Burkhard Rost (Columbia Univ., USA) closed the session by presenting target strategies developed for the NESG center.

In the next session, Golan Yona (Cornell Univ., USA) described the BIOZON machine learning approach to detect domain boundaries, which combines about 20 different features. David Jones (UCL, England) reported new developments of his domain boundary approach, DOMSSEA while Jinfeng Liu (Columbia Univ., USA) described CHOP and CHOPNet, both of these latter strategies being employed by the NESG. The workshop participants commented on the importance of having accurate domain boundary recognition methods and pointed out the different challenges in testing domain boundaries in known structures or in genomics sequences.

The third session was concerned with defining Modelling Families, which is a newly introduced technical definition for the fundamental unit for structural genomics. Eashwar Narayanan (UCSF, USA) presented an update on the pipeline developed by Andrej Sali and his group over the years for comparative modelling. Barry Honig (Columbia Univ., USA) considered and reviewed strategies to evaluate quality of models. Roland Dunbrak (Fox Chase Cancer Center, USA) reviewed the current performance of different sequence alignment strategies. Andras Fiser (Albert Einstein Coll. of Med., USA) presented a possible new approach to defining modelling families by relating local sequence signal with structural divergence and also talked about the surprisingly favourable current coverage of loop fragments in the PDB. Diana Murray (Weill Med. Coll., USA) discussed the three-tier target selection mechanism at NESG, which includes an automatic target selection step followed by manual curation and analysis of the functional impact of potential targets.

In the fourth session, Janet Thornton presented structure-based methods for predicting functions for proteins solved by the PSI. Whilst Alfonso Valencia (CNB-CSIC, Spain) described various approaches for identifying functionally important residues from multiple sequence alignments and phylogenetic analyses. Adam Godzik (Burnham Inst., USA) discussed the challenges of structure and/or function-based family classification schemes and forecasted depositions of very large amounts of new genomic data that may reshape the sequence landscape. Phil Bourne (UCSD, USA) reported on a database of putative targets, which are implicated in human diseases.

The first day concluded with a discussion led by Wayne Hendrickson (Columbia Univ., USA) inviting general comments from the participants on target selection strategies.

On the second day, in the session on evaluation, Stephen Brenner (UC Berkley, USA) presented a retrospective PFAM based analysis on the contribution of PSI centers to the protein fold universe. He also highlighted the reductions in cost per structure and unparalleled efficiency achieved by the PSI production centres. The four large-scale centers reported strong progress to date for PSI-2 and in particular, Ian Wilson (Scripps Inst., USA) mentioned very encouraging overall results suggesting a higher than expected success rate on new targets selected only six months before the workshop.

All the chairs of the four breakout sessions reported a strong support for the idea of increasing the biological impact of PSI efforts by carefully exploring the structural and functional diversity of very large families. They also all recommended more concerted efforts to increase publicity on the plans and outcomes of PSI through regular publications and to build stronger ties with other scientific initiatives such as the sequencing centers and the functional genomics community.

Following up on the four scientific themes of the first day sessions, several recommendations were made. On clustering, development of improved methods for identifying Modelling Families was encouraged to further rationalize target selection strategies. There was a consensus to let each center pursue its own strategies for identifying targets from within large families but to coordinate efforts to centrally validate these selected families before they enter the production pipeline. As regards to domain boundary prediction, there was a suggestion to publicize the extensive information being collected on domain constructs to the prediction community associated with CASP. It was recommended that experimental research programs validating functional predictions be strongly supported. Finally, on evaluation, a number of recommendations were made: There was a suggestion for an annual independent review, which could be linked to a workshop to discuss current scientific developments.