Protein Structure Initiative (PSI) Recommendations on Target Selection for PSI-2 Large-scale Research Centers

I. Overall goals of Target Selection for PSI-2

The overarching goal of the Protein Structure Initiative (PSI) is to make three-dimensional (3D) atomic-level structures of most proteins readily obtainable from knowledge of their corresponding DNA sequences. The primary goal of the second stage of the PSI (PSI-2) is large-scale structure determination to maximize the coverage of protein sequences by structural information. This will be done by experimental determination of an estimated 3,000 structures. The process of target selection will undergo continuous evaluation and refinement as our understanding of protein sequence-structure relationships evolves. PSI-2 will primarily select targets from protein sequence families (so-called BIG families) that are structurally uncharacterized. Very large protein families with enormous phylogenetic variation, but limited structural coverage (so-called MEGA-families), will serve as the source of additional target sets, in order to explore the evolution of structural and functional diversity. New metagenomics data (META-families) from large communities of organisms, such as the human gut microbiome with its implications for human disease, or the environment, e.g. Global Ocean Sampling (GOS), will constitute an important source of novel targets for PSI-2. The biomedical targets and community proposed targets represent essential component of each Center target selection. The coverage of certain model organisms (both prokaryotes and eukaryotes) may also be considered in target selection. In addition to providing ~3,000 new experimental structures, PSI-2 will make available even broader structural coverage of protein sequence space (i.e., leverage) using computational homology modeling to provide structural information for the ever expanding database of related protein sequences.

The main goal of target selection in PSI-2 is to coarsely sample large protein families (Pfam and other families) with no structural representatives in PDB for broad structural coverage. The current goals include:

To determine at least one structure for each large, hitherto uncharacterized protein sequence family using coarse sampling (BIG families);
To determine representative structures for each branch of very large, diverse protein sequence families (MEGA families) to span the structural and functional diversity within that family (moderate sampling to increase structural coverage and to provide structural coverage of selected families with high biomedical relevance);
To determine representative structures for families that are over-represented in the microbiome and metagenome sequence data (moderate sampling of META-family);
To determine structures of biomedical targets and community proposed targets.

In order to optimize target selection and to maximize the biological and biomedical insights that can be derived from the PSI-2 efforts we will evaluate structural coverage of certain genomes or groups of targets (human-disease related microbiomes and metagenomes).

Protein sequence families targeted by PSI-2 will be prioritized on the basis of size, genome coverage, homology modeling coverage, and perceived biological and biomedical relevance.

II. Target selection/distribution processes

Through the advice and participation of the PSI-2 Standing Subcommittee on Target Selection, a centralized target selection mechanism has been implemented within the four Large-Scale Research Centers to ensure target selection consistent with the goals of PSI-2. Structurally uncharacterized (or inadequately characterized) families are identified using well-established protocols developed among the four Centers. Initial rankings are based on family size (total number of sequences) and on family diversity (reflecting structural and functional complexity of the family plus consideration of the number of structures required to model most if not all proteins within the family). Furthermore, targets or target families that are deemed more appropriate for investigation by the six Specialized Centers are identified so that technologies and methodologies for these important classes of proteins (e.g. membrane proteins, protein-protein complexes, eukaryotic proteins, and recalcitrant proteins) can be developed and transferred to the general structural biology community.

The selection of target families is organized and managed jointly by the four Production Centers with the aims of:

Ensuring maximum leverage of PSI-2 efforts by coordinating and curating target lists and target selection mechanisms among the Centers;
Evaluation of genomic sequence data to assess the quality of the curated data to minimize effect of poorly annotated data;
Use of consistent protocols for genome analysis and target selection;
Elimination of unnecessary duplication of effort among Centers;
Establishment of an effective target exchange procedure, so that targets which are not successful in one Center can be considered by other Centers, including Specialized Centers, using alternative technologies or methodologies.

Each Center prioritizes targets according to criteria that reflect their individual scientific interests and technical capabilities, including:

Families containing representatives from selected model organisms or groups of organisms;
Families containing representatives with known or postulated disease association;
Families containing representatives with predicted or known biological/biochemical function; and
Families containing representatives from all three kingdoms of life.

A given target family is, in most cases, assigned only to one Center. Each Center is allocated an approximately equal number of target families (or subfamilies), which are distributed using a rotating draft pick process (or equivalent strategy) from a consensus list of target families. Target families (BIG families, MEGA families, META families) are assigned and constantly re-evaluated from the well-characterized ensemble of Pfam, BIG and other protein sequence families as they became available from the genome sequencing efforts. Additional target families will be selected from other ensembles of protein sequence families identified on a consensus basis by bioinformatics staff members from each Center. Following assignment of a new target family, each Center is responsible for applying their own methodologies for selecting individual candidates for experimental structure determination from within each targeted protein domain sequence family. Sets of very large, diverse and biologically/medically important protein sequence families (MEGA and META families) are being analyzed for PSI-2 structural studies. Each Center is also responsible for applying their own methodologies for selecting individual targets for structure determination of biomedical targets and community targets.

III. Public list of targets and progress

The PSI and each Center provides weekly updates to TargetDB, the public database of selected families and structure determination candidates, and deposits protocols for protein sample production into PepcDB, the public repository of PSI-2 structure determination results.

IV. Milestones

The Target Selection Subcommittee has recommended a well-defined set of milestones and deliverables which have been incorporated in the PSI-2 Goals and Milestones Statement. It is further recommended that each Center should contribute a peer-reviewed publication summarizing their progress and highlighting biological/biomedical as well as technological/methodological contributions. These annual publications will provide a facile means of communicating overall progress plus important results, concepts, strategies, and ideas to the public. Rigorous assessments of the achievements of the entire PSI initiative should be undertaken regularly and published, to the extent possible, along with the annual publications of all Centers. Such publications would serve an important goal of providing citations for the activities of the PSI as a whole and those of each Center, and for structures that would otherwise not be the focus of a peer-reviewed publication.

V. Impact

The impact of PSI-2 may be assessed on an annual basis using quantitative criteria that include:

Number of structures determined from within BIG, MEGA and META protein families with no structural coverage;
Number of novel structures determined from BIG, MEGA and META families (defined at the level of ~ 30% sequence identity);
Number of structures determined from within protein families, biomedical theme and community proposed targets;
Numbers of proteins and residues that can be modeled that could not previously be modeled;
Number of references (journal citations and web) to PSI-2/SG PDB identifiers;
Number of accesses to PSI-2 structures in PDB by PDBsum, SCOP, CATH, etc.;
Number of downloads of PSI-2 structures from PDB;
New methods and technologies made available to scientific community;
Number of workshops/meetings to disseminate the new technologies and methodologies to the public;
Materials and reagents distributed to the public;
Intellectual property invented.

Although primarily focused on high-throughput experimental protein structure determination and methodology, contributions from PSI-2 Centers are distinct from those of conventional structural biology in that they also make extensive data and substantial resources available to the general scientific community. It is imperative, therefore, when publicizing PSI-2 that special efforts be devoted to highlighting all such resources that have been made available to and accessed by the community, including: expression vectors; expression clones; protein expression/purification protocols; purified proteins; experimentally determined protein structures; homology models computed from PSI-2 structures; homology modeling techniques; advances in laboratory information management systems; computer programs; robotics; gene cloning and protein expression/purification methodologies; crystallization strategies and protocols; experimental data sets for methods developers; comprehensive positive and negative data for data mining; and X-ray crystallography and solution NMR structure determination methods.

Steering Subcommittee on Target Selection
Chair: Andrzej Joachimiak
Members: Guy Montelione, Ian Wilson, Stephen Burley, Andras Fiser, Adam Godzik, Christine Orengo, Burkhard Rost, Jerry Li
Advisors: David Baker, Steve Brenner