Copyright © Copyright 2001 The Protein Society Circularly permuted proteins in the protein structure database Reprint requests to: Dr. Byungkook Lee, Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg. 37, Room 4B15, 37 Convent Drive MSC 4255, Bethesda, MD 20892-4255, USA; e-mail: bk/at/nih.gov ; fax:(301) 402-1344. Received February 7, 2001; Revised June 14, 2001; Accepted June 14, 2001. This article has been cited by other articles in PMC. | ||||
Abstract Some proteins are homologous to others after their sequence is circularly permuted. A few such proteins have been recognized, mainly by sequence comparison, but also by comparing their three-dimensional structures. Here we report the result of a systematic search for all protein pairs in the SCOP 90% id domain database that become structurally superimposable when the sequence of one of the pairs is circularly permuted. Using a reasonable set of criteria, we find that 47% of all protein domains are superimposable to at least one other protein domain in the database after their sequence is circularly permuted. Many of these are symmetric proteins, which superimpose to another protein both with and without a circular permutation of the sequence. However, 412 of the total 3035 domains are nonsymmetric, and these become structurally superimposable to another protein only after a circular permutation of the sequence. These include most known and many previously undetected circularly permuted proteins with remote homology. Keywords: Circular permutation, protein structure, structure alignment, gene duplication | ||||
Proteins have been circularly permuted artificially to study folding and stability of the protein or to move the N or C terminus to another position in the protein structure in protein engineering contexts (Heinemann and Hahn 1995a; Baird et al. 1999; McWherter et al. 1999; Nakamura and Iwakura 1999; Iwakura et al. 2000). Circularly permuted proteins occur also in nature. Lindqvist and Schneider (1997) reviewed some eight naturally circularly permuted proteins that were known by 1997, but at least six more (Garcia-Vallve et al. 1998; Murzin 1998; Castillo et al. 1999; Jeltsch 1999; Polekhina et al. 1999; Jung and Lee 2000) have been reported since then. Circularly permuted proteins can arise from a posttranslational modification (Carrington et al. 1985; Bowles et al. 1986), but a majority probably arose from gene duplication (Luger et al. 1989; Ponting and Russell 1995; Jeltsch 1999) or exon shuffling (Doolittle 1987; Gilbert 1987) events. Natural circularly permuted proteins occur in a variety of organisms, including viruses, bacteria, plants, and higher animals. They are mostly β-sheet and α/β proteins, but saposins (Ponting and Russell 1995; Liepinsh et al. 1997) are α-helical proteins. In most known cases, the N and C termini are close to each other (Thornton and Sibanda 1983), but we have found in this work many examples wherein the two termini are not close together. Detecting repeated sequence segments and circularly permuted proteins from a sequence database has been reported recently (Marcotte et al. 1999; Uliel et al. 1999). Here we report the results of a systematic search for protein pairs that have similar structures, but the structural alignment of which requires circular permutation of one of the sequences. | ||||
Results and Discussion There are more than 10,000 entries in the protein structure databank (Berman et al. 2000), which consist of more than 16,000 domains according to the manual SCOP domain parsing result (Murzin et al. 1995). We selected 3035 protein domains from the SCOP domain database, version 1.41, that were at least 40 residues long and had 90% or less sequence identity between any pair of them. Attempts were made to structurally align all pairs of these domains both with and without circularly permuting one of the sequences. Two structures are said to be structurally related when they are sufficiently similar that the structural alignment produces a sufficiently large number of aligned pairs of residues (see Materials and Methods). Of the 9.2 million (3035 × 3035) possible pairs, 136,975 pairs met the criteria for a structural relation when neither sequence was permuted (unpermuted alignment), and 48,016 pairs met the criteria when one of the two sequences was circularly permuted (permuted alignment). The pairs in the latter set are said to be CP related. The automatic procedure found most known CP relations, including those between plant lectins (Cunningham et al. 1979), bacterial glucanases (Heinemann and Hahn 1995b), (β/α)8 barrel proteins (Sergeev and Lee 1994; Jia et al. 1996; Macgregor et al. 1996; Garcia-Vallve et al. 1998), the C2 domain proteins (Nalefski and Falke 1996), ferredoxins (Jung and Lee 2000), flavin-binding β-barrel domains (Murzin 1998), the six-stranded double-ξ β-barrels (Castillo et al. 1999), and the DNA and other methyltransferases (Jeltsch 1999). Some new examples of CP-related protein pairs are shown in Figure 1 . When a protein has a symmetric structure, it aligns to itself and to other structurally similar proteins both with and without circular permutation of its sequence. One can use this property to identify symmetric structures. Therefore, we operationally define a protein to be symmetric if it is related to another protein both with and without circular permutation and if the two alignments are judged to be distinct (see Materials and Methods). One feature that can be noted from the structures shown in Figure 1 is that the N and C termini are far apart in many of the structures. The proximity of the N and C termini are not a prerequisite condition for circular permutation. Individual structural relations are shown in Figure 2 . The number of relations between proteins that belong to the same or different fold, superfamily and family, according to the SCOP classification, are shown in Figure 3 . The unpermuted relations (blue and green dots in Fig. 2 ) are mostly between proteins in the same superfamilies (Fig. 3 ), indicating that our criteria for structural similarity roughly match the criteria used for the manual SCOP superfamily classification. Many relations do connect different classes (blue dots outside of the boxes in Fig. 2 ), but most of these involve protein domains that are small α-helical pieces or small α + β motifs, which resemble a part of many larger proteins. Most of the symmetric CP relations (green dots in Fig. 2 ) occur within the same SCOP folds and superfamilies (Fig. 3 ), but many nonsymmetric CP relations (red dots in Fig. 2 ) connect proteins in different superfamilies and folds (Fig. 3 ). The number of proteins that bear a relation with another protein is listed in Table 1. Also listed are the number of families, superfamilies, folds, and classes, as defined by SCOP, which these proteins represent. Obviously the precise numbers given in the table depend on the criteria used to judge structural similarity (see Materials and Methods). The fact that structural similarity depends on an ultimately arbitrary choice of a cutoff value is somewhat unsatisfactory. However, the situation is similar in the case of the detection of sequence homology, where a similarly arbitrary cutoff value for the e-score is commonly used. The z-score that we used in this work and the e-score are closely related, being precisely interconvertible when the score distribution is Gaussian for random matches. We made numerous spot checks by visual inspection of superimposed structures and confirmed to our satisfaction that in all cases we concur with the judgment made by the automatic procedure concerning the structural similarity or the lack thereof. It can be seen from Table 1 that 47% (1433 of 3035) of the protein domains have a CP relation with at least one other known protein domain and that such proteins are not restricted to a few special folds; circularly permuted proteins occur in all structural classes and in about half (226 of 446) of all known folds. In the SCOP classification, more than one-third of the protein domains belong to the 15 largest folds (1068 out of 3035). There is at least one circularly permuted protein in each of these 15 folds and, on average, 44% of the proteins are permuted in a given fold. It has long been recognized that many multidomain proteins were generated by different combinations of a small number of domains (Patthy 1993). The finding that a large number of protein domains have circular permutation relations with other protein domains indicates that individual domains themselves are also made from a combination of smaller units. Some 71% of the circularly permuted proteins (1025 of 1433) have symmetric structures. The number of symmetric proteins detected here is therefore 34% of the total number of proteins. These structures might have arisen from ancient gene duplication events (Lang et al. 2000). Marcotte et al. (1999) reported that duplicated gene segments occur in 14% of all protein sequences and more than 20% of all eukaryotic proteins. These must reflect relatively recent gene duplication events because they were detected by sequence homology. In the case of the symmetric structural domains detected here, the sequence homology is generally low; only 91 of the 34,581 symmetric circularly permuted pairs have >30% sequence identity between them. If the symmetry has indeed arisen from gene duplication events, therefore, most of them must be ancient events. Alternatively, one cannot rule out the possibility that at least some of these structures arose without a gene duplication event (convergent evolution). | ||||
Materials and methods Finding circularly permuted alignment A protein sequence was circularly permuted by deciding on a cut position and then renumbering the residues starting from the carboxy side of the cut position forward to the C terminus of the protein and then continuing to the N terminus and finishing at the amino side of the cut position. The cut position was initially chosen to be the middle of the sequence (Fig. 4 ). The structure of the permuted protein was then aligned to another protein, the sequence of which is not permuted, using the recently described structure–structure alignment program SHEBA (Jung and Lee 2000). This structural alignment procedure preserves connectivity so that two structures that are identical except for the numbering of the residues are considered distinct. A new cut position was then determined from the structural alignment. Let na be the number of residues that are matched in the first half (the half that contains the original C terminus) and nb the number of residues that are matched in the second half of the permuted protein. The new cut position is chosen to be next to the last residue matched in the second half if na > nb or chosen just before the first residue matched in the first half if na ≤ nb. Circular permutation using this new cut position increases the number of matched residues in the structural superposition.Criteria for a structural relation A structural alignment between two proteins, a and b, gives the match score mab, which is the fraction of matched residues in protein a. For each protein a, the mean match score ma of the random distribution was computed by averaging mab over all b proteins that are structurally unrelated (those with mab < 40%). The root-mean-square deviation σa of mab about ma was also computed. The match scores were then converted to z-score zab, which was defined as (mab − ma)/σa. For the straight structural alignment, a pair of proteins were considered to be structurally related when zab was >5.0. This z-score cutoff value is the same as that used previously for clustering protein structures into groups of similar structures (Jung and Lee 2000). This particular value was chosen primarily because the number of multimember clusters reached a plateau of maximum value at this cutoff value. Two proteins were considered to be related by circular permutation (CP related) if za`b is >5.0, where a` is the permuted protein, and if the number of matched residues of the C- and N-terminal parts of the permuted protein were both >10% of the total number of matched residues for the protein pair.Criteria for distinct alignment Two alignments were judged to be distinct if the mean alignment shift per residue, Δr (Jung and Lee 2000), was greater than 5 positions between the two alignments. | ||||
Acknowledgments This study used the high-performance computational capabilities of the Biowulf Cluster at the Center for Information Technology, National Institutes of Health. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact. | ||||
Notes Article and publication are at http://www.proteinscience.org/cgi/doi/10.1101/ | ||||
References
| ||||