Abstract
Background: Current classification of protein folds are based, ultimately, on visual inspection of
similarities. Previous attempts to use computerized structure comparison methods show only
partial agreement with curated databases, but have failed to provide detailed statistical and
structural analysis of the causes of these divergences.
Results: We construct a map of similarities/dissimilarities among manually defined protein folds,
using a score cutoff value determined by means of the Receiver Operating Characteristics curve.
It identifies folds which appear to overlap or to be "confused" with each other by two distinct
similarity measures. It also identifies folds which appear inhomogeneous in that they contain
apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive
rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest
either that some of these folds are defined using criteria other than purely structural consideration
or that the similarity measures used do not recognize some relevant aspects of structural similarity
in certain cases. Specifically, variations of the "common core" of some folds are severe enough to
defeat attempts to automatically detect structural similarity and/or to lead to false detection of
similarity between domains in distinct folds. Structures in some folds vary greatly in size because
they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size
differences. Structures in different folds may contain similar substructures, which produce false
positives. Finally, the common core within a structure may be too small relative to the entire
structure, to be recognized as the basis of similarity to another.
Conclusion: A detailed analysis of the entire available protein fold space by two automated
similarity methods reveals the extent and the nature of the divergence between the automatically
determined similarity/dissimilarity and the manual fold type classifications. Some of the observed
divergences can probably be addressed with better structure comparison methods and better
automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting
a continuous rather than discrete protein fold space.
Close Window