NCBI Home Entrez Protein Clusters

Protein Clusters: a collection of proteins grouped by sequence similarity and function

Help Document

Introduction back to top

Introduction

The NCBI Entrez Protein Clusters database is a collection of Reference Sequence (RefSeq) proteins from the complete genomes of prokaryotes, plasmids, and organelles grouped and annotated based on sequence similarity and protein function. Proteins are automatically grouped into clusters based on reciprocal best hit BLAST scores. NOTE: not all proteins in complete genomes are represented in the Protein Clusters database.

Clusters are named and functional description are assigned by manual curation. Alignments, information on genome neighborhood, and links to NCBI and external databases are provided for each protein cluster which are also searchable in Entrez. Specific query and search terms can be found under Querying and Searching.

Currently proteins encoded by prokaryotes and chloroplasts are contained within separate clusters(see below). Each cluster in the database has a unique identifying number (UID). Each cluster also has an accession number consisting of a three letter code followed by five numbers (Table 1). The clusters are divided into curated and non-curated sets. Curated clusters have consistent nomenclature and protein function descriptions. Non-curated clusters are automatically generated and have not yet been manually annotated.

Table 1. Protein Clusters Identifiers.
Protein Cluster Database Cluster ID
Prokaryotes: Curated Protein Clusters PRK#####
Chloroplasts CHL#####
Prokaryotes: Uncurated Protein Clusters CLS#####
Chloroplasts: Uncurated Chloroplast Clusters CLSC#####


Protein Cluster Page back to top

The display for each cluster provides information on cluster accession, cluster name, and gene name, as well as links to protein display tools, external databases, and publications. Each of these sections can be expanded or collapsed by clicking on the down or right arrows, respectively.

Figure 1. Sample Protein Cluster Page



Cluster Info back to top

Cluster Info displays information specific to the cluster. The information highlighted in blue at the top of the page contains the core curated identifiers and information. The cluster number and the curation status are at the left, the name of the cluster/protein in the middle, and the gene name or synonym (if annotated) at the right.

Figure 2. Cluster Info - Curated Info



On the left hand side below the blue bar are some specific statistics about the protein cluster.

Figure 3. Cluster Info - Statistics



Cluster Tools back to top

Cluster Tools contains several methods of displaying cluster details.

Figure 4. Cluster Tools



Show Detailed Alignment back to top

Show Detail Alignment displays a scrollable multiple sequence alignment. Selected sequences in the alignment may be highlighted by checking the box next to the species or genus name on the Protein Table before showing the alignment. Amino acids in the alignment display can be shown with either conserved amino acids or with a consensus display (only amino acids that differ from the consensus sequence are shown). Conserved domains can be displayed by clicking the show/hide all in the Domains box or, for any individual sequence by clicking the + or - next to the protein accession.

Figure 5. Show Detailed Alignment



Build Tree back to top

Build tree produces a distance tree. Tree construct method can be chosen using the drop-down menus. Selected sequences in the tree may be highlighted by checking the box next to the species or genus name on the Protein Table before building a tree. The tree display can be expanded or contracted using the Collapsing level drop-down menu or by clicking on a branch node and using the pop-up menu to expand or collapse around that branch point. The root of the tree can be defined by clicking on a node and selecting Re-root from the pop-up menu.

Figure 6. Build Phylogenetic Tree



Genome ProtMap back to top

Genome ProtMap displays in a separate page the genome context of the proteins either for all the proteins in the cluster (Genome ProtMap by PRK#####) or for all the proteins that have the same COG (Cluster of Orthologous Group) (Genome ProtMap by COG#####). In the Genome ProtMap display, clicking the accession number will link to the RefSeq nucleotide record. Mousing over the proteins gives detailed information such as name, cluster ID, and genome location. Clicking on any protein brings up a pop-up menu with links to protein, gene or cluster. The list of taxa in the ProtMap can be collapsed or expanded by clicking the + or - next to the taxon.

Figure 7. Genome ProtMap by Cluster


Cross references back to top

Cross references are calculated at the level of protein, then collected from all proteins in a given cluster, and finally displayed in the cross reference section which provides links to similar clusters, to information on protein families, metabolic roles, conserved protein domains, and protein structure.

Figure 7. Cross References



Entrez Links back to top

Entrez Links provide links to the Entrez Gene, Genome, Nucleotide, Protein, and PubMed records for the proteins in the cluster.

Description back to top

The description contains curated information on the proteins and their function, domain descriptions, COG functional categories, and KEGG BRITE hierarchy showing functional classification.

Figure 9. Description



Publications back to top

Publication links contain curated publications as well as publications found on individual proteins from RefSeq, GeneRIFs, GenBank, SwissProt, Structures, and Conserved Domains. NOTE: publications can be in more than one category. The total shown at the top (show all) is the complete set of non-redundant publications. The title from the most recent publication in each category is shown, along with direct links for each category. The entire publication set can be expanded to show all publication titles.

Figure 10. Publication Links



Protein Cluster Table back to submission form

The Protein Table contains information on each of the proteins in the cluster, organized by taxon. Where there are multiple species or strains from the same genus the table can be collapsed or expanded. Selected sequences can be highlighted in the alignment or distance tree by checking the box next to the genus or species.

Figure 11. Protein Table



The organism, protein name, accession, locus tag and length are provided. Clusters which are upstream or downstream of the cluster protein are shown to provide information on genome neighborhood. Blink is a link to pre-computed blast analysis. Alignment provides a graphic of the alignment including domain structure. Each of these properties provides a link to the appropriate database numbers (Table 2).

Table 2. Protein Table Links.
Property Links to:
Organism name Taxonomic information
Protein Name Protein Record
Upstream Cluster Cluster page in Protein Clusters
Accession Protein record
Downstream Cluster Cluster page in Protein Clusters
Locus tag Entrez Gene page
BLink Pre-computed blast analysis
Alignment Detailed alignment page
Domain Domain page in CDD


Querying and Searching back to top

The protein clusters database utilizes all of the features of other Entrez databases. You can limit searches, preview/index your search terms, use the history, clipboard, or details by using the tab buttons underneath the search box. More general instructions on Entrez querying can be found here

Limits back to Querying and Searching<

The limit button on the search bar allows search limits to be set from a drop-down menu.
After selecting a limit the selected field will show up in the yellow bar behind the Field tag. The limits checkbox will also be marked and will remain through subsequent searches. To remove the limits for a particular search, deselect the checkbox.

The following table summarizes the various limits and properties that can be used to refine searches.

Table. Limits and Properties in Entrez Genome Project
Field name Definition [including field abbreviations] Examples
Accession Unique identifier for each cluster.
[ACCN][ACCESSION]
Retrieve cluster with the accession PRK09525:
PRK09525[ACCN]
Average Length Average length of proteins in the cluster.
[Average Length]
Retrieve all clusters with an average protein length of 100 - 300 amino acids:
100:300[Average Length]
COG COG (Clusters of Orthologous Groups) is a phylogenetic classification of proteins from complete genomes.
[COG]
Retrieve all clusters with COG3250:
COG3250[COG]
Creation date Date the record was created. Note the format is: YEAR/MONTH/DAY including the forward slashes.
[Creation Date]
Find all clusters created in 2007:
2007[Creation Date]
Domain Name Domains are structural or functional units in a protein; nomenclature is based on the NCBI Conserved Domain Database.
[Domain Name]
Retrieve all clusters with the beta galactosidase small chain, N terminal domain:
Bgal_small_N[Domain Name]
Domains Number of domains in the proteins cluster.
[Domains]
Retrieve all clusters with 15 domains:
15[Domains]
EC/RN Number The number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively.
[EC/RN Number
Retrieve all clusters containing the EC number 3.2.1.23:
3.2.1.23[EC/RN Number]
Gene synonym Alternative name for gene found in the database records.
[Gene Synonym]
Retrieve all clusters with the cbiJ as a gene synonym:
cbij[Gene Synonym]
HAMAP Number assigned to designate a well defined and well conserved protein family or subfamily by the Swiss Institute for Bioinformatics. HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes.
[HAMAP]
Retrieve all clusters with the HAMAP MF_00008:
mf 00008[HAMAP]
KO Number assigned to designate a manually curated set of orthologous gene groups in the complete genomes by the Kyoto Encyclopedia of Genes and Genomes.
[KO]
Retrieve all clusters with a KO of k01190:
k01190[KO]
Locus Tag Locus tags are identifiers that are systematically applied to every gene in a genome.
[Locus Tag]
Retrieve all clusters with a protein with the locus tag of b0344:
Z0440[Locus Tag]
Organism The scientific and common names for the organisms associated with the protein sequence.
[Organism]
Find all projects associated with Escherichia coli:
Escherichia coli[Organism]
Paralogs Number of paralog proteins in a cluster.
[Paralogs]
Retrieve all clusters with 13 paralogs:
13[Paralogs]
Properties An attribute of the cluster based on DNA source or curation status.
[PROP][Properties]
Retrieve all clusters from chloroplasts:
source chloroplast[Properties]
Protein Accession The unique accession number of the protein.
[Protein Accession]
Retrieve all clusters containing the protein accession NP_414878:
NP_414878[Protein Accession]
Protein GI number A series of digits that are assigned consecutively by NCBI to each sequence it processes.
[Protein GI]
Retrieve all clusters containing the protein GI of 16128329:
16128329[Protein GI]
Protein Name The standard name of proteins found in database records. Common names may not be indexed in this field so it is best to also consider All Fields or Text Words.
[PROT][Protein Name]
Retrieve all clusters containing the protein beta galactosidase:
beta galactosidase[Protein Name]
PubMed ID Unique identifier for the publication in the PubMed database.
[PubMed ID]
Retrieve all clusters with the PubMed ID number 97298:
97298[PubMed ID]
Sequence Length Exact number of amino acids in the protein sequence.
[Sequence Length]
Retrieve all clusters with at least one protein with a length of 1024 amino acids:
1024[Sequence Length]
Size Number of proteins in the cluster.
[Size]
Retrieve all clusters with 25 proteins:
25[Size]
Taxonomy ID Identifier for the species or strain in the NCBI taxonomy database.
[Taxonomy ID]
Retrieve all clusters with proteins from taxonomy ID 83333:
83333[Taxonomy ID]
Title Title of the protein cluster.
[Title]
Retrieve all beta-D-galactosidase protein clusters:
beta-D-galactosidase[Title]
Total Publications Total number of publications associated with proteins in the cluster.
[Total Publications]
Retrieve all clusters with 51 publications:
51[Total Publications]


Preview/Index back to Querying and Searching<

The Preview/Index page on any Entrez database is used to construct queries and to view terms that have been indexed under any field name. The table in the previous section described the fields used in indexing the records and provided some representative queries using those fields. Information on using Preview/Index can be found in the Entrez help documentation here

The History, Clipboard, and Details features are consistent with other Entrez databases. You may find additional information in the Entrez help documentation.

If you have any additional questions, then please send an email to: info@ncbi.nlm.nih.gov


Revised March 26, 2007