NCBI Home Entrez Protein Clusters | |
Introduction |
The NCBI Entrez Protein Clusters database is a collection of Reference Sequence (RefSeq) proteins from the complete genomes of prokaryotes,
plasmids, and organelles grouped and annotated based on sequence similarity and protein function. Proteins are automatically grouped into clusters
based on reciprocal best hit BLAST scores. NOTE: not all proteins in complete genomes are represented in the Protein Clusters database.
Clusters are named and functional description are assigned by manual curation. Alignments, information on genome neighborhood, and links to
NCBI and external databases are provided for each protein cluster which are also searchable in Entrez. Specific query and search terms can be
found under Querying and Searching.
Currently proteins encoded by prokaryotes and chloroplasts are contained within separate clusters(see below). Each cluster in the database has a unique identifying number (UID). Each cluster
also has an accession number consisting of a three letter code followed by five numbers (Table 1). The clusters are divided into
curated and non-curated sets. Curated clusters have consistent nomenclature and protein function descriptions. Non-curated clusters are automatically generated and have not yet been manually annotated.
Table 1. Protein Clusters Identifiers.
Protein Cluster Database | Cluster ID |
Prokaryotes: Curated Protein Clusters | PRK##### |
Chloroplasts | CHL##### |
Prokaryotes: Uncurated Protein Clusters | CLS##### |
Chloroplasts: Uncurated Chloroplast Clusters | CLSC##### |
Protein Cluster Page |
The display for each cluster provides information on cluster accession, cluster name, and gene name, as well as links to protein display tools, external databases, and publications. Each of these sections can be expanded or collapsed by clicking on the down or right arrows, respectively.
Figure 1. Sample Protein Cluster Page
Cluster Info |
Cluster Info displays information specific to the cluster. The information highlighted in blue
at the top of the page contains the core curated identifiers and information. The
cluster number and the curation status are at the left, the name of the cluster/protein in the
middle, and the gene name or synonym (if annotated) at the right.
Figure 2. Cluster Info - Curated Info
Figure 3. Cluster Info - Statistics
Cluster Tools |
Cluster Tools contains several methods of displaying cluster details.
Figure 4. Cluster Tools
Show Detailed Alignment |
Show Detail Alignment displays a scrollable multiple sequence alignment. Selected sequences in the alignment may be highlighted by checking the box next to the species or genus name on the Protein Table before showing the alignment. Amino acids in the alignment display can be shown with either conserved amino acids or with a consensus display (only amino acids that differ from the consensus sequence are shown). Conserved domains can be displayed by clicking the show/hide all in the Domains box or, for any individual sequence by clicking the + or - next to the protein accession.
Figure 5. Show Detailed Alignment
Build Tree |
Build tree produces a distance tree. Tree construct method can be chosen using the drop-down menus. Selected sequences in the tree may be highlighted by checking the box next to the species or genus name on the Protein Table before building a tree. The tree display can be expanded or contracted using the Collapsing level drop-down menu or by clicking on a branch node and using the pop-up menu to expand or collapse around that branch point. The root of the tree can be defined by clicking on a node and selecting Re-root from the pop-up menu.
Figure 6. Build Phylogenetic Tree
Genome ProtMap |
Genome ProtMap displays in a separate page the genome context of the proteins either for all the proteins in the cluster (Genome ProtMap by PRK#####) or for all the proteins that have the same COG (Cluster of Orthologous Group) (Genome ProtMap by COG#####). In the Genome ProtMap display, clicking the accession number will link to the RefSeq nucleotide record. Mousing over the proteins gives detailed information such as name, cluster ID, and genome location. Clicking on any protein brings up a pop-up menu with links to protein, gene or cluster. The list of taxa in the ProtMap can be collapsed or expanded by clicking the + or - next to the taxon.
Figure 7. Genome ProtMap by Cluster
Cross references |
Cross references are calculated at the level of protein, then collected from all proteins in a given cluster, and finally displayed in the cross reference section which provides links to similar clusters, to information on protein families, metabolic roles, conserved protein domains, and protein structure.
Figure 7. Cross References
Entrez Links |
Entrez Links provide links to the Entrez Gene, Genome, Nucleotide, Protein, and PubMed records for the proteins in the cluster.
Description |
The description contains curated information on the proteins and their function, domain descriptions, COG functional categories, and KEGG BRITE hierarchy showing functional classification.
Figure 9. Description
Publications |
Publication links contain curated publications as well as publications found on individual proteins from RefSeq, GeneRIFs, GenBank, SwissProt, Structures, and Conserved Domains. NOTE: publications can be in more than one category. The total shown at the top (show all) is the complete set of non-redundant publications. The title from the most recent publication in each category is shown, along with direct links for each category. The entire publication set can be expanded to show all publication titles.
Figure 10. Publication Links
Protein Cluster Table |
The Protein Table contains information on each of the proteins in the cluster, organized by taxon. Where there are multiple species or strains from the same genus the table can be collapsed or expanded. Selected sequences can be highlighted in the alignment or distance tree by checking the box next to the genus or species.
Figure 11. Protein Table
The organism, protein name, accession, locus tag and length are provided. Clusters which are upstream or downstream of the cluster protein are shown to provide information on genome neighborhood. Blink is a link to pre-computed blast analysis. Alignment provides a graphic of the alignment including domain structure. Each of these properties provides a link to the appropriate database numbers (Table 2).
Table 2. Protein Table Links.Property | Links to: |
Organism name | Taxonomic information |
Protein Name | Protein Record |
Upstream Cluster | Cluster page in Protein Clusters |
Accession | Protein record |
Downstream Cluster | Cluster page in Protein Clusters |
Locus tag | Entrez Gene page |
BLink | Pre-computed blast analysis |
Alignment | Detailed alignment page |
Domain | Domain page in CDD |
Querying and Searching |
The protein clusters database utilizes all of the features of other Entrez databases. You can limit searches, preview/index your search terms, use the history, clipboard, or details by using the tab buttons underneath the search box. More general instructions on Entrez querying can be found here
Limits |
The limit button on the search bar allows search limits to be set from a drop-down menu.
After selecting a limit the selected field will show up in the yellow bar behind the Field tag.
The limits checkbox will also be marked and will remain through subsequent searches. To remove the limits
for a particular search, deselect the checkbox.
The following table summarizes the various limits and properties that can be used to refine searches.
Field name | Definition [including field abbreviations] | Examples |
---|---|---|
Accession | Unique identifier for each cluster. [ACCN][ACCESSION] |
Retrieve cluster with the accession PRK09525: PRK09525[ACCN] |
Average Length | Average length of proteins in the cluster. [Average Length] |
Retrieve all clusters with an average protein length of 100 - 300 amino acids:
100:300[Average Length] |
COG | COG (Clusters of Orthologous Groups) is a phylogenetic classification of proteins from complete genomes.
[COG] |
Retrieve all clusters with COG3250:
COG3250[COG] |
Creation date | Date the record was created. Note the format is: YEAR/MONTH/DAY including the forward slashes.
[Creation Date] |
Find all clusters created in 2007:
2007[Creation Date] |
Domain Name | Domains are structural or functional units in a protein; nomenclature is based on the NCBI Conserved Domain Database.
[Domain Name] |
Retrieve all clusters with the beta galactosidase small chain, N terminal domain:
Bgal_small_N[Domain Name] |
Domains | Number of domains in the proteins cluster.
[Domains] |
Retrieve all clusters with 15 domains:
15[Domains] |
EC/RN Number | The number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively.
[EC/RN Number |
Retrieve all clusters containing the EC number 3.2.1.23:
3.2.1.23[EC/RN Number] |
Gene synonym | Alternative name for gene found in the database records.
[Gene Synonym] |
Retrieve all clusters with the cbiJ as a gene synonym:
cbij[Gene Synonym] |
HAMAP | Number assigned to designate a well defined and well conserved protein family or subfamily by the Swiss Institute for Bioinformatics.
HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes.
[HAMAP] |
Retrieve all clusters with the HAMAP MF_00008:
mf 00008[HAMAP] |
KO | Number assigned to designate a manually curated set of orthologous gene groups in the complete genomes
by the Kyoto Encyclopedia of Genes and Genomes.
[KO] |
Retrieve all clusters with a KO of k01190:
k01190[KO] |
Locus Tag | Locus tags are identifiers that are systematically applied to every gene in a genome.
[Locus Tag] |
Retrieve all clusters with a protein with the locus tag of b0344:
Z0440[Locus Tag] |
Organism | The scientific and common names for the organisms associated with the protein sequence.
[Organism] |
Find all projects associated with Escherichia coli:
Escherichia coli[Organism] |
Paralogs | Number of paralog proteins in a cluster.
[Paralogs] |
Retrieve all clusters with 13 paralogs:
13[Paralogs] |
Properties | An attribute of the cluster based on DNA source or curation status.
[PROP][Properties] |
Retrieve all clusters from chloroplasts:
source chloroplast[Properties] |
Protein Accession | The unique accession number of the protein.
[Protein Accession] |
Retrieve all clusters containing the protein accession NP_414878:
NP_414878[Protein Accession] |
Protein GI number | A series of digits that are assigned consecutively by NCBI to each sequence it processes.
[Protein GI] |
Retrieve all clusters containing the protein GI of 16128329:
16128329[Protein GI] |
Protein Name | The standard name of proteins found in database records. Common names may not be indexed in this
field so it is best to also consider All Fields or Text Words. [PROT][Protein Name] |
Retrieve all clusters containing the protein beta galactosidase:
beta galactosidase[Protein Name] |
PubMed ID | Unique identifier for the publication in the PubMed database.
[PubMed ID] |
Retrieve all clusters with the PubMed ID number 97298:
97298[PubMed ID] |
Sequence Length | Exact number of amino acids in the protein sequence.
[Sequence Length] |
Retrieve all clusters with at least one protein with a length of 1024 amino acids:
1024[Sequence Length] |
Size | Number of proteins in the cluster.
[Size] |
Retrieve all clusters with 25 proteins:
25[Size] |
Taxonomy ID | Identifier for the species or strain in the NCBI taxonomy database.
[Taxonomy ID] |
Retrieve all clusters with proteins from taxonomy ID 83333:
83333[Taxonomy ID] |
Title | Title of the protein cluster.
[Title] |
Retrieve all beta-D-galactosidase protein clusters:
beta-D-galactosidase[Title] |
Total Publications | Total number of publications associated with proteins in the cluster.
[Total Publications] |
Retrieve all clusters with 51 publications:
51[Total Publications] |
Preview/Index |
The Preview/Index page on any Entrez database is used to construct queries and to view terms that have been indexed under any field name. The table in the previous section described the fields used in indexing the records and provided some representative queries using those fields. Information on using Preview/Index can be found in the Entrez help documentation here
The History, Clipboard, and Details features are consistent with other Entrez databases. You may find additional information in the Entrez help documentation.
If you have any additional questions, then please send an email to: info@ncbi.nlm.nih.gov
Revised March 26, 2007