Entrez Protein Clusters Help Document

	NCBI Home Entrez Protein Clusters

Protein Clusters: a collection of proteins grouped by sequence similarity and function

Help Document

Introduction
Protein Cluster Page
Querying and Searching
Preview Index

Introduction

The NCBI Entrez Protein Clusters database is a collection of Reference Sequence (RefSeq) proteins from the complete genomes of prokaryotes, plasmids, and organelles grouped and annotated based on sequence similarity and protein function. Proteins are automatically grouped into clusters based on reciprocal best hit BLAST scores. NOTE: not all proteins in complete genomes are represented in the Protein Clusters database.

Clusters are named and functional description are assigned by manual curation. Alignments, information on genome neighborhood, and links to NCBI and external databases are provided for each protein cluster which are also searchable in Entrez. Specific query and search terms can be found under Querying and Searching.

Currently proteins encoded by prokaryotes and chloroplasts are contained within separate clusters(see below). Each cluster in the database has a unique identifying number (UID). Each cluster also has an accession number consisting of a three letter code followed by five numbers (Table 1). The clusters are divided into curated and non-curated sets. Curated clusters have consistent nomenclature and protein function descriptions. Non-curated clusters are automatically generated and have not yet been manually annotated.

Table 1. Protein Clusters Identifiers.

Protein Cluster Database Cluster ID

Prokaryotes: Curated Protein Clusters PRK#####

Chloroplasts CHL#####

Prokaryotes: Uncurated Protein Clusters CLS#####

Chloroplasts: Uncurated Chloroplast Clusters CLSC#####

Protein Cluster Page

Cluster Info
Cluster Tools

Show Detailed Alignment
Build Tree
Genome ProtMap

Cross references
Entrez Links
Protein Table

The display for each cluster provides information on cluster accession, cluster name, and gene name, as well as links to protein display tools, external databases, and publications. Each of these sections can be expanded or collapsed by clicking on the down or right arrows, respectively.

Figure 1. Sample Protein Cluster Page

Cluster Info

Cluster Info displays information specific to the cluster. The information highlighted in blue at the top of the page contains the core curated identifiers and information. The cluster number and the curation status are at the left, the name of the cluster/protein in the middle, and the gene name or synonym (if annotated) at the right.

Figure 2. Cluster Info - Curated Info

On the left hand side below the blue bar are some specific statistics about the protein cluster.

Total proteins in the cluster
Common taxonomic node or conservation
Total number of genera
Total number of organisms
Putative number of paralogs (two or more proteins encoded by identical nucleotide sequences)
Total number of organisms with paralogs
Total number of publications (see below)

Figure 3. Cluster Info - Statistics

Cluster Tools

Cluster Tools contains several methods of displaying cluster details.

Show Detailed Alignment
Build Tree
Genome ProtMap

Figure 4. Cluster Tools

Show Detailed Alignment

Show Detail Alignment displays a scrollable multiple sequence alignment. Selected sequences in the alignment may be highlighted by checking the box next to the species or genus name on the Protein Table before showing the alignment. Amino acids in the alignment display can be shown with either conserved amino acids or with a consensus display (only amino acids that differ from the consensus sequence are shown). Conserved domains can be displayed by clicking the show/hide all in the Domains box or, for any individual sequence by clicking the + or - next to the protein accession.

Figure 5. Show Detailed Alignment

Build Tree

Build tree produces a distance tree. Tree construct method can be chosen using the drop-down menus. Selected sequences in the tree may be highlighted by checking the box next to the species or genus name on the Protein Table before building a tree. The tree display can be expanded or contracted using the Collapsing level drop-down menu or by clicking on a branch node and using the pop-up menu to expand or collapse around that branch point. The root of the tree can be defined by clicking on a node and selecting Re-root from the pop-up menu.

Figure 6. Build Phylogenetic Tree

Genome ProtMap

Genome ProtMap displays in a separate page the genome context of the proteins either for all the proteins in the cluster (Genome ProtMap by PRK#####) or for all the proteins that have the same COG (Cluster of Orthologous Group) (Genome ProtMap by COG#####). In the Genome ProtMap display, clicking the accession number will link to the RefSeq nucleotide record. Mousing over the proteins gives detailed information such as name, cluster ID, and genome location. Clicking on any protein brings up a pop-up menu with links to protein, gene or cluster. The list of taxa in the ProtMap can be collapsed or expanded by clicking the + or - next to the taxon.

Figure 7. Genome ProtMap by Cluster

Cross references

Cross references are calculated at the level of protein, then collected from all proteins in a given cluster, and finally displayed in the cross reference section which provides links to similar clusters, to information on protein families, metabolic roles, conserved protein domains, and protein structure.

Related Clusters
Cluster of Orthologous Groups COG
Enzyme Commission Number
High-quality Automated and Manual Annotation of Microbial Proteomes HAMAP
Kyoto Encyclopedia of Genes and Genomes Orthology Groups KEGG
European Bioinformatics Institute's InterPro Database
The Institute for Genomic Research's Protein Families TIGRfam
Conserved Domain Database CDD
Protein Structure

Figure 7. Cross References

Entrez Links

Entrez Links provide links to the Entrez Gene, Genome, Nucleotide, Protein, and PubMed records for the proteins in the cluster.

Description

The description contains curated information on the proteins and their function, domain descriptions, COG functional categories, and KEGG BRITE hierarchy showing functional classification.

Figure 9. Description

Publications

Publication links contain curated publications as well as publications found on individual proteins from RefSeq, GeneRIFs, GenBank, SwissProt, Structures, and Conserved Domains. NOTE: publications can be in more than one category. The total shown at the top (show all) is the complete set of non-redundant publications. The title from the most recent publication in each category is shown, along with direct links for each category. The entire publication set can be expanded to show all publication titles.

Figure 10. Publication Links

Protein Cluster Table

The Protein Table contains information on each of the proteins in the cluster, organized by taxon. Where there are multiple species or strains from the same genus the table can be collapsed or expanded. Selected sequences can be highlighted in the alignment or distance tree by checking the box next to the genus or species.

Figure 11. Protein Table

The organism, protein name, accession, locus tag and length are provided. Clusters which are upstream or downstream of the cluster protein are shown to provide information on genome neighborhood. Blink is a link to pre-computed blast analysis. Alignment provides a graphic of the alignment including domain structure. Each of these properties provides a link to the appropriate database numbers (Table 2).

Table 2. Protein Table Links.

Property	Links to:
Organism name	Taxonomic information
Protein Name	Protein Record
Upstream Cluster	Cluster page in Protein Clusters
Accession	Protein record
Downstream Cluster	Cluster page in Protein Clusters
Locus tag	Entrez Gene page
BLink	Pre-computed blast analysis
Alignment	Detailed alignment page
Domain	Domain page in CDD

Querying and Searching

The protein clusters database utilizes all of the features of other Entrez databases. You can limit searches, preview/index your search terms, use the history, clipboard, or details by using the tab buttons underneath the search box. More general instructions on Entrez querying can be found here

Limits
Preview/Index
History
Clipboard
Details

Limits

The limit button on the search bar allows search limits to be set from a drop-down menu.
After selecting a limit the selected field will show up in the yellow bar behind the Field tag. The limits checkbox will also be marked and will remain through subsequent searches. To remove the limits for a particular search, deselect the checkbox.

The following table summarizes the various limits and properties that can be used to refine searches.

Field name	Definition [including field abbreviations]	Examples
Table. Limits and Properties in Entrez Genome Project
Accession	Unique identifier for each cluster. [ACCN][ACCESSION]	Retrieve cluster with the accession PRK09525: PRK09525[ACCN]
Average Length	Average length of proteins in the cluster. [Average Length]	Retrieve all clusters with an average protein length of 100 - 300 amino acids: 100:300[Average Length]
COG	COG (Clusters of Orthologous Groups) is a phylogenetic classification of proteins from complete genomes. [COG]	Retrieve all clusters with COG3250: COG3250[COG]
Creation date	Date the record was created. Note the format is: YEAR/MONTH/DAY including the forward slashes. [Creation Date]	Find all clusters created in 2007: 2007[Creation Date]
Domain Name	Domains are structural or functional units in a protein; nomenclature is based on the NCBI Conserved Domain Database. [Domain Name]	Retrieve all clusters with the beta galactosidase small chain, N terminal domain: Bgal_small_N[Domain Name]
Domains	Number of domains in the proteins cluster. [Domains]	Retrieve all clusters with 15 domains: 15[Domains]
EC/RN Number	The number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively. [EC/RN Number	Retrieve all clusters containing the EC number 3.2.1.23: 3.2.1.23[EC/RN Number]
Gene synonym	Alternative name for gene found in the database records. [Gene Synonym]	Retrieve all clusters with the cbiJ as a gene synonym: cbij[Gene Synonym]
HAMAP	Number assigned to designate a well defined and well conserved protein family or subfamily by the Swiss Institute for Bioinformatics. HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes. [HAMAP]	Retrieve all clusters with the HAMAP MF_00008: mf 00008[HAMAP]
KO	Number assigned to designate a manually curated set of orthologous gene groups in the complete genomes by the Kyoto Encyclopedia of Genes and Genomes. [KO]	Retrieve all clusters with a KO of k01190: k01190[KO]
Locus Tag	Locus tags are identifiers that are systematically applied to every gene in a genome. [Locus Tag]	Retrieve all clusters with a protein with the locus tag of b0344: Z0440[Locus Tag]
Organism	The scientific and common names for the organisms associated with the protein sequence. [Organism]	Find all projects associated with Escherichia coli: Escherichia coli[Organism]
Paralogs	Number of paralog proteins in a cluster. [Paralogs]	Retrieve all clusters with 13 paralogs: 13[Paralogs]
Properties	An attribute of the cluster based on DNA source or curation status. [PROP][Properties]	Retrieve all clusters from chloroplasts: source chloroplast[Properties]
Protein Accession	The unique accession number of the protein. [Protein Accession]	Retrieve all clusters containing the protein accession NP_414878: NP_414878[Protein Accession]
Protein GI number	A series of digits that are assigned consecutively by NCBI to each sequence it processes. [Protein GI]	Retrieve all clusters containing the protein GI of 16128329: 16128329[Protein GI]
Protein Name	The standard name of proteins found in database records. Common names may not be indexed in this field so it is best to also consider All Fields or Text Words. [PROT][Protein Name]	Retrieve all clusters containing the protein beta galactosidase: beta galactosidase[Protein Name]
PubMed ID	Unique identifier for the publication in the PubMed database. [PubMed ID]	Retrieve all clusters with the PubMed ID number 97298: 97298[PubMed ID]
Sequence Length	Exact number of amino acids in the protein sequence. [Sequence Length]	Retrieve all clusters with at least one protein with a length of 1024 amino acids: 1024[Sequence Length]
Size	Number of proteins in the cluster. [Size]	Retrieve all clusters with 25 proteins: 25[Size]
Taxonomy ID	Identifier for the species or strain in the NCBI taxonomy database. [Taxonomy ID]	Retrieve all clusters with proteins from taxonomy ID 83333: 83333[Taxonomy ID]
Title	Title of the protein cluster. [Title]	Retrieve all beta-D-galactosidase protein clusters: beta-D-galactosidase[Title]
Total Publications	Total number of publications associated with proteins in the cluster. [Total Publications]	Retrieve all clusters with 51 publications: 51[Total Publications]

Preview/Index

The Preview/Index page on any Entrez database is used to construct queries and to view terms that have been indexed under any field name. The table in the previous section described the fields used in indexing the records and provided some representative queries using those fields. Information on using Preview/Index can be found in the Entrez help documentation here

The History, Clipboard, and Details features are consistent with other Entrez databases. You may find additional information in the Entrez help documentation.

If you have any additional questions, then please send an email to: info@ncbi.nlm.nih.gov

Revised March 26, 2007

Protein Cluster Database	Cluster ID
Prokaryotes: Curated Protein Clusters	PRK#####
Chloroplasts	CHL#####
Prokaryotes: Uncurated Protein Clusters	CLS#####
Chloroplasts: Uncurated Chloroplast Clusters	CLSC#####