This document provides tips and examples for searches of the three PubChem databases by text term/keyword, as well as tips for searching PubChem Compound by chemical properties. The Structure Search Help document provide tips on using chemical information for basic and advanced searches in the PubChem Structure Search tool. In addition, the PubChem Upload Help document provides procedures and instructions on how to deposit your structure/assay data into the PubChem system using the PubChem Upload tool. The PubChem Download Facility Help document describes how to use the PubChem Download Facility.
- PubChem Overview
- PubChem FAQ
- PubChem Substance Database
- PubChem Compound Database
- PubChem BioAssay Database
- PubChem Summary and Analysis
- Substance/Compound summary page (separate document)
- Structure Clustering
- BioActivity Services
- PubChem Cross Links
- PubChem Indexes and Filters in Entrez
- Tools & Data Analysis
- BioActivity Services
- Chemical Structure Search
- Classification Browser
- Direct Link Services
- Identifier Exchange Service
- ImageFly (2D) Service
- Power User Gateway (PUG)
- PubChem 3D
- Score Matrix Service
- Standardization Service
- Text Search Tool
- Widgets
- Downloading Data
- Uploading Data
- PubChem Upload (successor to the Deposition Gateway, which has been deprecated)
- Data Transfer Agreement (PDF)(HTML)
- PubChem Data Sources
- PubChem Data Specifications
- Small Molecule and Assay (ASN.1)(XML Schema)
- Fingerprint Description (PDF)(Text)
- SD Field Descriptions (PDF)(Text)
- Reference Materials
PubChem Overview |
PubChem provides information on the biological activities of small molecules.
PubChem includes substance information, compound structures, and BioActivity data in three primary databases, Pcsubstance, Pccompound, and PCBioAssay, respectively.The Substance/Compound database, where possible, provides links to BioAssay description, literature, references, and assay data points. The BioAssay database also includes links back to the Substance/Compound database. PubChem is integrated with Entrez, NCBI's primary search engine, and also provides compound neighboring, sub/superstructure, similarity structure, BioActivity data, and other searching features.
- Pcsubstance contains more than 180 million records. You can check the count of substance records as of today.
- Pccompound contains more than 63 million unique structures. You can check the count of compound records as of today.
- PCBioAssay contains more than 1 million BioAssays. Each BioAssay contains a various number of data points. You can check the count of BioAssay records as of today.
PubChem contains substance and BioAssay information from a multitude of depositors. You can check the PubChem data source status as of today.
PubChem Substance Database |
The PubChem substance database contains chemical structures, synonyms, registration IDs, description, related urls, database cross-reference links to PubMed, protein 3D structures, and biological screening results. If the contents of a chemical sample are known, the description includes links to PubChem Compound.Query Examples:
- Molecule synonym search
Which substances have "methotrexate" as a part of their molecule name?
Simply enter methotrexate in the Search textbox on the PubChem homepage or Entrez search page and press the Go button. You will get all substances with the synonym methotrexate and/or with any other keyword methotrexate.
Or enter methotrexate[synonym] in the Search textbox and press the Go button. Note: the term in the brackets "[]", such as "[synonym]", is an index field name or alias. For more information about index searches, please see PubChem Indexes and Index Search.
Which substances have "3'-Azido-3'-deoxythymidine" as their molecule name?
Enter "3'-Azido-3'-deoxythymidine" (including the quotes) in the Search textbox and press the Go button.
- External ID search
Which substances have "NSC78" for DTP/NCI's external ID ?
Simply enter "NSC78" in the Search textbox and press the Go button.
Or enter "78[objectid],dtp[sourcename]" in the Search textbox and press the Go button.
Which substances have "aids000006" for NIAID's anti-HIV chemical database external ID ?
Enter "aids000006" in the Search textbox and press the Go button.
Or enter "000006[objectid],niaid[sourcename]" in the Search textbox and press the Go button.
- Biology links search
Which substances have biological activity links?
1. Go to the Limits page
2. In the Specify Required Links section, click the checkbox next to BioAssay and press the Go button.
- Combined searches
Which substances contain the element Platinum and have biological activity links?
1. Go to the Limits page.
2. In the Specify Required Links section, click the checkbox next to BioAssay.
3. In the Specify Required Elements section, click the checkbox to the left of Pt.
4. Press the Go button in the Search toolbar.Query Results:
- Refine your results
The "Refine your results" panel appears on the right of the page. It shows the interesting subsets of the searched results. The name and count of the subset (e.g., "Protein 3D Structure") link to the substance subset. The subsets are grouped into three categories whenever available: "BioActivity Experiments", "BioMedical Annotation", and "Depositor Category". These categories can be hidden or expanded using the "-" or "+" sign.
The "BioActivity Experiments" category includes "BioAssays, Probes", "BioAssays, Active", "BioAssays, Tested", and "Protein 3D Structures". "BioAssays, Probes", "BioAssays, Active", and "BioAssays, Tested" mean screening experiments identifying substances as probe reagents, active, and tested, respectively. "Protein 3D Structures" means structure biology experiments showing substance binding.
The "BioMedical Annotation" category includes "Pharmacological Actions" and "BioSystems". "Pharmacological Actions" means therapeutic use and National Library of Medicine (NLM) substance classifications. "BioSystems" means biological pathways and other BioSystems containing substances.
The "Depositor Category" includes "Biological Properties", "Chemical Vendors", "Journal Publishers", and "NIH Molecular Libraries".
Top three examples: the subsets "Protein 3D Structures" and "Pharmacological Actions" also show the total count in the linked database and the top three sorted examples. The names and counts of the examples link to substance subsets.
BioActivity Analysis: BioActivity Analysis links for reported "Chemical Probes", "Active" compounds, and "Tested" compounds are shown with the icons .
PubChem Compound Database |
The PubChem Compound Database contains validated chemical depiction information that is provided to describe substances in PubChem Substance.
Structures stored within PubChem Compound are pre-clustered and cross-referenced by identity and similarity groups. Additionally, calculated properties and descriptors are available for searching and filtering of chemical structures.
Users can perform a term/keyword search in a same manner as for substance database (see above). In addition, the PubChem compound database also provides a chemical property search.
Examples:
- Molecular weight search
Which compounds have molecular weight between 100 and 200?
Enter 100:200[mw] or 100:200[molecularweight] in the Search textbox and press the Go button.
Note: The term in the brackets "[]", such as "[mw]", is an index field name or alias. For more information about index searches, please see PubChem Indexes and Index Search.
Or simply enter 164.2[mw] in the Search textbox and press the Go button to retrieve all compounds with 164.2 as the molecular weight.
- XLogP search
Which compounds have XLogP between 2.3 and 2.4?
Enter 2.3:2.4[xlogp] in the Search textbox and press the Go button.
- Heavy atom count search (Heavy atom means all atoms except hydrogen.)
Which compounds contain 8 heavy atoms?
Enter 8[heavyatomcount] in the Search textbox and press the Go button. Users can also carry out this search for the Substance database.
The PubChem Compound Limits page provides a very useful way to rapidly perform complex searches. All search examples showed above can be done at the Limits page. Go to the Limits page to begin any of the examples below.
Examples:
- Chemical property range searches
Which substances do not violate the "Lipinski Rule of 5"?
1. In the Chemical Property Search section:
a. For the Molecular Weight (MW) range, type 0 and 500 in the from and to text boxes, respectively.
b. For the Hydrogen Bond Donor Count (HBD) range, type 0 and 5 in the from and to text boxes, respectively.
c. For the Hydrogen Bond Acceptor Count (HBA) range, type 0 and 10 in the from and to text boxes, respectively.
d. For the XLogP range, type -5 and 5 in the from and to text boxes, respectively.
2. Push the Go button in the top Search bar
- Simple elemental searches of PubChem Compounds
Which substances contain Gallium?
1. In the Specify Required Elements section, select the checkbox to the left of the Ga atomic symbol
2. Push the Go button in the top Search barWhich substances contain Carbon, Nitrogen, Oxygen, and Fluorine?
1. In the Specify Required Elements section select the checkboxes to the left of the C, N, O, and F atomic symbols
2. Push the Go button in the top Search barQuery Results:
- Refine your results
The "Refine your results" panel appears on the right of the page. It shows the interesting subsets of the searched results. The name and count of the subset (e.g., "Protein 3D Structure") link to the compound subset. The subsets are grouped into three categories whenever available: "BioActivity Experiments", "BioMedical Annotation", and "Depositor Category". These categories can be hidden or expanded using the "-" or "+" sign.
The "BioActivity Experiments" category includes "BioAssays, Probes", "BioAssays, Active", "BioAssays, Tested", and "Protein 3D Structures". "BioAssays, Probes", "BioAssays, Active", and "BioAssays, Tested" mean screening experiments identifying compounds as probe reagents, active, and tested, respectively. "Protein 3D Structures" means structure biology experiments showing compound binding.
The "BioMedical Annotation" category includes "Pharmacological Actions" and "BioSystems". "Pharmacological Actions" means therapeutic use and National Library of Medicine (NLM) compound classifications. "BioSystems" means biological pathways and other BioSystems containing compounds.
The "Depositor Category" includes "Biological Properties", "Chemical Vendors", "Journal Publishers", and "NIH Molecular Libraries".
Top three examples: the subsets "Protein 3D Structures" and "Pharmacological Actions" also show the total count in the linked database and the top three sorted examples. The names and counts of the examples link to compound subsets.
BioActivity Analysis: BioActivity Analysis links for reported "Chemical Probes", "Active" compounds, and "Tested" compounds are shown with the icons .
PubChem BioAssay Database |
The PubChem BioAssay Database contains BioActivity screens of chemical substances described in PubChem Substance. It provides searchable descriptions of each BioAssay, including descriptions of the conditions and readouts specific to a screening protocol.Query Help:
- Searching for PubChem BioAssay datasets
Select PubChem BioAssay from the pull-down menu. In the Search textbox, enter terms you might expect to find in the description of an assay of interest. The search will consider terms in both the overall description of the assay and in the description of its individual parameters and readouts.
Examples:
1. Searching for yeast cell cycle control finds BioAssay result sets from the NCI Yeast Anticancer Drug Screen.
2. Searching for HIV growth inhibition finds the NCI AIDS Antiviral Assay
- Browsing and downloading PubChem BioAssay results
The PubChem BioAssay browser helps you to examine descriptions of each assay's parameters and readouts. You may use it to select those parameters and readouts most relevant to the biological activity of interest. An example on how to work with assay data is below.
Example:
1. From the Entrez search page Search bar
a. Select PubChem BioAssay from the pull-down menu.
b. Type "NCI AIDS Antiviral Assay" (include quote) in the textbox.
You will see a description of the "NCI AIDS Antiviral Assay" within Entrez.
2. Click the hypertext link for "AID: 179".
You will be brought to the "BioAssay Summary" page, where you will see the detailed description of the assay. You can find more help content about the BioAssay summary and result browser.- Combined searches
The PubChem BioAssay Limits page provides a very useful way to perform complex searches.
BioAssay Type:
BioAssay can be grouped into the following types: Substance Type, Screening Stage, and Target Type.
Substance Type: The assay records could contain data for chemicals or RNAi.
Screening Stage: As described below, there are four Activity Outcome Methods: Summary, Confirmatory, Primary Screening, and Other. Some of the confirmatory assays also contain Dose-Response data.
Target Type: An assay may contain no protein/gene/nucleotide target, one single target, multiple targets, or multiple targets in the members of a panel assay.Query Results:
- Refine your results
The "Refine your results" panel appears on the right of the page. It shows the interesting subsets of the searched results. The name and count of the subset (e.g., "Protein") link to the BioAssay subset. The subsets are grouped into four categories whenever available: "Target", "Experimental Method", "Related BioAssays", and "Active Chemicals". These categories can be hidden or expanded using the "-" or "+" sign.
The "Target" category includes "Proteins" and "BioSystems". "Proteins" means protein sequences specified as targets of BioAssays. "BioSystems" means biological pathways and other BioSystems containing the target sequences of BioAssays.
The "Experimental Method" category includes "Active Concentration (IC50, etc)", "NIH Molecular Libraries", "Chemical Screens", and "RNAi Screens". "Active Concentration (IC50, etc)" means BioAssays specifying the active concentrations of tested reagents. "NIH Molecular Libraries" means BioAssays supported by the NIH Molecular Libraries Program. "Chemical Screens" and "RNAi Screens" mean BioAssays reporting BioActivity of chemical reagents and RNAi reagents, respectively.
The "Related BioAssays" includes related BioAssays "by Target Similarity", "by Activity Overlap", "by Depositors", and "by Common BioSystems". Related BioAssays "by Target Similarity" means BioAssays with sequence-similar targets. Related BioAssays "by Activity Overlap" means BioAssays reporting activity of the same reagents. Related BioAssays "by Depositors" means Related BioAssays as specified by PubChem depositors. Related BioAssays "by Common BioSystems" means BioAssays with target sequences in the same biological pathways.
The "Active Chemicals" category includes "Chemical Probes" and "Active Compounds". They mean compounds reported as validated chemical probes and biologically active, respectively.
Top three examples: the subsets "Proteins" and "BioSystems" also show the total count in the linked database and the top three sorted examples. The names and counts of the examples link to BioAssay subsets.
BioActivity Analysis: A BioActivity Analysis link for reported "Chemical Probes" is shown with the icon .
PubChem Summary and Analysis |
The PubChem results are displayed in three category pages: substance, compound, and BioAssay pages. They provide rich cross links to each PubChem database, other NCBI databases, and depositor's databases. PubChem's default results page is part of the Entrez summary list display system.
Substance Summary:
From the Entrez PubChem substance database, users can get substance summary with thumbnails, corresponding compound ID, depositors source information, etc. You can see an example of a substance result in Entrez.
On this page, users can choose to display brief, summary, ID map, substance neighboring information, synonyms, and other database information from the dropdown list. On the right of the page, users can select few pop-up windows (when available) to get related structure, BioAssay, and literature links related to this substance. Users can choose to either "display", or "send" the searched results to "text" or to a "file".
Users can find the more detailed substance information and cross links by clicking the structure image or the ID link. Here is an example of the PubChem Substance Summary page:
This page displays the depositor provided original information, such as substance information, deposited structure drawing, older version selection, comments, etc. Users can also find some derived information, links if available.
Click the associated "Chemical Structure" tab to display the standardized compound information including property data, other depositor provided synonyms, descriptors, cross links, PubChem standardized structure drawing, etc.
Power users can even download different data formats, such as ASN.1, XML, and SDF.
Compound Summary:
All compounds have been extracted from deposited substances. For natural products substances and those don't have structures, there will be no compound records associated. A substance that is in form of mixture has the mixture format compound record and a/few component(s) compounds associated with.
From the Entrez PubChem compound database, users see a compound summary with thumbnails, few compound property data, etc. Here is an example of a compound result in Entrez.
The page is in the same style as substances. Clicking on thumbnails or CID hyperlink will lead users to the Compound Summary page. Users can find this compound's property data, description, related substance information, neighboring structures, and cross links.
All compounds are structurally unique when compared with each other. One compound may link to many substances.
Substance/Compound Summary Content:
Title shows chemical name and PubChem accession identifier. The toolbar contains icons that allow users to launch: a bioactivity summary , when bioactivity is available; a chemical structure search, to search by identity, similarity, super/sub-structure, or molecular formula; 3D conformer launch tool when a conformer is available; or data download in various formats, including the native PubChem archive format ASN.1 , XML , or the industry standard SDF format .
BioMedical Annotation:
Content in this section is provided by the NLM MeSH resource. MeSH is the U.S. National Library of Medicine's controlled vocabulary used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same.
In substance page (deposited record), the BioMedical annotation is derived from the MeSH resource by matching deposited synonyms. In compound page (chemical structure), the information is derived from combined synonyms with a name weighing algorithm.
This section also contains medication information (from NLM DailyMed), pharmacological action, drug and chemical classification, safety and toxicology, and pubmed linking information, when available.
Safety and Toxicology
Content in this section is provides from the NLM toxnet.
BioAssay Results:
Content in this section is provided by the PubChem BioAssay database. A summary of available results is provided. A launch point for bioactivity summary analysis is provided for the current compound or the current compound including similar compounds. To view all contributed BioAssays, click the "more..." link.
Synonyms:
Content in this section includes synonyms provided by depositors. "Unfiltered" synonyms are all the synonyms provided by depositors. "Filtered" synonyms are synonyms that have intra-depositor and inter-depositor consistencies. The substances assigned to a synonym need to be consistent at any of the following levels (high to low): exact same structure, same stereo form, same connectivity, same parent structure, same parent stereo form, or same parent connectivity. The order of filtered synonyms are sorted based on consistency level, frequency, and readability score, while unfiltered synonyms are sorted by frequency and readability score only. The frequency is the number of times a synonym is provided by depositors for a particular compound structure. Most commonly used synonym(s) show first. For substances, the frequency of synonyms is always 1. The readability score is determined by the size of the synonym, the count of non-alphabetic characters, and capitalization, etc. A MeSH tree icon indicates synonyms that are known to MeSH. Sorting and display controls are available. By default, only the first ten synonyms are shown.
Properties:
Content in this section includes computed properties of the compound record. A list of properties are below but include various counts.
- 2D compound properties:
Molecular Weight -
Molecular Formula -
XLogP -
H-Bond Donor -
H-Bond Acceptor -
Rotatable Bond Count -
Tautomer Count -
Exact Mass -
MonoIsotopic Mass -
Topological Polar Surface Area
Heavy Atom Count
Formal Charge
Complexity
Isotope Atom Count
Defined Atom StereoCenter Count
Undefined Atom StereoCenter Count
Defined Bond StereoCenter Count
Undefined Bond StereoCenter Count
Covalently-Bonded Unit Count
- 3D conformer properties:
Feature 3D Acceptor Count -
Feature 3D Hydrophobe Count -
Feature 3D Ring Count -
Effective Rotor Count -
Conformer Sampling RMSD -
CID Conformer Count - Conformer count for a given compound.
Descriptors:
Content in this section includes computed descriptors of the compound record. A list of descriptors is below.
IUPAC Name -
Canonical SMILES -
InChI -
InChiKey -
Compound Info:
Content in this section is provided by the PubChem Compound database. The PubChem Compound accession identifier (CID) is provided with the date the CID was created and, if a mixture, the parent compound CID (when applicable) and a link to the unique components comprising the compound record. Links to related compounds (when applicable), with varying degrees of identity (e.g., being different by isotopic or stereochemical means), and 2D chemical similarity are provided.
Substance Info:
Content in this section is provided by the PubChem Substance database. When viewing a substance record, this section contains the PubChem Substance accession identifier (SID) along with the dates the SID was first created and last updated by the depositor. When the substance can be linked to a unique compound record, the PubChem Compound accession identifier (CID) is provided along with the date the CID was created, the parent compound CID (when available), and a link to the unique components comprising the compound record.
When viewing a compound record, this section contains links to all related PubChem substance records, being either the same compound or contain the compound as a part of a mixture. Substance Categorizations are also provided to help you identify useful resources provided by PubChem depositors.
Structure & Quick Link Bar:
Content in this section includes 2D structure depiction, 3D conformer image (if available, toggled with 2D depiction), Pc3D application download link, frequently used compound property data, and links. Note: this part of content can be collapsed by clicking the "bar" on the top and expanded back by clicking the same bar (vertical). Double click the long thin vertical area (left side of the quick bar, light grey color, the mouse cursor will change to "+" when mouse over) perform the same function.
Structure Clustering for Compounds/Substances:
The compounds/substances are clustered based on the structure similarity using the Single Linkage algorithm. The structure similarity is either the Tanimoto score calculated from the 2D structure fingerprint, or the 3D shape/feature similarity. The 3D coordinates are theoretically calculated. For 2D structure analogs, a Tanimoto score of 0.68 or greater is statistically significant at the 95% confidence interval. For 3D structures, a similarity score is statistically significant at the 95% confidence interval as such: 3D Shape + Feature :: using 1 conformer is 0.88, using 10 conformers is 1.03; 3D Shape ST-Optimized :: using 1 conformer is 0.74, using 10 conformers is 0.85; 3D Feature CT-Optimized :: using 1 conformer is 0.30, using 10 conformers is 0.39. Both the simple view with the compound/substance IDs and the view with the structures are provided. The limit of compounds is 4000 for 2D structure analogs, and 1000 for 3D structures. If more compounds are input, a warning message will show up.Each compound may have up to 10 calculated conformers. If the compound has no 3D conformers but its parent compound has, the parent compound will be used in calculating the 3D conformer similarity. When the compounds are clustered, you can Choose Conformer Pairs by "Most Similar" or "All". "Most Similar" means the clustering is on compounds, and the most similar conformer pairs are used to represent a pair of compounds when a set of conformers for compound A and another set of conformers for compound B are compared. "All" means the clustering is on conformers.
During the 3D similarity calculation, 3D Superposition is Optimized by either "Shape" or "Feature". If the calculation is shape-optimized, the 3D similarity can be represented by the sum of "Shape" and "Feature" similarity scores, or just "Shape" similarity score. If the calculation is feature-optimized, the 3D similarity can be represented by the sum of "Feature" and "Shape" similarity scores, or just "Feature" similarity score.
A certain Number of Conformers per Compound (nconf) is chosen to finish the calculation in 1-2 minutes. This number "nconf" is also shown in the clustering image. You can increase this number up to 10 to use more conformers in the calculations.
Collapse Compound Cluster: The Compound Cluster Tree can be collapsed if you click on the ruler as shown below. The subtrees beyond the collapsed Tanimoto score will be collapsed into a node, which can be expanded.
Select from Cluster: As shown in the following image, if you click on a blue circle in the Cluster Tree, a menu will pop up. The options for 2D clustering are "Compounds in Entrez", "Compounds in BioActivity Analysis", "Structure Similarity Scores", "Expand Subtree", and two Revise Selections: "Display Subtree Only" and "Remove Subtree & Display the Rest". There are two more options for 3D clustering: "Compounds in 3D Viewer" and "Conformers Used".
Common Substructures: As shown in the following image, if you mouseover a node (blue circle) or the line on its left, the common substructures for the compounds in the subcluster will pop up. Currently only the 2D common structures are shown. If the similarity of the node is >= 0.9, the common fingerprint bits greater than 574 are shown. Otherwise, the fingerprint bits greater than 713 are shown.
Similarity Data: You can export Structure Similarity Scores used to generate the dendrogram.
Conformers Used: This button will appear only when you choose "Most Similar" conformer pairs to calculated the 3D Shape/Feature similarity for each pair of compound. You can export the selected conformer pair for each compound pair.
Image: You can export the display in one full PNG image since the display may consist of many small images.
Clusters in GML: You can export the clusters as a Graph Modelling Language (GML) file, which can be viewed in other softwares such as Cytoscape. The GML file format can be easily converted to other formats such as the eXtensible Graph Markup and Modeling Language (XGMML), Graph eXchange Language (GXL), and GraphML.Result Display Option - Group Results by: You can switch between "Compound" and "Substance" views. These compounds are grouped from these substances.
Save View: is defined below.
PubChem BioActivity Services
Common gateway of PubChem BioActivity Analysis Service. It provides a central entry point for accessing bioassay records, and tools including BioAssay Summary, BioActivity Summary, Data Table and Structure-Activity Analysis for selected substance/compound/assay set. Data Table further has services for data analysis through Plots and for Selecting detailed test results. Functionality and navigation of these services are documented below.
Files saved for recording analysis status can be imported using the "Open Saved View" tab. The chemical structure clustering tool launch point is also in this page. [Ref: Nucleic Acids Res, 2009; (6).]
BioAssay Summary: The BioAssay Summary service allows one to review the information content of PubChem BioAssay records, including information provided by assay depositors as well as annotations provided at PubChem. To retrieve a specific bioassay record, please provide the PubChem BioAssay accession, AID.
BioActivity Summary: The BioActivity Summary service reports the available biological screening results for a single or a set of chemical samples. This service provide means for one to examine and compare biological outcomes across multiple biological tests. Please specify compounds using the given input methods. To retrieve specific test results, please specify bioassays.
BioActivity DataTable: The Data Table tool supports rapid search and retrieval of test results for a single or multiple bioassay records. Please specify bioassays using the given input methods.
Structure-Activity Analysis: The Structure-Activity Analysis service clusters compounds and bioassays simultaneously using chemical structure, biological outcome, and target information. This service provides exploratory tools that allows one to identify structure-activity relationship and examine target selectivity and specificity of a compound. Please specify compounds and bioassays using the given input methods.
Structure Clustering: Chemical Structure Clustering Tool clusters compounds/substances based on the structure (fingerprint) similarity using the Single Linkage algorithm. Please specify compounds/substances using the given input methods.
Open Saved View: The launch point for the saved view file. A "view file" can be saved from BioActivity Summary, BioActivity Datatable, Structure-Activity Analysis, and Chemical Structure Clustering pages. For more information about a view file, click here.
Display: Allow users to switch compound and substance input.
Compound Input: Allow users to specify compound input. Users can choose to use only one input method: search term, CID list, CID list file, or select an entrez history key (if available).
Substance Input: When select substance input, users can specify substance input using search term, or SID list, or SID list file, or select an entrez history key (if available).
BioAssay Input: Allow users to specify the bioassay input. Users can choose to use only one input method: search term, AID list, AID list file, or select an entrez history key (if available).
UID List: A UID (here refers to CID, SID, or AID) list should be in form of a comma separated numeric list. Delimiters can also be space, semicolon(;), new line, tab. For SID input, users can choose to use the ID-Map file which can be obtained from the pcsubstance docsum page.
BioAssay Summary:
BioAssay Summary may be accessed through NCBI Entrez system, where one can search the PubChem BioAssay database using a specified key word. Users can see an example of Entrez BioAssay search result for the term "peroxiredoxins".
Using the "Display" pull-down menu in this page, users may choose to view lists of summaries, brief summaries, unique identifiers, compounds, substances, free text article links (via PMC), and PubMed citations. On the right of the page, users can select few pop-up windows (when available) to get Related BioAssays, Compounds, Literature, etc.
Clicking on AID hyperlink will lead users to the BioAssay Summary page.This page shows detail descriptions of a BioAssay including citation links, experiment protocols and depositor comments. "Data Table(Active)" links to test results for compounds considered active in the particular BioAssay, while "Data Table(All)" links to the complete test results. This page also provides links to a few data analysis resources/tools that are derived at PubChem, such as "BioActivity Summary", "Related BioAssay", and etc. The bottom of the page shows detailed readouts, such as name, descriptions and data type. "Test Concentration" and "Active Concentration" attributes are flagged with * and **, respectively. The glossary of this page is listed below.
AID: PubChem's BioAssay identifier.
BioAssay Version: The BioAssay version number is composed of major version number and minor version number. We encourage you to look at the current version result as it is the updated data from the depositors.
Name: The BioAssay name provide by the depositor.
Data Source: Depositor's source name (unique in PubChem)
Deposit Date: Date when data was first deposited.
Modify Date: Date when data was revised.
BioAssay Results: Data table for active substance or all substance.
BioActive Compounds: Active compounds/substances tested in the BioAssay. Related links for the compound/substance set.
Related BioAssays: Related BioAssays by activity overlap, target similarity, and/or related to the same tested compound/substance set.
Protein Target: Protein target related to this BioAssay.
Links: Extra linked information to this BioAssay.
Compounds: Compounds tested for this BioAssay, including activity information.
Substances: Substances tested for this BioAssay, including activity information.
PubMed: PubMed citations related to this BioAssay.
Nucleotide: NCBI Entrez Nucleotide links to this BioAssay if available.
Taxonomy: NCBI Entrez Taxonomy links to this BioAssay if available.
Structure: MMDB links to this BioAssay.
Gene: Gene links to this BioAssay.
BioAssay: The BioAssays related to this one.
Description: The BioAssay's description provided by the depositor.
Protocol: The BioAssay's protocol provided by the depositor.
Comment: The BioAssay's Comment provided by the depositor.
Categorized Comment: The BioAssay's Categorized Comment provided by the depositor.
Result Definition: The BioAssay's result definition provided by the depositors.
Test Concentration: The concentration in which compounds are tested in any BioAssay.
Activity Concentration: The concentration which produces 50% of the maximum activity. Same as IC50, EC50, etc.
BioActivity AnalysisBioActivity Analysis shows the activity analysis for a set of compounds/substances and BioAssays. It has three views: Summary, Data Table, and Structure-Activity as described below.
BioActivity Analysis - Summary:This is one of the three views of "BioActivity Analysis". It displays tested compound/substance activity summary across multiple BioAssays. There are three subviews under the "BioActivity Analysis - Summary": BioAssays, Targets, and Compounds. These view pages provide the bioactivity information for each bioassay, target, and compound, respectively.
There are three sections on each of the three view pages: "Revise BioAssay and Compound Selection", "Table", and "Extra Options"."Revise BioAssay and Compound Selection" provides users additional information for assay and compound/substance counts in several subgroups and a way to revise assay and compound/substance lists for the counts in Table. By clicking one of these subgroups, one can revise assay or compound/substance lists to be used to calculate counts for the Table. These subgroups serve as filters. The "Extra Options" provide users ways to switch between compound and substance views, and download the data table (in CSV format).
Launch the BioActivity Analysis page:Users can launch this page from PubChem Substance, Compound and BioAssay summary reports in Entrez, where users may click the display pull-down, choose "PubChem BioActivity Summary", and see the compound/substance activity distribution across all BioAssays. If launching from Entrez PubChem BioAssay, users see all active compounds across each BioAssay. Other launch points for this service are available from "BioAssay Summary", "Data Table" and "Structure-Activity Analysis services", or https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi. Most of these launch points will normally lead user to the "BioActivity Analysis - BioAssays" page. User then can go other pages by clicking the corresponding tab on the page. Also user can launch "Targets" or "Compounds" page directly by the URL //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=tgt&cid=xxxx (or aid=xxxx) or //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=cmp&cid=xxxx (or aid=xxxx), respectively, for specific AID(s) or CID(s), where xxxx is the AID or CID.
BioActivity Analysis - Summary - BioAssays:You can go to https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, click the tab "Assay-centric", and search by the protein target name (e.g., "Protein kinase C alpha type; PKC-A; PKC-alpha"). The table of "BioAssays" contains AID, active compound/substance count, inactive compound/substance count, total tested count, the counts of the compound/substance with active concentration less or equal 1uM or 1nM, the range of active concentration, the BioAssay name, and the protein target name. Clicking on each count number leads to respective Data Table.
BioActivity Analysis - Summary - Targets:You can go to https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, click the tab "Target-centric", and search by the protein target name (e.g., "Protein kinase C alpha type; PKC-A; PKC-alpha"). The table of "Targets" displays the tested results for each protein gis. The table contains, for each protein target, the target name, bioassay count, chemical probe count, active compound/substance count, the counts of the compound/substance with active concentration less or equal 1uM or 1nM, and the total tested compound/substance count. Some table columns are hidden by default. Users can click "More Columns" on the right-upper corner to show all columns.
All table columns are sortable. A sortable column features cell background color change from light-grey to orange when mouse point is placed in the column head. User can click the column head title to sort the table by the column. The arrow just after the head title indicates the sorting direction.
"Tips" If one wants to look at the Bioactivity results in PubChem BioAssay database for certain targets, he can select the interested targets on "Targets" view page, click the other tab ("Assays", or "Compounds"), and then click the "Targets" tab again. When one makes any selection with some check boxes checked on the left of each table row, only the selected subset of targets will be carried to the next page when user clicks any of "DataTable", "Structure-Activity", "BioAssays", and "Compounds" buttons.
"Count Links" All counts for targets and bioassays go to NCBI Entrez page to display the list. The counts for compounds/substances have links to variable pages for detail information for these counts. When one clicks these links, several options are popped up. One goes to NCBI Entrez page to display the list. One goes to PubChem Data for the tested results for these counts. And another one goes to PubChem Structure-Activity (SAR) analysis.
BioActivity Analysis - Summary - Compounds(Substances):If you search "aspirin" at https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, the table of "Compounds(Substances)" displays the tested results for each chemical compound(substance). By default the table contains counts for each chemical compound: the bioassay counts in which the compound has been concluded as "chemical probe", active, with active concentration less or equal 1uM or 1nM, and has been tested. Also the table contains the protein target counts where the compound has been tested active against the protein target. Unique protein target counts are used here, which means all protein targets with the identical sequence are grouped together and just count one here. The compound's active concentration range (values of IC50, etc.) is also provided in the table.
User can select "Substance" in the dropdown menu at the "Display Results By" of the "Result Display Option" section just below the table and then click "Apply" button to switch to substance view from compound view.
"Tips" If one wants to look at the Bioactivity results in PubChem BioAssay database for certain compounds, he can select the interested compounds on "Compounds" view page, click the other tab ("Assays", or "Targets"), and then click the "Compounds" tab again. When one makes any selection with some check boxes checked on the left of each table row, only the selected subset of compounds will be carried to the next page when user clicks any of "DataTable", "Structure-Activity", "BioAssays", and "Targets" buttons.
"Count Links" All counts for bioassays, compounds/substances, and targets have links to variable pages for detail information for these counts. When one clicks these links, several options are popped up. One goes to NCBI Entrez page to display the list. One goes to PubChem Data Table for the tested results for these counts. And another one may go to PubChem Structure-Activity (SAR) analysis.
Data Table tab shows result data table for selected or all (when no selection, maximum up to 50) BioAssays with the substance set in the page.
Structure-Activity tab shows Structure-Activity Analysis for selected or all BioAssays with the substance/compound set in the page.
Revise Substance/Compound Selection allows you to reset substance/compound based on few options.
Revise BioAssay Selection allows you to reset BioAssays from following options.
Result Display Option - Group Results by: Users can switch between Compound and Substance.
BioActivity Analysis - Data Table:
This is one of the three views of "BioActivity Analysis". Other views "Summary" and "Structure-Activity" are available as tab options. The Data Table Page displays the searched results. There are four menus for Data Table: "Data Table, Concise", "Data Table, Complete", "Plot", and "Select".
Result Display Option: is defined above.
Save View: is defined below.
Result Exports allows you to download result set including chemical structures and readouts in the chosen format. If the FTP connection hangs, it is likely that the Passive FTP mode is not set up in your windows. You can use the Passive FTP following these steps. Step 1, Click the Start button to open the Start menu. Step 2, Type "Internet Options" (without quotes) in the search box. Step 3, Click "Internet Options" in the list of results to open the Internet Properties dialog window. Step 4, Click the "Advanced" tab to select it. Step 5, Scroll down to the Browsing heading and check the box next to "Use Passive FTP (for Firewall and DSL Modem Compatibility)." Click "OK" to confirm the setting.Data Table, Concise-- shows concise results, which contains activity outcome, score and active concentration if provided.
Data Table, Complete-- shows test results corresponding to complete or selected readout fields.
Dose-Response Curve -- If the data table has dose-response data, the icon shows the link to the Dose-Response Curve.
Curve fit with the Hill Equation: The experimental data are fit with the Hill equation v = (V * S^n / (K + S^n)) + b or v = - (V * S^n / (K + S^n)) + b using the nonlinear regression algorithm described at (Pinto et al., 1984) without weighting. Here v is the response, S is the concentration, n is the Hill coefficient, K is the apparent dissociation constant, V is the maximum response, and b is a parameter related to the baseline response. The experimental data are shown as colored symbols, and the fit curve is shown in black.
You can choose linear or log scales for both X- and Y-axis. The default is log scale for X-axis and linear scale for Y-axis. The "Download Data" button shows the dose-response data. The "Data Table" button links to the data table for the pair of assay and chemical. If more than one curves are available, you can click the button "One Curve per Graph" to show each curve in a separate graph.
BioAssay Plot -- This page provides an interface for plotting "Scatter Plot" and "Histogram". Users can select up to 5 rows. The "Scatter Plot" will show figures for all pairs. The "Histogram" will show figures for all rows. Users can also click on each to get the histogram for that row.
Scatter Plot and Histogram: Clicking two diagonal points in the figure, you can view the data with four options: "Plot selected data", "Show selected data", "Show selected data, active only", and "Show selected data, inactive only".
BioAssay Select -- This page provides an interface to let you to carry out the BioAssay result search.
Navigate Buttons:
Press the "Show" button to retrieve the BioAssay(s) data table results based on your query criteria.
Press the "Clear" button to clear/reset the query form.Summary Results provides a search interface for you. You can search the activity outcome, rank score, and/or test date from the displayed search form. Click the to expand the BioAssay result search form. (Then the will be shown up. Click it will collapse the form)
Outcome Filter allows you to select tested compounds/substances based the activity outcome. The checkbox allows the outcome to be displayed in the result page. By default, it is checked.
Activity Score Filter allows you to select tested compounds/substances based the activity rank score. The checkbox allows the rank score to be displayed in the result page. By default, it is checked.
Updated Date Filter allows you to the date range for the assay. By default, all result will be returned if no input. The input format is yyyy/mm/dd. mm and dd are optional.
Other Experimental Results provides a detailed search interface for you. Click the to expand the BioAssay result search form. (Then the will be shown up. Click it will collapse the form)
All results fields are checked by default. You can unselect/select all by click the checkbox in the header row. Selected results will be displayed in the result page.
Results with integer/float type can be searched with lower-bound value and/or upper-bound value. String type results can be searched by either select one string term from the dropdown list or by a pattern string. Boolean type result can be searched by select one radio button.
Pattern search: You can use pattern to perform a string search. A PATTERN is a part of a search term.
Result Filter: There are few result filters to allow you to make your result search.
Substance Filter. You can provide a SID list using list file, list text, or Entrez history to your search.
Compound Filter. You can provide a CID list using list file, list text, or Entrez history to your search.
Select Other BioAssays provide a function to allow you to add/change BioAssays. WE DON'T ENCOURAGE YOU TO PROCESS MULTIPLE BIOASSAYS UNLESS YOU KNOW TWO OR MORE BIOASSAYS HAVE RELATION SHIP AND YOU WANT TO COMPARE THEIR RESULTS. You can choose up to 5 BioAssays to process their data together.
BioActivity Analysis - Structure-Activity:
This is one of the three views of "BioActivity Analysis". It shows the Structure-Activity relationship in a heatmap display. The sample page is shown below. The default limit of compounds and BioAssays is set to 1000 in order to get the job done in around one minute. If more than 1000 compounds or BioAssays are input, a warning message will show up and you can change the limit to a number <= 4000. However, users need to wait for more than one minute to get the job done.
Compound/BioAssay Clusters: This is probably the most important feature in this tool. Users can cluster compounds and BioAssays differently to do the Structure-Activity analysis. Compounds could be clustered by "2D Structure", "3D Structure", or "Activity" similarities. BioAssays could be clustered by "Activity", "Protein Target" sequence, "Depositor-Specified", or "BioSystems" similarities.
- Compound Cluster: can be clustered based on the 2D/3D structures or the activity of these compounds in the current set of BioAssays. The Clustering of Compounds is shown on the left of the Heatmap. The scoring function of 2D structure similarity is Tanimoto score, which is calculated from Structure Fingerprint. The 3D structure similarity is the calculated Shape or Feature similarity as described below. The scoring function of activity similarity is described below. The clustering algorithm for both compound and BioAssay clusters is Single Linkage.
- BioAssay Cluster: can be clustered based on the activity similarity of the current set of compounds in these BioAssays, or the sequence identity of the Protein targets, which are the proteins interacting with the compounds in the BioAssays, or the depositor-specified similarity, or BioSystems similarity based on the common BioSystems. The Clustering of BioAssays is shown at the top of the Heatmap.
Activity Data: There are four kinds of activity data: Activity Outcome, Activity (IC50 etc.), Linear Score, and Percentile Score. These activity data are used in the clustering by Activity Similarity. They are also shown in the Heatmap. Each cell in the Heatmap corresponds to the test result of one compound in one BioAssay.
- Activity Outcome includes Probe, Active, Inactive, Unspecified/Inconclusive, or Untested.
Active Concentration (AC) is reported as the concentration which produces 50% of the maximum possible biological response such as IC50, EC50, AC50, GI50 etc., or as constant parameters such as Ki. It has a unit of uM. In the Activity Similarity calculation, the Active Concentration is normalized to a value between 0 and 1 in log scale. The normalized AC = (log AC - 2) / (-3 - 2). If AC <= 0.001 uM, the normalized AC is 1. If AC > 100 uM, the normalized AC is 0.
Linear Score and Percentile Score are normalized from the raw scores of all compounds in one BioAssay. The Linear Score = ([score] - [min]) / ([max] - [min]), where [min] and [max] are the minimum and maximum score of this assay. The Percentile Score = [rank] / [N], where [rank] is the rank of one compound among all compounds in the assay, [N] is the total number of compounds in the assay.
- Scoring Functions: are different for different Data.
- Activity Outcome: The scoring function for Activity data is "Weighted Similarity" (WS). It uses both Activity Similarity of Active Compounds (ASAC) and Activity Similarity of Inactive Compounds (ASIC) data. "WS" = ("ASAC" + 0.1 * "ASIC") / (1 + 0.1). "ASAC" = [number of compounds active in both sets A and B] / ([number of compounds active in set A] + [number of compounds active in set B] - [number of compounds active in both sets A and B]). Similarly ASIC can be calculated.
- Active Concentration, Linear score, and Percentile Score: The selected scoring function for score data is "Euclidean Distance". "Euclidean Distance" = 1.0 - SUM of ([diff] * [diff]) / [N], where [N] is the total number of cells in set A (same as that in set B), [diff] = [score of a cell in set A] - [score of the corresponding cell in set B]. If both cells are untested, [diff] = 0. If one cell is tested with a score of "S" and the other cell is untested, [diff] is the higher value of "S" and 1 - "S".
During the 3D similarity calculation, 3D Superposition is Optimized by either "Shape" or "Feature". If the calculation is shape-optimized, the 3D similarity can be represented by the normalized "Shape + Feature" similarity score, or just "Shape" similarity score. If the calculation is feature-optimized, the 3D similarity can be represented by the normalized "Feature + Shape" similarity score, or just "Feature" similarity score.
A certain Number of Conformers per Compound (nconf) is chosen to finish the calculation in 1-2 minutes. This number "nconf" is also shown in the clustering image. You can increase this number up to 10 to use more conformers in the calculations.Revise Selection: Users can revise both compound/substance and BioAssay. The detailed options are hidden by default. Users can click the "+" sign near Revise Selection to show the details.
Revise Compound Selection:"Select Active" selects subset of compounds/substances active in some of the selected BioAssays. "Add Active" adds additional compounds/substances active in some of the selected BioAssays. Similarly, you can also "Select" or "Add" by Active Concentration (IC50, etc). You can also remove a list of comma-separated CIDs from the display. "Add Similar Compounds/Substances" adds compounds/substances similar to the current ones. "Add Similar Conformers" adds compounds/substances with its 3D conformers similar to the conformers of the current set.
Revise BioAssay Selection: "Select Active" selects subset of BioAssays in which some of the selected compounds/substances are active. "Add Active" adds additional BioAssays in which some of the selected compounds/substances are active. Similarly, you can also "Select" or "Add" by Active Concentration (IC50, etc). You can also remove a list of comma-separated AIDs from the display. "Add Related BioAssays" has four pop-up options: "by Target Similarity", "by Activity Overlap", "by Depositor", and "by BioSystems". "Select by BioAssay Type" has four pop-up options: "Summary/Confirmatory", "Summary", "Confirmatory", and "Primary Screening". "Defined Protein Target" shows only those BioAssays with protein targets defined. "Defined BioSystems" shows only those BioAssays with BioSystems defined.Counts: The "Counts" only appear when users launch the Heatmap for the first time. The counts include the "Input" compounds or BioAssays, the number of compounds or BioAssays "Shown" in the Heatmap.
Data Table: Users can export "BioActivity Data", "Compound Similarity Scores", and "BioAssay Similarity Scores". "BioActivity Data" shows the data corresponding to each cell in the heatmap-style display. There are three kinds of data: Activity Outcome, Score, and Active Concentration (if available). "Compound Similarity Scores" and "BioAssay Similarity Scores" show the similarity scores used to generate the dendrograms.
Conformers Used: Each compound could have up to 10 conformers in the 3D similarity calculation. The most similar conformer pair is used to represent the 3D similarity of a pair of compounds. You can export these conformer pairs.
Image: Users can export the display in one full PNG image since the display may consist of many small images.
Clusters in GML: Users can export the clusters as a Graph Modelling Language (GML) file, which can be viewed in other softwares such as Cytoscape. The GML file format can be easily converted to other formats such as the eXtensible Graph Markup and Modeling Language (XGMML), Graph eXchange Language (GXL), and GraphML.
Result Display Option: is defined above.
Save View: is defined below.Select a region in Heatmap: One way to show a subset of the Heatmap is to click two diagonal points in the Heatmap to select compounds and BioAssays in the region as shown in the image below. A menu will pop up with six options. The first option "Zoom in" displays a new Heatmap with compounds and BioAssays in the selected region. The second option "BioActivity Summary, Selected Compounds and BioAssays" shows the selected compounds and BioAssays in PubChem BioActivity Summary page. The third option "BioAssay Data Table, Selected Compounds and BioAssays" shows the selected compounds and BioAssays in PubChem Data Table page. The fourth option "Selected Compounds in Structure Clustering" shows the selected compounds in PubChem Structure Clustering page with all structures displayed. The last two options "Selected Compounds in Entrez" and "Selected BioAssays in Entrez" show the selected compounds or BioAssays in Entrez.
Click Blue Circles in Clusters: As shown in the following two images, if you click on a blue circle or the line above the circle in the Compound Cluster or BioAssay Cluster, a menu will pop up. The options for the Compound Cluster are "Compounds in Structure Clustering", "Compounds in Entrez", "Compounds in BioActivity Summary", "2D/3D Structure Similarity Scores" or "Activity Similarity Scores", "Expand Subtree", and three Revise Selections: "Display Subtree Only", "Remove Subtree & Display the Rest", "Add Similar Compounds", and "Add Similar Conformers". There are two more options for 3D Structure similarity: "Compounds in 3D Viewer" and "Conformers Used".
The options for the BioAssay Cluster Tree are "BioAssays in Entrez", "BioAssays in BioActivity Summary", "Similarity Scores" ("Activity Similarity Scores", "Target Sequence Identities", "Depositor-specified similarity", or "BioSystems similarity"), and four Revise Selections: "Display Subtree Only", "Remove Subtree & Display the Rest", "Add Related BioAssays, by Activity Overlap", "Add Related BioAssays, by Target Similarity", "Add Related BioAssays, by Depositor", and "Add Related BioAssays, by BioSystems".
Common Substructures: As shown in the following image, if you mouseover a node (blue circle) or the line on its left, the common substructures for the compounds in the subcluster will pop up. Currently only the 2D common structures are shown. If the similarity of the node is >= 0.9, the common fingerprint bits greater than 574 are shown. Otherwise, the fingerprint bits greater than 713 are shown.
Collapse Compound Cluster Tree: The Compound Cluster Tree can be collapsed if users click on the ruler as shown above. The subtrees beyond the collapsed Tanimoto score will be collapsed into a node, which can be expanded. The corresponding rows in the Heatmap are collapsed as well. The color of collapsed cells is the mixture of green and yellow.
Related BioAssays by Target Similarity:
This page shows the related BioAssays based on target similarity between the queried BioAssay and all the rest BioAssays. The top 10 related BioAssays are pre-selected.Protein Target: the protein(s) which the compounds interact with in the BioAssay.
Target Similarity: the similarity of the protein sequences for a pair of targets in two BioAssays. Both Sequence Identity and Blast E-value are shown. The related BioAssays are sorted by Sequence Identity.
Sequence Alignment: the sequences of protein targets in the original BioAssay and the Related BioAssay are aligned using Blast 2. If there are multiple targets in one BioAssay, only the target with the highest similarity is shown.
Related BioAssays by Activity Overlap:
This page shows the related BioAssays based on activity overlap between the queried BioAssay and all the rest BioAssays. The top 10 related BioAssays are pre-selected.Activity Similarity: For BioAssays A and B, the activity similarity = [Active in Both A and B] / ([Active in A] + [Active in B] - [Active in Both A and B]).
Active in Both: Links to compounds active in both the queried BioAssay and the BioAssay listed in the current row.
Related BioAssays by Depositor:
This page shows the related BioAssays specified by depositors. The top 10 related BioAssays are pre-selected.
Related BioAssays by Common BioSystems:
This page shows the related BioAssays based on common BioSystems between the queried BioAssay and all the rest BioAssays. The top 10 related BioAssays are pre-selected.BioSystems: biological pathways and other BioSystems containing the target sequences of BioAssays. There are two kinds of BioSystems: organism-specific and across-species.
BioAssay View File:
A BioAssay view file enables users save the state of a BioAssay display so that users may view it again at a later date or to share with colleagues. Please note that PubChem data may change over time as depositors add, update, and delete data. As such, saving a view does not absolutely guarantee that exactly the same information will be displayed at a later date. The BioAssay view file is in XML format. The specification for this file can be found at: ftp://ftp.ncbi.nih.gov/pubchem/specifications/pug.xsd.
PubChem Cross Links |
PubChem provides cross links to other databases when that information is available. You can find those links from either Entrez PubChem pages or individual record summary pages. These links are reciprocal. Other databases also link back to PubChem. The links work well for a single ID input, e.g., the literature about "aspirin" (CID 2244) can be found using the url https://www.ncbi.nlm.nih.gov/pubmed?LinkName=pccompound_pubmed_mesh&from_uid=2244. Some links were removed in October 2016. Click here to see the list.
Links in PubChem Compound:
pccompound_biosystems: Related BioSystems.Links in PubChem Substance:
pccompound_gene: Related genes.
pccompound_mesh: Related MeSH.
pccompound_nuccore: Related Nucleotides.
pccompound_omim: Related OMIM.
pccompound_pcassay: PubChem BioAssays where the compounds are tested.
pccompound_pcassay_active: PubChem BioAssays where the compounds are active.
pccompound_pcassay_activityconcmicromolar: PubChem BioAssays where the activity concentration of the compounds are <= 1uM.
pccompound_pcassay_activityconcnanomolar: PubChem BioAssays where the activity concentration of the compounds are <= 1nM.
pccompound_pcassay_inactive: PubChem BioAssays where the compounds are inactive.
pccompound_pcassay_probe: PubChem BioAssays where the compounds are chemical probes.
pccompound_pccompound: Related PubChem Similar Compounds.
pccompound_pccompound_3d: Related PubChem Compounds with 3D shape similarity.
pccompound_pccompound_mixture: Related PubChem mixture/component compounds.
pccompound_pccompound_parent: Parent compound.
pccompound_pccompound_parent_connectivity_pulldown: Compounds with same parent, connectivity.
pccompound_pccompound_parent_isotopes_pulldown: Compounds with same parent, isotopes.
pccompound_pccompound_parent_pulldown: Compounds with same parent.
pccompound_pccompound_parent_stereo_pulldown: Compounds with same parent, stereochemistry.
pccompound_pccompound_parent_tautomer_pulldown: Compounds with same parent, any tautomer.
pccompound_pccompound_sameanytautomer_pulldown: Related PubChem identical compounds (same, Any Tautomer).
pccompound_pccompound_sameconnectivity_pulldown: Related PubChem identical compounds (same, Connectivity).
pccompound_pccompound_sameisotopic_pulldown: Related PubChem identical compounds (same, Isotopic).
pccompound_pccompound_samestereochem_pulldown: Related PubChem identical compounds (same, Stereochem).
pccompound_pcsubstance: Related PubChem Substances.
pccompound_pcsubstance_same: PubChem Substances with the exact same structure as the given Compound.
pccompound_pmc: PubMed Central links.
pccompound_protein: Related proteins.
pccompound_pubmed: Related PubMed citations.
pccompound_pubmed_mesh: Related PubMed via MeSH.
pccompound_pubmed_publisher: PubMed articles linked via publisher deposited structures.
pccompound_structure: Related protein structures.
pccompound_taxonomy: Related taxonomy.
pcsubstance_biosystems: Related BioSystems.Links in PubChem BioAssays:
pcsubstance_books: Related books.
pcsubstance_gene: Related genes.
pcsubstance_nuccore: Related Nucleotides.
pcsubstance_omim: Related OMIM.
pcsubstance_pcassay: PubChem BioAssays where the substances are tested.
pcsubstance_pcassay_active: PubChem BioAssays where the substances are active.
pcsubstance_pcassay_activityconcmicromolar: PubChem BioAssays where the activity concentration of the substances are <= 1uM.
pcsubstance_pcassay_activityconcnanomolar: PubChem BioAssays where the activity concentration of the substances are <= 1nM.
pcsubstance_pcassay_inactive: PubChem BioAssays where the substances are inactive.
pcsubstance_pcassay_probe: PubChem BioAssays where the substances are chemical probes.
pcsubstance_pccompound: Related PubChem Compounds.
pcsubstance_pccompound_same: PubChem Compound with the exact same structure as the given Substance.
pcsubstance_pmc: PubMed Central links.
pcsubstance_probe: Related records in NCBI Probe database.
pcsubstance_protein: Related proteins.
pcsubstance_pubmed: Related PubMed citations.
pcsubstance_pubmed_publisher: PubMed articles linked via publisher deposited structures.
pcsubstance_structure: Related protein structures.
pcsubstance_taxonomy: Related taxonomy.
pcassay_books_probe: MLP Probe Report for chemical probe(s) with data reported in the Summary BioAssay record and other records relating to the same chemical probe development project.
pcassay_gene_alltarget_list: Gene by protein target or RNAi target.
pcassay_gene_rnai: Gene targets for all the RNAi reagents tested in a RNAi screening.
pcassay_gene_rnai_active: Gene targets of RNAi reagents which are identified as "hit" in a RNAi screening and flagged as "active" in the corresponding PubChem BioAssay record.
pcassay_gene_target: Gene Targets of PubChem BioAssays.
pcassay_nucleotide: Related Nucleotide sequence.
pcassay_nucleotide_dna_target: Target DNA of a particular PubChem BioAssay.
pcassay_nucleotide_rna_target: Target RNA of a particular PubChem BioAssay.
pcassay_omim: Related OMIM.
pcassay_pcassay_activityneighbor_list: Related BioAssays, by Activity Overlap.
pcassay_pcassay_assay_project: Assay Projects related to these assays.
pcassay_pcassay_common_gene_list: Related BioAssays, by Common Active Gene in RNAi assays.
pcassay_pcassay_gene_interaction_list: Related BioAssays, by Gene-Gene Interaction.
pcassay_pcassay_neighbor_list: PubChem BioAssay Neighbors as described by depositors.
pcassay_pcassay_same_assay_project_list: Related BioAssays, by Same Project.
pcassay_pcassay_same_publication_list: Related BioAssays, by Same Publication.
pcassay_pcassay_similar_publication_list: Related BioAssays, by Similar Publication in ChEMBL assays.
pcassay_pcassay_targetneighbor_list: Related BioAssays, by Target Similarity.
pcassay_pccompound: Related compounds.
pcassay_pccompound_active: Related compounds, active.
pcassay_pccompound_activityconcmicromolar: Related compounds, activity concentration <= 1 uM
pcassay_pccompound_activityconcnanomolar: Related compounds, activity concentration <= 1 nM
pcassay_pccompound_inactive: Related compounds, inactive.
pcassay_pccompound_probe: Related compounds, probe.
pcassay_pcsubstance: Related substances.
pcassay_pcsubstance_active: Related substances, active.
pcassay_pcsubstance_activityconcmicromolar: Related substances, activity concentration <= 1 uM
pcassay_pcsubstance_activityconcnanomolar: Related substances, activity concentration <= 1 nM
pcassay_pcsubstance_inactive: Related Substances, inactive.
pcassay_pcsubstance_probe: Related substances, probe.
pcassay_probe: Nucleic acid reagents used in a particular PubChem BioAssay.
pcassay_protein_target: Target proteins of a particular PubChem BioAssay.
pcassay_pubmed: Related PubMed citation.
pcassay_pubmed_major: High relevance links between PubChem Assay and PubMed.
pcassay_structure: Related protein structures.
pcassay_taxonomy: Related taxonomy.
PubChem Indexes and Filters in Entrez |
The PubChem index search is a very powerful tool within the Entrez system. Users can simply type search term(s) followed by the bracketed index field name. Then click the "Go" button.Examples:
Search for DTP/NCI's record with NSC#78:
On the PubChem homepage or Entrez search page, enter "DTP/NCI[Sourcename], 78[objectid]" in
the search box, then click the Go button.
Search for all compounds containing gold:
On the PubChem homepage or Entrez search page enter "Au[el]", and click the Go button.
Search for all compounds with heavy atom count between 10 and 12:
On the PubChem homepage or Entrez search page, choose 'Pccompound' database from the
search dropdown list, enter "10:12[hac]", and click the Go button.
The following fields can be searched within Entrez PubChem databases (with field aliases in square brackets; pick one alias that's easily memorized in case multiple aliases are available). For integer/real number fields, the range search can be done as shown above. Some indices and filters were removed in October 2016. Click here to see the list.
PubChem Compound:
All [ALL]: All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID]: The integer represents CID for each Pccompound database. By default, an integer without a field alias is recognized as a UID. Same as [CID].
Filter [Filter]: Limits the records. A number of filters are available to restrict the search to compounds with particular information. The specialized Filters in this database are:ActiveAidCount [AC, ACNT]: Using this filter users can query for compounds which are active in a certain number of assays
- has_3d_conformer: records have 3d conformers
- has_dailymed: records with associated dailymed info
- has_mesh: records with associated MeSH terms
- has_pharm: records with associated pharmacological actions
- has_parent: records that have a parent structure
- has_patent: records with associated patent info
- has_no_parent: records that do not have a parent
- has_src_nih_mlp: records generated from NIH Molecular Libraries Program
- has_src_vendor: records with vendors info
ActiveAidRatio [AAR]: Ratio should be between zero and 1. Ratio equals to the number of BioAssays where compounds were tested active divided by number of BioAssays where compounds tested with any result.
AtomChiralCount [ACC, ACCNT]: Total count of chiral atoms in a given compound, integer.
AtomChiralDefCount [ACDC, ACDCNT]: Total count of defined chiral atoms in a given compound, integer.
AtomChiralUndefCount [ACUC, ACUCNT]: Total count of undefined chiral atoms in a given compound, integer.
BondChiralCount [BCC, BCCNT]: Total count of chiral bonds in a given compound, integer.
BondChiralDefCount [BCDC, BCDCNT]: Total count of defined chiral bonds in a given compound, integer.
BondChiralUndefCount [BCUC, BCUCNT]: Total count of undefined chiral bonds in a given compound, integer.
CompleteSynonym [CSYN, CSYNO]: Compound's synonyms, based on all substance related to this compound.
Complexity [CPLX]: Compound complexity.
CompoundID [CID]: Compound ID. Same as [UID].
CovalentUnitCount [CUC, CUCNT]: Integer.
CreateDate: Date this compound created in PubChem.
Element [ELMT, EL]: Chemical element in a compound.
ExactMass [EMAS, EXMASS]: The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense mol/molecule peak in a MS spec. A real number.
HeavyAtomCount [HAC, HACNT]: Atom count in a compound except hydrogen, integer.
HydrogenBondAcceptorCount [HBAC, HBACNT]: Hydrogen bond acceptors for a compound, integer.
HydrogenBondDonorCount [HBDC, HBDCNT]: Hydrogen bond donors for a compound, integer.
InChI [INCH, INCHI]: Standard IUPAC International Chemical Identifier. More info..
InChIKey [INCHIKEY]: Standard IUPAC International Chemical Identifier Key.IsotopeAtomCount [IAC, IACNT]: Isotope atom numbers in a compound.
InChI string and InChIKey can be searched through the Entrez PubChem databases. e.g.To search with the InChIKey of aspirin: "BSYNRYMUTXBXSQ-UHFFFAOYSA-N":
Type or paste "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"[InChIKey] into the PubChem Compound, or PubChem Substance, or the Entrez Global search box, then click Go button.
Note:
The quote marks and the square brackets are required.
'InChI=' is required when search with an InChI string.
IUPACName [UPAC, IUPAC]: Standard IUPAC name for compound.
MeSHTerm [MSHT, MESHT]: Medical Subject Heading term. Note that MeSH entry terms (synonyms for the Medical Subject Heading term) are also indexed.
MolecularWeight [MW, MWT, MOLWT]: Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number.
MonoisotopicMass [MMAS, MIMASS]: Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number.
PharmAction [PHMA, PHARMA]: MeSH pharmacological actions.
RotatableBondCount [RBC, RBCNT]: Count of rotatable bonds
SourceName [SRC, SRCNAM, SRCNAME]: Depositor name officially recorded in PubChem databases. For current data sources look here
SourceCategory [SRCC, SRCCAT, SRCCATG]: Depositor categories. For more information and possible categories look here
SubstanceID [SID]: Substance identifier, integer.
Synonym [SYNO]: Synonyms for substance.
TotalAidCount [TAC]: TotalAidCount includes any assay that a compound is tested, it should cover active/inactive/inconclusive/unspecified
TotalFormalCharge [TFC, CHG, CHRG]: Total formal charge.
TPSA[TPSA]: Topological Polar Surface Area.
XLogP [XLGP, LOGP]
PubChem Substance:
All [ALL]: All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID]: The integer represents SID for Pcsubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].
Filter [Filter]: Limits the records. A number of filters are available to restrict the search to substances with particular information. The specialized Filters in this database are:AssaySourceName [ASRC, ASRCNAM, ASRCNAME]: Allows filtering of by assay source name. For available data sources look here
- has_autogen_on: records where structure is to be generated from synonyms
- has_autogen_success: records where structure successfully generated from synonyms
- has_deposited_3d: records with associated computational 3D info
- has_deposited_3d_experimental: records with associated experimental 3D info
- has_patent: records with associated patent info
- has_src_nih_mlp: records generated from NIH Molecular Libraries Program
- has_src_vendor: records with vendors info
- hasnohold: records that are not on hold
- hasonhold: records that are on hold
Comment [CMT]: Substance or BioAssay comment.
CompleteSynonym [CSYN, CSYNO]: Compound's synonyms, based on all substance related to this compound.
ComponentCID [CCID]: Component compound identifier.
CompoundID [CID]: Compound identifier, integer.
DepositDate [DDAT, DEPDAT]: Deposition timestamp for a substance.
ModifyDate: Date this substance record is modified.
SourceCategory [SRCC, SRCCAT, SRCCATG]: Depositor categories. For more information and possible categories look here
SourceID [SRID, SRCID]: Depositor's external id.
SourceName [SRC, SRCNAM, SRCNAME]: Depositor name officially recorded in PubChem databases. For current data sources look here
SourceReleaseDate [SRD, SRDAT, RLSDAT]
StandardizedCID [SCID]: Standardized compound identifier, integer.
SubstanceID [SID]: Substance ID. Same as [UID].
Synonym [SYNO]: Synonyms for substance.
TotalAidCount [TAC]
PubChem BioAssay:
All [ALL]: All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID]: The integer represents AID for Pcassay database. By default, an integer without a field alias is recognized as a UID.
Filter [Filter]: Limits the records. A number of filters are available, to retrieve records in the same or other databases that the current BioAssay records are cross-referenced to.ActiveSidCount [AC, ACNT]: Number of substances (identified by SID--substance identifier from Pcsubstance) that are considered as active in a BioAssay.
- mlp: assay records contributed by Molecular Library Program (MLP). Note that MLP includes both previous phase Molecular Library Screening Center Network (MLSCN) and current phase Molecular Library Probe production Center Network (MLPCN).
- all or pcassay_all: all assays.
- active_concentration: assay records with 'active concentration' attribute provided
- screening: assay of the 'Screening' activity outcome method category
- confirmatory: assay of the 'Confirmatory' activity outcome method category
- summary: assay of the 'Summary' activity outcome method category
- pcassay_biosystems_active: assay records with BioSystems link via active compounds.
- pcassay_biosystems_target: assay records with BioSystems link via protein target.
- pcassay_gene: assay records with gene information provided.
- pcassay_nuccore: assay records with nucleotide link provided.
- pcassay_nuccore_rna_target: assay records with RNA target provided.
- pcassay_nucleotide: assay records with nucleotide link provided.
- pcassay_nucleotide_rna_target: assay records with RNA target provided.
- pcassay_omim: assay records with omim link provided.
- pcassay_pathway: assay records with pathway link provided.
- pcassay_pcassay: another filter for all assays.
- pcassay_pcassay_active: assays contain active results.
- pcassay_pcassay_activityneighbor: assay records with activity overlap based related bioassays.
- pcassay_pcassay_neighbor: assay records related bioassays which are provided by PubChem depositors.
- pcassay_pcassay_neighbor_summary: assay records with summary for related bioassays which are provided by PubChem depositor.
- pcassay_pcassay_targetneighbor: assay records with target similarity based related bioassays
- pcassay_protein_target: assay records with protein target provided.
pcassay_protein_target_pig: assay records with protein targets that are similar to PIG proteins.
- pcassay_pubmed: assay records with pubmed link provided
- pcassay_taxonomy: assay records with taxonomy link provided
- pcassay_structure: assay records with protein structure link provided
- pcassay_pmc: assay records with pmc link provided
- pcassay_pccompound: assay records with PubChem compound link provided
- pcassay_pccompound_active: assay records with active PubChem compound link provided
- pcassay_pccompound_inactive: assay records with inactive PubChem compound link provided
- pcassay_pccompound_inconclusive: assay records with inconclusive PubChem compound link provided
- pcassay_pccompound_probe: assay records with chemical probe PubChem compound link provided
- pcassay_pcsubstance: assay records with PubChem substance link provided
- pcassay_pcsubstance_active: assay records with active PubChem substance link provided
- pcassay_pcsubstance_inactive: assay records with inactive PubChem substance link provided
- pcassay_pcsubstance_inconclusive: assay records with inconclusive PubChem substance link provided.
- pcassay_pcsubstance_probe: assay records with chemical probe PubChem substance link provided
- Rnai: assay records containing screening data for RNAi.
- Small_molecule: assay records containing screening data for chemicals.
Activity Outcome Method [ACMD]: Description on how activity outcome is determined. Choices of search query include:AssayComment [ACMT, ACMMNT]: comment for a BioAssay provided by depositor.
- screening: reports number of 'Screening' assay - Single Concentration Activity Observed: Activity outcome was defined based on the percentage of inhibition from test at a single dose
- confirmatory: reports number of 'Confirmatory' assay - Concentration-Response Relationship Observed: Activity outcome was defined based on EC50/IC50 values and so forth, derived from dose response curves following tests with multiple concentrations
- summary: reports number of 'Summary' assay - Candidate Probes/Leads with Supporting Evidence: An assay which summarizes information from multiple assays
- other: reports number of assays in the 'Other' category - An assay which does not fall into the above categories
AssayDescription [ADES, ADESC, ADSC]: Description for the BioAssay.
AssayName [ANAM, ANAME]: Name of a BioAssay provided by depositor.
AssayProtocol [APRL, APRTL]: Protocol for a BioAssay provided by depositor.
AssaySourceID [ASRD, ASRID]: External assay source identifier.
DepositDate [DDAT, DDATE]: Date when BioAssay record is deposited into PubChem. Date format is yyyy/mm/dd. mm and dd are optional.
GrantNumber [GRN,GRNUM]: NIH Grant Numbers
ModifyDate [MDAT, MDATE]: Last date when a BioAssay data content is modified. Date format is yyyy/mm/dd. mm and dd are optional.
NucleicAcidReagentID [NARD,NARID]: NCBI Probe Database identifiers(ProbeDB ID) referred by BioAssay
PigGI [PIGI,PIGGI]: Identical sequence NCBI Protein GI number similar to a BioAssay target
ProbeCidCount [ACC, ACCNT]: Number of unique chemicals (identified by CID--compound identifiers from Pccompound) that are considered as probe in a BioAssay.
ProteinTargetGI [PTGI]: NCBI Protein GI number of a BioAssay protein target
ProteinTargetName [PTN]: NCBI Protein name of a BioAssay protein target
RNATargetGI [NARD]: NCBI Nucleotide GI number of a BioAssay nucleotide target
ReleaseDate [RDAT, RDATE]: Date when a BioAssay data is released to public by PubChem. Date format is yyyy/mm/dd. mm and dd are optional.
SourceCategory [SRCC, SRCCAT, SRCCATG]: Category of BioAssay data source
SourceName [SNME, SNAME]: Source name of a BioAssay data specified by depositor.
SynonymTested [SYNT]: MESH names and synonyms that are associated with any chemical structure tested in a BioAssay.
TaxonomyName [TXNM,TXNAM,TXNAME]: NCBI Entrez Taxonomy name.
TotalSidCount [TSC]: Total number of substances tested in a BioAssay.
PubChem 3D |
PubChem generates a theoretical 3D description of each compound in the PubChem Compound database that isFor more information, please visit PubChem 3D Release Notes.
- not too large (<= 50 non-hydrogen atoms).
- not too flexible (<= 15 rotatable bonds).
- consists of only organic elements (H, C, N, O, F, P, S, Cl, Br, and I).
- has only a single covalent unit (i.e., not a salt or a mixture).
- contains only atom types recognized by the MMFF94s force field.
PubChem also provides 3D viewers for both desktop application and web-based interface.
PubChem FAQ |
- What is PubChem ?
- What is PubChem Substance ?
- What is PubChem Compound ?
- What does the depositor's category tell users and what are the existing depositor categories ?
- Why search PubChem Substance and/or Compound ?
- What is PubChem BioAssay ?
- How does PubChem assign Substance identifiers ? When a substance is revoked by the depositor, can I still see the old record ?
- How does PubChem assign Compound identifiers ? Will the structure represented by a CID ever change ?
- How does PubChem process my deposited structures ?
- How do I process a text search with PubChem databases ?
- How do I perform a structure search ?
- How do I save my search result ?
- Sometime I see errors in the substance record, where should I report ?
- What are exact mass and monoisotopic mass for a substance/compound ?
- How do I find INCHI version and parameters?
- What is the legacy designation ?
Q: What is PubChem ?
A: PubChem is a component of NIH's Molecular Libraries Roadmap Initiative. It provides information on the biological activities of small molecules. PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical similarity search tool.
Q: What is PubChem Substance ?
A: PubChem Substance records contain substance information electronically submitted to PubChem by depositors. This includes any chemical structure information submitted, as well as chemical names, comments, and links to the depositor's web site.
Q: What is PubChem Compound ?
A: PubChem compound records comprise a non-redundant set of standardized and validated chemical structures. A compound record may link to more than one PubChem Substance record, if different depositors supplied the same structure. Chemical names shown in PubChem Compound records are a composite derived from all linked substances, with default ranking of names by weighted frequency of use.
Q: What does the depositor's category tell users and what are the existing depositor categories ?
A: The depositor categories indicate the type of information one may expect to find when following the depositor substance URL or the type of information provided by the depositor. A list of possible categories include the following:
Status Meaning Biological Properties Depositor provides information about the biological properties of a substance or compound Chemical Reactions Depositor provides information about the reactivity, synthesis, or known reactions of a substance or compound Imaging Agents Depositor provides information about the contrast agent or imaging agent used in, for example, MRI's Journal Publishers Depositor is a journal publisher and has articles published about a substance or compound Metabolic Pathways Depositor provides information on the metabolic pathways involving a substance or compound Molecular Libraries Screening Center Network Depositor is part of the NIH Molecular Libraries Screening Center Network (MLSCN) NIH Substance Repository Depositor is an NIH Molecular Libraries Small Molecule Repository servicing the MLSCN Physical Properties Depositor provides information about the experimental physical properties of a substance or compound Protein 3D Structures Depositor provides information about the experimental 3-D structure of a substance or compound Substance Vendors Depositor is a seller of a substance or compound Theoretical Properties Depositor provides information about the theoretical properties of a substance or compound Toxicology Depositor provides information about the toxicological properties of a substance or compound
Q: Why search PubChem Substance and/or Compound ?
A: It is useful to search PubChem's Substance database when one is looking for information from a particular depositor exclusively, and/or when one is looking for information on substances such as natural product extracts which may not have associated chemical structure information. These special cases aside, it is generally most useful to search for chemical names or structures in PubChem's Compound database. This provides a concise view, combining information derived from multiple Substance records that specified the same structure. PubChem's structure search service operates on PubChem's Compound database exclusively.
Q: What is PubChem BioAssay ?
A: The PubChem BioAssay Database contains BioActivity screens of chemical substances described in PubChem Substance. It provides searchable descriptions of each BioAssay, including descriptions of the conditions and readouts specific to that screening procedure.
Q: How does PubChem assign Substance identifiers ? When a substance is revoked by the depositor, can I still see the old record ?
A: A PubChem Substance SID is assigned to each unique external registry ID provided by a PubChem data source. A depositor may "revoke" (or otherwise deprecate) a PubChem SID at any time for any reason. However, the link to the "revoked" PubChem SID lives on in perpetuity. There will be a message stating the depositor deprecated the SID, but the link to the archived information will still be available. In addition, the PubChem CID's pointed to by the old version of a PubChem SID at the time it was versioned or deprecated will also be available.
Q: How does PubChem assign Compound identifiers ? Will the structure represented by a CID ever change ?
A: A PubChem Compound CID is assigned to each unique chemical structure. It is possible that different tautomeric forms of the same compound to have different CID's. The chemical structure represented by a CID is permanent. The URL links to the compound summaries are stable (always live), regardless if any (or no) substance points to them.
Q: How does PubChem process my deposited structures ?
A: The conversion of the deposited information goes through a series of validation steps (to confirm the structure is "valid") and then a series of standardization/normalization steps to remove VB redundancy.
The validation steps consist of:
Atom verification: do all atoms correspond to a known atomic element? E.g., "*" is not a known atom Implicit hydrogens are assigned to organic elements using simple valence rules, e.g., methane "C" gets four implicit hydrogens assigned to it. Functional group standardization: common incorrect and hypervalent representations of functional groups are "fixed", e.g., nitro groups represented by N(=O)=O become [N+](=O)[O-] Atom valences are validated: do all atoms have an "allowed" valence? E.g., five bonds to carbon is not valid
The standardization steps consist of:
Valence bond (VB) canonicalization: equivalent/alternate VB/tautomeric forms of a structure are normalized into a single representation Aromaticity detection: structure aromaticity is detected and validated to be kekulizable StereoChemistry detection: SP3 and SP2 stereo centers are detected and stereo-wedge placement standardized Explicit hydrogen assignment: implicit hydrogens are converted to be explicit
Subsequent additional processing includes 2D coordinate layout assignment.
Q: How do I process a text search with PubChem databases ?
A: PubChem's Substance, Compound, and BioAssay databases are fully integrated within NCBI's Entrez data retrieval system. You can process any name, keyword, or ID search through the Entrez system. The PubChem homepage also provides a search box. For a specific database query, see related content in the help document above.
Q: How do I perform a structure search ?
A: You can perform a structure search through the PubChem structure database. PubChem provides two search interfaces, basic structure search and advanced structure search. For more information, visit structure search help.
Q: How do I save my search result ?
A: To save your search from Entrez, you can use either the PubChem download facility or the Entrez generic search-save tool. You can get more help by clicking the two links above.
Q: Sometime I see errors in the substance record, where I should report ?
A: PubChem doesn't have curators and never changes/edits substance records. They remain as supplied by our depositors, just as with GenBank records. You can follow PubChem substance summary page to find the original record from the depositor's page, and report the error. Once the error is corrected by the depositor, PubChem will implement it at next update. For compound property/descriptor errors, you can report to the NCBI help desk.
Q: What are exact mass and monoisotopic mass for a substance/compound ?
A: An exact mass is the most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a mass spectrum. A monoisotopic mass is a molecule calculated using the mass of the most abundant isotope of each element.
For example, carbon has a monoisotopic mass of 12.000 g/mol.
Exact mass and monoisotopic mass are the same for more than 90% of structures but differ when atom counts are such that presence of one or more lower abundance isotopes is most probable.
For example, carbon tetrachloride, CCl4, PubChem CID 5943, has an exact mass of 153.872, where, in this case, the prototypical compound is made of three 35Cl, one 37Cl, and one 12C. The monoisotopic mass for carbon tetrachloride is 151.875, where in this case all chlorine atoms are assumed to be 35Cl, with isotope abundance of 75.77%, and the carbon atom is assumed to be 12C, with isotope abundance of 98.9%. In many cases, these two masses are identical, except for compounds with four or more Cl atoms, two or more Br atoms, or other elements not dominated by a single isotope, or for really large compounds such as with the number of carbons greater than 99, i.e., for C100 the exact mass will be 12C * 99 + 13C * 1, while the monoisotopic mass will be 12C * 100.
Q: How do I find INCHI version and parameters?
A: The InChI version and parameters are detailed in the ASN.1 and XML data for each compound. For example, for aspirin, you can find the information you are seeking by viewing the ASN.1 record:
https://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2244&disopt=DisplayASN1
If you scroll down to the InChI property record, you will find:
{
urn {
label "InChI",
datatype string,
parameters "options {auxnonr donotaddh w0 fixedh recmet newps}",
implementation "E_INCHI",
version "1.0.1",
software "InChI",
source "nist.gov",
release "2007.09.04"
},
value sval "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H"
},
The InChI version is (currently) 1.0.1. We are using the parameter options "auxnonr donotaddh w0 fixedh recmet newps".
Q: What is the legacy designation ?
A: PubChem uses a "legacy" designation to give users the option to filter collections that are not regularly updated. For more information, please see our legacy designation help page.
PubChem Courses |
PubChem training materials, including slides and exercises, are available as part of A Librarian's Guide to NCBI. The last day of that five-day program includes coverage of Drugs and other small bioactive molecules (slides, exercises).
NCBI has also provided PubChem training courses in the past. Although these courses have been superceded by the newer Discovery Workshops (accessible from the NCBI Education page), the PubChem course materials are still available and helpful in understanding the PubChem resources:
PubChem Documents |
Direct Link Services |
PubChem services provide directly link urls to allow users to retrieve data based on the valid IDs.
- PubChem substance summary:
//pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=[a valid SID]
- PubChem compound summary:
//pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=[a valid CID]
- Structure Clustering:
Compound: The url for a short list of compound IDs (CIDs) is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=structcluster&cid=[cidlist], where "cidlist" is a comma separated list of CID. If the list of IDs is too long, the post form should be used for "cid=[cidlist]". If Entrez history is used, the url is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=structcluster&cquery_key=[history key].
Substance: The url for a short list of substance IDs (SIDs) is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=structcluster&sid=[sidlist], where "cidlist" is a comma separated list of CID. If the list of IDs is too long, the post form should be used for "sid=[sidlist]". If Entrez history is used, the url is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=structcluster&squery_key=[history key].
- PubChem bioassay services:
Bioassay summary:
//pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=[a valid AID]
Bioassay datatable:
//pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=r&aid=[a valid AID]
Bioactivity summary:
Assay-centric: If the inputs are assays and compounds, the url is //pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=cids&aid=[valid AIDs, comma separated]&cid=[valid CIDs, comma separated]. If the inputs are assays and substances, the url is //pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=sids&aid=[valid AIDs, comma separated]&sid=[valid SIDs, comma separated]. If Entrez history is used, you can replace "aid=..." with "aquery_key=...", replace "cid=..." with "cquery_key=...", and replace "sid=..." with "squery_key=...".
Target-centric: If the inputs are targets and compounds, the url is //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=tgt&gi=[valid GIs, comma separated]&cid=[valid CIDs, comma separated]. If the inputs are assays and compounds, the url is //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=tgt&aid=[valid AIDs, comma separated]&cid=[valid CIDs, comma separated]. If the inputs are assays and substances, the url is //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=tgt&aid=[valid AIDs, comma separated]&sid=[valid SIDs, comma separated]. If Entrez history is used, you can replace "gi=..." with "gquery_key=...", replace "aid=..." with "aquery_key=...", replace "cid=..." with "cquery_key=...", and replace "sid=..." with "squery_key=...".
Compound-centric: If the inputs are targets and compounds, the url is //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=cmp&gi=[valid GIs, comma separated]&cid=[valid CIDs, comma separated]. If the inputs are assays and compounds, the url is //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=cmp&aid=[valid AIDs, comma separated]&cid=[valid CIDs, comma separated]. If Entrez history is used, you can replace "gi=..." with "gquery_key=...", replace "aid=..." with "aquery_key=...", and replace "cid=..." with "cquery_key=...".
Structure-Activity Analysis:Compounds and BioAssays: The url for a short list of compound IDs (CIDs) and BioAssay IDs (AIDs) is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assaycluster&cid=[cidlist]&aid=[aidlist]. If the list of IDs is too long, the post form should be used for "cid=[cidlist]" and "aid=[aidlist]", where "cidlist/aidlist" is a comma separated list of CID/AID. If Entrez history is used, the url is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assaycluster&cquery_key=[compound history key]&aquery_key=[assay history key].
Substances and BioAssays: The url for a short list of substance IDs (SIDs) and BioAssay IDs (AIDs) is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assaycluster&sid=[sidlist]&aid=[aidlist]. If the list of IDs is too long, the post form should be used for "sid=[sidlist]" and "aid=[aidlist]", where "sidlist/aidlist" is a comma separated list of SID/AID. If Entrez history is used, the url is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assaycluster&squery_key=[history key]&aquery_key=[assay history key].
Options: Extra parameters can be added to the url to change the default display. Adding "&compound_cluster=[number]" clusters the compound/substance based on "Structure Similarity (0)" or "Activity Similarity (1)". Adding "&assay_cluster=[number]" clusters the BioAssays based on "Activity Similarity (0)", "Protein Target Similarity (1)", or "Depositor-Specified Similarity (2)". Furthermore, adding "&exportImage=1" exports the full image directly.
Bioactivity datatable:
//pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=cidsr&aid=[valid AIDs, comma separated]&cid=[valid CIDs, comma separated]
//pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=sidsr&aid=[valid AIDs, comma separated]&sid=[valid SIDs, comma separated]
Bioassay select:
//pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=t&aid=[a valid AID]
Bioassay plot:
//pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?q=p&aid=[a valid AID]
Dose-Response Curve:The url for Dose-Response Curve is //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assaygraph&DID=[DID list].
"DID list" is a comma separated list of "AID_SCID_[chemical type]", such as "DID=523_3243128_0". Among the three items in "AID_SCID_[chemical type]", the first one "AID" is a BioAssay ID. The second one "SCID" is the CID for a compound, or the SID for s substance. The third one "chemical type" is "0" for a compound, and "1" for a substance.
Many pairs of assay and chemical can be used in the "DID list". The Dose-Response Curve shows one figure for each pair.
Scatter Plot and Histogram:The urls for Scatter Plot and Histogram are //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assaygraph&TID=[TID list] and //pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assayhistogram&TID=[TID list], respectively.
"TID list" is a comma separated list of "AID_TID_[data type]", such as "TID=1_rank_2,1_1_1". Among the three items in "AID_TID_[data type]", the first one is a BioAssays ID. The second one is the TID as shown in the "Result Definitions" of a BioAssay Summary. There is no Scatter Plot or Histogram for "Outcome". The term "rank" is used for "Score". The third one is the data "Type" as shown in the "Result Definitions" of a BioAssay Summary. The term "1" is used for "Float", and "2" is used for "Integer".
Different BioAssay IDs can be used in the "TID list". The "Scatter Plot" will show figures for all TID pairs. The "Histogram" will show figures for each TID.
Related BioAssays by Target Similarity:
//pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assayneighbor&similarity=target&aid=[assay ID].
Related BioAssays by Activity Overlap:
//pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assayneighbor&similarity=activity&aid=[assay ID].
Related BioAssays by Depositor:
//pubchem.ncbi.nlm.nih.gov/assay/assayHeatmap.cgi?service=assayneighbor&similarity=depositor&aid=[assay ID].
Glossary |
AID: PubChem's BioAssay (protocol) identifier, a non-zero integer.
- IC50: the concentration of a compound where 50% of its inhibitory activity is observed (See https://en.wikipedia.org/wiki/IC50)
- EC50: the concentration of a compound where 50% of its maximal effect is observed (See https://en.wikipedia.org/wiki/EC50)
- Kd: the equilibrium dissociation constant for the ligand, determined directly in a binding assay using a labelled ligand (See http://www.guidetopharmacology.org/helpPage.jsp and https://en.wikipedia.org/wiki/Dissociation_constant)
- Ki: the equilibrium dissociation constant for the ligand, determined in inhibition studies (See http://www.guidetopharmacology.org/helpPage.jsp and https://en.wikipedia.org/wiki/Competitive_inhibition)
- AC50/Potency: the concentration of a compound where 50% of the activity is observed. AC50 and Potency are often used in an exchangeable way among PubChem BioAssay submissions, and may represent IC50, EC50, CC50 etc. Please refer to a specific BioAssay record for details.
CID: PubChem's compound identifier, a non-zero integer for a unique chemical structure.
Complexity : The complexity rating of the compounds is a rough estimate of how complicated a structure is, seen from both the point of view of the elements contained and the displayed structural features including symmetry. However, neither stereochemistry nor isotope labeling are used as auxiliary criteria. The value is computed using the Bertz/Hendrickson/Ihlenfeldt formula. A scaling factor for aromaticity is used so that the complexity of benzene is the same as of cyclohexane. It is a floating point value, ranging from 0 (simple ions) to several thousand (complex natural products). Generally larger compounds are more complex than smaller ones, but highly symmetrical compounds, or compounds with few distinct atom types or elements are downgraded. Complexity is only loosely correlated with synthetic accessibility. The most complex compound in PubChem is CID 6338588 (C124H185N9O207S36) with a complexity rating of about 18425. The average complexity of the structures in PubChem compound database is about 551.
Comments: List all depositor's comments and additional information for this substance.
Component: For mixture substance/compound, component is one of the single molecule.
Compound: Chemical representatives in substances. Chemical structure presented in a compound is standardized through PubChem's data pipeline. A mixture substance may have several standardized compounds. A compound record is structurally unique in the PubChem compound database.
Computed Descriptors: Information to describe the compound in different formats, including SMILES, InChI, IUPAC names.
Computed Properties: These data are calculated from the compound, including molecular weight, formula, XLogP, etc.
Depositors Category: Depositors category tells users that there is an additional category-specific information either on depositors substance summary page or on the depositor's web-site.
Deprecated Compound: A Compound CID which has no links to any substance. This may occur as PubChem modifies processing. A deprecated compound will not be available within Entrez.
HBA: Number of hydrogen acceptors in the structure. Classification of hydrogens follows [J. Chem. Inf. Comput. Sci. 1997,37, 615-621].
HBD: Number of hydrogen donors in the structure. Classification of hydrogens follows [J. Chem. Inf. Comput. Sci. 1997,37, 615-621].
Heavy Atom: All atoms except hydrogen.
InChI: IUPAC International Chemical Identifier. Learn more... InChI string can be searched through the Entrez PubChem databases. Click here to see the example.
Old Version Substance -- Substance versions are considered to be "old" when a more recent update is provided by the depositor.
Molecular Formula: A way of expressing information about the atoms that constitute a particular chemical molecule.
Molecular Weight: The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in gr/mol. In the absence of explicit isotope labeling, averaged natural abundance (which may, for example in case of Li and U compounds, not be identical to purchasable material) is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location, even for short-lived radioactive isotopes where this is often physically unrealistic. At this moment, it is not possible to deposit more detailed isotope composition information into the PubChem database. Pseudo-atoms which are not an element have an atomic weight of 0 g/mol.
Revoked BioAssay: When a depositor removes an assay that the depositor previously deposited into PubChem, the assay is considered revoked. A revoked assay will not be available within Entrez.
Revoked Substance: When a depositor removes a substance from their substance collection, the substance is considered revoked. A revoked substance will not be available within Entrez.
SID: PubChem's substance identifier, a non-zero integer for a deposited substance.
SMILES: Simplified Molecular Input Line Entry System, a line notation (a typographical method using printable characters) for entering and representing molecules. Learn more..
You can also find more related information form PubChem's document section in PDF or Text.SMARTS: A language that allows you to specify substructures using rules that are straightforward extensions of SMILES. Learn more..
Stereochemistry: Relative spatial arrangement of atoms within molecules, such as chirality.
Substance: Individual record object collected from depositors, representing a sample used at BioAssay.
Substance Category: Substance categories (one or more) are assigned to each depositor, based on nature of that depositor's institution and the type of data they supply.
Suppressed Compound: A Compound CID that links only to an old version substance. A suppressed compound will not be available within Entrez.
Synonyms: All names, trivial names, synonyms, frequently used IDs, and other names collected from depositors. In the compound summary page, synonyms are distinct synonyms from all corresponding substances.
TPSA -- Topological Polar Surface Area. This is an estimate of the area (in Å squared) which is polar. The implementation follows [J. Med. Chem. 2000, 43, 3714-3717.]. It is a simple method - only N and O are considered, 3D coordinates are not used, and there are various precomputed factors for different hybridizations, charges and participation in aromatic systems.
Version: PubChem substance version number is incremented when an update is provided by the depositor.
Xref: The external references/links to PubChem database records.
XLogP: A partition coefficient or distribution coefficient that is a measure of differential solubility of a compound in two solvents. Learn more..
From Feburary 2009, the PubChem uses version 3 of the algorithm to generate the XlogP value. [J. Chem. Inf. Model. 2007, 47, 2140-2148.]. You can also visit the XLogP3 website: http://www.sioc-ccbg.ac.cn/software/xlogp3/.