Related Links

MLSMR Compounds


Algorithmic Procedures for Compound Selection for the MLSMR Collection

The MLSMR collection of over 300K compounds was built-up primarily in three stages of 100K compounds each. Below you will find a broad description of the algorithmic procedures used in constructing the set. If you need more details please contact Jamie Driscoll at NIMH.

Stage 1 (the 1st 100K set): Compounds in the MLSMR collection are generically grouped into the following four categories (a) specialty sets (SS), such as the set of known drugs and toxins, (b) targeted libraries (TL), (c) natural products (NP), and (d) diversity compounds (DC), that is, a diverse set that complements the first three categories. The targeted breakdown for the number of compounds in these four categories is roughly 2%, 15%, 3% and 80% respectively.

Standard vendor supplied purity of > 90%, availability (10 mg) and re-supply (20 mg) was required. For some categories (see below) the following additional criteria were added, calculated water solubility of > 20 ug/mL, compound stability was assessed by implementing excluded functionality filters using Daylight SMARTS.

It is important to note that purity of 100% of the compounds was experimentally verified (provide link to document here) using LC-MS by DPI before the compound is included as part of the MLSMR collection.

The target for the SS set was to obtain a collection of 1500 drugs and toxins. These compounds were not filtered for diversity, Lipinski Rule of Five, calculated water solubility (using the Tetko AlogPS program), or substructure content.

For the NP set compounds must have an exact structural match in either the Chapman and Hall Natural Products Database or the Wiley AntiBase in order to be collected. The NP set targeted genuine, pure, isolated natural products with known structures. NP extracts and broths were excluded. Compounds must have the following additional criteria: (a) 10 mg availability, (b) MW < 2500, (c) supplier stated purity of > 90%. These compounds were not filtered for diversity. The calculated water solubility and substructure content filters (mentioned above) were applied. The physico-chemical property cutoffs applied were, MW <1000, cLogP < 5, HBA <20 and HBD < 10.

The 15K TL set was roughly equally divided into protease, kinase, GPCR, ion channel, and nuclear receptor targets. The vendor specified class was taken as a given without any further analysis. The Lipinski Rule of Five, calculated water solubility, and substructure content filters were applied to this set.

For the 80K DC set, Daylight fingerprints and clustering procedures were used to generate the final list of compounds to purchase. Topological fingerprints for the diversity analysis were calculated using the Daylight Similarity toolkit using a 4096 bit fingerprint (without folding) based on all pathways up to 14 bonds in length. No attempt was made to look at inter-category diversity.

The diversity of each chosen vendor’s filtered set was assessed using the Willett average pairwise cosine similarities method: each set measured between 0.72 and 0.75. A set of diverse structures within each compound class was chosen according to the following algorithm:

  1. From the first vendor, select the structure closest to the cosine centroid (Willett method).
  2. Change to the next vendor.
  3. Calculate the Tanimoto similarity between each selected structure and each of the current vendor’s remaining structures.
  4. From among the current vendor’s remaining structures, choose the structure most dissimilar to the selected set (Willett MaxMin algorithm).
  5. Add the structure to the selected set.
  6. Return to step 2 (until an adequate number of diverse structures is selected).

Compounds were collected in microclusters containing up to five structures. The micro-clusters were designed to provide (1) incipient SAR around a screening hit, and (2) re-supply of a similar compound for those that could not be replenished from their source.

To meet the micro-cluster objectives, MLSMR chose up to four nearest neighbors for each selected diverse structure, provided the Tanimoto similarity was at least 0.85 (to set a baseline similarity to the selected diverse structure) but not more than 0.99 (to ensure exclusion of duplicates).

Stage 2 (the 2nd 100K set): Similar procedures as for Stage 1 were mostly used in selecting compounds for inclusion in this round with the following exceptions:

  • Excluded functionality filters were modified to remove “druggability” criteria to reflect the fact that MLI’s aims are to find chemical probes not necessarily drugs.
  • Given the difficulties in sourcing SS and NP sets no diversity selection was applied to these sets with the aim of obtaining as many as possible compounds.
  • Two different diversity approaches were adopted in this round as described below.
  • For the TL and DC collections, the physico-chemical property requirements from Stage 1 were relaxed. Four physico-chemical categories, (A), (B), (C), and (D) were devised with the requirement that roughly >=25%, >=50%, >=75%, 100% of the entire MLSMR collection would belong to each of the categories.
    • (A): MW <= 300; ClogP <= 3; HBD <= 3; HBA <= 6; calculated solubility >= 40 ug/mL
    • (B): MW <= 400; ClogP <= 4; HBD <= 4; HBA <= 8; calculated solubility >= 30 ug/mL
    • (C): MW <= 500; ClogP <= 5; HBD <= 5; HBA <= 10; calculated solubility >= 20 ug/mL
    • (D): MW <= 600; ClogP <= 6; HBD <= 6; HBA <= 12; calculated solubility >= 10 ug/mL
  • In addition to the Tetko calculated solubility, the ACD Labs Solubility Batch software was also incorporated into the calculation of aqueous solubities.
  • In addition to the Daylight topological fingerprints used in Stage 1, another diversity metric based on the MDL MACCS keys was introduced. These methods were used iteratively so that around half of the compounds selected came from each method.

    Stage 3 (the 3rd 100K set): In this case all of the criteria for filters established in Stage 2 were retained with the following differences:

    • TL compounds based on vendor annotations were dropped as a specific sub-category.
    • The diversity implementation was completed altered.
    • The DTP/NCI compound collection was included along with commercial vendor supplied compounds for this round of purchases.

    One of the aims of building a specific set for screening apart from diversity is to ease the creation of potential SAR hypotheses that medicinal chemists can use for improving molecules obtained from the screening collection. Since the natural language of medicinal chemists is substructures, chemotypes, or scaffolds obtained from 2D representations of molecules a computational procedure that is faithful to this language was used for building out the MLSMR collection. Specifically an implementation of the Bemis-Murcko definition of scaffolds was used for all compound selection.

    The concept of TL compounds was replaced by “known biologically active molecules”. The set of known biologically active molecules made available by GVK Biosciences was used for defining this space. The GVK collection is obtained from published articles and patents and were human curated. The detailed GVK target/disease annotations were grouped into generic bins (e.g., proteases, kinases, GPCRs, HIV, heart-disease etc.) to ensure some level of “biological diversity”. About 30K compounds purchased in this round were targeted using known bioactives.

    Detailed Approach: Specifically, both the GVK and our vendor collection along with the existing MLSMR collection were converted into Bemis-Murcko scaffolds.

    • “Known Biologicals”: Compounds in the vendor collection with scaffold level-similarity based on topological torsions of >0.96 to the GVK set were tagged as “known biologicals”.
    • A machine learning based predictive model that separates the existing MLSMR collection from the vendor collection at the scaffold level was built. This extensively cross-validated model was used to rank-order the scaffolds in the vendor collection that are the most different from the existing Stage 2 collection. Topological torsions were used as descriptors.
    • Finally compounds were selected from this rank-ordered list of scaffolds that contained at least 5-10 compounds. If a scaffold did not contain enough compounds, compounds from neighboring scaffolds (based on scaffold-based distance matrix using topological torsions) were selected for inclusion.