Molecular Informatics
Cheminformatics Basics
Representing Molecules
Substructure
Molecular Similarity
Chemical Registration
What is Cheminformatics?
Cheminformatics is a cross between Computer Science and Chemistry: the process of storing and retrieving information about chemical compounds.
Information Systems are concerned with
storing, retrieving, and searching for information, and with storing relationships between bits of data. For example:
Operation | Classical Information System | Chemical Information System | ||
Store | Name = 'Jimmy Carter' | Stores text, numbers, dates, ... | ![]() |
Stores chemical compounds and information about them. |
Retrieve | Find record #13282 | Retrieves 'Jimmy Carter' | Find: CC(=O)C4CC3C2CC(C)C1=C(C) C(=O)CC(O)C1C2CCC3(C)C4 |
Retrieves:![]() |
Search | Find Presidents named 'Bush' | George Bush and George W. Bush | Find molecules containing:![]() |
Retrieves:![]() |
Relationship | Year Carter was elected | Answer: Elected in 1976 | What's the logP(o/w) of:![]() |
logP(o/w) = 2.62 |
How is Cheminformatics Different?
There are four key problems a cheminformatics system solves:
1. Store a Molecule | Computer scientists usually use the valence model of chemistry to represent compounds. Section 2, Representing Molecules, discusses this at length. |
2. Find Exact Molecule | If you ask, "Is Abraham Lincoln in the database?" it's not hard to find the answer. But, given a specific molecule, is it in the database? What do we know about it? This may seem simple at first glance, but it's not, as we'll see when we discuss tautomers, stereochemistry, metals, and other "flaws" in the valence model of chemistry. |
3. Substructure Search | If you ask, "Is anyone named Lincoln in the database?" you usually expect to find the former President and a number of others - this is called a search rather than a lookup. For a chemical informatics system, we have a substructure search: Find all molecules containing a partial molecule (the "substructure") drawn by the user. The substructure is usually a functional group, "scaffold," or core structure representing a class of molecules. This too is a hard problem, much harder than most text searches, for reasons that go to the very root of mathematics and the theory of computability. |
4. Similarity Search | Some databases can find similar-sounding or misspelled words, such as "Find Lincon" or "find Cincinati", which respectively might find Abraham Lincoln and Cincinnati. Many chemical information systems can find molecules similar to a given molecule, ranked by similarity. There are several ways to measure molecular similarity, discussed further in Section 4, Molecular Similarity. |
What is a Molecule?
One of the greatest achievements in chemistry was the development of
the valence model: a molecule is represented
as atoms joined by semi-rigid bonds that can be single,
double, or triple. This simple mental model has little resemblance to the
underlying quantum-mechanical reality of electrons, protons and neutrons,
yet it has proved to be a remarkably useful approximation of how atoms
behave in close proximity to one another. The valence model has been the
foundation of chemical instruction for well over a century.
The valence model is also the foundation of modern chemical information
systems. When a computer scientist approaches a problem, the first task
is to figure out a datamodel that represents the problem to be
solved as information. To the computer scientist, the valence
model naturally transforms into a graph, where the nodes
are atoms and the edges are bonds. Computer scientists know how
to manipulate graphs; mathematical graph theory and computer science
have been close allies since the invention of the digital computer.
There are atoms and space. Everything else is opinion. -- Democritus |
However, the valence model has many
shortcomings. The most obvious of these is aromaticity, a concept that
quickly required the addition of a non-integral "aromatic"
distributed bond to the single/double/triple bonds of the simple valence
model. And that was just the start. Tautomers, ferrocenes, charged
molecules and a host of other common molecules simply don't fit the
valence model well.
The imprecise nature of the valence model of
chemistry complicates life for the computer scientist. As we shall see,
the shortcomings in the valence model are the source of most of the
complexities of modern cheminformatics systems.
Older systems: Connection Tables
Most of the early (and some modern) representations
of molecules were in a connection table, a table
enumerating the atoms, and a table enumerating the bonds and which atoms
each bond connected. Here is an example of connection-table (CTAB)
portion of an MDL "SD" file (the data portion is not shown here):
CHEMW2 12141511122D 3 2 0 0 0 0 999 V2000 -29.5160 2.6603 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0 -25.1859 5.1603 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -20.8558 2.6603 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 M END
This simple example illustrates most of the key
features of the molecule. The molecule has three atoms, two bonds, and is provided with
three-dimensional (x,y,z) coordinates.
Connection tables can capture the valence model of
chemistry fairly well, but they suffer from three problems:
- Inefficiency: They take on the order up to two dozen bytes of data per atom and per bond. Newer line notations (discussed below) represent molecules with an average of 1.2 to 1.5 bytes per atom, or 6-8 bytes per atom if coordinates are added.
- Lack of specificity. For example, since hydrogens are often implicit, there can be ambiguity as to the electronic state of some molecules. The connection-table format does not explicitly state the valence assumptions.
- Formatting issues: Most connection-table formats mix the concept of connectivity (what are the atoms and how are they connected?) with other data such as 2D and 3D coordinates. For example, if you had two different conformers of a molecule, most connection tables would require you to specify the entire molecule twice, even though the connection table is identical in both.
Line Notations: InChI, SMILES, WLN and others
A line notation represents a molecule as a single-line string of characters.
WLN - Wisswesser Line Notation
WLN, invented by William J. Wisswesser in the early 1950's, was the first comprehensive line notation, capable of representing arbitrarily complex molecules correctly and compactly.
1H = CH4 Methane
2H = CH3-CH3 Ethane
3H = CH3-CH2-CH3 Propane
QVR BG CG DG EG FG = C7HCl5O2 Pentachlorbenzoate
2H = CH3-CH3 Ethane
3H = CH3-CH2-CH3 Propane
QVR BG CG DG EG FG = C7HCl5O2 Pentachlorbenzoate
WLN was the first line notation to feature a canonical form, that is, the rules
for WLN meant there was only one "correct" WLN for any particular molecule. Those
versed in WLN were able to write molecular structure in a line format, communicate molecular
structure to one another and to computer programs. Unfortunately, WLN's complexity prevented
widespread adoption. The rules for correct specification of WLN filled a small book. Encoding
those rules into a computer proved difficult, and the rules for the canonicalization were computationally intractable.
SMILES - Simplified Molecular Input Line Entry System
The best-known line notation today is SMILES. It was created by Arthur
and David Weininger in response to the need for a simpler, more "human
accessible" notation than WLN. While SMILES is not trivial to learn and
write, most chemists can create correct SMILES with just a few minutes of
training, and the entire SMILES language can be learned in an hour or
two. You can read more
details here. Here are some examples:
C methane
CC ethane
C=C ethene
Oc1ccccc1 phenol
CC ethane
C=C ethene
Oc1ccccc1 phenol
SMILES, like WLN, has a canonical form, but unlike WLN, Weininger relied on the
computer, rather than the chemist, to convert a non-canonical SMILES to a canonical SMILES.
This important separation of duties was key to making SMILES easy to enter. (Read more about
canonicalization below.)
InChI
InChI is the latest and most modern of the line notations. It resolves many of the chemical
ambiguities not addressed by SMILES, particularly with respect to stereo centers, tautomers
and other of the "valence model problems" discussed in the opening section this document.
You can read more about InChI at
the Official
Site
Canonicalization
A critical feature of line notations is canonicalization - the ability to choose one "blessed"
representation from among many equivalent representations. Consider:
OCC ethanol
CCO ethanol
CCO ethanol
Both of these SMILES represent the same
molecule. If we could all agree that one of these was the "correct" or
"canonical" SMILES for ethanol, then we could always store it the same
way in a database. More importantly, if we want to ask, "Is ethanol in our database?" we know that
it will only be there once, and that we can generate the canonical SMILES for ethanol and look
it up.
(Note that, in theory, one can create a canonical
connection table too, but it's not as useful since informatics systems
usually have trouble indexing BLOBs - large objects.)
Line Notation versus Connection Tables: A practical matter
Why are line notations preferred over connection-table formats? In theory, either could express the same information. But there are practical difference, mostly related to the complexity of "parsing" a connection table. If you know that the whole molecule is on one line of a file, it's
easy to parse.
Line notations are also very useful for database applications. Relational databases have datatypes
that, roughly speaking, are divided into numbers, text, and "everything else" (also known as "BLOBs"
or Binary Large OBjects). You can easily store line notations in the "text"
fields; multi-line connection tables are trickier.
Line notations also have pragmatic advantages. Modern Unix-like systems (such as UNIX, Linux
and Cygwin) have a number of very powerful "filter" text-processing programs that can be "piped"
together (connected end-to-end) to perform important tasks. For example, to count the number of
molecules containing aliphatic nitrogen in a SMILES file, I can
simply:
grep N <file.smi | wc
"grep" looks for a particular expression, in this case "N", and prints any line that contains
it, and "wc" (word count) counts the number of words and lines.
This is just a simple example of the power available via "script" programs using "filters" on
Unix-like systems. Unix "filters" are much less useful for connection-table formats because each molecule
is spread over many lines.
Query Languages: SMARTS
In addition to a typographical way to represent molecules, we also need a way to enter queries
about molecules. For example, we might want to ask, "Find all molecules that contain a phenol."
With text, we're familiar with the concept of typing a partial word, such as "ford"
to find "Henry Ford" as well as "John Hartford". For chemistry, we can also
specify partial structures, and find anything that contains them. For
example:
Query | Database | Matches? |
![]() |
![]() |
YES (matched portion highlighted in blue) |
![]() |
NO (double bond indicated doesn't match) |
The simplest query language for chemistry is SMILES itself: Just specify a structure, such as
"Oc1ccccc1", and search. This is how eMolecules' basic searching works. It's simple, and
because of the high-performance indexes in eMolecules it is very fast.
However, for general-purpose cheminformatics one needs more power. What if the substructure
you're looking for isn't a valid molecule? For example ClccBr (1,2- substitution on an aromatic
ring) isn't a whole molecule, since the concept of aromaticity is only sensible in the context
of a whole ring system.
Or what if the thing we're looking for isn't a simple atom such as Br, but rather a concept like
"Halogen"? Or, "A terminal methyl"?
To address this, cheminformatics systems have special query languages, such as SMARTS
(SMiles ARbitrary Target Specification). SMARTS is a close cousin to SMILES, but it uses expressions
instead of simple atoms and bonds. For example, [C,N] will find an atom that is either carbon
or nitrogen.
IUPAC Names, Trade Names, Common Names
Chemistry also has three other important name systems:
- IUPAC Names established a naming convention that is widely used throughout chemistry. Any chemical can be named, and all IUPAC names are unambiguous. This textual representation is aimed at humans, not computers: chemists versed in IUPAC nomenclature (which is widely taught) can read an IUPAC name and visualize or draw the molecule.
- Trade Names such as Tylenol® and Valium® are given to compounds and formulations by manufacturers for marketing, sales and regulatory purposes.
- Common names are names such as "aspirin" or "alcohol" for substances that are in widespread use.
- Linear paths: All "paths" through the molecule up to five atoms long. For example, C=NCO would generate five features: C=N, NC, CO, N=NC, NCO and C=NCO.
- Branch points: Any atom with three or more non-hydrogen neighbors.
- Multi-branch points: Pairs of adjacent branch points.
- Single Rings: For example, quinoline would generate two ring features: a benzene ring and a pyridine ring.
- Multi-ring systems: For example, quinoline would generate one multi-ring feature (in addition to the two single-ring features).
- Molecular formula: Each element in the molecule generates at least one feature (some generate several features).
- Structural novelty - ensure that each compound is only registered once
- Structural normalization - ensure that structures with alternative representations (such as nitro groups, ferrocenes, and tautomers) are entered in a uniform way.
- Structure drawing - ensure that compounds are drawn in a uniform fashion, so that they can be quickly recognized "by eye."
- Maintaining relationships among related compounds. For example, all salt forms of a compound should be recognized as being related to one another, and compounds in different solvates are also related.
- Registering mixtures, formulations and alternative structures.
- Registering compounds whose structures are unknown.
- Roles, responsibilities, security, and company workflow.
- Updates, amendments and corrections, and controlling propagation of changes (e.g. does changing a compound change a mixture containing that compound?)
- The contents of the first bottle were carefully analyzed, and found to be a single stereoisomer.
- The contents of the second bottle were carefully analyzed and found to contain a racemic mixture of the stereoisomers.
- The stereoisomers of the third bottle are unknown. It may be pure, or have one predominant form, or be a racemic mixture.
- The fourth bottle was obtained by running part of the contents of bottle #2 through a chromatographic separation. It is isotopically pure, but you don't know which stereoisomer.
- The fifth bottle is the other fraction from the same separation of #4. It is also isotopically pure, you don't know which stereoisomer, but you know it's the opposite of #4.
What is Indexing?
Indexing is pre-computing the answers to portions of expected questions before they're
asked, so that the questions can be quickly answered when they're asked.
Take your favorite search engine (AOL, Yahoo!, Google, MSN, etc) for example. Without indexing,
they might wait until you ask for "John Hartford Bluegrass", then start searching the
web, and in a year or two find all the web pages about the deceased banjo/fiddle player and steamboat
captain. That would probably not impress you.
Instead, these search engines search the web before you ask your question, and build an
index of the words they find. When you type in "Bluegrass John Hartford", they
already know all of the pages that contain "John", all of the
pages that contain "Hartford",
and all of the pages that contain "Bluegrass". Instead of searching, they examine their index,
and find pages that are on all three lists, and quickly returning your results. (NB: It's actually
a lot more complex, but this illustrates the main idea of indexing.)
Indexes for Chemicals
Instead of indexing words, cheminformatics systems index substructures. Although there
are many schemes for doing this, cheminformatics systems all use the same fundamental principle:
they decompose the molecule into smaller bits and index those.

Decomposing the molecule for indexing
Roughly speaking, a cheminformatics system will index each of the substructures (fragments) above,
so that every molecule that contains each fragment is known.
When a query is entered, the cheminformatics system breaks apart the query using the same technique,
to find all of the fragments in the query. It then checks its index for each fragment, and combines
the lists it finds to get only those molecules that have all of those fragments.
This doesn't mean that all molecules returned by the index actually are matches. In the language
of databases, we say the index will return false positives, candidate molecules that don't
actually match the substructure search.
Consider our example of searching for "John Hartford" - the index might return many
pages that have both "John" and "Hartford", yet have nothing to do with bluegrass
music or steamboats. For example, it might return a page containing, "President John F. Kennedy
visited Hartford, Connecticut today..." To confirm that the search system has found something
relevant, it must check the pages returned from the index to ensure that the specific phrase "John
Hartford" is present. However, notice that this is much faster than searching every
page, since the overwhelming majority of web pages were instantly rejected because they have neither
"John" nor "Hartford" on them.
Similarly, a chemical fragment index serves to find only the most likely molecules for
our substructure match - anything that the index didn't find is definitely not a match. But we
still have to examine each of the molecules returned by the indexing system and verify that the
complete substructure for which we are searching is present.
NP-Complete - A Little about Computability
Searching through a page of text for the words, "John Hartford" is pretty easy for
a modern computer. Although false positives returned by the index are a nuisance and impair performance,
they are not a catastrophe. This is not so for substructure matching. Unfortunately, substructure matching
falls into a category of "hard" mathematical problems, which means false positives from
the index are a big problem.
Substructure matching (finding a certain functional group within a molecule) is an example of
what mathematicians call graph
isomorphism, and is in a class of problems called NP
Complete. Roughly speaking, this means that the time it takes to do a substructure search is
non-polynomial (i.e. exponential in the number of atoms and bonds). To see why this is a computational
disaster, compare two tasks, one that takes polynomial time, k1*N2,
versus one that takes exponential time k2*2N.
Our polynomial task is bad enough: If we double N, it takes four times as long to solve.
But the exponential task is worse: Every time we add an atom it doubles. So going from
one atom to two doubles the time, and going from 100 atoms to 101 atoms doubles the time. Even
if we can get k2 down to a millionth of k1,
we're still in trouble - a million is just 220 or twenty atoms
away.
It has been mathematically proven that substructure searching is in the set of NP Complete problems,
so there's no point wasting our time searching for a polynomial algorithm. The good news is that
most molecules have "low connectivity", meaning most atoms have fewer than four bonds,
unlike the weird and twisted graphs that mathematicians consider. In practice, most substructure
matching can be done in polynomial time around N2 or N3.
But even with this improvement, substructure matching is an "expensive" time-consuming
task for a computer.
The key point is that indexing is particularly important for cheminformatics systems. The typical
modern computer can only examine a few thousand molecules per second, so examining millions of
molecules one-by-one is out of the question. The indexing done by a modern cheminformatics system
is the key to its performance.
What is similarity search?
Similarity search finds molecules with a high percentage of
features that are common to the target molecule. Unlike a substructure
search, the molecules found may not actually match the target.
For example, suppose you are looking for compounds related to
Sildenafil. A substructure search would miss the closely-related drug
Vardenifil because one nitrogen is moved.
Query | Target |
![]() |
![]() |
We're looking for Sildenafil-like drugs... | But we don't find Vardenafil because of the shifted N. |
By contrast, a similarity search will find Sildenafil as well as
Vardenafil and its relatives. Although the core ring systems of the two
drugs have significant differences, their similarity index is moderately
high (0.7) because most of the structural features are shared.
The eMolecules Similarity Metric
The eMolecules similarity metric ranks molecules based on their number of shared
"substructural features" between the two molecules. These features are
defined as follows:
The eMolecules similarity metric measures the fraction of features
common to both molecules. A high similarity (> 0.9) means that the two
molecules share most of their features; a low similarity means they share
few or no features. (More details about this computation are at
the end of this article).
Caveats
Similarity search can be very useful for finding related molecules, but
one must be careful not to give much meaning to the numbers.
A low similarity metric should be considered more-or-less meaningless.
It is wrong to think that a similarity of 0.3 is "more similar" than a
similarity of 0.2 – anything at this level is just noise. There is
no simple rule to determine the cutoff for useful similarity
results, but roughly speaking anything below 0.4 is probably not
interesting. (See Limitations... below for a
more technical discussion of this).
High similarity can also be misinterpreted. Similarity metrics do not form
any sort of Euclidian space. They're often discussed with terms like
"distance," "closer," and "farther," but these terms are deceiving. If two
molecules are highly similar, it only means that they share many
substructural features. Highly similar molecules may or may not have
highly similar chemical or medicinal properties. Two molecules that are
similar to a third molecule may be similar to each other or may be very
dissimilar.
Computation of the Similarity Metric
To achieve high speeds when searching for similar molecules, the
molecule features (linear, branch points and rings) described above are
"hashed" into a 512-bit (64 byte) bitmap representation called a
"fingerprint."
A hash function
is a well-defined procedure that maps a large space (in this case, all possible molecular features) into a much
smaller space (a number between zero and 511). For example, it might do
this:
h("C=NCO") = 427
h("c1ncccc1") = 201
... etc.
h("c1ncccc1") = 201
... etc.
To create a fingerprint, the system initializes the bitmap to all zeros.
Then each molecular features is hashed, and the fingerprint's bit
corresponding to that hash value is set to one. A typical molecule in the
eMolecules database has between 20 and 200 features, so the
corresponding fingerprint has roughly that many bits set (more on this
below).
The hash function is designed to distribute the
output evenly from zero to 511 for a large number of input values.
Given the fingerprints of two molecules, their similarity is computed
using the well-known
Tanimoto coefficient.
T(m1, m2) = | nbits-common |
nbits-set |
Where:
nbits-common: | the number of 1 bits common to both fingerprints (the bitwise AND of the fingerprints) |
nbits-set: | the number of 1 bits set in either fingerprint (the bitwise OR of the fingerprints) |
This coefficient and its relatives (e.g. the Tversky coefficient) have
become cheminformatics standards because
of the very high speed at which modern computers can manipulate large integers.
A modern computer can compute billions of logical AND and OR operations per
second. This makes it possible to calculate the Tanimoto coefficients for
millions of molecules in just a few seconds.
Limitations of Tanimoto Similarity
The hash function described above maps a very large space (all possible
molecular features) into a much smaller space (512 bits). The set of
molecules in the eMolecules database together have roughly 150,000 distinct
molecular features. As a molecule's features are turned into a bitmap,
there is no guarantee of uniqueness. Each "1" of a fingerprint could
represent any one of thousands of potential molecular features.
To illustrate, imagine the fingerprint-generation algorithm has hashed
205 features into a 512-bit fingerprint, and is about to hash the next (206th) feature.
Since 40% of the bits have already been set, there's a 40% probability
this bit will "collide" with an already-set bit.
Now imagine you have two 512-bit fingerprints, each with around 204
bits set (40%). Just by random chance, we can expect that these two
fingerprints will have roughly 0.25 similarity (25% of their 1 bits will
happen to coincide) even if they have no common substructural features.
This is why similarity values less than 0.4 are not useful.
Almost any two fingerprints for real molecules will share a few bits just
by random chance. It's only when the similarity metric rises to above 0.5
that it indicates a high probability that the two molecules share many
substructural features.
Other Similarity Metrics
There are many ways to measure similarity besides the substructural
features and Tanimoto coefficient used in the eMolecules system.
2D topology |
The best-known and most widely used similarity metrics are close
relatives of the above-described eMolecules system. That is, they use
the "2D connectivity" of the molecule's atoms and bonds to create a
fingerprint, and search using the Tanimoto coefficient. Daylight CIS, ChemAxon, OpenEye, the OpenBabel and MayaChem projects and many others also provide Tanimoto similarity searches, but the structural features of the molecule that each project uses to create the fingerprint are quite different. OpenBabel offers several algorithms to suit different needs. |
3D configuration |
One of the most important uses of similarity is in the discovery of new drugs, and a molecule's
shape is critical to its medicinal value (see QSAR).
3D similarity searches compare the configuration (also called the "conformation") of a molecule to other molecules. The "electronic surface" of the molecule is the important bit – the part that can interact with other molecules. 3D searches compare the surfaces of the two molecules (especially which parts are polarized or polarizable). 3D similarity searches are uncommon for two reasons: they're difficult and slow. The difficulty comes from the complexity of molecular interactions: a molecule is not a fixed shape, but rather a dynamic object that changes according to its environment. And the slowness comes from the difficulty: to get better results, scientists employ very complex algorithms. |
Physical Properties | The above 2D and 3D similarity are based on the molecule's structure. Another technique compares the properties, either computed or measured or both, and declares that molecules with many properties in common are of interest. It is the idea of QSAR taken to the database. |
Clustering |
"Clustering" is the process of differentiating a set of things into groups based
on common features. Molecules can be clustered using a variety of techniques,
such as common 2D and/or 3D features. Clustering is not a similarity metric per se (the topic of this section), but it is included here because it is a "cheap substitute." That is, when someone wants to find compounds similar to a known compound, you can show them the group (the cluster) to which the compound belongs. It allows you to spend a lot of computational time up front to pre-compute the clusters, and then give answers very quickly. |
Many cheminformatics databases have one or more similarity searches available.
Chemical Registration is the "big brother" of cheminformatics.
A cheminformatics system is primarily devoted to
recording chemical structure. Chemical Registration systems are
additionally concerned with:
The scope of Chemical Registration Systems is far beyond the goals of this brief introduction
to cheminformatics. However, to illustrate just one of the points above, let's consider structural
novelty. In real life, chemical structure can be very ambiguous. Imagine you have five bottles
of a particular compound that has a stereo center:
Which of these five bottles contain the same compound, and which are different? That is the essential
task of a chemical registry system, which would consider all five to be different. After all,
you probably have data about each bottle (that's why you have them), and you must be able to record
it and not confuse it with the other bottles.
In this example above, consider what is known and
not known:
Bottle | Known | Not Known |
1 | Everything | Nothing |
2 | Everything | Nothing |
3 | Compound is known | Stereochemistry |
4 | Compound and purity known, stereochemistry is opposite of #5 | Specific stereochemistry |
5 | Compound and purity known, stereochemistry is opposite of #4 | Specific stereochemistry |
A cheminformatics system has no way to record the contents of the five bottles; it is only concerned
with structure. By contrast, a chemical registration system can record both what is known
as well as what is not known. This is the critical difference between the two.
Copyright © 2016 eMolecules,
Inc. | Terms of
Service | Privacy
Policy | Give us your
feedback | Contact Us