Molecular Informatics

Cheminformatics Basics
Representing Molecules
Substructure
Molecular Similarity
Chemical Registration

What is Cheminformatics?

Cheminformatics is a cross between Computer Science and Chemistry: the process of storing and retrieving information about chemical compounds.

Information Systems are concerned with storing, retrieving, and searching for information, and with storing relationships between bits of data. For example:

Operation	Classical Information System		Chemical Information System
Store	Name = 'Jimmy Carter'	Stores text, numbers, dates, ...		Stores chemical compounds and information about them.
Retrieve	Find record #13282	Retrieves 'Jimmy Carter'	Find: CC(=O)C4CC3C2CC(C)C1=C(C) C(=O)CC(O)C1C2CCC3(C)C4	Retrieves:
Search	Find Presidents named 'Bush'	George Bush and George W. Bush	Find molecules containing:	Retrieves:
Relationship	Year Carter was elected	Answer: Elected in 1976	What's the logP(o/w) of:	logP(o/w) = 2.62

How is Cheminformatics Different?

There are four key problems a cheminformatics system solves:

1. Store a Molecule	Computer scientists usually use the valence model of chemistry to represent compounds. Section 2, Representing Molecules, discusses this at length.
2. Find Exact Molecule	If you ask, "Is Abraham Lincoln in the database?" it's not hard to find the answer. But, given a specific molecule, is it in the database? What do we know about it? This may seem simple at first glance, but it's not, as we'll see when we discuss tautomers, stereochemistry, metals, and other "flaws" in the valence model of chemistry.
3. Substructure Search	If you ask, "Is anyone named Lincoln in the database?" you usually expect to find the former President and a number of others - this is called a search rather than a lookup. For a chemical informatics system, we have a substructure search: Find all molecules containing a partial molecule (the "substructure") drawn by the user. The substructure is usually a functional group, "scaffold," or core structure representing a class of molecules. This too is a hard problem, much harder than most text searches, for reasons that go to the very root of mathematics and the theory of computability.
4. Similarity Search	Some databases can find similar-sounding or misspelled words, such as "Find Lincon" or "find Cincinati", which respectively might find Abraham Lincoln and Cincinnati. Many chemical information systems can find molecules similar to a given molecule, ranked by similarity. There are several ways to measure molecular similarity, discussed further in Section 4, Molecular Similarity.

Next: Representing Molecules

What is a Molecule?

One of the greatest achievements in chemistry was the development of the valence model: a molecule is represented as atoms joined by semi-rigid bonds that can be single, double, or triple. This simple mental model has little resemblance to the underlying quantum-mechanical reality of electrons, protons and neutrons, yet it has proved to be a remarkably useful approximation of how atoms behave in close proximity to one another. The valence model has been the foundation of chemical instruction for well over a century.

The valence model is also the foundation of modern chemical information systems. When a computer scientist approaches a problem, the first task is to figure out a datamodel that represents the problem to be solved as information. To the computer scientist, the valence model naturally transforms into a graph, where the nodes are atoms and the edges are bonds. Computer scientists know how to manipulate graphs; mathematical graph theory and computer science have been close allies since the invention of the digital computer.

There are atoms and space. Everything else is opinion.
-- Democritus

However, the valence model has many shortcomings. The most obvious of these is aromaticity, a concept that quickly required the addition of a non-integral "aromatic" distributed bond to the single/double/triple bonds of the simple valence model. And that was just the start. Tautomers, ferrocenes, charged molecules and a host of other common molecules simply don't fit the valence model well.

The imprecise nature of the valence model of chemistry complicates life for the computer scientist. As we shall see, the shortcomings in the valence model are the source of most of the complexities of modern cheminformatics systems.

Older systems: Connection Tables

Most of the early (and some modern) representations of molecules were in a connection table, a table enumerating the atoms, and a table enumerating the bonds and which atoms each bond connected. Here is an example of connection-table (CTAB) portion of an MDL "SD" file (the data portion is not shown here):


  CHEMW2  12141511122D                              

  3  2  0  0  0  0            999 V2000
  -29.5160    2.6603    0.0000 Br  0  0  0  0  0  0  0  0  0  0  0  0
  -25.1859    5.1603    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  -20.8558    2.6603    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
M  END

This simple example illustrates most of the key features of the molecule. The molecule has three atoms, two bonds, and is provided with three-dimensional (x,y,z) coordinates.

Connection tables can capture the valence model of chemistry fairly well, but they suffer from three problems:

Inefficiency: They take on the order up to two dozen bytes of data per atom and per bond. Newer line notations (discussed below) represent molecules with an average of 1.2 to 1.5 bytes per atom, or 6-8 bytes per atom if coordinates are added.
Lack of specificity. For example, since hydrogens are often implicit, there can be ambiguity as to the electronic state of some molecules. The connection-table format does not explicitly state the valence assumptions.
Formatting issues: Most connection-table formats mix the concept of connectivity (what are the atoms and how are they connected?) with other data such as 2D and 3D coordinates. For example, if you had two different conformers of a molecule, most connection tables would require you to specify the entire molecule twice, even though the connection table is identical in both.

Line Notations: InChI, SMILES, WLN and others

A line notation represents a molecule as a single-line string of characters.

WLN - Wisswesser Line Notation

WLN, invented by William J. Wisswesser in the early 1950's, was the first comprehensive line notation, capable of representing arbitrarily complex molecules correctly and compactly.

1H = CH4 Methane
2H = CH3-CH3 Ethane
3H = CH3-CH2-CH3 Propane
QVR BG CG DG EG FG = C7HCl5O2 Pentachlorbenzoate

WLN was the first line notation to feature a canonical form, that is, the rules for WLN meant there was only one "correct" WLN for any particular molecule. Those versed in WLN were able to write molecular structure in a line format, communicate molecular structure to one another and to computer programs. Unfortunately, WLN's complexity prevented widespread adoption. The rules for correct specification of WLN filled a small book. Encoding those rules into a computer proved difficult, and the rules for the canonicalization were computationally intractable.

SMILES - Simplified Molecular Input Line Entry System

The best-known line notation today is SMILES. It was created by Arthur and David Weininger in response to the need for a simpler, more "human accessible" notation than WLN. While SMILES is not trivial to learn and write, most chemists can create correct SMILES with just a few minutes of training, and the entire SMILES language can be learned in an hour or two. You can read more details here. Here are some examples:

C methane
CC ethane
C=C ethene
Oc1ccccc1 phenol

SMILES, like WLN, has a canonical form, but unlike WLN, Weininger relied on the computer, rather than the chemist, to convert a non-canonical SMILES to a canonical SMILES. This important separation of duties was key to making SMILES easy to enter. (Read more about canonicalization below.)

InChI

InChI is the latest and most modern of the line notations. It resolves many of the chemical ambiguities not addressed by SMILES, particularly with respect to stereo centers, tautomers and other of the "valence model problems" discussed in the opening section this document.

You can read more about InChI at the Official Site

Canonicalization

A critical feature of line notations is canonicalization - the ability to choose one "blessed" representation from among many equivalent representations. Consider:

OCC ethanol
CCO ethanol

Both of these SMILES represent the same molecule. If we could all agree that one of these was the "correct" or "canonical" SMILES for ethanol, then we could always store it the same way in a database. More importantly, if we want to ask, "Is ethanol in our database?" we know that it will only be there once, and that we can generate the canonical SMILES for ethanol and look it up.

(Note that, in theory, one can create a canonical connection table too, but it's not as useful since informatics systems usually have trouble indexing BLOBs - large objects.)

Line Notation versus Connection Tables: A practical matter

Why are line notations preferred over connection-table formats? In theory, either could express the same information. But there are practical difference, mostly related to the complexity of "parsing" a connection table. If you know that the whole molecule is on one line of a file, it's easy to parse.

Line notations are also very useful for database applications. Relational databases have datatypes that, roughly speaking, are divided into numbers, text, and "everything else" (also known as "BLOBs" or Binary Large OBjects). You can easily store line notations in the "text" fields; multi-line connection tables are trickier.

Line notations also have pragmatic advantages. Modern Unix-like systems (such as UNIX, Linux and Cygwin) have a number of very powerful "filter" text-processing programs that can be "piped" together (connected end-to-end) to perform important tasks. For example, to count the number of molecules containing aliphatic nitrogen in a SMILES file, I can simply:

grep N <file.smi | wc

"grep" looks for a particular expression, in this case "N", and prints any line that contains it, and "wc" (word count) counts the number of words and lines.

This is just a simple example of the power available via "script" programs using "filters" on Unix-like systems. Unix "filters" are much less useful for connection-table formats because each molecule is spread over many lines.

Query Languages: SMARTS

In addition to a typographical way to represent molecules, we also need a way to enter queries about molecules. For example, we might want to ask, "Find all molecules that contain a phenol."

With text, we're familiar with the concept of typing a partial word, such as "ford" to find "Henry Ford" as well as "John Hartford". For chemistry, we can also specify partial structures, and find anything that contains them. For example:

Query	Database	Matches?
		YES (matched portion highlighted in blue)
		NO (double bond indicated doesn't match)

The simplest query language for chemistry is SMILES itself: Just specify a structure, such as "Oc1ccccc1", and search. This is how eMolecules' basic searching works. It's simple, and because of the high-performance indexes in eMolecules it is very fast.

However, for general-purpose cheminformatics one needs more power. What if the substructure you're looking for isn't a valid molecule? For example ClccBr (1,2- substitution on an aromatic ring) isn't a whole molecule, since the concept of aromaticity is only sensible in the context of a whole ring system.

Or what if the thing we're looking for isn't a simple atom such as Br, but rather a concept like "Halogen"? Or, "A terminal methyl"?

To address this, cheminformatics systems have special query languages, such as SMARTS (SMiles ARbitrary Target Specification). SMARTS is a close cousin to SMILES, but it uses expressions instead of simple atoms and bonds. For example, [C,N] will find an atom that is either carbon or nitrogen.

IUPAC Names, Trade Names, Common Names

Chemistry also has three other important name systems:

IUPAC Names established a naming convention that is widely used throughout chemistry. Any chemical can be named, and all IUPAC names are unambiguous. This textual representation is aimed at humans, not computers: chemists versed in IUPAC nomenclature (which is widely taught) can read an IUPAC name and visualize or draw the molecule.

Trade Names such as Tylenol® and Valium® are given to compounds and formulations by manufacturers for marketing, sales and regulatory purposes.

Common names are names such as "aspirin" or "alcohol" for substances that are in widespread use.

Previous: Cheminformatics Basics

Next: Substructure Searching

What is Indexing?

Indexing is pre-computing the answers to portions of expected questions before they're asked, so that the questions can be quickly answered when they're asked.

Take your favorite search engine (AOL, Yahoo!, Google, MSN, etc) for example. Without indexing, they might wait until you ask for "John Hartford Bluegrass", then start searching the web, and in a year or two find all the web pages about the deceased banjo/fiddle player and steamboat captain. That would probably not impress you.

Instead, these search engines search the web before you ask your question, and build an index of the words they find. When you type in "Bluegrass John Hartford", they already know all of the pages that contain "John", all of the pages that contain "Hartford", and all of the pages that contain "Bluegrass". Instead of searching, they examine their index, and find pages that are on all three lists, and quickly returning your results. (NB: It's actually a lot more complex, but this illustrates the main idea of indexing.)

Indexes for Chemicals

Instead of indexing words, cheminformatics systems index substructures. Although there are many schemes for doing this, cheminformatics systems all use the same fundamental principle: they decompose the molecule into smaller bits and index those.

Decomposing the molecule for indexing

Roughly speaking, a cheminformatics system will index each of the substructures (fragments) above, so that every molecule that contains each fragment is known.

When a query is entered, the cheminformatics system breaks apart the query using the same technique, to find all of the fragments in the query. It then checks its index for each fragment, and combines the lists it finds to get only those molecules that have all of those fragments.

This doesn't mean that all molecules returned by the index actually are matches. In the language of databases, we say the index will return false positives, candidate molecules that don't actually match the substructure search.

Consider our example of searching for "John Hartford" - the index might return many pages that have both "John" and "Hartford", yet have nothing to do with bluegrass music or steamboats. For example, it might return a page containing, "President John F. Kennedy visited Hartford, Connecticut today..." To confirm that the search system has found something relevant, it must check the pages returned from the index to ensure that the specific phrase "John Hartford" is present. However, notice that this is much faster than searching every page, since the overwhelming majority of web pages were instantly rejected because they have neither "John" nor "Hartford" on them.

Similarly, a chemical fragment index serves to find only the most likely molecules for our substructure match - anything that the index didn't find is definitely not a match. But we still have to examine each of the molecules returned by the indexing system and verify that the complete substructure for which we are searching is present.

NP-Complete - A Little about Computability

Searching through a page of text for the words, "John Hartford" is pretty easy for a modern computer. Although false positives returned by the index are a nuisance and impair performance, they are not a catastrophe. This is not so for substructure matching. Unfortunately, substructure matching falls into a category of "hard" mathematical problems, which means false positives from the index are a big problem.

Substructure matching (finding a certain functional group within a molecule) is an example of what mathematicians call graph isomorphism, and is in a class of problems called NP Complete. Roughly speaking, this means that the time it takes to do a substructure search is non-polynomial (i.e. exponential in the number of atoms and bonds). To see why this is a computational disaster, compare two tasks, one that takes polynomial time, k₁*N², versus one that takes exponential time k₂*2^N. Our polynomial task is bad enough: If we double N, it takes four times as long to solve. But the exponential task is worse: Every time we add an atom it doubles. So going from one atom to two doubles the time, and going from 100 atoms to 101 atoms doubles the time. Even if we can get k₂ down to a millionth of k₁, we're still in trouble - a million is just 2²⁰ or twenty atoms away.

It has been mathematically proven that substructure searching is in the set of NP Complete problems, so there's no point wasting our time searching for a polynomial algorithm. The good news is that most molecules have "low connectivity", meaning most atoms have fewer than four bonds, unlike the weird and twisted graphs that mathematicians consider. In practice, most substructure matching can be done in polynomial time around N² or N³. But even with this improvement, substructure matching is an "expensive" time-consuming task for a computer.

The key point is that indexing is particularly important for cheminformatics systems. The typical modern computer can only examine a few thousand molecules per second, so examining millions of molecules one-by-one is out of the question. The indexing done by a modern cheminformatics system is the key to its performance.

Previous: Representing Molecules

Next: Molecular Similarity

What is similarity search?

Similarity search finds molecules with a high percentage of features that are common to the target molecule. Unlike a substructure search, the molecules found may not actually match the target.

For example, suppose you are looking for compounds related to Sildenafil. A substructure search would miss the closely-related drug Vardenifil because one nitrogen is moved.

Query	Target

We're looking for Sildenafil-like drugs...	But we don't find Vardenafil because of the shifted N.

By contrast, a similarity search will find Sildenafil as well as Vardenafil and its relatives. Although the core ring systems of the two drugs have significant differences, their similarity index is moderately high (0.7) because most of the structural features are shared.

The eMolecules Similarity Metric

The eMolecules similarity metric ranks molecules based on their number of shared "substructural features" between the two molecules. These features are defined as follows:

Linear paths: All "paths" through the molecule up to five atoms long. For example, C=NCO would generate five features: C=N, NC, CO, N=NC, NCO and C=NCO.
Branch points: Any atom with three or more non-hydrogen neighbors.
Multi-branch points: Pairs of adjacent branch points.
Single Rings: For example, quinoline would generate two ring features: a benzene ring and a pyridine ring.
Multi-ring systems: For example, quinoline would generate one multi-ring feature (in addition to the two single-ring features).
Molecular formula: Each element in the molecule generates at least one feature (some generate several features).

The eMolecules similarity metric measures the fraction of features common to both molecules. A high similarity (> 0.9) means that the two molecules share most of their features; a low similarity means they share few or no features. (More details about this computation are at the end of this article).

Caveats

Similarity search can be very useful for finding related molecules, but one must be careful not to give much meaning to the numbers.

A low similarity metric should be considered more-or-less meaningless. It is wrong to think that a similarity of 0.3 is "more similar" than a similarity of 0.2 – anything at this level is just noise. There is no simple rule to determine the cutoff for useful similarity results, but roughly speaking anything below 0.4 is probably not interesting. (See Limitations... below for a more technical discussion of this).

High similarity can also be misinterpreted. Similarity metrics do not form any sort of Euclidian space. They're often discussed with terms like "distance," "closer," and "farther," but these terms are deceiving. If two molecules are highly similar, it only means that they share many substructural features. Highly similar molecules may or may not have highly similar chemical or medicinal properties. Two molecules that are similar to a third molecule may be similar to each other or may be very dissimilar.

Computation of the Similarity Metric

To achieve high speeds when searching for similar molecules, the molecule features (linear, branch points and rings) described above are "hashed" into a 512-bit (64 byte) bitmap representation called a "fingerprint."

A hash function is a well-defined procedure that maps a large space (in this case, all possible molecular features) into a much smaller space (a number between zero and 511). For example, it might do this:

h("C=NCO") = 427
h("c1ncccc1") = 201
... etc.

To create a fingerprint, the system initializes the bitmap to all zeros. Then each molecular features is hashed, and the fingerprint's bit corresponding to that hash value is set to one. A typical molecule in the eMolecules database has between 20 and 200 features, so the corresponding fingerprint has roughly that many bits set (more on this below).

The hash function is designed to distribute the output evenly from zero to 511 for a large number of input values.

Given the fingerprints of two molecules, their similarity is computed using the well-known Tanimoto coefficient.

T(m1, m2) =	nbits-common
	nbits-set

Where:

nbits-common:	the number of 1 bits common to both fingerprints (the bitwise AND of the fingerprints)
nbits-set:	the number of 1 bits set in either fingerprint (the bitwise OR of the fingerprints)

This coefficient and its relatives (e.g. the Tversky coefficient) have become cheminformatics standards because of the very high speed at which modern computers can manipulate large integers. A modern computer can compute billions of logical AND and OR operations per second. This makes it possible to calculate the Tanimoto coefficients for millions of molecules in just a few seconds.

Limitations of Tanimoto Similarity

The hash function described above maps a very large space (all possible molecular features) into a much smaller space (512 bits). The set of molecules in the eMolecules database together have roughly 150,000 distinct molecular features. As a molecule's features are turned into a bitmap, there is no guarantee of uniqueness. Each "1" of a fingerprint could represent any one of thousands of potential molecular features.

To illustrate, imagine the fingerprint-generation algorithm has hashed 205 features into a 512-bit fingerprint, and is about to hash the next (206th) feature. Since 40% of the bits have already been set, there's a 40% probability this bit will "collide" with an already-set bit.

Now imagine you have two 512-bit fingerprints, each with around 204 bits set (40%). Just by random chance, we can expect that these two fingerprints will have roughly 0.25 similarity (25% of their 1 bits will happen to coincide) even if they have no common substructural features.

This is why similarity values less than 0.4 are not useful. Almost any two fingerprints for real molecules will share a few bits just by random chance. It's only when the similarity metric rises to above 0.5 that it indicates a high probability that the two molecules share many substructural features.

Other Similarity Metrics

There are many ways to measure similarity besides the substructural features and Tanimoto coefficient used in the eMolecules system.

2D topology	The best-known and most widely used similarity metrics are close relatives of the above-described eMolecules system. That is, they use the "2D connectivity" of the molecule's atoms and bonds to create a fingerprint, and search using the Tanimoto coefficient. Daylight CIS, ChemAxon, OpenEye, the OpenBabel and MayaChem projects and many others also provide Tanimoto similarity searches, but the structural features of the molecule that each project uses to create the fingerprint are quite different. OpenBabel offers several algorithms to suit different needs.

3D configuration	One of the most important uses of similarity is in the discovery of new drugs, and a molecule's shape is critical to its medicinal value (see QSAR). 3D similarity searches compare the configuration (also called the "conformation") of a molecule to other molecules. The "electronic surface" of the molecule is the important bit – the part that can interact with other molecules. 3D searches compare the surfaces of the two molecules (especially which parts are polarized or polarizable). 3D similarity searches are uncommon for two reasons: they're difficult and slow. The difficulty comes from the complexity of molecular interactions: a molecule is not a fixed shape, but rather a dynamic object that changes according to its environment. And the slowness comes from the difficulty: to get better results, scientists employ very complex algorithms.

Physical Properties	The above 2D and 3D similarity are based on the molecule's structure. Another technique compares the properties, either computed or measured or both, and declares that molecules with many properties in common are of interest. It is the idea of QSAR taken to the database.

Clustering	"Clustering" is the process of differentiating a set of things into groups based on common features. Molecules can be clustered using a variety of techniques, such as common 2D and/or 3D features. Clustering is not a similarity metric per se (the topic of this section), but it is included here because it is a "cheap substitute." That is, when someone wants to find compounds similar to a known compound, you can show them the group (the cluster) to which the compound belongs. It allows you to spend a lot of computational time up front to pre-compute the clusters, and then give answers very quickly.

Many cheminformatics databases have one or more similarity searches available.

Previous: Substructure

Next: Chemical Registration

Chemical Registration is the "big brother" of cheminformatics.

A cheminformatics system is primarily devoted to recording chemical structure. Chemical Registration systems are additionally concerned with:

Structural novelty - ensure that each compound is only registered once
Structural normalization - ensure that structures with alternative representations (such as nitro groups, ferrocenes, and tautomers) are entered in a uniform way.
Structure drawing - ensure that compounds are drawn in a uniform fashion, so that they can be quickly recognized "by eye."
Maintaining relationships among related compounds. For example, all salt forms of a compound should be recognized as being related to one another, and compounds in different solvates are also related.
Registering mixtures, formulations and alternative structures.
Registering compounds whose structures are unknown.
Roles, responsibilities, security, and company workflow.
Updates, amendments and corrections, and controlling propagation of changes (e.g. does changing a compound change a mixture containing that compound?)

The scope of Chemical Registration Systems is far beyond the goals of this brief introduction to cheminformatics. However, to illustrate just one of the points above, let's consider structural novelty. In real life, chemical structure can be very ambiguous. Imagine you have five bottles of a particular compound that has a stereo center:

The contents of the first bottle were carefully analyzed, and found to be a single stereoisomer.
The contents of the second bottle were carefully analyzed and found to contain a racemic mixture of the stereoisomers.
The stereoisomers of the third bottle are unknown. It may be pure, or have one predominant form, or be a racemic mixture.
The fourth bottle was obtained by running part of the contents of bottle #2 through a chromatographic separation. It is isotopically pure, but you don't know which stereoisomer.
The fifth bottle is the other fraction from the same separation of #4. It is also isotopically pure, you don't know which stereoisomer, but you know it's the opposite of #4.

Which of these five bottles contain the same compound, and which are different? That is the essential task of a chemical registry system, which would consider all five to be different. After all, you probably have data about each bottle (that's why you have them), and you must be able to record it and not confuse it with the other bottles.

In this example above, consider what is known and not known:

Bottle	Known	Not Known
1	Everything	Nothing
2	Everything	Nothing
3	Compound is known	Stereochemistry
4	Compound and purity known, stereochemistry is opposite of #5	Specific stereochemistry
5	Compound and purity known, stereochemistry is opposite of #4	Specific stereochemistry

A cheminformatics system has no way to record the contents of the five bottles; it is only concerned with structure. By contrast, a chemical registration system can record both what is known as well as what is not known. This is the critical difference between the two.

Previous: Molecular Similarity