Pitfalls in Information Theory
and
Molecular Information Theory

Thomas D. Schneider

Information theory and molecular biology touch on a huge number of topics, as shown by the icon to the right. (Click on it to see the detail.) As a result there are many ways that one can get into intellectual trouble and many of these are widely repeated in the literature. This page is devoted to listing the pitfalls that I have come across and needed to solve to create a consistent theory. Not everything that is in the literature is correct!

Using ambiguous or poor terminology
Confusing a model with reality: consensus sequences
Using the popular meaning of the term 'information'.
Information Is Not Entropy, Information Is Not Uncertainty! (this topic is its own page)
Thinking that information (R) is the same as uncertainty (H).
Treating Uncertainty (H) and Entropy (S) as identical OR treating them as completely unrelated.
Using the term "Shannon entropy".
Modeling or depicting free energy surfaces as 2 dimensional.
Ignoring the number Zero.
Thinking that bits are merely a measure of statistical non-randomness.
Maxwell's Demon.
The meaning of ΔS in the ΔG equation.
Entropy Is Not Disorder
Chemistry is not always appropriate when thinking about molecules
Evolution is Information Change
Percent Identity: a poor measure of protein similarity
SELEX: a potentially misleading molecular biology experiment
Forgetting that information has units
Using `Relative entropy'
One dimensional energy diagrams for molecules
Folding funnels

Using ambiguous or poor terminology
- box
- complexity
- consensus sequence
- core consensus
- negentropy
- score
- Shannon entropy
- specificity
Confusing a model with reality: consensus sequences. The main example is confusing a consensus sequence (a model) with a binding site (a natural phenomenon). See The Consensus Sequence Hall of Fame and the paper Consensus Sequence Zen.
Using the popular meaning of the term 'information'. In physics it is well understood that the term 'force' has a precise technical definition, and this allows one to write Newton's famous equation F = M A (force is mass times acceleration). This is quite different from the popular use of the term force as in 'the force of my rhetoric'. It is clear that usually the phrase 'my rhetoric' is not meant to be an acceleration applied to the mass of your brain! Likewise, Shannon defined information in a precise technical sense. Beware of writers who slip from the technical definition into the popular one. See also:
- A Whack on the Side of the Head: How You Can Be More Creative by Roger Von Oech (sorry, I couldn't resist the link! ;-0)
Thinking that information (R) is the same as uncertainty (H). Because of noise, after a communication there is always some uncertainty remaining, H_after and this must be subtracted from the uncertainty before the communication is sent, H_before. In combination with the R/H pitfall, this pitfall has lead many authors to conclude that information is randomness. Examples:
- William Dembski (creationist) in the book No Free Lunch (page not identified, please tell me if you find it!). imagined flipping a coin 1000 times to get 1000 bits of information. But since an unbiased coin could have two possibilities, the uncertainty before flipping is H_before = 1 bit. After flipping, one will see frequencies approaching 50%, so H_after = 1 bit. Therefore the information gained is H_before - H_after = 0 bits.
- Seth Lloyd (physicist) "Computational capacity of the universe" 24 Oct 2001, arXiv.org e-Print archive. Pages 2 and 7. I = S/(Kb ln2) implies that information is proportional to entropy.
- Hubert Yockey (molecular biologist/information theorist) said (Thu, 26 Jan 1995 00:39:52 GMT)
  Information' is, of course, not the very opposite of randomness. Elitzur is using the word 'information' in the semantic sense as synonym for knowledge or meaning. Everyone knows that a random sequence, that is, one chosen without intersymbol restrictions or influence, carries the most information in the sense use by Shannon and in computer technology. ...
  to which I (Tom Schneider) responded:
  Here you have made the mistake of setting Hafter to zero. So a random sequence going into a receiver does not decrease the uncertainty of the receiver and so no information is received. But a message does allow for the decrease. Even the same signal can be information to one receiver and noise to another, depending on the receiver!
- In A pattern analysis of the second Rehnquist U.S. Supreme Court (PNAS 2003 100: 7432-7437) Lawrence Sirovich (mathematician) said "... the information (entropy) conveyed by a decision is [Shannon uncertainty labeled "I"] where the logarithm is base two."
- In: D. Benedetto, E. Caglioti and V. Loreto, Language Trees and Zipping", Phys. Rev. Lett, 88: 048702-1 - 048702-4, 2002. "In this context [information theory] the word information acquires a measure of the surprise the source emitting the sequences can reserve to us." This is incorrect because they forgot about noise.
See also: Information Is Not Entropy, Information Is Not Uncertainty!
Treating Uncertainty (H) and Entropy (S) as identical OR treating them as completely unrelated. The former philosophy is clearly incorrect because uncertainty has units of bits per symbol while entropy has units Joules per Kelvin. The latter philosophy is overcome by noting that the two can be related if one can correlate the probabilities of microstates of the system under consideration with probabilities of the symbols. See Theory of Molecular Machines. II. Energy Dissipation from Molecular Machines (J. Theor. Biol. 148 125-137, 1991) for how to do this. Examples:
- Vic Stenger, Skeptical Briefs Vol 10 No. 4 December 2000 writes: "Now, it turns out that the Shannon uncertainty and the physicist's entropy are identical within a trivial constant, a point that Dembski either does not recognize or chooses to hide." Clearly the constant is not trivial if one is making measurements. HOWEVER you should understand that for the purposes of his argument this does not affect Stenger's reasoning (which demonstrates the errors in Dembski's creationist arguments) because he is not making measurements! For a qualitative argument, he got the direction correct.
- In: D. Benedetto, E. Caglioti and V. Loreto, Language Trees and Zipping", Phys. Rev. Lett, 88: 048702-1 - 048702-4, 2002. "In this context [information theory] the word information acquires a measure of the surprise the source emitting the sequences can reserve to us." This is incorrect because they confused uncertainty with entropy.
Using the term "Shannon entropy". Although Shannon himself did this, it was a mistake because it leads to thinking that the thermodynamic entropy is the same as the "Shannon Entropy". There are two extreme classes of error:
- "Shannon entropy" is identical to "entropy". This is incorrect because they have different units: bits per symbol and joules per kelvin, respectively.
- "Shannon entropy" is entirely unrelated to "entropy". This is incorrect since it is clear that the forms of the equation are similar and differ by a constant.
A better term to use for measuring the state of a set of symbols is "uncertainty". I take the middle road and say that entropy and uncertainty can be related under the condition when the microstates of the system correspond to symbols, as they do for molecular machines. In this case one can write a simple conversion equation. See the paper edmm: Energy Dissipation from Molecular Machines. Examples:
- Claim of identical: The creationist William Dembski in the book No Free Lunch stated that the two forms are mathematically identical (page 131). Of course just about every sentence of Dembski's work has an error or three ...
- Claim of unrelated: In his book The Low-Down on Entropy and Interpretive Thermodynamics (DCW Industries, Inc., 1999, ISBN Number 1-928729-01-0) Stephen J. Kline claimed that the two forms are completely unrelated. Unfortunately he fell into other pitfalls too as he didn't distinguish information and uncertainty.
Modeling or depicting free energy surfaces as 2 dimensional. Such surfaces are high dimensional. This has severe effects on the shape of the path. If the individual valleys are Gaussian, the final shape is a sphere. See Theory of Molecular Machines. I. Channel Capacity of Molecular Machines. Examples:
- Curr Opin Struct Biol 2002 Apr;12(2):161-8
  Protein folding: the free energy surface.
  Gruebele M.
Ignoring the number Zero. Molecular biologists are in a nasty habit of not including zero in their counting systems. Surprisingly, zero was invented several thousand years ago. Physicists are shocked when I tell them that to a molecular biologist, counting goes like this: -3, -2, -1, +1, +2, +3 ... (For this reason, molecular biologists may not have not noticed the millennium Y2K transition.) Methods for how to treat zero coordinate systems are given in the glossary. If one creates a sequence logo without a zero, then one will be seriously bitten later on when one starts using sequence walkers, because the location of a sequence walker has to be specified and the natural place to do this is the zero base. Examples:
- Long M, de Souza SJ, Rosenberg C, Gilbert W. Proc Natl Acad Sci U S A 1998 Jan 6;95(1):219-23 Relationship between "proto-splice sites" and intron phases: evidence from dicodon analysis. Figure 2.
  Walter Gilbert, Nobel Prize in Chemistry 1980
Thinking that bits are merely a measure of statistical non-randomness. One can compute the significance of a position in a binding site as the number of z scores above background (e.g. for splice junctions splice). However, this prevents one from thinking of the bits as a measure of sequence conservation, which is a different thing. Aside from small sample effects, which can be corrected, the average number of bits in a binding site does not change as the sample size changes. By contrast, the error bars on a sequence logo show the significance of the conservation.
Maxwell's Demon. There is a huge literature on Maxwell's Demon and it is full of errors, too many to list here. The basic problem is that the people who write about the Demon are not molecular biologists, they are physicists and philosophers who do not know molecular biology, so they are not thinking in realistic molecular terms. If one treats the demon as a real physical being or device, then it is clear that there are natural analogues for the things he has to do, and none of these violate the Second Law of Thermodynamics. If one does not treat the demon as a real physical device, then one has violated known physics already and so violation of the the Second Law is not surprising. See nano2 for a detailed debunking of the Demon.
The meaning of ΔS in the ΔG equation. It is well known from thermodynamics that the free energy is:

ΔG = ΔH - T ΔS
Often people talk about ΔS in this equation as "the" entropy. This is misleading if not downright incorrect.

ΔS in the above equation is the entropy change of the system:
ΔS = ΔS_system
ΔH corresponds to the entropy change of the surroundings:
ΔH = ΔH_system = -T ΔS_surroundings
so the total free energy change is:
ΔG_system = ΔH_system -T ΔS_system

= -T ΔS_surroundings -T ΔS_system

= -T ΔS_total
This is why, of course, that ΔG_system corresponds to the total entropy change and it is why one can use the sign of ΔG_system to predict the direction of a chemical reaction.

So ΔH_system is misnamed since it is about what happens outside the system.

The pitfall is to think or say that ΔS_system is "the entropy" change. It's not since it is only part of the total entropy change.

Reference:
```
@book{Darnell1986,
author = "J. Darnell
 and H. Lodish
 and D. Baltimore",
title = "Molecular Cell Biology",
publisher = "Scientific American Books, Inc.",
address = "N. Y.",
year = "1986"}
See pages 36-38.
```
Entropy is not "disorder"; it is a measure of the dispersal of energy by Dr. Frank L. Lambert. An entropy increase MIGHT lead to disorder (by that I mean the scattering of matter) but then - as in living things - it might not!
How can we relate this idea to molecular information theory? 'Disorder' is the patterns (or mess) left behind after energy dissipates away. The measure Rsequence (the information content of a binding site) is a measure of the residue of energy dissipation left as a pattern in the DNA (by mutation and selection) when a protein binds to DNA. On the other hand, Rfrequency, the information required to find a set of binding sites, corresonds to the decrease of the positional entropy of the protein. To drive this decrease the entropy of the surrounding must increase more by dissipation of energy. After the energy has dissipated out the protein is bound. So the protein bound at the specific genetic control points represents 'ordering'. This concept applies in general to the way life dissipates energy to survive.
Chemistry is not always appropriate when thinking about molecules. Consider the reaction of the restriction enzyme EcorI with DNA. In bulk solution, one measures the reaction and observes a distribution between specifically bound and unbound states. One can write out a chemical equation for this:
EcoRI + DNA <--> EcoRI.DNA
and talk about the global ΔG. However, this tells us nothing about how a single molecule binds to the DNA. A single molecule will find a binding site IRRESPECTIVE OF THE TOTAL CONCENTRATIONS OF OTHER MOLECULES IN THE SOLUTION. In other words, the global ΔG is NOT relevant to the problem of how EcoRI finds the binding site. This is widely misunderstood in the literature.
Evolution is Information Change. The evolutionary history of a molecule can be obtained by aligning a set of related molecules and then constructing a phylogenetic tree. These trees are based on the idea that changes in the molecule are more or less regular in time because mutations are more or less regular. Consider cytochrome C. This molecule evolved to full function very early and is now continuing to diverge in various species. The changes are not (for the most part) altering the function of the molecule. To call such changes 'evolution' is not appropriate, because they are equivalent to the rearrangements of water molecules in a glass of water at room temperature. The glass of water stays "the same" as far as we are concerned. The ev program makes this more clear. In the initial state, there are no binding sites. Over generations, the creatures evolve to have binding sites, and the information content of the sites increases. Eventually the sites have just enough information for them to be found in the genome (Rsequence ~ Rfrequency). After this point the sites change (drift) but do not gain further information. What we call things is a matter of terminiology. However, in the Ev simulation the sites first evolve information and after that they have neutral drift; they are no longer evolving. The term evolution should be applied only to the information increase, and also to the decrease when selection is removed. If we call the neutral drift 'evolution' then we must also say that a glass of water sitting on the counter minding its own business is also evolving.
Percent Identity: a poor measure of protein similarity
"Per cent identity" does not take into account that amino acids are almost always not equally probable and for this reason leads to illusions. Mutual entropy is the correct measure of "similarity".
--- H. P. Yockey. Information theory, evolution and the origin of life. Information Sciences, 141:219-225, 2002.
The term 'entropy' should not be used, but otherwise the statement is correct. This means that the basis of the widely used phylogenetic tree generating programs, such as Clustal, is unreliable. These programs begin by pairwise comparison of the percent identity of proteins.
SELEX: a potentially misleading molecular biology experiment
Assoc Prof David F Callen, (Breast Cancer Genetics Group, Dame Roma Mitchell Cancer Research Labs, Hanson Institute (North Building), Adelaide, SA 5000, Australia) asked
We were wondering if you could point us to the right direction, We are doing some SELEX experiments using rounds of selection with random oligos to determine the DNA binding sites of a zinc finger protein. Do you know of any web sites that can easily determine a possible consensus from such sequences?
Ok I can help you with this, thanks for asking, but it is important to understand two things. First, if you create a consensus sequence after having done your beautiful SELEX, you will be throwing out most of your hard-earned data! See the Consensus Sequence Zen paper and also the entry on consensus sequences on this page.

Second, SELEX itself can, unfortunately, get you into deep trouble. See this paper: Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX.

So how can you avoid consensus sequences? Basically, you replace them by sequence logos and sequence walkers.

How can you avoid the pitfalls of SELEX? See if you can find enough natural sites (minimum: 6 sites) to make a natural sequence logo to compare to your SELEX results.

To specifically answer your question, you can use weblogo:

http://weblogo.berkeley.edu/

Please be sure to have a sensible zero coordinate:

http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#zero_coordinate
Forgetting that information has units

You wouldn't say that you walked 5 today would you? 5 what?
On the very first page of his famous 1948 paper (Shannon1948), after noting the advantages of logarithmic measure for information, Shannon pointed out that
"The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey."
There are papers that have used the natural log and others that have used log base 2 for measuring information in biology, so it is important to indicate the units. If you don't say the units (as in, "3 bits" or "19 bits per site") your paper will not be precise enough for someone to replicate the work. The word 'bits' is very important to have after every number. Examples:
- Neurology. 2005 Oct 25;65(8):1319-21. Analysis of LRRK2 functional domains in nondominant Parkinson disease. Skipper L, Shen H, Chua E, Bonnard C, Kolatkar P, Tan LC, Jamora RD, Puvan K, Puong KY, Zhao Y, Pavanni R, Wong MC, Yuen Y, Farrer M, Liu JJ, Tan EK.
Using `Relative entropy'
The so-called 'relative entropy' has become popular to measure the distributions of bases or amino acids. In this computation one has the form
Sum P_ij log₂ P_ij/q_i
where P_ij is the frequencies i of amino acids at a given position j (for example) and q_i is the frequencies of amino acids in proteins in general. The problem with this measure is that it gives results that are not consistent with information theory. For example, the maximum information required to identify one protein from 20 is: log₂ 20 = 4.3 bits. Yet this statistical measure can give more than 5 bits. So it is incorrect to assign units to the results of this measure as bits.
- See: Measuring Molecular Information T. D. Schneider, J. Theor. Biol., 201, 87-92, 1999
- Example: The PFAM database produces logos this way. Here is one case, the Insulin/IGF/Relaxin family. Clicking on 'View HMM logo' gives a logo with several positions higher than 5 bits. What could this possibly mean? There are not 32 amino acids! The original paper is: HMM Logos for visualization of protein families, Schuster-Boeckler B, Schultz J, Rahmann S, BMC Bioinformatics 2004, 5:7.
- Example: BLogo: A tool for visualization of bias in biological sequences
One dimensional energy diagrams for molecules
(in preparation)
Folding funnels
(in preparation)

Related reference:

Book Review of Information Theory and Molecular Biology

Schneider Lab
origin: 2002 March 13
updated: version = 1.41 of pitfalls.html 2008 Aug 12

Pitfalls in Information Theory and Molecular Information Theory

Pitfalls in Information Theory
and
Molecular Information Theory