Information theory and molecular biology touch on a huge number of topics, as shown by the icon to the right. (Click on it to see the detail.) As a result there are many ways that one can get into intellectual trouble and many of these are widely repeated in the literature. This page is devoted to listing the pitfalls that I have come across and needed to solve to create a consistent theory. Not everything that is in the literature is correct!
Using ambiguous or poor terminology
Confusing a model with reality: consensus sequences. The main example is confusing a consensus sequence (a model) with a binding site (a natural phenomenon). See The Consensus Sequence Hall of Fame and the paper Consensus Sequence Zen.
Using the popular meaning of the term 'information'. In physics it is well understood that the term 'force' has a precise technical definition, and this allows one to write Newton's famous equation F = M A (force is mass times acceleration). This is quite different from the popular use of the term force as in 'the force of my rhetoric'. It is clear that usually the phrase 'my rhetoric' is not meant to be an acceleration applied to the mass of your brain! Likewise, Shannon defined information in a precise technical sense. Beware of writers who slip from the technical definition into the popular one. See also:
Thinking that information (R) is the same as uncertainty (H). Because of noise, after a communication there is always some uncertainty remaining, Hafter and this must be subtracted from the uncertainty before the communication is sent, Hbefore. In combination with the R/H pitfall, this pitfall has lead many authors to conclude that information is randomness. Examples:
Information' is, of course, not the very opposite of randomness. Elitzur is using the word 'information' in the semantic sense as synonym for knowledge or meaning. Everyone knows that a random sequence, that is, one chosen without intersymbol restrictions or influence, carries the most information in the sense use by Shannon and in computer technology. ...to which I (Tom Schneider) responded:
Here you have made the mistake of setting Hafter to zero. So a random sequence going into a receiver does not decrease the uncertainty of the receiver and so no information is received. But a message does allow for the decrease. Even the same signal can be information to one receiver and noise to another, depending on the receiver!
Treating Uncertainty (H) and Entropy (S) as identical OR treating them as completely unrelated. The former philosophy is clearly incorrect because uncertainty has units of bits per symbol while entropy has units Joules per Kelvin. The latter philosophy is overcome by noting that the two can be related if one can correlate the probabilities of microstates of the system under consideration with probabilities of the symbols. See Theory of Molecular Machines. II. Energy Dissipation from Molecular Machines (J. Theor. Biol. 148 125-137, 1991) for how to do this. Examples:
Using the term "Shannon entropy". Although Shannon himself did this, it was a mistake because it leads to thinking that the thermodynamic entropy is the same as the "Shannon Entropy". There are two extreme classes of error:
Modeling or depicting free energy surfaces as 2 dimensional. Such surfaces are high dimensional. This has severe effects on the shape of the path. If the individual valleys are Gaussian, the final shape is a sphere. See Theory of Molecular Machines. I. Channel Capacity of Molecular Machines. Examples:
Ignoring the number Zero. Molecular biologists are in a nasty habit of not including zero in their counting systems. Surprisingly, zero was invented several thousand years ago. Physicists are shocked when I tell them that to a molecular biologist, counting goes like this: -3, -2, -1, +1, +2, +3 ... (For this reason, molecular biologists may not have not noticed the millennium Y2K transition.) Methods for how to treat zero coordinate systems are given in the glossary. If one creates a sequence logo without a zero, then one will be seriously bitten later on when one starts using sequence walkers, because the location of a sequence walker has to be specified and the natural place to do this is the zero base. Examples:
Thinking that bits are merely a measure of statistical non-randomness. One can compute the significance of a position in a binding site as the number of z scores above background (e.g. for splice junctions splice). However, this prevents one from thinking of the bits as a measure of sequence conservation, which is a different thing. Aside from small sample effects, which can be corrected, the average number of bits in a binding site does not change as the sample size changes. By contrast, the error bars on a sequence logo show the significance of the conservation.
Maxwell's Demon. There is a huge literature on Maxwell's Demon and it is full of errors, too many to list here. The basic problem is that the people who write about the Demon are not molecular biologists, they are physicists and philosophers who do not know molecular biology, so they are not thinking in realistic molecular terms. If one treats the demon as a real physical being or device, then it is clear that there are natural analogues for the things he has to do, and none of these violate the Second Law of Thermodynamics. If one does not treat the demon as a real physical device, then one has violated known physics already and so violation of the the Second Law is not surprising. See nano2 for a detailed debunking of the Demon.
The meaning of ΔS in the ΔG equation.
It is well known from thermodynamics that the free energy is:
ΔG = ΔH - T ΔSOften people talk about ΔS in this equation as "the" entropy. This is misleading if not downright incorrect.
ΔS in the above equation is the entropy change of the system:
ΔS = ΔSsystemΔH corresponds to the entropy change of the surroundings:
ΔH = ΔHsystem = -T ΔSsurroundingsso the total free energy change is:
ΔGsystem = ΔHsystem -T ΔSsystem
= -T ΔSsurroundings -T ΔSsystem
= -T ΔStotalThis is why, of course, that ΔGsystem corresponds to the total entropy change and it is why one can use the sign of ΔGsystem to predict the direction of a chemical reaction.
So ΔHsystem is misnamed since it is about what happens outside the system.
The pitfall is to think or say that ΔSsystem is "the entropy" change. It's not since it is only part of the total entropy change. |
Reference:
@book{Darnell1986, author = "J. Darnell and H. Lodish and D. Baltimore", title = "Molecular Cell Biology", publisher = "Scientific American Books, Inc.", address = "N. Y.", year = "1986"} See pages 36-38.
Entropy is not "disorder"; it is a measure of the dispersal of energy by Dr. Frank L. Lambert. An entropy increase MIGHT lead to disorder (by that I mean the scattering of matter) but then - as in living things - it might not!
How can we relate this idea to molecular information theory? 'Disorder' is the patterns (or mess) left behind after energy dissipates away. The measure Rsequence (the information content of a binding site) is a measure of the residue of energy dissipation left as a pattern in the DNA (by mutation and selection) when a protein binds to DNA. On the other hand, Rfrequency, the information required to find a set of binding sites, corresonds to the decrease of the positional entropy of the protein. To drive this decrease the entropy of the surrounding must increase more by dissipation of energy. After the energy has dissipated out the protein is bound. So the protein bound at the specific genetic control points represents 'ordering'. This concept applies in general to the way life dissipates energy to survive.
EcoRI + DNA <--> EcoRI.DNAand talk about the global ΔG. However, this tells us nothing about how a single molecule binds to the DNA. A single molecule will find a binding site IRRESPECTIVE OF THE TOTAL CONCENTRATIONS OF OTHER MOLECULES IN THE SOLUTION. In other words, the global ΔG is NOT relevant to the problem of how EcoRI finds the binding site. This is widely misunderstood in the literature.
"Per cent identity" does not take into account that amino acids are almost always not equally probable and for this reason leads to illusions. Mutual entropy is the correct measure of "similarity".The term 'entropy' should not be used, but otherwise the statement is correct. This means that the basis of the widely used phylogenetic tree generating programs, such as Clustal, is unreliable. These programs begin by pairwise comparison of the percent identity of proteins.
--- H. P. Yockey. Information theory, evolution and the origin of life. Information Sciences, 141:219-225, 2002.
We were wondering if you could point us to the right direction, We are doing some SELEX experiments using rounds of selection with random oligos to determine the DNA binding sites of a zinc finger protein. Do you know of any web sites that can easily determine a possible consensus from such sequences?Ok I can help you with this, thanks for asking, but it is important to understand two things. First, if you create a consensus sequence after having done your beautiful SELEX, you will be throwing out most of your hard-earned data! See the Consensus Sequence Zen paper and also the entry on consensus sequences on this page.
You wouldn't say that you walked 5 today would you? 5 what? |
"The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey."There are papers that have used the natural log and others that have used log base 2 for measuring information in biology, so it is important to indicate the units. If you don't say the units (as in, "3 bits" or "19 bits per site") your paper will not be precise enough for someone to replicate the work. The word 'bits' is very important to have after every number. Examples:
Sum Pij log2 Pij/qiwhere Pij is the frequencies i of amino acids at a given position j (for example) and qi is the frequencies of amino acids in proteins in general. The problem with this measure is that it gives results that are not consistent with information theory. For example, the maximum information required to identify one protein from 20 is: log2 20 = 4.3 bits. Yet this statistical measure can give more than 5 bits. So it is incorrect to assign units to the results of this measure as bits.
Schneider Lab
origin: 2002 March 13
updated:
version = 1.41 of pitfalls.html 2008 Aug 12