Peter K. Rogan 1 and Thomas D. Schneider2 3 4
version = 1.38 of colonsplice.tex 2000 Nov 8
Research Article,
Human Mutation vol. 6, No. 1, pages 74-76, 1995
short title:
Identifying Splice Site Polymorphisms
Predicting the effects of nucleotide substitutions in human splice sites has been based on analysis of consensus sequences. We used a graphic representation of sequence conservation and base frequency, the sequence logo, to demonstrate that a change in a splice acceptor of hMSH2 (a gene associated with familial nonpolyposis colon cancer) probably does not reduce splicing efficiency. This confirms a population genetic study that suggested that this substitution is a genetic polymorphism. The information-theory based sequence logo is quantitative and more sensitive than the corresponding splice acceptor consensus sequence for detection of true mutations. Information analysis may potentially be used to distinguish polymorphisms from mutations in other types of transcriptional, translational or protein coding motifs.
KEY WORDS: Information theory, Human splice sites, DNA sequencing,
mutation, polymorphism
Nucleotide substitutions in a human splice donor or acceptor recognition sequence often disrupt processing of the normal transcript. Although altered mRNA splicing must ultimately be confirmed experimentally, tools capable of predicting the effects of these substitutions may be useful in recognizing those which are most likely deleterious. Mutations in human splice sites have been conventionally identified by comparing the sequence of the putative mutation with the consensus sequence [Mount, 1982,Nakai and Sakamoto, 1994]. This type of comparison can produce misleading results since the consensus sequence contains only the most representative nucleotides at each position. In some instances, it may not be possible to distinguish between deleterious mutations and silent genetic polymorphisms. This paper describes a more robust alternative.
Information analysis of normal splice junctions reveals partially conserved
nucleotide sequences that are not always reflected in the corresponding
consensus sequence [Stephens and Schneider, 1992]. Information content may be
represented by a sequence logo (Fig. 1),
which depicts the relative contribution of each position of the splice site
and the relative frequencies of each nucleotide at every position
[Schneider and Stephens, 1990]. The logo illustrates the full range of
normal variants in the splice junction.
To determine whether a nucleotide substitution in a splice
site represents a polymorphism or a
mutation, the individual information content of
the site is compared with the overall
distribution of individual
information in a set of 1800 human splice sites.
As an example of this method, we have analyzed the
T
C transition found at
position -5 of the intervening sequence of the hMSH2 gene from multiple,
independent sporadic colon carcinomas and patients with Lynch syndrome
[Fishel et al., 1993].
Other mutations in the coding domain of this gene cause hereditary
nonpolyposis colon cancer by disrupting the repair of somatic lesions that
accumulate in genomic DNA
[Leach et al., 1993].
Although the substitution at -5 was proposed
to cause aberrant splicing of hMSH2 mRNA [Fishel et al., 1993],
our analysis suggests that
it is probably not
deleterious to maturation of the hMSH2 message.
First, upon inspection of the sequence logo, there is a nearly equal
probability of observing C or T at position -5 in this set of splice acceptor
sequences (Fig. 1; this corresponds to position -6 in
[Fishel et al., 1993]).
Second,
cytosine at this position does not
impede the normal splicing of 691 of 1712 acceptor sites derived from numerous
human genes
[Stephens and Schneider, 1992].
Third,
we find that the common allele contains 6.5 bits of information,
and the substitution weakens it to 6.3 bits.
The average of the distribution of sites is 9.3 bits,
and the distribution has a standard deviation of 4.6 bits.
Nonfunctional sites are
predicted to be
below zero on this scale
[Schneider, 1994].
Indeed, 2 of 20 unrelated normal individuals displayed this variant,
consistent with the suggestion that this change represents a polymorphism
[Leach et al., 1993].
This change is unlikely to affect the recognition of other nucleotides in the same acceptor site, as mutational analysis of the polypyrimidine tract in which it resides suggests that these nucleotides are independently recognized by the spliceosome [Stephens and Schneider, 1992,Roscigno et al., 1993]. We have found 196 normal human sites with the same or lower information content as the hMSH2 acceptor containing this substitution. 51 of these contain cytosine at position -5. Either the true mutation lies elsewhere, in this or another gene [Leach et al., 1993,Bronner et al., 1994,Papadopoulos et al., 1994], or the change indicates that this base is involved in a genetic control mechanism other than mRNA splicing [Amrein et al., 1994].
To summarize, inference of genetic mutations in splice junction recognition sites based on consensus sequences may be inaccurate, whereas information analysis of sequence variants can distinguish between polymorphic nucleotides and mutant sites. True mutations are expected to reside in positions in which the sequence conservation in bits significantly exceeds the background variation [Stephens and Schneider, 1992] and where the base frequency decreases significantly. However, the identification of a mutation by information analysis does not always imply that the substitution will have a phenotype. For example, incomplete penetrance may affect the reliability of molecular diagnosis based on information analysis.
A similar approach could be applied to the analysis of other conserved transcriptional and translational signals or protein motifs in human sequences.5
We thank Denise Rubens and Paul N. Hengen for comments on the manuscript. P.K.R. is supported by the Leukemia Research Foundation, the American Cancer Society, the March of Dimes Birth Defects Foundation, the Four Diamonds Pediatric Cancer Research Foundation, the David S. and Amy S. Goldberg Memorial Fund for Pediatric Research, and PHS R29-HD 29098-01.
![]() This sequence logo was created from 1744 wild-type acceptor sites. The height of each nucleotide is proportional to its frequency at that position, while the height of each entire stack of nucleotides corresponds to the information measure (in bits) or, equivalently, the sequence conservation at that position. When sequence conservation is measured in bits, the relative heights of the stacks can be compared to one another and the total sequence conservation in a region can be found by adding the heights of the stacks together [Shannon and Weaver, 1949,Sloane and Wyner, 1993,Pierce, 1980]. Coordinates in the splice site are defined along the abscissa. RNA strand cleavage during splicing occurs at the vertical line between positions 0 and 1. All positions except -3 in this logo are significantly above background ( ![]() ![]() |