A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters Serge Saxonov*+. Paul Bergt*, and Douglas L. Brutlag*+§ *BioMedical Informatics Program and 'Department of Biochemistry, Stanford University, Stanford, CA 94305 Contributed by Paul Berg, December 2, 2005 A striking feature of the human genome is the dearth of CpG dinucleotides (CpGs) interrupted occasionally by CpG islands (CGls), regions with relatively high content of the dinucleotide. CGls are generally associated with promoters; genes, whose pro- moters are especially rich in CpG sequences, tend to be expressed in most tissues. However, all working definitions of what consti- tutes a CGI rely on ad hoc thresholds. Here we adopt a direct and comprehensive survey to identify the locations of all CpGs in the human genome and find that promoters segregate naturally into two classes by CpG content. Seventy-two percent of promoters belong to the class with high CpG content (HCG), and 28% are in the class whose CpG content is characteristic of the overall genome (low CpG content). The enrichment of CpGs in the HCG class is symmetric and peaks around the core promoter. The broad-based expression of the HCG promoters is not a consequence of a correlation with CpG content because within the HCG class the breadth of expression is independent of the CpG content. The overall depletion of CpGs throughout the genome is thought to be a consequence of the methylation of some germ-line CpGs and their susceptibility to mutation. A comparison of the frequencies of inferred deamination mutations at CpG and GpC dinucleotides in the two classes of promoters using SNPs in human-chimpanzee sequence alignments shows that CpGs mutate at a lower frequency in the HCG promoters, suggesting that CpGs in the HCG class are hypomethylated in the germ line. CpG islands 1 DNA methylation I epigenetics I gene expression n vertebrates, the postreplication addition of methyl groups to I the 5-position of cytosine in certain CpG dinucleotides and the maintenance of a particular genomic pattern of methylated CpGs provides an epigenetic means for diffcrcntial regulation of gcne expression (1-7). Indeed, the pattern of methylation often varies between cell types and diflerent conditions, changes throughout development, and is abnormal in many disease states (5-10). A prevalent view holds that the state of CpG methylation regulates and stabilizes chromatin structure, perhaps regulating accessibility of the transcription machinery to regions of DNA (6,9-1 I). Thus, whereas methylated CpGs restrict transcription, unnicthylated CpGs in the vicinity of a gene allow that gene to bc expressed. The abundance of CpG dinucleotides in human DNA is much lower than expected based on the GC content (12-14), which results from the inherent mutability of methylated cytohine. Whereas the product of cytosine deamination, uracil, is readily recognized as aberrant and is repaired (4, 12, 15), the deami- nation product of methylated cytosine is thymine. leading to transition mutations in the next round of replication. Conse- quently, methylated CpGs in the germ line are likely to be lost over time (16-19). The resulting dearth of methylated CpGs is not uniform; typically, regions scvcral hundreds of basc pairs long contain a11 elevated number of CpGs and are referred to as CpG islands (CGIs) (13, 14, 20). Ostensibly, CGls are retained because thcir CpGs are hypomethylated in the germ line, but some can arise through circumstances unrelated to methylation, such as strong selection or as a result of thc prcvalcncc of CpGs in some repcats (2, 21, 22). Because no objective standard exists for defining a CGI. the prevailing approach is to rely on ad hoc thresholds ol length, CpG fraction, and GC content (20, 22, 73). Despitc the absence of a satisfactory definition. CGIs have been iiitensivcly studied. On the cxpcriincntal front, CGIs havc convciitionally hcen targets for interrogation when probing the methylation status of the genome (24-28). Computationally, it has been observed that CGIs are imperfectly associated with promoters, leading to thcir use in promoter prediction (29, 30). Based on the threshold- based definitions, promoters with highcr lcvcls of CpCs are presumed to bc associated with widely exprcsscd germ. How- ever, any study that attempts to analyze CGI-related properties of promoters is faced with the dual difficulty of defining what constitutes a CGI and what constitutcs ii CGI-promoter association. As a prelude to determining the genomz-wide pattern of CpG methylation, we have surveyed the pattern of CpGs over the human genome (31) and havc calculated the prevalcnce of CpGs with respect to various gcnc-related features as annotatcd by thc RcfScq databasc (32). By foregoing the usc of threshold-based definitions of CGIs, we were able to uncover the existence and catalog the membership of two classes of' promoters based on thcir CpG content: 72% of promoters with high CpC conccn- trations (HCC) and 28% of promoters whosc CpG content was characteristic of the overall gcnorne [low CpG conccntration (LCG)]. By cataloging the promoters of the two classes, we wcrc also able to analyze the differences in CpG distributions, mu- tation ratcs. and expression profiles. Results Although CpGs occur =25%3 as often over the whole human genome as would be expected based on the GC content, their presence is elevated relative to this background level in exons and upstream regions of gcncs (Table 1). At any given distance from the transcription start site (TSS), exons arc similarly enriched for CpGs compared to introns. We infer that the retention and enrichment of CpGs in exons stems from coding constraints, which strongly limit the I-ange of acceptablc niutii- tions, because noncoding exons closclp rcsemble introns in thcir CpG contcnt (Fig. L4). Furthermore. our analysis of the CpC occurrence with respect to the coding frame is consistent with Conflict of interest statement: No conflicts declared. Freely available online through the PNAS open access option Abbreviations: CGI. CpG island; TSS. transcription start site; LCG. low CpG concentration: HCG, high CpG concentration. *To whom correspondence may be addressed at: Department of Biochemistry. Beckman Center 6400, MC 5307. Stanford University, Stanford, CA 94304-5307. E-mail. pbergO cmgmstanford edu )TO whom correspondence may be addressed at: Department of Biochemistry, Beckmar Center 6403, 279 Campus Drive, MC 5307. Palo Alto, CA 943054307. E-mail. brutlag@ stanford.edu. Q 2006 by The National Academy of Sciences of the USA 1412-1417 I PNAS 1 January31.2006 1 vol. 103 I no. 5 Table 1. Overview of CpG distribution in the human genome Observed Norma I ized Length, GC CPG CPG Subset Mb content fraction fraction Whole genome 3.1* 0.38 0.009 0.25 1 kb upstream regions 15 0.53 0.042 0.60 1 kb downstream regions 15 0.45 0.013 0.26 Transcription units 930 0.42 0.011 0.26 Exons 45 0.50 0.028 0.45 Introns 880 0.41 0.010 0.24 Length refers to the total length of DNA examined. *Length given in gigabases. this claim (Table 4, which is published as supporting information on the PNAS web site). In addition to their prevalence in exons, CpGs are also relatively enriched around the TSS. In fact, the enrichment pattern peaks sharply close to the core promoter 15 bp upstream of the TSS and extends symmetrically to -2 kb from the 'ISS (Fig. 1B). Within individual promoters, CpGs tend to corne in clusters (data not shown), implying that the enrichment pattcrn reflects an average across many CpG islands, which tend to appear close to the core promoter and show no preference for bcing upstream or downstream. Two Promoter Classes. Considering only the average pattern of CpG occurrence around the TSS conceals the existence of two distinct promoter classes. The distribution of promoters' nor- malized CpG content is bimodal and can be approximated by a mixture of two Gaussian curves with means of 0.23 and 0.61 normalized CpG content and relative abundances of 28% and 72%, respectively (Fig. 2d). It is unlikely that the bimodality can be explained by AT-rich and GC-rich isochores, because the distribution of GC content is distinctly unimodal (Fig. 2B). Taking the intersection of the Gaussian curves as a decision boundary, we assign a promoter to class LCG if the normalized CpG content of the 3 kb centered at the TSS is <0.35, and we assign a promoter to class HCG otherwise. This partitioning allocates 3,575 promoters (along with the corresponding genes) to the LCG class and 11,305 promoters to the HCG class, although there is minor cross-contamination because of the overlap between the curves. Reexamining the pattern of CpC occurrence around the 'I'SS, there is a striking difference be- tween the two classes. Whereas HCG promotcrs exhibit a prominent peak in the frequency of CpG centered some 15 bp upstream of the TSS, the CpG frequency for LCG promoters is relatively flat except for a small incrcase near the TSS (Fig. 2 C and D, lower curves). The most straightforward explanation for this qualitative difference between the classes is that all of the HCG promoters contain CGIs, and all of the LCG promoters lack them. Estimation of CpG Mutation Rates. As previously discussed, ele- vated levels of CpGs can be due to the presence of CpG-rich repeats, general selection pressure, or nicthylation-related CpG- specific effects. To investigate the proximate cause of the difference in CpG content between the two classes, we analyzed mutation frequencies by using SNPs in human-chimpanzee sequcnce alignments. SNPs represent sites of recent mutations in the human genome, and the aligned chimpanzee sequence can be used to infer which alleles are ancestral (33). To distinguish the effects of methylation from the effects of selection, we examined the frequencies of deamination mutations at the CpC dinucle- otides (CpG to TpG or CpA) and those at the GpC dinucleotides (GpC to GpT or ApC). Although negative selection should acl indiscriminately on the two dinucleotides, changes rclatcd to methylation should only affect mutation frequencies at the CpGs. The last two rows of Table 2 show that CpGs mutate at a lower frequency in the HCG promoters than they do in the LCG promoters, whereas mutation frequencies of GpCs differ only modestly. Unfortunately, this finding is not sufficient to establish the existence of a CpC-specific effect, because it can, in pi-inciplc, be explained by a difference in general selection. One would expect that mutation rates of CpGs would be more strongly affected than those of GpCs, because many CpCs have been purged from the genome, making it more likely that thc remaining ones arc under stronger selection. Therefore, when examining regions conserved by evolution. the frequency of CpC mutations would be expected to be dampened to a higher extent than for GpC mutations. Consequently. in addition to examining the promoter regions of the two classes, we also examined the mutation patterns in regions downstream of the transcription start sites. Because methylation is unlikely to be a factor in sequences that arc distant from the TSS. any differences in mutation frcquen- Fig. 1. Patterns of CpG occurrence with respect to gene features. The measures were made on overlapping segments aligned with respect to the TSS and identified by the distance ofthe midpointfrom theTSS. Theanalysis included all (1 5,880) RefSeq genes for which the TSS was annotated differently from the start of the coding region. (A) To compare CpG presence in exons and introns aswell as coding and noncoding sequences, the normalized CpG fraction was computed on overlapping 99-bp segments downstream of the TSS. Se- quences were filtered according to whether they were in introns or exons; exons were further split into coding and noncoding (3' and 5' UTRs) sets. Exons carry a consistently higher level of CpGs than introns; the difference between the coding and noncoding exonic sequence shows that the CpG content of noncoding exons is only slightly above that of introns, suggesting the culpability of the coding potential in maintaining the higher CpG levels in exons. (Band 0 Patterns of CpG occurrence (8) and GC content (C) around transcription start sites. Normalized CpG fraction and GC content were computed in 50-bp overlapping segments across 4-kb regions centered at the TSS. Saxonov et a/. PNAS I January 31,2006 I vol. 103 I no. 5 I 1413 OM 017 030 043 058 069 082 095 029 035 041 047 053 059 065 071 Normalized CpG GC conteflt *----KG Class D - HCG CLas -06- - I -1.500 -1,000 -500 0 500 1000 Z 500 -1 500 -1,000 -506 0 500 1,000 1500 Distance from TSS Distance from TSS Fig 2 Distribution of promoterswith respectto CpG properties (A and B) Histograms of normalized CpG fractions (A) and GCcontent @)of 3-kb regionsaround TSSs They axis counts the number of promoters with the given CpG or GC content in the 3 kb centered at each promoter's TSS. Two Gaussian curves were fitted to the distribution in A with means of 0 23 and 0 61, [rvalues of 0 07 and 0 14, and weights of 4,430 and 11,450, respectively The intersection of the two curves, at 0 35, is the decision boundary we used to separate promoters and their genes into classes LCG and HCG See Table 6, which is published as supporting information on the PNAS web site. for a full listing of the TSSs in the two classes, along with their RefSeq IDS and chromosome locptions (C and D) Plotting the normalized CpG fraction (0 and GC content (12) separately for the two classes cies in such sequences should be due to differences in selection pressure. For the downstream analysis, we examined mutations in introns and the three coding phases of exons (phase 0, phase I, and phase 2 rcfcr to inutatioiis that are in the first, second, and third positions of a codon, respectively). As expcctcd, frcquen- cies of mutations varied in accordance with the amount of selection on the .sequences being considered. For both CpGs and GpCs, mutations were more prevalent in introns and in phase 2 (wobble) exonic positions, compared with phase 0 and 1 cxonic positions (Table 2). Observations of mutation frequencies in downstream introns and exons provide a basis from which to reexamine the differ- ences between the LCG and HCG classes. The Irequency of GpC mutations, which we can view as an inverse indicator of general selection, is only slightly higher in the LCG promoters compared with the HCG proniotcrs, whereas for both classes it is close to the corresponding frequency in introns and at wobble positions. Most importantly, the HCG class appears to be an outlier because the frequency of CpG mutations is the lowest of any of the regions examined and the GpC mutation frequency is consistent with HCG promoters being under only very modest selection. Taken together, the evidence argues for a CpG- Table 2. Frequencies of deamination mutations at CpG and GpC dinucleotides in exons, introns, and promoters GpC-GpT CpG-tTpG Ratio (CpG mutation mutation frequency/GpC Gene regions frequency* frequency* frequency) Downstream exons, phase 0 0.42 2 0.06 2.30 i 0.04 5.5 Downstream exons, phase 1 0.39 IO.06 2.78 L 0.04 7.2 Downstream exons, phase 2 0.72 i 0.04 7.73 c 0.02 10.8 Downstream introns 0.75 i 0.00 8.31 C 0.00 11.1 LCG promoterst 0.75 f 0.03 7.31 t 0.02 9.8 HCG promoters+ 0.64 -+ 0.02 1.62 ? 0.01 2.5 Downstream refersto all the sequences ;. 3 kb downstream of the TSS. Recent mutations in the human lineage were identified by compiling human SNPs that fell within the examined regions. For every SNP we determined which allele was ancestral by identifying the aligned base in the chimpanzee genome. *For mutations XpY - X'pY', mutation rate is presented as l,OOO.(XpY --f X'pY' mutations/XpY dinucleotides). '3-kb seauences centered at the TSS. 141 4 I www.pnas.org/cgi jdoi/l0.1073/pnas.05103 101 03 Saxonov et a/. Table 3. Distributions of top-level GO terms for the LCG and the HCG classes Appearances GO code GO term description LCG HCG P value Overrepresented in class LCG 09607 09605 07582 0561 5 06950 09628 05886 05102 30246 07267 05576 04872 19825 05623 08233 07154 07165 05578 05634 06139 07049 06350 06259 05739 05575 03723 30528 05622 09719 05654 03700 03677 05840 1503 1 06464 05730 05694 081 52 04672 061 18 05783 Overrepresented in class HCG [BPIresponse to biotic stimulus [BPIresponse to external stimulus [BPlphysiological process [CClextracellular space [BPlresponse to stress [BPIresponse to abiotic stimulus [CCIplasma membrane [MFJreceptor binding [MFIcarbohydrate binding [BPlcell-cell signaling [CCIextracel Iu lar region [MFlreceptor activity [MFIoxygen binding [CClcell [MFIpeptidase activity [BPlcell communication [BPlsignal transduction [CClextracellular matrix [CC]nucleus [BPInucleo-metabolism [BPlcell cycle [BPItranscription [BPIDNA metabolism [CClmitochondrion [CC]cellular_component [MFIRNA binding [MFItranscription regulator activit [CC] intracel I u lar [BPIresponse to endogenous stin [CClnucleoplasm [MFItranscription factor activity [MFIDNA binding [CC] ri bosome [BPlprotein transport [BPIprotein modification [CC]nucleolus [CClchromosome [BPlmetabolism [MFlprotein kinase activity [BPIelectron transport [CCIendoplasmic reticulum 307 296 603 116 227 108 429 128 30 136 39 182 14 512 62 75 429 44 85 78 37 71 22 16 74 22 52 16 8 12 52 22 1 10 73 1 5 285 51 0 24 192 218 789 88 268 97 656 146 19 196 36 288 6 965 76 103 810 52 535 458 294 40 1 193 168 367 180 286 140 96 103 236 129 44 82 277 39 56 836 200 25 113 7.9 x 10-52 5.8 x 10-41 1.8 x 1 0-2G 1.1 x 10-15 2.5 x 10-73 2.1 x IO-" 1.0 x 10-10 2.0 x 10-0* 1.5 x 10-05 1.1 x 10-04 2.2 x 10-04 5.6 x 10-04 7.9 x 10-04 2.6 x 10-03 3.0 x 10-03 4.0 x 10-O3 1.1 x 10-78 1.2 x 10-14 2.6 x 3.3 x 10-12 1.1 x 10-09 1.3 x 10-09 3.1 X 10 O4 8.7 x 3.9 x 10-03 1.3 x 10-O8 1.5 x 10-Os 3.4 x 10-07 3.3 x 10-06 2.1 x 10-05 3.3 x 10-05 1.4 x 10-O4 2.2 x 10-04 2.6 x 10-04 5.8 x 10-04 6.6 x 10-04 8.6 x 10-O4 1.4 x 10-03 2.6 x 10-03 4.4 x 10-03 4.9 x 10-03 All of the terms were mapped to the goslim-generic subset, which is meant to represent the top levels of the GO hierarchy. P values were calculated by using the ,y2 statistic. Only terms significant at the 0.005 level are presented. Parenthesized markings stand for the three major subontologies comprising GO: CC for "cellular component." BPfor"biologica1 process," and MFfor"molecularfunction." Resultsforthe full ontology (not just the goslim-generic subset) can be found in Table 4. specific effect and not general selection as the dominant culprit for the high levels of CpGs in HCG promoters. Differences in Annotation and Expression Between the Two Classes. Evidence from other studies suggests that CGIs arc more frequently associated with "house-keeping" genes than with tissue-specific genes (21,34,35). Our analysis of Gene Ontology [GO) (36) terms associated with genes in the HCG and LCG classes is consistent with that functional relationship (Table 3; see also Table 5, which is published as supporting information on thc PNAS web site). Broadly considered, house-keeping func- tions arc significantly overrepresented in the HCG cl tcrms associated with specific functions characteristic of more differentiated or highly regulated cells are significantly overrep- resented in the LCG class. The correlation of a promoter's CpG content with the breadth of expression of its gene is also borne out by our analysis of expression prof'iles of genes in the two classes (Fig. 3). Using the data set from Su et ~11. (37). who measured expression levels of an extensive set of genes in 79 different tissues, we bin genes according to the number of tissues in which they are expressed. The resulting distributions are significantly different between the two classes, the most pro- nounced differences being at the extremes of the distributions: therefore, genes that are cxpressed in only a small number of tissues arc overrepresented in class LCG. and gcncs expressed in all or almost all of the tissues are biased toward the HCG class Saxonov et a/ PNAS I January31, 2006 I vol. 103 1 no 5 1 1415 A035 0.3 ; 025 0 2 0.2 5 015 5 09 0 05 0 I 0 400 0 300 0 200 0 100 0 000 0 Number of tissues In which a gene Is expressed Fig 3 A microarray analysis of tissue distribution of genes in class LCG and class HCG (A) Tissue distributions of genes in the two classes were significantly difterent (P ~ 1 6 x The fraction of genes expressed in only a few tissues was higher in the LCG class, whereas the fraction of universally expressed genes was higher in the HCG class For plotting convenience we show distributions of genes grouped in 16 larger bins of size 5 (8) We partitioned class HCG into thirds by CpG content One third of promoters had normalized CpG fractions between 0 350 and 0 563, the next third was between 0 563 and 0 683, and the last third comprised all of the promoters with normalized CpG at >O 683 The tissue distributions of genes in the three HCG partitions were similar to each other and different from class LCG (C) We quantified that coriclu~ion by measuring dissimildrities between distributions by using 2 values (P values in parentheses) (Fig. 54). Significantly, genes within the HCG class, irrespective of whether they contain the least or the highest CpG content, exhibit very similar expression profilcs (Fig. 3 B and C). The implication is that, within a class, the number of tissues in which a gene is expressed is not significantly dependent on the pro- moter's CpG content. This point is important because it shows that the universality of a gene's expression is specifically corre- luted with class membership and not directly with the CpG con tent. Discussion We should note that thcrc havc been previous studies comparing genes with or without CGIs in their 5' regions (21, 35, 38). t-lowever, all such studies ified genes according to arbitrary and limiting definitions of CGIs, definitions based on thresholds of CpG fraction, GC content, and length. Few inferences could have been made about the underlying distribution of promoters, because applying any threshold would partition a set of promot- crs regardless of whether they cluster into cohesive subsets. Only one study approached classifying promoters based on CpG properties from an ab initio perspective. Davuluri, Grosse, and %hang (30) found a bimodal distribution of a sliding window statistic in thc vicinity of TSSs and used it to generate two scparatc models for first exon prediction. Our results are con- sistent with their findings, while bringing more clarity to the nature of promoter-CGI association and establishing that there is ;I biologically meaningful separation of genes based on their CGl properties. Before our work, a continuous gradation of CpG content could not be ruled out because the promoters that were dccmcd to lack CpG islands could have becn at the tail of a distribution of CpG content. We show that there are, in fact, two classes of promoters with distinct CpG sequence profiles and a natural decision boundary. Furthermore, we find that CpG-rich promoters are expressed in more tissues but only to the extent that they are more likely to be in the HCG class. Incidentally, it may appear surprising that the GC content around promoters forms a unimodal distribution (Fig. 2B), because it has been previously argued that CpG islands are prefcrcntially located in the GC-rich isochores (21), and we have found that the normalized CpG content at the promoter is weakly correlated with the GC content (data not shown). Most likely, the GC content appears unimodal bccausc, although different between the two classes, it varies to a much smaller extent than the CpG content. Given the difference in CpG-specific mutation rates (`I`able 2), CGIs in the HCG promoters arc almost certainly a consequence of their methylation state rather than of a general selection or the presence of CpG-rich transposable elements. As mentioned above, the most common explanation for such CGIs is that they are a consequence of hypomethylation in the germ line. `l`he unmethylated CpGs in active promoters would be spared the mutagenic effect seen in methylated regions ol' the rest of the genome. According to this view, the pattern of CGIs in the genome should reflect a weighted average of methylation pat- terns in the germ line for which the weight is proportional to the time spent in the particular methylation state (1). ?`he overrep- resentation of widely expressed gencs in the HCG class is consistent with the supposition that these promoters are hypom- ethylated in the germ line. Another possible explanation for the origin of CGIs is that they represent regions where natural selection has favored retention of CpGs for use in methylation- mcdiated regulation. This explanation would account for why some tissue-specific genes contain promoters that are highly enriched for CpGs. If CGIs are manifestations of methylation patterns, studying the properties of CGIs may yield insights into mechanisms that govern the establishment of these patterns. For instance. any proposed model for such a mechanism must account for thc symmetry of CGI distribution around the core promoter. `Ihere- fore, the prevailing hypothesis involving the binding of transcrip- tion factors, such as SP1, to inhibit methylation (39-41), is probably incomplete because it is unlikely to explain the equal clustering of CpGs upstream and downstream of the core promoter. More generally. identification of the two promoter classes lays the groundwork for characterization of CGI prop- erties and analysis of sequcnce elements that influence and are influenced by CGI locations and boundaries. Orthologous se- qucnces from other mammals should be very useful in this rcgard 141 6 I www.pnas.org/cgijdoijlO. 1073jpnas.05103 10103 Saxonov et a/. as they can help to better separate the classes and to identify CGI boundaries more precisely. Thc most striking finding of our analysis is the bimodal distribution of CpG content in promoters, which should caution against excessive reliance on CGIs as gene markers. The LCG class represents a substantial fraction of known genes and is likely to be inore prevalent among undiscovered genes (42-44). The discovery of the LCG class raises the question about the role of methylation in controlling the expression of LCG genes. At present, we have a paucity of experimental data because most studies of differential methylation focus on CGIs, which are absent in the LCG class. In the end, it is the state of methylation of CpGs in both HCG- and LCG-class promoters and in various physiological states that holds the key to understanding their role in molding the phenotype. Methods Sequence Analysis. All of the statistics were compilcd for the University of California, Santa Cruz human genome assembly (hgl6) from July 2003, and the corresponding gene annotations were from the National Center for Biotechnology Information KefSeq database. To determine whether false TSS predictions were skewing our results. we also analyzed annotations from cap analysis gene expression sites (RIKEN CAGE database), chro- matin iminunoprecipitatioIi sites, and compiled 5' UTI< lengths. It does not appear that the essential conclusions of this work wcre compromised by false TSS predictions in the RefSeq database. Normalized CpG fraction was computed as (observed CpGj/(expected CpG), where expected CpG was calculated as (GC ~ontent/2)~. Analysis of Mutation Frequencies. We compiled a list of mutation locations in the human genome by relying on SNPs and inferred thc ancestral alleles through comparisons with the chimpanzee ge- I. Rcik, W., Dean, W. B Walter, J. (2001) Science 293, 1089-1093. 2. 1;azz;iri. M. J. B Greally, J. iM. (2004) Nu/. Rev. Gerrrf. 5, 446-455. 3 Rohertsoii, K. D. B Wolffe, A. P. (2000) Nul. Rev. Genet. 1, 11-19. 4 Singal, R. Sr Ginder, G. D. (1999) Blood 93, 4059-4070. 5. Ijird, A. (3102) Genes Dev. 16, 6-21. 6. Jaciiisch. 17. B Bird, A. (2003) Mif. Cenel. 33, Suppl., 245-254. 7. Novik. K. L., Nimmrich, I., Gcnc, B,, Maier, S., Picpenbrock, C., Olck, A, 61 8. Jones, P. .4. 6: Takai. D. (2001) Scit.~ice 293, 1008-1070. 9. Gcimaii, T. M. 6i Robertson, K. D. (2002) J. Cell Bioclrerri. 87, 117-125. Beck, S. (3002) Crwr. lJsries .Mil. Bid. 4, 111-128. 10. Hcrrnari, J. G. B Baylin, S. B. (2003) N. Ggl. .I. ;Wed. 349, 2042-2054. I1 Fahriicr, J. .4., Eguchi, S., Herman, J. G. B Baylin, S. B. (2002) CrmcerRe~. 62, 7213-72 18. 12. Uird, A. P. (1980) Nirclcic Acids Rcs. 8, 1499-1504. 13. Laiidei, E. S., Liiiton, L. M., Birreii, B., Nushaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., Fitz€Iugh, W., et ul. (2001) Nulure 409, 14. Vciitcr, J. C., Adams, M. I]., Myers; E. W., Li, P. W., Mural, I<. J.; Sutton, G. G.. Smith, H. O., Yandcll, M., Evans, C. A,, Holt, R. A,, e/ nl. (2001) Science 291, 1304-1351. 8611-921 15. Duncan, B. I<. B Miller, J. 11. (1980) Nulure 287, 560-561. 16. Arndt, P. F., Burge, C. B. 61 IIwa, T. (2003) J. Cumpul. Biol. 10, 313-322. 17. Ariidt, P. F. 6: Hwa, T. (2004) UioLfurmurics 20, 1482-1485. 18. Luiitcr, G. 61 Hcin. J. (2004) Uioinformufics 20. Suppl. I, 1216-1223. 19. Svctl, J. 6: Bird, A, (1990) Pruc. Nufl. Acud. Sci. L'SX 87, 4692-4606. 20. Gardiner-Garden. M. 61 Froinmer, M. (1987) J. Mol. Biol. 196, 261-282. 31. Porigcr. L., Durct, L. 61 Mouchiroud, D. (2001) Genome Res. 11, 1854-1860. 23. Takai, D. 6i Jones, P. A. (2002) Proc. IVufl. Acud. Sei. USA 99, 374-3745, 33. Porigcr, 1.. & Mouchiroud, D. (2002) Bioin~~rmutics 18, 631-633. 24. Rakyan, V. K.. Hildmann, T., Novik, K. L., Lewiii, J., Tost, J., Cox, A. V., Andrcws, `I, D., Howc, K. L., Otto, T., Olck, A,, ei ul. (2004) PloS. Bid. 2, c405. 25. Yaiiiada, Y., Watanabe, €I., Miura, F., Soejima, H., Uchiyama, M., Iwasaka, T., Mukai, T., Sakaki, Y. 6: Ito, T. (2004) Genome Res. 14, 247-266. 36. Huang. T., Pcrry, M. B Laux, D. (1999) Hum. Mol. Gena. 8, 459-470. 27. Yaii. P. S., Pcrry, M. R., 1.aux. D. E., Asarc. A. I-., Caldwcll. C. W. & Muarig. T. H.-M. (2000) C'lin. Cuncer Res. 6, 1432-1438. nome. A compilation of human SNPs was downloaded I'rom the National Center for Biotechnology Information and was mapped to the University of California, Santa Cruz human-chimpanzee align- ments. We compiled statistics for mutations of the CpG dinucle- otide to the TpG dinuclcotidc by collecting all of the {C, T} polymorphisms that were followed by a G and which aligned to a C in the chimpanzee genome. To ount for the complementary strand, the CpG-to-CpA mutations were also included in all of the tallies. `The statistics of mutations at the GpC dinucleotide were compiled in the analogous fashion. When measuring mutation rates, only nonoverlapping dinucleotides were examined (i.e., cy- tosines flanked by two guanines were not considered because their mutations could not be used to discriminate between mutations of GpC and CpG dinucleotides). GO Analysis. GO terms wcre mapped to RefScq gencs using LocusLink annotations and IiefSeq to LocusLink mappings downloaded from the National Center for Biotechnology Infor- mation web site. Only experimentally confirmed annotations were used (i.e., evidence codes IDE, IDA, IEP: IGI, IMP, IPI, ISI, and TAS). Expression Analysis. The data were taken from an analysis of exprcssion in 79 tissues by Su et a/. (37); only genes (8, 272) with RefSeq identifiers were considered and each one was decmcd to be expressed in a tissue if the average difference value was >200 (45). Consequently, each gene was assigned into one of 80 bins, depending on the number of tissues in w>hich it was expressed (0-79). LCG was represented by 2,202 genes, and IICG was reprcscnted by 6,070 gencs. We thank S. Manteuil-Brutlag, B. Naughton, and I. Yeh Ior helpful coinnients on the manuscript. S.S was supported by a National Library of Medicine graduate fellowship. 28. Weinmann, A. S., Yaii, 1'. S., Obcrlcy, M. J., Huang.'I`. H.-M. 6i Farnliam. 1'. J. 29. loshikhes, I. P. 61 Zhang. M. Q. (2000) Nul. Geuel. 26, 61-63. 30. Davuluri, R. V., Grossc, I. & Zhang, M. 0. (2001) )Vu/. Gene/. 29, 412-417. 31. Kent, W. J., Sugiiet, C. W., Furey, T. S., Ruskin, K. M., Pringlc, T. It.. Zahlei, 32. Pruitt, K. D. 61 Maglott, D. R. (2001) iVhcleic Acids Res. 29, 137-140. 33. Watanabc, H., Fujiyama, A,, Hattori, M., Taylor,T. D..Toyotla, A,. Kuroki, Y , Noguchi, I-I., BenKnhla, .4., Lehrach, iI.. Sudbrak. R.. et ui. (2004) NUIU~L' 429. 382-388. 34. Larsen, F., (hidersen. G., Lopez, R. 6: Prydz, It. (1992) GoiomrcJ 13. 109-1 107. 35. Krihirison. P. N., Buhrnc, U., Lopcz, R., Muiidlus, S. 6: Nurnhcrg, P. (2004) 36. Harris, M. A,. Clark, J., Ireland, A,, Loniax, J.. Ashhuriicr. M.. Foiilger, R.. (2002) Genes Der.. 16, 235-244. A. M. & Ilausslcr, D. (2002) Genome Res. 12, 990-1006 HUJII. ,I.lul. Geiref. 13. 1969-1978. Eilbeck. K., Lewis. S., Marshall, B., Miingall, C., PI nl. (2004) Nu( 32, 258-261. 37. Su, A. I., Wiltshire, T., Batalov. S., I.app, H., Ching, K. A,, Block, D., Zhang, J.. Sodcn, R., Hayakawa. M., Krciman, G.. el ul. (2004) Pruc. Mil. r1c:ud. Sci. USA 101, 6062-6067. 38. Holniquist, G. P. (1989) J. .Mol. Evol. 28, 469-486. 39. Bell, A. C. 6: Felscnfcld, G. (2000) Nuture 405, 482-485. 40. Brandcis, M., Frank, D., Kesliet, I., Siegfried. Z., Mendelsohii. M., Nema, A., Temper, V., Ra&, A. 6: Cedar, H. (1994) A'ul~ire 371, 435-4 41. Siegfried. Z., Eden, S., Mcndclsohn. M., Fciig, X., Tcuhcri, H. 42. Kapranov, P., Cawley, S. E., Drenkow, J., Bckiranov, S., Stmuaherg, R. L.. 43. Bertone, P.. Stole, V., Royce, T. E., Rozowsky. J. S.. Urhan, A. E., (lW9) Nul. Gem,l. 22, 203-206. Fodor, S. P. 61 Gingeras, `I. 17. (2002) Sciencc 296, 916-919. Rim, J. I-., Tongprasit, W., Sarnarita, M., Wcissrnaii. S., er ul. (2004) 306, 2242-2246. 44. Cawlcy, S., Bckiranov, S., Ng, H. H., Kapranov, P., Sckiiigci-, E. A, Kampa, I)., Piccolboni, A, Scmcntchcnko. V., Cheiig, J., Williams, A. J., e/ ul. (2004) Cell 116,499-509. 45. Su, A. I.. Cooke, M. P., Cliing. K. A,, ftakak, Y., Walker, J. R.. Wiltshire. T., Orth, A. P., Vega, R. G., Sapinoso, I,. M., Moyrich, A,, e/ ul. (2002) Proc A'ull. Acad. Sci. LISA 99, 4465-4470. Saxonov et a/. PNAS I January31.2006 I vol 103 I no 5 I 1417