Organization and Evolution of the Cyp2 Gene Cluster on Mouse Chromosome 7, and Comparison with the Syntenic Human Cluster*

Address correspondence to S. Hoffman, Dept. of Zoology, Miami University, 700 East High St., Oxford, OH 45056 USA. Telephone: (513) 529-3125. Fax: (513) 529-6900. E-mail: hoffmasm@muohio.edu
*The online version of this article (available at http://www.ehponline.org) contains Supplementa1 Material, Tables 1-4.
We thank K. Gavit, L. Kaplan, A. Parent, M. Smith, S. Whitehead, and E. Workman for laboratory assistance. We also thank D. Nelson (University of Tennessee, Memphis) and D. Nebert (University of Cincinnati) for helpful discussions, and J. Vaughn, K. Killian, and D. Pennock (Miami University) for reviewing the manuscript.
This work was supported by National Institutes of Health grant (NIH) 1R15GM55951 (SMGH), and Medical Research Service, Department of Veterans Affairs, and NIH grants R01 AR45603, P30 AR41943, and P30 ES00267 (DSK).
The authors declare they have no conflict of interest.
Received 25 June 2003; accepted 24 September 2003.

Genes from the ancient cytochrome P450 (CYP) superfamily, which encode a large and diverse group of heme-thiolate monooxygenases, are present in the genomes of almost all species examined to date. Mammalian CYP enzymes have been particularly well studied because they detoxify or activate a wide range of environmental and ingested compounds, including many drugs (Nelson et al. 1996), and they play important roles in many endogenous processes such as the metabolism of fatty acids, steroids, and eicosanoids (Nebert and Russell 2002). The CYP superfamily is also an excellent group for studying the evolutionary mechanisms that create gene families; previous studies of this group have provided clear examples of the molecular processes, such as tandem duplication and gene conversion, that are considered to be most important in gene family evolution (Fernandez-Salguero et al. 1995; Gonzalez and Nebert 1990).

The CYP superfamily of genes has been divided hierarchically into families and subfamilies on the basis of sequence similarity, and there is a standardized nomenclature incorporating this hierarchy (Nelson et al. 1993, 1996). Families are designated by adding a number to the root "CYP" ("Cyp" in mouse, e.g., Cyp2), and subfamilies are indicated by a letter (e.g., Cyp2a). Genes within a subfamily are numbered in order of discovery, regardless of species, and pseudogenes (both partial and full-length) are named by adding "ps" to the related mouse gene ("P" for other species) or by adding independent numbers if no true genes are highly related. In general this system appears to reflect evolutionary relationships among the loci (Lewis et al. 1998), with different CYP gene families found in different major taxonomic groups. In vertebrates, larger families are divided into subfamilies that are typically scattered across genomes, but multiple loci from within a subfamily are usually physically clustered together on a single chromosome (Nelson et al. 1996). This pattern has been interpreted as reflecting the creation of most new CYP loci by tandem duplication (Nelson et al. 1993) so that recently duplicated and therefore highly similar loci remain in tandem clusters, whereas older duplications have been broken up over evolutionary time by chromosomal rearrangements.

In this era of genome projects, it would seem relatively simple to analyze the organization and evolution of CYP gene clusters, based on available assembled sequences. However, even though both the human and mouse genome projects have now produced huge stockpiles of sequence information (Waterson et al. 2002), additional study is typically required for highly duplicated portions of mammalian genomes such as gene family clusters (Eichler 1998). Computer-based assemblies cannot by themselves accommodate the complexities presented by clusters of closely related genes, and both polymerase chain reaction (PCR)-based and hybridization-based analyses are confounded by the high levels of sequence similarity between paralogous loci. Mouse and human are estimated to have 190 and 115 CYP loci (including pseudogenes), respectively, some of which have very similar sequences (Hoffman and Keeney 2002). Thus, the study of these gene clusters has been best served by combining genomic sequencing with fine-scale physical mapping based on analyses of cloned DNA (Hoffman et al. 2001).

Detailed physical mapping using genomic clones proved to be an extremely fruitful approach for understanding one human gene cluster, the CYP2 cluster on chromosome 19 (Hoffman et al. 2001), which includes loci from the CYP2A, 2B, 2F, 2G, 2S, and 2T subfamilies (hereafter the CYP2A-T cluster). Extending this map-based approach to the mouse, the most important model organism for genetic studies in mammals, will allow researchers to better study the expression and variation of these genes. Until all individual genes and their related pseudogenes from a cluster have been analyzed in a species, it is nearly impossible to develop PCR primers that are sufficiently locus-specific to use for accurate genotyping, cloning into expression vectors, or development of knockout mice. This study is intended to provide a practical basis for future studies of CYP gene expression, as well as to make a contribution to our understanding of the mechanisms underlying gene family evolution.

Materials and Methods

Families of closely related genes are difficult to analyze using standard techniques--because of high sequence similarity, a closely related gene or pseudogene can easily be mistaken for the target gene during Southern blotting or PCR amplifications. The general procedure discussed below was followed for all loci, but specific techniques were integrated in different combinations as necessary to localize and analyze each putative locus.

Mouse bacterial artificial chromosome (BAC) clones overlapping the Cyp2a-t cluster region were identified by library screening or database analysis, as described below. Specific exons and introns of CYP2 genes (see Supplemental Material) were then PCR-amplified from clone DNAs and sequenced. Some of the BAC clone DNAs were digested by multiple restriction enzymes, blotted onto nylon membranes, and hybridized with PCR-amplified products. Clones were assembled on the basis of shared gene sequences and on restriction mapping. Clone overlaps were confirmed by developing sequence tag sites from the ends of each clone and testing them against other cloned DNAs and against known sequence fragments from both the public and Celera (Celera Genomics, Rockville, MD) mouse genome projects.

Library Screening and Isolation of Bacterial Artificial Chromosome DNA

The RP22 mouse BAC library was screened (Invitrogen Corp., Carlsbad, CA) with PCR products amplified from mouse genomic DNA, using primers designed to match one exon each from the Cyp2a5, Cyp2b9, and Cyp2f2 cDNA sequences. The BAC library clones RP22-78A19, -44B20, -362C15, -127H7, -548M4, and -160O14 gave strong positive signals and were selected for further study. Public mouse genomic sequences and private sequences created by Celera were later examined for the presence of Cyp2a-t genes by comparing them with known Cyp2 cDNA sequences (Table 1). Partially sequenced clones from the RP23 BAC library identified by database analysis as containing Cyp2a-t cluster genes were -430G14, -174D7, -113D13, and -353B5 [GenBank accession nos. AC087157, AC087137, AC087130, AC087155, respectively (gene accession numbers are from GenBank: http://www.ncbi.nlm.nih.gov/Entrez)]. Additional RP23 BAC clones--RP23-368014, -120B2, and -314C12--were identified using the National Center for Biotechnology Information (NCBI) MapViewer tool (http://www.ncbi.nlm.nih.gov/mapview) as being unsequenced but from the region of interest.

Table 1

Of the 13 BAC clones selected by library screening or by database analysis, one (RP23-353B5) was used only for sequence analysis. The DNAs of the remaining 12 clones were obtained and isolated using Qiagen 100 columns (Qiagen Inc., Valencia, CA), according to the manufacturer's protocol. Clones RP23-368014, -120B2, and -314C12 were used only as PCR templates for specific primers to establish patterns of clone overlap. RP22-78A19, -44B20, -362C15, -127H7, -548M4, -160O14, and RP23-430G14, -174D7, and -113D13 were used for all other experiments in this study, including PCR amplifications, sequencing, restriction mapping, and Southern blotting. Unless otherwise noted, these nine clones are the BAC clones referred to throughout the rest of this article.

PCR Amplification

Polymerase chain reaction was performed on all samples using primers designed from mouse Cyp2 gene cDNA and genomic sequences in GenBank and from sequences available from Celera Genomics. All primer sequences and annealing temperatures can be found in Table 1 of the Supplemental Material or on the Cytochrome P450 Homepage (Nelson 2003). PCR amplification was carried out in a total volume of 25 µL consisting of 1 PCR buffer (10 mM Tris-HCl, pH 9.0; 50 mM KCl; 0.1% Triton X-100), 1 mM MgCl₂, 200 µM each deoxynucleoside triphosphate, 0.5 µM each primer, 0.75 units of Taq polymerase (Promega Corp., Madison, WI), and 100-500 ng template DNA. Amplification reactions were performed with various annealing temperatures and cycles. PCR products were electrophoresed on 1.5% low-melt agarose gels, from which fragments were excised and purified using the QIAquick Gel Extraction Kit (Qiagen).

Subcloning and Sequencing

Sequencing templates were isolated in either of two ways. For most of the Cyp2a-t loci, PCR amplifications were performed directly from BAC or genomic DNA preparations. In addition, some restriction fragments of BAC clones RP22-78A19, RP22-160O14, RP22-548M4, RP23-430G14, and RP23-174D7 were subcloned into plasmid vector pBlueScript KS- (Stratagene Inc, La Jolla, CA). Five to ten positive subclones were chosen by blue/white screening for each experiment, and plasmid DNAs were recovered using alkaline lysis minipreps, followed by direct sequencing using the T3 and T7 priming sites.

All sequencing was done for both strands, using either BigDye or ET terminator chemistry on an ABI 310 automated DNA sequencer (Applied Biosystems, Foster City, CA). All samples were prepared for sequencing according to manufacturer instructions. The forward and reverse primers were the same primers used for PCR. DNA sequences were analyzed using the NCBI BLAST similarity search tool (http://www.ncbi.nlm.nih.gov/BLAST/).

Southern Blot Analysis

Each BAC clone DNA was separately digested with the restriction enzymes Cla I, Eco RI, Hind III, Pvu II, Sac I, and Xba I at 37°C for 12-36 hr. The digests were electrophoresed on 1% agarose gels, and the DNA fragments were blotted onto Hybond (Boehringer Mannheim, Indianapolis, IN) nylon membranes. PCR amplification products were made into fluorescent probes using the Genius DIG-labeling system (Boehringer Mannheim) and hybridized to the blots at 55°C overnight, followed by rinsing and labeling according to the manufacturer's protocol.

Restriction Mapping

Table 2

Restriction maps of BAC clones RP22-78A19, -44B20, -362C15, -127H7, -548M4, -160O14, and RP23-430G14, -174D7, and -113D13 were constructed using Eco RI and Hind III (data not shown). These restriction maps were compared with the draft genomic sequences and with the sizes of the fragments that hybridized with gene-specific and BAC end probes to confirm the overall assembly. Independent Bst 1107 I restriction maps of clones RP23-430G14, -174D7, -113D13, and -353B5 from the study of Kim et al. (2001) are available as supplemental data through the Lawrence Livermore National Laboratory (Livermore, CA) website (http://greengenes.llnl.gov/mouse/html/syn_table.htm). These maps cover the region from Cyp2a4 through Cyp2b10 (Figure 1) that was most poorly resolved by Eco RI-Hind III mapping and genomic sequencing. Though the online assembled maps (Kim et al. 2001) have large unresolved regions and a number of mistakes, the actual digest data are accurate. We have extracted and reassembled the digest data to form a new map that is consistent with our Eco RI-Hind III restriction maps and with the draft genomic sequences. This composite restriction map (Table 2 of Supplemental Material) is the basis for the distances between the Cyp2a4 through Cyp2b10 loci shown in Figure 1.

Figure 1. Organization of the Cyp2a-t gene cluster on mouse chromosome 7. The total map has been broken in half to fit the page. The public sequences (GenBank nos. NW_000303, NW_000307, and NW_000310, build of 15 November 2002), Celera sequences (assembly GA_x6K02T2PU9B, as of 1 June 2002), and individual BAC clones from the RP22 and RP23 libraries (labeled by GenBank clone number) are aligned to the composite map produced by this study. More recent assemblies of the public data are not included because they are significantly less accurate. The two sequence assemblies are labeled with their own numbering systems in kilobases to clarify the comparisons. The locations of the Cyp2a-t genes and their directions of transcription are indicated with broad arrows on the composite map. A total of 22 loci from six different subfamilies are identified; the distance spanned by each locus is exaggerated relative to the distances between loci for clarity. Large gaps within the assembled sequences are shown by dashed lines, whereas many smaller gaps are unmarked. A region of 300 kb at the break point in the middle of the cluster contains no Cyp genes and is therefore deleted from the map. The Egln2 and Axl genes flanking the cluster are shown by solid arrows on the composite map.

Analysis of Draft Genomic Sequences

Partial draft sequences of some BAC clones from the relevant region of mouse chromosome 7 are available on GenBank as of May 2003 (build 30). Some of the sequences from these clones are incorporated into the supercontig NT_039407, but this assembly contains many errors and should be disregarded. The older supercontigs NW_000303, NW_000307, and NW_000310 (build 29) are incomplete, but for the most part are correctly assembled. We created a more complete and accurate assembly of the draft sequences by rearranging the sequence contigs inside each clone to conform to our maps and by comparing small overlapping portions of contigs from different clones. Predicted restriction enzyme cut sites in the database sequences were compared with actual restriction mapping fragment sizes and with the data from Southern blotting to determine the order of sequence contigs. An independent sequence assembly for this region obtained from Celera Genomics (GA_x6K02T2PU9B, as of 1 June 2002) was also used to fill in some gaps. Sequence comparisons were done using the NCBI BLAST and Jellyfish (LabVelocity Inc., San Francisco, CA) software.

Results

The mouse Cyp2a-t gene cluster forms a small part of a 30-Mb region of chromosome 7 that is syntenic to most of the q arm of human chromosome 19 (Kim et al. 2001). This entire region of synteny in the two species is in the opposite orientation relative to the centromere; the mouse map in Figure 1 is shown with the telomere to the left, the reverse of the conventional mode of display, so that it aligns with the established human map (Hoffman et al. 2001). The mouse Cyp2a-t cluster spans about 1.4 Mb, as measured by restriction mapping. It is delimited by the Egln2 gene on the telomeric side and the Axl gene on the centromeric side (Figure 1).

These genes are orthologs of the EGLN2 and AXL genes that bracket the human cluster. A total of 22 CYP loci from six subfamilies were found in the mouse gene cluster; 10 of the loci match previously sequenced mRNAs (Hoffman et al. 2001) and are therefore functional genes. Information on individual loci is compiled in Table 1, and the specific evidence used to localize and identify each locus is organized below by subfamily. The complete map of the region is shown in Figure 1. The relevant parts of the public and private sequence assemblies of mouse chromosome 7 (as of December 2002) are compared in Figure 1 with the composite map generated by this study. Both of these previous sequence assemblies are mostly accurate but incomplete across this region of the chromosome; a more recent assembly of the public data (build 30, February 2003) is markedly less accurate. Significant gaps occur in both assemblies. Gaps in the public contigs are quite accurately sized, but the sizes of several large gaps in the Celera assembly are seriously underestimated (Figure 1). Some apparent gaps in the public assembly can be filled, in fact, by integrating sequences from the draft versions of BAC clones RP23-430G14, -174D7, and -113D13 (GenBank accession nos. AC087157, AC087137, and AC087130) and from the small assembled contigs NW_000304, NW_000305, NW_000306, NW_000308, NW_000309, and NW_011833. Two distinct regions of about 50 kb each, which contain the 2b26-ps and 2b27-ps pseudogenes, respectively (Figure 1), are incorrectly merged by both assemblies because of the very high level of sequence similarity between them. The presence of both of these regions on the chromosome was confirmed by restriction mapping and by specific PCR amplifications of fragments that bridge small gaps in the draft sequences. Additional details of experimental methods and results, including tables of the primers used, exact exon/intron boundaries, and restriction map data, are available in Tables 1-4 of the Supplemental Material and through the Cytochrome P450 Homepage (Nelson 2003).

Descriptions of Loci by Subfamily

Cyp2a. Three functional mouse Cyp2a genes, 2a4, 2a5, and 2a12, were previously identified from mRNAs (Iwasaki et al. 1993). Our analysis has discovered a new full-length 2a locus, located between the 2a5 and 2a12 genes, that corresponds to a single mouse expressed sequence tag (EST) in the database (GenBank accession no. BB667610) and is therefore likely to be functional. It has been given the name Cyp2a22. There are also three partial 2a pseudogenes--Cyp2a20-ps and Cyp2a23-ps, each of which consists only of exons 1 and 2, and Cyp2a21-ps, which has part of exon 3 and all of exons 4-9. To search for additional 2a loci, PCR-amplified fragments from exons 2, 6, and 9 of 2a5 were used to probe Southern blots of the BAC clone DNAs. They hybridized to the expected fragments for all genes, which collectively accounted for all positive signals. Specific primers were used to amplify and sequence the sixth exons of 2a4, 2a5, and 2a12 and the third exons of 2a12 and 2a22 from the appropriate BAC clone DNAs (Figure 1) to confirm the locations and identities of these genes. The 2a20-ps pseudogene and the 2a12 gene were identified only from the genomic sequence assemblies, as they were not included in any BAC clones used in this study.

The 2a4 and 2a5 genes are extremely similar (98% exons/96% introns), even though they are not physically close together (Figure 2). The 2a12 locus is strongly related by sequence to 2a22 but is quite different from the other genes, with only 75 and 76% exon identities to 2a4 and 2a5, respectively (Table 2). The 2a20-ps and 2a23-ps pseudogenes are very similar to each other and are slightly more similar to the 2a5 locus than to 2a12. The Cyp2a21-ps pseudogene is most similar to 2a5 (Table 2).

Figure 2. (A) Postulated evolution of the Cyp2a-t/CYP2A-T clusters in mouse and human. The modern arrangements are shown in the middle of the figure, with the separate evolutionary paths of the two clusters converging, as indicated by the vertical arrows. The ends of the clusters are very similar, but the inverted duplication in the human CYP2A-T cluster (adapted from Hoffman et al. 2001) is not present in mouse. Instead, a tandem duplication of the central Cyp2a, 2g, and perhaps 2b loci, without an inversion, established the organization of the mouse cluster. The telomeric Cyp2a group and the centromeric Cyp2b group in mouse (gray boxes) were probably formed by series of smaller duplications, as shown in B and C. Straight dashed arrows indicate direct duplications, and bent arrows indicate inverted duplications. The large vertical arrows indicate evolutionary time. Mouse pseudogenes are labeled with the suffix “p” rather than “ps” because of space restrictions. (B) Detailed diagram of telomeric mouse 2a gene group showing the hypothesized tandem and inverted duplications that formed the three genes and three pseudogenes in this group. The open boxes show the extent of each duplicated block of DNA. Straight arrows indicate direct duplications, and bent arrows indicate inverted duplications. (C) Detailed diagram of centromeric mouse 2b gene group showing the hypothesized series of duplications that formed this group. The open boxes show the extent of each duplicated block of DNA. Straight arrows indicate direct duplications, and bent arrows indicate inverted duplications.

Cyp2b. The 2b subfamily is the most diverse group within the gene cluster. Four genes previously known to be functional (Nelson et al. 1996)--2b9, 2b10, 2b13, and 2b19--have been identified and localized. A fifth gene previously identified as functional, 2b20 (Damon et al. 1996), and its putative pseudogene 2b20-ps (Marc et al. 1999) are not found in the chromosome 7 gene cluster. Our repeated attempts to amplify a 2b20-specific fragment with the PCR primers designed by Damon et al. (1996), using both BAC clone and mouse genomic templates, were unsuccessful. We therefore conclude that the 2b20 and 2b20-ps transcripts are artifacts of the very similar 2b10 gene. In addition, the 2b10 gene in both sequence assemblies differs markedly (19 base pair substitutions) from the originally reported 2b10 mRNA (Noshiro et al. 1988); this may be because of interstrain heterogeneity or to mistakes in sequencing. We now consider all 2b10, 2b20, and 2b20-ps mRNAs in the database to be products of the single gene identified as 2b10 in Figure 1.

To determine the location of each 2b locus, PCR products generated from exons 1, 7, and 9 of 2b9 were used to probe the BAC clone Southern blots. They hybridized to the appropriate DNA fragments from each of the 2b genes and pseudogenes. Specific probes were also made for exon 4 of 2b13 and exon 3 of 2b19 that hybridized uniquely to fragments of those genes on blots. Primers specific for exon 2 of 2b10 were used to amplify and sequence a fragment of BAC clone RP23-113D13 to confirm the identity of this locus. Specific primers were also used to amplify and sequence fragments of intron 2 and intron 3 from the nearly identical 2b26-ps and 2b27-ps pseudogenes. These fragments were then used as probes on the blots to prove the separate existence of the two pseudogenes. PCR products encoding intron 2 of 2b26-ps/2b27-ps hybridized to Eco RI fragments of 10.6 and 9.5 kb, respectively, in the BAC clones RP23-430G14 and -113D13, and to both fragments in the BAC clone RP23-174D7, which overlaps both pseudogenes (Figure 1).

Five of the 2b loci can confidently be identified as pseudogenes because they consist of less than the nine exons common to functional Cyp2b genes (Table 1). The partial nature of the 2b24-ps, 2b25-ps, 2b26-ps, 2b27-ps, and 2b28-ps pseudogenes was confirmed by exon-specific PCRs from the appropriate BAC clones. The locus labeled 2b23 in Figure 1 has not been previously identified as a functional gene, but it has all nine exons, includes a legitimate heme signature in the ninth exon, and has no premature stop codons or frameshift mutations. Differences between the public and Celera versions of this sequence yield a few alternative amino acid residues but do not affect the viability of any potential product. The 2b23 sequence does not match any mRNAs or ESTs currently listed in GenBank, so it must be listed as a potentially functional new 2b gene, pending a search through more tissue types for a matching mRNA. There are thus a total of five confirmed or potentially functional genes and five partial pseudogenes within the 2b subfamily in mouse.

The rule that functional genes in the Cyp2 family have a nine-exon structure is violated in the mouse by the 2b10 gene. Two cDNA sequences were originally described for this gene (Noshiro et al. 1988), one with a standard length of 1,476 base pairs, and a second rare form with a stretch of 27 extra nucleotides that were presumed to belong either to the end of exon 8 or the beginning of exon 9. However, our analysis of the genomic sequence of 2b10 makes it clear that these base pairs in fact make up a small additional exon with valid splice sites (Figure 3). This "miniexon," which encodes only nine amino acids, appears to have been recruited from sequence that previously formed part of the eighth intron of the ancestral 2b gene. We have identified sequences in the eighth introns of the 2b9 and 2b13 genes that are very similar to the miniexon, but both of these introns have critical differences that prevent the formation of splice sites (Figure 3). Because the nine additional amino acids would disrupt a critical motif in the enzyme, it is likely that the long form of 2b10 represents a nonfunctional splice variant.

Figure 3. The structure of the 2b10 gene. There is a miniexon of 27 nucleotides, labeled “10,” between the standard exons 8 and 9. The 2b10 miniexon and surrounding sequences are compared with the corresponding regions from the eighth introns of 2b9 and 2b13. Potential splice sites are underlined. Exon sizes are exaggerated relative to intron sizes for clarity.

Evolutionary relationships among the paralogous mouse 2b loci are far from clear. The 2b9 and 2b13 genes are more highly related to each other than to either 2b10 or 2b19, and the new 2b23 gene is somewhat more similar to 2b19 than to the other genes. Any other pairing among the functional 2b genes gives the same average identity level of about 85% across exons (Table 2). The 2b pseudogenes are also not closely related to specific functional genes except for the 2b28-ps partial pseudogene, which is somewhat more similar to 2b13 and 2b9 than to the other 2b genes (Table 2). As noted above, the 2b26-ps and 2b27-ps pseudogenes are nearly identical, but they do not show a particular affiliation to any of the functional genes.

Cyp2f. There is only a single member of the 2f subfamily in the mouse, the functional 2f2 gene, which is located centromeric of and close to the 2t4 gene (Figure 1) in a position exactly corresponding to that of the human 2F1P locus. Unlike the human and gorilla (Chen et al. 2002), the mouse does not have a second 2f locus. This was established by the failure of intron 1 and exon 9 primers to amplify from any of the BAC clones and by the lack of hybridization to the Southern blots using a 2f2 exon 9 probe.

Cyp2g. The 2g subfamily consists of the gene responsible for the known CYP2G1 enzyme, located just centromeric of the 2a5 gene, and the partial pseudogene 2g1-ps, which lies just centromeric of the 2a4 gene (Figure 1). Cyp2g1-ps has only exons 7, 9, and half of 8, which collectively are about 96% identical to the corresponding portions of 2g1 (Table 2). As both sequence assemblies are very fragmented near 2g1-ps, the pseudogene sequence can currently be found only in the Celera assembly, and even there it is incomplete. Because 2g1 is also incomplete in the public assembly (Table 1), its identity was confirmed by PCR-amplifying and sequencing fragments of exons 1, 2, 6, and 9 from the BAC clone RP22-78A19. To prove the partial nature of the 2g1-ps pseudogene, the RP23-430G14 BAC clone was used as template for the same amplifications; only the exon 9 primers gave a product. In addition, the exon 1 and 6 products hybridized only to clone RP22-78A19 on the Southern blots and not to RP23-430G14.

Cyp2s. As is true for the human, the mouse has a single member of the 2s subfamily located at one end of the cluster, close to the Axl gene (Figure 1). Neither Southern blots nor PCR amplifications gave evidence of any additional 2s loci.

Cyp2t. There is also only a single member of the 2t subfamily in the mouse, 2t4. It is very similar to the functional rat 2T1 gene (Nelson 2003), but the predicted 2t4 mRNA does not match any mouse cDNA or EST now in GenBank. The gene is located at the extreme telomeric end of the mouse cluster, only 8 kb from the Egln2 locus (Figure 1), in the same relative position as the human pseudogene 2T2P. The mouse 2t4 is slightly more related to human 2T2P than to human 2T3P (Table 2).

Discussion

The Cyp2 subfamilies in the mouse cluster are the same six present in the corresponding human cluster (CYP2A, 2B, 2F, 2G, 2S, and 2T), but the total number of loci is significantly greater in mouse (22 vs. 13). This difference is due primarily to the expansion of the 2a and 2b subfamilies in mouse (Figure 2). The similarities between the CYP2A-T gene clusters in the mouse and human indicate that the six component subfamilies were already present in a common mammalian ancestor. The differences in organization indicate that most of the individual loci within the subfamilies developed after the primate and rodent lineages split. Some specific loci, however, may have developed in the common ancestor, and therefore may be truly orthologous. To trace the evolution of the gene clusters in any detail, it is necessary to distinguish these older orthologous loci from newer, species-specific loci. Defining orthologs between mouse and human also facilitates the creation of appropriate knockout animals.

It is often difficult or impossible to identify orthologs of CYP genes in all but very closely related species (Nelson et al. 1993), but when sequence similarity, physical location, and protein function all match, this can be done at least tentatively (Chen et al. 2002; Hoffman et al. 2001). In the case of the CYP2A-T clusters, some orthologous relationships can be reasonably deduced. The mouse Cyp2a5 locus may be a true ortholog of the human CYP2A6, as they have the most similar sequences (Table 2) and they are located at similar positions within the two clusters (Figure 2A). The single mouse 2f2 and the functional human 2F1 both express proteins with similar substrate ranges and the same limited tissue distribution and are thus considered orthologous (Chen et al. 2002). The mRNAs known from the single mouse 2s1 and human 2S1 genes are 81% identical, and these genes are similarly located on the AXL ends of their respective clusters, so they can also be considered orthologous (Hoffman et al. 2001). The mouse 2g1 is equally similar to the two human 2G pseudogenes (83%), which are probably degraded copies of an earlier functional gene that was orthologous to 2g1. Finally, the presumably functional mouse 2t4 has the same position on the EGLN2 end of the cluster as does the human 2T2P and is likely to be its ortholog (Nelson et al. 2003).

Though the two ends of the mouse and human cluster are very similar, with orthologous genes in corresponding positions, the distinct organization in the middle of each cluster indicates that some major rearrangements have occurred since the two species diverged from a common ancestor. In the human, a large inverted repeat was inferred to explain the mirror-image organization underlying the paired 2F, 2T, 2A, and 2G loci in the center of the cluster (Hoffman et al. 1995, 2001). In the mouse, there is no such mirror-image set of loci. Instead, there are single 2f and 2t genes, while the central loci are arranged a-g-b-a-g-b rather than a-g-b-b-g-a as in humans, suggesting a more limited tandem duplication without an inversion (Figure 2A). Additionally, in humans the 2B6 and 2B7P loci were apparently inserted into the middle of the 2A18P locus (Hoffman et al. 2001), whereas in mouse the many 2b subfamily genes occur in two separate groups, with no sign of a late insertion.

Figure 2A illustrates our hypothesis that the basic organization of the central loci in mouse is due to a large tandem duplication encompassing 2a, 2g, and perhaps 2b loci. This hypothesis is supported by the fact that the 2a5 and 2g1 genes on the telomeric side of the cluster are transcribed in the same direction and are spaced a similar distance apart as are the 2a4 and 2g1-ps loci on the centromeric side. It is also consistent with the suggestion of Aida et al. (1994) that because Mus musculus has both 2a4 and 2a5 genes, whereas its close relative Mus spretus appears to have two nearly identical 2a5-like loci, 2a4 must have been formed recently by the alteration of a critical residue in a previously duplicated copy of the 2a5 gene.

This large tandem duplication may or may not have included loci from the 2b subfamily. Though 2a4 is very similar in sequence to 2a5, and 2g1 to 2g1-ps, there are no highly related 2b genes across the two groups, as would be expected if an a-g-b block of sequence had been recently duplicated. In addition, the 2b loci are not as clearly patterned as the 2a and 2g loci--multiple 2b genes are transcribed in different directions next to both the 2g1 and 2g1-ps loci. Conversely, the 2b loci do occur in two distinct groups that are in similar positions relative to the 2a and 2g loci. Although there is thus good evidence for a tandem duplication that included at least one 2a and one 2g locus, the timing and the extent of this duplication cannot be determined until the corresponding genes are examined in additional mammalian species.

Full or partial deletions of single loci may have occurred in both the primate and rodent lineages, but these cannot be detected. Gene losses by deletion should always be harder to distinguish than duplications, as they are unlikely to leave behind any characteristic pattern or signature sequences. In particular, partial pseudogenes in both clusters may have been created either by a duplication followed by deletion of some exons or by a duplication encompassing only part of a locus.

On a smaller scale, several interesting comparisons can be made among the numerous 2a and 2b subfamily genes. The 2a subfamily expanded in the mouse by a series of duplications involving single and multiple loci. The group of six 2a loci on the telomeric side of the cluster includes three highly related pairs (Table 2). The whole 35-kb block of DNA that includes Cyp2a22 and 2a23-ps is highly similar to the block containing Cyp2a12 and 2a20-ps, with more than 90% identity between the regions around the genes. The 20-kb region that encompasses Cyp2a5 shares more than 90% sequence identity with the region around Cyp2a21-ps. The simplest explanation for this arrangement, shown in Figure 2B, requires several rounds of duplication. The exact order in which the duplications happened is ambiguous, as there is no significant patterning of sequence in between the duplicated blocks.

The two strongest similarities within the 2b subfamily are between the 2b9 and 2b13 loci and between the 2b26-ps and 2b27-ps pseudogenes (Table 2). The relative positions and the directions of transcription of these gene pairs suggest that a second, smaller tandem duplication occurred within the centromeric 2b group to create the 2b9-2b26-ps and 2b13-2b27-ps regions, as shown in Figure 2C. Evidence for this duplication also comes from the extremely high level of identity (99%) found between short noncoding sequences in the introns of the 2b26-ps and 2b27-ps pseudogenes (data not shown).

The information provided by this study has allowed us to draw a complete and accurate picture of the Cyp2a-t gene cluster in the mouse and to understand the similarities and differences between its evolution and that of the human cluster. This comparison should enable researchers to better utilize the mouse as a model system for the study of these CYP genes in humans and in other mammals.