Identification of Disease Genes

Medha Bhagwat

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Bergman NH, editor. Comparative Genomics: Volumes 1 and 2. Totowa (NJ): Humana Press; 2007.

Comparative Genomics: Volumes 1 and 2.

Show details

Contents

< PrevNext >

Chapter 24Identification of Disease Genes

Example-Driven Web-Based Tutorial

Medha Bhagwat.

Summary

The National Center for Biotechnology Information (NCBI) has developed several web-based mini-courses (www.ncbi.nlm.nih.gov/Class/minicourses) illustrating the applications of NCBI resources. This chapter describes the problem-based minicourse called “Identification of Disease Genes.” The mini-course guides us through one of the several ways to identify disease related genes, starting from the expressed sequence data such as that may have been obtained from patients. The chapter first provides an introduction to the human genome assembly and the resources such as the Basic Local Alignment Search Tool, Map Viewer, Single Nucleotide Polymorphism database, and Online Mendelian Inheritance in Man. The chapter then demonstrates the practical application of these resources to the identification of genes related to two diseases, hemochromatosis and sickle cell anemia. The chapter also provides links to the mini-course web pages and includes the screen images of the results of the applied steps.

1. Introduction to the National Center for Biotechnology Information Mini-Courses

National Center for Biotechnology Information (NCBI) provides several focused bioinformatics mini-courses (http://www.ncbi.nlm.nih.gov/Class/minicourses/). The mini-courses are either problem-based such as “Identification of Disease Genes” or NCBI resource-based such as “BLAST Quick Start.” The courses are 2.5 h in length with the first 90 min devoted to an overview and an online demonstration of a problem or problem set by an instructor. This is followed by a 1-h hands-on session where students practice a similar problem or problem set to the one demonstrated at their own computers. The courses are taught on the National Institutes of Health (NIH) campus in Bethesda and at academic institutes in the United States.

2. Objective

This chapter describes the mini-course that focuses on the identification of a disease gene using NCBI’s human genome assembly. The reference human genome assembly along with integrated maps, literature, and expression information comprises a powerful discovery system for exploring candidate human disease genes.

3. Genetics and Bioinformatics Background

3.1. Information Transfer Within a Cell

The pathway of genetic information transfer in a cell begins with the transcription of genes within a genome to produce mRNAs and ends with the translation of mRNAs to produce proteins. The sequence databases contain genomic, mRNA, and protein sequences representing all three stages in the pathway.

3.2. Data Available for Bioinformatics Analysis

One way to identify genes in a genome is to generate a cDNA library from the pool of RNA messages. To generate a cDNA library, the RNA messages from a tissue or from cells representing a developmental stage are copied into more stable cDNA molecules, which are then placed into an appropriate vector to generate a collection of cDNA clones (vector and the individual cDNA insert). The single pass, short 300–500 nucleotide sequences obtained from sequencing either end of the cDNA insert are called expressed sequence tags (ESTs).

For more background information about genetic terms, the user may refer to the following webpages:

Talking Glossary of Genetic Terms (http://www.genome.gov/glossary.cfm).

The NCBI Handbook Glossary (http://www.ncbi.nlm.nih.gov/books/NBK21106/).

A Science Primer (http://www.ncbi.nlm.nih.gov/About/primer/index.html).

NCBI Bookshelf (http://www.ncbi.nlm.nih.gov/books/).

One way to solve the problem of identifying genes responsible for a particular phenotype is to generate a cDNA library from patient tissues/samples and obtain a number of ESTs. Then, use the ESTs to determine the genes expressing them and to determine whether they contain any nucleotide variations or single nucleotide polymorphisms (SNPs) when compared to normal individuals. Sites of DNA sequences where individuals differ at a single nucleotide are called SNPs. We will obtain more information about the SNP database in the latter part of this chapter.

4. General Protocol and Required Resources

4.1. Outline of Steps

Compare the sequences of ESTs from a patient to the sequences of the human genome (using Basic Local Alignment Search Tool [BLAST]).
Identify the genes aligning to the ESTs and download their sequences (using Map Viewer).
Identify whether the EST sequences contain any known SNPs (using dbSNP).
Determine whether a gene variant is known to cause a phenotype (using Online Mendelian Inheritance in Man [OMIM]).

Thus, starting from the transcribed sequences derived from patients, we will obtain information about expressed genes and determine whether these genes contain known variations that lead to the disease phenotype.

4.2. Descriptions of Resources Used

NCBI assembles component sequences from the human genome sequencing project into longer sequences called contigs whose accession numbers begin with prefix “NT_”. NCBI also performs a number of annotations on the assembly to identify genes, transcripts, clones, repeats, markers, and SNPs. NCBI releases the updated human genome assembly or the new “Build” periodically. For more information about the human genome assembly and annotation, see ref. 1 and the help document (http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html).

This problem based mini-course guides us through use of NCBI resources such as BLAST, Map Viewer, dbSNP, and OMIM as tools to identify disease genes (2).

4.2.1. BLAST

BLAST provides a method for rapid searching of nucleotide and protein databases for similarities with a query nucleotide or protein sequence (3,4). The human genome BLAST page at (http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) provides centralized access to the NCBI human genome assembly and annotated transcript and protein sequences. The BLAST output links directly to the Human Genome Map Viewer, where database hits can be analyzed in their genomic context to see the relationship with other annotated features.

4.2.2. Map Viewer

The Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) allows us to view and search an organism’s complete genome (5). It shows integrated views of a collection of genetic, physical, and sequence maps for annotated genes, expressed sequences, SNPs, and other features, and, thus, is a valuable tool for the identification and localization of genes that contribute to human disease (as demonstrated in this mini-course).

4.2.3. dbSNP

NCBI’s SNP database (http://www.ncbi.nlm.nih.gov/SNP/) contains both single nucleotide substitutions, and short deletion and insertions (6). The data in dbSNP are integrated with other NCBI genomic data. SNPs are aligned to the human genome and the locations of SNPs with respect to the annotated genes and mRNAs are identified.

4.2.4. OMIM

OMIM (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) is the database of human genes and genetic disorders developed and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and adapted for the Internet by NCBI (see Note 1 about Online Mendelian Inheritance in Animals) (7).

5. Detailed Protocol: Problem 1

In this problem, we will use as an example the hemochromatosis disease, which is characterized by an iron overload. Consider that a researcher is working on the hemochromatosis disease and needs to obtain information about the gene(s) causing the phenotype. The following steps will describe the analysis of EST sequences that might have been obtained from a hemochromatosis patient.

It is recommended to follow the link to the “Identification of Disease Genes” through the mini-course webpage (http://www.ncbi.nlm.nih.gov/Class/minicourses/).

This page contains a link to a file containing the up-to-date screen images of each of the steps described below. Referring to the file is strongly recommended to follow the mini-course steps. However, a number of screen images are provided in this chapter as well for a reader to follow along. These screen images are from human Build 35.1.

5.1. Step 1: Compare ESTs to The Human Genome

One way to identify the genes expressing the ESTs is to compare their sequences using BLAST with the human genome assembly and the genes annotated on it. The specialized BLAST page for searching against the annotated human genome assembly is at (http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) (see Note 2). The user may directly access the human genome BLAST page through the “Identification of Disease Genes” mini-course web page by clicking on the “BLAST (human genome)” link. We can concatenate a number of EST sequences to run the search as a batch. However, we will use only one EST sequence as a query for this analysis (see Note 3). Paste the EST sequence provided on the mini-course page in the query box of the BLAST page and select the “genome (reference only)” database from the pull down menu and use the default program MegaBlast (8) (see Note 4). Start the search by clicking on the “Begin Search” button and obtain the results by clicking on the “Format” button. The BLAST results page shows only one match to the contig sequence NT_007592.14 on chromosome 6 in the human genome Build 35.1. In certain cases, there may be multiple matches to the human genome assembly (see Notes 5 and 6).

The alignment of the query EST sequence (indicated by “query”) and the matched sequence from chromosome 6 (indicated by “sbjct”) shows that the EST sequence is only 99 % identical to the genomic sequence (Fig. 1). Note the location of the nucleotide that is different between the two sequences (a G to A variation at the nucleotide 16951392 of the contig NT_007592.14).

Fig. 1

MegaBlast of the query EST against the human genome: alignment overview. The difference between the EST and genomic sequence (a G to A variation at the nucleotide 16951392 of the contig NT_007592.14) is highlighted by a rectangle.

The difference may be due to a sequencing error in the low quality EST sequence or it may represent a real SNP in the human genome. For future reference, paste your results, such as the alignment and the nucleotide difference (sequence difference at the nucleotide 16951392 on NT_007592.14; G in the genomic and A in the query EST sequence), in the window provided in the mini-course webpage.

5.2. Step 2: Identify the Genes Expressing the ESTs and Download Their Sequences

We will now take advantage of the NCBI annotation of the human genome assembly to identify the gene corresponding to the EST by using the Map Viewer. To visualize the BLAST hit on the genome using Map Viewer, click the “Genome View” button at the top of the BLAST results page, then on the Map element “NT_007592.” Currently, four maps should be displayed (Model, RNA, Genes_seq, and Contig) (Fig. 2).

Fig. 2

Map Viewer display of the Basic Local Alignment Search Tool hit from Subheading 5.2. The four maps displayed in this view, Model, RNA, Gene-seq, Contig, are highlighted by a rectangle.

The Genes_seq map shows the “known” genes annotated by alignment of EST and/or mRNA sequences to the assembly. The Contig map shows the assembled genome contig sequence in the region, the Model map shows the Ab initio model genes predicted by the NCBI’s program Gnomon and the RNA map shows the alignments of the known alternatively spliced transcripts. For more information about the human genome assembly and annotation, see ref. 4 and the help document (http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html).

Make the Genes_seq map the master map by clicking on the arrow at its top. The BLAST hit, indicated by the red bar, is in the region of one of the exons of the HFE gene annotated on the human genome (Fig. 3).

Fig. 3

Map Viewer display of the Basic Local Alignment Search Tool (BLAST) hit from Subheading 5. with Genes_seq Map as a master map. The BLAST hit, indicated by a bar on the right side of the Genes_seq map, is in the region of one of the exons of the HFE gene. (more...)

The thick bars in the Genes_seq map indicate the exons and the thin lines joining them indicate introns of the gene. Zoom out several times until the user sees the entire HFE gene structure by clicking on the gray line and selecting option “Zoom out 2 times” from the menu that appears. The query EST represents a known gene, HFE. The orientation of the arrow next to the gene link indicates the orientation of the gene on the forward or the reverse strand. A gene annotated on the forward strand is indicated by an arrow pointing downward whereas a gene annotated on the reverse strand is indicated by an arrow pointing upward (see Note 7). The HFE gene is annotated on the forward strand of chromosome 6.

The right-most map is called the master map and links that give more information about map elements are provided next to it. For example, the current master map, Genes_seq map, has links to resources that provide more information about the HFE gene such as OMIM, sv (Sequence Viewer), pr (Reference Proteins), dl (Download Sequence), ev (Evidence Viewer), mm (Model Maker), and hm (Homologene). For more information, please refer to the Human Maps Help document http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html. We will use some of these links in this mini-course. Display the entire HFE gene sequence by clicking on the download “dl” link and then on “Display” on the next page (see Notes 8 and 9). Copy the sequence and paste it in the area provided on the mini-course page. Note the accession number of the longest transcript, NM_000410. We will use this information in the next step.

5.3. Step 3: Determine Whether the ESTs Contain Known SNPs

Go back to the Map Viewer report by clicking on the back button of the browser twice. Click the Maps and Options link.

Remove all the maps except the Genes_seq map by selecting the map under the “Maps Displayed” menu and clicking on “Remove.” Now add the Variation map from the “Available maps” menu (by selecting the map and clicking Add). Make the Variation map the master map by selecting it and clicking the “Make Master/Move to Bottom” option. Then click “Apply.” (The Mini-Course Map Viewer Quick Start describes the usage of the Map Viewer in detail.)

Now two maps are displayed, Variation (it is the rightmost and the master map) and Genes_seq (see Fig. 4). The master map provides detailed information for the map features, in this case SNPs. Zoom in on the blast hit area (bar on the right side of each map) by clicking on the map line next to it and choosing the appropriate zoom level. There are two SNPs in the area; rs1800562 and rs4986950 (see Notes 10 and 11).

Fig. 4

Map Viewer display, containing the variation and Genes_seq maps, zoomed in the region of the Basic Local Alignment Search Tool (BLAST) hit in Subheading 5. There are two SNPs, rs1800562 and rs4986950, in the BLAST hit area indicated by a bar on the right (more...)

Click any of the links and obtain information about the location and the nucleotide variation from the “Fasta sequence” and “Integrated maps” panels. The SNP, rs1800562, represents an A/G SNP (Fig. 5) at the nucleotide position 16951392 on the contig NT_007592.14 of the reference assembly (Fig. 6, see Note 12)

Fig. 5

Fasta sequence section of the SNP entry rs1800562. The A/G allele in the SNP, indicated in the definition line on the record, is highlighted by an oval.

Fig. 6

Integrated maps section of the SNP entry rs1800562. The location of the SNP, nucleotide position 16951392 on the contig NT_007592.14 of the reference assembly, is highlighted by a rectangle.

This is the same nucleotide variation on the contig NT_007592.14 found in the BLAST result in Subheading 5.1. (16951392 G to A). To identify whether this change represents a change in an encoded amino acid, we will refer to the GeneView panel. This view shows the location of the SNP in the alternatively spliced products annotated on all the assemblies. It also provides information at the protein level; the amino acid number and the change in the sequence, if any. Refer to the panel for the longest transcript, transcript variant 1 NM_000410.2, on the reference assembly contig NT_007592.14. The SNP would result in the change of 282nd amino acid in the protein NP_000401.1, encoded by the mRNA NM_000410.2, from cysteine to tyrosine (Fig. 7).

Fig. 7

GeneView section of the SNP entry rs1800562 for the mRNA NM_000410 alignment on the reference assembly contig NT_007592. The resulting amino acid change, 282nd amino acid in the protein NP_000401.1, from cysteine to tyrosine, is highlighted by a rectangle. (more...)

Thus, the query EST sequence contains a known SNP in the HFE gene that results in a cysteine to tyrosine change in the 282nd amino acid (Cys282Tyr) of the protein expressed by the longest HFE transcript variant, variant 1 (see Note 13). The next obvious step is to find out whether the SNP in the HFE gene is known to be associated with a disease phenotype.

5.4. Step 4: Determine Whether the HFE Gene Variant is Known to Cause a Disease Phenotype

To determine whether the Cys282Tyr amino acid change is linked to a phenotype, we will access the OMIM database. Go back to the Map Viewer report by clicking the back button of the web browser. Make the Genes_seq map a master map by clicking the arrow at the top of the Genes_seq map. Click on the OMIM link next to the HFE gene. This takes us to the OMIM report for the HFE gene. It describes the relationship between the mutations in the HFE gene and the hemochromatosis phenotype. Click the Allelic Variants “View list” in the side blue bar to get information about the mutant proteins from patients. One variant, Cys282Tyr, is reported to cause the hemochromatosis phenotype (see Fig. 8). The query EST contains a known variation that would lead to the expression of the Cys282Tyr variant protein associated with the hemochromatosis phenotype (see Notes 14 and 15).

Fig. 8

Allelic variants list section from the Online Mendelian Inheritance in Man report for the HFE gene. The Cys282Tyr variant, highlighted by a rectangle, is reported to be associated with hemochromatosis.

5.5. Results for Problem 1

This Mini-Course describes the steps needed to identify the gene producing an EST obtained from a hemochromatosis patient, download the gene sequence, identify known SNPs in the gene, and find SNP-associated phenotypes.

Results of Subheading 5.1.: the query EST sequence was found to align to contig NT_007592.14 on chromosome 6 with one nucleotide difference (G to A with respect to the nucleotide 16951392 on the contig).

Results of Subheading 5.2.: The query EST was found to align to the HFE gene.

Results of Subheading 5.3.: The query EST sequence contains a known SNP (G/A with respect to the nucleotide 16951392 on contig NT_007592.14) that results in the Cys282Tyr change in the hemochromatosis protein expressed by the longest HFE mRNA variant.

Results of Subheading 5.4: The Cys282Tyr change in the HFE protein is associated with hemochromatosis.

6. Detailed Protocol: Problem 2

For more practice, we will now perform a similar analysis using another EST sequence from a sickle anemia patient. Sickle cell anemia is a disease in which the red blood cells are curved in shape and have difficulty passing through small blood vessels. It is recommended to follow along from the webpage (http://www.ncbi.nlm.nih.gov/Class/minicourses/diseasegene2.html).

This page contains a link to a file containing the screen images of each of the steps described next. Referring to the file is strongly recommended to follow the mini-course steps. However, a number of screen images are provided in this chapter as well for a reader to follow along. These screen images are from human Build 35.1.

6.1. Step 1

Paste the EST sequence that is provided in the mini-course page into the query box of the human genome BLAST page (http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606). The user may directly access the human genome BLAST page through the “Identification of Disease Genes” mini-course web page by clicking on the “BLAST (human genome)” link. Select the “genome (reference only)” database from the pull down menu, start the search by clicking on the “Begin Search” button and obtain the results by clicking on the “Format” button (see Notes 2–4). The BLAST results page shows four hits to the contig sequence NT_009237.17 on chromosome 11, in the human genome build 35.1, with varying percent identity (Fig. 9).

Fig. 9

MegaBlast of the query expressed sequence tags against the human genome: graphical overview. There are four hits to the contig sequence NT_009237.17 on chromosome 11 as highlighted by rectangles.

These multiple hits could arise from similarity to multiple gene family members and/or the query EST sequence originating from multiple exons (see Notes 5 and 6)

6.2. Step 2: Identify the Genes Expressing the ESTs and Download Their Sequences

To determine the gene expressing the EST in this case, it is much easier to view the BLAST hits in the Map Viewer. Click the “Genome View” button at the top of the BLAST results page, then on the Map element “NT_009237.”

Currently, four maps are displayed; Model, RNA, Genes_seq, and Contig (see Fig. 10). Refer to Subheading 5.1. for more description of these maps. The four BLAST hits are indicated by the shaded areas on the right side of each map. Two of these align to the two exons of the HBB gene and two align to the two exons of the HBD gene (highlighted by the ovals). Note the percent identity of the BLAST hits (highlighted by the rectangles). The EST sequence is more similar to the HBB exons than to the HBD exons (100 and 99 % compared to 98 and 92 %, respectively). Thus, the query EST is probably expressed by the HBB gene but also aligns to the HBD gene because of its sequence similarity to the HBB gene sequence (Fig. 10).

Fig. 10

Map Viewer display obtained from the Genome View button link on the BLAST results page of Subheading 6.1. The four BLAST hits are indicated by the shaded areas on the right side of each map. Two of these align to the two exons of the HBB gene and two (more...)

One of the exons of the HBB gene is only 99 % identical to the query EST. To note the location of the nucleotide difference between the two sequences, click on the corresponding “Blast hit” link to go back to the BLAST results page.

Note that the alignment is on the minus (reverse) strand of the contig NT_009237.17 (Fig. 11). One nucleotide sequence is different between the query EST and the genomic sequence “Sbjct.” Identify the nucleotide number at the site of difference with respect to the closest nucleotide number 4035482 of “Sbjct” contig NT_009237.17 and count downward by 10 (because the site of difference is 10 nucleotides away from 4035482 and the alignment is on the minus strand). Thus the query EST has a T to A variation with respect to the nucleotide 4035473 of the contig NT_009237.17. The difference could be due to a sequencing error in the low quality EST sequence or it may represent a real SNP in the human genome. For future reference, paste your results, such as the alignment and the nucleotide difference (sequence difference at the nucleotide 4035473 on NT_009237.17; A in the genomic and T in the query EST sequence), in the window provided in the mini-course webpage.

Fig. 11

Alignment of the 99 % identical Basic Local Alignment Search Tool (BLAST) hit in Subheading 6.1. The BLAST hit is on the minus (reverse) strand (highlighted by an oval) of the contig NT_009237.17. There is one nucleotide difference (highlighted by a rectangle) (more...)

To download the HBB genomic sequence, go back to the Map Viewer report by clicking the “back” button on the browser once. Set the Genes_seq map as the master map by clicking on the arrow at its top. Because the gene of interest for further analysis is the HBB gene, we can remove the HBD gene from the view by clicking on the gray line at the appropriate BLAST hit location and selecting the “Recenter” option (see Fig. 12).

Fig. 12

Map Viewer display showing the recenter option by clicking on the gray line indicating the contig Map.

The upward pointing arrow next to the HBB gene link shows the placement of the gene on the reverse strand of chromosome 11 (see Note 7).

Click on the “dl” link next to the HBB gene. Because the gene is on the reverse strand, select minus on the Stand pull down menu and click on the “Change Region/Strand” button. Display the gene sequence by clicking on the “Display” option (see Notes 8 and 9). Copy the sequence and paste it in the area provided in the Mini-Course page. You can adjust the nucleotide locations to download the upstream or downstream sequence by using the “adjust by” and “Change Region/Strand” option.

6.3. Step 3: Determine Whether the ESTs Contain Known SNPs

Go back to the Map Viewer report by clicking on the back button of the browser twice. Click the Maps and Options link.

Remove all the maps except the Genes_seq map and add the Variation map as the master map. Zoom in on the blast hit area (represented by the bars) by clicking on the thin gray line next to it and choosing the appropriate zoom level. There are three SNPs in the area; rs713040, rs334, rs11549407 (see Notes 10 and 11; Fig. 13).

Fig. 13

Map Viewer, displaying the variation and Genes_seq maps, zoomed in the region of BLAST hit in Subheading 6. There are three SNPs in the BLAST hit area indicated by the bars on the right side of each map; rs713040, rs334, rs11549407.

Click on any of the links and obtain information about the location and the nucleotide variation from the “Fasta sequence” and “Integrated maps” panels. The SNP, rs334, represents an A/T SNP (Fig. 14) at the nucleotide position 4035473 on the contig NT_009237.17 (Fig. 15 and Note 12).

Fig. 14

Fasta sequence section of the SNP entry rs334. The SNP contains an A/T SNP, indicated in the definition line of the record, highlighted by an oval.

Fig. 15

Integrated maps section of the SNP entry rs334. The location of the SNP, nucleotide position 4035473 on the contig NT_009237.17 of the reference assembly, is highlighted by a rectangle.

This is the same nucleotide variation on the contig NT_009237.17 found in the BLAST result in Subheading 6.1. (4035473 T to A). Next, to identify whether this change represents a change in an encoded amino acid, we will refer to the GeneView panel. This view shows the location of the SNP in the alternatively spliced products annotated on all the assemblies. It also provides information at the protein level; the amino acid number and the change in the sequence, if any. Refer to the panel for the transcript, NM_000518, on the reference assembly contig, NT_009237.17. The SNP would result in the change at the seventh amino acid in the protein NP_000509.1, encoded by the mRNA NM_000518.4, from glutamate to valine (see Fig. 16).

Fig. 16

GeneView section of the SNP entry rs334 for the mRNA NM_000518 alignment on the reference assembly contig NT_009237. The SNP results in the change at the seventh amino acid in the protein NP_000509.1 from glutamate to valine.

Thus, the query EST sequence contains a known SNP in the HBB gene that results in a glutamate to valine change in the seventh amino acid (Glu7Val) of the β-globin protein (see Note 13). The next obvious step is to find out whether the SNP in the HBB gene is known to be associated with a disease phenotype.

6.4. Step 4: Determine Whether the HBB Gene Variant is Known to Cause a Disease Phenotype

To determine whether the Glu7Val variant is known to cause a disease phenotype, we will access the OMIM database. Go back to the Map Viewer report by clicking on the back button of the web browser. Make the Genes_seq map the master map by clicking on the arrow at the top of the Genes_seq map. Click on the OMIM link next to the HBB gene. This takes us to the OMIM report for the HBB gene that details how variants (HBB gene variants) the HBB gene are associated with various phenotypes. As mentioned in the OMIM report under the “Psuedogenes” section, the allelic variants are listed for the mature HBB (β-globin) protein which lacks the initiator methionine. The SNP database reports them for the precursor protein. Hence, the allelic variants in the OMIM report are off by one amino acid compared to the variants in the SNP report (see Note 14). Thus, the Glu7Val variant in the SNP report corresponds to the Glu6Val variant in the OMIM report. Access the allelic variants list by clicking on the “View list” in the blue side bar. The Glu6Val variant, called hemoglobin S, is reported to cause the sickle cell anemia phenotype. The query EST contains a known variation that leads to the expression of the Glu7Val variant protein associated with the sickle cell anemia phenotype (see Note 15).

6.5. Results for Problem 2

This mini-course describes steps to identify the gene expressing the ESTs obtained from a sickle cell anemia patient, download the gene sequence, identify known SNPs in the gene and find SNP-associated phenotypes.

Results of Subheading 6.1.: the query EST sequence was found to align to the contig NT_009237.17 on chromosome 11 with one nucleotide difference (T to A with respect to the nucleotide 4035473 on the contig).

Results of Subheading 6.2.: the query EST was found to be expressed by the HBB gene.

Results of Subheading 6.3.: the query EST sequence contains a known SNP (T/A with respect to the nucleotide 4035473 on contig NT_009237.17).

Results of Subheading 6.4.: the Glu7Val change in the HBB protein is associated with sickle cell anemia.

7. Application to Unknown Disease Genes

The mini-course describes a procedure to identify a known gene and a SNP from the NCBI databases starting from one EST sequence. The same procedure can be used with a batch of EST sequences in the initial human genome BLAST search (see Note 3) followed by a similar analysis to identify genes corresponding to them. Some ESTs may be produced by known genes (as described in the mini-course) and some may be produced by novel genes not yet annotated on the Genes_seq map. The Model map may be useful to identify the novel genes. Also, some ESTs may contain new SNPs, which can be deposited in dbSNP. By comparing the DNA sequence from patients and normal individuals, it can be discerned whether the novel SNP and/or the novel gene are associated with the disease.

8. Further Analysis

The mini-course “Correlating Disease Gene and Phenotype” elucidates the biochemical and structural basis for the function of the mutant proteins and their relationship to the particular phenotype.

9. Notes

Note 1

Online Mendelian Inheritance in Animals is a database of genes, inherited disorders and traits in animal species (other than human and mouse) authored by Professor Frank Nicholas of the University of Sydney, Australia, with help from many collaborators over the years.

Note 2

In addition to the human genome, a number of other genomes are available as BLAST databases. A complete list is available under the Genomes panel on the BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/).

Note 3

MegaBlast also accepts a batch of query sequences. Each query sequence must have a unique identifier written on a separate line before the sequence and the identifier line should begin with a greater than (“>”) sign. For example,

>identifier1

atgcggctta…

>identifier2

ttggcatactg…

>identifier3

ggatcgatcag…

Note 4

Since the human Build 36, NCBI also provides access to the previous assembly release (Build) as a BLAST database. More information about each build is provided in the release notes at http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html. You may choose to run the BLAST search against the previous build by using the appropriate option in the database field.

Note 5

In Subheading 5.1., there is only one hit to the database sequence. Multiple hits are possible for several reasons such as finding similarity to other gene family members. For example, refer to Subheading 6.1. The query EST sequence shows similarity to two members of the globin family, HBB (encoding β-globin), and HBD (encoding delta globin). The gene encoding the EST may be the one with high similarity, HBB in this case (Fig. 10).

Note 6

BLAST may align a single EST in two or more segments if the EST sequence spans two or more exons. For example, refer to Subheading 6.1. The query EST sequence aligns to two exons of both the HBB and HBD genes (Fig. 10).

Note 7

The orientation of the gene on the chromosome can also be discerned from the placement of the blue bars representing the gene structure with respect to the gray line. If the gene is placed on the forward strand then the blue bars representing the gene structure are drawn to the right of the gray line (as for the HFE gene in Subheading 5. (Fig. 3)). If the gene is placed on the reverse strand, then the blue bars are drawn to the left of the gray line (as for the HBB gene in Subheading 6. (Fig. 12)).

Note 8

If the gene of interest is on the reverse strand, then change the Strand pull down menu to minus and click on the “Change Region/Strand” option before displaying the sequence (refer to Subheading 6.2.).

Note 9

The user can also adjust the nucleotide locations to download the upstream or downstream sequence by using the “adjust by” and “Change Region/Strand” options.

Note 10

When a single nucleotide polymorphism is submitted to dbSNP, an identifier with prefix “ss” is assigned to the entry. It is possible that multiple laboratories may submit information on the same SNP as new techniques are developed to assay variation, or new populations are typed for frequency information. Each of these SNP entries is assigned a unique identifier with prefix “ss”. When two or more submitted SNP records refer to the same location in the genome, a Reference SNP record is created, with an “rs” prefix on the identifier, by NCBI during periodic “builds” of the SNP database. This reference record provides a summary list of submitted “ss” records in dbSNP.

For example, the Reference SNP record from Subheading 5., rs1800562, contains three submitted SNP records; ss2420669, ss5586582, and ss24365242 in the dbSNP build 125. The Reference SNP record from Subheading 6., rs334, contains six submitted SNP records; ss335, ss1536049, ss4397657, ss4440139, ss16249026, and ss24811263 in the dbSNP build 125.

Note 11

When the “variation” map is selected as the master map in the Map Viewer (Figs. 4 and 13), the user is presented with a graphical summary of several properties of the SNP, such as the quality of the SNP computed from mapping, location in the gene region, marker heterozygosity and validation information. For more information, refer to http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=verbose.

For example, the green triangle next to rs1800562 indicates that this marker is mapped to a unique position in the genome (Fig. 4). The highlighted L, T, and C symbols indicate that the SNP is in the Locus, Transcript, and the coding region of the gene. For the SNP rs807209, only the L symbol is highlighted indicating that the SNP is not in the transcript or the coding region but is in the locus region of the gene. A locus in this report is defined as any part of the marker position on sequence map within a 2000-base interval 5′ of the most 5′ feature of the gene (CDS, mRNA, gene), or the marker position within a 500-base interval 3′ of the most 3′ feature of the gene.

Note 12

NCBI has additional human genome assemblies such as the assembly submitted by Celera and chromosome 7 assembly from the Center for Applied Genomics, TCAG. The assembly information is provided under the contig label heading in the Integrated Maps panel. On the Celera assembly, the SNP in the HFE gene in Subheading 5., rs18000562, is at the nucleotide 25684223 on the contig NT_086686.1 (see Fig. 6). On the Celera assembly, the SNP in the HBB gene in Subheading 6., rs334, is at the nucleotide 860802 on the contig NT_086780.1 (see Fig. 15).

Note 13

The GeneView panel shows the locations of the SNPs on the genomic assemblies with respect to the genes, their alternatively spliced mRNAs and encoded proteins. The view is color coded for quick identification of the location and to show whether the change is synonymous (not altering the amino acid translation) or nonsynonymous (altering the amino acid translation). For example, nonsynonymous SNPs are represented in red, synonymous in green and those in introns are in yellow. A link to the “Color Legend” is provided next to the Gene Model under the GeneView panel.

Note 14

Some OMIM entries report the allelic variants for the mature protein, whereas dbSNP reports variants for the precursor protein. Thus, for the same SNP, amino acid numbering for the allelic variant may be different in these databases. For example, refer to Subheading 6.6 OMIM reports allelic variants for the beta globin mature protein (after removal of the initiator methionine). Thus, the Glu7Val change reported in dbSNP is the same as Glu6Val allelic variant reported in the OMIM database.

Note 15

The OMIM report and thus its allelic variants list are manually derived from publications. dbSNP contains the SNPs reported by the submitters. Currently, a link is provided from dbSNP to OMIM if the amino acid number of the allelic variant in the OMIM report matches the number of the changed amino acid due to a SNP in dbSNP. Since the sources of the two databases, OMIM and dbSNP, are different, each may contain information not found in the other.

References

1.: Kitts P. Genome assembly and annotation process. In: McEntyre J, Ostell J, editors. The NCBI Handbook. National Library of Medicine (US), NCBI; Bethesda, MD: 2002–2005.
2.: Wheeler DL, Barrett T, Benson DA, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006;34:D173–D180. [PMC free article: PMC1347520] [PubMed: 16381840]
3.: Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search program. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article: PMC146917] [PubMed: 9254694]
4.: Madden T. The BLAST sequence analysis tool. In: McEntyre J, Ostell J, editors. The NCBI Handbook. National Library of Medicine (US), NCBI; Bethesda, MD: 2002–2005.
5.: Dombrowski SM, Maglott M. Using the Map Viewer to Explore Genomes. In: McEntyre J, Ostell J, editors. The NCBI Handbook. National Library of Medicine (US), NCBI; Bethesda, MD: 2002–2005.
6.: Kitts A, Sherry S. The single nucleotide polymorphism database (dbSNP) of nucleotide sequence variation. In: McEntyre J, Ostell J, editors. The NCBI Handbook. National Library of Medicine (US), NCBI; Bethesda, MD: 2002–2005.
7.: Maglott D, Amberger JS, Hamosh A. Online Mendelian Inheritance in Man (OMIM): a directory of human genes and genetic disorders. In: McEntyre J, Ostell J, editors. The NCBI Handbook. National Library of Medicine (US), NCBI; Bethesda, MD: 2002–2005.
8.: Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214. [PubMed: 10890397]

Bookshelf ID: NBK1735

Contents