Genetic Associations in Biomedical Informatics

In collaboration with NIA, HPCIO develops and enhances tools for the archival, retrieval, and mining of genetic association study data. The Genetic Association Database (GAD) is an archive of human genetic association studies of complex diseases and disorders. GAD enables scientists to query association data in a systematic manner and to integrate association data with other molecular databases. Study data are recorded in the context of official human gene nomenclature with additional molecular reference numbers and links. The goal of this project is to collect all published genetic association study data and allow the user to rapidly identify medically relevant polymorphism from the large volume of polymorphism and mutational data, in the context of standardized nomenclature.

PubMatrix SE, is a Web-based text-mining tool on MEDLINE citations. It applies natural language processing and statistical methods on biomedical literature text to provide an estimation of the strength of associations among various entities, including genes and diseases. The results are represented in a matrix format, facilitating more efficient interpretation of large amount of text data to assist in microarray studies.

Caption: A simple search of positive associations for the disease schizophrenia. Fields in this view include Official Gene Symbol, Disease Phenotype, Disease Class, Chromosome, Chromosome Band, Genomic DNA Position, P Value, Reference, PubMed ID and Links to other gene related resources

Caption: Search results for “Candidate Genes for ALZHEIMER DISEASE” from SNPs3D web site. GAD data have been integrated in the search results. Each “Y” represents one positive association record and each “N” represents one negative association record in GAD

Caption: GAD links were provided by multiple NCBI applications including Entrez. User can access GAD data by following the LinkOut resources for each gene.

In collaboration with NIA, HPCIO constantly improves the quality and quantity of GAD. GAD data has gone through major update several times in FY2007 to correct missing or erroneous data. The total number of records has increased from 28 thousands to 40 thousands. In collaboration with the University of California Santa Cruz, we created GAD track on the UCSC browser system. This allows integration of large-scale genetic disease data with molecular annotation such as SNPs and RNA splicing; and facilitates the integration with the genomes of other model organisms.

A copy of the MEDLINE citations was imported to a local database. The gene normalization algorithm used in the PubMatrix SE was entered into the BioCreative II challenge. The competition was an international event to assess the state-of-the-art text-mining techniques for bioinformatics. Our algorithm was placed within the top half out of 20 groups. A standalone Web application, called GIANT, was developed to normalize gene mentions in free-text. This work is a part of a larger effort, involving a number of institutions in the

U.S.

and aboard, to build a meta-server that connects to various text-mining servers and provides a one-stop service to researchers in the bioinformatics community.

In FY 2008, HPCIO will create new features to make integrating GAD data more convenient for outside biomedical databases. For example, GAD will add UMLS unique concept identification number (UCI) for each disease. Since current GAD does not store UMLS UCI, some users have to manually associate GAD disease to UMLS UCI before integrating GAD data with their own data. In order to maintain quality and consistency of GAD data, we shell make UMLS UCI available in the future.

Albino Bacolla and his colleague from Texas A&M University System Health Science Center utilized GAD data in their data-mining approach when they conducted thefirst systematic search for the longest uninterrupted R•Ytracts in the human genome and tested specific hypotheses byemploying experimental methodologies in silico. They showed thatlong R•Y tracts constitute mutationalhotspots and are likely to have played a key role in genomeplasticity and evolution.

Lau, W.W. and Johnson,

C.A.

, “Rule-based gene normalization with a statistical and heuristic confidence measure,” in Proc of the Second BioCreative Challenge Evaluation Workshop. 2007. Madrid, Spain.

Lau, W.W. and Johnson C.A. “Rule-based Human Gene Normalization in Biomedical Text with Confidence Estimation,” in Proc of Comput Syst Bioinformatics Conf. 2007. San Diego, CA.

Kevin G Becker, Yonqing Zhang, Narmada Shenoy, Kayla E Smith, Donna Karolchik, Fan Hsu, S Alex Wang, “GAD and GADview: a genomic view of common human disease,” HUGO's 12th Human Genome Meeting, Montreal, Canada, Mon 21-Thu 24 May 2007

Becker, K.G., Barnes, K.C., Bright, T.J. & Wang, S.A.
"The Genetic Association Database" Nature Genetics 36: 431-432 (2004)
[Full article pdf] [PubMed]

Sun G, Lau W, Wang A, Shenoy N, Becker K, Cheung H "Ranking and Presenting Gene-Disease Associations from Biomedical Literature." Poster. 2006 Summer Research Program Student Poster Day. [Full Article]

Han A, Kim WY, Park SM, “SNP2NMD: a database of human single nucleotide polymorphisms causing nonsense-mediated mRNA decay,” Bioinformatics. 2007 Feb 1;23(3):397-9. Epub 2006 Nov 22.

Frodsham AJ, Higgins JP, “Online genetic databases informing human genome epidemiology,” BMC Med Res Methodol. 2007 Jul 4;7:31

Nobuhara Y, Usuku K, Saito M, Izumo S, Arimura K, Bangham CR, Osame M. “Genetic variability in the extracellular matrix protein as a determinant of risk for developing HTLV-I-associated neurological disease,” Immunogenetics. 2006 Jan;57(12):944-52. Epub 2006 Jan 10.

Lussier YA, Liu Y., “Computational approaches to phenotyping: high-throughput phenomics,” Proc Am Thorac Soc. 2007 Jan;4(1):18-25.

Jegga AG, Gowrisankar S, Chen J, Aronow BJ. “PolyDoms: a whole genome database for the identification of non-synonymous coding SNPs with the potential to impact disease,” Nucleic Acids Res.2007 Jan;35(Database issue):D700-6. Epub 2006 Nov 16.

Bacolla A, Collins JR, Gold B, Chuzhanova N, Yi M, Stephens RM, Stefanov S, Olsh A, Jakupciak JP, Dean M, Lempicki RA, Cooper DN, Wells RD, “Long homopurine*homopyrimidine sequences are characteristic of genes expressed in brain and the pseudoautosomal region,” Nucleic Acids Res. 2006 May 19;34(9):2663-75. Print 2006.

Anil G. Jegga, Jing Chen, Sivakumar Gowrisankar, Mrunal A. Deshmukh, RangaChandra Gudivada, Sue Kong, Vivek Kaimal, and Bruce J. Aronow, “GenomeTrafac: a whole genome resource for the detection of transcription factor binding site clusters associated with conventional and microRNA encoding genes conserved between mouse and human gene orthologs,” Nucleic Acids Res. 2007 January; 35(Database issue): D116–D121.

Simon N. Twigger, Mary Shimoyama, Susan Bromberg, Anne E. Kwitek, Howard J. Jacob, and the RGD Team, “The Rat Genome Database, update 2007—Easing the path from disease to data and back again,” Nucleic Acids Res. 2007 January; 35(Database issue): D658–D662.

Number of records in the database:	40,339
Number of hits:	1,238,655
Number of unique visitors:	49,494
Number of whole database downloads:	420
Number of direct links: Gene Normalization Accuracy:	207,200 76.2%