CGAP-GAI Identified Variation in Genes: SNP Finder:
Genetic Annotation Initative Information:

CHLC: Cooperative Human Linkage Center integrated maps, including SNP locations
CGAP Home Page
NCI Home Page

Site Search:
Keyword-based search of documents on site
What is the Genetic Annotation Initative(GAI)?
The Genetic Annotation Initative extends the utility of Cancer Genome Anatomy Project (CGAP) resources. More specifically it "annotates" the genes being discovered as part of CGAP with variation information so that they are of utility in "genetic" studies.

How is the GAI finding genetic variation?
The GAI is using multiple approaches to find variation in genes. In one component it is performing sequencing (actually resequencing) of genes that are of interest to cancer researchers. It is called resequencing because one sequences the same region multiple times in different individuals starting from a known sequence.

The GAI is also using "data mining" to find variation in genes. There exists in the public domain more than a million gene transcripts (ESTs or Expressed Sequence Tags). These tags come from sequencing different individuals. The differences present in these individuals' genes is captured in these ESTs. The GAI has developed software tools to find these differences, and separate the true variants from errors in the data (see Nature Genetics). You can use these same tools by clicking on the SNP FINDER. These "candidate" SNPs have been shown by experimental studies to identify "common" variation greater than 75% of the time when stringent scoring conditions are used (SNP scores of 0.99 or greater)

The GAI used NCBI's UniGene clusters as a starting point. All clusters with raw data for five or more sequences were used. More than 20,000 such clusters met this criteria. From these clusters more than 10,000 high confidence SNPs were identified.

Candidate SNP? Validated SNP? Confirmed SNP?
Candidate SNPs are predicted from sequence data by computer tools. Validated SNPs are candidate SNPs that have been observed in an experiment. Within the GAI eight or fewer individuals are tested for the occurrence of the SNP. Confirmed SNPs have been tested in a minimum of five CEPH families for Mendelian transmission and placed in genetic reference maps.

What is an SNP score?
An SNP score measures the confidence with which an SNP is identified within a given collection of sequence data. It is calculated using the quality scores obtained when analyzing raw sequence data. The quality scores assess the chance that a given nucleotide is determined accurately. The SNP score is a formal statistical test measuring whether more than one nucleotide is like at a given location.

Why don't all candidate SNPs turn out to be variants?
There are many reasons. Transcriptase errors generated when the cDNA libraries were created will masquerade as true polymorphisms. The validation experiments also only detect common variants (only 8 individuals have been tested). It has yet to be determined if these variants will be identified at a lower frequency. The cDNA libararies also are sampled from an eclectic collection of individuals (mixed geographic origin, various disease states). The current validation individuals are of European ancestry. As technology permits, the GAI will extend these characterizations to extend clinically and epidemiologically defined populations. It is hoped that multiple investigators examining many differently defined populations will benefit from (and utilize) the GAI distributed SNP outcomes.

Why don't I find some known gene variants?
In order to be found by this process the variant must be common. The data mining alogrithm requires only five sequences with raw data. Many of the SNPs described in the literature are not common enough to be observed in such little data. Moreover, the SNP score used is very conservative. Additional known variants are often observed when the SNP score threshold is reduced.

If I don't find my gene does it mean it has no variant forms?
Not really. Only ~20,000 gene clusters were examined. Even among those examined, only common variation will be observed. Moreover, many of the genes have only been incompletely surveyed. The EST data commonly only examines the most 3' and 5' regions of genes. The SNP viewing tools permit one to determine the degree of coverage of each gene examined and the number of libraries included in the assembly.

Where do we go from here?
Are we done? We have really just begun! The CGAP's Tumor Gene Index project is generating more than 10,000 ESTs per week. This should continue to feed the data mining identification of candidate SNPs. The GAI is moving to rapidly validate, then confirm these candidates. Experiments are also being conducted to assess the accuracy of lower SNP scores. The GAI is also working with Technology partners to assess new methods to cost-effectively identify and characterize SNPs. Stay Tuned!!!


For additional information on CGAP, contact
Dr. Robert Strausberg, 301-496-1550.

If you have comments or questions on this Web Site, contact Dr. Ken Buetow, 301-435-8953