skip banner navigation  
National Cancer Institute Center for Bioinformatics
Search:
Site Map
GLOBAL QUICK LINKS 
caCORE Information Opens in New Window: caCORE Information
caGrid Information Opens in New Window: caGrid Information
Cancer Biomedical Informatics Grid™ (caBIG™) Opens in New Window: Cancer Biomedical Informatics Grid™ (caBIG™)
NCICB GForge Portal Opens in New Window: NCICB GForge Portal

HOME
ABOUT NCICB INFRASTRUCTURE TOOLS PROJECTS DOWNLOADS TRAINING SUPPORT  
Projects > Cancer Gene Data Curation Project
Poster: Towards a Comprehensive Catalog of Gene-disease and Gene-drug Relationships in Cancer. Click image for a detailed view.
Poster: Towards a Comprehensive Catalog of Gene-disease and Gene-drug Relationships in Cancer. Click image for a detailed view.
Cancer Gene Data Curation Logo

The Cancer Gene Data Curation Project is an ongoing effort to create a database of associations between genes and diseases and genes and drug compounds derived from Medline abstracts. The project involves a mixture of automatic text mining, semi-automatic verification, and manual validation/scoring of results. A pilot study involving 1000 genes was completed in March 2005 by the NCICB, ScenPro, and Biomax Solutions, Inc. Subsequent phase of this project, performed by Sophic Systems Alliance Inc. in partnership with Biomax Informatics AG, have resulted in the curation of a total of 4,658 genes. Initial phases of this project have focused on the annotation of genes with moderate sentence counts in Medline (<1000 sentences), to promote near-term discovery research. The completion phase of the project will include the curation of highly cited cancer genes, toward the final goal of a complete index of cancer gene-disease and cancer gene-compound relationships.

Data from all phases of this project have been integrated into caBIO and are available at: http://cabioapi.nci.nih.gov/cabio41. These data are also available for bulk download via the links below.

Phase I (Pilot)

Gene Selection Methodology (1000 gene pilot)
The scope of the pilot was 1000 genes selected from the set of 4800 genes with at least one true sentence (i.e. a sentence indicating a Gene->Disease association) from a semiautomated annotation process. The 1000 genes were selected using a 5 step approach designed to test if it was possible to handle genes with differing levels of information. Thus it contains some ‘known’ genes but was focused on high-value data; specifically genes that appeared in the last couple of years and those with moderate sentence counts. The selection criteria was:
  • 100 genes that were selected because of interest from NCICB personnel
  • 300 randomly selected genes with first date in 2003-2004
  • 250 randomly selected ‘older’ genes with 1-10 sentences
  • 250 randomly selected ‘older’ genes with 11-100 sentences
  • 100 randomly selected ‘older’ genes with 101 or more sentences

Full Data Downloads (1000 gene pilot)
Cancer Gene Database:    Download Opens in New Window: Download    README Opens in New Window: README
Gene Drug Database:    Download Opens in New Window: Download    README Opens in New Window: README

Aggregated Data Downloads (1000 gene pilot)
Aggregated Gene-Disease Data (Gene Centric/No Role Codes):    Download Opens in New Window: Download    README Opens in New Window: README
Aggregated Gene-Disease Data (Disease Centric/No Role Codes):    Download Opens in New Window: Download    README Opens in New Window: README
Aggregated Gene-Disease Data (Gene Centric/Role Codes):    Download Opens in New Window: Download    README Opens in New Window: README

Phase II

The second phase consists of an additional 1500 genes that have been fully annotated. The selection criteria for this set emphasized recently discovered gene to disease associations, with the remainder of the selected genes emphasizing genes with moderate numbers of citations (between 10 and 100). Overall, this set of genes had a mean number of sentences with significant co-occurance of gene and disease or cancer terms of 200.

This data set is reported in XML format for the convenience of researchers that wish to parse the data without concern over problems with delimiters. An XSD is available by link from each data set.

Full Data Downloads (1500 genes)
Cancer Gene Database:    Download Part I  Download Part II Opens in New Window: Download    README Opens in New Window: README
Gene Drug Database:    Download Part I  Download Part II Opens in New Window: Download    README Opens in New Window: README

Phases III and IV

The phases consists of 2675 manually annotated cancer genes with moderate (<1000) sentence counts.

This data set is reported in XML format for the convenience of researchers that wish to parse the data without concern over problems with delimiters. An XSD is available with each dataset.

Full Data Download (All Phases)
Gene Drug Database:   Download
Cancer Gene Database:   Download

CONTACT US PRIVACY NOTICE DISCLAIMER ACCESSIBILITY APPLICATION SUPPORT  
National Cancer Institute Department of Health and Human Services National Institutes of Health FirstGov.gov