NCICB: Cancer Gene Data Curation Project

GLOBAL QUICK LINKS
caCORE Information
caGrid Information
Cancer Biomedical Informatics Grid™ (caBIG™)
NCICB GForge Portal

HOME

Projects > Cancer Gene Data Curation Project

Cancer Gene Data Curation Project

Poster: Towards a Comprehensive Catalog of Gene-disease and Gene-drug Relationships in Cancer. Click image for a detailed view.

The Cancer Gene Data Curation Project is an ongoing effort to create a database of associations between genes and diseases and genes and drug compounds derived from Medline abstracts. The project involves a mixture of automatic text mining, semi-automatic verification, and manual validation/scoring of results. A pilot study involving 1000 genes was completed in March 2005 by the NCICB, ScenPro, and Biomax Solutions, Inc. Subsequent phase of this project, performed by Sophic Systems Alliance Inc. in partnership with Biomax Informatics AG, have resulted in the curation of a total of 4,658 genes. Initial phases of this project have focused on the annotation of genes with moderate sentence counts in Medline (<1000 sentences), to promote near-term discovery research. The completion phase of the project will include the curation of highly cited cancer genes, toward the final goal of a complete index of cancer gene-disease and cancer gene-compound relationships.

Data from all phases of this project have been integrated into caBIO and are available at: http://cabioapi.nci.nih.gov/cabio41. These data are also available for bulk download via the links below.

Phase I (Pilot)

Gene Selection Methodology (1000 gene pilot)

The scope of the pilot was 1000 genes selected from the set of 4800 genes with at least one true sentence (i.e. a sentence indicating a Gene->Disease association) from a semiautomated annotation process. The 1000 genes were selected using a 5 step approach designed to test if it was possible to handle genes with differing levels of information. Thus it contains some ‘known’ genes but was focused on high-value data; specifically genes that appeared in the last couple of years and those with moderate sentence counts. The selection criteria was:

100 genes that were selected because of interest from NCICB personnel
300 randomly selected genes with first date in 2003-2004
250 randomly selected ‘older’ genes with 1-10 sentences
250 randomly selected ‘older’ genes with 11-100 sentences
100 randomly selected ‘older’ genes with 101 or more sentences

Full Data Downloads (1000 gene pilot)

Cancer Gene Database: Download

README

Gene Drug Database: Download

README

Aggregated Data Downloads (1000 gene pilot)

Aggregated Gene-Disease Data (Gene Centric/No Role Codes): Download

README

Aggregated Gene-Disease Data (Disease Centric/No Role Codes): Download

README

Aggregated Gene-Disease Data (Gene Centric/Role Codes): Download

README

Phase II

The second phase consists of an additional 1500 genes that have been fully annotated. The selection criteria for this set emphasized recently discovered gene to disease associations, with the remainder of the selected genes emphasizing genes with moderate numbers of citations (between 10 and 100). Overall, this set of genes had a mean number of sentences with significant co-occurance of gene and disease or cancer terms of 200.

This data set is reported in XML format for the convenience of researchers that wish to parse the data without concern over problems with delimiters. An XSD is available by link from each data set.

Full Data Downloads (1500 genes)

Cancer Gene Database: Download Part I Download Part II

README

Gene Drug Database: Download Part I Download Part II

README

Phases III and IV

The phases consists of 2675 manually annotated cancer genes with moderate (<1000) sentence counts.

This data set is reported in XML format for the convenience of researchers that wish to parse the data without concern over problems with delimiters. An XSD is available with each dataset.

Full Data Download (All Phases)

Gene Drug Database: Download
Cancer Gene Database: Download

PRIVACY NOTICE

DISCLAIMER

ACCESSIBILITY

APPLICATION SUPPORT