|
Poster: Towards a Comprehensive Catalog of Gene-disease and Gene-drug Relationships in Cancer. Click image for a detailed view. |
The Cancer Gene Data Curation Project is an ongoing effort to create a database of associations between genes and diseases and genes and drug compounds derived from Medline abstracts. The project involves a mixture of automatic text mining, semi-automatic verification, and manual validation/scoring of results. A pilot study involving 1000 genes was completed in March 2005 by the NCICB, ScenPro, and Biomax Solutions, Inc. Subsequent phase of this project, performed by Sophic Systems Alliance Inc. in partnership with Biomax Informatics AG, have resulted in the curation of a total of 4,658 genes. Initial phases of this project have focused on the annotation of genes with moderate sentence counts in Medline (<1000 sentences), to promote near-term discovery research. The completion phase of the project will include the curation of highly cited cancer genes, toward the final goal of a complete index of cancer gene-disease and cancer gene-compound relationships.
Data from all phases of this project have been integrated into caBIO and are available at: http://cabioapi.nci.nih.gov/cabio41. These data are also available for bulk download via the links below.
Phase I (Pilot)
The scope of the pilot was 1000 genes selected from the set of 4800 genes with at least one true
sentence (i.e. a sentence indicating a Gene->Disease association) from a semiautomated annotation
process. The 1000 genes were selected using a 5 step approach designed to test
if it was possible to handle genes with differing levels of information. Thus it contains
some ‘known’ genes but was focused on high-value data; specifically genes that appeared in
the last couple of years and those with moderate sentence counts. The selection criteria was:
- 100 genes that were selected because of interest from NCICB personnel
- 300 randomly selected genes with first date in 2003-2004
- 250 randomly selected ‘older’ genes with 1-10 sentences
- 250 randomly selected ‘older’ genes with 11-100 sentences
- 100 randomly selected ‘older’ genes with 101 or more sentences
Cancer Gene Database: Download README
Gene Drug Database: Download README
Aggregated Gene-Disease Data (Gene Centric/No Role Codes): Download README
Aggregated Gene-Disease Data (Disease Centric/No Role Codes): Download README
Aggregated Gene-Disease Data (Gene Centric/Role Codes): Download README
Phase II
The second phase consists of an additional 1500 genes that have been fully annotated. The selection criteria
for this set emphasized recently discovered gene to disease associations, with the remainder
of the selected genes emphasizing genes with moderate numbers of citations (between 10 and 100).
Overall, this set of genes had a mean number of sentences with significant co-occurance of gene and
disease or cancer terms of 200.
This data set is reported in XML format for the convenience of researchers that wish to
parse the data without concern over problems with delimiters. An XSD is available by link from
each data set.
Cancer Gene Database: Download Part I Download Part II README
Gene Drug Database: Download Part I Download Part II README
Phases III and IV
The phases consists of 2675 manually annotated cancer genes with moderate (<1000) sentence counts.
This data set is reported in XML format for the convenience of researchers that wish to parse the data without concern over problems with delimiters. An XSD is available with each dataset.
Gene Drug Database: Download
Cancer Gene Database: Download
|