Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2006Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2006-073
Argument-predicate Distance as a Filter for Enhancing Precision in Extracting Predications on the Genetic Etiology of Disease
Masseroli M, Kilicoglu H, Lang F-M, Loane RF, Rindflesch TC
BMC Bioinformatics 2006; 7:291
Genomic functional information is valuable for biomedical research. However, such informationfrequently needs to be extracted from the scientific literature and structured in order to be exploited byautomatic systems. Natural language processing is increasingly used for this purpose although it inherentlyinvolves errors. A postprocessing strategy that selects relations most likely to be correct is proposed andevaluated on the output of SemGen, a system that extracts semantic predications on the etiology of geneticdiseases. Based on the number of intervening phrases between an argument and its predicate, we defined aheuristic strategy to filter the extracted semantic relations according to their likelihood of being correct. We alsoapplied this strategy to relations identified with co-occurrence processing. Finally, we exploited postprocessedSemGen predications to investigate the genetic basis of Parkinson's disease. The filtering procedure for increased precision is based on the intuition that arguments which occurclose to their predicate are easier to identify than those at a distance. For example, if gene-gene relations arefiltered for arguments at a distance of 1 phrase from the predicate, precision increases from 41.95% (baseline) to70.75%. Since this proximity filtering is based on syntactic structure, applying it to the results of co-occurrenceprocessing is useful, but not as effective as when applied to the output of natural language processing. In an effort to exploit SemGen predications on the etiology of disease after increasing precision withpostprocessing, a gene list was derived from extracted information enhanced with postprocessing filtering andwas automatically annotated with GFINDer, a Web application that dynamically retrieves functional and phenotypicinformation from structured biomolecular resources. Two of the genes in this list are likely relevant to Parkinson's disease but are not associated with this disease in several important databases on genetic disorders. Information based on the proximity postprocessing method we suggest is of sufficient quality to beprofitably used for subsequent applications aimed at uncovering new biomedical knowledge. Although proximityfiltering is only marginally effective for enhancing the precision of relations extracted with co-occurrenceprocessing, it is likely to benefit methods based, even partially, on syntactic structure, regardless of the relation.
PDF