Abstract

The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for down-stream text mining applications in bioinformatics. We have developed a rule-based algorithm that includes pattern matching for gene symbols and an approximate term searching technique for gene names. The algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.



Close Window