Word Sense Disambiguation

Home
Consider the following sentences taken from three different MEDLINE® abstracts, each containing the word cold:

  1. A greater proportion of mesophil micro-organisms were to be found during the cold months than in warmer months.

  2. In a controlled randomised trial we analysed whether the use of the term ``smoker's lung'' instead of chronic bronchitis when talking to patients with chronic obstructive lung disease (COLD) changed their smoking habits.

  3. The overall infection rate was 83% and of those infected, 88% felt that they had a cold.
The sense of the word cold is different in each sentence. Cold in sentence (1) is an indication of the temperature, in sentence (2), it is the acronym of chronic obstructive lung disease and in sentence (3) cold is a disease. The fact that a single word may have more than one sense is called ambiguity. In natural language, ambiguity occurs at many levels, e.g., lexical, structural, semantic, and pragmatic. Also, it pervades normal language use; humans have to disambiguate constantly (and subconsciously) in normal communication using textual and other types of context.

The general opinion is that language in more restricted environments such as medical research is more specific and straightforward; there is less ambiguity. This may well be the case, but ambiguity is still present as shown by the examples above. Additionally, the UMLS® Metathesaurus®, the largest medical thesaurus, has more than 7,400 ambiguous strings that map to more than one thesaurus concept. The word cold, for instance, maps to six different UMLS concepts, three of which we used in sentences (1) -- (3).

Error analysis performed during the Indexing Initiative evaluation process indicated word sense disambiguation as an area of focus for continued enhancement of the Medical Text Indexer. Indexing errors due to word sense ambiguity arise when the UMLS® Metathesaurus® has a single string referring to two or more distinct concepts. We do not currently have the means of choosing which concept is appropriate in the given textual context. Current research in statistically-based natural language processing addresses automatic resolution of this type of ambiguity. One challenge in this method is that it requires a significant amount of training text, which must often be disambiguated by hand. We have initiated research in a memory-based learning approach which minimizes this effort by first concentrating on non-ambiguous training text.

To test such techniques in the biomedical language domain, we have developed a Word Sense Disambiguation (WSD) test collection (requires UMLS KS login) that comprises 5,000 disambiguated instance for 50 ambiguous UMLS® Metathesaurus® strings. Initial Machine Learning (ML) testing with this training set has yielded some very impressive results - to be reported on later (sorry).

In addition, the work on Journal Descriptor (JD) Indexing offers another promising approach to word sense disambiguation.

Word sense disambiguation by selecting the best semantic type based on Journal Discriptor Indexing: preliminary experiment.
Humphrey, SM; Rogers, WJ; Kilicoglu H; Demner-Fushman, D; Rindflesch, TC. J Am Soc Inf Sci Technol 2006 Jan;57(1):96-113.
Erratum in: J AM Soc Inf Sci, Mar. 2006, 57(4):726.   PDF: Erratum for Word sense disambiguation by selecting ... paper  (20.6kb)
PDF: Word sense disambiguation by selecting the best semantic ...
 (386kb)


Last Modified: October 09, 2007 ii-public
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button