Word Sense Disambiguation (WSD)
Test Collection




Word sense ambiguity is a pervasive characteristic of natural language. For example, the word "cold" has several senses and may refer to a disease, a temperature sensation, or a natural phenomenon. The specific sense intended is determined by the textual context in which an instance of the ambiguous word appears. In "I am taking aspirin for my cold" the disease sense is intended, while in "Let's go inside, I'm cold" the temperature sensation sense is meant.

It is convenient to refer to an ambiguous word along with all of its individual senses as an ambiguity case. Further, we call each textual occurrence of the ambiguity an instance. In the UMLS Metathesaurus, a large number of ambiguity cases are represented by separate concepts, each of which refers to one of the individual senses.

In order to support research investigating the automatic resolution of word sense ambiguity using natural language processing techniques, we have constructed this test collection of medical text in which the ambiguities were resolved by hand. Evaluators were asked to examine instances of an ambiguous word and determine the sense intended by selecting the Metathesaurus concept (if any) that best represents the meaning of that sense.

The test collection consists of 50 highly frequent ambiguous UMLS concepts from 1998 MEDLINE. Each of the 50 ambiguous cases has 100 ambiguous instances randomly selected from the 1998 MEDLINE citations. For a total of 5,000 instances. We had a total of 11 evaluators of which 8 completed 100% of the 5,000 instances, 1 completed 56%, 1 completed 44%, and the final evaluator completed 12% of the instances. Evaluations were only used when the evaluators completed all 100 instances for a given ambiguity.

The following links provide information about the process of building the WSD Test Collection:

Evaluator Guidelines (July 11, 2000)   PDF: Evaluator Guidelines (July 11, 2000)  (95kb)
Final Ambiguity Choices (July 5, 2000)   PDF: Final Ambiguity Choices (July 5, 2000)  (134kb)

The following paper describes in more detail the development of the test collection:
Developing a Test Collection for Biomedical Word Sense Disambiguation, AMIA 2001    PDF: Developing a Test Collection for Biomedical Word Sense Disambiguation paper  (93kb)

Now Available from Dr. Ted Pedersen at the University of Minnesota, Duluth:

A small utility package called nlm2sval2, which will take the WSD Test Collection and convert it into the Senseval-2 lexical sample format. nlm2sval2 is written in Perl, and is freely available from their data conversion page at the following URL: http://www.d.umn.edu/~tpederse/tools.html

Please Note

Users are responsible for compliance with the UMLS copyright restrictions.

To use this test collection, you must have signed the UMLS agreement. The UMLS agreement requires those who use the UMLS to file a brief report once a year to summarize their use of the UMLS. It also requires the acknowledgment that the UMLS contains copyrighted material and that those copyright restrictions be respected. The UMLS agreement requires users to agree to obtain agreements for EACH copyrighted source prior to its use within a commercial or production application.

[ Use of all the sources is permitted if the application is used for research purposes only. ]

The 5,000 MEDLINE citations included at this site are for exclusive use with the Test Collection and cannot be redistributed. In addition, the citations were retrieved in late 1999 and represent a static view of MEDLINE at that time.


To access the WSD Test Collection, you must have access to a UMLSKS account. For more information about how we use UMLSKS authentication data, or for information on how to setup a UMLSKS account, please select the Info icon to the right: Information Mark Symbol: Help about UMLSKS accounts

Access WSD Test Collection  (RESTRICTED)


Last Modified: April 14, 2008 ii-public
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button