Skip to Content
United States National Library of Medicine National Institutes of Health

Fact Sheet
SPECIALIST Lexicon


The SPECIALIST Lexicon

The SPECIALIST Lexicon is one of three UMLS® Knowledge Sources under development by the National Library of Medicine (NLM) as part of the Unified Medical Language System® project.

The SPECIALIST Lexicon has been developed to provide the lexical information needed for the SPECIALIST Natural Language Processing (NLP) System. It is intended to be a general English lexicon that includes many biomedical terms. Coverage includes both commonly occurring English words and biomedical vocabulary. The lexicon entry for each word or term records the syntactic, morphological, and orthographic information needed by the SPECIALIST NLP System.

Scope and Content of the SPECIALIST Lexicon

The lexicon consists of a set of lexical entries with one entry for each spelling or set of spelling variants in a particular part of speech. Lexical items may be multi-word terms made up of other words if the multi-word term is determined to be a lexical item by its presence as a term in general English or medical dictionaries, or in medical thesauri such as MeSH®. Expansions of generally used acronyms and abbreviations are also allowed as multi-word terms.

Words are selected for lexical coding from a variety of sources. Approximately 20,000 words form the core of the words entered. These are taken from the UMLS Test Collection of MEDLINE® abstracts together with words which appear both in the UMLS Metathesaurus and Dorland's Illustrated Medical Dictionary. In addition, an effort has been made to include words from the general English vocabulary. The 10,000 most frequent words listed in The American Heritage Word Frequency Book and the list of 2,000 words used in definitions in Longman's Dictionary of Contemporary English have also been coded. Since the majority of the words selected for coding are nouns, an effort has been made to include verbs and adjectives by identifying verbs in current MEDLINE citation records, by using the Computer Usable Oxford Advanced Learner's Dictionary and by identifying potential adjectives from Dorland's Illustrated Medical Dictionary using heuristics developed by McCray and Srinivasan (1990).

A variety of reference sources are used in coding lexical records. Coding is based on actual usage in the UMLS Test Collection and MEDLINE, dictionaries of general English, primarily learner's dictionaries which record the kind of syntactic information needed for NLP, and medical dictionaries. Longman's Dictionary of Contemporary English, Dorland's Illustrated Medical Dictionary, Collins COBUILD Dictionary, The Oxford Advanced Learner's Dictionary, and Webster's Medical Desk Dictionary were used.

Distribution Formats

The SPECIALIST Lexicon is provided in two formats; a unit record format and a relational table format. The information associated with each lexical entry includes a unique identifier, a base form, a syntactic category code, certain agreement information, complementation information if relevant, and various other properties relevant to the particular lexical entry.

The unit record format is a frame structure consisting of slots and fillers. The slots are the basic lexical attributes, and the fillers express the possible values of those attributes for that particular lexical item. Data for lexical entries are also represented in a set of relational tables. The lexicon relational format is not fully normalized. By design, there is duplication of data among different relations and within certain relations. Developers will need to decide the extent to which this redundancy should be retained, reduced, or increased for their applications. Among other tables, there are separate tables for agreement and inflection information, complementation patterns, spelling variants, and abbreviations and acronyms and their fully expanded forms.

Obtaining the SPECIALIST Lexicon

The SPECIALIST Lexicon is available as an open source resource as part of the SPECIALIST NLP tools (http://SPECIALIST.nlm.nih.gov). Distribution is subject to terms and conditions.

The SPECIALIST Lexicon and the other UMLS Knowledge Sources are available through the UMLS Knowledge Source Server (http://umlsks.nlm.nih.gov/). Further information about the UMLS project can be found at http://umlsinfo.nlm.nih.gov/.

For additional information, send an email to custserv@nlm.nih.gov or call 1-888-FINDNLM.

A complete list of NLM Factsheets is available at:
(alphabetical list)http://www.nlm.nih.gov/pubs/factsheets/factsheets.html
(subject list)http://www.nlm.nih.gov/pubs/factsheets/factsubj.html

Or write to:

FACT SHEETS
Office of Communications and Public Liaison
National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894

Phone: (301) 496-6308
Fax: (301) 496-4450
email: publicinfo@nlm.nih.gov

Last updated: 28 March 2006
First published: 28 March 2006
Metadata| Permanence level: Permanent: Stable Content
Previous version