Skip to Content

The Unified Medical Language System® (UMLS®) Project


Stuart J. Nelson, MD, Tammy Powell, and Betsy L. Humphreys

National Library of Medicine, Bethesda, MD


I. INTRODUCTION

In 1986, Donald A. B. Lindberg, M.D., Director of the National Library of Medicine (NLM), initiated a long-term research and development effort known as the Unified Medical Language System (UMLS). Anticipating increasing amounts of biomedical information available in electronic form, he believed that NLM should facilitate the development of advanced information systems that could retrieve and integrate information from a variety of disparate information sources, including bibliographic databases, patient record systems, factual databanks, and knowledge bases. He recognized that a major barrier to effective retrieval and integration of information from multiple sources was the "naming problem" or the variety of different ways that the same concepts are expressed in different information sources and by different information seekers.

To address the complex problems of relating user inquiries to the content of biomedical information sources and of aggregating comparable data derived from disparate databases, the NLM assembled a multidisciplinary in-house research group and also awarded a series of research contracts to a number of primarily academic investigators. The first several years of UMLS research were devoted to studying user needs, developing research tools, identifying required capabilities, exploring alternative methods for delivering these capabilities, and defining in general terms the new knowledge sources that would be needed to support integrated use of information from disparate electronic biomedical sources. Based on the results of this early work, the conception of UMLS components as "middleware" designed for use by system developers emerged. Since 1990, NLM has issued annual editions of UMLS Knowledge Sources and associated lexical programs. Over the past decade, these resources have grown and developed, the methodology of creating them has matured, and their utility has been demonstrated in many different information systems. Today more than 1,000 individuals and institutions worldwide license the UMLS resources, which are free-of-charge. The majority of the licensees use one or more of the UMLS components in a information systems, often in creative and innovative undertakings. The NLM itself uses UMLS components to enhance retrieval from a number of its information services, including the MEDLINE database available via PubMed, the ClinicalTrials.gov database of ongoing clinical trials sponsored by the National Institutes of Health and other organizations, and the NLM Gateway, which provides a single point of entry to a number of different NLM databases. The Library also relies heavily on the UMLS resources in its natural language processing and digital library research programs.

II. THE KNOWLEDGE SOURCES

There are three major UMLS knowledge sources: a large Metathesaurus® of concepts and terms from many biomedical vocabularies and classifications; a Semantic Network of sensible relationships among the broad semantic types or categories to which all Metathesaurus concepts are assigned; and the SPECIALIST Lexicon which contains syntactic, morphological, and orthographic information for biomedical and common words in the English language. The Lexicon and its associated lexical resources are used to generate the indexes to the Metathesaurus and also have wide applicability in natural language processing applications in the biomedical domain. From 1991 to 1998, the NLM also produced a fourth UMLS Knowledge Source, an Information Sources Map that described the scope, content, and access conditions for many publicly available biomedical databases. The development of the World Wide Web offered other promising approaches to addressing the problem of determining which of many information sources contain content relevant to a particular inquiry.

The Metathesaurus and the Semantic Network will be discussed in this article. Those interested in the SPECIALIST lexicon and its related natural language processing resources should consult the readings at the end of this article.

II.A. THE METATHESAURUS

The Metathesaurus is the central vocabulary component of the UMLS. The term Metathesaurus draws on Webster's Dictionary third definition for the prefix "Meta," i.e., "more comprehensive, transcending." In a sense, the Metathesaurus transcends the specific vocabularies and classifications it encompasses.

The Metathesaurus is a database of information on concepts whose names appear in one or more of a number of different controlled vocabularies and classifications used in the field of biomedicine. In general, the scope of the Metathesaurus is determined by the combined scope of its source vocabularies. The Metathesaurus preserves the meanings, hierarchical connections, and other relationships between terms present in its source vocabularies, while adding certain basic information about each of its concepts and establishing new relationships between concepts and terms from different source vocabularies.

The Metathesaurus contains concepts and concept names from more than 60 vocabularies and classifications, some in multiple editions. The 2000 edition of the Metathesaurus includes approximately 730,000 concepts and 1.5 million concept names. Most of the source vocabularies are included in their entirety. Some material from the UMLS Metathesaurus is from copyrighted sources

II.A.1. Organization of the Metathesaurus

A thesaurus, and the Metathesaurus is no exception to this observation, is a strange construction. I. A. Richards, in his introduction to Roget's Pocket Thesaurus, described it as the opposite of a dictionary, where you go to look up a word when you know the meaning. Thesauri are organized by the principle of semantic locality. That is, words and phrases close to one another in meaning can be found in the same area. A review of the first thesaurus in the English language, that designed by Roget, illustrates that principle. Organized in general categories, then into more specific ones, are lists of words and phrases close in meaning to one another.

The Metathesaurus is organized similarly. All words and phrases that mean the same thing form a distinct concept or synonym class in the Metathesaurus. Each separate meaning appears as its own concept, together with links (represented relationships) to other concepts in the Metathesaurus. These relationships to the other concepts serve to define the semantic neighborhood of the concept. A user of the Metathesaurus, whether human or program, can navigate within this semantic neighborhood to find the names for the concept sought.

Multiple meanings of the same term are dealt with by separating the meanings and presenting them in different semantic neighborhoods. If you look up the word "corn" in a dictionary you will see one entry with multiple definitions. If you search for "corn" in the Metathesaurus you will find that it names two separate concepts. One concept is the plant used for food and the other one the anatomical abnormality. Each of these meanings has differing relationships within the Metathesaurus.

II.A.2. Concept Structure

The first step in organizing the Metathesaurus is to connect the alternative names for the same concept. Each concept record contains the strings of alphanumeric characters and terms that express the meaning of the concept. Strings that are lexical variants of each other (that is, identical after a series of well-defined manipulations that can be done computationally, e.g., making all characters lower case, putting all words in a defined order, and changing all plural forms to singular) are grouped together as a single term. One string is designated, by convention, as the preferred form of that term. Table 1 illustrates the terms for one meaning of corn. In Table 1, Indian Corn and Corn, Indian are lexical variants. Also in Table 1, the five terms all have the same meaning. They are therefore linked together as alternate names of the same concept, with one term designated as the preferred name of the concept. The designation of preferred forms and preferred names is done by algorithm based on an order of precedence among source vocabularies. Because this is done by an arbitrary convention, any user can change and select their own order of precedence among vocabularies.

Each concept in the Metathesaurus has a unique concept identifier (CUI), which itself has no intrinsic meaning. This unique identifier is represented in the Metathesaurus by the letter C followed by 7 digits (i.e. C0010028). This identifier remains the same across versions of the Metathesaurus, irrespective of the term designated as the preferred name of the concept. This facilitates file maintenance and management, as well as tracking the meanings assigned to a given term changes over time. It is "the name [of a concept] that never changes."

II.A.3. Relationships

The Metathesaurus also represents relationships between different concepts. Many relationships are derived directly from the source vocabularies. For example, the fact that there is a relationship between "Coffee" and "Rubiaceaea" is derived from the hierarchical tree structures in the Medical Subject Headings (MeSH) vocabulary. The exact nature of the relationship may or may not be represented in the source vocabulary, although the contextual and hierarchical relationships are identified as such. Relationships between concepts from different source vocabularies are also, on occasion, created during Metathesaurus construction. For example, the COSTAR concept "CHOCOLATE INTOLERANCE OR ALLERGY" is identified in the Metathesaurus as having a narrower-than relationship to "Food Hypersensitivity", a concept that is present in MeSH, the United Kingdom National Health Service Clinical Terms, and others.

Nine types of relationships exist in the Metathesaurus. They are listed in Table 2. Relationships are reciprocal, so that, for example, where one concept is broader than another, the other is noted as being narrower than the first. Relationship attributes, describing the exact nature of a relationship, may be assigned to a given relationship. These attributes are drawn from the set of permissible relationships within the Semantic Network (discussed below), with the additional relationship of "mapped_to".

Many of the source vocabularies included in the Metathesaurus place the concepts they include in some context. These contexts are in general hierarchical arrangements, for some organizational or classification purpose. NLM endeavors to preserve all of these different contexts in the Metathesaurus, so a single concept in the Metathesaurus can appear in multiple different hierarchies. As an example, several of the contexts in which the concept "fruit" appears are shown in Table 3. There is no attempt to merge or combine the different contextual views into one coherent hierarchical arrangement for the Metathesaurus. Given the different perspectives and purposes of the many UMLS source vocabularies, this would be an essentially impossible task.

II.A.4. Semantic Types

Each Metathesaurus concept is assigned at least one semantic type. The types are drawn from the Semantic Network (see below). The types provide a general categorization of the concept, and allow some reasoning about the possible meaning of the concept. In all cases, the most specific semantic type available in the hierarchy is assigned to the concept. For example, the concept "Macaca" receives the semantic type "Mammal" because there is not a more specific type, e.g., "Primate," available in the Network.

II.A.5. Additional Attributes

Within each concept, there are many data elements provided both by the sources and created by NLM. There are over 110 attributes and data elements in the Metathesaurus. Each attribute in the Metathesaurus is labeled with the source which asserted that attribute.

Of particular note is that there may be narrative description(s) of the meaning of the concept. The majority of these definitions come from MeSH, but there are also definitions from a number of other sources. A few definitions are created specifically for the Metathesaurus when they are needed to distinguish among different meanings of the same string.

II.B. THE SEMANTIC NETWORK

The Semantic Network is tied very closely to the content of the Metathesaurus. In general, semantic networks attempt to impart common sense knowledge to computers, allowing them to "reason" and draw conclusions about entities by virtue of the categories to which they have been assigned. The semantic links provide the structure for the network and represent important relationships in the domain. The UMLS Semantic Network consists of 134 semantic types, broad categories intended to indicate the general area of meaning of a Metathesaurus concept, together with 54 relationships or semantic links.

The Semantic Network can be visualized as a diagram where the types make up nodes within a network. The top of the network has two nodes, "Entity" and "Event". The remaining types each appear only in one location within the network Figure 1 includes a portion of the UMLS Semantic Network.

The primary link between the nodes is the `isa' link. This establishes the hierarchy of types within the Network and is used for deciding on the most specific semantic type available for assignment to a Metathesaurus concept. In addition, a set of non-hierarchical relations between the types has been identified. These are grouped into five major categories, which are themselves relations: `physically related to,' `spatially related to,' `temporally related to,' `functionally related to,' and `conceptually related to.'

The relations are stated between semantic types and do not necessarily apply to all instances of concepts that have been assigned to those semantic types. That is, the relation may or may not hold between any particular pair of concepts. So, although `treats' is one of several valid relations between the semantic types `Pharmacologic Substance' and `Disease or Syndrome,' a particular pharmacologic substance (e.g., penicillin) may not treat a particular disease (e.g., AIDS).

III. PRODUCTION AND DISTRIBUTION OF THE UMLS

III.A. MAKING THE METATHESAURUS

To understand the process of making the Metathesaurus, it is helpful to review the steps taken in adding a new vocabulary to the Metathesaurus. (Adding an update to a vocabulary already present in the Metathesaurus is quite similar.) The NLM begins by acquiring the rights to incorporate a vocabulary into the Metathesaurus. The rights include permission to include and represent the vocabulary in this form, and to distribute the vocabulary to UMLS licensees. However, a UMLS license alone does not permit a licensee to use every source vocabulary for any purpose. For some applications of some source vocabularies, the user must also establish a separate agreement with the individual vocabulary producer. The UMLS license describes when this is necessary.

Once a machine-readable version of a vocabulary is made available to the NLM it is converted into a "normal" or canonical form. This "inversion" process requires careful consideration of how the source represents its meanings and attempts to make all of this representation explicit. Each source is then added to the existing Metathesaurus. Terms from different sources which are lexically similar to each other or to existing terms in the Metathesaurus, or which appear from other indications to be semantically identical to concepts in the Metathesaurus, are brought together (merged) as proposed synonyms located in a single Metathesaurus concept.

After this merging, the results are reviewed by editors, largely to assess if the proposed concept merge is appropriate. Editors also may add information such as additional relationships and semantic types. This human review is expedited by computational assistance, as is the quality assurance which takes place after the editing.

In categorizing the concepts, editors are encouraged to consider the most specific semantic type available. If the concept is broad or not represented by a more specific type, a broad category in the semantic type hierarchy is used. For example, a sub-tree under the node "Physical Object" is "Manufactured Object." It has only two child nodes, "Medical Device" and "Research Device." It is clear that there are manufactured objects other than medical devices and research devices. Rather than proliferate the number of semantic types to encompass multiple additional subcategories for these objects, concepts that are neither medical devices nor research devices are simply assigned the more general semantic type "Manufactured Object."

Periodically various types of quality assurance efforts are performed. The most important of these are efforts to insure that there is no missed synonymy, i.e., no terms meaning exactly the same thing in different concepts. Another important effort is made to insure that every concept is linked to others by some relationship.

In preparation for releasing of the next version of the Metathesaurus, all new releasable concepts are assigned concept unique identifiers (CUIs), while concepts previously present in the Metathesaurus retain their CUIs. The Metathesaurus is released as a set of relational tables.

III.B. LICENSING AND DISTRIBUTION OF THE UMLS

The UMLS is available free of charge to anyone who wishes to license it. The license agreement, available at http://www.nlm.nih.gov/research/umls/, must be completed and signed. As noted above, UMLS licensees may also have to enter into separate license agreements with the producers of specific vocabularies present in the Metathesaurus. No additional agreements are required for use of the other UMLS components.

Licensed users may ftp the UMLS Knowledge Sources or access them interactively from the UMLS Knowledge Source Server. On request, CD-ROMs are provided to users who do not have adequate connectivity to ftp the large files.

IV. FOR FURTHER READING

Additional information and documentation may be found at
http://www.nlm.nih.gov/research/umls/.

IV.A. GENERAL BACKGROUND

Lindberg DA, Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med. 1993 Aug;32(4):281-91.

McCray AT, Razi AM, Bangalore AK, Browne AC, Stavri PZ. The UMLS Knowledge Source Server: a versatile Internet-based research tool. Proc AMIA Annu Fall Symp. 1996:164-8.

Campbell KE, Oliver DE, Spackman KA, Shortliffe EH. Representing thoughts, words, and things in the UMLS. J Am Med Inform Assoc. 1998 Sep-Oct;5(5):421-31.

Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO. The Unified Medical Language System: an informatics research collaboration. J Am Med Inform Assoc. 1998 Jan-Feb;5(1):1-11.

IV.B. THE SPECIALIST LEXICON AND NATURAL LANGUAGE PROCESSING

McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994:235-9.

Divita G, Browne AC, Rindflesch TC.Evaluating lexical variant generation to improve information retrieval. Proc AMIA Annu Fall Symp. 1998:775-9.

McCray AT, Browne AC. Discovering the modifiers in a terminology data set. Proc AMIA Annu Fall Symp. 1998:780-4.

McCray AT. The nature of lexical knowledge. Methods Inf Med. 1998 Nov;37(4-5):353-60.

McCray AT, Loane RF, Browne AC, Bangalore AK. Terminology issues in user access to Web-based medical information. Proc AMIA Annu Fall Symp. 1999:107- 11.

IV.C. SEMANTICS OF THE METATHESAURUS

Nelson SJ, Tuttle MS, Cole WG, Sherertz DD, Sperzel WD, Erlbaum MS, Fuller LL, Olson NE. From meaning to term: semantic locality in the UMLS Metathesaurus. Proc Annu Symp Comput Appl Med Care. 1991:209-13.

Tuttle MS, Cole WG, Sherertz DD, Nelson SJ. Navigating to knowledge. Methods Inf Med. 1995 Mar;34(1-2):214-31.

McCray AT, Nelson SJ. The representation of meaning in the UMLS. Methods Inf Med. 1995 Mar;34(1-2):193-201.

Hole WT, Srinivasan S. Discovery of missed synonymy in a large concept- oriented Metathesaurus. Proc AMIA Annu Fall Symp. 2000:354-358.

IV.D. THE SEMANTIC NETWORK

McCray AT. UMLS semantic network. Proc Annu Symp Comput Appl Med Care. 1989:503-7

McCray AT, Hole WT. The scope and structure of the first version of the UMLS Semantic Network. Proc Annu Symp Comput Appl Med Care. 1990:126-3

Tuttle MS, Nelson SJ, Fuller LF, Sherertz DD, Erlbaum MS, Sperzel WD, Olson NE, Suarez-Munist ON. The semantic foundations of the UMLS metathesaurus. Medinfo 1992:1506-11


TABLE 1

SYNONYMOUS TERMS AND THEIR SOURCES


  TERM SOURCE TERM TYPE CODE
  Zea mays  MTH PN NOCODE
  Zea mays SNMI98 PT L-DB941
  Zea mays CSP2000 ET 2340-8793
  ZEA MAYS NDDF00 IN 006695
  Corn <1> MTH MM U000077
  Corn MSH2000 MH D003313
  Corn SNMI98 SY L-DB941
  Corn LCH90 PT U001161
  corn AOD99 DE 0000013135
  corn CSP2000 PT 2340-8793
  Indian Corn MSH2000 EP D003313
  Corn, Indian MSH2000 PM D003313
  Maize MSH2000 EP D003313
  Maize SNMI98 SY L-DB941
  maize AOD99 NP 0000026523
  maize CSP2000 ET 2340-8793

Material drawn from the 2000 UMLS Metathesaurus. The meaning of the Source Abbreviations and Term Types can be obtained by reviewing the UMLS documentation.


TABLE 2

RELATIONSHIPS WITHIN THE UMLS METATHESAURUS


RELATIONSHIP DEFINITION
Broader (RB) Has a meaning which includes that of the concept.
Narrower (RN) Has a meaning which is included in that of the concept.
Other related (RO) Has a relationship other than synonymous, narrower, or broader.
LIKE (RL) The two concepts are similar or "alike". In the current edition of the Metathesaurus, most relationships with this attribute link MeSH supplementary concepts which are largely chemicals. Many of the concepts linked by this relationship may be synonymous and will be in a single concept identifier in future editions of the Metathesaurus. Source-specific mappings from one vocabulary to another also have this relationship, along with the label for the relationship attribute of "mapped_to.".
Parent (PAR) Is a parent in a hierarchy of a Metathesaurus source vocabulary
Child (CHD) Is a child in a hierarchy of a Metathesaurus source vocabulary
Sibling (SIB) Shares a parent in a hierarchy in a Metathesaurus source vocabulary.
AQ Is an allowed qualifier for a concept in a Metathesaurus source vocabulary.
QB Can be qualified by a concept in a Metathesaurus source vocabulary.


TABLE 3

REPRESENTATIVE HIERARCHICAL CONTEXTS FOR THE CONCEPT

"FRUIT"


ALCOHOL AND OTHER DRUG THESAURUS

technology, safety and accidents
technology, manufacturing, and agriculture
food product
beverage +
chocolate
dairy product +
fish, shellfish, other seafood
food product ingredient
fruits
grain, cereal +
junk food
meat, poultry. eggs +
nut, seed
pastry, sweets
vegetables
vitamin supplement


CRISP THESAURUS

food science/technology
food
animal food
baby food
beverage +
dairy product +
fruit
tomato
grain +
meat
poultry product +
seafood
vegetable +


MESH

Organisms (MeSH Category)
Plants
Plant Components
Fruit
Nuts
Plant Epidermis
Plant Roots +
Plant Shoots +
Pollen
Seeds


MESH

Technology, Food and Beverages (MeSH Category)
Food and Beverages
Food
Bread
Candy +
Cereals +
Condiments +
Crops, Agricultural +
Dairy Products +
Dietary Fats +
Dietary Fiber
Dietary Supplements
Eggs +
Flour
Food Additives +
Foods, Specialized +
Fruit
Citrus
Coconut
Honey
Meat +
Micronutrients
Molasses
Nuts +
Vegetables

Concept name bolded for readability.
Contexts drawn from 2000 UMLS Metathesaurus.
+ Indicates has children not shown


FIGURE 1
A PORTION OF THE UMLS SEMANTIC NETWORK


Semantic Network
* Additional children not shown

Last reviewed: 18 May 2006
Last updated: 18 May 2006
First published: 08 January 2001
Metadata| Permanence level: Permanent: Dynamic Content