Skip to Content

White Paper: UMLS® Metathesaurus® Rich Release (MR+) Format

8/8/2003

NOTE: This paper contains references to Sections of the 2003AC UMLS documentation.

  1. Introduction

    Effective with the 2003AC release of the UMLS Metathesaurus in November 2003, the release file structure will be substantially expanded. The old file structure will continue to be available as an output option of MetamorphoSys, the tool to customize and create subsets of the Metathesaurus that is distributed with the UMLS files (Section 2.8).

    MetamorphoSys will have several file output options: the current relational file formats, the new expanded relational file formats, and (in 2004) XML formats.

    SNOMED CT® will be released in the first 2004 release (2004AA), in the new Rich Release Format.

    Detailed documentation of the new rich release file structure is available here.

    Sample files are available, and Java APIs representing the Rich Release Format Object model of the Metathesaurus will be available; API documentation is available here.



  2. General Description of Additions to the File Structure

    Additional fields will be added to many Metathesaurus files. Several new relational files will be added. Three existing relational files (MRCON, MRSO and MRATX - see below) will be deprecated - that is, their continued use is NOT recommended, but they will still be available as an output of MetamorphoSys so that existing applications do not break. The specific changes to the release files are also described in detail.

    1. Purpose of the Additions

      All of the additions are designed to make it easier for applications developers to customize the UMLS Metathesaurus for particular applications and to maintain these applications appropriately as the source vocabularies in the Metathesaurus are updated and new versions of the Metathesaurus appear. In particular, the additions to the format will:

      1. Simplify extraction of particular source vocabularies and groups of vocabularies useful for particular purposes (e.g., clinical applications, natural language processing).

        As described in UMLS training resources, the Metathesaurus almost always requires customization for particular applications. A very common method of customization is by source. In the new file structure, a new table, MRCONSO will combine and expand the concept and vocabulary source information from the existing MRCON and MRSO files - thus eliminating the need to join tables to select concepts and terms from particular sources. This file will have rows and identifiers for every occurrence of every string in every source. For example, if 3 different sources contain the exact string "Atrial Fibrillation" there will be 3 rows for that string in MRCONSO, and, in addition to the Metathesaurus concept (CUI), term (LUI), and string (SUI) identifiers, each row will have a unique Metathesaurus "atom" identifier (AUI) for each occurrence of each string in each source. The addition of the AUI to other Metathesaurus files will also facilitate customization by source.

        A new "Content View Flag" will be added to many tables to allow easier extraction of the vocabularies - as well as the specific concept names, relationships, and attributes - believed to be useful for particular purposes, e.g., for example, for natural language processing. Customization by source alone is usually insufficient to eliminate content that is superfluous or detrimental to certain applications, e.g., obsolete terms, terms that lack face validity, inappropriate hierarchical relationships. The number of content views will be expanded over time based on input from UMLS users.

      2. Provide complete "source transparency" - that is, make it possible to extract any source vocabulary from the Metathesaurus and demonstrate that there is no information loss from the original source input.

        As emphasized in Section 2, the Metathesaurus has always endeavored to preserve the meanings, attributes, hierarchical connections, and other relationships between terms present in its source vocabularies. The existing concept-oriented distribution file format accurately preserves meanings, attributes, and relationships between concepts. However, by representing relationships at the conceptual level only it obscures some relationships that are not concept-oriented and, in some cases, makes it difficult to generate completely accurate source hierarchies.

        Additional source-specific information needed to correct this situation (e.g., the previously described Metathesaurus "atom identifier" (AUI) for each occurrence of each string in each source) is already present in the internal system that NLM uses to maintain the Metathesaurus. Although such information is used to aid Metathesaurus construction, it has not previously been distributed in the Metathesaurus release files. Expansion of the Metathesaurus distribution formats to include this information will enable accurate representation of all intra-source relationships, including novel types of relationships present in SNOMED CT and the NCI Thesaurus, but not in other source vocabularies. NLM believes that the benefits in source transparency will far outweigh the costs in file size and complexity - especially since UMLS users will be able to employ MetamorphoSys to generate the previous file formats. Additional "atomic" level data will added to many of the Metathesaurus release files. There will also be a more consistent and explicit approach to labeling source-asserted identifiers and source-asserted relationship directionality.

        "Source transparency" ensures that there is no information loss when a vocabulary is inserted in the Metathesaurus. It does NOT mean that the Metathesaurus will reproduce the original file formats of each of its source vocabularies. The Metathesaurus will continue to provide all of its source vocabularies in a common, fully-specified format.

      3. Enable production of complete "change sets" for each new version of the UMLS Metathesaurus.

        The Metathesaurus release format already includes files that track the disappearance of concepts and strings from the Metathesaurus between versions and, in the case of concept identifiers, over most of the history of the Metathesaurus. However, the current release format does not allow easy detection of other types of changes in the Metathesaurus, such as the addition or disappearance of specific relationships and attributes.

        In addition to the "atom identifiers" and other source specific identifiers described above, persistent Metathesaurus identifiers will be added for all relationships (RUI) and all attributes (ATUI) released in the Metathesaurus. The continued existence of these identifiers will indicate content that is unchanged across versions of the Metathesaurus. The appearance or disappearance of these identifiers will signal change. This will enable generation of complete Metathesaurus change sets, which will provide a simpler method for updating applications as new releases of the Metathesaurus are issued.

      4. Provide enhanced ability to create and distribute robust, purpose-specific mappings between different source vocabularies and classifications within the Metathesaurus.

        Although the current Metathesaurus release format can represent one-to-one, one to many, and one-to-Boolean expression mappings, the more complex mappings are cumbersome to maintain and to use and the format does not accommodate rule-based mappings.

        In the new release format, the Associated Expressions file (MRATX) will be deprecated in favor of a new mappings file (MRMAP), which will have a more robust structure for representing simple, complex, and rule-based mappings using Metathesaurus or source-asserted unique identifiers.

      5. Provide enhanced documentation of the Metathesaurus file formats.

        A new file (MRDOC) will list all possible values for fields containing a finite set of such values, e.g., TTY, ATN, TS, STT, REL, RELA. By joining this file with MRCOLS, a user will be able to identify which files contain these fields (columns).

    2. New Object Model of the Metathesaurus

      A standard model has been defined for the objects in the Metathesaurus such as concepts, attributes, relationships, etc. A reference implementation in Java along with associated Javadoc documentation will be made available with the UMLS Knowledge Sources.

      MetamorphoSys has been re-written to use this model internally and will be able to consume or produce representations of these objects in either the MR+ or, in 2004, serialized XML formats. The UMLS Knowledge Source Server (KSS) will eventually support this model for an API to the Metathesaurus.



Return to UMLS Home

Last reviewed: 29 January 2008
Last updated: 29 January 2008
First published: 08 August 2003
Metadata| Permanence level: Permanence Not Guaranteed