8/8/2003
NOTE: This paper contains references to Sections of the 2003AC UMLS documentation.
- Introduction
Effective with the 2003AC release of the UMLS Metathesaurus in November
2003, the release file structure will be substantially expanded. The
old file structure will continue to be available as an output option of
MetamorphoSys, the tool to customize and create subsets of the
Metathesaurus that is distributed with the UMLS files (Section 2.8).
MetamorphoSys will have several file output options: the current
relational file formats, the new expanded relational file formats, and
(in 2004) XML formats.
SNOMED CT® will be released in the first 2004 release (2004AA), in the new Rich Release Format.
Detailed documentation of the new rich release file structure is
available here.
Sample files are available, and Java APIs representing the Rich Release Format Object model of the
Metathesaurus will be available; API documentation is available here.
- General Description of Additions to the File Structure
Additional fields will be added to many Metathesaurus files. Several
new relational files will be added. Three existing relational files
(MRCON, MRSO and MRATX - see below) will be deprecated - that is, their
continued use is NOT recommended, but they will still be available as
an output of MetamorphoSys so that existing applications do not break.
The specific changes to the release files are also described in detail.
- Purpose of the Additions
All of the additions are designed to make it easier for applications
developers to customize the UMLS Metathesaurus for particular
applications and to maintain these applications appropriately as the
source vocabularies in the Metathesaurus are updated and new versions
of the Metathesaurus appear. In particular, the additions to the
format will:
- Simplify extraction of particular source vocabularies and groups of
vocabularies useful for particular purposes (e.g., clinical
applications, natural language processing).
As described in UMLS training resources, the Metathesaurus almost always requires
customization for particular applications. A very common method of
customization is by source. In the new file structure, a new table,
MRCONSO will combine and expand the concept and vocabulary source
information from the existing MRCON and MRSO files - thus eliminating
the need to join tables to select concepts and terms from particular
sources. This file will have rows and identifiers for every occurrence
of every string in every source. For example, if 3 different sources
contain the exact string "Atrial Fibrillation" there will be 3 rows for
that string in MRCONSO, and, in addition to the Metathesaurus concept
(CUI), term (LUI), and string (SUI) identifiers, each row will have a
unique Metathesaurus "atom" identifier (AUI) for each occurrence of
each string in each source. The addition of the AUI to other
Metathesaurus files will also facilitate customization by source.
A new "Content View Flag" will be added to many tables to allow easier
extraction of the vocabularies - as well as the specific concept names,
relationships, and attributes - believed to be useful for particular
purposes, e.g., for example, for natural language processing.
Customization by source alone is usually insufficient to eliminate
content that is superfluous or detrimental to certain applications,
e.g., obsolete terms, terms that lack face validity, inappropriate
hierarchical relationships. The number of content views will be
expanded over time based on input from UMLS users.
- Provide complete "source transparency" - that is, make it possible
to extract any source vocabulary from the Metathesaurus and demonstrate
that there is no information loss from the original source input.
As emphasized in Section 2, the Metathesaurus has
always endeavored to preserve the meanings, attributes, hierarchical
connections, and other relationships between terms present in its
source vocabularies. The existing concept-oriented distribution file
format accurately preserves meanings, attributes, and relationships
between concepts. However, by representing relationships at the
conceptual level only it obscures some relationships that are not
concept-oriented and, in some cases, makes it difficult to generate
completely accurate source hierarchies.
Additional source-specific information needed to correct this situation
(e.g., the previously described Metathesaurus "atom identifier" (AUI)
for each occurrence of each string in each source) is already present
in the internal system that NLM uses to maintain the Metathesaurus.
Although such information is used to aid Metathesaurus construction, it
has not previously been distributed in the Metathesaurus release files.
Expansion of the Metathesaurus distribution formats to include this
information will enable accurate representation of all intra-source
relationships, including novel types of relationships present in SNOMED
CT and the NCI Thesaurus, but not in other source vocabularies. NLM
believes that the benefits in source transparency will far outweigh the
costs in file size and complexity - especially since UMLS users will be
able to employ MetamorphoSys to generate the previous file formats.
Additional "atomic" level data will added to many of the Metathesaurus
release files. There will also be a more consistent and explicit
approach to labeling source-asserted identifiers and source-asserted
relationship directionality.
"Source transparency" ensures that there is no information loss when a
vocabulary is inserted in the Metathesaurus. It does NOT mean that the
Metathesaurus will reproduce the original file formats of each of its
source vocabularies. The Metathesaurus will continue to provide all of
its source vocabularies in a common, fully-specified format.
- Enable production of complete "change sets" for each new version of
the UMLS Metathesaurus.
The Metathesaurus release format already includes files that track the
disappearance of concepts and strings from the Metathesaurus between
versions and, in the case of concept identifiers, over most of the
history of the Metathesaurus. However, the current release format does
not allow easy detection of other types of changes in the
Metathesaurus, such as the addition or disappearance of specific
relationships and attributes.
In addition to the "atom identifiers" and other source specific
identifiers described above, persistent Metathesaurus identifiers will
be added for all relationships (RUI) and all attributes (ATUI) released
in the Metathesaurus. The continued existence of these identifiers
will indicate content that is unchanged across versions of the
Metathesaurus. The appearance or disappearance of these identifiers
will signal change. This will enable generation of complete
Metathesaurus change sets, which will provide a simpler method for
updating applications as new releases of the Metathesaurus are issued.
- Provide enhanced ability to create and distribute robust,
purpose-specific mappings between different source vocabularies and
classifications within the Metathesaurus.
Although the current Metathesaurus release format can represent
one-to-one, one to many, and one-to-Boolean expression mappings, the
more complex mappings are cumbersome to maintain and to use and the
format does not accommodate rule-based mappings.
In the new release format, the Associated Expressions file (MRATX) will
be deprecated in favor of a new mappings file (MRMAP), which will have
a more robust structure for representing simple, complex, and
rule-based mappings using Metathesaurus or source-asserted unique
identifiers.
- Provide enhanced documentation of the Metathesaurus file formats.
A new file (MRDOC) will list all possible values for fields containing
a finite set of such values, e.g., TTY, ATN, TS, STT, REL, RELA. By
joining this file with MRCOLS, a user will be able to identify which files
contain these fields (columns).
- New Object Model of the Metathesaurus
A standard model has been defined for the objects in the Metathesaurus
such as concepts, attributes, relationships, etc. A reference
implementation in Java along with associated Javadoc documentation will
be made available with the UMLS Knowledge Sources.
MetamorphoSys has been re-written to use this model internally and will
be able to consume or produce representations of these objects in
either the MR+ or, in 2004, serialized XML formats. The UMLS Knowledge
Source Server (KSS) will eventually support this model for an API to
the Metathesaurus.
Return to UMLS Home
|