Skip to Content
MeSH Logo

An Interlingual Database of MeSH Translations

NLM Logo
Stuart J. Nelson. MD [1], Michael Schopen, MD [2],
Jacque-Lynne Schulman [1], and Natalie Arluk [1]
1. National Library of Medicine, Bethesda, MD, USA
2. Deutsches Institut für Medizinishe Dokumentation und Information, Köln, Deutschland


ABSTRACT

The National Library of Medicine produces the Medical Subject Headings (MeSH®) with an annual update cycle. Many interested parties produce translations of MeSH to make the vocabulary useful for non-English users. Some have been issued annually; others irregularly. The translations of MeSH into German, French, Spanish, Portuguese, Italian, Finnish, and Russian have been included in the year 2000 version of the Unified Medical Language System (UMLS®) Metathesaurus®.

MeSH translators have encountered difficulties with entry vocabulary as they maintain and update their translations to reflect changes in the annual version of MeSH. An entry term might move from one main heading to another main heading, or, more commonly, an entry term might become a new main heading Translators are faced with difficulties in tracking these changes. Another problem arises for certain terms in other languages. There may be no exact English equivalent. In that case it may not be possible to identify the correct mapping to the MeSH descriptor or to concepts in other vocabulary databases, such as the UMLS Metathesaurus.

NLM has developed and implemented a concept-centered vocabulary maintenance system for MeSH. While remaining invisible to the users of the system, it enables a better understanding of the role of MeSH and of composition of the thesaurus, and a useful method of representing those relationships. Each main heading is a descriptor class, comprised of one or more concepts closely linked in meaning. This system is being extended to create an interlingual database of translations. Each translated term is identified as a name of an existing concept, or as the name of a new concept created within the descriptor class. This database allows continual updating of the translations, as well as facilitating tracking of the changes within MeSH from one year to another.

Issues in implementing this database include defining the character set, defining privileges for modifying the files, intellectual property rights, and quality control. We anticipate that this interlingual database will facilitate effective use of the Medical Subject Headings by many international users and will facilitate incorporation of MeSH translations in the UMLS Metathesaurus.

BACKGROUND

The National Library of Medicine (NLM) has produced the MEDLINE database since 1966. The MEDLINE database includes over 10 million literature citations of articles written in 41 languages. Each article is indexed with MeSH descriptors assigned by an individual who reads the article in its original language, and assigns the Headings to indicate what the article is about. MeSH is now in its 40th year of production, and is added to and otherwise modified on an annual basis. These modifications are then applied automatically to the MEDLINE database; articles are not reindexed, but the database is kept current with the current version of MeSH.

The International MEDLARS Centers, including those in Germany, Japan, Brazil, and France, have long produced translations of MeSH to make the vocabulary useful for non-English users. These translations were generally not in machine-readable form and varied in frequency of appearance. Some translations were issued annually and others irregularly. Translations of MeSH were also provided by national medical information centers in Polish, Romanian, Arabic, Greek, Dutch, and other languages. The translations enable users not facile in English to identify articles which are of sufficient potential interest to warrant reading.

In 1986, the National Library of Medicine began the UMLS project, a long term effort to improve retrieval of electronically available information from a variety of sources. The UMLS Metathesaurus, a large database of naming information encompassing terms and concepts from more than 50 biomedical vocabularies and classifications, has been built as part of this project, and continues to be updated on an annual basis. Since the inception of the Metathesaurus, MeSH has been one of the principal vocabularies. The Metathesaurus is organized around concepts. This concept- oriented approach allows direct linking of all terms from various vocabularies and languages that have the same meaning.

A number of translations of MeSH are included in the UMLS Metathesaurus. A German MeSH, provided by DIMDI, French (INSERM), Portuguese (BIREME), Spanish (BIREME), Russian (Central Medical Library, Moscow), Italian (Istituto Superiore di Sanita), and Finnish (the Finnish Medical Society) are included in the 2000 Metathesaurus. MeSH has also been fully or partly translated into Dutch, Slovak, Slovene, Swedish, Norwegian, Turkish, Chinese, and Arabic.

PROBLEMS NOTED IN MAINTAINING TRANSLATIONS

Until the recent development of the new MeSH maintenance system, the internal MeSH structure was based on terms, not concepts. Each term was related to the main heading term, but was often not synonymous with it. Unless a translation of an entry term was explicitly linked to a MeSH term, it became difficult for UMLS editors to know the concept a translated term named. Furthermore, entry vocabulary within MeSH may be moved from one subject heading to another, or, more commonly, elevated to the status of a subject heading itself. For example, in 2001 MeSH the term "Cerebrum" will move from its place as an entry term to "Brain" to the position of entry term to "Telencephalon". Because this type of information is not readily apparent to translators, it often has been difficult to track where a given translated term belonged.

In each of these situations, the term-based environment in which MeSH was maintained limited effective and efficient maintenance of the translations. Unless there were specific links between the translated terms and English terms in MeSH, maintaining appropriate representation in the Metathesaurus required considerable effort on the part of the translators or bilingual editors.

Another area of difficulty was that of a term in a language other than English for which there was no exact English equivalent. While an original translation of all of MeSH might link that term with the appropriate subject heading, with modifications of MeSH the linkage of that translated term might no longer be appropriate.

THE NEW MESH MAINTENANCE ENVIRONMENT

As part of the reinvention efforts at the National Library of Medicine, the MeSH maintenance environment was moved off the mainframe computer. In making that move, the database was redesigned, with the aim of making the MeSH database more compatible with the UMLS Metathesaurus. In doing so, the familiar term-based system was changed to a concept-oriented system. The two-year project included the design and development of a new data structure and the migration of all vocabulary data from the old Model 204 maintenance environment to an Oracle-based client-server system.

The new structure is centered on descriptors, concepts, and terms rather than just descriptors and terms. Our understanding of what a descriptor consists of has been refined. A descriptor is now viewed as a class of concepts, and a concept as a class of synonymous terms within a descriptor class.

As an example of how the modifications are represented in the structure, consider the Main Heading, AIDS DEMENTIA COMPLEX. Under the old term-based system, AIDS DEMENTIA COMPLEX had six print entry terms:

AIDS DEMENTIA COMPLEX
HIV Dementia (equivalent)
HIV-Associated Cognitive Motor Complex (equivalent)
Dementia Complex, AIDS-Related (equivalent)
HIV Encephalopathy (narrower)
AIDS Encephalopathy (narrower)
HIV-1-Associated Cognitive Motor Complex (related)
Each term's relationship to the descriptor is listed in parentheses. But there was no way to tell the relationship of the narrower entry terms to each other. In the concept-oriented structure we have:
AIDS DEMENTIA COMPLEX [Descriptor Class]
Concept Class I - Preferred Concept
Terms: AIDS Dementia Complex (Preferred Term)
  HIV Dementia
  HIV-Associated Cognitive Motor Complex
  Dementia Complex, AIDS-Related
Concept Class II - Subordinate Concept (narrower)
Terms: HIV Encephalopathy (Preferred Term)
  AIDS Encephalopathy
Concept Class III - Subordinate Concept (related)
Terms: HIV-1-Associated Cognitive Motor Complex (Preferred Term)
It can be seen that concept classes II and III are respectively, narrower and related to concept class I (the preferred concept), but are not equivalent to each other. Each concept class could be given its own definition if desired. It can also be seen that HIV Encephalopathy and AIDS Encephalopathy are synonymous terms within the same concept class.

By using the concept as the key unit in the new structure, appropriate non- synonymous relationships can now be represented separately and finer shades of meaning disambiguated. Putting these non-synonymous concepts together into one descriptor class does not alter the traditional function of entry vocabulary.

A descriptor class consists of one or more concepts closely related to each other in meaning. For the purposes of indexing, retrieval, and organization of the literature, these concepts are best lumped together in one class. It has been recognized for some time that not every term that we might wish to explore is sufficiently distinct in meaning that it would serve well as a descriptor. For example, the NISO standard for Monolingual Thesauri talks of quasi-synonyms (terms that don't mean the same, like "roughness" and "smoothness", but are means of addressing the same underlying phenomenon). Entry terms like "Isometric Exercise" are narrower in meaning than the main heading "Exercise", but left in the descriptor class because of the overlap in meaning with another entry term, "Aerobic Exercise." The recognition of the nature of a descriptor as a class of concepts helps us to understand what we are dealing with.

MeSH Descriptors (main headings), Qualifiers, and Supplementary Concept Records still exist. Entry terms, whether printed in the MeSH Tools (Print Entry Terms) or not (Non-Print Entry Terms), still provide access points to the MeSH components. The change in data structure allows a greater degree of organization. Each descriptor class has a preferred concept. The term that names the preferred concept (the preferred term of the preferred concept) is referred to as the descriptor or as the main heading. Each of the subordinate concepts also will have a preferred term, as well as a labeled (broader, narrower, related) relationship to the preferred concept. Terms meaning the same (naming the same concept) are grouped together in the concept record. Attributes of descriptors (tree positions), of concepts (definitions and semantic types, and of terms (lexical tags) are all represented at the appropriate level in the data structure.

The above "related" relationship between a subordinate concept and the preferred concept is not to be confused with the relationship between descriptors described with the traditional thesaurus pointer "see related". Those "see related" relationships are between descriptors, and serve as pointers from one descriptor to other descriptors whose use should be considered in indexing or in searching. The "related" relationship between a subordinate concept and a preferred concept indicates that the same descriptor is to be used for both meanings.

The parent/child structure of the MeSH trees has not been changed. These relationships are between descriptors, not between concepts, although in many instances they can appear to be identical. In the past, the blurring of the significance of the hierarchical relationships has often led to confusion about the motivation for a given representation. Now, a simpler test should clarify the motivation for a given tree structure. A should be a parent to B in a tree if the answer to the following question is positive:

Should a search for documents dealing with A find documents dealing with B?
A TRANSLATION DATABASE

With the new MeSH structure, it is relatively easy to conceive how a translator might work in MeSH. The translated terms would be included in the MeSH maintenance environment, as an extension of the current database. Translators could then use the maintenance environment to manage their translations.

Translated terms could be provided as synonyms to existing concepts. The database table for terms requires that the term have an attribute of the language in which it was appearing. If more than one character set was being used, the term might also have the attribute of which character set it was being presented in.

For non-synonymous entry terms not present in the English version, but useful in the language of the translation, a new concept would be created by the translator. The concept would, of course, belong to a descriptor class, that of the main heading for which it was an appropriate entry term. In this case of a concept class for which there was no English synonym, a definition of the concept would be in order, so that translators using other languages might have the ability to include their terms in that concept class.

In order to avoid difficulties with trying to maintain interface clients, the interface is intended to be Web-based, with a variety of security measures to limit participation to authorized individuals. Java servlets, running on the Web server, would enable the transmission of the submitted information to the database server.

Privileges for translators would be limited to insertion of terms in their own language, and to creation of new subordinate concepts. In the case of creating a new subordinate concept, the submission of a definition (in English) of the new concept would support both the translation of that term into other non-English languages as well as enable proper maintenance when that descriptor class was edited in the MeSH section. A member of the MeSH section at the National Library of Medicine would serve as coordinator to manage the review and quality control applied to all changes in MESH before the changes became an official part of MeSH.

To institute this database will require the agreement and cooperation of the translators. After that has been obtained, previous translations can be loaded into the database from the UMLS Metathesaurus. Translations that have not been previously included in the UMLS Metathesaurus will be dealt with on an individual basis. The translators would then be able to review areas in which the mapping from one term to another might be problematic, and to find the descriptors in MeSH for which there was no translated term.

COPYRIGHT ISSUES

The method of dealing with copyrights for MeSH translations is anticipated to be very similar to the way copyrights have been handled in the UMLS. Some of the Material in the UMLS Metathesaurus is from copyrighted sources. If the licensee uses any data from the UMLS Metathesaurus:

a) the licensee is required to display in full specified wording in order that its users be made aware of these copyright constraints: "Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright claimants. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright restrictions and are referred to the copyright notices appearing in the original sources, all of which are hereby incorporated by reference."

b) the licensee is prohibited from altering data obtained from the UMLS Metathesaurus, but may include data from other sources in applications that also contain UMLS data. The licensee may not imply in any way that data from other sources is part of the UMLS Metathesaurus or of any of its vocabulary sources.

c) the licensee is required to include in its applications identifiers from the UMLS Metathesaurus such that the original source vocabularies for any data obtained from the UMLS Metathesaurus can be determined by reference to a complete version of the UMLS Metathesaurus.

For material in the UMLS Metathesaurus obtained from some sources additional restrictions may apply. Within the UMLS, each vocabulary producer decides the level of copyright that will be claimed or established for their data. This is done based on the policies of their institution and national patterns. There are presently three levels of copyright designated for licensees:

"Category 1: Licensee is prohibited from translating the vocabulary source into another language or from producing other derivative works based on this single vocabulary source.

"Category 2: All category 1 restrictions and Licensee is prohibited from using the vocabulary source in operational applications that create records or information containing data from the vocabulary source. Use for data creation research or product development is allowed.

"Category 3: Licensee's right to use material from the source vocabulary is restricted to internal use at the Licensee's site(s) for research, product development, and statistical analysis only. Internal use includes use by employees, faculty, and students of a single institution at multiple sites. Notwithstanding the foregoing, use by students is limited to doing research under the direct supervision of faculty. Internal research, product development, and statistical analysis use expressly excludes: use of material from these copyrighted sources in routine patient data creation; incorporation of material from these copyrighted sources in any publicly accessible computer-based information system or public electronic bulletin board including the Internet; publishing or translating or creating derivative works from material from these copyrighted sources; selling, leasing, licensing, or otherwise making available material from these copyrighted works to any unauthorized party; and copying for any purpose except for back up or archival purposes."

CHARACTER SET ISSUES

The Latin-1 character set accommodates the alphabets of most western European countries. Not coincidentally, these were the first translations of MeSH included in the UMLS. Inclusion of other translations not using the Latin alphabet, e.g., Russian, Arabic, and Japanese, becomes more problematic. In the case of Russian, we have been given versions transliterated by the Central Medical Library for inclusion in the UMLS.

At present, the character set used depends on the operating system and the coding scheme it uses for the language. Knowing the coding scheme, it is often possible to find a set of fonts (or glyphs) to make the character appear as it should in the specific written language. However, the coding schemes are not unique, and far from universal. Unless the scheme is understood properly, sorting and presenting material in an orthographic manner becomes quite difficult.

The best long term solution to the character set problem is one that correctly represents languages with their native alphabets and full orthography. UNICODE appears to be one means of achieving that goal. The MeSH database is being converted to Oracle version 8I, a database management system which supports the use of UNICODE. Java, which supports the servlet and the MeSH client used at the NLM, is fully UNICODE compliant. We anticipate that we will encourage translators to submit their terms in UNICODE, though we may need to make provisions for those who are unable to do so.

CONCLUSION

As the number of translations of MeSH into other languages attests, it is often easier, when searching for information about a potentially relevant topic, to use the language with which one has the most facility. Translations of MeSH are valuable to persons not facile in English. Creating a database of translations will enable correct mappings from one language to another to be maintained, and enable translators to stay current with MeSH as it continues to evolve.

ADDITIONAL MATERIALS

For further information about the UMLS, see http://www.nlm.nih.gov/research/umls/
For further information about MeSH, see http://www.nlm.nih.gov/mesh

Last updated: 20 November 2001
First published: 19 October 2000
Metadata| Permanence level: Permanence Not Guaranteed