Skip Navigation Bar

MEDLINE®/PubMed® Character Set

MEDLINE/PubMed is an English language database. Beginning September 2010, it uses an expanded character set which permits the use of Latin (Roman) and Greek characters as well as mathematical and other symbols commonly found in medical literature. It contains diacritical marks on characters in author names, article titles, vernacular titles, full journal titles, and abstracts.

Through August 2010, a limited character set was used. The database supported only the following nine non-spacing diacritical marks only in combination with Latin small letters: diaeresis, breve, cedilla, acute, ring-above, macron, circumflex, tilde, and grave. Additionally, the database supported an uppercase and lowercase o and l with stroke. The expanded character was not used to modify data in existence prior to its use in September 2010.

The XML file extracts of MEDLINE data use UTF-8 encoding (from ISO/IEC 10646 and Unicode Standard -- see Unicode for more information on unicode and UTF-8 encoding). The UTF-8 encoded data is in unicode Normalized Form C (see Unicode Technical Report #15), which uses Unicode composite characters. This approach is consistent with the direction of the World Wide Web Consortium as described in Character Model for the World Wide Web.