DATE: May 25, 2004
REVISED:
NAME: Changing the MARC-8 to UCS Mapping for the Halves of Doublewide Diacritics from the Unicode/UCS Half Diacritic Characters to the Unicode/UCS Doublewide Diacritic Characters
SOURCE: RLG
SUMMARY: The doublewide tilde of Tagalog and the ligature used in Cyrillic Romanization are encoded as half diacritic characters in ANSEL. In Unicode/UCS, there are two ways to represent each of these double-wide diacritics: use the appropriate double diacritic character that spans two base letters (recommended in the Unicode Standard) or use the Combining Half Mark characters (analogous to current MARC 21 practice).
The current mapping for the four “diacritic halves” in ANSEL is to the Combining Half Mark characters. This proposal, in response to MARBI Discussion Paper 2004-DP03 in January 2004, recommends that the MARC 21 community change the official mapping to Unicode/UCS to the double diacritic characters, U+0360 COMBINING DOUBLE TILDE and U+0361 COMBINING DOUBLE INVERTED BREVE. A specification for conversion is included in the proposal.
KEYWORDS: Double tilde; Ligature; ANSEL; Unicode/UCS, Half marks
RELATED: 96-10 (July 1996); 2004-DP03 (January 2004)
STATUS/COMMENTS:
05/25/04 - Made available to the MARC 21 community for discussion.
06/26/04 - Results of the MARC Advisory Committee discussion - Approved.
08/26/04 - Results of LC/LAC/BL review - Approved.
The MARC-8 ANSEL character set, the Extended Latin character set for MARC 21, includes two double-wide diacritics: the double tilde and ligature. Each double-wide diacritic is encoded in the MARC-8 set as two halves, with each half positioned over a letter so that the halves combine to make one double diacritic when rendered. The ANSEL characters are:
EB | Ligature, first half |
EC | Ligature, second half |
FA | Double tilde, first half |
FB | Double tilde, second half |
1.2 Occurrence of the Characters
The double-wide diacritics are used for the following languages: the double tilde was once used in the Tagalog language written in Latin script, and the ligature is used in the ALA-LC Romanization of Russian, Belorussian, Bulgarian, Russian, Ukrainian, and Church Slavic, as well as various Non-Slavic languages written in Cyrillic script. (In the RLG database, this represents about 3% of the titles in Books and Serials). The character combinations that use the ligature in those transliterations are: dz, gh, ia, ie, io, iu, iq, in, kh, ks, ng, ot, ps, th, ts, zh.
1.3 Equivalent Characters in Unicode/UCS
The Unicode Standard and its International Standard counterpart, ISO/IEC 10646 Universal Character Set (UCS), provide two ways to represent each of the double-wide diacritics: a double diacritic character that spans two base letters or pairs of combining half mark characters.
The Double Diacritic characters are:
U+0360 | COMBINING DOUBLE TILDE |
U+0361 | COMBINING DOUBLE INVERTED BREVE |
The Combining Half Mark characters are:
U+FE20 | COMBINING LIGATURE LEFT HALF |
U+FE21 | COMBINING LIGATURE RIGHT HALF |
U+FE22 | COMBINING DOUBLE TILDE LEFT HALF |
U+FE23 | COMBINING DOUBLE TILDE RIGHT HALF |
The double width diacritics and the combining half marks are discussed with illustrations in Section 7.7, Combining Marks of The Unicode Standard, Version 4.0 (Addison-Wesley, 2003), p. 186-188. (Chapter 7 is also available online at: www.unicode.org/versions/Unicode4.0.0/ch07.pdf) Section 7.7 states that the preferred way to encode the doublewide tilde and the ligature is with the double diacritic characters.
1.4 Unicode/UCS Mappings for the Characters
In 1994, the MARC Advisory Committee (MAC) revived its Character Sets Subcommittee to plan for a transition path from MARC-8 character sets to Unicode/UCS. The Character Sets Subcommittee was charged with mapping the MARC-8 sets (except the CJK set, or EACC) to Unicode/UCS equivalents.
The MARC/Unicode mapping, approved by MARBI (Proposal
96-10) and published
on the MARC Web site in 1997, maps the half diacritic characters in ANSEL to
the corresponding half diacritic characters in Unicode/UCS. (The complete mapping
for Extended Latin (ANSEL) is at: www.loc.gov/marc/specifications/specchartables.html.
Proposal 96-10 indicates that use of the double diacritic characters was considered,
but the half diacritic characters were chosen. No reason is given in the proposal,
but presumably it was because mapping between MARC-8 data and Unicode/UCS would
be simpler.
2.1 Scope
Many bibliographic records containing half diacritic characters can be identified by language. However, an authority record is not marked for language and in the bibliographic database many records will contain the ligature in added entries for English language material (e.g., added entries for the central committee of the communist party of the former USSR).
2.2 Font Support
In general, modern computer fonts now position diacritical characters correctly on the base characters, but rendering of the half diacritic characters is problematic. However, recent changes to Open Type have addressed the typographical challenge of positioning the diacritical halves for a proportional font.
To obtain satisfactory rendering, libraries will need to upgrade their fonts. But even though a technical solution is available, there is no guarantee that both half and double-width options will be supported in a font. Although it is not in the “spirit” of Unicode/UCS, a typographer may consider one option to be sufficient, and the preferred option will almost certainly be the double diacritics because the Unicode Standard prefers these to the “half marks” option. Libraries may prefer conform to what is most widely implemented to avoid creating problems when users download bibliographic data (e.g., for use in bibliographies).
2.3 Data Input Errors
Four characters must be entered in the correct order when combining half marks are used; only three characters are needed when a double diacritic is used. When combining half marks are used, erroneous data is created when (a) another diacritic is entered instead of the second half, or (b) the first or second half is omitted. In addition, there is the possibility that one of the half diacritic characters will be entered instead of a different single diacritic. Such a mistake is less likely when the character being entered is wider. With the double diacritic the mail issue may be assuring that it is entered after the correct character in the pair.
2.4 Identification of Existing Data Errors
Valid sequences in MARC-8 records that include the combining half marks can be converted to the Unicode/UCS double diacritic plus base characters. (The conversion method is described in an appendix to this proposal.) When the conversion flags a sequence as erroneous, the MARC-8 half diacritic character would have to be converted to its Unicode/UCS combining half mark characters or omitted. As a result, sequences that are defective and should be corrected will be readily apparent (especially if the font does not support the half diacritic characters).
2.5 Keyboard issues
If this proposal is accepted, four characters used for input will be replaced by two, providing room for the Euro sign and the Eszett that were added to the MARC 21 character repertoire in 2002. However, the half diacritic characters will continue to be valid for use in MARC 21 data, as discussed in Section 2.4, but the double diacritics would be used in newly created records.
There is a benefit for both librarians and library users if libraries abandon use of the half diacritic characters practice in Unicode/USC-encoded data and adopt the doublewide diacritical character option instead. Although some libraries have already converted their records to Unicode, we are just at the beginning of its use, and there are benefits to making library Unicode/UCS-encoded data more in synch with the mainstream implementations of Unicode.
Changing to use of the doublewide diacritic characters to the half diacritic characters for prospective MARC 21 in Unicode/UCS-encodings has the following implications:
It is noted that internal use of the combining half marks for the representation of the double-width tilde and the ligature is not prohibited by this proposal. What is required is that such a system must export the double-width tilde and the ligature as the double diacritic characters specified in the first bullet point, and must also be able to accept these characters.
One alternative to the use of the double diacritic character may be to change
the transliteration tables to use a grave accent on the second of the two characters “pointing
back to the first character”. The grave is not currently used in the
transliteration tables that employ the ligature. Such a change, however, would
require consultation the ALA/LC with the affected community.
Note: The double diacritic characters are identified by their Unicode Scalar Values for ease of reference. MARC21 requires use of the UTF-8 encoding form in data exchange, so UTF-8 code points are also given (the space between each hex value is for clarity, and is not part of the UTF-8 value).
Condition | Result | |
Current character = EC | FB | ERROR 1 | |
Current character = EB | FA | Examine the next 3 characters (“string”) | |
<3 characters remaining | ERROR 2 | |
1st character of string not Latin letter | ERROR 2 | |
Current character = EB | ||
2nd character of string not EC | ERROR 2 | |
Current character = FA | ||
2nd character of string not FB | ERROR 2 | |
3rd character of string not Latin letter | ERROR 2 | |
Convert current character and next 3 characters as follows: | ||
Drop current character |
||
Convert 1st character of string to Unicode/UCS equivalent | ||
*2nd character of string is either EC or FB * | ||
Convert EC to U+0361 (= CD A1 in UTF-8) | ||
Convert FB to U+0360 (= CD A0 in UTF-8) | ||
Convert 3rd character of string to Unicode/UCS equivalent | ||
ERROR handling | ||
ERROR 1: EC | FB should not be encountered unless EB | FA has been encountered | ||
Convert EC to U+FE21 (= EF B8 A1 in UTF-8) | ||
Convert FB to U+FE23 (= EF B8 A3 in UTF-8) | ||
ERROR 2: String sequence incorrect (length or content) | ||
Convert EB to U+FE20 (= EF B8 A0 in UTF-8) | ||
Convert FA to U+FE22 (= EF B8 A2 in UTF-8) | ||
Convert character(s) of string to Unicode/UCS equivalent(s)
including: EB to U+FE20, EC to U+FE21, FA to U+FE22, FB to U+FE23 |