PROPOSAL NO.: 2004-08

DATE: May 25, 2004
REVISED:

NAME: Changing the MARC-8 to UCS Mapping for the Halves of Doublewide Diacritics from the Unicode/UCS Half Diacritic Characters to the Unicode/UCS Doublewide Diacritic Characters

SOURCE: RLG

SUMMARY: The doublewide tilde of Tagalog and the ligature used in Cyrillic Romanization are encoded as half diacritic characters in ANSEL. In Unicode/UCS, there are two ways to represent each of these double-wide diacritics: use the appropriate double diacritic character that spans two base letters (recommended in the Unicode Standard) or use the Combining Half Mark characters (analogous to current MARC 21 practice).

The current mapping for the four “diacritic halves” in ANSEL is to the Combining Half Mark characters. This proposal, in response to MARBI Discussion Paper 2004-DP03 in January 2004, recommends that the MARC 21 community change the official mapping to Unicode/UCS to the double diacritic characters, U+0360 COMBINING DOUBLE TILDE and U+0361 COMBINING DOUBLE INVERTED BREVE. A specification for conversion is included in the proposal.

KEYWORDS: Double tilde; Ligature; ANSEL; Unicode/UCS, Half marks

RELATED: 96-10 (July 1996); 2004-DP03 (January 2004)

STATUS/COMMENTS:

05/25/04 - Made available to the MARC 21 community for discussion.

06/26/04 - Results of the MARC Advisory Committee discussion - Approved.

08/26/04 - Results of LC/LAC/BL review - Approved.

Proposal 2004-08: Changing the Unicode/UCS Mapping for the Double-Wide Diacritics

1. BACKGROUND

1.1 Description of the Characters

The MARC-8 ANSEL character set, the Extended Latin character set for MARC 21, includes two double-wide diacritics: the double tilde and ligature. Each double-wide diacritic is encoded in the MARC-8 set as two halves, with each half positioned over a letter so that the halves combine to make one double diacritic when rendered. The ANSEL characters are:

EB	Ligature, first half
EC	Ligature, second half
FA	Double tilde, first half
FB	Double tilde, second half

1.2 Occurrence of the Characters

The double-wide diacritics are used for the following languages: the double tilde was once used in the Tagalog language written in Latin script, and the ligature is used in the ALA-LC Romanization of Russian, Belorussian, Bulgarian, Russian, Ukrainian, and Church Slavic, as well as various Non-Slavic languages written in Cyrillic script. (In the RLG database, this represents about 3% of the titles in Books and Serials). The character combinations that use the ligature in those transliterations are: dz, gh, ia, ie, io, iu, iq, in, kh, ks, ng, ot, ps, th, ts, zh.

1.3 Equivalent Characters in Unicode/UCS

The Unicode Standard and its International Standard counterpart, ISO/IEC 10646 Universal Character Set (UCS), provide two ways to represent each of the double-wide diacritics: a double diacritic character that spans two base letters or pairs of combining half mark characters.

The Double Diacritic characters are:

U+0360	COMBINING DOUBLE TILDE
U+0361	COMBINING DOUBLE INVERTED BREVE

The Combining Half Mark characters are:

U+FE20	COMBINING LIGATURE LEFT HALF
U+FE21	COMBINING LIGATURE RIGHT HALF
U+FE22	COMBINING DOUBLE TILDE LEFT HALF
U+FE23	COMBINING DOUBLE TILDE RIGHT HALF

The double width diacritics and the combining half marks are discussed with illustrations in Section 7.7, Combining Marks of The Unicode Standard, Version 4.0 (Addison-Wesley, 2003), p. 186-188. (Chapter 7 is also available online at: www.unicode.org/versions/Unicode4.0.0/ch07.pdf) Section 7.7 states that the preferred way to encode the doublewide tilde and the ligature is with the double diacritic characters.

1.4 Unicode/UCS Mappings for the Characters

In 1994, the MARC Advisory Committee (MAC) revived its Character Sets Subcommittee to plan for a transition path from MARC-8 character sets to Unicode/UCS. The Character Sets Subcommittee was charged with mapping the MARC-8 sets (except the CJK set, or EACC) to Unicode/UCS equivalents.

The MARC/Unicode mapping, approved by MARBI (Proposal 96-10) and published on the MARC Web site in 1997, maps the half diacritic characters in ANSEL to the corresponding half diacritic characters in Unicode/UCS. (The complete mapping for Extended Latin (ANSEL) is at: www.loc.gov/marc/specifications/specchartables.html.

Proposal 96-10 indicates that use of the double diacritic characters was considered, but the half diacritic characters were chosen. No reason is given in the proposal, but presumably it was because mapping between MARC-8 data and Unicode/UCS would be simpler.

2. DISCUSSION

2.1 Scope

Many bibliographic records containing half diacritic characters can be identified by language. However, an authority record is not marked for language and in the bibliographic database many records will contain the ligature in added entries for English language material (e.g., added entries for the central committee of the communist party of the former USSR).

2.2 Font Support

In general, modern computer fonts now position diacritical characters correctly on the base characters, but rendering of the half diacritic characters is problematic. However, recent changes to Open Type have addressed the typographical challenge of positioning the diacritical halves for a proportional font.

To obtain satisfactory rendering, libraries will need to upgrade their fonts. But even though a technical solution is available, there is no guarantee that both half and double-width options will be supported in a font. Although it is not in the “spirit” of Unicode/UCS, a typographer may consider one option to be sufficient, and the preferred option will almost certainly be the double diacritics because the Unicode Standard prefers these to the “half marks” option. Libraries may prefer conform to what is most widely implemented to avoid creating problems when users download bibliographic data (e.g., for use in bibliographies).

2.3 Data Input Errors

Four characters must be entered in the correct order when combining half marks are used; only three characters are needed when a double diacritic is used. When combining half marks are used, erroneous data is created when (a) another diacritic is entered instead of the second half, or (b) the first or second half is omitted. In addition, there is the possibility that one of the half diacritic characters will be entered instead of a different single diacritic. Such a mistake is less likely when the character being entered is wider. With the double diacritic the mail issue may be assuring that it is entered after the correct character in the pair.

2.4 Identification of Existing Data Errors

Valid sequences in MARC-8 records that include the combining half marks can be converted to the Unicode/UCS double diacritic plus base characters. (The conversion method is described in an appendix to this proposal.) When the conversion flags a sequence as erroneous, the MARC-8 half diacritic character would have to be converted to its Unicode/UCS combining half mark characters or omitted. As a result, sequences that are defective and should be corrected will be readily apparent (especially if the font does not support the half diacritic characters).

2.5 Keyboard issues

If this proposal is accepted, four characters used for input will be replaced by two, providing room for the Euro sign and the Eszett that were added to the MARC 21 character repertoire in 2002. However, the half diacritic characters will continue to be valid for use in MARC 21 data, as discussed in Section 2.4, but the double diacritics would be used in newly created records.

3. IMPLICATONS

There is a benefit for both librarians and library users if libraries abandon use of the half diacritic characters practice in Unicode/USC-encoded data and adopt the doublewide diacritical character option instead. Although some libraries have already converted their records to Unicode, we are just at the beginning of its use, and there are benefits to making library Unicode/UCS-encoded data more in synch with the mainstream implementations of Unicode.

Changing to use of the doublewide diacritic characters to the half diacritic characters for prospective MARC 21 in Unicode/UCS-encodings has the following implications:

Keyboards for input to Unicode/UCS systems would have to be modified to replace the existing half diacritic character pairs with the corresponding doublewide diacritic. (Input of half diacritic characters would not need to be supported because half diacritic characters would be used only for existing defective data).
System changes would need to be made to tables and software for loading, exporting, and internal use, according to the design of a system.
Existing Unicode data would need to be modified, either via a one-time conversion or by incorporating“on-the-fly” conversion for outgoing data. (For conversion details, see the Appendix.)

4. PROPOSAL:

It is proposed that:

MARC 21 prefers the double diacritic character U+0360 COMBINING DOUBLE TILDE as the Unicode/UCS representation of the double-width tilde of Tagalog and U+0361 COMBINING DOUBLE INVERTED BREVE as the Unicode/UCS representation of the tilde used in ALA-LC Romanization.
The corresponding UTF-8 code point values are CDA0 for U+0360 COMBINING DOUBLE TILDE and CDA1 for U+0361 COMBINING DOUBLE INVERTED BREVE.
In addition, MARC 21 permits the use of the Unicode/UCS combining half marks (U+FE20..U+FE23) for the exchange of defective sequences that cannot be corrected by the sender of the data.

It is noted that internal use of the combining half marks for the representation of the double-width tilde and the ligature is not prohibited by this proposal. What is required is that such a system must export the double-width tilde and the ligature as the double diacritic characters specified in the first bullet point, and must also be able to accept these characters.

One alternative to the use of the double diacritic character may be to change the transliteration tables to use a grave accent on the second of the two characters “pointing back to the first character”. The grave is not currently used in the transliteration tables that employ the ligature. Such a change, however, would require consultation the ALA/LC with the affected community.

APPENDIX: SUBSTITUTING A UNICODE/UCS DOUBLE DIACRITIC FOR A MARC-8 HALF-MARK PAIR

Note: The double diacritic characters are identified by their Unicode Scalar Values for ease of reference. MARC21 requires use of the UTF-8 encoding form in data exchange, so UTF-8 code points are also given (the space between each hex value is for clarity, and is not part of the UTF-8 value).

Condition		Result
Current character = EC \| FB		ERROR 1
Current character = EB \| FA		Examine the next 3 characters (“string”)
<3 characters remaining		ERROR 2
1st character of string not Latin letter		ERROR 2
Current character = EB
	2nd character of string not EC	ERROR 2
Current character = FA
	2nd character of string not FB	ERROR 2
3rd character of string not Latin letter		ERROR 2
Convert current character and next 3 characters as follows:
	Drop current character
	Convert 1st character of string to Unicode/UCS equivalent
	2nd character of string is either EC or FB
	Convert EC to U+0361 (= CD A1 in UTF-8)
	Convert FB to U+0360 (= CD A0 in UTF-8)
	Convert 3rd character of string to Unicode/UCS equivalent
ERROR handling
ERROR 1: EC \| FB should not be encountered unless EB \| FA has been encountered
	Convert EC to U+FE21 (= EF B8 A1 in UTF-8)
	Convert FB to U+FE23 (= EF B8 A3 in UTF-8)
ERROR 2: String sequence incorrect (length or content)
	Convert EB to U+FE20 (= EF B8 A0 in UTF-8)
	Convert FA to U+FE22 (= EF B8 A2 in UTF-8)
Convert character(s) of string to Unicode/UCS equivalent(s) including: EB to U+FE20, EC to U+FE21, FA to U+FE22, FB to U+FE23

Go to:

Library of Congress

Library of Congress Help Desk ( 08/26/2004 )