DISCUSSION PAPER NO.: 2004-DP03

DATE: December 17, 2003
REVISED:

NAME: Changing the Mapping for the Double-Wide Diacritics from MARC8 to Unicode/UCS from the Unicode/UCS Half Diacritic Characters to the Unicode/UCS Double-Wide Diacritic Characters.

SOURCE: RLG

SUMMARY: The double-wide tilde of Tagalog and the ligature used in Cyrillic romanization are encoded as half diacritic characters in ANSEL, and are mapped to equivalent half diacritic characters in Unicode/UCS. Another way to represent the double-width tilde and the ligature in Unicode is to use single diacritic characters that span two alphabetic characters: the combining double tilde and combining double inverted breve. The Unicode Consortia and implementors are preferring the double-width characters to the halves. This discussion paper gives a number of reasons why the use of the single double-wide diacritics may be preferable to using the two half diacritic characters, and suggests that the MARC 21 community change the official encoding in Unicode/UCS to the double-wide diacritics.

KEYWORDS: Double tilde; Ligature; ANSEL; Unicode/UCS

RELATED: 96-10 (July 1996)

STATUS/COMMENTS:

12/17/03 - Made available to the MARC 21 community for discussion.

01/10/04 - Results of the MARC Advisory Committee discussion - The participants discussed whether the manner in which MARC converts double-width diacritics to Unicode could be improved upon by another method. Converting double-width diacritics are currently difficult to accomplish using proportional fonts. Because there are other characters besides the double ligature that do not display well in browsers, participants agreed that the committee should look at the larger issues of implementing and mapping to Unicode. It was also agreed that coding for double-width diacritics has not been consistent in the past. There may be mismatched halves of characters using the current method, for example. This particular problem is limited to romanized records, however. RLG indicated that it has created an algorithm to fix these coding inconsistencies. This algorithm may need to be revised in the future. A formal proposal should be presented to the committee at the annual meeting in June.

Discussion Paper 2004-DP03: Changing the Unicode/UCS Mapping for the Double-Wide Diacritics

1. BACKGROUND

An experts group was appointed by the MARC Advisory Committee (MAC) in 1994 to plan for a transition path from MARC8 character sets to Unicode. This first of 4 expert groups tackled the mapping of the MARC8 sets into Unicode (except the CJK set, EACC). The MARC 8 ANSEL character set, the Extended Latin character set for MARC 21, includes two double-wide diacritics: the double tilde and ligature. They are encoded in the MARC8 set as two halves, each half over a letter but the diacritics then combine to make one double diacritic in renderings. They were mapped to corresponding halves in Unicode. The mapping was approved by the community in 1997 and published with the MARC documentation on the web in that year.

The double-wide diacritics are used for the following languages: the long tilde is used in the Tagalog language written in Latin script, and the ligature is used in the ALA-LC romanization of Russian, other languages written in Cyrillic script, and Church Slavic. These characters are:

	EB	Ligature, first half
	EC	Ligature, second half
	FA	Double tilde, first half
	FB	Double tilde, second half

The Unicode Standard and its International Standard counterpart, ISO/IEC 10646 Universal Character Set (UCS), provide for encoding both the whole “double-wide” diacritics and the diacritic “halves” from the ANSEL standard. The Unicode/UCS characters are:

Half diacritic characters (to which the ANSEL halves are officially mapped)

	U+FE20	COMBINING LIGATURE LEFT HALF
	U+FE21	COMBINING LIGATURE RIGHT HALF
	U+FE22	COMBINING DOUBLE TILDE LEFT HALF
	U+FE23	COMBINING DOUBLE TILDE RIGHT HALF

Double-wide diacritic characters (alternative representations to the se diacritics in Unicode)

	U+0360	COMBINING DOUBLE TILDE
	U+0361	COMBINING DOUBLE INVERTED BREVE

The Unicode Standard states that the preferred way to encode the double-wide tilde and the ligature is with the double-wide diacritic characters (see The Unicode Standard, Version 4.0, Addison-Wesley, 2003, p. 188). For more information on options, see Section 7.7, Combining Marks, (p. 186-188). The text of Chapter 7 is available online at: www.unicode.org/versions/Unicode4.0.0/ch07.pdf. Illustrations in Section 7.7 show the order of diacritics and base characters when a double-width diacritic character is used alone and with other diacritics with alphabetic characters. Another illustration contrasts use of two half diacritic characters and use of the single double-width tilde to represent the Tagalog "ng" with tilde.

The MARC/Unicode mapping approved by MARBI (Proposal 96-10) maps the half diacritic characters in ANSEL to the corresponding half diacritic characters in Unicode/UCS. (The complete mapping for Extended Latin (ANSEL) is at: www.loc.gov/marc/specifications/specchartables.html.

Proposal 96-10 indicates that use of the double-wide diacritic characters was considered, but the half diacritic characters were chosen. No reason is given in the proposal, but logically it was because mapping between MARC-8 data and Unicode would be simpler.

2. STATEMENT OF THE PROBLEM

The double-wide diacritic characters have been input and rendered in various ways over the years depending on the devices available. Most typewriters used the half diacritic character technique for input, over- striking the individual base letters, and many supported only monospaced fonts for rendering, although proportional spacing fonts were possible on some devices. The Library of Congress used proportional spaced typesetting for rendering on catalog cards and the double-width diacritic characters were often precomposed along with their alphabetic base characters. When the ALA character set was developed for the computer era, the half diacritic characters were selected for the extended Latin set. They were often displayed as separate characters before the base characters on display screens, as were other diacritics on early devices. It is only with the introduction of Unicode that composite accented letters have become common in the display of library data. Proportional fonts are supported by modern typography and are preferred to monospaced fonts except for some applications, such as cataloging input.

In general, fonts now position diacritical characters correctly on the base characters, but problems with the display of the half diacritic characters have been observed. If the halves are treated as regular diacritics, they do not connect when the base letters are of different heights. Because of the requirement for connection, the halves cannot be treated like other diacritics, where the positioning of the diacritic is determined by the base letter. RLG first noticed the problem in romanized Russian data being added to the RLG Cultural Materials database. Modifying the font was not an easy technical task for RLG. After LC noticed the problem in its Voyager system, Microsoft was asked to modify the font being used so that the halves would be positioned correctly over their base characters.

The primary way that the double-wide tilde and the ligature are supported in fonts based on Unicode/UCS is with the double-width diacritic characters as these are easier to implement. While it is possible to request a typographer to add special handling to a particular font so that the half diacritic characters are displayed properly, libraries are a small market, and the special handling is complex. If libraries continue to use the half diacritic characters, they may be limited in their choice of fonts.

Many bibliographic records containing half diacritic characters can be identified by language. They occur in Tagalog (in Latin script) and in the romanization of some Slavic languages (Belorussian, Bulgarian, Russian, Ukrainian), various Non-Slavic languages written in Cyrillic script, and Church Slavic. (In the RLG database, this represents about 3% of the titles in Books and Serials.) However, the Authority record is not marked for language and in the bibliographic database many records will contain the ligature in added entries for English language material (e.g., added entries for the central committee of the communist party of the former USSR).

3. ADDITIONAL CONSIDERATIONS

Even if a library system is able to display Unicode records containing the half diacritic characters adequately because it uses a special font, library data requiring the double tilde and ligature will not be compatible with mainstream data, causing various problems.

3.1. Half diacritic characters create difficulties for users

Library data is often downloaded by authors to create citations. The double-wide diacritic characters are more likely to be supported in an author's word-processing software than the half diacritic characters. The fonts that the author wishes to use may not support the half diacritic characters at all, or may do so inadequately. Users will be confused when the ligature or double tilde appears correctly on the library system, but not when the bibliographic record is downloaded to a different environment.

3.2. The double-wide diacritic characters eliminate some data entry errors

Another reason to prefer the double-wide diacritic characters to the half diacritic characters is that some data entry errors are eliminated. When half diacritic characters are used, both halves must be entered correctly to achieve the proper result. Erroneous data is created when (a) another diacritic is entered instead of the second half, or (b) the first or second half is omitted. These errors do not occur when there is only one character to enter. In addition, there is the possibility that one of the half diacritic characters will be entered instead of a different single diacritic. Such a mistake is less likely when the character being entered is wider.

3.3. Using double-wide diacritic characters highlights errors in existing data

Legitimate half diacritic character sequences in MARC-8 records can be converted to base characters plus the double-width characters in Unicode/UCS. Half diacritic character sequences that cannot be converted are erroneous; for these, a MARC-8 half diacritic character would have to be converted to its Unicode/UCS half diacritic characters or omitted. As a result, sequences that are defective and should be corrected will be readily apparent (especially if the font does not support the half diacritic characters). If the half diacritic characters are to be allowed in interchange when the sequences are defective, the the mapping would need to indicate that both encodings are supported.

3.4. Using the double-wide diacritic characters could free up two of the locations on the keyboard

The euro sign and the eszett were added to the MARC 21 character repertoire in 2002. Some keyboard layouts may not have room to add these.

4. EFFECTS ON EXISTING UNICODE IMPLEMENTATIONS

A number of systems have already implemented Unicode, and the records of these systems contain the half diacritic characters specified in the existing MARC-8/Unicode mapping. What is used internally by a system is outside the scope of the MARC 21 Specifications. At issue is what should be imported and exported from a library system. The internal half diacritic characters should be converted to Unicode/UCS double-wide diacritic characters for record exchange and for downloading of data by users. A receiving system could load the Unicode/UCS double-wide diacritic characters directly, or decompose the double-wide diacritic and its base characters to Unicode/UCS half diacritic characters with the base characters if it preferred for internal processing.

The Unicode/UCS half diacritic characters should be exchanged or downloaded only when a sequence with half diacritic characters cannot be converted to a sequence with double-wide diacritic characters (i.e., when it is an defective sequence that the sender is unable to correct), if allowed in the mapping.

5. ALA/LC ROMANIZATION PROBLEM

The Abkhazian ligature TE TSE (U+04B4 and U+04B5) is romanized to TS (uppercase) or ts (lowercase) joined by a ligature and with a single dot centered above the ligature. This romanization cannot be represented in either ASCII/ANSEL or in Unicode. It cannot be represented correctly in ASCII/ANSEL because there is no way to know that a dot is to be centered above the middle of the ligature: it can only be positioned on one of the individual letters (depending on the position of the dot in the input sequence). The romanization cannot be represented in Unicode using the half diacritic characters for the same reason as in ASCII/ANSEL. The romanization cannot be represented in Unicode using the double-wide diacritic character U+0361 COMBINING INVERTED BREVE because the combining class of U+0361 is greater than the combining class of U+0307 COMBINING DOT ABOVE, so the double-wide diacritic character will be positioned above the combining dot (i.e., be further away from the “TS” digraph).

6. CONCLUSION

There is a benefit for both librarians and library users if libraries abandon use of the half diacritic characters practice in Unicode/USC-encoded data and adopt the double-wide diacritical character option instead. Although some libraries have already converted their records to Unicode, we are just at the beginning of its use, and there are benefits to making library Unicode/UCS-encoded data more in synch with the mainstream implementations of Unicode.

One additional option is for the use of the Unicode/UCS half diacritic characters to continue to be legitimate for MARC-8 compatible data, but only for the transmission of data that could not be corrected by the sending library.

7. IMPLICATIONS

Changing to use of the double-wide diacritic characters to the half diacritic characters for prospective MARC 21 in unicode/UCS-encodings has the following implications:

Keyboards for input to Unicode/UCS systems would have to be modified to replace the existing half diacritic character pairs with the corresponding double-wide diacritic. (Input of half diacritic characters would not need to be supported because half diacritic characters would be used only for existing defective data.)
Existing Unicode data would need to be modified, either via a one-time conversion or by incorporating “on-the-fly” conversion for outgoing data. (For conversion details, see the Appendix.)

8. QUESTIONS FOR CONSIDERATION

Is it desirable to modify the existing conversion table from MARC8 to Unicode/UCS for the double tilde and ligature to use the Unicode/UCS double-wide diacritic characters?
Does the price of system and data changes required by the modification of the table outweigh the advantage of using the mainstream Unicode/UCS double-wide diacritic characters that will be commonly supported in fonts?
Should the half diacritic Unicode/UCS encodings continue to be supported in the mapping table for defective data?
Would modifying existing Unicode data be more effective whether via a one-time conversion or by incorporating an “on-the-fly” conversion?

APPENDIX: Converting Compatibility Half Characters to Combining Diacritical Characters U+0360 and U+0361

The majority of records containing half diacritic characters can be identified by language.

When half diacritic characters are used, the UTF-8 encoding for the accented Tagalog letter “nang” or a ligatured digraph has the following structure: <first base character> <diacritic string including first half> <second base character> <diacritic string including second half>. Each component of the structure is required.

When double-wide diacritic characters are used, the corresponding UTF-8 encoding has this structure: <first base character> <combining double-wide diacritic character> <second base character>

For Tagalog, the components of the source legal sequence are:

Component	Value
<first base character>	N\|n
<diacritic string including first half>	U+FE22 COMBING TILDE LEFT HALF
<second base character>	G\|g
<diacritic string including second half>	U+FE23 COMBING TILDE RIGHT HALF

and the components after conversion are:

Component	Value
<first base character>	N\|n
<combining double-wide diacritic character>	U+0360 COMBINING DOUBLE TILDE
<second base character>	G\|g

For Cyrillic script languages and Church Slavic, the components of the source legal sequences are:

Component
<first base character>	D\|d	I\|i	K\|k	N\|n	O\|o	P\|p	T\|t	Z\|z
<diacritic string including first half>	U+FE20	U+FE20	U+FE20	U+FE20	U+FE20 U+0304\| U+0304 U+FE20	U+FE20	U+FE20	U+FE20
<second base character>	Z\|z	A\|a\|E\|e\|N\|n\|O\|o\|U\|u	H\|h	G\|g	T\|t	S\|s	S\|s\|Sh\|sh	H\|h
<diacritic string including second half>	U+FE21	U+FE21	U+FE21	U+FE21	U+FE21	U+FE21	U+FE21	U+FE21

and the components after conversion are:

Component
<first base character>	D\|d	I\|i	K\|k	N\|n	O\|o	P\|p	T\|t	Z\|z
<diacritic string >	U+0361	U+0361	U+0361	U+0361	U+0304 U+0361	U+0361	U+0361	U+0361
<second base character>	Z\|z	A\|a\|E\|e\|N\|n\|O\|o\|U\|u	H\|h	G\|g	T\|t	S\|s	S\|s\|Sh\|sh	H\|h

Note: Canonical ordering is specified for macron and double-wide inverted breve.

When an attempt at conversion fails, pass the sequence unchanged. (This is a defective sequence, which is preserved by using the half diacritic characters.)

Go to:

Library of Congress

Library of Congress Help Desk ( 03/18/2004 )

Component
<first base character>	D\|d	I\|i	K\|k	N\|n	O\|o	P\|p	T\|t	Z\|z
<diacritic string >	U+0361	U+0361	U+0361	U+0361	U+0304 U+0361	U+0361	U+0361	U+0361
<second base character>	Z\|z	A\|a\|E\|e\|N\|n\|O\|o\|U\|u	H\|h	G\|g	T\|t	S\|s	S\|s\|Sh\|sh	H\|h