NAME: USMARC Character Set Issues and Mapping to Unicode/UCS
SOURCE: MARBI Character Set Subcommittee
SUMMARY: This paper identifies character set issues related to the mapping between USMARC and Unicode/UCS. It reviews the working principles of the MARBI Character Set Subcommittee and discusses the mapping of certain USMARC characters to Unicode/UCS. In addition, it suggests the correction of various elements in the USMARC Specifications for Record Structure, Character Sets, and Media Exchange and proposed the addition of four characters, one to the Cyrillic set, and three to the Arabic set. Appendix 1 contains USMARC to Unicode/UCS mapping tables for ASCII and ANSEL character sets, and Appendix 2 contains mapping tables for basic extended Arabic, basic Hebrew, and basic and extended Cyrillic.
KEYWORDS: USMARC Character Sets; Unicode/UCS
STATUS/COMMENTS:
5/24/96 - Forwarded to USMARC Advisory Group for discussion at the July 1996 MARBI meetings.
7/7/96 - Result of USMARC Advisory Group discussion - Approved, noting the unfinished resolution of the ASCII clone mapping. That part should be approved separately at a future meeting.
8/6/96 - Results of final LC review - Agreed with MARBI decision. In March 1999, the mapping for the character called Ayn in MARC (B0) was changed from that in this proposal. The MARC character B0 is used to represent both weak aspiration (romanized Chinese according to the Wade-Giles system) and the voiced pharyngeal fricative (romanized Arabic or Hebrew). In the Unicode standard, U+02BD represents "weak aspiration" and U+02BF represents "voiced pharyngeal fricative." The mapping of B0 was changed from U+02BF to U+02BB because the dual function of U+02BB (as an alternate for the transliterated representation of the ayn or for indication of weak aspiration) reflects the multiple uses of the MARC B0 character.
PROPOSAL NO. 96-10: USMARC Character Set Issues and Mapping to Unicode/UCS 1. BACKGROUND The Character Set Subcommittee was appointed in June 1994 (following MARBI discussion of Discussion Paper #73) with the following charge: * To review the character set issues related to mapping between USMARC and Unicode/UCS; * To formulate a proposal for review and comment by LC, MARBI, and the USMARC Advisory Group; * To identify other issues related to character sets which should be addressed by MARBI and/or the library community. Members of the Subcommittee: Joan Aliprand - Research Libraries Group Randy Barry - Library of Congress Candy Bogar - Date Research Associates John Espley - VTLS, Inc. Robyn Greenlund - Microlif, Inc. Sally McCallum - Library of Congress Gary Smith - OCLC, Inc. Paul Weiss - University of New Mexico Larry Woods - University of Iowa, Chair Glossary and conventions: UCS = Universal Character Set (International Standard ISO/IEC 10646). U+nnnn = An individual Unicode/UCS value, where nnnn is a four digit number expressed in hexadecimal notation. Private Use Area = Unicode/UCS values in the range U+E000 through U+F8FF. Codes in this range are for the use of software developers and end users who need a special set of characters for their applications. The code points in this area do not have defined, interpretable semantics except by private agreement. 2. PROPOSAL WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM USMARC TO UNICODE/UCS The following Working Principles were established by the Subcommittee and continue to inform their mapping decisions: * Round-trip mapping will be provided between USMARC characters and Unicode/UCS characters wherever possible. * Transliteration tables will remain unchanged unless there is no Unicode/UCS equivalent for a diacritical mark, in which case a change to the transliteration table may be considered by the Library of Congress. * Accented letters (and vocalized consonants in Hebrew and Arabic) will continue to be encoded as a base letter and non- spacing marks. Use of precomposed accented letters is not sanctioned at this stage. * Codes in the Private Use Area will be used only if necessary to facilitate round-trip mapping. MAPPINGS FROM USMARC TO THE UNICODE/UCS STANDARD The Subcommittee has completed mappings for the following USMARC character sets: * Basic Latin (ASCII) and Extended Latin (ANSEL); * Greek Symbols (the Greek lowercase letters Alpha, Beta and Gamma); * Subscript Characters; * Superscript Characters; * Basic Hebrew; * Basic and Extended Cyrillic; and * Basic Arabic and Extended Arabic. However, the correct mapping for certain characters could not be determined unequivocally, and two options are presented. The Cyrillic, Hebrew, and Arabic character sets of USMARC contain characters which are also found in columns 2 and 3 of the ASCII range of the USMARC Latin set. The term "ASCII clones" is used to refer to these characters as a group. The "ASCII clones include numbers, punctuation marks, the space, and symbols such as asterisk. The problem arises because Unicode/UCS contains only the ASCII characters. One option is to map all ASCII clones to the ASCII- equivalent characters in Unicode/UCS. The alternative option is to preserve the ASCII clones by mapping them to unique values in the Private Use Area. The options are discussed in more detail below in the section "Mappings for ASCII Clones". The proposed mappings are listed in Appendix 1 with the two proposed options for mapping ASCII clones listed in Appendix 2. For the most part the mappings were straightforward and non- controversial. A few engendered discussion, and some recommendations were not unanimous. Those mappings are listed here along with a summary of the discussion. Issue 1: A3 D WITH CROSSBAR UPPERCASE The USMARC Latin character A3 (Uppercase D with crossbar) is used to encode both Croatian and Vietnamese letters, transliterated Macedonian and Serbian, and is also considered to be the uppercase form of the Eth. Unicode/UCS includes three "crossed D" characters. Because the Eth is generally regarded as a lowercase letter, the Subcommittee chose to map A3 to U+0110, on the basis of the most common usage (Croatian, Vietnamese, etc.). The proposed mapping is: A3 D WITH CROSSBAR UPPERCASE 0110 LATIN CAPITAL D WITH STROKE Issue 2: AA SUBSCRIPT PATENT MARK It was felt that the loss of subscriptedness (U+00AE is not a subscripted character) was not crucial for this character. The proposed mapping is: AA SUBSCRIPT PATENT MARK 00AE REGISTERED TRADEMARK SIGN Issue 3: EB LIGATURE FIRST HALF EC LIGATURE, SECOND HALF FA DOUBLE TILDE, FIRST HALF FB DOUBLE TILDE, SECOND HALF There were two possible mappings for these four characters: to a single character (which extends over two letters) or to a pair of characters corresponding to the "halves". Mapping to the "halves" was chosen. The proposed mappings are: EB LIGATURE FIRST HALF FE20 COMBINING LIGATURE, LEFT HALF EC LIGATURE, SECOND HALF FE21 COMBINING LIGATURE, RIGHT HALF FA DOUBLE TILDE, FIRST HALF FE22 COMBINING DOUBLE TILDE, LEFT ALF FB DOUBLE TILDE,SECOND HALF FE23 COMBINING DOUBLE TILDE, RIGHT HALF Issue 4: F7 LEFT HOOK (COMMA BELOW) This character is used in Latvian, Romanian, and Polish. The issue was whether mapping should be based on the appearance of the character, or on its function. The recommendation accepted by a majority of the Subcommittee was a mapping based on function, and supported with a reference to the use of a comma-like descender in Romanian typography. Other members felt that the graphic appearance was important. The proposed mapping is: F7 LEFT HOOK (COMMA BELOW) 0326 COMBINING COMMA BELOW Issue 5: F8 RIGHT CEDILLA We were fortunate to locate a Thai linguistics expert from the University of Wisconsin, Robert Bickner, who confirmed Joan Aliprand's hypothesis that the "RIGHT CEDILLA" was an artifact created through transcription and is similar to the International Phonetic Alphabet (IPA) symbol for an open vowel, which is depicted in the Unicode/UCS as 031C COMBINING LEFT HALF RING BELOW. This gives us unique mappings for all four "comma below" characters in the USMARC Latin Set. The proposed mappings are: F0 CEDILLA 0327 COMBINING CEDILLA F1 RIGHT HOOK 0328 COMBINING OGONEK F7 LEFT HOOK 0326 COMBINING COMMA BELOW F8 RIGHT CEDILLA 031C COMBINING LEFT HALF RING BELOW Issue 6: GREEK LETTERS The Subcommittee recommended mapping the three Greek letters in USMARC to the corresponding Greek script characters in Unicode/UCS rather than try to retain the "latinness" of those characters by some other mapping (e.g. to values in the Private Use Area). Issue 7: 45 HOLAM The issue concerned the recommended mapping for the USMARC Hebrew character 45 HOLAM. USMARC has only a single character HOLAM, which should have been listed as HOLAM/RIGHT SIN DOT. There are two distinct characters -- HOLAM and SIN DOT -- in Unicode/UCS. The discussion was about whether to map contextually to Unicode/UCS HOLAM or Unicode/UCS RIGHT SIN DOT. This has now been resolved. The proposed mapping is: 45 HOLAM 05B9 HEBREW POINT HOLAM ADDITION OF NINE ARABIC SCRIPT CHARACTERS TO UNICODE/UCS The following nine script characters from the USMARC Arabic Set have no corresponding equivalents in Unicode/UCS: A1 DOUBLE ALEF WITH HAMZA ABOVE B2 TCHEH WITH DOT ABOVE C9 SHEEN WITH DOT BELOW CC DAD WITH DOT BELOW CF GHAIN WITH DOT BELOW E7 LAM WITH THREE DOTS BELOW EC NOON WITH DOT BELOW FD SHORT E FE SHORT U Since USMARC Arabic has been officially adopted as an ISO standard, this will be our primary justification to getting them added to Unicode/UCS. We assume this will be routine and will list the mapping as "in process, pending addition to Unicode/UCS" or some other similar phrase. We will not map them to the Private Use Area. MAPPINGS FOR "ASCII CLONES" The USMARC character set includes characters from the Latin, Cyrillic, Hebrew and Arabic scripts. The Latin character set is a superset of US ASCII. Each of the other three character sets contain duplicates of some of the characters present in the Latin set, typically digits and punctuation. For the purposes of this discussion, the Character Set Subcommittee has dubbed these characters "ASCII clones". Unicode/UCS provides only one standard value with which to encode each of these characters. The Standard includes an algorithm for the display of text in opposite directions which accommodates ASCII punctuations and digits. It is not possible to map a string of characters from USMARC to Unicode/UCS and back to USMARC, and to guarantee that the output will be identical to the input. Algorithms exist which will produce a reasonable result in the case of Cyrillic, although equivalence still cannot be guaranteed. It has not been determined whether algorithms exist which will produce satisfactory results for Hebrew and Arabic. Two options have been advanced for mapping the ASCII clones into Unicode/UCS: Option #1 maps the ASCII clones to Latin equivalents in Unicode/UCS, relying on reverse mapping algorithms to make the necessary distinctions. The algorithms to do this have not yet been developed. Option #2 maps the ASCII clones to Private Use Space in Unicode/UCS. This is defined in Unicode/UCS as those hex values between U+E000 and U+F8FF. The Corporate Use Zone starts at U+F8FF and allocates downward. The End User Zone starts at U+E000 and allocates upward. In Option #2 we have used the Corporate Use Zone. Option #1. (Latin Equivalents) Advantages * Only universally defined Unicode/UCS characters are mapped to, thus users outside the USMARC community would encounter no additional difficulty in interpreting USMARC data. * No complications are created in the use of standard Unicode/UCS-compatible print and display drivers. Disadvantages * The original coding of the characters is lost as a result of round-trip mapping particularly when a string of characters begins with one of these characters. Additional parsing may be able to restore the original identity in some cases. * The handling of bi-directional text may be more complex. * Current USMARC records containing bi-directional text would require complex processing if records were to remain unaltered. Option #2. (Private Use Space) Advantages * The identity of each character is preserved unambiguously and the ability to perform round-trip mapping is guaranteed. * The handling of bi-directional text may be simpler. Disadvantages * Use of non-standard encoding requires that all users have USMARC-specific documentation in order to interpret the data successfully. * Print and display drivers, or the software which invokes them, will need to be aware of the Private Use characters and their relationship to the standard characters. We could end up repeating the same problem we have had with the ALA character set where standard hardware was not able to print or display some of the characters without special adaption. Questions to be answered * If Option #1 is used, can an algorithm be devised that will handle the round-trip mapping of bi-directional text in an acceptable manner? * If Option #1 is used, is there any case in which the inability to recreate the original byte sequence from a USMARC- Unicode/UCS-USMARC round-trip mapping would cause a problem? * If Option #2 is used, is the concern regarding print and display drivers well-founded? * Is it acceptable to change the character content of a bibliographic record as a result of conversion into Unicode/UCS and then back into USMARC? The two proposed options for mapping ASCII clones are listed in Appendix 2. CORRECTIONS TO USMARC The following character descriptions In USMARC Specifications For Record Structure, Character Sets, And Media Exchange, 1994 need to be corrected: ASCII Basic And Extended Latin A1 LOWERCASE POLISH L should read A1 UPPERCASE POLISH L. 8D JNR (Joiner) is in the USMARC tables but is not in the code list. 8E NJR (Non-joiner) is in the USMARC tables but not in the code list. Basic Hebrew 3D EQUAL SIGN should read 3D EQUALS SIGN. 45 HOLAM should read 45 HOLAM/RIGHT SIN DOT Basic And Extended Cyrillic 3D EQUAL SIGN should read 3D EQUALS SIGN. 7E UPPERCASE CHA should read 7E UPPERCASE CHE. EO UPPERCASE G WITH UPTURN should read EO UPPERCASE GE WITH UPTURN. Basic And Extended Arabic 3D EQUAL SIGN should read 3D EQUALS SIGN. REPRESENTATION OF THE PRESENCE OF UNICODE/UCS CHARACTERS IN MARC RECORDS Discussion of this is continuing. We need to show when a record contains only Unicode/UCS values and when it contains Unicode/UCS values as well as USMARC characters. This will be addressed in a separate proposal to MARBI. MAPPINGS FOR THE EAST ASIAN CHARACTER CODE (EACC) The Subcommittee recommends that a new committee be appointed to handle these mappings, because the Subcommittee felt it lacked the expertise to deal with East Asian scripts. The Subcommittee further recommends that at least one member of the present Subcommittee be named to the EACC Mapping Group, and that the Library of Congress also be represented. We also recommend that the same set of Working Principles be observed that this Subcommittee put into place.