PROPOSAL NO. 96-10

DATE: May 24, 1996
REVISED: July 22, 1996

NAME: USMARC Character Set Issues and Mapping to Unicode/UCS

SOURCE: MARBI Character Set Subcommittee

SUMMARY: This paper identifies character set issues related to the mapping between USMARC and Unicode/UCS. It reviews the working principles of the MARBI Character Set Subcommittee and discusses the mapping of certain USMARC characters to Unicode/UCS. In addition, it suggests the correction of various elements in the USMARC Specifications for Record Structure, Character Sets, and Media Exchange and proposed the addition of four characters, one to the Cyrillic set, and three to the Arabic set. Appendix 1 contains USMARC to Unicode/UCS mapping tables for ASCII and ANSEL character sets, and Appendix 2 contains mapping tables for basic extended Arabic, basic Hebrew, and basic and extended Cyrillic.

KEYWORDS: USMARC Character Sets; Unicode/UCS

STATUS/COMMENTS:

5/24/96 - Forwarded to USMARC Advisory Group for discussion at the July 1996 MARBI meetings.

7/7/96 - Result of USMARC Advisory Group discussion - Approved, noting the unfinished resolution of the ASCII clone mapping. That part should be approved separately at a future meeting.

8/6/96 - Results of final LC review - Agreed with MARBI decision. In March 1999, the mapping for the character called Ayn in MARC (B0) was changed from that in this proposal. The MARC character B0 is used to represent both weak aspiration (romanized Chinese according to the Wade-Giles system) and the voiced pharyngeal fricative (romanized Arabic or Hebrew). In the Unicode standard, U+02BD represents "weak aspiration" and U+02BF represents "voiced pharyngeal fricative." The mapping of B0 was changed from U+02BF to U+02BB because the dual function of U+02BB (as an alternate for the transliterated representation of the ayn or for indication of weak aspiration) reflects the multiple uses of the MARC B0 character.


PROPOSAL NO. 96-10:   USMARC Character Set Issues and Mapping to
Unicode/UCS

1. BACKGROUND

The Character Set Subcommittee was appointed in June 1994
(following MARBI discussion of Discussion Paper #73) with the
following charge:

   * To review the character set issues related to mapping between
     USMARC and Unicode/UCS; 
   * To formulate a proposal for review and comment by LC, MARBI,
     and the USMARC Advisory Group; 
   * To identify other issues related to character sets which
     should be addressed by MARBI and/or the library community.

Members of the Subcommittee:
     Joan Aliprand - Research Libraries Group
     Randy Barry - Library of Congress
     Candy Bogar - Date Research Associates
     John Espley - VTLS, Inc.
     Robyn Greenlund - Microlif, Inc.
     Sally McCallum - Library of Congress
     Gary Smith - OCLC, Inc.
     Paul Weiss - University of New Mexico
     Larry Woods - University of Iowa, Chair

Glossary and conventions:

UCS = Universal Character Set (International Standard ISO/IEC
10646).

U+nnnn = An individual Unicode/UCS value, where nnnn is a four
digit number expressed in hexadecimal notation.

Private Use Area = Unicode/UCS values in the range U+E000 through
U+F8FF. Codes in this range are for the use of software developers
and end users who need a special set of characters for their
applications. The code points in this area do not have defined,
interpretable semantics except by private agreement.


2.  PROPOSAL

WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM
USMARC TO UNICODE/UCS

The following Working Principles were established by the
Subcommittee and continue to inform their mapping decisions:

   * Round-trip mapping will be provided between USMARC characters
     and Unicode/UCS characters wherever possible.
   * Transliteration tables will remain unchanged unless there is
     no Unicode/UCS equivalent for a diacritical mark, in which
     case a change to the transliteration table may be considered
     by the Library of Congress.
   * Accented letters (and vocalized consonants in Hebrew and
     Arabic) will continue to be encoded as a base letter and non-
     spacing marks. Use of precomposed accented letters is not
     sanctioned at this stage.
   * Codes in the Private Use Area will be used only if necessary
     to facilitate round-trip mapping.


MAPPINGS FROM USMARC TO THE UNICODE/UCS STANDARD

The Subcommittee has completed mappings for the following USMARC
character sets:

   * Basic Latin (ASCII) and Extended Latin (ANSEL);
   * Greek Symbols (the Greek lowercase letters Alpha, Beta and
     Gamma);
   * Subscript Characters; 
   * Superscript Characters;
   * Basic Hebrew;
   * Basic and Extended Cyrillic; and
   * Basic Arabic and Extended Arabic.

However, the correct mapping for certain characters could not be
determined unequivocally, and two options are presented. The
Cyrillic, Hebrew, and Arabic character sets of USMARC contain
characters which are also found in columns 2 and 3 of the ASCII
range of the USMARC Latin set. The term "ASCII clones" is used to
refer to these characters as a group. The "ASCII clones include
numbers, punctuation marks, the space, and symbols such as
asterisk. 

The problem arises because Unicode/UCS contains only the ASCII
characters. One option is to map all ASCII clones to the ASCII-
equivalent characters in Unicode/UCS. The alternative option is to
preserve the ASCII clones by mapping them to unique values in the
Private Use Area. The options are discussed in more detail below in
the section "Mappings for ASCII Clones".

The proposed mappings are listed in Appendix 1 with the two
proposed options for mapping ASCII clones listed in Appendix 2. For
the most part the mappings were straightforward and non-
controversial. A few engendered discussion, and some
recommendations were not unanimous. Those mappings are listed here
along with a summary of the discussion.


Issue 1:  A3   D WITH CROSSBAR UPPERCASE

The USMARC Latin character A3 (Uppercase D with crossbar) is used
to encode both Croatian and Vietnamese letters, transliterated
Macedonian and Serbian, and is also considered to be the uppercase
form of the Eth.  Unicode/UCS includes three "crossed D"
characters. 
     
Because the Eth is generally regarded as a lowercase letter, the
Subcommittee chose to map A3 to U+0110, on the basis of the most
common usage (Croatian, Vietnamese, etc.). The proposed mapping is:
A3   D WITH CROSSBAR UPPERCASE     0110 LATIN CAPITAL D WITH STROKE


Issue 2:  AA   SUBSCRIPT PATENT MARK

It was felt that the loss of subscriptedness (U+00AE is not a
subscripted character) was not crucial for this character. The
proposed mapping is:
AA   SUBSCRIPT PATENT MARK    00AE REGISTERED TRADEMARK SIGN


Issue 3:  EB   LIGATURE FIRST HALF
          EC   LIGATURE, SECOND HALF
          FA   DOUBLE TILDE, FIRST HALF
          FB   DOUBLE TILDE, SECOND HALF

There were two possible mappings for these four characters: to a
single character (which extends over two letters) or to a pair of
characters corresponding to the "halves". Mapping to the "halves"
was chosen. The proposed mappings are:
EB   LIGATURE FIRST HALF      FE20 COMBINING LIGATURE, LEFT HALF
EC   LIGATURE, SECOND HALF    FE21 COMBINING LIGATURE, RIGHT HALF
FA   DOUBLE TILDE, FIRST HALF FE22 COMBINING DOUBLE TILDE, LEFT
                                   ALF
FB   DOUBLE TILDE,SECOND HALF FE23 COMBINING DOUBLE TILDE, RIGHT
                                   HALF


Issue 4:  F7   LEFT HOOK (COMMA BELOW)

This character is used in Latvian, Romanian, and Polish. The issue
was whether mapping should be based on the appearance of the
character, or on its function. The recommendation accepted by a
majority of the Subcommittee was a mapping based on function, and
supported with a reference to the use of a comma-like descender in
Romanian typography. Other members felt that the graphic appearance
was important. The proposed mapping is:
F7   LEFT HOOK (COMMA BELOW)  0326 COMBINING COMMA BELOW


Issue 5:  F8   RIGHT CEDILLA

We were fortunate to locate a Thai linguistics expert from the
University of Wisconsin, Robert Bickner, who confirmed Joan
Aliprand's hypothesis that the "RIGHT CEDILLA" was an artifact
created through transcription and is similar to the International
Phonetic Alphabet (IPA) symbol for an open vowel, which is depicted
in the Unicode/UCS as 031C COMBINING LEFT HALF RING BELOW. This
gives us unique mappings for all four  "comma below" characters in
the USMARC Latin Set. The proposed mappings are: 
F0   CEDILLA        0327 COMBINING CEDILLA
F1   RIGHT HOOK     0328 COMBINING OGONEK
F7   LEFT HOOK      0326 COMBINING COMMA BELOW
F8   RIGHT CEDILLA  031C COMBINING LEFT HALF RING BELOW


Issue 6:  GREEK LETTERS

The Subcommittee recommended mapping the three Greek letters in
USMARC to the corresponding Greek script characters in Unicode/UCS
rather than try to retain the "latinness" of those characters by
some other mapping (e.g. to values in the Private Use Area).


Issue 7:  45   HOLAM

The issue concerned the recommended mapping for the USMARC Hebrew
character 45 HOLAM. USMARC has only a single character HOLAM, which
should have been listed as HOLAM/RIGHT SIN DOT. There are two
distinct characters -- HOLAM and SIN DOT -- in Unicode/UCS. The
discussion was about whether to map contextually to Unicode/UCS
HOLAM or Unicode/UCS RIGHT SIN DOT. This has now been resolved. 
The proposed mapping is:
45   HOLAM     05B9 HEBREW POINT HOLAM


ADDITION OF NINE ARABIC SCRIPT CHARACTERS TO UNICODE/UCS

The following nine script characters from the USMARC Arabic Set
have no corresponding equivalents in Unicode/UCS:

     A1   DOUBLE ALEF WITH HAMZA ABOVE
     B2   TCHEH WITH DOT ABOVE
     C9   SHEEN WITH DOT BELOW
     CC   DAD WITH DOT BELOW
     CF   GHAIN WITH DOT BELOW
     E7   LAM WITH THREE DOTS BELOW
     EC   NOON WITH DOT BELOW
     FD   SHORT E
     FE   SHORT U

Since USMARC Arabic has been officially adopted as an ISO standard,
this will be our primary justification to getting them added to
Unicode/UCS. We assume this will be routine and will list the
mapping as "in process, pending addition to Unicode/UCS" or some
other similar phrase. We will not map them to the Private Use Area.


MAPPINGS FOR "ASCII CLONES"

The USMARC character set includes characters from the Latin,
Cyrillic, Hebrew and Arabic scripts. The Latin character set is a
superset of US ASCII. Each of the other three character sets
contain duplicates of some of the characters present in the Latin
set, typically digits and punctuation. For the purposes of this
discussion, the Character Set Subcommittee has dubbed these
characters "ASCII clones".

Unicode/UCS provides only one standard value with which to encode
each of these characters. The Standard includes an algorithm for
the display of text in opposite directions which accommodates ASCII
punctuations and digits. It is not possible to map a string of
characters from USMARC to Unicode/UCS and back to USMARC, and to
guarantee that the output will be identical to the input.
Algorithms exist which will produce a reasonable result in the case
of Cyrillic, although equivalence still cannot be guaranteed. It
has not been determined whether algorithms exist which will produce
satisfactory results for Hebrew and Arabic.

Two options have been advanced for mapping the ASCII clones into
Unicode/UCS:

Option #1 maps the ASCII clones to Latin equivalents in
Unicode/UCS, relying on reverse mapping algorithms to make the
necessary distinctions. The algorithms to do this have not yet been
developed.

Option #2 maps the ASCII clones to Private Use Space in
Unicode/UCS. This is defined in Unicode/UCS as those hex values
between U+E000 and U+F8FF. The Corporate Use Zone starts at U+F8FF
and allocates downward. The End User Zone starts at U+E000 and
allocates upward. In Option #2 we have used the Corporate Use Zone.

Option #1. (Latin Equivalents)

Advantages  
   * Only universally defined Unicode/UCS characters are mapped to,
     thus users outside the USMARC community would encounter no
     additional difficulty in interpreting USMARC data.
   * No complications are created in the use of standard
     Unicode/UCS-compatible print and display drivers.

Disadvantages
   * The original coding of the characters is lost as a result of
     round-trip mapping  particularly when a string of characters
     begins with one of these characters. Additional parsing may be
     able to restore the original identity in some cases.
   * The handling of bi-directional text may be more complex.
   * Current USMARC records containing bi-directional text would
     require complex processing if records were to remain
     unaltered.

Option #2. (Private Use Space)

Advantages
   * The identity of each character is preserved unambiguously and
     the ability to perform round-trip mapping is guaranteed.
   * The handling of bi-directional text may be simpler.

Disadvantages
   * Use of non-standard encoding requires that all users have
     USMARC-specific documentation in order to interpret the data
     successfully.
   * Print and display drivers, or the software which invokes them,
     will need to be aware of the Private Use characters and their
     relationship to the standard characters. We could end up
     repeating the same problem we have had with the ALA character
     set where standard hardware was not able to print or display
     some of the characters without special adaption.

Questions to be answered
   * If Option #1 is used, can an algorithm be devised that will
     handle the round-trip mapping of bi-directional text in an
     acceptable manner?
   * If Option #1 is used, is there any case in which the inability
     to recreate the original byte sequence from a USMARC-
     Unicode/UCS-USMARC round-trip mapping would cause a problem?
   * If Option #2 is used, is the concern regarding print  and
     display drivers well-founded?
   * Is it acceptable to change the character content of a
     bibliographic record as a result of conversion into
     Unicode/UCS and then back into USMARC?

The two proposed options for mapping ASCII clones are listed in
Appendix 2.


CORRECTIONS TO USMARC

The following character descriptions In USMARC Specifications For
Record Structure, Character Sets, And Media Exchange, 1994 need to
be corrected:

ASCII Basic And Extended Latin
     A1  LOWERCASE POLISH L should read A1  UPPERCASE POLISH L.
     8D  JNR (Joiner) is in the USMARC tables but is not in the
          code list.
     8E  NJR (Non-joiner) is in the USMARC tables but not in the
          code list.
Basic Hebrew
     3D  EQUAL SIGN should read 3D  EQUALS SIGN.
     45  HOLAM should read 45  HOLAM/RIGHT SIN DOT
Basic And Extended Cyrillic  
     3D  EQUAL SIGN should read 3D  EQUALS SIGN.
     7E  UPPERCASE CHA should read 7E   UPPERCASE CHE.
     EO  UPPERCASE G WITH UPTURN should read EO  UPPERCASE GE WITH
          UPTURN.
Basic And Extended Arabic  
     3D  EQUAL SIGN should read 3D  EQUALS SIGN.


REPRESENTATION OF THE PRESENCE OF UNICODE/UCS CHARACTERS IN MARC
RECORDS

Discussion of this is continuing. We need to show when a record
contains only Unicode/UCS values and when it contains Unicode/UCS
values as well as USMARC characters. This will be addressed in a
separate proposal to MARBI.


MAPPINGS FOR THE EAST ASIAN CHARACTER CODE (EACC)

The Subcommittee recommends that a new committee be appointed to
handle these mappings, because the Subcommittee felt it lacked the
expertise to deal with East Asian scripts. The Subcommittee further
recommends that at least one member of the present Subcommittee be
named to the EACC Mapping Group, and that the Library of Congress
also be represented. We also recommend that the same set of Working
Principles be observed that this Subcommittee put into place.
Appendixes to 96-10


Go to:


Library of Congress
Library of Congress Help Desk (10/26/01)