PROPOSAL NO.: 2001-09

DATE: May 7, 2001
REVISED:

NAME: Mapping of EACC characters to Unicode/UCS

SOURCE: Library of Congress

SUMMARY: This paper proposes a mapping of characters from the MARC East Asian Character Set (EACC) to Unicode/UCS

KEYWORDS: Character sets; East Asian Character Set; EACC: UCS; Unicode

RELATED:

STATUS/COMMENTS:

05/07/01 - Made available to the MARC 21 community for discussion.

06/18/01 - Results of the MARC Advisory Committee discussion - Approved.
The identification of further mappings for EACC will be delegated to RLG and the CJK experts of the Unicode Consortium. The EACC Task Force will be dissolved.

08/07/01 - Results of LC/NLC review - Approved.


PROPOSAL NO. 2001-09: Mapping of EACC characters to Unicode/UCS

1. BACKGROUND

The East Asian Character Code (EACC) is a bibliographic character set for Chinese, Japanese, and Korean (CJK) approved for use in MARC 21 records. Developed by the Research Libraries Group in 1983, EACC was approved as an American National Standard, ANSI/NISO Z39.64, in 1989. The EACC repertoire contained 15,728 characters (15,704 from version L of the CJK thesaurus + the CJK space + 23 punctuation and pronunciation mark) as of May 2001. These characters were broken into the following categories for the purpose of the MARBI mapping:

13,468...Ideographs (for Chinese, Japanese, and Korean)
173...Japanese kana (86 katakana, 83 hiragana, 4 sound marks)
2,028...Korean hangul (1,966 modern, 29 archaic, 33 jamo)
24...CJK punctuation marks (9 East Asian, 14 Western, CJK space)
35...Ideographic "component input method" characters (used in RLIN system)

In 1996, the reconstituted MARBI Character Set Subcommittee developed mappings for most of the MARC character sets to the Unicode (TM) Standard (which is synchronized with the Universal Coded Character Set (UCS), ISO/IEC 10646) to support migration of MARC data from the 8-bit library sets to the new universal character set. The 1996 work focused on MARC character sets for the Arabic, Cyrillic, Hebrew, and Latin scripts, as well as Greek symbols, superscripts, and subscripts. These character sets cover the bulk of the character encodings which currently exist in MARC 21 bibliographic records. The MARBI subcommittee did not deal with the large East Asian Coded Character (EACC) set (ANSI/NISO Z39.64) because it involved the Chinese, Japanese, and Korean scripts with which subcommittee members had little familiarity. The subcommittee recommended that MARBI establish a special group deal with the East Asian characters.

The East Asian Character Set Task Force was formed by MARBI in 1997 to establish mappings between EACC and Unicode. The work of the Task Force focused specifically on reviewing mappings of East Asian characters already done by the Unicode Consortium, identifying characters missing from Unicode, establishing mappings for Korean hangul, Japanese kana, CJK punctuation and component characters, and working out a solution for mapping duplicate and variant ideographic characters. This proposal details the results of the Task Force's work and presents a proposal for establishing the definitive mapping of EACC characters to Unicode/UCS. The work of the Task Force was reviewed by a special committee of the Council on East Asian Libraries (CEAL). Corrections from members of the special committee have been incorporated into this proposal.

Proposed mappings for Korean hangul, Japanese kana, CJK punctuation and component characters, and additions and changes to the Unicode Consortium's mappings for EACC ideographs were posted for public review by the library community in March 2001. The availability of these proposed mappings was announced on the MARC list.

2. DISCUSSION

The EACC set used in MARC 21 records for vernacular Chinese, Japanese, and Korean script information contains 15,728 characters. The Unicode Consortium's mapping of 21,204 Unicode ideographic characters to other standards included mappings to 13,226 of the ideographs in EACC. This left 2,502 EACC characters to be mapped or otherwise dealt with by the MARBI Task Force. The review of the Unicode to EACC mappings done by the Unicode Consortium identified 203 changes that were needed. The changes needed were broken down as follows:

131...corrections to existing Unicode to EACC character mappings
30...deletions of existing Unicode to EACC character mappings
42...additions of Unicode to EACC character mappings using existing Unicode characters

The mapping of the remaining 2,490 EACC characters was handled in a variety of ways depending upon the repertoire of characters into which different groups fit. In all cases, an effort was made to find a mapping among the existing Unicode characters. When no mapping could be found, characters were mapped to unique codes in the Private Use Area (PUA) reserved in Unicode. The list below details how the remaining groups of EACC characters were handled.

173.....Japanese hiragana and katakana, plus sound marks: mapped to Unicode values
24......CJK puncutation marks: mapped to corresponding Unicode values
2,001...Korean hangul: mapped to corresponding Unicode values
27......Korean hangul: mapped to code values in PUA
5.......Ideographs not in Unicode: mapped to code values in PUA
28......Ideographs lacking exact forms in Unicode: mapped to code values in PUA
151.....Related variant ideographs unified in Unicode: mapped to code values in PUA
8.......Unrelated variant ideographs unified in Unicode: mapped to code values in PUA
38......Duplicate simplified ideographs: mapped to code values in PUA
35......"Component input method" characters: mapped to code values in PUA

The list above includes 2,490 additional EACC to Unicode mappings not previously available to MARC users. It includes a category of 35 code values for "component input method" characters that should not appear in MARC records. These component characters were defined for input interfaces that use a series of components, that when combined can be interpreted to give the person keying data a choice from among a small number of matching ideographs. Although not intended to become part of EACC data, mappings for the "components" are provided in case any of these components somehow made it into MARC records.

Since the mapping of EACC to Unicode was based primarily on work done by the Unicode Consortium in 1994, the primary reference document for MARC users is on the Unicode Web site. The URL is:

http://www.unicode.org/charts/unihan.html

Mappings for the special groups of characters described in this proposal, including the 203 changes to the Unicode Consortium's mapping of Unicode to EACC, will be made available on the Unicode Web site. A link will be made to this mapping from the MARC Web site when it is available. In the interim, the following textual documents are available for FTP transfer from RLG.at host domain:

ftp://ftp.rlg.org

Look in subdirectory "/pub/EACC" for the following ASCII text files:

mu-kana.txt .....Japanese kana
mu-punc.txt .....CJK puncutation
mu-hangl.txt ....Korean hangul
id-mssng.txt ....Ideographs not in Unicode
id-inxct.txt ....Ideographs lacking exact forms in Unicode
id-2to1r.txt ....Related variant ideographs unified in Unicode
id-2to1u.txt ....Unrelated variant ideographs unified in Unicode
id-simpl.txt ....Duplicate simplified ideographs
mu-compn.txt ...."Component input method" characters

3. PROPOSED CHANGES

In the MARC 21 Formats:


Go to:


Library of Congress Library of Congress
Library of Congress Help Desk (08/07/01)