Discussion Paper 2002-DP06

DATE: December 18, 2001
REVISED:

NAME: Repertoire Expansion in the Universal Character Set for Canadian Aboriginal Syllabics

SOURCE: Canadian Committee on MARC

SUMMARY: This paper proposes that, for records in UCS/Unicode only, the repertoire permitted in MARC 21 be expanded to include all characters from the Canadian Aboriginal Syllabics set. (Appendix A provides background on Canadian Aboriginal Syllabics.)

KEYWORDS: Canadian Aboriginal Syllabic set (All); Leader (All); MARC-8 (All); UCS/Unicode (All)

RELATED: DP 73 (February 1994); 96-10 (June 1996); 97-10 (June 1997); 98-18 (June 1998)

STATUS/COMMENTS:

12/18/01 - Made available to the MARC 21 community for discussion.

01/21/02 - Results of the MARC Advisory Committee discussion - Most participants favored expanding the current character set repertoire for CAS, however, they did not think that a mechanism was needed to flag the presence of the expansion. The discussion indicated some interest in exploring further expansion of the repertoire for other uses. A proposal for CAS will be presented during the annual 2002 meetings.

Discussion Paper 2002-DP06: Repertoire Expansion in the Universal Character Set for Canadian Aboriginal Syllabics

1. BACKGROUND

1.1. UCS/Unicode

The use of UCS/Unicode in MARC 21 records as an alternative character set encoding to the MARC-8 set was approved in June 1998 following the work of the MARBI Unicode Encoding and Recognition Technical Issues Task Force presented in Proposal 98-18 (Unicode Identification and Encoding in MARC records). This Task Force followed the mapping work presented by the MARBI Character Set Subcommittee in Proposals 96-10 (MARC Character Set Issues and Mapping to Unicode ) and 97-10 (Use of the universal code character set in MARC records ), a subcommittee created following the consideration of Discussion Paper 73 (UCS and MARC Mapping) in February 1994.

One of the working principles of the Character Set Subcommittee stated in Proposal 97-10 is that round-trip mapping would be provided between the MARC-8 set and the accepted repertoire in UCS. This has the effect of providing UCS as an alternate encoding without approving any repertoire expansion. The desirability of eventual expansion in a UCS framework, however, was recognized even when DP73 was written, and repeated in Proposal 98-18 (section 1.1), which considered the restriction temporary until expansion proposals were prepared and approved.

1.2 Canadian Aboriginal Syllabics (CAS)

The impetus behind this discussion paper is based on a current difficulty in Canadian cataloging--the use of Canadian Aboriginal Syllabics (CAS), which is a writing system used for the Cree and Inuktitut languages. Since the creation of the new territory of Nunavut in April 1999, there has been a resultant increase in usage of the script. Because the use of romanized forms is not an acceptable solution within Canada's national perspective, repertoire expansion is an urgent national priority in Canada to accommodate bibliographic information using CAS. Although the need for CAS is limited to Canada only, any solution allowing these characters to be part of a MARC record requires solving the more general difficulty of character repertoire expansion in MARC records. The appendix provides background on Canadian Aboriginal Syllabics.

The solutions proposed in this paper do not oblige libraries outside of Canada to adopt CAS--nor would they likely have any need to do so for the implementation of the CAS set is optional for libraries. Those libraries not needing to create records including these characters can continue to apply transliteration according to current practice. Romanization of Inuktitut in syllabics can be found in Cataloging Service Bulletin, no. 87 (winter 2000), p. 35-36. The adoption of the expanded CAS set is required based on whatever policy decision is made by a particular library about the use of Model A (transliteration plus parallel vernacular 880 fields) or Model B (simple multiscript records with no parallel fields) for multiscript records.

The proposed solutions are directed solely at Unicode encoded records. There is no suggestion to expand the repertoire of MARC records that are not in Unicode. This paper is being submitted as a discussion paper not a proposal since it does raise fundamental issues; guidance from MARBI is therefore requested.

2. DISCUSSION

The adoption of Unicode for general and library software is progressing, but Unicode is still far from being universally available. An interesting development is that Unicode has become available for desktop computing and in the recent releases of some library management software prior to being generally available in bibliographic utilities. Thus, a library could be in the situation of being able to handle an expanded repertoire internally to its operations, for input, display, printing, and processing, but not be able to exchange expanded records with certain exchange partners.

In Proposal 98-18, Leader/09 was defined as Character coding scheme, with 2 values:

# (blank)	MARC-8
a	UCS/Unicode

Using Leader/09 value a implies that the restricted repertoire has been respected, although this is not made explicit in the definition.

If repertoire expansion for CAS is approved, it will probably be useful to provide a mechanism for the identification and labeling of those MARC 21 records coded in UTF-8 which cannot be completely converted to the MARC-8 set. These records can then be omitted from files exchanged with partners who are not ready for the expanded character set. This will be needed only until the transition to general Unicode support is accomplished.

2.1 Leader/09

One mechanism for accommodating additions may be to add a value to Leader/09 as follows:

b - UCS/Unicode (record includes characters beyond MARC-8 subset)
Code b indicates that the character coding in the record includes characters beyond the MARC-8 subset of the Universal Coded Character Set (UCS) (ISO 10646), or Unicode, an industry subset.

The definition of existing value a may be edited as follows:

a - UCS/Unicode (record is restricted to and can be completely mapped to MARC-8 subset)
Code a indicates that the character coding in the record is restricted to the MARC-8 subset of the Universal Coded Character Set (UCS) (ISO 10646), or Unicode, an industry subset.

The advantage to using Leader/09 for identification is that it is early in the record and in a mandatory position. The value b could be set manually by the cataloger at the time of input, since that person would know that characters from the CAS set have been used in the record.

2.2 Field 066

Variable field 066 (Character sets present) could also be reintroduced into the UCS environment to carry this labeling function. This would require a new subfield (e.g. subfield $d) with a single code that would flag only expanded repertoire records. This would flag just the fact of expansion, however, and not which additional scripts or character blocks are involved, as it was previously determined in Proposal 98-18. For example, section 2.7 of Proposal 98-18 states that "it is difficult to define or represent appropriate subsets of the Unicode repertoire that would prove either workable or meaningful." The proposal also states that "more fundamentally, no demonstrable need or concrete use for this information [script or character block] emerged from the discussions." The drawback to the 066 approach is that the field appears later in the record, and in a variable data field which is not mandatory, thus not fulfilling the early warning function expected of such a flag.

2.3 Use of Local Systems

The identification function for selective exporting may also be considered a local system responsibility and thus, one may not need to internally flag or identify individual records carrying repertoire expansion at all. When FTP is the transfer method used, the character set information about a file could be transmitted in the electronic transfer label file using the Character Set and Set Variations fields as already defined. Repertoire is a property at the record level rather than at the database or file level, however, so exchange partners are better served with an explicit identifier in each record.

3. QUESTIONS

Should the current character set repertoire of MARC 21 records in Unicode be expanded for those users who have valid and perhaps urgent needs for expansion? If yes, then should the case of CAS be used as a test case in this expansion?
If CAS may be used as a test case in expanding the current character set repertoire, should a mechanism be then found to flag the presence of this expansion within the MARC record?
What mechanism should be used to flag the presence of the expanded character set repertoire?

APPENDIX

The Canadian Aboriginal Syllabics (CAS) writing system was created by James Evans, a Wesleyan missionary in what is now Manitoba in 1840. CAS was first used for the writing of Cree and Ojibwe and was adapted later by others for writing the Inuit language (Inuktitut). A number of other Athabaskan and Algonquian languages have used syllabics but currently the principal usage is Cree and Inuktitut.

CAS characters were added to Unicode with version 3.0 and published in The Unicode Standard, Version 3.0, Reading, Mass. : Addison-Wesley, 2000 (ISBN 0-201-61633-5). The code range is 1401-1676. Approximately 600 code positions are currently defined, around 100 of which are in actual use for either Cree or Inuktitut.

On April 1, 1999, Canada created in the eastern Arctic a new territory called Nunavut. Eight-five percent of the population of this territory is Inuit. The territorial government, by policy, has adopted both Inuktitut as well as English as the language of the workplace. Rapidly increasing amounts of written materials are being produced in Nunavut using CAS. Cataloging of these materials at the Legislative Library as well as at other government libraries is being done using CAS (using a VTLS local system). CAS are also in widespread use in the area of northern Quebec (called Nunavik) where initiatives are underway to create a similar kind of self-government as was done in Nunavut.

Canadian Aboriginal Syllabics can be found at: www.unicode.org/charts/PDF/U1400.pdf

Go to:

Library of Congress

Library of Congress Help Desk (03/21/2002)