NAME: Addition of new characters to existing USMARC sets
SOURCE: Research Libraries Group
SUMMARY: This proposal suggests defining three new characters in the existing USMARC character set for the basic Arabic script
KEYWORDS: Arabic script; Character sets
RELATED:
STATUS/COMMENTS:
5/1/97 - Forwarded to USMARC Advisory Group for discussion at the June 1997 MARBI meetings.
6/28/97 - Results of USMARC Advisory Group discussion - Approved.
8/21/97 - Result of final LC review - Approved.
PROPOSAL NO. 97-14: Addition of new characters to existing USMARC sets 1. BACKGROUND Several non-Latin character sets have been developed for use in USMARC records since the late 1970's. All USMARC character sets are based on standards, when available. An attempt has always been made to keep in the USMARC character sets in sync with other standards when possible. This proposal recommends the addition of three Arabic script characters to the USMARC Basic Arabic set to synchronize it with several standards as well as the Arabic implementations of USMARC users. 2. DISCUSSION The USMARC character sets for the Arabic script were developed in the late 1980's and early 1990's to support cataloging in the vernacular for Arabic and Persian (Farsi) language materials. The basic and extended Arabic script character sets provide sufficient characters to support vernacular cataloging in other languages as well including Kashmiri, Kurdish, Moplah, Turkish (Ottoman period), Pushto, Sindhi, Uighur, and Urdu. The basic USMARC Arabic script set was based on two standards, ASMO Standard Specification 449 (an Arab standard), and ISO 9036 (Information Processing--Arabic 7-bit Coded Character Set for Information Interchange). The extended USMARC Arabic script set was developed at the same time as ISO 11822 (Information and documentation--Extension of the Arabic Alphabetc Coded Character Set for Bibliographic Information Interchange). The USMARC and ISO extended Arabic script sets are completely in sync. Due to slightly different requirements for bibliographic applications, the USMARC basic Arabic set has some differences from ISO 9036 and ASMO 449. The most noteworthy difference is the inclusion of Arabic style digits 0 through 9 in the USMARC set rather than the Indic style digits. The USMARC basic Arabic script set also includes two additional characters, defined in character code positions that are unassigned in ISO 9036 and ASMO 449. The extra characters correspond to letters that sometimes appear in cataloging data. During the recent work of the MARBI Character Set Subcommittee, which has been developing a mapping of the existing USMARC character sets to ISO 10646 (Information Technology--Universal Multiple-Octet Coded Character Set (UCS)), discrepancies between the published USMARC Arabic set and USMARC implementations of that set were discovered. The Research Libraries Group (RLG), which has worked closely with MARBI and the Library of Congress in the development and implementation of non-Latin character sets, implemented the Arabic sets in November 1991. Their implementation has been the source of the largest number of Arabic vernacular cataloging records in the U.S. Differences between RLG's Arabic implementation and the USMARC sets are considered important because of the number of vernacular Arabic script cataloging records they maintain. The MARBI Character Set Subcommittee came to the conclusion that the three extra Arabic characters in the RLG Arabic implementation should be added to the USMARC set. Mappings to the universal coded character set have already been determined and will be added to the mapping document. The basic Arabic characters in question are the following (UCS character names have been used): ARABIC THOUSANDS SEPARATOR: USMARC character code '78' (maps to +U-066C) RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK: USMARC character code '79' (maps to +U-00BB) LEFT-POINTING DOUBLE ANGLE QUOTATION MARK: USMARC character code '7A' (maps to +U-00AB) Initial justification for including these characters in the Arabic set was that they were found to occur on Arabic script title pages. For example, the Arabic thousands separator occurs in the classic title "1001 Nights". The characters are also included in the DOS and Windows code pages for the Arabic script, upon which most Arabic implementations are based. Other USMARC implementers such as VTLS and Innovative Interfaces use the Arabic version of Windows as the platform for their Arabic systems. These characters are also, of course, in the universal code character set, which was based on independent assessments of the existence and usefulness of characters for a variety of scripts. It is unclear how many occurrences of these characters can be found in existing USMARC records. (Statistics of this sort are not available for all USMARC databases.) Considering the number of years that USMARC Arabic implementations have existed, and the presence of these characters in user documentation, it is likely that they have been used and will need to be handled in conversion someday to universal character encodings. It is best to add this missing characters to the USMARC sets now. It is important to note that the addition of characters to the USMARC Arabic set will not affect most USMARC users. Only a small number of USMARC implementations support the Arabic sets. It is suspected that those implementations already support the characters suggested in this proposal. 3. PROPOSED CHANGES The following is presented for consideration: - Defined the following three new characters in the existing USMARC Basic Arabic set: ARABIC THOUSANDS SEPARATOR: USMARC character code '78' (maps to +U-066C) RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK: USMARC character code '79' (maps to +U-00BB) LEFT-POINTING DOUBLE ANGLE QUOTATION MARK: USMARC character code '7A' (maps to +U-00AB)