The Library of Congress >> Especially for Librarians and Archivists >> Standards
HOME >> MARC Development >> Proposals List
DATE: Dec. 16, 2005
REVISED:
NAME: Technique for conversion of Unicode to MARC-8
SOURCE: Unicode-MARC Forum
SUMMARY: This paper summarizes the discussion on the Unicode-MARC Forum for converting Unicode to MARC-8 for systems that cannot handle Unicode records. The Forum consensus was for defining a placeholder character that was to be substituted for each unmappable Unicode character.
KEYWORDS: Unicode (all formats); MARC-8 (all formats); character sets; USC/Unicode
RELATED: Assessment of Options for Handling Full Unicode Character Encodings in MARC 21 -- Part 1: New Scripts (January 2004); Assessment of Options for Handling Full Unicode in Character Encodings in MARC 21 -- Part 2: Issues (June 2005)
STATUS/COMMENTS:
12/16/05 - Made available to the MARC 21 community for discussion.
01/22/06 - Results of the MARC Advisory Committee discussion - The Placeholder approach was approved as amended. MARBI will consider a second proposal in the summer for a lossless numeric character reference conversion technique. Option 1 was approved in that the fill character is used as a placeholder for the lossy technique. A decomposition list will be prepared for preprocessing normalization.
03/07/06 - Results of LC/LAC/BL review - Approved
1 BACKGROUND
One major issue that needed resolution for the adoption of full Unicode was how the mapping from Unicode with 95,000+ characters to MARC-8 with 16,000+ was to be handled. From August to October 2005 the Unicode-MARC discussion list visited that issue. The discussion focused on three main techniques before reaching a consensus that was based on input from the systems that would be most effected by implementing a standard method. The discussions and conclusions are summarized below. The archive of all the messages is available via the MARC home page www.loc.gov/marc/. (Under General Information, click on Unicode-MARC Forum.)
2 FORUM DISCUSSION
2.1 General Observations
The primary need for the Unicode to MARC-8 conversion is for users with systems that they cannot modify or users that have no need to modify their systems in the near future to enable the import of full Unicode character set. Many also use only the Latin part of MARC-8 as they do not collect non-Latin script material or they do so in a minimal way and they find that romanization of bibliographic data to the Latin script serves them well enough. While the continued use of MARC-8 may be viewed as temporary -- temporary in this case could mean 15-20 years or so.
Many institutions distribute records to these sites and the distributors need an agreed-upon Unicode to MARC-8 conversion as they do not want to customize to different customer preferences, that would require them to maintain multiple set reduction streams.
Another situation where such conversion may be needed was also pointed out on the Forum: movement of data internally between Unicode-based and MARC-8 based data modules. Since this is an internal issue and may be handled differently by different systems, it was not pursued.
The presumption was made that the record source system would have the responsibility of producing a record that was converted to MARC-8 according to the rules agreed upon by the community, so that the receiving systems could continue to have expectation of receiving consistent, standard, MARC-8 data.
It was noted that the language code in field 008 has no role in specifying characters present in the record. A record could be entirely English and contain many unmappable characters. A record for a Chinese-language item may contain nothing but ASCII and ANSEL.
There was general agreement that the technique needed to be cheap and easy to implement. Two classes of techniques were considered, lossy, where data would be lost and could not be recovered, and non-lossy, where data was carried over into the MARC-8 environment in a coded form and could therefore be recovered on reconversion to Unicode.
2.2 Solution Options
(For all solutions the question of converting precomposed letter/diacritics was noted. See further explanations of this in section 2.3.)
2.2.1 Drop record. (Do not distribute records that have non-MARC-8 characters.)
This approach was briefly considered. It would not be acceptable to institutions who are record suppliers. When a system accepted a distribution request, it would not necessarily know the character set characteristics of the record accurately enough to inform the requester that the record might not be eligible to be sent.
2.2.2 Drop character. (Delete any character that is not mappable to MARC-8.)
This approach would drop any characters that would not convert. It would be non-reversible, i.e., lossy. It would be easy and cheap to implement by the receiver as it would have little impact on their systems. A number of points were made pertaining to this approach.
2.2.3 Placeholder character. (Substitute a placeholder character for any character not mappable to MARC-8).
This is a lossy technique as it would produce records that could not be reversed. It would be relatively simple and cheap to implement by both the receiver and the source. A placeholder character would need to be determined, as discussed below. The following points were made.
The unused graphic character code hex DF was thus favored. In ANSEL, DF is the last code position in the columns that contain spacing characters. Using this code point would not embed the "place-holder" among conventional graphic characters as there are no other defined characters in that column of the table.
A suitable specification for the actual graphic image to be used for this placeholder character was discussed and the consensus was to leave it to the users. (The MARC documentation would use some graphic convention, but as with the subfield delimiter (double dagger, dollar sign), the visual representation would not be part of the standard.)
2.2.4 Numeric Character Reference. (Substitute a Numeric Character Reference (NCR) for each unmappable MARC-8 character.)
This is the main loss-less technique that was discussed. It allows Unicode characters to be recovered. The following points were made.
2.3 Preprocessing Requirements
The listserv also discussed preprocessing of the Unicode record before the conversion to MARC-8 takes place. In all of the above techniques, the following steps for decomposing diacritics were presumed.
The exceptions to NFD need to be carefully and completely identified (see draft list in Annex A to this document).
Other preprocessing that might be useful, is the application of the Unicode Normalization Form KD. This normalization includes not only the exact equivalent conversion of NFD but also some "compatible equivalent" conversion specifications. (An example is the character in Unicode that is a digraph of "ffi", which would have the compatible equivalent of "f" "f" "i". This normalization before conversion might reduce the unmappable characters, but would not be reversible in most cases. The NFKD needs to be reviewed to see if it would be useful to carry out, especially if the techniques 2 and 3 above are being used. Since they are already lossy, if compatible equivalents are acceptable, NFKD could enable the mapping of more characters rather than their being deleted or converted to a placeholder.
See full explanation of Unicode normalization at: http://www.unicode.org/reports/tr15/ and a brief summary on Section 4 of Assessment of Options ... Part 2.
2.4 Information from MicroLIF
The Unicode-MARC Listserv discussion often went back to the question of what was preferred by the group of users that most needed to continue to receive MARC-8 records, as described in the Introduction. The MicroLIF group, is made up of many of these users' vendors, and, in mid-October the MicroLIF representative to MARBI, Gail Lewis, reported to the list the response that she had received from her constituents to the following query:
"Given that you're receiving data from a system which implements characters that cannot be mapped directly to MARC-8, how would you want the originating system to deal with those characters? For purposes of discussion, assume that such characters may occur anywhere other than in the fixed fields."
She summarized the response she received as follows:
"... they would like the non MARC-8 characters to be removed from the incoming records, or replaced with a single character (such as the fill character "|"). From the people I talked with, they felt that substituting a Unicode character with a character string might have negative system impacts, and would not likely ever be used. The librarians that we are representing are not sending their records to other utilities or systems, so the use of a single replacement character came thru as the solution that they liked best."
Until that response, the consensus was tending toward the Character Reference approach.
3 PROPOSAL
Adopt the Placeholder approach to the distribution of records to MARC-8 systems from Unicode systems, with the following specifications.
Option 1: the vertical bar (ASCII hex 7C)
Option 2: define a new ANSEL graphic character, hex DF, for the placeholder function (with no associated graphic image)
a) Normalize string
Option 1: Unicode Normalization Form D, with exceptions (draft in Annex A).
Option 2: Unicode Normalization Form KD, with exceptions to be designated (partial draft in Annex A).
b) Reverse the order of the character modifiers and base characters so that modifiers precede the base.
ANNEX A: EXCEPTIONS TO NORMALIZATION FORM D CONVERSION
In general, when converting data from Unicode encoding to MARC-8 encoding, it is necessary to convert precomposed characters such as é (Unicode hex 00E9; Latin small letter e with acute) to their separate parts; in this case the letter e and the acute accent. There are, however, some exceptions as MARC-8 has a few precomposed characters that should therefore not be decomposed. These MARC-8 sets are listed below by the set to which they belong. There are no occurrences of this situation in the Hebrew set, the Greek sets, or the superscript and subscript sets. (These tables were prepared by Jack Cain.)
ANSEL Set
MARC-8 as G0 |
MARC-8 as G1 |
UCS |
UTF8 |
NAME |
---|---|---|---|---|
2C |
AC |
01A0 |
C6A0 |
UPPERCASE O-HOOK / LATIN CAPITAL LETTER O WITH HORN |
2D |
AD |
01AF |
C6AF |
UPPERCASE U-HOOK / LATIN CAPITAL LETTER U WITH HORN |
3C |
BC |
01A1 |
C6A1 |
LOWERCASE O-HOOK / LATIN SMALL LETTER O WITH HORN |
3D |
BD |
01B0 |
C6B0 |
LOWERCASE U-HOOK / LATIN SMALL LETTER U WITH HORN |
Arabic Set
MARC-8 as G0 |
MARC-8 as G1 |
UCS |
UTF8 |
NAME |
---|---|---|---|---|
42 |
C2 |
0622 |
D8A2 |
ARABIC LETTER ALEF WITH MADDA ABOVE |
43 |
C3 |
0623 |
D8A3 |
ARABIC LETTER ALEF WITH HAMZA ABOVE |
44 |
C4 |
0624 |
D8A4 |
ARABIC LETTER WAW WITH HAMZA ABOVE |
45 |
C5 |
0625 |
D8A5 |
ARABIC LETTER ALEF WITH HAMZA BELOW |
46 |
C6 |
0626 |
D8A6 |
ARABIC LETTER YEH WITH HAMZA ABOVE |
69 |
E9 |
0649 |
D989 |
ARABIC LETTER ALEF MAKSURA |
73 |
F3 |
0671 |
D9B1 |
ARABIC LETTER ALEF WASLA |
Extended Arabic Set
MARC-8 as G0 |
MARC-8 as G1 |
UCS |
UTF8 |
NAME |
---|---|---|---|---|
6E |
EE |
06C0 |
DB80 |
HEH WITH HAMZA ABOVE / ARABIC LETTER HEH WITH YEH ABOVE |
78 |
F8 |
06D3 |
DB93 |
ARABIC LETTER YEH BARREE WITH HAMZA ABOVE |
Cyrillic Set
MARC-8 as G0 |
MARC-8 as G1 |
UCS |
UTF8 |
NAME |
---|---|---|---|---|
4A |
CA |
0439 |
D0B9 |
LOWERCASE SHORT II / CYRILLIC SMALL LETTER SHORT I |
6A |
EA |
0419 |
D099 |
UPPERCASE SHORT II / CYRILLIC CAPITAL LETTER SHORT I |
Extended Cyrillic Set
MARC-8 as G0 |
MARC-8 as G1 |
UCS |
UTF8 |
NAME |
---|---|---|---|---|
42 |
C2 |
0453 |
D193 |
CYRILLIC SMALL LETTER GJE |
44 |
C4 |
0451 |
D191 |
CYRILLIC SMALL LETTER IO |
47 |
C7 |
0457 |
D197 |
LOWERCASE YI / CYRILLIC SMALL LETTER YI (Ukrainian) |
4C |
CC |
045C |
D19C |
CYRILLIC SMALL LETTER KJE |
4D |
CD |
045E |
D19E |
LOWERCASE SHORT U / CYRILLIC SMALL LETTER SHORT U (Byelorussian) |
62 |
E2 |
0403 |
D083 |
CYRILLIC CAPITAL LETTER GJE |
64 |
E4 |
0401 |
D081 |
CYRILLIC CAPITAL LETTER IO |
67 |
E7 |
0407 |
D087 |
UPPERCASE YI / CYRILLIC CAPITAL LETTER YI (Ukrainian) |
6C |
EC |
040C |
D08C |
CYRILLIC CAPITAL LETTER KJE |
6D |
ED |
040E |
D08E |
UPPERCASE SHORT U / CYRILLIC CAPITAL LETTER SHORT U (Byelorussian) |
HOME >> MARC Development >> Proposals List
The Library of Congress >> Especially
for Librarians and Archivists >> Standards ( 03/07/2006 ) |
Contact Us |