The Library of Congress >> Especially for Librarians and Archivists >> Standards

HOME >> MARC Development >> Proposals List

MARC PROPOSAL NO. 2006-04

DATE: Dec. 16, 2005

REVISED:

NAME: Technique for conversion of Unicode to MARC-8

SOURCE: Unicode-MARC Forum

SUMMARY: This paper summarizes the discussion on the Unicode-MARC Forum for converting Unicode to MARC-8 for systems that cannot handle Unicode records. The Forum consensus was for defining a placeholder character that was to be substituted for each unmappable Unicode character.

KEYWORDS: Unicode (all formats); MARC-8 (all formats); character sets; USC/Unicode

STATUS/COMMENTS:

12/16/05 - Made available to the MARC 21 community for discussion.

01/22/06 - Results of the MARC Advisory Committee discussion - The Placeholder approach was approved as amended. MARBI will consider a second proposal in the summer for a lossless numeric character reference conversion technique. Option 1 was approved in that the fill character is used as a placeholder for the lossy technique. A decomposition list will be prepared for preprocessing normalization.

03/07/06 - Results of LC/LAC/BL review - Approved

PROPOSAL No. 2006-04: Technique for conversion of Unicode to MARC-8

1 BACKGROUND

One major issue that needed resolution for the adoption of full Unicode was how the mapping from Unicode with 95,000+ characters to MARC-8 with 16,000+ was to be handled. From August to October 2005 the Unicode-MARC discussion list visited that issue. The discussion focused on three main techniques before reaching a consensus that was based on input from the systems that would be most effected by implementing a standard method. The discussions and conclusions are summarized below. The archive of all the messages is available via the MARC home page www.loc.gov/marc/. (Under General Information, click on Unicode-MARC Forum.)

2 FORUM DISCUSSION

2.1 General Observations

The primary need for the Unicode to MARC-8 conversion is for users with systems that they cannot modify or users that have no need to modify their systems in the near future to enable the import of full Unicode character set. Many also use only the Latin part of MARC-8 as they do not collect non-Latin script material or they do so in a minimal way and they find that romanization of bibliographic data to the Latin script serves them well enough. While the continued use of MARC-8 may be viewed as temporary -- temporary in this case could mean 15-20 years or so.

Many institutions distribute records to these sites and the distributors need an agreed-upon Unicode to MARC-8 conversion as they do not want to customize to different customer preferences, that would require them to maintain multiple set reduction streams.

Another situation where such conversion may be needed was also pointed out on the Forum: movement of data internally between Unicode-based and MARC-8 based data modules. Since this is an internal issue and may be handled differently by different systems, it was not pursued.

The presumption was made that the record source system would have the responsibility of producing a record that was converted to MARC-8 according to the rules agreed upon by the community, so that the receiving systems could continue to have expectation of receiving consistent, standard, MARC-8 data.

It was noted that the language code in field 008 has no role in specifying characters present in the record. A record could be entirely English and contain many unmappable characters. A record for a Chinese-language item may contain nothing but ASCII and ANSEL.

There was general agreement that the technique needed to be cheap and easy to implement. Two classes of techniques were considered, lossy, where data would be lost and could not be recovered, and non-lossy, where data was carried over into the MARC-8 environment in a coded form and could therefore be recovered on reconversion to Unicode.

2.2 Solution Options

(For all solutions the question of converting precomposed letter/diacritics was noted. See further explanations of this in section 2.3.)

2.2.1 Drop record. (Do not distribute records that have non-MARC-8 characters.)

This approach was briefly considered. It would not be acceptable to institutions who are record suppliers. When a system accepted a distribution request, it would not necessarily know the character set characteristics of the record accurately enough to inform the requester that the record might not be eligible to be sent.

2.2.2 Drop character. (Delete any character that is not mappable to MARC-8.)

This approach would drop any characters that would not convert. It would be non-reversible, i.e., lossy. It would be easy and cheap to implement by the receiver as it would have little impact on their systems. A number of points were made pertaining to this approach.

The 008/38 position could be expanded to indicate that the record was not true to its original, characters had been dropped in the conversion.
There would be no indication of which fields and which words within fields were missing characters. (Note that the positionally defined elements such as the 008 are protected from dropped characters since they are limited to basic ASCII characters.)
If in the future a system was replaced with one that handled full Unicode, the 008/38 code would identify records that might need to be obtained again from a Unicode source.
A field could logically become empty except for its indicators and subfield codes if it contained data in a script not in MARC-8. Such fields would need to be dropped. Rules for dealing with fields that were left with only a $6 (Linkage) subfield or other control subfield, or empty subfields or subfields with only punctuation would be needed. Presumably the receiving organization would not want such fields. Character modifiers like diacritics could also lose their base character so a rule for dropping 'dangling' diacritics would be needed or else the modifier would appear to modify the previous character in the string.

2.2.3 Placeholder character. (Substitute a placeholder character for any character not mappable to MARC-8).

This is a lossy technique as it would produce records that could not be reversed. It would be relatively simple and cheap to implement by both the receiver and the source. A placeholder character would need to be determined, as discussed below. The following points were made.

This technique is the least disruptive and requires the least change since each character that could not be mapped to MARC-8 would be represented by a replacement character. There would not be any empty fields or subfields or missing base characters positions.
The record length would be relatively stable, since the mapping would be largely 1-1.
No need to indicate that the records has been altered via a 008/38 code because existence of the placeholder characters shows that, although such a code could be developed if a use is identified.
A placeholder character would need to be identified. Discussion looked at the vertical bar (MARC fill character) and an unused (reserved) position from the ANSEL set, particularly hexDF. The vertical bar has the advantage of already being defined in existing implementations, therefore it is easy and cheap to implement. Getting systems to implement additions to the ANSEL character set has been a challenge in the past. The vertical bar is not currently used in the variable length subfields where these Unicode characters are expected to occur. (The fill character is only defined for position-defined data areas such as the 008.) However, adding another meaning to the fill character was seen as an important reason not to use it. Some correspondents noted that the code chosen for a place marker should have no other use, present or past.

The unused graphic character code hex DF was thus favored. In ANSEL, DF is the last code position in the columns that contain spacing characters. Using this code point would not embed the "place-holder" among conventional graphic characters as there are no other defined characters in that column of the table.

A suitable specification for the actual graphic image to be used for this placeholder character was discussed and the consensus was to leave it to the users. (The MARC documentation would use some graphic convention, but as with the subfield delimiter (double dagger, dollar sign), the visual representation would not be part of the standard.)

2.2.4 Numeric Character Reference. (Substitute a Numeric Character Reference (NCR) for each unmappable MARC-8 character.)

This is the main loss-less technique that was discussed. It allows Unicode characters to be recovered. The following points were made.

This technique would preserve the Unicode value information and when reconverting to Unicode, the character could be restored. In some situations display software might also be able to take the reference and display the correct Unicode character even if the basic system was MARC-8.
An NCR for one character has the following structure: פ where "&" and ";" surround the reference, "#x" indicates that it is a numeric reference expressed in hex, and "05e4" is the hexadecimal representation of the character's code point in ISO/IEC 10646. Thus, each unmappable character could become as many as 8 ASCII characters, increasing the length of a field and record. While this could logically be a problem with maximum field (9999 characters) and record (99999 characters) lengths, it was considered not likely to cause serious problems.
The NCR may also be structured with the decimal representation of the character's code point, but the discussion favored using only one type of NCR, the hexadecimal one.
This technique is easy for the source to implement as there is a 1-1 substitution, in the sense of 1 Unicode character to 1 NCR.

2.3 Preprocessing Requirements

The listserv also discussed preprocessing of the Unicode record before the conversion to MARC-8 takes place. In all of the above techniques, the following steps for decomposing diacritics were presumed.

Decompose the precomposed base character/character modifier combinations using Unicode Normalization Form D (NFD) which produces exact equivalents, and primarily applies decomposition to precomposed characters with diacritics. There are exceptions to NFD that are necessary and those should be applied at the same time. For example, the Vietnamese hook o and hook u, some Korean Hangul characters, and a few Greek and Cyrillic characters are precomposed in MARC-8; they should not be decomposed.
Reverse the order of combining characters (diacritics) in relation to their base character so that they come before their base, as is the convention for MARC-8.

The exceptions to NFD need to be carefully and completely identified (see draft list in Annex A to this document).

Other preprocessing that might be useful, is the application of the Unicode Normalization Form KD. This normalization includes not only the exact equivalent conversion of NFD but also some "compatible equivalent" conversion specifications. (An example is the character in Unicode that is a digraph of "ffi", which would have the compatible equivalent of "f" "f" "i". This normalization before conversion might reduce the unmappable characters, but would not be reversible in most cases. The NFKD needs to be reviewed to see if it would be useful to carry out, especially if the techniques 2 and 3 above are being used. Since they are already lossy, if compatible equivalents are acceptable, NFKD could enable the mapping of more characters rather than their being deleted or converted to a placeholder.

See full explanation of Unicode normalization at: http://www.unicode.org/reports/tr15/ and a brief summary on Section 4 of Assessment of Options ... Part 2.

2.4 Information from MicroLIF

The Unicode-MARC Listserv discussion often went back to the question of what was preferred by the group of users that most needed to continue to receive MARC-8 records, as described in the Introduction. The MicroLIF group, is made up of many of these users' vendors, and, in mid-October the MicroLIF representative to MARBI, Gail Lewis, reported to the list the response that she had received from her constituents to the following query:

"Given that you're receiving data from a system which implements characters that cannot be mapped directly to MARC-8, how would you want the originating system to deal with those characters? For purposes of discussion, assume that such characters may occur anywhere other than in the fixed fields."

She summarized the response she received as follows:

"... they would like the non MARC-8 characters to be removed from the incoming records, or replaced with a single character (such as the fill character "|"). From the people I talked with, they felt that substituting a Unicode character with a character string might have negative system impacts, and would not likely ever be used. The librarians that we are representing are not sending their records to other utilities or systems, so the use of a single replacement character came thru as the solution that they liked best."

Until that response, the consensus was tending toward the Character Reference approach.

3 PROPOSAL

Adopt the Placeholder approach to the distribution of records to MARC-8 systems from Unicode systems, with the following specifications.

Placeholder character:

Option 1: the vertical bar (ASCII hex 7C)

Option 2: define a new ANSEL graphic character, hex DF, for the placeholder function (with no associated graphic image)
Preprocessing required:
a) Normalize string

Option 1: Unicode Normalization Form D, with exceptions (draft in Annex A).

Option 2: Unicode Normalization Form KD, with exceptions to be designated (partial draft in Annex A).

b) Reverse the order of the character modifiers and base characters so that modifiers precede the base.

Using the MARC 21 character set code tables from the MARC 21 Specifications, convert preprocessed record, mapping all characters that match MARC-8 characters to their MARC-8 equivalents, inserting the necessary escape sequences as appropriate for the non-Latin MARC-8 characters. For all characters that do not map, insert the placeholder character, inserting escape sequences as needed.

ANNEX A: EXCEPTIONS TO NORMALIZATION FORM D CONVERSION

In general, when converting data from Unicode encoding to MARC-8 encoding, it is necessary to convert precomposed characters such as é (Unicode hex 00E9; Latin small letter e with acute) to their separate parts; in this case the letter e and the acute accent. There are, however, some exceptions as MARC-8 has a few precomposed characters that should therefore not be decomposed. These MARC-8 sets are listed below by the set to which they belong. There are no occurrences of this situation in the Hebrew set, the Greek sets, or the superscript and subscript sets. (These tables were prepared by Jack Cain.)

ANSEL Set

MARC-8 as G0	MARC-8 as G1	UCS	UTF8	NAME
2C	AC	01A0	C6A0	UPPERCASE O-HOOK / LATIN CAPITAL LETTER O WITH HORN
2D	AD	01AF	C6AF	UPPERCASE U-HOOK / LATIN CAPITAL LETTER U WITH HORN
3C	BC	01A1	C6A1	LOWERCASE O-HOOK / LATIN SMALL LETTER O WITH HORN
3D	BD	01B0	C6B0	LOWERCASE U-HOOK / LATIN SMALL LETTER U WITH HORN

Arabic Set

MARC-8 as G0	MARC-8 as G1	UCS	UTF8	NAME
42	C2	0622	D8A2	ARABIC LETTER ALEF WITH MADDA ABOVE
43	C3	0623	D8A3	ARABIC LETTER ALEF WITH HAMZA ABOVE
44	C4	0624	D8A4	ARABIC LETTER WAW WITH HAMZA ABOVE
45	C5	0625	D8A5	ARABIC LETTER ALEF WITH HAMZA BELOW
46	C6	0626	D8A6	ARABIC LETTER YEH WITH HAMZA ABOVE
69	E9	0649	D989	ARABIC LETTER ALEF MAKSURA
73	F3	0671	D9B1	ARABIC LETTER ALEF WASLA

Extended Arabic Set

MARC-8 as G0	MARC-8 as G1	UCS	UTF8	NAME
6E	EE	06C0	DB80	HEH WITH HAMZA ABOVE / ARABIC LETTER HEH WITH YEH ABOVE
78	F8	06D3	DB93	ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

Cyrillic Set

MARC-8 as G0	MARC-8 as G1	UCS	UTF8	NAME
4A	CA	0439	D0B9	LOWERCASE SHORT II / CYRILLIC SMALL LETTER SHORT I
6A	EA	0419	D099	UPPERCASE SHORT II / CYRILLIC CAPITAL LETTER SHORT I

Extended Cyrillic Set

MARC-8 as G0	MARC-8 as G1	UCS	UTF8	NAME
42	C2	0453	D193	CYRILLIC SMALL LETTER GJE
44	C4	0451	D191	CYRILLIC SMALL LETTER IO
47	C7	0457	D197	LOWERCASE YI / CYRILLIC SMALL LETTER YI (Ukrainian)
4C	CC	045C	D19C	CYRILLIC SMALL LETTER KJE
4D	CD	045E	D19E	LOWERCASE SHORT U / CYRILLIC SMALL LETTER SHORT U (Byelorussian)
62	E2	0403	D083	CYRILLIC CAPITAL LETTER GJE
64	E4	0401	D081	CYRILLIC CAPITAL LETTER IO
67	E7	0407	D087	UPPERCASE YI / CYRILLIC CAPITAL LETTER YI (Ukrainian)
6C	EC	040C	D08C	CYRILLIC CAPITAL LETTER KJE
6D	ED	040E	D08E	UPPERCASE SHORT U / CYRILLIC CAPITAL LETTER SHORT U (Byelorussian)

HOME >> MARC Development >> Proposals List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 03/07/2006 )