The Library of Congress >> Especially for Librarians and Archivists >> Standards

MARC Standards

HOME >> MARC Development >> Proposals List


MARC PROPOSAL NO. 2006-04

DATE: Dec. 16, 2005

REVISED:

NAME: Technique for conversion of Unicode to MARC-8

SOURCE: Unicode-MARC Forum

SUMMARY: This paper summarizes the discussion on the Unicode-MARC Forum for converting Unicode to MARC-8 for systems that cannot handle Unicode records. The Forum consensus was for defining a placeholder character that was to be substituted for each unmappable Unicode character.

KEYWORDS: Unicode (all formats); MARC-8 (all formats); character sets; USC/Unicode

RELATED: Assessment of Options for Handling Full Unicode Character Encodings in MARC 21 -- Part 1: New Scripts (January 2004); Assessment of Options for Handling Full Unicode in Character Encodings in MARC 21 -- Part 2: Issues (June 2005)

STATUS/COMMENTS:

12/16/05 - Made available to the MARC 21 community for discussion.

01/22/06 - Results of the MARC Advisory Committee discussion - The Placeholder approach was approved as amended. MARBI will consider a second proposal in the summer for a lossless numeric character reference conversion technique. Option 1 was approved in that the fill character is used as a placeholder for the lossy technique. A decomposition list will be prepared for preprocessing normalization.

03/07/06 - Results of LC/LAC/BL review - Approved


PROPOSAL No. 2006-04: Technique for conversion of Unicode to MARC-8

1 BACKGROUND

One major issue that needed resolution for the adoption of full Unicode was how the mapping from Unicode with 95,000+ characters to MARC-8 with 16,000+ was to be handled. From August to October 2005 the Unicode-MARC discussion list visited that issue. The discussion focused on three main techniques before reaching a consensus that was based on input from the systems that would be most effected by implementing a standard method. The discussions and conclusions are summarized below. The archive of all the messages is available via the MARC home page www.loc.gov/marc/. (Under General Information, click on Unicode-MARC Forum.)

2 FORUM DISCUSSION

2.1 General Observations

The primary need for the Unicode to MARC-8 conversion is for users with systems that they cannot modify or users that have no need to modify their systems in the near future to enable the import of full Unicode character set. Many also use only the Latin part of MARC-8 as they do not collect non-Latin script material or they do so in a minimal way and they find that romanization of bibliographic data to the Latin script serves them well enough. While the continued use of MARC-8 may be viewed as temporary -- temporary in this case could mean 15-20 years or so.

Many institutions distribute records to these sites and the distributors need an agreed-upon Unicode to MARC-8 conversion as they do not want to customize to different customer preferences, that would require them to maintain multiple set reduction streams.

Another situation where such conversion may be needed was also pointed out on the Forum: movement of data internally between Unicode-based and MARC-8 based data modules. Since this is an internal issue and may be handled differently by different systems, it was not pursued.

The presumption was made that the record source system would have the responsibility of producing a record that was converted to MARC-8 according to the rules agreed upon by the community, so that the receiving systems could continue to have expectation of receiving consistent, standard, MARC-8 data.

It was noted that the language code in field 008 has no role in specifying characters present in the record. A record could be entirely English and contain many unmappable characters. A record for a Chinese-language item may contain nothing but ASCII and ANSEL.

There was general agreement that the technique needed to be cheap and easy to implement. Two classes of techniques were considered, lossy, where data would be lost and could not be recovered, and non-lossy, where data was carried over into the MARC-8 environment in a coded form and could therefore be recovered on reconversion to Unicode.

2.2 Solution Options

(For all solutions the question of converting precomposed letter/diacritics was noted. See further explanations of this in section 2.3.)

2.2.1 Drop record. (Do not distribute records that have non-MARC-8 characters.)

This approach was briefly considered. It would not be acceptable to institutions who are record suppliers. When a system accepted a distribution request, it would not necessarily know the character set characteristics of the record accurately enough to inform the requester that the record might not be eligible to be sent.

2.2.2 Drop character. (Delete any character that is not mappable to MARC-8.)

This approach would drop any characters that would not convert. It would be non-reversible, i.e., lossy. It would be easy and cheap to implement by the receiver as it would have little impact on their systems. A number of points were made pertaining to this approach.

2.2.3 Placeholder character. (Substitute a placeholder character for any character not mappable to MARC-8).

This is a lossy technique as it would produce records that could not be reversed. It would be relatively simple and cheap to implement by both the receiver and the source. A placeholder character would need to be determined, as discussed below. The following points were made.

The unused graphic character code hex DF was thus favored. In ANSEL, DF is the last code position in the columns that contain spacing characters. Using this code point would not embed the "place-holder" among conventional graphic characters as there are no other defined characters in that column of the table.

A suitable specification for the actual graphic image to be used for this placeholder character was discussed and the consensus was to leave it to the users. (The MARC documentation would use some graphic convention, but as with the subfield delimiter (double dagger, dollar sign), the visual representation would not be part of the standard.)

2.2.4 Numeric Character Reference. (Substitute a Numeric Character Reference (NCR) for each unmappable MARC-8 character.)

This is the main loss-less technique that was discussed. It allows Unicode characters to be recovered. The following points were made.

2.3 Preprocessing Requirements

The listserv also discussed preprocessing of the Unicode record before the conversion to MARC-8 takes place. In all of the above techniques, the following steps for decomposing diacritics were presumed.

The exceptions to NFD need to be carefully and completely identified (see draft list in Annex A to this document).

Other preprocessing that might be useful, is the application of the Unicode Normalization Form KD. This normalization includes not only the exact equivalent conversion of NFD but also some "compatible equivalent" conversion specifications. (An example is the character in Unicode that is a digraph of "ffi", which would have the compatible equivalent of "f" "f" "i". This normalization before conversion might reduce the unmappable characters, but would not be reversible in most cases. The NFKD needs to be reviewed to see if it would be useful to carry out, especially if the techniques 2 and 3 above are being used. Since they are already lossy, if compatible equivalents are acceptable, NFKD could enable the mapping of more characters rather than their being deleted or converted to a placeholder.

See full explanation of Unicode normalization at: http://www.unicode.org/reports/tr15/ and a brief summary on Section 4 of Assessment of Options ... Part 2.

2.4 Information from MicroLIF

The Unicode-MARC Listserv discussion often went back to the question of what was preferred by the group of users that most needed to continue to receive MARC-8 records, as described in the Introduction. The MicroLIF group, is made up of many of these users' vendors, and, in mid-October the MicroLIF representative to MARBI, Gail Lewis, reported to the list the response that she had received from her constituents to the following query:

"Given that you're receiving data from a system which implements characters that cannot be mapped directly to MARC-8, how would you want the originating system to deal with those characters? For purposes of discussion, assume that such characters may occur anywhere other than in the fixed fields."

She summarized the response she received as follows:

"... they would like the non MARC-8 characters to be removed from the incoming records, or replaced with a single character (such as the fill character "|"). From the people I talked with, they felt that substituting a Unicode character with a character string might have negative system impacts, and would not likely ever be used. The librarians that we are representing are not sending their records to other utilities or systems, so the use of a single replacement character came thru as the solution that they liked best."

Until that response, the consensus was tending toward the Character Reference approach.

3 PROPOSAL

Adopt the Placeholder approach to the distribution of records to MARC-8 systems from Unicode systems, with the following specifications.

b) Reverse the order of the character modifiers and base characters so that modifiers precede the base.

ANNEX A: EXCEPTIONS TO NORMALIZATION FORM D CONVERSION

In general, when converting data from Unicode encoding to MARC-8 encoding, it is necessary to convert precomposed characters such as é (Unicode hex 00E9; Latin small letter e with acute) to their separate parts; in this case the letter e and the acute accent. There are, however, some exceptions as MARC-8 has a few precomposed characters that should therefore not be decomposed. These MARC-8 sets are listed below by the set to which they belong. There are no occurrences of this situation in the Hebrew set, the Greek sets, or the superscript and subscript sets. (These tables were prepared by Jack Cain.)

ANSEL Set

MARC-8 as G0

MARC-8 as G1

UCS

UTF8

NAME

2C

AC

01A0

C6A0

UPPERCASE O-HOOK / LATIN CAPITAL LETTER O WITH HORN

2D

AD

01AF

C6AF

UPPERCASE U-HOOK / LATIN CAPITAL LETTER U WITH HORN

3C

BC

01A1

C6A1

LOWERCASE O-HOOK / LATIN SMALL LETTER O WITH HORN

3D

BD

01B0

C6B0

LOWERCASE U-HOOK / LATIN SMALL LETTER U WITH HORN

Arabic Set

MARC-8 as G0

MARC-8 as G1

UCS

UTF8

NAME

42

C2

0622

D8A2

ARABIC LETTER ALEF WITH MADDA ABOVE

43

C3

0623

D8A3

ARABIC LETTER ALEF WITH HAMZA ABOVE

44

C4

0624

D8A4

ARABIC LETTER WAW WITH HAMZA ABOVE

45

C5

0625

D8A5

ARABIC LETTER ALEF WITH HAMZA BELOW

46

C6

0626

D8A6

ARABIC LETTER YEH WITH HAMZA ABOVE

69

E9

0649

D989

ARABIC LETTER ALEF MAKSURA

73

F3

0671

D9B1

ARABIC LETTER ALEF WASLA

Extended Arabic Set

MARC-8 as G0

MARC-8 as G1

UCS

UTF8

NAME

6E

EE

06C0

DB80

HEH WITH HAMZA ABOVE / ARABIC LETTER HEH WITH YEH ABOVE

78

F8

06D3

DB93

ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

Cyrillic Set

MARC-8 as G0

MARC-8 as G1

UCS

UTF8

NAME

4A

CA

0439

D0B9

LOWERCASE SHORT II / CYRILLIC SMALL LETTER SHORT I

6A

EA

0419

D099

UPPERCASE SHORT II / CYRILLIC CAPITAL LETTER SHORT I

Extended Cyrillic Set

MARC-8 as G0

MARC-8 as G1

UCS

UTF8

NAME

42

C2

0453

D193

CYRILLIC SMALL LETTER GJE

44

C4

0451

D191

CYRILLIC SMALL LETTER IO

47

C7

0457

D197

LOWERCASE YI / CYRILLIC SMALL LETTER YI (Ukrainian)

4C

CC

045C

D19C

CYRILLIC SMALL LETTER KJE

4D

CD

045E

D19E

LOWERCASE SHORT U / CYRILLIC SMALL LETTER SHORT U (Byelorussian)

62

E2

0403

D083

CYRILLIC CAPITAL LETTER GJE

64

E4

0401

D081

CYRILLIC CAPITAL LETTER IO

67

E7

0407

D087

UPPERCASE YI / CYRILLIC CAPITAL LETTER YI (Ukrainian)

6C

EC

040C

D08C

CYRILLIC CAPITAL LETTER KJE

6D

ED

040E

D08E

UPPERCASE SHORT U / CYRILLIC CAPITAL LETTER SHORT U (Byelorussian)


HOME >> MARC Development >> Proposals List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 03/07/2006 )
Contact Us