The Library of Congress >> Especially for Librarians and Archivists >> Standards
MARC Standards
MARC 21 HOME >> Specifications >> Character Sets >> Part 1

MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media

CHARACTER SETS AND ENCODING OPTIONS: Part 1

General Character Set Issues

December 2007

Link disclaimerCONTENTS


INTRODUCTION

Character set issues that pertain to both the MARC-8 and Unicode environments are dealt with here. In this section, references to Unicode imply its UTF-8 encoding form, the only one approved for use in MARC 21. References to a particular Unicode character will be expressed in the conventional way, as a hexadecimal number identifying the code point, not as a representation of the individual octets in the UTF-8 transformation. See Part 3 for a discussion of UTF-8.


ENCODING MARKER

In MARC 21 records, Leader character position 9 (Character coding scheme) must indicate whether the record uses MARC-8 encoding or Unicode encoding. This is necessary processing information because any character other than an ASCII graphic or control character -- i.e., any character whose code point is greater than 7F(hex) is represented by different code points in the two encoding schemes. It is not permitted to use both encoding schemes in the same MARC 21 record.


LEADER, DIRECTORY, AND CONTENT DESIGNATION

The Leader, Directory, and all content designation in MARC 21 records must be encoded using only the repertoire found in Code for Information Interchange (ASCII) (ANSI X3.4) or its international counterpart, ISO 646 (IRV). The encoding of these characters in theMARC-8 and Unicode encodings is identical in the two MARC 21 encoding schemes, when the Unicode form is UTF-8.


CONTROL FUNCTION CODES

Eight characters are specifically designated as control characters for MARC 21 use:

These are the only control characters that may be used in MARC-8 encoded records. The joiner and nonjoiner characters are sometimes required to control the display form of graphic characters whose proximity to other characters affects their shape; as can happen, for example, in the Arabic script. Specifications for use of the sorting control characters are contained in the MARC 21 Format for Bibliographic Data.


BASE CHARACTERS AND DIACRITICS

There are four characters appearing in ASCII and Unicode which look like combining marks but are actually base characters. (The four characters and their code points are: circumflex accent (5E(hex)), low line (5F(hex)), grave accent (60(hex)), tilde 7E(hex)). MARC 21 records do not use these four code points as combining characters. Equivalent combining characters are represented by other code points in either encoding.


DIRECTIONALITY OF TEXT

The contents of a field in a MARC 21 record in either the MARC-8 or the Unicode environment are recorded in their logical order, from the first character to the last, regardless of the directionality of text. Although most scripts display characters from left to right, some scripts such as Arabic and Hebrew display characters primarily from right to left. More information on handling bidirectional text is found in Part 2 for MARC-8-encoded MARC 21 records. For MARC 21 records using the Unicode encoding consult the Unicode Standard, Annex #9: The Bi-Directional Algorithm.

Note: When bidirectional scripts were first permitted in MARC, numerous records were created with the embedded sections of left-to-right data entered in visual rather than logical order. This is no longer considered good practice.

Left-to-right field orientation is the default for fields in MARC 21 records. No designation of field orientation is required for character sets with left-to-right orientation. When a field contains data whose orientation is from right to left, orientation is indicated with a field orientation code appended to subfield $6 (Linkage). (See MARC 21 Format for Bibliographic Data, Appendix A, subfield $6).

The decision to designate the field orientation as right-to-left depends on the predominance of data in a script that is read right-to-left at the field and/or the record level. A field may contain a mixture of scripts. Right-to-left field orientation is usually designated in the following instances:


FILL CHARACTER

The key to retaining the MARC structure, while simultaneously reducing required coding specificity, is the fill character. For MARC 21 records, the use of this fill character is limited to variable control fields such as field 008 (Fixed-Length Data Elements). It may not be used in the leader or in tags, indicators, or subfield codes. Presence of a fill character in a variable control field indicates that the creator of the record has not attempted to supply a value. In contrast, use of a character signifying "unknown" in a variable control field indicates that the creator of the record has attempted to supply a value, but was unable to determine what the appropriate value should be. The fill character may be used in undefined character positions and in character positions for which the MARC 21 format defines one or more values. Use of the fill character in variable control fields is usually regulated by the policy of the inputting agency.

For communication purposes, the fill character is represented by the code point 7C(hex). The fill character is represented graphically as the vertical bar ( | ).

Note: Another use of 7C(hex) is as a placeholder for a character outside the MARC-8 repertoire in a record converted from Unicode to MARC-8. This lossy conversion technique is described in Part 4. There is no conflict between the functions of 7C(hex). The places mentioned above where the 7C(hex) fill character is valid are constrained to use only characters in the ASCII repertoire; hence the placeholder 7C(hex) can only occur elsewhere, in the text of variable data fields.


MARC 21 HOME >> Specifications >> Character Sets >> Part 1

The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 12/04/2007 )
Contact Us