Discussion Paper 2001-DP02

DATE: December 15, 2000
REVISED:

NAME: Non-MARC Language Codes in Field 041 of the Bibliographic and Community Information Formats

SOURCE: Library of Congress; OCLC CORC

SUMMARY: This paper discusses four different approaches to using non-MARC language codes in MARC 21 bibliographic and community information records.

KEYWORDS: Field 041 (BD)(CI); Subfield $j, in field 041 (BD)(CI); Subfield $2, in field 041 (BD)(CI); Language codes (BD)(CI); Non-MARC language codes (BD)(CI); Source of term (BD)(CI)

RELATED:

STATUS/COMMENTS:

12/15/00 - Forwarded to the MARC Advisory Committee for discussion at the January 2001 MARBI meetings.

1/13/01 - Results of the MARC Advisory Committee discussion - Most participants felt that option 4 was the best solution to adding non-MARC language codes in the 041 field. The participants felt that stacking language codes was not useful because most library systems cannot parse stacked codes correctly and thus, consideration should be given to eliminating this practice.

The following decisions were made for field 041:

MARBI agreed that there should be further discussion about the legacy records containing stacked codes. The group should also look further into the possible impact that these changes may have on the bibliographic community. A proposal paper reflecting these decisions will be presented at the annual 2001 meeting.


Discussion Paper 2001-DP02: Non-MARC Language Codes in Field 041

1. BACKGROUND

The MARC Code List for Languages contains a list of languages and their associated three-character alphabetic codes. The purpose of it is to allow for the designation of the language or languages used in MARC records. The language codes are three-character lowercase alphabetic strings based on the first three letters of the English form or on the vernacular of the corresponding language name. Language codes are used in the following MARC 21 bibliographic fields: 008/35-37 (Fixed-length data elements/Language); 040 $b (Cataloging source/Language of cataloging); 041 (Language code); 242 $y (Translation of Title by Cataloging Agency / Language code of translated title); 775 $e (Other edition entry/Language code). Language codes are also used in the following MARC 21 community information fields: 008/12-14 (Fixed-length data elements/Language); 040 $b (Cataloging source/Language of cataloging); 041 (Language code).

There are two international standards for language codes developed by the International Organization for Standardization. The ISO 639-1 standard is a 2-character code list. It was created in 1988 for use in lexicography, linguistics and terminology and describes most of the predominate written languages of the world.

A joint project between ISO TC 37, which maintains the two-character list (previously called ISO 639; now ISO 639-1), and ISO TC 46 (Information and Documentation) resulted in the approval of a three-character code list in 1998. This allowed for the definition of many more language codes then the two-character list could accommodate. This list, ISO 639-2, was based on the MARC list and will remain consistent with it. Both ISO standards allow for the addition of ISO country codes to the language codes to distinguish national linguistic differences, i.e., eng-GB for English spoken in Great Britain.

The ISO 639-1 language code scheme, being the older ISO set, is used in SGML and some metadata applications. For example, the SGML standard (ISO 8879) requires that declaration of public text language be indicated using the ISO 639-1 language codes. In addition, these non-MARC codes have been widely used in the Internet because of the approved Internet RFC 1766 (Request for Comments, essentially an Internet standard), which designates the use of 2-character codes (with an optional country code). The Dublin Core element "Language" also recommends the use of the ISO 639-1 codes.

OCLC's Cooperative Online Resource Catalog (CORC) includes descriptions of Web resources which may be created or displayed in either a MARC or Dublin Core view. A crosswalk between MARC and Dublin Core elements was developed by OCLC and LC's Network Development MARC Standards Office, which allows for the data transformations in either format. It is expected that some CORC records may include an ISO 639-1 two-character language code instead of or in addition to a MARC code.

Because of the increasing use of SGML encoding and diverse metadata standards for description of information resources, and the need to create MARC records from these, it is desirable to provide for other standard types of language codes in MARC records. This would facilitate the ease of mapping and the subsequent conversion from various metadata standards to the MARC 21 format and would thus, enhance libraries' participation in the organization of online web products, e-books and other media that use metadata standards. Defining an element for non-MARC language codes would also assist in the increasingly internationalization of the MARC 21 formats by allowing the use of other language code schemes in MARC 21 records.

DISCUSSION

2.1. Language fields/subfields in MARC 21

Field 041 (Language code) contains the three-character MARC alphabetic code for languages associated with an item when field 008/35-37 (Language) is insufficient to convey full information for a multilingual item or an item that involves translation. The source of the language codes is the MARC Code List for Languages. Several additional non-repeatable subfields in 041 are used to designate other language aspects, such as language in summaries (subfield $b), or in sung or spoken text (subfield $d). Use of multiple MARC language codes in all of the subfields in field 041 is accommodated by "stacking" them in their appropriate subfields. For example, a text in original Greek with an English translation would be coded as, "enggrc," by stacking the code for English (eng) onto the code for Greek (grc).

Although field 041 has sufficiently expressed the various languages found in items using the MARC language codes, problems arise when one considers coding it using non-MARC language codes. For example, the practice of "stacking" non-MARC language codes would make it difficult for systems to adequately parse the codes because of their varying lengths (even assuming all codes in a given field occurrence are from the same list). ISO 639-1 codes are two characters long and thus if three were stacked together in subfield $a (as in "itfren"), the system may incorrectly parse them as two three-character MARC codes. Likewise, both ISO standards allow for the addition of ISO country codes to distinguish national linguistic differences which may make parsing non-MARC language codes possibly even more difficult.

The following describes various alternatives for accommodating non-MARC language codes in field 041.

2.2. Option 1. Definition of subfield $j in the same field with other MARC codes

Subfield $j (Non-MARC language code) could be defined for non-MARC language codes in field 041.

For example:

1.) 041 0#$aeng$jen$2 [Code for ISO 639-1]
     
2.) 041 0#$aeng$jeng-US$2 [Code for ISO 639-2 with ISO 3166 code]
[Subfield $j includes a 3-character language code for English with an ISO 3166 country code for US]
     
3.) 041 0#$aeng $jeng-US$jeng-GB$2 [Code for ISO 639-2 with ISO 3166 code]

In these examples, subfield $a (containing a MARC language code) and subfield $j (containing a non-MARC language code) are contained in the same field.

The use of subfield $j would not affect the language code in field 008/35-37, which would remain always a MARC code.

4.) 008/35-37 eng
  041 0#$aeng$jen$2 [Code for ISO 639-1]

If the MARC code is unknown, the non-MARC code could be placed in field 041 and 008/35-37 would contain 3 fill characters (| | |).

5.) 008/35-37 |||
  041 0#$jen$2 [Code for ISO 639-1]

When using more than one non-MARC language code from the same scheme, subfield $j could simply be repeated (as in example 3, above), thus eliminating the problem of stacking non-MARC language codes.

This repeatable subfield could be defined as follows:

Subfield $j contains a non-MARC language code associated with the item. The source of the code is specified in subfield $2 (Source of code). When coding more than one non-MARC language code from the same language code scheme, repeat subfield $j.

In this option, subfield $2 (Source of code) would also be defined to indicate the source of the code in subfield $j. Subfield $2 values will be defined in the future.

This subfield could be defined as follows:

Subfield $2 contains a MARC code that identifies the source of the language code used in subfield $j. The source of the MARC code is MARC Code Lists for Relators, Sources, Description Conventions that is maintained by the Library of Congress.

Because subfield $j would be repeatable, the problem of stacking multiple non-MARC codes is eliminated. However, if more than one scheme is used in a field, the different schemes must be indicated by using multiple subfield $2 (Source of code). Since systems could have difficulty differentiating which subfield $2 went to which subfield $j, subfield $2 could be input after its corresponding subfield $j. For example:

6.) 041 0#$jen$2[Code for ISO 639-1]$jeng-GB$2[Code for ISO 639-2 with ISO 3166 code]

Unfortunately, using a single subfield $j for non-MARC codes will not allow the same distinctions to be made as when using the other existing subfields. For example, when using MARC codes, one can make the distinction between language of summary, language of spoken text, etc., in an item. However when using subfield $j, no such distinctions can be made.

7.) 008/35-37 |||
  041 0#$jit$jes$jen

2.3. Option 2. Definition of subfield $j using repeatable 041

Field 041 (Language code) could be made repeatable to accommodate for the use of non-MARC language codes. With this option, field 041 would be repeated with each non-MARC code scheme used in a record. This may be a cleaner approach in not mixing non-MARC codes in the same field with MARC codes.

For example:

8.) 041 0#$aeng
  041 07$jen$2[Code for ISO 639-1]
  041 07$jeng-GB$2[Code for ISO 639-2 with ISO 3166 code]
[The field is repeated because two non-MARC language code schemes have been used. The MARC code is placed in subfield $a.]

As with the previous option, subfield $2 would be defined to specify the language code scheme. Likewise, the second indicator, value 7 (Source specified in subfield $2) could also be defined to indicate when non-MARC codes are used in a field. Value # (blank) could be used if MARC codes are recorded.

Subfield $j is defined as in Option 1, but it is specified that the field is repeated with non-MARC codes.

2.4. Option 3. Redefinition and repeatability of subfields $a-$h; repeat subfields only for non-MARC codes

Another alternative is to redefine subfields $a - $h to include both MARC and non-MARC language codes and to change their repeatability status to "repeatable." When coding non-MARC language codes, separate 041 fields could be used with the source of the code placed in subfield $2. Subfield $2 (Source of code) would be defined to indicate the source of the non-MARC code scheme used in subfields $a-$h and the second indicator, value 7 (Source specified in subfield $2) could also be defined to indicate when non-MARC codes are used in a field. Value # (blank) could be used if MARC codes are recorded. In this alternative, MARC codes in subfields $a-$h could be stacked, but subfields containing non-MARC codes could be repeated to indicate multiple languages in an item. Different language code schemes could also be used by simply repeating field 041 for each scheme used. As in the other options, if the MARC code is unknown, 008/35-37 would contain 3 fill characters (| | |).

For example:

9.) 008/35-37 eng
  041 0#$aengfreita
  041 07$aen$afr$ait$2[Code for ISO 639-1]
[The MARC language codes are stacked in subfield $a and the non-MARC language codes are placed in repeated occurrences of subfield $a. A separate field 041 is used when coding languages using non-MARC codes]
     
10.) 008/35-37 |||
  041 07$afr$aen$ait$2[Code for ISO 639-1]
[Only non-MARC language codes are used so 008/35-37 contains fill characters]
     
11.) 008/35-37 eng
  041 0#$aeng
  041 07$aen$2[Code for ISO 639-1]
  041 07$aeng-GB$2[Code for ISO 639-2 with ISO 3166 code]
[Two non-MARC language code schemes are used and field 041 is repeated]

This option would allow systems to effectively handle non-MARC codes, while continuing the current practice of stacking MARC language codes. It would also allow easier mapping of field 041 to other metadata standards without changing any existing coding practices, such as stacking. Likewise, unlike options 1 and 2 of this discussion paper, non-MARC codes can be used to make distinctions between the languages found in different parts of an item, such as the language of librettos, language of table of contents, etc., as indicated in subfields $a-$h.

2.5. Option 4. Redefinition and repeatability of subfields $a-$h; change practice of stacking MARC codes

As in Option 3, subfields $a - $h could be redefined to include both MARC and non-MARC language codes and to change their repeatability status to "repeatable." However, unlike the previous option, stacking MARC language codes could be completely eliminated and thus, separate subfields could be used to indicate each language used in an item. The source of the non-MARC code, could then be placed in subfield $2, optionally with value 7. Value # (blank) could be used if MARC codes are recorded. Different language code schemes could also be used by simply repeating field 041. As in the other options, if the MARC code is unknown, 008/35-37 would contain 3 fill characters (| | |).

For example:

12.) 008/35-37 |||
  041 0#$aen$afr$ait$2[Code for ISO 639-1]
[All of the codes come from the ISO 639-1 standard]
     
13.) 008/35-37 eng
  041 0#$aeng$afre$aita$bfre$bita
[The code comes from the MARC Code List for Languages]
     
14.) 008/35-37 eng
  041 0#$aeng$afre
  041 0#$aen$afr$2[Code for ISO 639-1]
[Two language code schemes are used and field 041 is repeated]

This option would allow for more consistent coding of language codes in field 041. It would also make mapping field 041 to other metadata standards more precise and accurate and like Option 3, it would enable non-MARC codes to be used to make distinctions between the languages found in different parts of an item, such as the language of librettos, language of table of contents, etc., as indicated in subfields $a-$h.

However, because field 041 is used extensively in the bibliographic community, changing its definition, repeatability and coding practices could be expensive to implement. Likewise, most library systems automatically parse stacked MARC language codes and changing this practice would require system vendors to reconfigure their systems.

3. QUESTIONS FOR CONSIDERATION

  1. Is there a need to code non-MARC codes in MARC 21 records?

  2. Should a new subfield (such as subfield $j) be defined in field 041 to indicate non-MARC codes?

  3. Would repeating field 041 be beneficial in indicating different language schemes for non-MARC codes? Or, would simply repeating subfield $2 (Source of Code) for each scheme used in field 041 be adequate enough? If repeating subfield $2, should it follow its corresponding language code?

  4. Would repeating subfields $a-$h be beneficial in coding non-MARC codes? Is there a need to differentiate between different languages used in various aspects of items using non-MARC language codes? If so, would redefining subfields $a-$h be the most effective way to show this distinction?

  5. If using option 3 or 4, is the second indicator needed or is subfield $2 sufficient?

  6. How expensive would the elimination of stacking codes be for libraries? Would the long-run cost savings be more than this initial cost?


Go to:


Library of Congress Library of Congress
Library of Congress Help Desk (02/01/01)