PROPOSAL NO: 98-16

DATE: May 1, 1998
REVISED:

NAME: Nonfiling characters in all USMARC formats

SOURCE: USMARC electronic list

SUMMARY: This proposal presents techniques that are more flexible and extensible for dealing with non-filing characters that appear at the beginning of cataloging data in access fields in USMARC records, including titles embedded in author/title fields. It proposes adopting control characters for marking nonsorting information.

KEYWORDS: Nonfiling indicator

RELATED: DP102 (June 1997)

STATUS/COMMENTS:

5/1/98 - Forwarded to the USMARC Advisory Group for discussion at the June 1998 MARBI meetings.

6/29/98 - Results of USMARC Advisory Group discussion - There was general consensus that the technique needed to be changed and the pair of control characters is the solution. Reconciliation of the previous identifications of the nonfiling zones and more precise new defintion are needed. The situations to which the technique could and should be applied need to be specified, for example, only at the beginning of a field or subfield? internal to a subfield? There was a preference for limiting the technique to sorting, i.e., not to include indexing. The proposal should return at the next meeting with these additions. There was also consensus that implementation of this change will need considerable lead time.

7/29/98 - Results of LC/NLC review - Agreed with the MARBI decisions.


PROPOSAL NO. 98-16: Nonfiling characters

1 BACKGROUND

MARC records are created with data elements to support the processing of the information in a variety of ways including online applications and various output products (e.g., catalog cards, book catalogs, and COM catalogs). Various products form sorted lists of access points for browsing. Online applications index access fields for titles, persons, corporate bodies, and subject terms to provide left match retrieval. Access points for titles and names sometimes include parts of speech or other character strings at the beginning of the name or title parts of the heading that users and systems tend to ignore for sorting and retrieval, commonly called "nonfiling characters".

1.1 Current Technique

The current USMARC technique for identifying non-filing characters retained in records involves the use of an indicator position that carries a digit (0 through 9) representing the number of characters to be ignored. It is defined for Titles and Uniform Titles in the following fields:

Examples:
     240 14  $aThe Pickwick papers
     245 18  $aThe ... annual report of the Governor 
     245 12  $aL'enfant criminal
     245 13  $aal-Sharq as-'Arabi
It is also common for some types of introductory characters (diacritics or punctuation marks) not to be identified as nonfiling characters with the expectation that they will be ignored by the software and for others to be omitted by the cataloger.

Examples:
     130 0#  $a"Hsuan lai hsi kan" hsi liah
     245 10  $a[Diary]
     245 10  $a­as others see us. 
A nonfiling indicator was defined for Authority format fields X00 (Personal Name), X10 (Corporate Name), X11 (Meeting Name), until it was made obsolete in 1993. The change was made because the X00, X10, and X11 fields in the Bibliographic format did not (and could not) have corresponding non-filing indicators, thus systems with authority control modules found the Authority format indicators had no practical use.

1.2 Need

A more flexible and extensible technique for identifying nonfiling strings is needed since the indicator technique is not available for all places where initial nonfiling information occurs and cataloging conventions for dropping information in those situations may not be acceptable. This is especially important since the format is used by communities that do not use the same cataloging rules and interpretations as the AACR2 community in the US, therefore their cataloging conventions may differ. Within a community, agreement to omit the article or character may be made, but there is a need to have a mechanism that can be used across all communities.

While generally, the handling of nonfiling characters by an indicator value works in those fields for which it is defined, this technique cannot be used in the following cases. (1) When indicator positions are already defined in a field. For example, titles recorded in field 246, like those in field 245 (Title Statement), sometimes have initial articles but field 246 does not have a non-filing indicator and both indicator positions are already defined for other uses. In addition, there is a need for a nonfiling technique for name headings in some cases. (2) Titles in author/title fields where the title part begins in the $t subfield (e.g., Bibliographic format 7XX fields).

Examples:
     Anwar al-Sadat
     100 1#  $aSadat, Anwar 
         [article omitted since nonfiling characters cannot be indicated]
     The Henry (ship)
     610 1#  $aHenry (Ship) 
         [article omitted since both indicators used already]
     Stower, Caleb.  The printer's manual
     700 1#  $aStower, Caleb.$tPrinter's manual 
         [article omitted since no indicator available for $t subfield]
     N,N-Dimethyltryptamine
         Filed as Dimethyltryptamine
1.3 June 1997 Discussion

This issue was investigated in Discussion Paper 102. The following summarizes the discussion at MARBI.

There were different preferences for the technique to be used -- subfields, graphic characters, and control characters. Subfields would be easy to implement; graphic characters would be difficult to identify; and control characters are theoretically desirable but have some system drawbacks. There was a preference for two distinct characters, to be used before and after the nonfiling part. Several participants pointed out that the function under question is nonfiling, not non-indexing. For example, the English word "the" might not be indexed any place in a string but the characters we are identifying are only when the word occurs at the beginning of a string.

Gary Smith of OCLC proposed three properties that a new technique should have:
(1) It must not introduce any conflicts with existing data and thus must not use any code which has previously been assigned a graphic or control function in USMARC;
(2) It must be location independent, i.e., it must be interpretable without knowledge of the identity or nature of the field in which it occurs.
(3) It must not require the conversion of existing records.

Other characteristics (from discussions) that apply if a graphic or control character technique is used:
(1) There should be two different characters for beginning and end.
(2) The characters should exist also in Unicode.

2 DISCUSSION

There are three types of initial characters that occur in bibliographic data that need to be considered.

2.1. Initial Articles

The most common non-filing characters in MARC data are initial definite and indefinite articles, "the" and "a"/"an" in English and their foreign language counterparts. In English, they may be used intermittently, but in some languages and situations initial articles can be very important grammatically. German and French, for example, express grammatical case and gender through a variety of initial definite and indefinite articles. Arabic and Hebrew use initial definite articles with both nouns and adjectives.

Examples:
     A place like Alice
     Anwar al-Sadat
     Der Spiegel
     La philosophie et le Quebec
Not all languages possess parts of speech such as articles (all Slavic languages except Bulgarian lack articles), or the articles associated with the first word of a title may be appended to the end of a word, for example, articles in Bulgarian and Romanian. The latter are not treated in this proposal, although some techniques might be extensible to them.

Articles are important enough that many cataloging rules allow them to be included in bibliographic data in MARC records, even though custom tends to omit them if possible in search, retrieval, and filing operations.

2.2 Initial Punctuation and Symbols

Punctuation (e.g., ;, ") and other symbols (e.g., $, *) are commonly found at the beginning of titles. For some languages, other marks can occur, e.g., in Spanish the inverted question mark and inverted exclamation mark. Other non-filing characters found in MARC data include the opening square bracket ("[", signifying a cataloger-supplied title]) and ellipsis marks ("..."). Most of these characters are either omitted in the access field or detected and omitted for sorting and retrieval by software.

Examples:  
     ...and then I said
     ¿Quién es quién en el Perú?
2.3 Other Initial Nonfiling Characters

On occasion, characters other than articles and punctuation/symbol marks may also occur at the beginning of access points. In MeSH (Medical Subject Headings), for example, name of chemical compounds, when including prefixed letters or numbers, are sorted and filed ignoring the prefixes.

Examples:  
     16,16-Dimethylprostaglandin E2 
     N,N-Dimethyltryptamine
2.4 Options for more Flexible Indication of Nonfiling Characters

2.4.1 Using indicators

The current USMARC solution for dealing with non-filing characters has been described briefly above. It makes use of an indicator position to signal the number of initial characters in a field to be ignored in processing.

Advantages:

Disadvantages:
Example:      
     245 12  $aA place like Alice 
     (See also section 1.2)
2.4.2 Using graphic characters

Another approach is to use a graphic character, such as spacing underscore or angle brackets to identify the beginning and end of the nonfiling sequence. On the USMARC list, Elhanan Adler said that double angle brackets (<< and >>) are often used to surround the nonfiling parts in Israel and Bernhard Eversberg indicated that in Germany the spacing underscore and other graphics are commonly used in systems.

Advantages:
- The characters are standardly available in most computer systems.
- The characters can be used anywhere, therefore they would accommodate subfield $t of the author/title and linking entry (7XX) fields and elsewhere.
Disadvantages:
- Data has extraneous graphic characters that must be omitted in displays and printed output.
- Most graphic characters are used by some user group as part of data, for example, the spacing
underscore is now found in Internet addresses.

Examples:  
     245 1#  $a_A _place like Alice
     100 1#  $a<<al->>Sadat, Anwar
     245 1#  $a<<¿>>Quién es quién en el Perú
     650 #2  $a_N,N-_Dimethyltryptamine
     700 1#  $aStower, Caleb.$t<<The  >>printer's manual
2.4.3 Using special control characters

A pair of special control characters, such as the NON-SORTING CHARACTER(S), BEGIN (hex'88') and NON-SORTING CHARACTER(S), END (hex'89') characters defined in ISO 6630 (Bibliographic control set) could be used as delimiters for strings of non-filing characters. It is not clear how widely the above characters have been implemented but they are speciified for use with IFLA's UNIMARC format and some implementors of that format have used them.

Advantages:

Disadvantages:
The graphics B ( for begin) and E (for end) have been used in the following examples to represent the two CONTROL characters from ISO 6630.
Examples: 
     245 1#  $aBA Eplace like Alice
     100 1#  $aBal-ESadat, Anwar
     245 1#  $aB¿EQuién es quién en el Perú?
     650 #2  $aBN,N-EDimethyltryptamine
     700 1#  $aStower, Caleb.$tBThe Eprinter's manual
2.4.4 Using system recognition

One solution is to program library systems to recognize grammatical articles automatically. Unfortunately, in practice it is very difficult for a computer to identify initial articles because English articles such as "the", "a", and "an" are legitimate non-article words in other languages (in French, "th‚" means "tea", "…" means "to", and "an" means "year"). The variety of languages which might be represented in a single record, both in the description and access points, also make machine determination of articles based on language coding impractical.

Examples of fields (no nonfiling marking):
     245 1#  $aA place like Alice
     100 1#  $aal-Sadat, Anwar
     245 1#  $a¿Quién es quién en el Perú?
     650 #2  $aN,N-Dimethyltryptamine
     700 1#  $aStower, Caleb.$tThe printer's manual
2.4.5 Using a special subfield

Another idea has been to establish a special subfield for non-filing characters. It would have the disadvantage of separating pieces of titles and names that belong together for other processing and could be confusing with $t situations. The subfield would need to be one that could be used in any field where needed and a common unused subfield code would be difficult to identify. (Subfield $? is used in the examples.)

Examples:  
     245 1#  $?A $aplace like Alice
     100 1#  $?al-$aSadat, Anwar
     245 1#  $?¿$aQuién es quién en el Perú?
     650 #2  $?N,N-$aDimethyltryptamine
     700 1#  $aStower, Caleb.$?The $tprinter's manual
2.4.6 Using omission

Rules for the inclusion or omission of initial articles in access points vary but have tended to favor omission in recent years in North America. This solution has been particularly widespread in the treatment of initial articles associated with personal names. The omission of initial articles to deal with not being able to handle them otherwise is not totally acceptable to some MARC users. European and Middle Eastern libraries have been particularly vocal in their request for a generalizable technique for indicating non-filing characters. Their chief argument has been that the simple omission of articles corrupts the cataloging data grammatically and yields title strings that the public finds unacceptable.

Examples:  
     245 1#  $aPlace like Alice
     100 1#  $aSadat, Anwar
     245 1#  $aQui‚n es qui‚n en el Per£
     650 #2  $aDimethyltryptamine
     700 1#  $aStower, Caleb.$tPrinter's manual
 
2.5 Character Modifiers and Special Characters

The format states that diacritics at the beginning of a field that does NOT have any nonfiling characters are not counted as nonfiling characters, e.g.,

     222 #0  $aÖsterreich in Geschichte und Literatur
but those that are associated with the first word after an article are counted, e.g.,
     222 #5  Der Öffentliche Dienst
This practice is not thought to be consistently followed, however, and CAN/MARC specifies NOT to count the diacritic in the latter case. It will cause difficulty in the future with Unicode as in Unicode the diacritics are encoded after the characters they modify rather than before, as is currently done in USMARC. Thus in cases where the diacritic has been counted (the latter case above) the indicator count will be erroneous for data converted to Unicode.

Special characters, such as "[", "..." and ALIF, are treated in the same way as diacritics in USMARC. They are included in the nonfiling zone when they occur in conjunction with an article:

     245 05  $a[The part of Pennsylvania that ... townships].
but are ignored, with the expectation that systems will also ignore them automatically, when they are not with an article:

     245 00  $a [Diary].
Again CAN/MARC does not count the special character in either case. The first example would have indicator value of "4" in CAN/MARC.

The format specification should be changed to consistently indicate that the diacritic is not counted and the new technique, whichever is selected, needs to likewise specify that the diacritic is outside the nonfiling zone. Consideration should be given to whether the format specification should also note that special characters are consistently considered in a nonfiling zone, even when they do not occur in conjunction with an article.

2.6 Proposal to use the Control Character Technique

The Control characters from ISO 6630 for marking the beginning and end of the nonfiling zone meets most of the characteristics indicated in the previous discussion of DP 102.

The last point needs to be discussed. The nonfiling indicators would become obsolete, but that would not mean they needed to be changed in existing records. It is unrealistic to expect that all records will be changed retrospectively. A particular environment might want, however, to process records being stored into the new technique. Particular problems would occur if a system had linked authority and bibliographic records. There would be a need to coordinate the heading fields across the records, probably bringing pressure to retrospectively convert. Any heading matching routines that did not ignore the indicator would need to be adjusted.

2.7 Impact Considerations

There are a number of points that should be taken into account when evaluating the impact of this change on existing systems, however, it can be assumed that there would not be a need for any manual remarking of retrospective records.

3 PROPOSED CHANGE

In the USMARC Bibliographic, Authority, Classification, and Community Information formats:


Go to:


Library of Congress
Library of Congress Help Desk (09/01/98)