Date:Thu, 8 Dec 2005 14:55:48 -0800
Reply-To:UNICODE-MARC Discussion List <[log in to unmask]>
Sender:UNICODE-MARC Discussion List <[log in to unmask]>
From:Johan Zeeman <[log in to unmask]>
Subject:Re: MARC Filing and Unicode Exclusion
Comments:To: UNICODE-MARC Discussion List <[log in to unmask]>
Comments:cc: UNICODE-MARC Discussion List <[log in to unmask]>
In-Reply-To:<[log in to unmask]>
Content-type:text/plain; charset=US-ASCII
Most Hebrew characters are encoded in two-byte values in UTF-8; at least
one is encoded in 3 bytes.
If Voyager does behave as you have described, it would seem to be
incorrect. The second indicator of 245 is defined to be "non-filing
CHARACTERS", not bytes. Yoyager should be converting UTF-8 encodings
(whether one-byte or multi-byte) in the exchange record to an internal
character representation before determining at what point the indexed term
starts.
Note that the record length count in the leader and (I think) the lengths
and offsets in the directory are counts of bytes; non-filing indicators in
the data fields are counts of characters. In the MARC-8 environment these
are more or less synonymous. In the UTF-8 environment, they most
definitely are not synonyms. Just another one of the multitude of joys in
a MARC record.
Johan Zeeman
RLG
Daniel Lovins
<daniel.lovins@YA
LE.EDU> To
Sent by: [log in to unmask]
UNICODE-MARC cc
Discussion List
<UNICODE-MARC@loc Subject
.gov> Re: [UNICODE-M] MARC Filing and
Unicode Exclusion
12/08/2005 02:09
PM
Please respond to
UNICODE-MARC
Discussion List
<UNICODE-MARC@loc
.gov>
Dear group,
I have question about whether certain Unicode characters--in my case,
certain Hebrew ones--are represented by double bytes in UTF-8, and if so,
whether this would explain the following situation:
While testing the Unicode release of Endeavor Voyager, a member of my team,
Jerry Anne Dickel, found that Hebrew script titles beginning with definite
(and for Yiddish, also indefinite) articles, no longer indexed properly.
The titles were failing to show up in browse displays. Jerry Anne was able
to implicate the second indicator of the 245 field (= number of non-filing
characters) in this: Ordinarily, with the Hebrew article "ha" [a one
character prefix], the second indicator of the 245 would be 1, but it was
only when Jerry Anne changed it to a 2 that the title once again indexed
correctly. The same thing happened with the Yiddish definite article "der"
(3 letters plus a space as in the Latin script), where the numeral 4
(representing the three letters plus space) would normally be used in the
second indicator; in Voyager Unicode, however, the title would only index
if the 4 were replaced by a 7 (i.e., doubling the Hebrew characters (3x2)
but not the space).
We replicated the problem in LC's Unicode-compliant Voyager and in OCLC
WorldCat.
Interestingly, there did not seem to be a problem in RLIN21.
Did RLG anticipate (what I'm assuming is) the doubled bytes and apply a
fix? Alternatively, do you think it might be something other than byte
number that's causing the problem?
Thank you very much for your help.
Daniel
>------------------------------------
Daniel Lovins
Hebraica Team Leader
Catalog Department
Sterling Memorial Library
Yale University
PO Box 208240
New Haven, CT 06520
tel: 203/432-1707
fax: 203/432-7231