UNICODE-MARC Archives -- December 2005 (#2)

L-Soft - Home of the LISTSERV mailing list manager

Date: Thu, 8 Dec 2005 14:55:48 -0800 Reply-To: UNICODE-MARC Discussion List <[log in to unmask]> Sender: UNICODE-MARC Discussion List <[log in to unmask]> From: Johan Zeeman <[log in to unmask]> Subject: Re: MARC Filing and Unicode Exclusion Comments: To: UNICODE-MARC Discussion List <[log in to unmask]> Comments: cc: UNICODE-MARC Discussion List <[log in to unmask]> In-Reply-To: <[log in to unmask]> Content-type: text/plain; charset=US-ASCII

Most Hebrew characters are encoded in two-byte values in UTF-8; at least one is encoded in 3 bytes. If Voyager does behave as you have described, it would seem to be incorrect. The second indicator of 245 is defined to be "non-filing CHARACTERS", not bytes. Yoyager should be converting UTF-8 encodings (whether one-byte or multi-byte) in the exchange record to an internal character representation before determining at what point the indexed term starts. Note that the record length count in the leader and (I think) the lengths and offsets in the directory are counts of bytes; non-filing indicators in the data fields are counts of characters. In the MARC-8 environment these are more or less synonymous. In the UTF-8 environment, they most definitely are not synonyms. Just another one of the multitude of joys in a MARC record. Johan Zeeman RLG Daniel Lovins <daniel.lovins@YA LE.EDU> To Sent by: [log in to unmask] UNICODE-MARC cc Discussion List <UNICODE-MARC@loc Subject .gov> Re: [UNICODE-M] MARC Filing and Unicode Exclusion 12/08/2005 02:09 PM Please respond to UNICODE-MARC Discussion List <UNICODE-MARC@loc .gov> Dear group, I have question about whether certain Unicode characters--in my case, certain Hebrew ones--are represented by double bytes in UTF-8, and if so, whether this would explain the following situation: While testing the Unicode release of Endeavor Voyager, a member of my team, Jerry Anne Dickel, found that Hebrew script titles beginning with definite (and for Yiddish, also indefinite) articles, no longer indexed properly. The titles were failing to show up in browse displays. Jerry Anne was able to implicate the second indicator of the 245 field (= number of non-filing characters) in this: Ordinarily, with the Hebrew article "ha" [a one character prefix], the second indicator of the 245 would be 1, but it was only when Jerry Anne changed it to a 2 that the title once again indexed correctly. The same thing happened with the Yiddish definite article "der" (3 letters plus a space as in the Latin script), where the numeral 4 (representing the three letters plus space) would normally be used in the second indicator; in Voyager Unicode, however, the title would only index if the 4 were replaced by a 7 (i.e., doubling the Hebrew characters (3x2) but not the space). We replicated the problem in LC's Unicode-compliant Voyager and in OCLC WorldCat. Interestingly, there did not seem to be a problem in RLIN21. Did RLG anticipate (what I'm assuming is) the doubled bytes and apply a fix? Alternatively, do you think it might be something other than byte number that's causing the problem? Thank you very much for your help. Daniel >------------------------------------ Daniel Lovins Hebraica Team Leader Catalog Department Sterling Memorial Library Yale University PO Box 208240 New Haven, CT 06520 tel: 203/432-1707 fax: 203/432-7231

Back to: Top of message | Previous page | Main UNICODE-MARC page

LISTSERV.LOC.GOV