Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2001Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2001-002
Stochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries
Mao S, Kanungo T
Document Layout Interpretation and Its Application. 2001.
Electronic bilingual lexicons are crucial for machine translation, cross-lingual information retrieval and speech recognition. For low-density languages, however, the availability of electronic bilingual lexicons is questionable. One solution is to acquire electronic lexicons from printed bilingual dictionaries. While manual data entry is a possibility, automatic acquisition of lexicons from scanned images of bilingual dictionaries would expedite the prototyping process of cross-language systems. Printed dictionaries have a logical model that defines the syntax of the dictionary entries - i.e. order of the dictionary entry, its part of speech, its pronunciation and its definition. In this article we propose an algorithm to automatically extract bilingual dictionary entries based on stochastic language models. We demonstrate this algorithm on a printed Chinese-English dictionary. This work can be easily used for extracting information from other tabular structures like telephone books, catalogs, etc.
PDF