Staff Bibliography |
Back to previous Print this E-mail this |
Document Citation |
Title:
Design of a Digital Library for Early 20th Century Medico-legal Documents.
Author(s):
Thoma GR, Mao S, Misra D, Rees J.
Institution(s):
1) Lister Hill National Center for Biomedical Communications, National Library of Medicine Bethesda, Maryland 20894, USA.
Source:
Proc. ECDL 2006. Eds: Gonzalo J et al. Berlin: Springer-Verlag; LNCS. September 2006;4172:147-57.
Abstract:
The research value of important government documents to historians of medicine and law is enhanced by a digital library of such a collection being designed at the U.S. National Library of Medicine. This paper presents work toward the design of a system for preservation and access of this material, focusing mainly on the automated extraction of descriptive metadata needed for future access. Since manual entry of these metadata for thousands of documents is unaffordable, automation is required. Successful metadata extraction relies on accurate classification of key textlines in the document. Methods are described for the optimal scanning alternatives leading to high OCR conversion performance, and a combination of a Support Vector Machine (SVM) and Hidden Markov Model (HMM) for the classification of textlines and metadata extraction. Experimental results from our initial research toward an optimal textline classifier and metadata extractor are given.
Publication Type: CONFERENCE
More about this article:
Full Text (PDF)