Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2003Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2003-002
Generating Robust Features for Style-Independent Labeling of Bibliographic Fields in Medical Journal Articles
Mao S, Kim J, Le DX, Thoma GR
Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics.2003 July;III:53-6.
Bibliographical data such as title, author, affiliation, and abstract are crucial for indexing biomedical journal articles. The Medical Article Records System (MARS) has been developed at the National Library of Medicine (NLM) to automate bibliographical data extraction for MEDLINE , the NLM's premier database of citations to the biomedical literature. The automatic extraction of bibliographic data involves the process of assigning logical labels (title, author, affiliation, and abstract) to homogeneous regions or zones on page images. While an OCR- and rule-based labeling module (called ZoneCzar) in MARS can reliably label medical journals with regular layout styles, it cannot accurately label the journals with arbitrary or unusual layout styles, and new rules have to be manually created for these journals. Furthermore, the OCR zoning errors, particularly merging errors, can greatly affect the labeling accuracy of ZoneCzar. In this paper, we describe an algorithm for automatic generation of robust features that are used by the labeling algorithm to perform style-independent labeling.
PDF