Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2003Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2003-008
OCR Correction Using Historical Relationships from Verified Text in Biomedical Citations
Hauser SE, Sabir TF, Thoma GR
Proc. of 2003 Symposium on Document Image Understanding Technology. College Park MD: Institute for Advanced Computer Studies, University of Maryland. 2003 April;: 171-7.
The Lister Hill National Center for Biomedical Communications has developed a system that incorporates OCR and automated recognition and reformatting algorithms to extract bibliographic citation data from scanned biomedical journal articles to populate the NLM's MEDLINE database. The multi-engine OCR server incorporated in the system performs well in general, but fares less well with text printed in the small or italic fonts often used to print institutional affiliations. Because of poor OCR and other reasons, the resulting affiliation field frequently requires a disproportionate amount of time to manually correct and verify. In contrast, author names are usually printed in large, normal fonts that are correctly recognized by the OCR system. We describe techniques to exploit the more successful OCR conversion of author names to help find the correct affiliations from MEDLINE data.
PDF