LHNCBC: Document Abstract

|

|

FAQs


	Home
	Welcome
	Organization
	Visitor Information
	Staff Directory

	Medical Informatics
	Language & Knowledge Processing
	Image Processing
	Information Systems
	Infrastructure Research
	Multimedia Visualization

	Published Articles
	Technical Reports
	Lectures

	Training Opportunities
	Employment Opportunities

LHNCBC: Document Abstract

Year: 2001	Download Free Adobe Acrobat Reader
LHNCBC-2001-010
Pattern Matching Techniques for Correcting Low Confidence OCR Words in a Known Context
Ford G, Hauser SE, Le DX, Thoma GR
Proc. SPIE., Document Recognition and Retrieval VIII. 2001 Jan;4307:241-9.
A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. This 5-engine OCR system, while exhibiting high performance overall, does not reliably convert very small characters, especially those that are in italics. As a result, the 'affiliations' field that typically contains such characters in most journals, is not captured accurately, and requires a disproportionately high manual input. To correct this problem, dictionaries have been created from words occurring in this field (e.g., university, department, street addresses, names of cities, etc.) from 230,000 articles already processed. The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification. This paper outlines the techniques employed and the results of a comparative evaluation.
PDF

Lister Hill National Center for Biomedical Communications
U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health, Department of Health & Human Services
Copyright, Privacy, Accessibility, Freedom of Information Act
USA.gov, Applications & Plug-Ins
Site last updated: 30 January 2009