LHNCBC: Document Abstract

|

|

FAQs


	Home
	Welcome
	Organization
	Visitor Information
	Staff Directory

	Medical Informatics
	Language & Knowledge Processing
	Image Processing
	Information Systems
	Infrastructure Research
	Multimedia Visualization

	Published Articles
	Technical Reports
	Lectures

	Training Opportunities
	Employment Opportunities

LHNCBC: Document Abstract

Year: 2005	Download Free Adobe Acrobat Reader
LHNCBC-2005-019
Unsupervised Style Classification of Document Page Images
Mao S, Nie L, Thoma GR
Proc. IEEE International Conference on Image Processing, September 2005, Genova, Italy; Vol. II: 510-13
Style classification of documenat page images is crucial for logical structure analysis of heterogeneous collections of documents. Both layout and contextual features contain significant information about document styles. Most existing methods are supervised methods in which specific document models or classifiers are learned from a training set of document page images with known class labels. In this paper, we propose an unsupervised classification method that involves no training or manual selection of algorithm parameters. In particular, we first represent each document page as an ordered labeled X-Y tree. A tree matching algorithm is then used to compute style dissimilarity between two document pages. We propose a set of tree edit cost functions based on Karl Pearson distance between two multivariate feature observations, which is robust to the over-segmentation problem and zone length variations of same logical entities. Finally, the K medoids algorithm is used to find an optimal grouping of the trees into K clusters, each of which corresponds to a distinct document style. We evaluate our algorithm on test datasets with different cluster sizes and degrees of style similarity. Experimental results show our algorithm achieved an average classification accuracy of 95.69% over six datasets consisting of 150 pages of 11 different styles.
PDF

Lister Hill National Center for Biomedical Communications
U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health, Department of Health & Human Services
Copyright, Privacy, Accessibility, Freedom of Information Act
USA.gov, Applications & Plug-Ins
Site last updated: 30 January 2009