Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2007Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2007-066
Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents
Chen S, Mao S, Thoma GR
Proceedings of 9th Int. Conf. on Document Analysis and Recognition (ICDAR2007). Curitiba, Brazil; September 2007, pp. 118-22
Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrade dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance of their representing trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneous by matching the input tree to the trees in closest-matched layout style cluster, of training set. The experimental results show that our algorithm is robust to balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.
PDF