Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2007Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2007-082
Structure and Content Analysis for HTML Medical Articles: A Hidden Markov Model Approach
Zou J, Le DX, Thoma GR
Proc August 2007 ACM Symposium on Document Engineering. pp. 199-201.
We describe ongoing research on segmenting and labeling HTML medical journal articles. In contrast to existing approaches in which HTML tags usually serve as strong indicators, we seek to minimize dependence on HTML tags. Designing logical component models for general Web pages is a challenging task. However, in the narrow domain of online journal articles, we show that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones.
PDF