Skip Navigation
Lister Hill Center Logo  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2005Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2005-023
Automated Cleanup Processing for Extracting Bibliographic Data from Biomedical Online Journals
Kim I, Le DX, Thoma GR
In: Callaos N, Lesso W, editors. SCI 2005. Proc. 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 4; Orlando (FL): International Institute of Informatics and Systemics; c2005. 401-5.
An R&D division of the National Library of Medicine (NLM) has developed the Web-based Medical Article Records System (WebMARS) to create citations from online biomedical journals. This paper presents one important part of this system, the automated cleanup module that extracts bibliographic information from HTML-formatted text based on a rule-based approach. A learning scheme comparing the output of the cleanup module to the verified processing result is newly introduced to create and update cleanup rules automatically, thereby minimizing the manual effort for rule setting and improving the performance of the cleanup processing. Experimental results show that the proposed automated cleanup module can effectively detect and extract the bibliographic data of interest from HTML-formatted online journal articles using relevant rules identified through the learning process.
PDF