The data here is ground truth for representative article page images
drawn from the corpus of important biomedical journals. It is thus
most suitable for the development of algorithms to locate and extract
text from the bibliographic fields typical of articles within such
journals. These fields include the article title, author names, institutional
affiliations, abstracts and possibly others. Only the first page of
each article is available, or the second page if the abstract runs
over.
This data has been collected during the operation of a system called
MARS
(for Medical Article Records System). Mars was developed by the
Communications Engineering
Branch of the Lister Hill National Center for Biomedical Communications,
an R&D division of the U.S. National Library of Medicine. MARS
combines scanning, optical character recognition (OCR), and document
image analysis techniques to automatically extract bibliographic
data from paper-based biomedical journals to populate the Librarys
flagship database, MEDLINE®, used worldwide by biomedical researchers
and clinicians.
This data includes document images and OCR-converted and operator-verified
data at the page, zone, line, word, and character levels. In addition
to providing a public site for researchers worldwide to develop
and test their algorithms, we offer a tool to graphically visualize
the ground truth data and conduct comparisons and analysis. Code-named
Rover (gROundtruth Vs. Engineered Results),
this automated analysis assistant will compare the results of a
researchers program to the ground truth data. The ground truth
and results data are in XML, and Rover is written in Java.
|