|
|
Why? |
|
MARG is a freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Research in document image analysis and understanding is greatly facilitated by such respositories for the design, training, and testing of algorithms for data identification and extraction.
|
What is it? |
| The data here is ground truth for representative article page images
drawn from the corpus of important biomedical journals. It is thus
most suitable for the development of algorithms to locate and extract
text from the bibliographic fields typical of articles within such
journals. These fields include the article title, author names, institutional
affiliations, abstracts and possibly others. Only the first page of
each article is available, or the second page if the abstract runs
over.
This data has been collected during the operation of a system called MARS (for Medical Article Records System). Mars was developed by the Communications Engineering Branch of the Lister Hill National Center for Biomedical Communications, an R&D division of the U.S. National Library of Medicine. MARS combines scanning, optical character recognition (OCR), and document image analysis techniques to automatically extract bibliographic data from paper-based biomedical journals to populate the Librarys flagship database, MEDLINE®, used worldwide by biomedical researchers and clinicians. This data includes document images and OCR-converted and operator-verified data at the page, zone, line, word, and character levels. In addition to providing a public site for researchers worldwide to develop and test their algorithms, we offer a tool to graphically visualize the ground truth data and conduct comparisons and analysis. Code-named Rover (gROundtruth Vs. Engineered Results), this automated analysis assistant will compare the results of a researchers program to the ground truth data. The ground truth and results data are in XML, and Rover is written in Java.
|
Who is it for? |
| This data is for the computer science and informatics research communities
interested in the development of innovative and efficient algorithms
for automatic zoning (page segmentation), labeling (field identification),
and syntax reformatting to adhere to the conventions of the database
for which the fields are extracted. |
| NEWS - Visualization Tools - Ground Truth Files - Most Popular Sites Ground Truth Sites |
U.S. National
Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S.
Dept. of Health and Human Services
Copyright information
| Privacy policy | NLM
Accessibility | USA.gov
All of the data collected for this website has been produced by the Medical Article Record System (MARS).
MARS is a production system affiliatied with The National Library of Medicine and was developed by the Communications Engineering Branch.
For more information about the MARS system click here.
If you have any questions please contact MARG admin
News last updated: 02/28/2005