Ground Truth Data for Biomedical Journals

Why?

MARG is a freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Research in document image analysis and understanding is greatly facilitated by such respositories for the design, training, and testing of algorithms for data identification and extraction.

What is it?

The data here is ground truth for representative article page images drawn from the corpus of important biomedical journals. It is thus most suitable for the development of algorithms to locate and extract text from the bibliographic fields typical of articles within such journals. These fields include the article title, author names, institutional affiliations, abstracts and possibly others. Only the first page of each article is available, or the second page if the abstract runs over.

This data has been collected during the operation of a system called MARS (for Medical Article Records System). Mars was developed by the Communications Engineering Branch of the Lister Hill National Center for Biomedical Communications, an R&D division of the U.S. National Library of Medicine. MARS combines scanning, optical character recognition (OCR), and document image analysis techniques to automatically extract bibliographic data from paper-based biomedical journals to populate the Library’s flagship database, MEDLINE®, used worldwide by biomedical researchers and clinicians.

This data includes document images and OCR-converted and operator-verified data at the page, zone, line, word, and character levels. In addition to providing a public site for researchers worldwide to develop and test their algorithms, we offer a tool to graphically visualize the ground truth data and conduct comparisons and analysis. Code-named Rover (gROundtruth Vs. Engineered Results), this automated analysis assistant will compare the results of a researcher’s program to the ground truth data. The ground truth and results data are in XML, and Rover is written in Java.

Who is it for?

This data is for the computer science and informatics research communities interested in the development of innovative and efficient algorithms for automatic zoning (page segmentation), labeling (field identification), and syntax reformatting to adhere to the conventions of the database for which the fields are extracted.

NEWS - Visualization Tools - Ground Truth Files - Most Popular Sites Ground Truth Sites

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility | USA.gov

The MARS System

All of the data collected for this website has been produced by the Medical Article Record System (MARS).

MARS is a production system affiliatied with The National Library of Medicine and was developed by the Communications Engineering Branch.

For more information about the MARS system click here.

If you have any questions please contact MARG admin

News last updated: 02/28/2005