Ground Truth Introduction

The number of citations that has been processed by MARS II is over 200,000. Somewhere, buried, is the perfect set of examples to use as ground truth data. While the perfect set is not a reality, the goal is to provide as thorough a sample of articles as possible. The MARS system currently categorizes journals into major types. The majority of journals in the entire NLM collection are represented in the ground truth data available here.

What format are the files in?

After identifying a number of Journal Titles to use for ground truth data we needed to use a common format that can support the communities needs. XML was chosen as the format for the GT data and excels in:

· Intelligence - How well data knows itself
· Adaptation - How well data changes to changing times
· Maintenance - How easily data is cared for
· Linking - Letting one piece of data carry you to another
· Simplicity - Easy to learn
· Portability - Portable over networks, operating systems and development environments.

Given that XML is growing in popularity and all its strengths work well for our structured data, XML is the format of choice.

How are the files stored?

All the Ground Truth Files are stored in subdirectories that match the layout used by the MARS system. A Journal Issue is assigned a unique number, called an MRI. This MRI has been created as a directory. Inside this directory are individual scanned pages (TIFF files) and corresponding ground truth files (XML files). We took a sample of one to five pages from specific journal issues to match the different layout types.

For convenience, we have links to zip files that contain a either sample of each type or the entire collection for a particular layout type. If you wish to download the entire ground truth collection use the link below - download all. To download a specific layout type, follow the links below or use the menu system on the left.

Introduction Definitions Download_All Type_A Type_B Type_C Type_D Type_E Type_F Type_G Type_H Type_Other

 

 


U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility

XML Format

There is a file called TrueViz.DTD that has the data definition, but many of the listed fields are not used in our initial version of GT data. If they are unused and required we use the keyword Unknown. If the field is not required, it is simply omitted. You can download the DTD to create your own XML data to be run in Rover.