Layout types
The page layouts selected for this ground truth collection represent
typical biomedical journals from which we automatically extracted bibliographic
fields (e.g., article Title, Authors, Affiliation, Abstract). We have
classified the page layouts into nine types. These types are graphically
represented in the sidebar and described below.
It is important to note we have not zoned or labeled any data that is
not a title, author, affiliation or abstract. It is our hope that in the
future other zoned data will be provided. The extracted data from MARS
only includes the fields that were created by our automated processes.
Type A
|
The
article Title, Author(s), Affiliation and Abstract appearing in
the defined order and are located in the upper half of the page.
The lower portion could include the abstract or the beginning of
the article text. Variations include: single column (as shown in
sidebar); 2 single columns (large white space separating); single
column, followed by double column (large white space separating
single from double); single, double single column; and single, double,
double column.
|
Type B
|
The
Title, Author and Abstract are in the upper portion of the page. The
Affiliation is located at the bottom. |
Type C
|
The Title, Author and Abstract are in the upper half of the page.
The Affiliation is single columned and located in the left column
of double column text. Variations include: a body of text below the
double column Affiliation area that is other data. That is. data MEDLINE
does not record and isn't classified. |
Type D
|
The
Title, Author, and Affiliation are in the upper half of the page.
The Abstract usually is all or some of the first column. Variations
include the Abstract continuing into a portion of column 2. |
Type E
|
The
Title, Author, and Affiliation is in the upper half of the page.
The Abstract is double columned, but above the body text of the
article in most cases.
|
Type F
|
The
Title and Author are in the upper half of the page. The Affiliation
is along the bottom. The Abstract usually is all or some of the first
column. Variations include the abstract continuing into a portion
of column 2. |
Type G
|
The
Title and Author are in the upper half of the page. The Affiliation
is along the bottom-left. The Abstract usually is all or some of the
first column. Variations include the Abstract continuing into a portion
of column 2. |
Type H
|
The
Title and Author is in the upper half of the page. The Affiliation
is along the bottom. The Abstract is double columned, but above the
body text of the article in most cases. |
Type Other
|
This
category holds all unusual article layouts encountered. There are
quite a few (about 25% of the NLM collection) that fall into this
category and have not been categorized as of yet. This area has little
internal research done and would be a great area to improve upon.
|
|
Introduction
Download
All
Type A Type
B
Type C
Type D
Type E |
|
Type
F
Type G
Type H Type
Other
|
NOTE |
All Layout Types have been groundtruthed
at the ZONE level, including Label Tags. Line, Word, and Character
level information are still being worked on. |
|
Chart
with a distribution of thepage layouts in 3,337 journals.
|
U.S. National
Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S.
Dept. of Health and Human Services
Copyright information
| Privacy policy | NLM
Accessibility
|