Toward a Metadata
Standard for Digitized Historical Newspapers
Ray L. Murray
Library of Congress
101 Independence Ave SE
Washington, DC
+1 (202) 707-6080
ramu@loc.gov
ABSTRACT
This paper is a case study of metadata development in the early
stages of the National Digital Newspaper Program, a twenty-year
digital initiative to expand access to historical newspapers
in support of research and education. Some of the issues involved
in newspaper metadata are examined, and a new XML-based standard
is described suited to the large volume of data, while remaining
flexible into the future.
Categories and Subject Descriptors
H.3.7 [Information Systems]: Information Storage and Retrieval -- Digital
Libraries -- Collection, Standards.
General Terms
Design, Experimentation, Standardization.
Keywords
Historical Newspapers, Digitization, National Digital Newspaper
Program.
1. INTRODUCTION
On March 31, 2004 the Library of Congress and the National Endowment
for the Humanities signed an agreement to jointly launch the
National Digital Newspaper Program (NDNP), a twenty-year initiative
with a goal to create an online resource for research of historical
newspapers. It will provide access to bibliographic records describing
every title published in the United States, from 1690 to the
present. State programs will receive awards to digitize, primarily
from microfilm, selected local newspapers. This national online
resource will allow full-text searching of these titles as they
are added, with an eventual aim of tens of millions of digitized
pages [1].
2. NEED FOR Structural Metadata
Gaining intellectual control over this large number of items, in
order to sustain and provide access in a national system, required
a considered metadata design. Previously, there was no universally
accepted metadata standard for historical newspapers. Online
historical newspapers produced by the public and private sectors
often existed as discrete systems, their metadata structures
not designed for interoperability with other systems. To coordinate
materials from fifty state institutions, a unified standard was
needed.
Conceptually, a newspaper manifests itself in different forms.
A newspaper can appear as a sequence of pages on a microfilm reel,
certain sequences of pages represent original issues, and the sequence
of issues may all share the same title. Ideally, the metadata system
should proceed from the physical object, the type of manifestation,
and preserve information about the provenance and original order
of the historical materials [2]. Titles, issues, pages and reels
all need to be represented as different yet related classes of
objects in the metadata system.
The interrelationships between these classes of physical objects
are not simple. Technical resolution targets are associated with
a given microfilm reel, but not a particular newspaper issue. Page
images are associated both with a reel and its parent issue. A
reel may contain issues from multiple titles, or a title may exist
across many reels. Each page will have multiple surrogates: the
scanned TIFF file, a service image JPEG2000 or PDF, and the optical
character recognition (OCR) text for the page.
3. METS SOLUTION
To handle the complex links between these compound objects, the
NDNP Technical Development Team developed a solution conforming
to the Metadata Encoding and Transmission Standard (METS). METS
is an XML document format designed to handle complex objects,
and to facilitate management of objects within a repository,
or between repositories [3]. The development team designed separate
METS document templates for the following classes of objects:
titles, issues and reels.
4. METS: TITLE DOCUMENT
Metadata at the title level already exists for most NDNP titles.
For over twenty years the United States Newspaper Program (USNP)
sponsored creation of bibliographic records for newspapers published
in the United States. This NEH-funded work included cataloging
information and location of holdings, standardized in the MARC
format. Records for 140,000 titles along with their 450,000 holdings
records will be incorporated into the NDNP system [4], allowing
users to locate historical newspapers in all formats, digital,
microfilm or the original paper. The title METS document brings
together bibliographic and holdings data in a single title record,
after being transformed losslessly from MARC to MARC XML format.
Titles that are digitized will have additional data -- descriptive
essays, more precise geographic coverage data -- included
in the title records. This new data takes the form of a Metadata
Object Description Schema (MODS) object within the larger METS
document [5].
5. METS: ISSUE DOCUMENT
The issue/edition information serves as an intermediary level of
object, between page and title levels. It includes information
about which pages belong to it, and to which title it belongs.
The model allows for multiple editions on the same date, distinguishing
one from another with "edition order" data element.
An issue present indicator allows for records to be created for
issues known to exist, but unavailable to digitize. This allows
for retention of the collation work that often appears in the
form of an "issue missing" frame on microfilm.
5.1 The Page Object
The page is the fundamental unit, the atom of the structural metadata.
Metadata must exist down to the page level, to be able to associate
and order files of pages within an issue. The page is also natural
as the smallest object to track with full structural metadata.
The two dimensional layout of the page carries editorial information
about the relative importance of the items on the page, and best
replicates the way the page was perceived by its original readers.
A sub-page-level metadata system could work, but analogous to
a physical page carved up into clippings, each data item would
lose the contextual information carried in the original two-dimensional
order of presentation.
Page-level metadata was defined robustly enough to allow recording
of information for missing pages, pages of the same issue digitized
from different holdings and ability to keep original order on unnumbered
pages and pages in multi-section newspapers. For simplicity, individual
page information was rolled up into the parent issue/edition document.
6. METS: REEL DOCUMENT
Digitizing from microfilm is an efficient way to capture a high
volume of data. Although in the end what is created is a digital
image of the original page, the characteristics of the intermediary
medium of the film should not be ignored. The content will have
been transferred three times: once to the film, once to the print
negative and once when being digitized. Administrative metadata
can help trace effects of that process on the final product.
The reel document will capture metadata on whether the paper
was filmed from loose leaves or bound volumes, the camera’s
effective reduction ratio, resolution quality of the film and
photographic emulsion density. This will allow study of whether
these characteristics impact the quality of the end product,
especially OCR accuracy.
7. CONCLUSION
The METS approach to NDNP metadata allows for easier exchange of
data between state institutions and the Library of Congress,
where the national repository will reside. Its open standard
allows for states to more easily reuse resources created for
NDNP. It provides the flexibility needed for future evolution.
As technical capabilities and user expectations change, these
XML data objects can be changed as well. For now, the current
approach is workable, robust and builds on the knowledge gained
through the METS and USNP initiatives.
8. REFERENCES
[1] Cole, B. The National Digital Newspaper Program. Organization
of American Historians Newsletter 32 (May 2004).
[2] Miller, F. Arranging and Describing Archives and Manuscripts.
Society for American Archivists, Chicago, IL, 1990.
[3] Metadata Encoding and Transmission Standard (METS) Official
Web Site. http://www.loc.gov/standards/mets/.
[4] Library of Congress. National Digital Newspaper Program. http://www.loc.gov/ndnp/.
[5] Metadata Object Description Schema (MODS) Official Web Site. http://www.loc.gov/standards/mods/.
|