Print this Print this  E-mail this E-mail this

System For Preservation of Electronic Resources (SPER)

Overview

SPER is a part of the Digital Preservation Research project at Lister Hill Center's Communications Engineering Branch. Its main objective is to help in the long term preservation of digitized or born-digital documents at the National Library of Medicine in a cost-effective way.

As a part of on-going research, SPER provides a testbed to explore and experiment with important digital preservation standards, tools and techniques. It also comprises a prototype system to perform actual preservation of digital documents in a convenient manner, using selected open source tools. An important component of SPER's preservation function is the automated extraction of metadata from textual documents using machine learning tools, which significantly lowers the cost of metadata acquisition over manual input.

The following sections provide a description of the SPER preservation framework (also called SPER for simplicity), and its automated metadata extraction component.

System Description

SPER provides the ability to extract, review and validate metadata and ingest documents using an integrated environment with convenient graphical user interfaces. The system is implemented as a Java-based client-server system, where the client and the server communicate with each other using Java RMI protocols. The archiving infrastructure (for building the repository and indexing its contents) is provided by DSpace, an open source OAIS compliant software system. The ingested documents may be searched and retrieved publicly using a Web browser, submitting simple and advanced queries. A MySQL database is used by SPER for storing all ingest and retrieval information.

The basic architecture of the SPER is shown in the Figure 1. The SPER server can communicate with multiple remote clients simultaneously, to perform metadata extraction or ingest functions. Automated metadata extraction (AME) capability is implemented as a plug-in service. Different AME tools may be used for different types of document (scanned text images, HTML pages, etc.). AME function is also provided as a stand-alone service for text-based documents. Document Converters are used by SPER to generate derivatives such as PDF files from the archived source documents for display purpose.


SPER Architecture
Figure 1- SPER system architecture


Workflow

SPER supports a batch-based processing scheme, where a set of document pages from a collection is grouped into a batch and submitted remotely, either for metadata extraction or for ingest to the archive, using the SPER Client GUI. In case of digitized documents, the batch comprises scanned TIFF images of the original documents or their reproduction copies. SPER Server copies the specified pages from the client's node and submits them to the AME module for metadata extraction. The extracted results are then presented to the operator for review and editing, as needed, and the validated metadata is stored at the server node. A validated batch may then be submitted for ingest into the SPER archive. Derivatives such as PDF and text files, useful for viewing a target document, are automatically created by SPER from the text documents.

A number of customizing is done by SPER to the DSpace (V1.4) software for supporting preservation operations at NLM. The highest level entity in the SPER archive is called a Repository (equivalent to a DSpace Community), which is composed of one or more Collections. All collections in a repository have similar properties and same metadata schema. SPER can support metadata schemas other than Dublin Core for browse and search.

Automated Metadata Extraction

The automated metadata extraction system identifies and extracts embedded descriptive metadata from digitized documents with structured or semi-structured text layouts, applying machine learning and pattern search techniques.


AME System
Figure 2- Automated metadata extraction in SPER


Prior to any metadata extraction from a document collection, layouts of its text pages are analysed, and corresponding recognition models are developed by the Layout Recognition Model Trainer module. The model parameters are trained using a combination of two learning models, Support Vector Machine and Hidden Markov Model, from the OCR output of training sets corresponding to each layouts. SPER uses FineReader 8.0 as the OCR Engine to obtain character level features in the OCR output.

During operations the Metadata Extractor module, using the layout recognition models, determines the structure of individual pages in a submitted batch and specifies the boundaries and text segments of individual items (an abstract, a pamphlet, or an official Government record) in the batch. This text is then processed by the Metadata Search Engine, which identifies and extracts embedded metadata for each item by using appropriate string search patterns for the metadata fields.

The metadata thus extracted for individual items are then sent to the Sper Server for formatting and storage. When running in a a stand-alone mode, the AME system can also save it locally for review, editing and generation of ground truth.

Archival of the FDANJ Collection

SPER has been used for end-to-end archival of a set of historic medico-legal documents aquired by NLM from the Food and Drug Administration. This collection of more than 40,000 digitized pages, referred to as FDA Notices of Judgment (FDANJ), consists of about 70,000 published notices of judgment (NJ) from court cases, involving products seized under authority of the 1906 Pure Food and Drug Act and the 1938 Food, Drug, and Cosmetic Act.

Using AME, individual court cases (NJs) in the collection were identified and corresponding metadata elements were extracted and stored. The source pages and metadata for each court case was then ingested to the SPER archive, and indexed for full-text and metadata field-based search. SPER customized and enhanced the search and retrieval capability provided originally by DSpace, including browsing on specific FDANJ-specific metadata fields.

The archived FDANJ collection is now publicly available at http://archive.nlm.nih.gov/fdanj.

Project Papers

[CONFERENCE]
Misra D, Chen S, Thoma GR.
A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models.
Proc. IS&T Archiving 2009. Arlington, Virginia. May 2009 (in Press).


[CONFERENCE]
Misra D, Mao S, Rees J, Thoma GR.
Archiving a Historic Medico-legal Collection: Automation and Workflow Customization.
Proc. IS&T Archiving 2007. Arlington, Virginia. May 2007:157-61.
Abstract | Full Text (PDF) | View Citation


[CONFERENCE]
Thoma GR, Mao S, Misra D, Rees J.
Design of a Digital Library for Early 20th Century Medico-legal Documents.
Proc. ECDL 2006. Eds: Gonzalo J et al. Berlin: Springer-Verlag; LNCS. September 2006;4172:147-57.
Abstract | Full Text (PDF) | View Citation


[CONFERENCE]
Thoma GR, Mao S, Misra D.
Automated metadata extraction to preserve the digital contents of biomedical collections.
Proc. 5th IASTED International Conference on Visualization, Imaging and Image Processing (VIIP 2005). Benidorm, Spain. September 2005:214-19.
Abstract | Full Text (PDF) | View Citation


[CONFERENCE]
Mao S, Misra D, Seamans J, Thoma GR.
Design Strategies for a Prototype Electronic Preservation System for Biomedical Documents.
IS&T Archiving 2005 Conference. Washington, D.C. April 2005:48-53.
Abstract | Full Text (PDF) | View Citation


[CONFERENCE]
Misra D, Seamans J, Thoma GR.
Testing the Scalability of a DSpace-based Archive.
Proc. IS&T Archiving 2008. Bern, Switzerland. June 2008:36-40.
Abstract | Full Text (PDF) | View Citation

 

National Institutes of Health (NIH)National Institutes of Health (NIH)
9000 Rockville Pike
Bethesda, Maryland 20892

U.S. Dept. of Health and Human ServicesU.S. Dept. of Health
and Human Services

USA.gov Website