Related Resources at the Library

Technical Background

Tool Investigations

The Library, in conjunction with its partners, is working on developing a common set of Web Capture tools in four areas: curator selection, verification and permissions; acquisition; collection storage and maintenance; and access.

In several of these areas, open-source software provides the foundation for our work:

Selection and permissions: The Library is investigating the use of the Web Capture Tool (WCT), an open source Web harvesting management system released by the National Library of New Zealand in September 2006.
Acquisition: Our primary acquisition tool is the Heritrix Web crawler.
Access: We are working with the WERA and Wayback access tools and the NutchWAX search engine.

The technical environment for our acquisition, access, and storage is based on open platforms such as Linux.

Guidelines for the Web Harvesting Process

Our goals for capturing, storing, and preserving content during the Web harvest process are to:

Retrieve all code, images, documents, and other files essential to reproduce the site as completely as possible
Capture and preserve all technical metadata from Web server (e.g. HTTP headers) and crawler (e.g. context of capture, date and time stamp, and crawl conditions). Date information is especially important to distinguish repeated crawl versions of same site.
Store the content exactly as delivered. HTML and other code are always left intact, without making modifications necessary for access.
Maintain platform and file system independence. We do not use file system features such as naming or timestamps to record technical metadata.