Quality and Functionality Factors for Archived Web Sites and Pages

Sustainability of Digital Formats Planning for Library of Congress Collections

Introduction \| Sustainability Factors \| Content Categories \| Format Descriptions \| Contact

Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive

Scope
This discussion concerns Web sites as they may be collected and archived for research access and long-term preservation. What is at stake is harvesting sites as they present themselves to users at a particular time.

The formats discussed here are those that might hold the results of a crawl of a Web site or set of Web sites, a dynamic action resulting from the use of a software package (e.g., Heritrix) that calls up Web pages and captures them in the form disseminated to users.

The goal for a Web archiving activity is typically to collect Web pages, each with such embedded resources as images, sounds, and the like, in as complete a manner as possible and to capture the link structure in a way that allows the researcher to identify what was linked to and if the linked resource has also been captured to link to it. The focus of a Web archiving activity may be guided by the concept of a Web site. The terms Web page and Web site must be understood in a flexible manner. A useful definition for page is provided in Web Archive Metrics: Definitions and Framework (draft, December 2005), prepared by the Library of Congress Web Capture team for the International Internet Preservation Consortium (IIPC): "a page is a set of one or more Web resources expected to be rendered simultaneously, which can be identified by the URI of the item that embeds the other resources in the set." The same document suggests the following definition for site: "an intellectually related set of resources often (but not always) bounded by technical division, such as content from a domain, which may include several related domains, or a subset of content from a host." In practice, the boundaries for a Web site are often hard to define.

For consideration of functionality required for the digital formats used for the captured Web sites, it is useful to provide examples of scenarios and categorizations that have been used to describe Web archiving activities. In Archiving Websites: General Considerations and Strategies, Niels Brugger distinguishes between micro and macro archiving. [1]

Macro archiving
Macro archiving is carried out on a large scale, typically by large institutions, as Brugger suggests "in order to archive (part of) the (inter)national cultural heritage." Members of the IIPC, including the Library of Congress, are primarily engaged in macro archiving. In Web Archive Metrics, a draft prepared for the IIPC, Boyko distinguishes between metrics for internet-based aggregations and for collection-based aggregations of Web pages.[2] The need for the two sets of metrics reflects different scenarios for future use of the archived Web pages. Some researchers will want to study the Web as a network, analyzing patterns of links and changes over time. Others will want to locate materials of a particular type (e.g., blogs) or pages devoted to a particular topic. Some macro archiving is scoped by Internet-based characteristics (e.g. a national domain). In other cases, collection-based aggregations are assembled by purposeful capture of pages on the basis of criteria established by a human curator, perhaps using a pre-determined set of URLs, rules that attempt to limit the degree to which linked resources beyond a site are also captured, and harvesting frequencies based on frequency of update. For example, the Library of Congress has collected Web sites related to particular events, such as elections and disasters, and in 2006 is collecting Web sites of organizations that have donated or committed to donate organizational papers to its Manuscript Division. [See http://www.loc.gov/webcapture/.] Internet-based aggregations may be the result of a less controlled process, taking the URL for a Web page, following links that can be reached from that page, harvesting those pages, and then following the links on those pages, and so on.

Macro archiving is usually of the open "surface" Web and includes the intent of supporting study of the Web as itself, including its link structure and changes or trends over time. This is not simply a matter of archiving the content of selected Web pages and preserving the ability to follow links when the linked pages have also been harvested.

The need for efficiency at scale (for both capture and subsequent processing) is likely to dominate other functionality factors. The ability to reproduce dynamic elements of a Web-based presentation may be considered less significant.

Other functionality factors that may be significant for macro archiving include: the ability to combine and de-duplicate the results of crawls at different times or by different institutions (e.g., different national libraries); the ability to extract subsets; support for very efficient indexing for access by URL and chronology for simulating the original Web experience; support for indexing the full text of pages; and the retention of the original URLs for harvested content and links in order to relate pages and other content objects and to analyze link structures.

At this writing, most macro archiving activities use one of two related formats designed for Web archiving at scale: ARC and WARC. The former was developed by the Internet Archive to support its work; WARC is a refined and extended format that is based on ARC and, in 2006, under consideration as a standard by ISO.

Micro archiving
The intent of micro archiving is to take a snapshot of a single site for a specific purpose. Several scenarios can lead to this need. A Web site, or a cluster of Web pages, may be considered as a single published work, to be collected and treated as a freestanding object. An organization may wish to take a snapshot of its own Web site. A site may be captured with the immediate purpose of using it as an object of study. It is important to note that a site cannot be harvested instantaneously and completely. As a result, dynamic sites may change during capture process, with the result that the site as harvested never existed at a single point of time.

Brugger's Archiving Websites focuses on micro archiving. In addition to discussion of harvesting, he highlights the challenges of "archiving the dynamics of the Internet," including not only the dynamics of updating, but also the experience of dynamic elements embedded in pages, some of which may rely on human interaction. Brugger's team tested nine programs for capturing complete individual Web sites. Results are reported at http://www.cfi.au.dk/publikationer/archiving/.

Brugger's team also tested software for recording screen shots or interactive Web browsing activity. Since the resulting content objects are still images or video, the sustainability of capture formats for recording are dealt with elsewhere on this site. This type of capture is very labor-intensive.

Out of scope
The Library of Congress Web harvesting activities can primarily be characterized as macro archiving. Hence, this section of the format resource will focus on formats appropriate for Web capture on a large scale. Formats for individual documents (such as still or moving images, sound, or self-contained text documents) that happen to be on the Web, and may be captured using Web-harvesting tools, are discussed in the areas of the site dealing with those content categories.

Also out of scope are:
• Capture of the interactive nature of a Web site by recording screenshots or a video of an interactive session. Formats for still and moving images, some of which may be used for such documentation of interactive site behavior, are dealt with elsewhere on this Web site.
• Capture of the deep Web from digital asset management systems. Many resources made available on the Web, particularly in the deep Web, are generated from content managed in databases. The Web provides an interface, for example to search an encyclopedia or image archive, but the best strategy for preserving the managed content is likely be to archive it in source form.
• Obtaining a copy of all of the files that may exist in the Web site provider's server system, the files from which the Web presentation is constructed. Many current Web presentations are built on the fly using databases, scripts, and other programs. This underlying data can only be used to recreate the experience of the Web site if many other dependencies are in place, e.g., specific operating systems, Web supporting software like the Apache HTTP Server, database management applications and the like. The aim of Web harvesting is not to be able to rebuild the Web site with all its functionality, but to capture Web pages as the user viewed them, to the extent possible.

Normal rendering for archived Web sites
Normal rendering for archived Web sites is identical to that expected for active Web sites on the Internet: users may read and scroll through text, follow hyperlinks from one page to another, and copy and print. Assuming that the harvesting tool had succeeded in collecting the images, sounds, or other elements that are embedded in a page, these are also presented to or accessible to users.

Functionality: documentation of harvesting context
Whether an individual Web site is captured for a particular purpose or as part of a large-scale program, the context and circumstances of the capture process must be recorded to enable future analysis or scholarship. For this reason, record-keeping about the date and circumstances for harvesting is important, and preferred archiving formats will store and maintain such documentation. Harvesting detail includes information about how the record was requested (typically a http request) and the response (typically a full http response including headers and content body).

Some related functionality elements are among the goals articulated for the WARC format. The following have been selected from the list in the February 2006 draft ISO specification document for WARC:
• Ability to store both the payload content and control information from mainstream Internet application layer protocols, including HTTP, FTP, NNTP, and SMTP.
• Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding).
• Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information.

Functionality: efficiency at scale
Macro archiving demands simplicity, flexibility, and amenability to efficient processing in the format used to store the harvested Web pages. The format should not require that pages be held in a logical order, since any harvesting process is subject to interruption. It is desirable to have a format that offers flexibility in file size; that is, the format should not have an inherent limit to the filesize, but should also allow segmentation of large harvested resources, in order to permit a harvesting application to impose a size limit based on the practicalities of technology at the time of harvesting. A format that will support straightforward merging of aggregations of harvested pages is desirable.

Since simulation of the original Web experience in terms of following links found in pages is a part of normal rendering, the format must permit efficient indexing by original URL and the date and time of harvesting.

Web sites may be crawled periodically, e.g., once a week or once a month. In many instances, much of the content will be unchanged from the previous crawl. At this time, there are few effective tools for the elimination of duplicate content. Nevertheless, the possibility of avoiding duplication in the future has led specialists in the field to define an action ("duplicate detection event") and to establish a related requirement, i.e., that archiving formats be capable of storing relevant metadata that can point to no-longer duplicate data in another location, e.g., the dataset from a preceding crawl.

Among the goals listed in the draft ISO specification document for WARC that emphasize this aspect of functionality are:
• Amenability to efficient processing.
• Ability to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources).
• Support for data compression and maintenance of data record integrity.
• Support for deterministic handling of long records (e.g., truncation, segmentation).

Functionality: support for stewardship
Experience with the capture of Web sites by archival institutions making a commitment to long-term preservation is growing, for example among the members of the International Internet Preservation Consortium [4]. Functionalities expected by these institutions contributed to the extensions proposed for WARC over the original ARC format. These expectations include support for activities that will enhance access for researchers and for activities to support future preservation activities.

In July 2004, the Danish Royal Library, as part of planning for the Danish national Web archiving program (Netarkivet), produced the report, Archive Format and Metadata Requirements [5]. This report recommended extending ARC to allow richer metadata, rather than using an XML-based structure or creating a new format from scratch. This report was influential in the development of WARC.

The ability to record metadata about harvested resources based on analysis of the harvested content can support preservation activities and enhance access. For example, documenting the character encoding, or recording whether a harvested file is technically valid might support future preservation activities. Access for researchers could be enhanced by assigning topical subject terms based on textual analysis. The May 2006 report Use Cases for Access to Internet Archives identifies the need for capabilities to extract a subset of a Web archive for a researcher to use for specialized analysis [6]. The ability to assign terms could support subset generation.

The resources harvested from the Web may be in a wide variety of digital formats, some widely used and others relatively obscure. In the future, the format used for some resource (for example an embedded image) may no longer be supported by browsers. It may be appropriate for custodial institutions to transform such images into a supported format and store the transformed images as part of the Web archive.

The list of goals in the February 2006 draft ISO specification document for WARC includes some features that relate particularly to stewardship:
• Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding).
• Ability to store the results of data transformations linked to other stored data.
• Ability to store globally unique record identifiers.

Notes

¹ It is worth noting that the U.S. Copyright Office, part of the Library of Congress, does sometimes receive file sets for Web sites as a part of the creator's copyright registration and deposit process. Although policies have not been established, the writers of this document do not anticipate that these file sets will be selected for the Library's permanent collections. In contrast, harvested Web sites are being added to the collections today; see http://www.loc.gov/webcapture/.

References

1. Brugger, Niels. Archiving Websites: General Considerations and Strategies. Aarhus, Denmark: The Centre for Internet Research, 2005.

2. Boyko, Andrew and Michael Ashenfelder. Web Archive Metrics : Definitions and Framework (Working draft - IIPC internal review), Washington DC, USA: Library of Congress, October 2005.

3. Library of Congress. Web Capture. Information for the public. http://www.loc.gov/webcapture/

4. International Internet Preservation Consortium (IIPC). http://netpreserve.org/

5.Christensen, Steen Sloth. Archive Format and Metadata Requirements. Copenhagen, Denmark: Royal Library of Denmark, July 2004. http://netarchive.dk/publikationer/Archival_format_requirements-2004.pdf

6. IIPC. Use Cases for Access to Internet Archives. May 2006. http://netpreserve.org/publications/iipc-r-003.pdf

Last Updated: Wednesday, 07-Mar-2007 12:40:14 EST

Sustainability of Digital Formats Planning for Library of Congress Collections

Web Sites and Pages >> Quality and Functionality Factors

Sustainability of Digital Formats
Planning for Library of Congress Collections