Web Archive Service Models and Metrics

From eotcd

Jump to: navigation, search

Contents

Web Archive Service Models

Providers of web archives are an emerging class of content suppliers for which the business cases and service models have yet to be developed. Web archives are essentially repositories of born-digital and digitized resources. In terms of current collection management practices in libraries, web archives are most akin to the broad category of electronic resources and to the specific material types of electronic journals and databases. It is largely current collection management practices and librarian experiences in the electronic resources arena, particularly selection and retention practices, that may best inform the development of models and usage statistics for Web archives.

Fundamentally we have found that current service and acquisition models for electronic resources provide guidance for the extension of these models to include web archives. Further, it appears that the demand for networked access to shared content, versus ownership by libraries of web archive content, will predominate as the acquisition model.

Networked Access Model
In a networked access model, a library provides discovery services to users, via its catalog or web pages, and network services for resources that are served by another entity. This model applies both to licensed and freely-available materials.
Ownership Model
In an ownership model, a library acquires materials and typically provides services such as storage, maintenance, discovery, access, and metrics. These services may be provided to users who are associated with the library (e.g., students, faculty, and staff), to other libraries via consortia and consortia-like arrangements, or to web users of any ilk.
Hybrid Model
It is also possible that a library might purchase/own web-published materials but choose to have a Web archive service provider host the materials. In this case, the library would provide networked access to its patrons.

Whatever model(s) libraries adopt, they will need to collect and report statistics about the web-published materials in their collections. As this class of materials increases in importance in library collections, it will be in the best of interest of libraries to have them represented in the annual survey data reported by libraries. Current reporting practices for academic libraries provide a framework within which statistical measurements for web archive content can be developed.

Web Archive Metrics: Draft Proposal

Measurement Categories

In general there are four categories of measurement for which academic libraries collect data:

  1. Scope (How much; how many)
  2. Expenditures (Cost)
  3. Usage (Counts)
  4. Quality (Outcomes; Value)

The eotcd project's SMEs identified two critical areas for which Web archive statistics will be needed to inform selection and retention decisions:

  1. Scope (How much; how many)
  2. Usage (Counts)

These two areas are the primary focus of our current metrics work. The discussion tab includes some thoughts and ideas in these and other areas of interest for Web archive metrics. We note, for example, that the United Kingdom Serials Group (UKSG) is investigating the concept of a journal usage factor as a measure of the quality and value of online journals. It may be possible to logically extend the usage factor concept to Web archive content, thereby creating a quality measure for this class of materials.

Web Archive Terminology


SCOPE

Pertinent Academic Library Statistics

In terms of statistics tracked and reported by academic libraries, Web archives most closely resemble statistics reported using the ARL supplemental statistics worksheet for (a) the use of networked electronic resources and services (i.e., e-resources such as databases) and (b) library digitization activities.

Use of Networked Electronic Resources and Services

ARL includes three measures for the use of databases and services:

  1. number of sessions
  2. number of searches
  3. number of successful article requests

While these three statistics pertain to usage, which is discussed below, ARL requests two additional measures of scope for each usage statistic:

  1. Number of resources for each usage statistic
  2. Types of resources comprising the number of resources

Library Digitization Activities

ARL statistics for library digitization activities also provide some guidance regarding measures for Web archives. In this regard, the pertinent statistics include:

  1. Number of collections
  2. Size (in gigabytes)
  3. Items (e.g., digital objects or unique files)

These statistics (a) include items that are made available to users and (b) exclude backup copies and e-reserves.

Archive Content Categories

Fundamentally, an objective of this project is to suggest metrics that characterize the resources in a Web archive in a manner that is meaningful to librarians and library administrators, who range in their degree of familiarity with the technical definitions employed by standards bodies and the wider technical community. To meet this objective, we analyzed the content of the EOT Web Archive by mime types and subsequently identified categories for some of the resource formats associated with the "application" and "text" mime types. The resulting categories are listed in the following chart and table. The categories suggest aggregate measurement units for Web archive resources.


archive content categories.jpg


Category # URIs # Formats Formats
text 109,498,363 2 html, plain
image 29,140,868 8 jpeg, gif, png, tiff, pjpeg, x-icon, jpg, bmp
document-like 11,234,522 4 pdf, msword, postscript, vnd.ms-powerpoint
computer files
* coded/formatted 2,427,349 11 x-javascript, javascript (both text and application type), x-cgi, xml (both text and application type), atom+xml, rss+xml, x-vcal, x-vcalendar, css
* compressed 526,105 5 zip, x-zip-compressed, x-gzip, x-compress, vnd.google-earth.kmz
* binary 503,660 2 octet-stream, x-octet-stream
* executable 15,079 1 download
dataset 908,339 5 vnd.ms-excel, csv, comma-separated-values, x-netcdf, fits
video 318,498 5 quicktime, x-ms-asf, mpeg, x-ms-wmv, x-shockwave-flash
audio 198,349 3 mpeg, x-pn-realaudio, x-wav

Proposed Data Elements

For a Web archive:

  1. Size (in gigabytes, terabytes, etc. as appropriate)
  2. Number of discrete collections

For each collection within a Web archive:

  1. Size (in gigabytes, terabytes, etc. as appropriate)
  2. Number of objects by type:
    • Text
    • Image
    • Document-like
    • Computer file
    • Dataset
    • Video
    • Audio



USAGE

Pertinent Academic Library Statistics

As mentioned earlier, in terms of statistics tracked and reported by academic libraries, Web archives most closely resemble statistics reported using the ARL supplemental statistics worksheet for the use of networked electronic resources and services. ARL includes three usage measures for databases and services and the ARL specifically instructs libraries to derive the values for these numbers from reports specified in the COUNTER Code of Practice.

Statistic COUNTER Code of Practice
number of sessions Database Reports 1 and 3
number of searches Database Reports 1 and 3
number of successful article requests Journal Report 1


The PIRUS and PIRUS2 projects are investigating the adaptation of COUNTER usage measurements and reports for materials in institutional repositories. These investigations have a similar purpose to our investigation into usage statistics for Web archives. It seems prudent that our work to establish usage statistics for Web archives should also be informed by the COUNTER Code of Practice.

Libraries have indicated they are quite interested in having a standard set of statistics for similar materials types. It is hoped that doing so will enable libraries to evaluate their patrons' use of the materials in Web archives in the manner they are already familiar with for other classes of electronic resources (i.e., ebooks, databases, and journals).

COUNTER Terminology & Reports: Brief Overview

Perspectives

Usage statistics are critical from the perspective of both service providers and libraries. This dual perspective has been a driver in the development of the COUNTER Codes of Practice:

"The guidelines provided in the Codes of Practice enable librarians to compare statistics from different vendors, to make better-informed purchasing decisions, and to plan infrastructure more effectively. COUNTER also provides vendors/intermediaries with the detailed specifications they need to generate data in a format useful to customers, to compare the relative usage of different delivery channels, and to learn more about online usage patterns."

It is expected that a set of standard usage statistics will likewise meet the reporting and assessment needs of Web archive service providers and libraries.

Proposed Data Elements

For each collection within a Web archive:

  1. Number of sessions
    • Total number
    • Number federated or automated
  2. Number of searches (queries)
    • Total number of searches run
    • Number federated or automated

NOTE: See the discussion tab content for thoughts regarding the applicability of item requests as a data element for Web archives.