Talk:Web Archive Service Models and Metrics

Scope

Scope is a measure of the amount of materials in the archive.

Level of Aggregation

We considered at what aggregate level (e.g., archive or collection) size should be measured. We suggest that both archive and collection level statistics be measured. This may be of importance to the archive service providers in reporting their statistics.

Unit of Measure

We considered what unit would be meaningful to measure within a Web archive. We are suggesting that URIs be the unit counted and that the number of URIs and their corresponding aggregate sizes for meaningful classes of content (materials) be measured.

EOT Archive: Mime Type Evaluation

Web Archive Content Categories

We first identified the mime types represented in the Archive and these are shown in the table and chart below.

EOT Archive Mime Types (RFC 2046)
Valid Types	# Formats	# URIs	% URIs	Formats
application	22	14,386,967	9.30%	pdf, vnd.ms-excel, x-javascript, msword,octet-stream, zip, x-shockwave-flash, postscript, vnd.ms-powerpoint, atom+xml, xml, x-cgi, x-octet-stream, x-zip-compressed, rss+xml, x-gzip, x-compress, javascript, x-netcdf, fits, vnd.google-earth.kmz, download
audio	3	198,349	0.13%	mpeg, x-pn-realaudio, x-wav
image	8	29,140,868	18.83%	jpeg, gif, png, tiff, pjpeg, x-icon, jpg, bmp
text	9	110,918,439	71.67%	html, plain, css, xml, csv, javascript, x-vcal, x-vcalendar, comma-separated-values
video	4	126,509	0.08%	quicktime, x-ms-asf, mpeg, x-ms-wmv
Total		154,771,132	100%

The following table lists the mime types that account for at least 1% of the Archive. There are six of these and they constitute 96% of the Archives' URIs. The 40 remaining URIs constitute 4% of the total number of URIs in the Archive. However, because of the Archive's size, the least of these (application/download) includes 15,079 URIs.

Type	Format	# URIs	% URIs	Cumm %
text	html	105,590,929	68.22%	68.22%
image	jpeg	13,665,196	8.83%	77.05%
image	gif	13,031,046	8.42%	85.47%
application	pdf	10,320,163	6.67%	92.14%
text	plain	3,907,434	2.52%	94.67%
image	png	2,066,892	1.34%	96.00%
all others (n = 40)		6,189,472	4.00%	100.00%

The formats for audio, video, and image mime types in the Archive are straightforward, and we think fairly familiar to users. The text and application mime types included a range of formats that we thought would be more descriptive of the Archive's content if they were further classified. To that end, we have classified the 31 formats in those two mime types as follows:

Category	# URIs	Formats
Text	109,498,363	html, plain
Document-like	11,234,522	pdf, msword, postscript, vnd.ms-powerpoint
Computer Files	3,472,193
Compressed	526,105	zip, x-zip-compressed, x-gzip, x-compress, vnd.google-earth.kmz
Binary	503,660	octet-stream, x-octet-stream
Executable	15,079	download
Coded/Formatted	2,427,349	x-javascript, javascript, javascript, x-cgi, xml, xml, atom+xml, rss+xml, x-vcal, x-vcalendar, css
Data-set	908,339	vnd.ms-excel, csv, comma-separated-values, x-netcdf, fits
Video	191,989	x-shockwave-flash

Rules for Inclusion

Web pages are often comprised of multiple content types, a simple example being a web page that includes text (html and css) and images. Since we are suggesting that URIs be the unit measured for content types, the number of URIs counted to render a single web page would generally exceed one. We wonder if rules might be established to inform more meaningful counts of an archive's content.

Should specific types (or formats) of materials be excluded from counts altogether, for example, x-icon images, all types of coded/formatted files (css, xml, javascript, etc.)?
- NOTE: In the ARL instructions for reporting counts for computer files, those files counted include "machine-readable files comprising data or programs that are locally held as part of the library's collections available to library clients. Examples are U.S. Census data tapes, sample research software, locally-mounted databases, and reference tools on CD-ROM, tape or disk."
- Would it be reasonable to include the sub-classes of compressed files, binary files, and executable files and to exclude all coded/formatted files?
Would it be reasonable to only measure images if their size exceeded established size thresholds?
Would it meaningful in characterizing a collection or an archive to separately count (and report) certain formats within a class of content, for example PDF-formatted files/URIs, if their count exceeded a certain established threshold?

Usage

The URL+timestamp might be considered the unique identifier for discrete objects in a Web archive. Is this the level at which usage ought to be tracked and usage reports generated for institutions?

Unique identifiers for materials in the Web archive are:
- Needed for tracking usage at the unique identifier level
- Needed for identifying versions of the same identifier

Item Requests

For journal articles ARL tracks the number of successful requests, as defined by COUNTER:

Number of items requested by users as a result of a search. User requests include viewing, downloading, emailing and printing of items, where this activity can be recorded and controlled by the server rather than the browser. Turnaways will also be counted.

What types of viewing, downloading, emailing and printing would the Archive Service include that would be server-controlled versus browser-controlled? Are there probable use cases for Web archives? One such case might be an authentication service, in which certain publications are vetted by an authority and assigned a certificate of authenticity that could be requested by a user.

Quality

UKSG has been investigating a journal usage factor as a measure of the quality and value of online journals. It may be possible to logically extend the usage factor concept to Web archive content, thereby creating a quality measure for this class of materials.

Costs

It may be that the Web archive service providers will not charge for access to the materials in their collections. However, this may not always be the case.

It seems likely that different cost models will emerge as Web archives become more commonplace. It may be that certain collections within an institution's archive will be free, while other collections will have costs associated with them.

A tiered cost structure might work well. This could allow for both free and fee-based services, for example a service provider could offer:

Free basic discovery and access services
Fee-based options and services:
1. usage reports
2. hosting

Usage Reports for Web Archives

For a start these could emulate the COUNTER usage reports for databases and journals. As such they would include:

Sessions by Month by Collection
Searches by Month by Collection
Searches and Sessions by Year by Collection
Searches and Sessions by Year by Archive

As appropriate, these reports could be done for consortia as well as individual institution.

Content Description

There are at least two perspectives to content description for Web archives: user and provider.

User perspective.

We were concerned with one class of user, a library. We asked librarians serving as project SMEs what criteria their libraries used in making acquisition decisions. From their responses we discovered that describing an archive's content is essential and goes beyond measures of its scope. Further, libraries require consistency in content descriptions for the same type of materials (e.g., journal databases or Web collections) that are available from different providers. Content description allows a library to assess the broadness of applicability of all, or a portion of, a provider's content to a library's collection. For libraries, this assessment is fundamental in their material selection process.

We identified three attributes of content description to consistently describe a collection within a Web archive:

Topical areas covered
Unique or exclusive content available
Dates materials were harvested

Provider perspective.

Content description is important to Web archive providers for a few reasons:

To determine change-over-time for similar content captured at different points in time
To identify content overlap among collections

It seems reasonable that, if reported in a consistent manner, these characteristics of a Web archive will promote access and discovery of materials. We wonder what other descriptive measures might help characterize materials in Web archives and further support user access and discovery? Two measures might include:

Leveraging page rank data, for example to suggest possibly related content as measured by the underlying Web graph
Targeted, human-mediated, topical analysis of machine-identified clusters of related content

Summary thoughts on content description.

Common attributes.

The two perspectives share common attributes for content description. We can suggest the following:

Topical areas addressed
- At a feasible level of effort, whether resulting from human mediation or machine analysis
Unique or exclusive content available
- Dates materials in the collection were captured
- Measure of how the collection changed-over-time
- Analysis of collection's overlap with other known collections

Discovery and access.

Utilize page rank data when providing search results
- Possibly a "view related content" feature
- Design experiments to measure effectiveness
As it becomes feasible, provide users with a feature that identifies similar or related content on another Web archive
- For example, another capture of the material hosted by a different provider

Terminology

Source: DCMI Glossary

document-like object (DLO): Originally defined as an entity that resembles a document from the standpoint that it is substantially text-based and shares other properties of a document; e.g., electronic mail messages or spreadsheets. The definition was expanded at the 3rd DC workshop to refer to any discrete information resource that are characterized by being fixed (i.e., having identical content for each user). Examples include text, images, movies, and performances.

Resource: A resource is anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources. http://dublincore.org/documents/2003/04/02/dc-xml-guidelines/

Type: The Dublin Core element used to designate the nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary. See also "Using Dublin Core".

References

Buckland, Michael K.. What is a "Document"? Journal of the American Society for Information Science, Sep1997, Vol. 48 Issue 9, p804-809.: Abstract: Presents information on the definition of a document, while focusing on the development of a functional view, and discusses whether sculpture, museum, live animals and objects can be considered `documents'. Objectives of this article; Background information on document; Explanation of documentation in historical terms; Definition of documentation.

Frohmann, Bernd. Revisiting "What is a Document?" Journal of Documentation, Jan 1, 2009, Vol. 65, Issue 2, p291-303.: Abstract: Purpose -- The purpose of this paper is to provide a reconsideration of Michael Buckland's important question, "What is a document?," analysing the point and purpose of definitions of "document" and "documentation." Design/methodology/approach -- Two philosophical notions of the point of definitions are contrasted: John Stuart Mill's concept of a "real" definition, purporting to specify the nature of the definiendum; and a concept of definition based upon a foundationalist philosophy of language. Both conceptions assume that a general, philosophical justification for using words as we do is always in order. This assumption is criticized by deploying Hilary Putnam's arguments against the orthodox Wittgensteinian interpretation of criteria governing the use of language. The example of the cabinets of curiosities of the sixteenth-century English and European virtuosi is developed to show how one might productively think about what documents might be, but without a definition of a document. Findings -- Other than for specific, instrumentalist purposes (often appropriate for specific case studies), there is no general philosophical reason for asking, what is a document? There are good reasons for pursuing studies of documentation without the impediments of definitions of "document" or "documentation." Originality/value -- The paper makes an original contribution to the new interest in documentation studies by providing conceptual resources for multiplying, rather than restricting, the areas of application of the concepts of documents and documentation. Adapted from the source document.

Talk:Web Archive Service Models and Metrics

From eotcd

Contents

Scope

EOT Archive: Mime Type Evaluation

Usage

Quality

Costs

Usage Reports for Web Archives

Content Description

Terminology

Views

Personal tools

Navigation

Search wiki

Toolbox