Talk:Web Archive Service Models and Metrics

From eotcd

(Difference between revisions)
Jump to: navigation, search
m (Content Description)
(EOT Archive: Mime Type Evaluation)
Line 70: Line 70:
 
|Text||align="right"|109,498,363||html, plain
 
|Text||align="right"|109,498,363||html, plain
 
|-
 
|-
|Text-like||align="right"|11,234,522||pdf, msword, postscript, vnd.ms-powerpoint
+
|Document-like||align="right"|11,234,522||pdf, msword, postscript, vnd.ms-powerpoint
 
|-
 
|-
 
|Dataset||align="right"|908,339||vnd.ms-excel, csv, comma-separated-values, x-netcdfv, fits
 
|Dataset||align="right"|908,339||vnd.ms-excel, csv, comma-separated-values, x-netcdfv, fits

Revision as of 14:55, 19 December 2011

Contents

Scope

Scope is a measure of the amount of materials in the archive.


Level of Aggregation

We considered at what aggregate level (e.g., archive or collection) size should be measured. We suggest that both archive and collection level statistics be measured. This may be of importance to the archive service providers in reporting their statistics.


Unit of Measure

We considered what unit would be meaningful to measure within a Web archive. We are suggesting that URIs be the unit counted and that the number of URIs and their corresponding aggregate sizes for meaningful classes of content (materials) be measured.


EOT Archive: Mime Type Evaluation

Web Archive Content Categories

We first identified the mime types represented in the Archive and these are shown in the table and chart below.

EOT Archive Mime Types (RFC 2046)
Valid Types # Formats # URIs  % URIs Formats
application 22 14,386,967 9.30% pdf, vnd.ms-excel, x-javascript, msword,octet-stream, zip, x-shockwave-flash, postscript, vnd.ms-powerpoint, atom+xml, xml, x-cgi, x-octet-stream, x-zip-compressed, rss+xml, x-gzip, x-compress, javascript, x-netcdf, fits, vnd.google-earth.kmz, download
audio 3 198,349 0.13% mpeg, x-pn-realaudio, x-wav
image 8 29,140,868 18.83% jpeg, gif, png, tiff, pjpeg, x-icon, jpg, bmp
text 9 110,918,439 71.67% html, plain, css, xml, csv, javascript, x-vcal, x-vcalendar, comma-separated-values
video 4 126,509 0.08% quicktime, x-ms-asf, mpeg, x-ms-wmv
Total 154,771,132 100%


percent eot uris by mime type.jpg


The following table lists the mime types that account for at least 1% of the Archive. There are six of these and they constitute 96% of the Archives' URIs. The 40 remaining URIs constitute 4% of the total number of URIs in the Archive. However, because of the Archive's size, the least of these (application/download) includes 15,079 URIs.


Type Format # URIs % URIs Cumm %
text html 105,590,929 68.22% 68.22%
image jpeg 13,665,196 8.83% 77.05%
image gif 13,031,046 8.42% 85.47%
application pdf 10,320,163 6.67% 92.14%
text plain 3,907,434 2.52% 94.67%
image png 2,066,892 1.34% 96.00%
all others (n = 40) 6,189,472 4.00% 100.00%


The formats for audio, video, and image mime types in the Archive are straightforward, and we think fairly familiar to users. The text and application mime types included a range of formats that we thought would be more descriptive of the Archive's content if they were further classified. To that end, we have classified the 31 formats in those two mime types as follows:

Category # URIs Formats
Text 109,498,363 html, plain
Document-like 11,234,522 pdf, msword, postscript, vnd.ms-powerpoint
Dataset 908,339 vnd.ms-excel, csv, comma-separated-values, x-netcdfv, fits
Computer Files 3,472,193
Compressed 526,105 zip, x-zip-compressed, x-gzip, x-compress, vnd.google-earth.kmz
Binary 503,660 octet-stream, x-octet-stream
Executable 15,079 download
Coded/Formatted 2,427,349 x-javascript, javascript, javascript, x-cgi, xml, xml, atom+xml, rss+xml, x-vcal, x-vcalendar, css
Data-set 908,339 vnd.ms-excel, csv, comma-separated-values, x-netcdf, fits
Video 191,989 x-shockwave-flash


Rules for Inclusion

Web pages are often comprised of multiple content types, a simple example being a web page that includes text (html and css) and images. Since we are suggesting that URIs be the unit measured for content types, the number of URIs counted to render a single web page would generally exceed one. We wonder if rules might be established to inform more meaningful counts of an archive's content.

  • Should specific types (or formats) of materials be excluded from counts altogether, for example, x-icon images, all types of coded/formatted files (css, xml, javascript, etc.)?
    • NOTE: In the ARL instructions for reporting counts for computer files, those files counted include "machine-readable files comprising data or programs that are locally held as part of the library's collections available to library clients. Examples are U.S. Census data tapes, sample research software, locally-mounted databases, and reference tools on CD-ROM, tape or disk."
    • Would it be reasonable to include the sub-classes of compressed files, binary files, and executable files and to exclude all coded/formatted files?
  • Would it be reasonable to only measure images if their size exceeded established size thresholds?
  • Would it meaningful in characterizing a collection or an archive to separately count (and report) certain formats within a class of content, for example PDF-formatted files/URIs, if their count exceeded a certain established threshold?

Usage

The URL+timestamp might be considered the unique identifier for discrete objects in a Web archive. Is this the level at which usage ought to be tracked and usage reports generated for institutions?

  • Unique identifiers for materials in the Web archive are:
    • Needed for tracking usage at the unique identifier level
    • Needed for identifying versions of the same identifier

Item Requests

For journal articles ARL tracks the number of successful requests, as defined by COUNTER:
Number of items requested by users as a result of a search. User requests include viewing, downloading, emailing and printing of items, where this activity can be recorded and controlled by the server rather than the browser. Turnaways will also be counted.

What types of viewing, downloading, emailing and printing would the Archive Service include that would be server-controlled versus browser-controlled? Are there probable use cases for Web archives? One such case might be an authentication service, in which certain publications are vetted by an authority and assigned a certificate of authenticity that could be requested by a user.

Quality

UKSG has been investigating a journal usage factor as a measure of the quality and value of online journals. It may be possible to logically extend the usage factor concept to Web archive content, thereby creating a quality measure for this class of materials.

Costs

It may be that the Web archive service providers will not charge for access to the materials in their collections. However, this may not always be the case.

It seems likely that different cost models will emerge as Web archives become more commonplace. It may be that certain collections within an institution's archive will be free, while other collections will have costs associated with them.

A tiered cost structure might work well. This could allow for both free and fee-based services, for example a service provider could offer:

  1. Free basic discovery and access services
  2. Fee-based options and services:
    1. usage reports
    2. hosting

Usage Reports for Web Archives

For a start these could emulate the COUNTER usage reports for databases and journals. As such they would include:

  • Sessions by Month by Collection
  • Searches by Month by Collection
  • Searches and Sessions by Year by Collection
  • Searches and Sessions by Year by Archive

As appropriate, these reports could be done for consortia as well as individual institution.

Content Description

There are at least two perspectives to content description for Web archives: user and provider.

User perspective.

We were concerned with one class of user, a library. We asked librarians serving as project SMEs what criteria their libraries used in making acquisition decisions. From their responses we discovered that describing an archive's content is essential and goes beyond measures of its scope. Further, libraries require consistency in content descriptions for the same type of materials (e.g., journal databases or Web collections) that are available from different providers. Content description allows a library to assess the broadness of applicability of all, or a portion of, a provider's content to a library's collection. For libraries, this assessment is fundamental in their material selection process.

We identified three attributes of content description to consistently describe a collection within a Web archive:

  1. Topical areas covered
  2. Unique or exclusive content available
  3. Dates materials were harvested

Provider perspective.

Content description is important to Web archive providers for a few reasons:

  • To determine change-over-time for similar content captured at different points in time
  • To identify content overlap among collections

It seems reasonable that, if reported in a consistent manner, these characteristics of a Web archive will promote access and discovery of materials. We wonder what other descriptive measures might help characterize materials in Web archives and further support user access and discovery? Two measures might include:

  • Leveraging page rank data, for example to suggest possibly related content as measured by the underlying Web graph
  • Targeted, human-mediated, topical analysis of machine-identified clusters of related content

Summary thoughts on content description.

Common attributes.

The two perspectives share common attributes for content description. We can suggest the following:

  • Topical areas addressed
    • At a feasible level of effort, whether resulting from human mediation or machine analysis
  • Unique or exclusive content available
    • Dates materials in the collection were captured
    • Measure of how the collection changed-over-time
    • Analysis of collection's overlap with other known collections

Discovery and access.

  • Utilize page rank data when providing search results
    • Possibly a "view related content" feature
    • Design experiments to measure effectiveness
  • As it becomes feasible, provide users with a feature that identifies similar or related content on another Web archive
    • For example, another capture of the material hosted by a different provider

Terminology

Source: DCMI Glossary

document-like object (DLO)
Originally defined as an entity that resembles a document from the standpoint that it is substantially text-based and shares other properties of a document; e.g., electronic mail messages or spreadsheets. The definition was expanded at the 3rd DC workshop to refer to any discrete information resource that are characterized by being fixed (i.e., having identical content for each user). Examples include text, images, movies, and performances.
Resource
A resource is anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources. http://dublincore.org/documents/2003/04/02/dc-xml-guidelines/
Type
The Dublin Core element used to designate the nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary. See also "Using Dublin Core".

References

Buckland, Michael K.. What is a "Document"? Journal of the American Society for Information Science, Sep1997, Vol. 48 Issue 9, p804-809.
Abstract: Presents information on the definition of a document, while focusing on the development of a functional view, and discusses whether sculpture, museum, live animals and objects can be considered `documents'. Objectives of this article; Background information on document; Explanation of documentation in historical terms; Definition of documentation.
Frohmann, Bernd. Revisiting "What is a Document?" Journal of Documentation, Jan 1, 2009, Vol. 65, Issue 2, p291-303.
Abstract: Purpose -- The purpose of this paper is to provide a reconsideration of Michael Buckland's important question, "What is a document?," analysing the point and purpose of definitions of "document" and "documentation." Design/methodology/approach -- Two philosophical notions of the point of definitions are contrasted: John Stuart Mill's concept of a "real" definition, purporting to specify the nature of the definiendum; and a concept of definition based upon a foundationalist philosophy of language. Both conceptions assume that a general, philosophical justification for using words as we do is always in order. This assumption is criticized by deploying Hilary Putnam's arguments against the orthodox Wittgensteinian interpretation of criteria governing the use of language. The example of the cabinets of curiosities of the sixteenth-century English and European virtuosi is developed to show how one might productively think about what documents might be, but without a definition of a document. Findings -- Other than for specific, instrumentalist purposes (often appropriate for specific case studies), there is no general philosophical reason for asking, what is a document? There are good reasons for pursuing studies of documentation without the impediments of definitions of "document" or "documentation." Originality/value -- The paper makes an original contribution to the new interest in documentation studies by providing conceptual resources for multiplying, rather than restricting, the areas of application of the concepts of documents and documentation. Adapted from the source document.