Talk:Web Archive Service Models and Metrics
From eotcd
Contents |
Scope
Scope is a measure of the amount of materials in the archive.
Level of Aggregation
We considered at what aggregate level (e.g., archive or collection) size should be measured. We suggest that both archive and collection level statistics be measured. This may be of importance to the archive service providers in reporting their statistics.
Unit of Measure
We considered what unit would be meaningful to measure within a Web archive. We are suggesting that URIs be the unit counted and that the number of URIs and their corresponding aggregate sizes for meaningful classes of content (materials) be measured.
EOT Archive: Mime Type Evaluation
Web Archive Content Categories
We first identified the mime types represented in the Archive and these are shown in the table and chart below.
Valid Types | # Formats | # URIs | % URIs | Formats |
---|---|---|---|---|
application | 22 | 14,386,967 | 9.30% | pdf, vnd.ms-excel, x-javascript, msword,octet-stream, zip, x-shockwave-flash, postscript, vnd.ms-powerpoint, atom+xml, xml, x-cgi, x-octet-stream, x-zip-compressed, rss+xml, x-gzip, x-compress, javascript, x-netcdf, fits, vnd.google-earth.kmz, download |
audio | 3 | 198,349 | 0.13% | mpeg, x-pn-realaudio, x-wav |
image | 8 | 29,140,868 | 18.83% | jpeg, gif, png, tiff, pjpeg, x-icon, jpg, bmp |
text | 9 | 110,918,439 | 71.67% | html, plain, css, xml, csv, javascript, x-vcal, x-vcalendar, comma-separated-values |
video | 4 | 126,509 | 0.08% | quicktime, x-ms-asf, mpeg, x-ms-wmv |
Total | 154,771,132 | 100% |
The following table lists the mime types that account for at least 1% of the Archive. There are six of these and they constitute 96% of the Archives' URIs. The 40 remaining URIs constitute 4% of the total number of URIs in the Archive. However, because of the Archive's size, the least of these (application/download) includes 15,079 URIs.
Type | Format | # URIs | % URIs | Cumm % |
---|---|---|---|---|
text | html | 105,590,929 | 68.22% | 68.22% |
image | jpeg | 13,665,196 | 8.83% | 77.05% |
image | gif | 13,031,046 | 8.42% | 85.47% |
application | 10,320,163 | 6.67% | 92.14% | |
text | plain | 3,907,434 | 2.52% | 94.67% |
image | png | 2,066,892 | 1.34% | 96.00% |
all others (n = 40) | 6,189,472 | 4.00% | 100.00% |
The formats for audio, video, and image mime types in the Archive are straightforward, and we think fairly familiar to users. The text and application mime types included a range of formats that we thought would be more descriptive of the Archive's content if they were further classified. To that end, we have classified the 31 formats in those two mime types as follows:
Category | # URIs | Formats |
---|---|---|
Text | 109,498,363 | html, plain |
Document-like | 11,234,522 | pdf, msword, postscript, vnd.ms-powerpoint |
Computer Files | 3,472,193 | |
Compressed | 526,105 | zip, x-zip-compressed, x-gzip, x-compress, vnd.google-earth.kmz |
Binary | 503,660 | octet-stream, x-octet-stream |
Executable | 15,079 | download |
Coded/Formatted | 2,427,349 | x-javascript, javascript, javascript, x-cgi, xml, xml, atom+xml, rss+xml, x-vcal, x-vcalendar, css |
Data-set | 908,339 | vnd.ms-excel, csv, comma-separated-values, x-netcdf, fits |
Video | 191,989 | x-shockwave-flash |
Rules for Inclusion
Web pages are often comprised of multiple content types, a simple example being a web page that includes text (html and css) and images. Since we are suggesting that URIs be the unit measured for content types, the number of URIs counted to render a single web page would generally exceed one. We wonder if rules might be established to inform more meaningful counts of an archive's content.
- Should specific types (or formats) of materials be excluded from counts altogether, for example, x-icon images, all types of coded/formatted files (css, xml, javascript, etc.)?
- NOTE: In the ARL instructions for reporting counts for computer files, those files counted include "machine-readable files comprising data or programs that are locally held as part of the library's collections available to library clients. Examples are U.S. Census data tapes, sample research software, locally-mounted databases, and reference tools on CD-ROM, tape or disk."
- Would it be reasonable to include the sub-classes of compressed files, binary files, and executable files and to exclude all coded/formatted files?
- Would it be reasonable to only measure images if their size exceeded established size thresholds?
- Would it meaningful in characterizing a collection or an archive to separately count (and report) certain formats within a class of content, for example PDF-formatted files/URIs, if their count exceeded a certain established threshold?
Usage
The URL+timestamp might be considered the unique identifier for discrete objects in a Web archive. Is this the level at which usage ought to be tracked and usage reports generated for institutions?
- Unique identifiers for materials in the Web archive are:
- Needed for tracking usage at the unique identifier level
- Needed for identifying versions of the same identifier
Item Requests
For journal articles ARL tracks the number of successful requests, as defined by COUNTER:Number of items requested by users as a result of a search. User requests include viewing, downloading, emailing and printing of items, where this activity can be recorded and controlled by the server rather than the browser. Turnaways will also be counted.
What types of viewing, downloading, emailing and printing would the Archive Service include that would be server-controlled versus browser-controlled? Are there probable use cases for Web archives? One such case might be an authentication service, in which certain publications are vetted by an authority and assigned a certificate of authenticity that could be requested by a user.
Quality
UKSG has been investigating a journal usage factor as a measure of the quality and value of online journals. It may be possible to logically extend the usage factor concept to Web archive content, thereby creating a quality measure for this class of materials.
Costs
It may be that the Web archive service providers will not charge for access to the materials in their collections. However, this may not always be the case.
It seems likely that different cost models will emerge as Web archives become more commonplace. It may be that certain collections within an institution's archive will be free, while other collections will have costs associated with them.
A tiered cost structure might work well. This could allow for both free and fee-based services, for example a service provider could offer:
- Free basic discovery and access services
- Fee-based options and services:
- usage reports
- hosting
Usage Reports for Web Archives
For a start these could emulate the COUNTER usage reports for databases and journals. As such they would include:
- Sessions by Month by Collection
- Searches by Month by Collection
- Searches and Sessions by Year by Collection
- Searches and Sessions by Year by Archive
As appropriate, these reports could be done for consortia as well as individual institution.
Content Description
There are at least two perspectives to content description for Web archives: user and provider.
User perspective.
We were concerned with one class of user, a library. We asked librarians serving as project SMEs what criteria their libraries used in making acquisition decisions. From their responses we discovered that describing an archive's content is essential and goes beyond measures of its scope. Further, libraries require consistency in content descriptions for the same type of materials (e.g., journal databases or Web collections) that are available from different providers. Content description allows a library to assess the broadness of applicability of all, or a portion of, a provider's content to a library's collection. For libraries, this assessment is fundamental in their material selection process.
We identified three attributes of content description to consistently describe a collection within a Web archive:
- Topical areas covered
- Unique or exclusive content available
- Dates materials were harvested
Provider perspective.
Content description is important to Web archive providers for a few reasons:
- To determine change-over-time for similar content captured at different points in time
- To identify content overlap among collections
It seems reasonable that, if reported in a consistent manner, these characteristics of a Web archive will promote access and discovery of materials. We wonder what other descriptive measures might help characterize materials in Web archives and further support user access and discovery? Two measures might include:
- Leveraging page rank data, for example to suggest possibly related content as measured by the underlying Web graph
- Targeted, human-mediated, topical analysis of machine-identified clusters of related content
Summary thoughts on content description.
Common attributes.
The two perspectives share common attributes for content description. We can suggest the following:
- Topical areas addressed
- At a feasible level of effort, whether resulting from human mediation or machine analysis
- Unique or exclusive content available
- Dates materials in the collection were captured
- Measure of how the collection changed-over-time
- Analysis of collection's overlap with other known collections
Discovery and access.
- Utilize page rank data when providing search results
- Possibly a "view related content" feature
- Design experiments to measure effectiveness
- As it becomes feasible, provide users with a feature that identifies similar or related content on another Web archive
- For example, another capture of the material hosted by a different provider
Terminology
Source: DCMI Glossary
- document-like object (DLO)
- Originally defined as an entity that resembles a document from the standpoint that it is substantially text-based and shares other properties of a document; e.g., electronic mail messages or spreadsheets. The definition was expanded at the 3rd DC workshop to refer to any discrete information resource that are characterized by being fixed (i.e., having identical content for each user). Examples include text, images, movies, and performances.
- Resource
- A resource is anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources. http://dublincore.org/documents/2003/04/02/dc-xml-guidelines/
- Type
- The Dublin Core element used to designate the nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary. See also "Using Dublin Core".
References
- Buckland, Michael K.. What is a "Document"? Journal of the American Society for Information Science, Sep1997, Vol. 48 Issue 9, p804-809.
- Abstract: Presents information on the definition of a document, while focusing on the development of a functional view, and discusses whether sculpture, museum, live animals and objects can be considered `documents'. Objectives of this article; Background information on document; Explanation of documentation in historical terms; Definition of documentation.
- Frohmann, Bernd. Revisiting "What is a Document?" Journal of Documentation, Jan 1, 2009, Vol. 65, Issue 2, p291-303.
- Abstract: Purpose -- The purpose of this paper is to provide a reconsideration of Michael Buckland's important question, "What is a document?," analysing the point and purpose of definitions of "document" and "documentation." Design/methodology/approach -- Two philosophical notions of the point of definitions are contrasted: John Stuart Mill's concept of a "real" definition, purporting to specify the nature of the definiendum; and a concept of definition based upon a foundationalist philosophy of language. Both conceptions assume that a general, philosophical justification for using words as we do is always in order. This assumption is criticized by deploying Hilary Putnam's arguments against the orthodox Wittgensteinian interpretation of criteria governing the use of language. The example of the cabinets of curiosities of the sixteenth-century English and European virtuosi is developed to show how one might productively think about what documents might be, but without a definition of a document. Findings -- Other than for specific, instrumentalist purposes (often appropriate for specific case studies), there is no general philosophical reason for asking, what is a document? There are good reasons for pursuing studies of documentation without the impediments of definitions of "document" or "documentation." Originality/value -- The paper makes an original contribution to the new interest in documentation studies by providing conceptual resources for multiplying, rather than restricting, the areas of application of the concepts of documents and documentation. Adapted from the source document.