ARC_IA, Internet Archive ARC file format

Sustainability of Digital Formats Planning for Library of Congress Collections

Introduction \| Sustainability Factors \| Content Categories \| Format Descriptions \| Contact

Format Description Categories >> Browse Alphabetical List

ARC_IA, Internet Archive ARC file format

Table of Contents

Identification and description
Local use
Sustainability factors
Quality and functionality factors
File type signifiers
Notes
Format specifications
Useful references

Format Description Properties

ID: fdd000235
Short name: ARC_IA
Content categories: aggregate , web-archive
Format Category: file-format
Other facets: container-wrapper
Last significant update: 2008-02-14
Draft status: Partial

Identification and description

Relationship to other formats
Full name	ARC_IA, Internet Archive ARC file format.
Description	Specifies a method for combining multiple digital resources into an aggregate archival file together with related information, used since 1996 by the Internet Archive to store 'web crawls' as sequences of content blocks harvested from the World Wide Web.
Production phase	Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
May contain	Data of various types, for example, HTML pages, images as GIF, JPEG, etc.
Has later version	WARC,

Local use

LC experience or existing holdings	LC has large volumes of captured web sites in the ARC_IA format. See http://www.loc.gov/webcapture/
LC preference	LC's preferred formats for harvested Web sites harvested in bulk are ARC_IA and WARC. As capture tools are developed to support WARC, WARC will be preferred to ARC.

Sustainability factors

Disclosure	Developed by the Internet Archive (Brewster Kahle). Documentation and tools to use files in the format freely available.
Documentation	Described at http://www.archive.org/web/researcher/ArcFileFormat.php
Adoption	The file format developed for the Heritrix web crawler, supported by the International Internet Preservation Consortium.
Licensing and patents	None.
Transparency	The wrapper is transparent; contained data varies.
Self-documentation	In the ARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by some header information about the document: the document file format, the document size, outward links that the document contains, etc. At the Internet Archive, each ARC file has a corresponding DAT file that contains only the header information.
External dependencies	User access depends on large-scale indexing of a corpus of ARC files or a separate copy of the record headers (e.g. Internet Archive DAT files). Indexing the DAT files can support user access by URL and date, as in the Wayback Machine.
Technical protection considerations	None.

Quality and functionality factors

Web Archive
Normal rendering	Supported through Internet Archive's Wayback Machine or equivalent tool.
Documentation of harvesting context	Allows for basic information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, etc.
Efficiency at scale	Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The use of coordinated ARC and DAT files is one way to support efficient indexing for such access.
Support for stewardship.	The capabilities in ARC that support long-term management of a corpus of web archive files is basic. WARC was developed as an extension to ARC, in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.

File type signifiers

Tag	Value	Note
Filename extension	.arc	ARC files are not typically transmitted to users or used in ways that depend on recognition by file type.

Notes

General
History

Format specifications

Internet Archive: Research Access, ARC file format (http://www.archive.org/web/researcher/ArcFileFormat.php).
Internet Archive: Research Access, DAT file format (http://www.archive.org/web/researcher/dat_file_format.php).

Useful references

URLs

Internet Archive: Research Access, Data Available (http://www.archive.org/web/researcher/data_available.php).
Internet Archive: Wayback Machine (http://web.archive.org/collections/web/advanced.html).
Heritrix developer documentation: Chapter 13. Internet Archive ARC files (http://crawler.archive.org/articles/developer_manual/arcs.html).

Last Updated: 02/14/2008