Sustainability of Digital Formats
|
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | ARC_IA, Internet Archive ARC file format. |
---|---|
Description | Specifies a method for combining multiple digital resources into an aggregate archival file together with related information, used since 1996 by the Internet Archive to store 'web crawls' as sequences of content blocks harvested from the World Wide Web. |
Production phase | Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser). |
Relationship to other formats | |
May contain | Data of various types, for example, HTML pages, images as GIF, JPEG, etc. |
Has later version | WARC, |
LC experience or existing holdings | LC has large volumes of captured web sites in the ARC_IA format. See http://www.loc.gov/webcapture/ |
---|---|
LC preference | LC's preferred formats for harvested Web sites harvested in bulk are ARC_IA and WARC. As capture tools are developed to support WARC, WARC will be preferred to ARC. |
Disclosure | Developed by the Internet Archive (Brewster Kahle). Documentation and tools to use files in the format freely available. |
---|---|
Documentation | Described at http://www.archive.org/web/researcher/ArcFileFormat.php |
Adoption | The file format developed for the Heritrix web crawler, supported by the International Internet Preservation Consortium. |
Licensing and patents | None. |
Transparency | The wrapper is transparent; contained data varies. |
Self-documentation | In the ARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by some header information about the document: the document file format, the document size, outward links that the document contains, etc. At the Internet Archive, each ARC file has a corresponding DAT file that contains only the header information. |
External dependencies | User access depends on large-scale indexing of a corpus of ARC files or a separate copy of the record headers (e.g. Internet Archive DAT files). Indexing the DAT files can support user access by URL and date, as in the Wayback Machine. |
Technical protection considerations | None. |
Web Archive | |
---|---|
Normal rendering | Supported through Internet Archive's Wayback Machine or equivalent tool. |
Documentation of harvesting context | Allows for basic information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, etc. |
Efficiency at scale | Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The use of coordinated ARC and DAT files is one way to support efficient indexing for such access. |
Support for stewardship. | The capabilities in ARC that support long-term management of a corpus of web archive files is basic. WARC was developed as an extension to ARC, in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors. |
Tag | Value | Note |
---|---|---|
Filename extension | .arc |
ARC files are not typically transmitted to users or used in ways that depend on recognition by file type. |
General | |
---|---|
History |
|