Sustainability of Digital Formats
|
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | WARC (Web ARChive) file format |
---|---|
Description | The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations. |
Production phase | Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser). |
Relationship to other formats | |
May contain | Data of various types; see Notes below |
Has earlier version | ARC_IA, Internet Archive ARC file format. |
LC experience or existing holdings | LC's web harvesting activities capture web sites in the WARC format. LC also has web archives in the predecessor ARC_IA format. |
---|---|
LC preference | LC's preferred format for harvested Web sites harvested in bulk is WARC. |
Disclosure | Open standard, publicly documented, developed under the auspices of the International Internet Preservation Consortium. Submitted in May 2005 as a work item through ISO TC46/SC4, it was approved as an International Standard in May 2009. ISO TC46/SC4/WG12, convened by the Bibliothèque nationale de France, is the working group responsible for maintenance. |
---|---|
Documentation | ISO 28500:2009, Information and documentation -- WARC file format is available from ISO for purchase. The draft standard that was the basis for approval, ISO/DIS 28500, is at http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf. |
Adoption | The file format was designed to support the requirements of members of the International Internet Preservation Consortium. |
Licensing and patents | None. |
Transparency | The wrapper is transparent; contained data varies. |
Self-documentation | In the WARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by basic information about the document. |
External dependencies | User access depends on large-scale indexing of a corpus. |
Technical protection considerations | None. |
Web Archive | |
---|---|
Normal rendering | Supported through Internet Archive's Wayback Machine or equivalent tool. |
Documentation of harvesting context | Allows for substantial information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, the purpose of harvesting, etc. |
Efficiency at scale | Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The structured record headers can be extracted and stored separately for efficient indexing. WARC supports duplicate elimination and compression to reduce file sizes for storage, transmission, and indexing after harvesting. |
Support for stewardship. | WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors. |
Tag | Value | Note |
---|---|---|
Filename extension | warc |
WARC files are not typically transmitted to users or used in ways that depend on recognition by file type. |
Internet Media Type | application/warc |
General | The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers. |
---|---|
History | An HTML version of WARC File Format (Version 0.9) is at http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html. Subsequent drafts are also available at http://archive-access.sourceforge.net/warc/ in various formats. |
|