Sustainability of Digital Formats
 Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

WARC, Web ARChive file format

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name WARC (Web ARChive) file format
Description The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.
Production phase Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
Relationship to other formats
    May contain Data of various types; see Notes below
    Has earlier version ARC_IA,

Local use Explanation of format description terms

LC experience or existing holdings LC has large volumes of captured web sites in the predecessor ARC_IA format.
LC preference LC's preferred formats for harvested Web sites harvested in bulk are ARC_IA and WARC. As capture tools are developed to support WARC, WARC will be preferred to ARC_IA.

Sustainability factors Explanation of format description terms

Disclosure Developed under the auspices of the International Internet Preservation Consortium. Proposed in May 2006 as an ISO standard through ISO TC46/SC4 and accepted as a fast-track work item. In December 2007, ISO/CD 28500 was approved for circulation to national standards bodies as a Draft International Standard (DIS).
    Documentation ISO/DIS 28500 is at http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf.
Adoption The file format was designed to support the requirements of members of the International Internet Preservation Consortium.
    Licensing and patents None.
Transparency The wrapper is transparent; contained data varies.
Self-documentation In the WARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by information about the document.
External dependencies User access depends on large-scale indexing of a corpus.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Web Archive
Normal rendering Supported through Internet Archive's Wayback Machine or equivalent tool.
Documentation of harvesting context Allows for substantial information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, the purpose of harvesting, etc.
Efficiency at scale Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The structured record headers can be extracted and stored separately for efficient indexing. WARC supports duplicate elimination and compression to reduce file sizes for storage, transmission, and indexing after harvesting.
Support for stewardship. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.

File type signifiers Explanation of format description terms

Tag Value Note
Filename extension warc
WARC files are not typically transmitted to users or used in ways that depend on recognition by file type.
Internet Media Type application/warc
 

Notes Explanation of format description terms

General The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers.
History An HTML version of WARC File Format (Version 0.9) is at http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html. Subsequent drafts are also available at http://archive-access.sourceforge.net/warc/ in various formats.

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 09/05/2008