Related Resources at the Library
About Web Capture Activities at the Library of Congress
-
Why is the Library of Congress collecting and creating an archive of Web sites?
The Library of Congress and libraries and archives around the world are interested in collecting and preserving the Web because an ever-increasing amount of the world’s cultural and intellectual output is created in digital formats and does not exist in any physical form. Creating an archive of Web sites supports the goals of the Library’s Digital Strategic Plan, announced in March 2003, which focuses on the collection and management of digital content.
-
How does the Library’s Web Capture program relate to the National Digital Information and Infrastructure Preservation Program (NDIIPP)?
In 2004 the Library formed the Web Capture team to support the National Digital Information Infrastructure and Preservation Program’s (NDIIPP) strategic goal to manage and sustain at-risk digital content. The team’s focus has been on technologies, tools, and infrastructure to assist Library staff in building born-digital Web collections.
In the Library’s role as administrator of NDIIPP, the Web Capture team also collaborates on two NDIIPP projects: The Web-at-Risk: A Distributed Approach to Preserving Our Nation’s Political Cultural Heritage, led by the California Digital Library, and The ECHO DEPository, led by the University of Illinois at Urbana-Champaign.
-
How large is the Library’s archive?
As of May 2008, the Library has collected more than 82.6 terabytes of data.
-
What kinds of Web sites does the Library archive?
Library of Congress recommending officers, or curators, select a variety of Web sites to archive, depending on the theme of the collection activity. The Library’s MINERVA project was the initial pilot project to capture web sites. Event-based or thematic collections publicly available through the MINERVA site include Election 2002, September 11, Election 2004, and the 107th Congress Web archive.
Categories of sites captured include, but are not limited to: United States government (federal, state, district, local), foreign government, candidates for political office, political commentary, political party, media, religious organizations, support groups, tributes and memorials, advocacy groups, educational and research institutions, creative expressions (cartoons, poetry, etc.), and blogs.
-
Do you have published selection guidelines?
A collections policy statement and other internal documents demonstrate current policies about selection of electronic resources. In addition, selection criteria are developed for each collection. Our publicly accessible Web archives have collection-specific selection criteria available for review:
-
Are other libraries and organizations doing similar work?
Since 1996 the Internet Archive has archived 40 billion web pages. The Library of Congress contracts with the Internet Archive for many of its Web capture projects.
U.S. federal government agencies, including the National Archives and Records Administration (NARA) and the Government Printing Office, collect official Web content from the U.S. government. NARA documented federal agencies' presence on the World Wide Web at the time that the presidential administration term ended in early 2005.
The Library of Congress also works closely with members of the International Internet Preservation Consortium (IIPC), which was formed in 2003 to enable the collection of a rich body of Internet content from around the world and foster the development and use of common tools, techniques and standards that enable the creation of international archives. The Library of Congress is a founding member of the consortium, whose members the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden and the United Kingdom and the Internet Archive (U.S.).
- How do I ask a question about Web Capture projects at the Library of Congress?
Use our online form to contact the Library about its Web Capture activities.
How Web Capture Works
-
What is the difference between Web Capture, Web harvesting, Web archiving, and Web collecting?
These terms are used interchangeably to describe the same activity.
-
How does the Library archive Web sites?
The Library or its agent makes a copy of a Web site using an open-source archival-quality Web crawler called Heritrix. The Library uses other in-house tools to manage the selection and permissions process.
-
How much of a Web site is collected?
The Library’s goal is to create an archival copy – essentially a snapshot -- of how the site appeared at a particular point in time. Depending on the collection, the Library captures as much of the site as possible, including html pages, images, flash, PDFs, audio, and video files, to provide context for future researchers. The Heritrix crawler is currently unable to capture streaming media, "deep web" or database content requiring user input, and content requiring payment or a subscription for access. In addition, there will always be some Web sites that take advantage of emerging or unusual technologies that the crawler cannot anticipate.
-
Do you capture all identifying site documentation, including URL, trademark, copyright statement, ownership, publication date, etc.?
We attempt to completely reproduce a site for archival purposes.