Harvesting the Sustainability of Digital Formats Site

The following is a guest post by Jimi Jones, Digital Audiovisual Formats Specialist with the Office of Strategic Initiatives.

The World Wide Web is a complex and constantly-evolving network of linkages. Maintaining access to web content can be very challenging because content producers can change or remove pages or entire sites at any time. We’ve all experienced the frustration of broken links and the inability to find content we just knew was there yesterday.

Digital Formats Sustainability website

The Library of Congress’ Sustainability of Digital Formats website draws upon many web resources to describe and assess digital formats. We consult online specifications and technical information in order to accomplish this. When the web resources we link to vanish, it can make it difficult or impossible for the Formats team to get information we need to help us make informed decisions about the formats we assess.

For this reason, the Formats site team – Caroline Arms, Carl Fleischhauer and Jimi Jones – worked with Gina Jones, Nicholas Taylor and Pranay Pramod of the Repository Development Center to capture web resources to which the Formats site links. By capturing these resources before they can vanish, we are building a reference library of sites to help us create a better site. For more about the Sustainability of Digital Formats website and what it does, take a look at Carl’s blog post here.

The concept of web archiving is fairly straightforward, even though the practice can be quite complex. The user points the archiving software at a particular web resource and the software “captures” or saves the information as “an aggregate archival file together with related information” to our secure space (see our discussion of WARC and ARC file formats). The chief benefit of WARC is that it is an ISO standardized format.

Standardization goes a long way towards the sustainability of a format and the Formats team believes in practicing what we preach! Once we have “parked” the information in this new location we can adjust our links to point to these new, more stable locations and not have to worry that the producers of the original content will move or it or make it inaccessible to the Formats team. For more about the mechanics of web archiving, take a look at Nicholas Taylor’s excellent blog post here.

robots.txt felt robot by user silvertje on Flickr

This all sounds so simple doesn’t it? It turns out that we had a few challenges with the web harvesting of the Formats site. The first challenge is common to web archiving and it deals with the “Robots Exclusion Standard.” Our harvest of the Formats site complied with the robots.txt convention which helped us choose which directories we harvested from the websites linked from the Formats pages. In some cases, complying with this convention ended up excluding content that we were interested in archiving, but we were often able to work around this limitation by finding other locations for the same information. If the Formats team had more time, we would have considered soliciting permission from site owners to bypass their robots.txt exclusions.

The next challenge was the sheer number of links from the Formats site. The Formats team pulls from hundreds of online resources to inform our assessment work. Gina, Nicholas and Pranay used a tool to extract all the links from the Formats site. The two teams worked to put these “seed” URLs into a spreadsheet. That action revealed that our collection included 1,280 URLs. In many cases, we collect not only seed page but also a number of related pages at the same web site, so the total number of captured pages is even higher. Then Gina, Nicholas and Pranay used the list of seed URLs as an input to the web archiving software to “crawl” the links and capture the corresponding web pages.

The majority of the pages we link to from the Formats site were crawled in a straightforward manner. As we performed quality review, we annotated seeds to indicate that they had been successfully captured. We also assessed the resource for the need to harvest annually. Many of our links point to entities like PDF documents: capture-once items that are rarely revised once published.

Spore 404 Error Page by user laughingsquid on Flickr

There were, however, over two hundred links that were problematic. Most of them returned “404 – Not Found” errors but some led to placeholder pages saying that the content had moved to somewhere else on the site. Gina, Nicholas and Pranay had to pass these problematic “seeds” to the Formats team to review and decide how they should be handled.

Caroline, Carl and I decided to divvy these seeds up and decide how they should be treated. For each seed we considered the content of the web resource and decided if the resource was actually worth capturing. If it was, we looked to see if the resource had been captured already by the Internet Archive. If no suitable iteration of the resource existed in the Internet Archive, we would try to track down the resource. In most cases the content of the resource existed elsewhere – on another part of the same website or sometimes in a totally new location. Once we found the content, the resource could be moved to “post crawl” status.

Post crawl is the term we use for resources that have been successfully captured and need not be crawled again. The resources that were not captured successfully were either set aside for further review or rejected completely. The last step in this process was to change the appropriate links in the Formats website to point to the new locations for the resources. In the cases where the information couldn’t be found online or in the Internet Archive, we deleted the links from the Formats pages.

Web harvesting the Formats site was a long and challenging process. One of the biggest challenges was developing a workflow between the Repository Development Group and the Formats team. The Repository Development folks did the heavy lifting of collating the links, doing the original crawl and educating us about how the process would go. The Formats team looked through each seed to decide if the site was something we wanted to capture and, if it was, how often we wanted to capture it.

Some web resources change so frequently that capturing it on a regular basis makes the most sense but some resources are fairly static and a one-time capture is sufficient. This process of web archiving has made the Formats team reevaluate and update existing content on the site – a definite “added value” to the harvesting. As we go forward with new edits and additions to the Formats site, we plan to nominate URLs for web archiving as we go. Now that the bulk of the work has been done, making web archiving a part of the Formats team’s regular workflow will help to ensure that the team will be able to get the information we need to maintain the utility of the Formats site.

2 Comments

mary lukanuski
May 29, 2012 at 2:32 pm
when the images are screen captions of an active website, please link to the url not to a larger version of the image. thanks,
Butch Lazorchak
May 31, 2012 at 2:34 pm
Great point! We fixed it so that it does.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.