I recently attempted to web-archive an interesting website called Letters of Charlotte Mary Yonge. The creators had approached us for some preservation advice, as there was some danger of losing institutional support.

The site was built on a WordPress platform, with some functional enhancements undertaken by computer science students, to create a very useful and well-presented collection of correspondence transcripts of this influential Victorian woman writer; within the texts, important names, dates and places have been identified and are hyperlinked.

Since I’ve harvested many WordPress sites before with great success, I added the URL to Web Curator Tool, confident of success. However, right from the start some problems were experienced. One concern was that the harvest was taking many hours to complete, which seemed unusual for a small text-based site with no large assets such as images or media attachments. One of my test harvests even went up to the 3 GB limit. As I often do in such cases, I terminated the harvests to examine the log files and folder structures of what had been collected up to that point.

This revealed that a number of page requests were showing a disproportionately large size, some of them collecting over 40 MB for one page – odd, considering that the average size of a gathered page in the rest of the site was less than 50 KB. When I tried to open these 40 MB pages in the Web Curator Tool viewer, they failed badly, often yielding an Apache Tomcat error report and not rendering any viewable text at all.

These pages weren’t actually static pages as such – it might be more accurate to call them responses to a query, as are all . A typical query was

http://www.yongeletters.com/letters-1850-1859?year_id=1850

a simple script that would display all letters tagged with a year value of 1850. Again, I’ve encountered such queries in my web-archiving activities before, and they don’t usually present problems like this one.

I decided to investigate this link’s behaviour, and others like it, on the live site. The page is supposed to load truncated links to other pages. Instead it loads the same request on the page multiple times, ad infinitum. The code is actually looping, endlessly returning the result “Letters 1 to 10 of 11″, and will never complete its task.

When this behaviour on the live site is encountered by the web harvester Heritrix, it means the harvester is likewise sent into a loop of requests that can never be completed. This is what caused the 40 MB “page bloat” for these requests.

We have two options for web-archiving in this instance; neither one is satisfactory.

  • Remove the 3 GB system limit and let the harvester keep running. However, as my aborted harvests suggested, it would probably keep running forever, and the results still would not produce readable (or useful) pages.
  • Using exclusion commands, filter out the links such as the one above. The problem with that approach is that the harvester misses a large amount of the very content it is supposed to be collecting, and the archived version is then practically useless as a resource. To be precise, it would collect the pages with the actual transcribed letters, but the method of navigating the collection by date would fail. Since the live site only offers navigation using the dated Letter Collection links, the archived version would remain inaccessible.

This is, therefore, an example of a situation where a web site is effectively un-archivable, as it never completes executing its scripts and potentially ties the harvester up forever. The only sensible solution is for the website owners to fix and test their code (which, arguably, they should have done when developing it). Until then, a valuable resource, and all the labour that went into it, will continue to be at risk of oblivion.

About the author: