Gina Jones and 20 Years of Web Archiving at the Library of Congress

Today’s guest blog post is from Gina Jones and Abbie Grotke, both of the Web Archiving Team.


As a part of our series looking back at some of the people and stories around our 20th Anniversary of Web Archiving, I wanted to share with you an interview with a person who has been working on the web archiving team even longer than I have been — Gina Jones, Digital Projects Coordinator. Gina is retiring this month after an incredible 18 years of service at the Library of Congress and 20 years in the Marine Corps before that. We’re sorry to see her go, but wish her well in her retirement – she deserves it!

Here are Gina’s responses (lightly edited!) to some questions I posed to her a few weeks ago via email:

Join us in celebrating Gina Jones’ 18 years of web archiving at the Library of Congress! Photo courtesy of Gina Jones.

How long have you been working with the Web Archiving team at LC?

I was hired in 2002, when I came to the Library from the University of Maryland in 2002.

How did you become a web archiving professional?

Luck. I applied for a job at the Library when they were looking for someone with my skill-sets.  My military jobs gave me experience managing tactical communications teams and understanding of the communication functions of a telecommunication system.  I had my first apple computer in 1982 and I learned how to do BASIC.  As time went on, even though I have never had any computer classes, I self-studied and learned about the Internet and the web as it came online.  Two of my proudest achievements were: getting my mother on the Internet (Prodigy) in the early 90’s, and getting the internet deployed to the Marine Corps Family Service Centers before even the base generals were provided access.  It is a critical service to families that provides transition and relocation services thru employment opportunities and duty stations’ and locale information.  As I was headed to (my first) retirement, I got my MLS, something that I had thought about for years.  While at the University of Maryland, I was able to get a graduate assistant position in the university’s IT department and my team of students were responsible for supporting the departments’ and colleges’ web development needs.  So, given my resume of web skills, the Library hired me and assigned me to the web archiving project.

Did you know about web archiving before you came to the Library?

Actually, yes.  The Arts and Humanities College at the University of Maryland “lost” their college’s web pages at the time when the Internet Archive was just coming online the web, sometime around 1999 or 2000.  I was assigned to fix the problem and I used what IA had to reconstruct the site.

What has been your primary role on the Web Archiving Team? 

I have been responsible for the implementation and use over time of the web archiving tools, including Heritrix, Wayback, and Digiboard (our inhouse curatorial tool, formerly known as Leaderboard), but have focused primarily on trying to figure out what we got [in the archiving process], what we didn’t get, and why.

Rumor has it that you were one of the first people to do quality review on web archives. Could you tell us a story about that early process and what it was like?

It has always been a matter of understanding how the archive viewer and the crawler works.  I have two classic experiences.  For the first, I was doing quality review on the Election 2002 archive and I kept coming across content that was not in archive, but they were all URLS that had spaces in the URIs.  I asked our crawl contractor (at that time) to see if the content was there, and if it was, why their Wayback [the access tool] couldn’t handle it.  Their response was that people shouldn’t be posting content to the web with spaces in the URI’s.  To me, that is exactly why it’s so hard to archive the web, because if it can be done, people will do it on the web.  The developers who were handling our early days, of course, dealt in a Unix environment which doesn’t allow for spaces in file names, but people were posting content to the web using Microsoft platforms and that does allow for spaces.

My second example is about depth.  I spent about a month looking at the results of the 107th Congressional Web Archive project.  We have always said our goal was to get the depth and breadth of a website.  As I looked at the results of the crawls, I kept coming across content “not-in-archive” that the Library considered important to render those pages in the future-the look (images missing) and feel (layout scripts and files). The crawler behaved as it did because it was designed for a business purpose, not for an archival purpose that documents how that page looked when we archived it.  That drove us in pushing for an international, curatorial solution for web archiving that would acquire web objects

You’ve been particularly focused on ensuring that we adequately archive campaign websites during the U.S. elections. What have been the biggest changes you’ve seen in archiving campaign sites?

The rise of social media and inexpensive/free web platforms have given campaigns that don’t/didn’t have money the opportunity to have a web presence.  In the early days, it was expensive to buy domains and platforms to serve content, and it was primarily the big Democratic and Republican party candidates who had a sophisticated (for that time) web presence.  We did crawl some early geocities sites, but unfortunately, the crawler programmatically was not able to capture the documents and objects to render those pages so that they cannot be displayed in the archives.

What are the most significant changes that have taken place in the web archiving since you began?

Content owners are becoming increasingly more knowledgeable of web archiving and work with us more proactively if we come across issues during the harvesting process. There is also an increasingly interested developer population who are tackling capture and replay issues that get increasingly challenging as the web platforms get more sophisticated.  And lastly, changes in how we are able to manage and handle the permissions process over the years has significantly increased the kinds and numbers of sites that we are able to archive.

What technical skills do you think would be useful for those starting out in web archiving?

Of course, it goes without saying that we assume people understand the web because we’ve been using it for decades now, but over time, the more successful people who have worked with us on the web archiving team have had great observation skills and a level of curiosity about the content. Although curiosity is not a technical skill, I think it is important for those doing any investigation or review of the archives, because, to me, it is the ability to notice things that are not quite right that makes for successful quality review.  It’s the anecdotal side of, “hmm, why is this happening?” that leads the person to look at more of the same kind of content to then figure out if there is a crawler or display issue that could be addressed.

What do you see as the biggest challenge in web archiving?

The biggest challenge I think is along the lines of what I found when I first started out.  I was told that Internet Archive was archiving the web.  But when I looked at their archive and what the crawler was getting for our collections (using the same crawler), there were glaring content holes because the crawler was designed for a [Alexa Internet’s] business model, which only cared about text. Thankfully, in 2003 the international web archiving community came together and we were able to launch Heritrix, a curatorial web crawler by 2004.  So, in our archives, we have a lot of webpages that we don’t have the images or other kinds of content that provided the look and feel of the site at the time, particularly in those early years.  And, of course, as new platforms come online, the crawler lags in getting some of the hard to get content. Identifying those gaps are important and that, I think, is the job for those of us creating the archives.  The smart web archivists of the future can figure out later how to make that content display, but if you don’t preserve it, it will be lost.

What one thing did you wish you knew about web archiving when you started your job?

I wish that digital preservation had been a “thing” while I was working on my MLS and that they had had courses in that.

What are some of the favorite collections in the Library’s web archive that you helped us preserve? 

I remember the days when [former team member] David Brooks and I were working on election campaign sites and would bet on the political party based on the colors used on websites.  He could call it correctly every time!

This is our 20th year of web archiving at LC, and you’ve been here most of it. What are you most proud about your work in the early days of Web Archiving at LC?  

I do look at what we do and have done, and a lot of that is based on recommendations that I have made over the years.  I think that I have impacted the trajectory of the web archiving program at LC, hopefully for the good.  But it’s been a team effort, from the early days of MINERVA to what we are doing now, the people who have worked on the team brought wonderful skillsets and perspectives that greatly enhanced our activities.

What advice do you have for the remaining team members as you depart?

Best wishes and good luck! I think with the creation of the Digital Content Management Section [where the Web Archiving team now sits organizationally], under the leadership of Trevor Owens, the team is in great hands, the best it has been since the days of Martha Anderson.

What are your plans for retirement?

It is going to be a challenge–I’ve had to work and support myself for over 50 years and going to work is an ingrained habit that will be a challenge to break. I do have a lot of home projects that need to be worked on but I’d like to get my master beekeeper’s certification, get my fluency in Spanish back and volunteer on Peace Corps or other volunteer organization beekeeping projects.