Upgrading Image Thumbnails… Or How to Fill a Large Display Without Your Content Team Quitting

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design requirements. These decisions often require revisiting as technology improves and what previously seemed like a reasonable compromise now feels constricting.

I recently ran into an example of this while working on the next version of the World Digital Library website, which still has substantially the same look and feel as it did when the site launched in April of 2009. The web has changed considerably since then with a huge increase in users on mobile phones or tablets and so the new site uses responsive design techniques to adjust the display for a wide range of screen sizes. Because high-resolution displays are becoming common, this has also involved serving images at larger sizes than in the past — perfectly in keeping with our goal of keeping the focus on the wonderful content provided by WDL partners.

When viewing the actual scanned items, this is a simple technical change to serve larger versions of each but one area posed a significant challenge: the thumbnail or reference image used on the main item page. These images are cropped from a hand-selected master image to provide consistently sized, interesting images which represent the nature of the item – a goal which could not easily be met by an automatic process. Unfortunately the content guidelines used in the past specified a thumbnail size of only 308 by 255 pixels, which increasingly feels cramped as popular web sites feature much larger images and modern operating systems display icons as large as 256×256 or even 512×512 pixels. A “Retina” icon is significantly larger than the thumbnail below:

Icon SizesGoing back to the source

All new items being processed for WDL now include a reference image at the maximum possible resolution, which the web servers can resize as necessary. This left around 10,000 images which had been processed before the policy changed and nobody wanted to take time away from expanding the collection to reprocess old items. The new site design allows flexible image sizes but we wanted to find an automated solution to avoid a second-class presentation for the older items.

Our original master images are much higher resolution and we had a record of the source image for each thumbnail but not the crop or rotation settings which had been used to create the original thumbnail. Researching the options for reconstructing those settings lead me to OpenCV, a popular open-source computer vision toolkit.

At first glance, the OpenCV template matching tutorial appears to be perfect for the job: give it a source image and a template image and it will attempt to locate the latter in the former. Unfortunately, the way it works is by sliding the template image around the source image one pixel at a time until it finds a close match, a common approach but one which fails when the images differ in size or have been rotated or enhanced.

Fortunately, there are far more advanced techniques available for what is known as scale and rotation invariant feature detection and OpenCV has an extensive feature detection suite. Encouragingly, the first example in the documentation shows a much harder variant of our problem: locating a significantly distorted image within a photograph – fortunately we don’t have to worry about matching the 3D distortion of a printed image!

Finding the image

The locate-thumbnail program works in three steps:

  1. Locate distinctive features in each image, where features are simply mathematically interesting points which will hopefully be relatively consistent across different versions of the image – resizing, rotation, lighting changes, etc.
  2. Compare the features found in each image and attempt to identify the points in common
  3. If a significant number of matches were found, replicate any rotation which was applied to the original image
  4. Generate a new thumbnail at full resolution and save the matched coordinates and rotation as a separate data file in case future reprocessing is required

You can see this process in the sample visualizations below which have lines connecting each matched point in the thumbnail and full-sized master image:

The technique even works surprisingly well with relatively low-contrast images such as this 1862 photograph from the Thereza Christina Maria Collection courtesy of the National Library of Brazil where the original thumbnail crop included a great deal of relatively uniform sky or water with few unique points:

Scaling up

After successful test runs on a small number of images, locate-thumbnail was ready to try against the entire collection. We added a thumbnail reconstruction job to our existing task queue system and over the next week each item was processed using idle time on our cloud servers. Based on the results, some items were reprocessed with different parameters to better handle some of the more unusual images in our collection, such as this example where the algorithm matched only a few points in the drawing, producing an interesting but rather different result:

Reviewing the results

Automated comparison

For the first pass of review, we wanted a fast way to compare images which should be very close to identical. For this work, we turned to libphash which attempts to calculate the perceptual difference between two images so we could find gross failures rather than cases where the original thumbnail had been slightly adjusted or was shifted by an insignificant amount. This approach is commonly used to detect copyright violations but it also works well as a way to quickly and automatically compare images or even cluster a large number of images based similarity.

A simple Python program was created and run across all of the reconstructed images, reporting the similarity of each pair for human review. The gross failures were used to correct bugs in the reconstruction routine and a few interesting cases where the thumbnail had been significantly altered, such as this cover page where a stamp added by a previous owner had been digitally removed:

7778 original7778 reconstructed

 

 

 

 

 

 

 

 

http://www.wdl.org/en/item/7778/ now shows that this was corrected to follow the policy of fidelity to the physical item.

Human review

The entire process until this point has been automated but human review was essential before we could use the results. A simple webpage was created which offered fast keyboard navigation and the ability to view sets of images at either the original or larger sizes:

Screen Shot 2014-08-03 at 18.42.23This was used to review items which had been flagged by phash as less than matching below a particular threshold and to randomly sample items to confirm that the phash algorithm wasn’t masking differences which a human would notice.

In some cases where the source image had interacted poorly with the older down-sampling, the results are dramatic – the reviewers reported numerous eye-catching improvements such as this example of an illustration in an Argentinian newspaper:

Illustration from “El Mosquito, March 2, 1879” (reconstructed).

 

Conclusion

This project completed towards the end of this spring and I hope you will enjoy the results when the new version of WDL.org launches soon. On a wider scale, I also look forward to finding other ways to use computer-vision technology to process large image collections – many groups are used to sophisticated bulk text processing but many of the same approaches are now feasible for image-based collections and there are a number of interesting possibilities such as suggesting items which are visually similar to the one currently being viewed or using clustering or face detection to review incoming archival batches.

Most of the tools referenced above have been released as open-source and are freely available:

Perpetual Access and Digital Preservation at #SAA14

I had the distinct pleasure of moderating the opening plenary session of the Joint Annual Meeting of COSA, NAGARA and SAA in Washington D.C. in early August. The panel was on the “state of access,” and I shared the dais with David Cuillier, an Associate Professor and Director of the University of Arizona School of […]

Untangling the Knot of CAD Preservation

At the 2014 Society of American Archivists meeting, the CAD/BIM Taskforce held a session titled “Frameworks for the Discussion of Architectural Digital Data” to consider the daunting matter of archiving computer-aided design and Building Information Modelling files. This was the latest evidence that — despite some progress in standards and file exchange — the digital preservationist […]

What Do You Do With 100 Million Photos? David A. Shamma and the Flickr Photos Dataset

Every day, people from around the world upload photos to share on a range of social media sites and web applications. The results are astounding; collections of billions of digital photographs are now stored and managed by several companies and organizations. In this context, Yahoo Labs recently announced that they were making a data set […]

Emulation as a Service (EaaS) at Yale University Library

The following is a guest post from Euan Cochrane, ��Digital Preservation Manager at Yale University Library. This piece continues and extends exploration of the potential of emulation as a service and virtualization platforms. Increasingly, the intellectual productivity of scholars involves the creation and development of software and software-dependent content. For universities to act as responsible stewards […]

Curating Extragalactic Distances: An interview with Karl Nilsen & Robin Dasler

While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation […]

Research is Magic: An Interview with Ethnographers Jason Nguyen & Kurt Baer

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the […]

August Library of Congress Digital Preservation Newsletter is Now Available

The August Library of Congress Digital Preservation Newsletter is now available: Included in this issue: Digital Preservation 2014: It’s a Thing Preserving Born Digital News LOLCats and Libraries with Amanda Brennan Digital Preservation Questions and Answers End-of-Life Care for Aging, Fragile CDs Education Program updates Interviews with Henry Jenkins and Trevor Blank More on Digital […]