It’s been a few weeks since my last post but that doesn’t mean we haven’t been making progress. On the contrary we’ve been quite busy with several irons in the fire.
Our principal focus remains very much on how to achieve the public access goal of the project and we’re looking at all of the options. We’re fortunate to have several staff from all parts of the Copyright Office helping on a voluntary part time basis with the analysis of the Copyright catalog cards. These cards remain the best source of information to build online indexes to the pre-1978 records. But like typical catalog cards the content is not labeled, which presents a challenge in extracting and identifying the types of index terms. The volunteers are studying the cards to identify patterns that could allow programmatic parsing of the data. For instance, the class codes will allow us to recognize registration numbers, and the copyright notice symbol should allow us to recognize the claimant name text string. Also, the relative location of other text strings in the card content, when compared with the card header information, should allow us to recognize some if not all of the other index terms. The easy approach would be to invert all of the text in a card and provide general word searching, but we’re making the extra effort to try to index these older records in the same way as the post-1977 records.
On the data capture front, we published a request for information to learn what skills, experience and technology exist in the marketplace to support the capture of information through crowdsourcing. We’ve got a lot of data to capture but very limited resources to get the work done. We found companies that build work-flow processes to capture and verify data through keyboarding from displayed images. The processes are made available to interested persons through online service providers to bring together those with data needing capture and those who are willing to spend some time contributing towards meeting that need. Some of our data capture requirements appear to lend themselves to crowdsourcing so we are planning to invite those who responded to the RFI to see the actual records and to discuss what’s feasible.
We published a similar request for information about building a virtual card catalog (see the March 22nd and April 5th blog posts) and were similarly encouraged from responses about the feasibility of using this as an interim approach to making the records available online. We’ll continue to explore this as a possible avenue to sharing the card images with you online as an interim measure pending a full search capability.
In the course of scanning the published Catalogs of Copyright Entries for preservation purposes, we captured OCR output of the content. For some volumes, particularly those that were typeset in at least 8 point font, the OCR output is relatively good and may allow us to avoid keyboarding all of the records. Some CCEs published in the 1970s used computer line printing followed by photo reduction which resulted in not so clear characters at about 6 point font size. While eye readable, the content is less than clear to the OCR engine and the output attests to that.
We’re also refining the estimates of how many cards exist in the catalog and finding that there are fewer cards than originally estimated. Card thickness varied over the years and so estimates based on numbers of cards per inch are not uniform across the 108 year span of the catalog. Based on the new estimates, we believe that all of the cards in the catalog could be imaged by the end of FY2014.
So there’s lot’s going on and the knowledge we’re gaining about the records is helping us plan the most efficient and shortest road to making them available online for your use. This blog is a way to keep you informed about the project and the progress we’re making, but it’s also a means for you to provide feedback about what we can do to best meet your needs when it comes to Copyright records. Your comments and suggestions are most important and always welcome.