The Determinator: Behind the Scenes at the Stanford
Copyright Renewal Database. An Interview with Mimi Calter [*]
August 2007
Mary Minow: What was the impetus to put
together the Stanford Copyright Renewal
Database?
Mimi Calter: The project grew out of a
conversation between Michael Keller, Stanford's University Librarian,
and Lawrence Lessig. A student of Professor Lessig's had composed an
interrogatory Q&A for determining the copyright status of any work.
This led to a discussion of the possibility of automating the
determination of the copyright status of a work, and of the necessary
inputs to such a system. The 1923-1963 renewal data is one obvious
input required for that system, and members of the Stanford
University Libraries staff very quickly learned about the work of
Project Gutenberg to scan the relevant Catalog of Copyright Entries,
and the early version of a copyright renewal database compiled by
Michael Lesk. (I did not get involved in the project until somewhat
later). It was decided to pursue the project that Professor Lesk had
started, with an eye to the framework suggested by Lawrence Lessig.
We still hope to see the copyright renewal data integrated into a
larger tool for copyright status analysis, and are having
conversations with possible partners.
Minow:
Who actually put "The Determinator" database together? Is that its
official name?
Calter: We call it "The Determinator"
in-house, but the official name is the Stanford Copyright Renewal
Database. I was the project coordinator, and our Chief Information
Architect, Jerry Persons, as well as several members of our wonderful
Academic Computing team worked quite hard on this.
Minow:Can you give me an overview of your
process for compiling the database?
Calter: Renewals
for books originally registered between 1923 and 1963 should have
taken place between 1950 and 1992. The Copyright Office moved to
electronic records in 1978, which meant that we had to deal with two
broad groups of records: the paper records from before 1978, and the
electronic records that came after. Our mission was to have all of
the data fielded, and searchable in a single database.
For
the 1950-1977 records, we started with the Project Gutenberg
transcriptions of Class A renewals which includes books, pamphlets
and articles in serials. These records were the most challenging, as
the data was completely unfielded, and we were essentially starting
from scratch. Even worse, the record format used by the Copyright
Office in the print books changed several times during those years.
For the 1978-1992 records, we used records extracted from the
Copyright Office's online database, which had been collected by a
member of the Project Gutenberg team. This data was largely fielded,
and we only had to work with formatting, and breaking author data out
of the title field.
For each type of data, we developed
schemas for extracting the appropriate fields, and then worked with
an outside firm to tag all of the records. We outsourced some of the
parsing to an outside company, Innodata Isogen.
We've also
done testing of the database. In our first round, we pulled 500
titles from the library catalog, so we could have the actual books in
hand. We searched them manually in the Catalog of Copyright Entries
(CCE), published by the Copyright Office, and also sent a subset of
those to the Copyright Office to be searched (at $100 plus $150 per
hour). We then repeated the searches in the Determinator. Overall,
we're very happy with the accuracy of the database, but we did find
some unusual problems. For example, there is a book in our catalog
titled Memoirs of a Spy that is listed as Memories of a Spy in both
the Determinator and the CCE. That's not a problem we can fix, as
the problem is with the Copyright Office record.
The testing
did reveal a few small problems with the database that are being
cleaned up now. We'll be doing a second round of testing once that is
complete, and will make those results public.
Minow: How was the project
funded?
Calter: We had a grant from the William and
Flora Hewlett Foundation. In addition, the Stanford Library
contributed staff time and in-house resources. We're working on the
final report for the Foundation right now, and it will be available
online when it is complete.
Minow: Who are
the target users of the database?
Calter: Frankly, we
had a bit of a selfish motivation here. We are very interested in
digitizing as much of the material in our library as we legally can.
Looking at the copyright status of 1923-1963 works is an important
part of this. We expect that the primary users of the database will
be libraries and groups like ourselves that are involved in
digitization projects, although I'm certain there will be other uses
found!
Minow: Will this database be helpful
to libraries, archives and museums who are digitizing "orphan works"?
That is, do you think it will help them show "due diligence" when
searching for copyright ownership?
Calter: Studies
show that less than 15% of items eligible for renewals were in fact
renewed.1 Our work on this database has uncovered only about 280,000
renewal records, and a surprising number of these are for things like
court reporters and Singer sewing machine manuals, so it's clear to
us that a very large portion of the books published in this period
are now in the public domain. Nevertheless, we know that due
diligence is important when dealing with orphan works, and we think
our database can be very helpful in that regard.
Minow: What were the biggest challenges in
the project?
Calter: By far the biggest challenge to
this point in working with the Copyright Office records has been
parsing out author names. Even in the electronic records that it
produced after 1978, the Copyright Office included the author name as
part of the title field. Extracting that information in order to
allow searching has required significant effort.
That said,
bigger challenges remain for those interested in using the data. We
have only extracted the names, we have not yet attempted to insert
authority control, and that makes matching a challenge. And since
records from this time period have no ISBNs, there's no easy way to
tie the copyright records to particular books. We would like to see
the renewal records in our database matched against catalog records,
so that users can even more easily determine the status of a
particular work.
Minow: What feedback have
you gotten from users?
Calter: Very positive. Lots of
folks want to know when we'll be creating similar tools for other
classes of works, but that's not something we're pursuing right now.
But I have been contacted by a few organizations that are
incorporating the database into their standard search process. I
recently received a nice thank you note from Jack Herrick, a Stanford
alum and founder of wikiHow, which has done just that:
www.wikihow.com/Import-Old-Public-Domain-Books-to-wikiHow
Minow: What do you wish you could add to the
database?
Calter: I certainly think it would be
beneficial to expand the database to other classes of works, but
that's not something we have funding or staff to manage. However,
I'm actually more interested in seeing our database become part of a
tool that addresses a wide variety of copyright concerns and
questions.
*Mimi Calter is the Executive Assistant to the University Librarian at Stanford University.
1 For example, a 1961 Copyright Office study found that fewer than 15% of all registered copyrights were renewed. For books, the figure was even lower: 7%. See Barbara Ringer, "Study No. 31: Renewal of Copyright" (1960), reprinted in Library of Congress Copyright Office. Copyright law revision: Studies prepared for the Subcommittee on Patents, Trademarks, and Copyrights of the Committee on the Judiciary, United States Senate, Eighty-sixth Congress, first [-second] session. (Washington: U. S. Govt. Print. Off, 1961), p. 220. Peter Hirtle,
Copyright Term and the Public Domain the United States, 1 January 2007.
|