Metadata, Cataloging, Digitization and Retrieval: Who's Doing What to Whom: The Colorado Digitization Project Experience

For the Library of Congress Bicentennial Conference on Bibliographic Control for the New Millennium: Confronting the challenges of networked resources and the web

Submitted by
Liz Bishoff, Project Director
Colorado Digitization Project
And
William A. Garrison, Head of Cataloging
University of Colorado, Boulder
November 2000

Final version

In the last five years there has been significant growth in museum/library collaboration, in part due to the Institute of Museum and Library Services national leadership program and in part due to the growing realization that both libraries and museums are holders of collections that represent our rich and diverse culture heritage.

Museum/library collaboration isn't just occurring in the United States. In 1999, the European Commission's Information Society Directorate General appointed a working group to develop a research framework for archives, libraries and museums that would support their work in the networked environment. The primary purpose of the research framework is to support access to resources available on the Internet. The document notes that the framework is based on the assumption, "...that libraries, archives and museums have shared research interests... can identify several broad goals that underpin these and encourage collaborative activity.... " [1]. The goals are:

While these institutions share similar goals and missions, there is no common vocabulary, no common policies on access and use by the public, no common term for this group of institutions, and no common standards to support the goal of access.

The report summaries that these institutions:

Within this common vision, each of the communities addresses the goals within their own curatorial traditions and organizational contexts, and specific national or administrative framework. "The recognition that common interests converge on the Internet, driven by the desire to release the value of their collections...that support creative use by as many users as possible." [4] The participating institutions understand that users desire increased access to the intellectual and cultural materials in a flexible manner, without concern for who owns the resource. "To support this need, they recognize the need for services that provide unified routes into their deep collective resources...." [5]. At the same time these institutions are all developing their own approaches for organization and access to their resources. They may be working with subject based peer institutions across the continent or internationally to develop versions of Dublin Core (DC) or the Encoded Archival Description (EAD); or they maybe working within their type of organization to develop Visual Resources Association (VRA) description for visual resources. There is little evidence of work across institutions of different types at the implementation stage.

Assuming that U.S. museums, libraries and archives share the same goals and vision as our European colleagues, then the issues discussed at the 'Bicentennial Conference on Bibliographic Control for the New Millennium' must be discussed within a community that involves our museum and archive colleagues. For as the EC paper notes, without providing our common users with a means of identifying the unique resources and special collections, the mission of access to our heritage will be severely restricted. Several papers, including that by Caroline Arms, touch on the issues related to the collaboration of many institutions on the American Memory Project [6].

This paper will focus on the specific experiences of the Colorado Digitization Project (CDP) related to accessing a diverse set of primary resources held by many different cultural heritage institutions. The paper will address issues that arise from different cataloging and metadata standards and diverse user populations and needs. The biggest challenge for the CDP is to bring metadata from the various institutions together in a single union catalog and to present the user with retrieval of digital objects stored in a distributed network environment.

Description of the project:

The Colorado Digitization Project begun in the fall of 1998, is a collaborative initiative involving Colorado's archives, historical societies, libraries, and museums. The CDP's goal is to create a virtual digital collection of resources that provide the people of Colorado access to the rich historical, scientific and cultural resources of the state. Project participants will be able to contribute content that has been reformatted into digital format, as well as the born digital. The virtual collection will include such resources as letters, diaries, government documents, manuscripts, music scores, digital versions of exhibits, artifacts, oral histories, and maps.

Initial funding from the Colorado State Library supported the development of the collaborative, identification of ongoing and planned digitization initiatives, development of best practices for digitization projects, a small pilot project and identification of future funding options. For fiscal year 1999-2001, the CDP was awarded a two-year $499,999 grant from the Institute on Museum and Library Services and a second LSTA grant of $107,000. In addition, the Regional Library Systems of Colorado awarded the CDP a $36,000 grant. The grant funds supported the expansion of the project to include:

The CDP Strategic Plan for 1999-2002 http://coloradodigital.coalliance.org/about.html, establishes the project goals:

To implement the plans, the CDP has a variety of working groups with membership from different constituent groups. These groups were responsible for developing best practices for metadata and scanning. The CDP website (http://coloradodigital.coalliance.org) introduced in January 1999, provides access to resources and information about the project, the best practices on metadata and scanning, links to digital resources and information on legal issues. As of summer, 2000, the website links to more than 40 digital collections available in Colorado. That number will be doubled as the funded projects come online.

As part of the IMLS grant, the CDP will conduct two research projects, the first focusing on the impact that digital images available via the internet will have on museum attendance, and the second a project researching user satisfaction with two approaches for providing access to digital objects, the exhibit/interpretative approach vs. the catalog/database approach.

Environment for standards application in a cross-cultural heritage institution group:

In order to meet the objective of increased access to digital collections, the first effort undertaken by the CDP was identifying the approaches used among the existing projects to provide access to their collections. Among the initial 15 projects, there were 8 libraries and 7 non-library participants. Among those there were a range of approaches to providing access to the collections. Several provided access through their local library system and MARC records. Several presented exhibits with an additional database to search for individual digital objects. Many offered only exhibits, while two provided access via a locally developed database. One university library offered collection level MARC records linked to HTML finding aids, finally linking to images. Clearly even at this early stage, there was no dominant approach and therefore little possibility of a single standard or a single search engine. This is due, in part, to the lack of a dominant standard, the early stage of development of systems supporting access to digital objects through the new standards, and in part because of a lack of a funded mandate that would provide for a single system or approach. Additionally when a web search was undertaken these sites frequently weren't located, as they were several layers down on the host website. Where a database supported searching of specific images, the images weren't located, as the web engines cannot search a subsequent database.

Outside the library community, these organizations either used a specialized standard for description and specialized thesauri or taxonomies or they created their own with some providing no metatags at all. In the library and archival community there was use of collection level description and item level description. None of the current or planned projects had adopted the Encoded Archival Description, Dublin Core, Text Encoding Initiative or any of their derivative standards.

Like the European Community, the CDP found that there was a lack of common vocabulary, lack of common software, and a lack of standards that would support interoperability.

The CDP and standards:

It is within this environment that the Metadata Working Group began its work. In addition to understanding the approaches taken by current and planned projects, the group reviewed current and emerging standards, including EAD, MARC, Government Information Locator Service, DC, VRA, etc., for common elements. As web searching would not provide the desired access, and a single centralized metadata and image system would not be politically or financial feasible, the working group recommended the development of a union catalog of metadata to provide a desired level of access, hoping that future developments in web searching would negate the long term need for the union catalog. The guidelines are intended to promote best practices and consistency in the creation of metadata records across the different cultural heritage institutions and skill levels, while enhancing online search and retrieval accuracy, improve resource discovery capabilities and facilitate and ensure interoperability. To achieve this objective, institutions must create metadata or cataloging data at a sufficient level to support the identification and access needs.

The metadata standard chosen by an institution depends on a variety of factors. These factors include the type of materials that are being described and digitized, the purpose of the digitization project (access or preservation or both), the user community, the knowledge and expertise of project staff, and the technical infrastructure of the institution. The level of detail for a resource also varies from institution to institution. Information may be proprietary or confidential and may not be distributed or accessible on systems open to public access. Agreement on inclusion of such administrative information is unlikely. As a result, the Metadata Working Group determined that information of this type would not be retained in the union catalog record.

The CDP Core Elements:

Based on the analysis of the metadata standards, the working group recommended adoption of the Dublin Core/XML metadata for the union catalog. Rather than adopting a specific communication form such as MARC or EAD, the working group developed a minimum set of elements that must be included in a cataloging or metadata record based on the fifteen Dublin Core elements. The working group recognized that additional elements might be required for particular formats and has accommodated this in its recommendations.

The recommendations of the group for the "core" and "full" record in Dublin Core are as follows:

Mandatory Elements: Optional (Desirable) Elements:
Title Contributor
Creator Publisher
Subject Relation
Description Type
Identifier Source
Date Digital Language
Date Original Coverage
Format View Rights
Subject: Classification number
Identifier: Owning Institution

The "mandatory" or "core" elements were designed along the same guidelines as the core records for the Program for Cooperative Cataloging were developed. In addition, the working group recommended that a "qualified" Dublin Core be implemented. This record employs modifiers and schemes for each element as appropriate. For example, a recommendation that subject terms from a recognized thesaurus be used has been made. The CDP Metadata Guidelines http://coloradodigital.coalliance.org/guides provide links to all publicly accessible subject heading lists and thesauri.

Each element of the Dublin Core has been defined. For example, the subject element has a web page as follows:

Subject

Label: Subject

Definition: Topic of the digital resources. Typically, subject will be expressed as keywords or phrases that describe the subject content of the resource, or terms related to significant associations of people, places, and events, or other contextual information.

Mandatory: Yes

Repeatable: Yes

Scheme: Use established thesaurus: Library of Congress Subject Headings (LCSH), Art and Architecture Thesaurus (AAT), Thesaurus for Graphic Materials (TGM), Medical Subject Headings (MESH), ICONCLASS, etc.

Input guidelines:

  1. Prefer use of most significant or unique words, with more general words used as necessary
  2. Subjects may come from the title or description field, or elsewhere in the resource
  3. If the subject is a person or organization, enter as outlined under Creator
Examples of subject terms/descriptors are also provided.

Issues with Dublin Core:

Adopting the Dublin Core framework at this early stage is risky; however it is likely to be the best option for integrating records using a variety of international best practices/standards. Adopting Dublin Core in 2000 is like adopting MARC in 1970. Early adopters of MARC recognized that there would be changes to MARC, that the systems would be available to support it, etc. We are facing similar issue in 2000 with Dublin Core. As the project was focusing on metadata for digital objects vs. websites, significant interpretation was required. Most problematic for the working group was the handling of the date for the original object, which was needed to qualify searches. Using the Source field for this information would negate the possibility of qualifying searches by date. After many discussions, the group decided to add an additional date field to accommodate the original date. The other aspect that caused the group difficulty was accommodating the functional metadata relating to the digital object. Again after much discussion, the group decided to use the Format field for both the requirements for use of materials and a second Format field for the requirements for creation of the resource. Lastly the group added a field for holding institution, allowing the user to limit searches by the owing institution.

As noted in other papers, software supporting both the creation and use of Dublin Core based records is slow to develop and implementation is unsettled due to the evolving nature of the standard. The advantage of adopting Dublin Core is that many specialized communities, archives, libraries and museums are creating Dublin Core based derivatives for their communities.

What do you describe?

Not unexpectedly, the issue of cataloging the original versus cataloging the digital object has arisen, regardless of whether the owning institution is a museum or library. Some institutions catalog the original item, providing a link to the digital image/object. This practice, in most instances, does not preserve or record any of the details of the digital object (e.g., scanning equipment, resolution, rights management, etc.). In many of these cases, it is a financial decision. The cataloging already exists for the original and the most cost effective approach is to provide access to the digital version by adding a URL or other linking identifier. In many instances the digital object is considered secondary to the original, so where the original item is not cataloged, cataloging for the original is preferred, with the URL linkage to the digital. The public service and reference librarians have also expressed concern for multiple records for the same item. This discussion is not dissimilar to the multiple version discussions the library community has had for more than two decades.

Some institutions catalog at the collection level and not at the item level, others catalog at the item level only, and others catalog both at the collection and item level. Within one institution all three approaches have been taken. Those that provide access to the digital object through a collection level record, generally have finding aids. As with original cataloging, the existing finding aid is converted to HTML rather than to another format. As finding aids focus on the hierarchical relationship of the items within the collection, there is little subject rich terminology for the item level materials, limiting access to the individual resources in the collection. In response, institutions are expanding the subject terms for the collection level cataloging. With the future hope of full text indexing of the resources in the collection, enhanced retrieval is a possibility, but until then the only other option is providing the enhanced subject terms in the finding aids themselves.

To accommodate the different approaches and different standards, the CDP licensed the OCLC SiteSearch software to build its union catalog for accessing the digital collections in Colorado. The SiteSearch software allows CDP participants to batch load records into the system and supports online record creation. The CDP is working with OCLC on enhancements to the software, as there are currently limitations on the variety of formats handled. It is anticipated that SiteSearch, as implemented by CDP, will enable participants to contribute records in a variety of formats. A loader profile has been developed for the CDP participants. Initially records may be batchloaded in either MARC format or SGML/XML. The SGML/XML capability will be used to load locally developed databases, as well as commercial databases supporting the museum and historical society communities. The capability to load records in Encoded Archival Description (EAD) as well as records in other formats (e.g., VRA) is being explored with OCLC. Initially the CDP had planned to use the SiteSearch record builder capability allowing input in either Dublin Core or MARC, but due to time constraints in implementation, the CDP will offer a locally developed search intake mechanism for online input. These online records, built with a Dublin Core template, the MARC records loaded from library local systems, and the SGML/XML loaded records will create a single union catalog. All records will be converted to the CDP defined Dublin Core elements.

Among the features that CDP hopes to have incorporated into SiteSearch in the future are the ability to load records in formats other than MARC and SGML/XML, the ability to export records, an authority control feature/system, and an improved online entry and maintenance system. While SiteSearch has been specifically designed as a library "system", CDP is expanding the system to meet the needs of the varied cultural heritage institutions involved in this collaborative venture.

Subject terminology:

A wide range of issues exists in the area of subject retrieval in the CDP Union Catalog. The mix of cultural heritage institutions resulted in many specialized institutions, for example the Florissant Fossil Beds National Monument and their collection of 6000 unique fossils, or the Crow Canyon Archaeological Center and the large collection of archaeological materials or the Boulder History Museum and their more than 4000 costumes and accessories. The first two use taxonomies from their specialized fields, while the third uses The Revised Nomenclature for Museum Cataloging, A Revised and Expanded Version of Robert G. Chenhall's System for Classifying Man-Made Objects by James R. Blackaby, Patricia Greeno, and the Nomenclature Committee. Published by American Association for State and Local History, 1988. At the same time some of the smaller or more general collections will contain these type of resources or subjects, but use a more generalized subject heading list such as the Library of Congress Subject Heading List. The CDP Union Catalog will provide access to this entire range of terms without an authority control system. As a result unless the user knows both the general and specialized taxonomy, retrieval will be limited to the term input. To address this situation, the project is testing the use of Dewey Decimal Classification numbers that will be assigned to each record, allowing the linkage of general terms and highly specialized terms within a browse feature. When using the keyword or advanced search capabilities the users will retrieve only the term/terms entered, a common approach for both museums and libraries.

The project is addressing one area of authority control, terms for Colorado geographic names and subjects. In order to assure some level of consistency in terminology, the CDP has developed a list of Colorado terms that a user can search from the SiteSearch web. The list can be searched by specific term or through a browse function. The list is being created by extracting headings from the Prospector database, the database reflecting the collections of Colorado's major public and academic research libraries, as well as the community colleges and four year schools. The Metadata working group has begun exploring the idea of turning this list into a real thesaurus and/or a full authority file. The later would be approach through statewide NACO/SACO project creating name headings and subject headings to be added to the Library of Congress Name Authority File and Library of Congress Subject Heading List.

What needs to be addressed in the shared cultural heritage environment?

Shared development: In order to reach commonality in standards and address the interoperability issues, participants from across the range of institution types need to be at the table at the start of the discussions. Libraries cannot determine the standards and assume that museums, archives and other cultural heritage institutions will adopt them.

Standards: The key to participation of a wide range of institutions lies in the ability to allow the metadata creators to use multiple standards while attempting to ensure that there is agreement between the various standards for some commonality in the access points provided. This will clearly call for the cultural heritage institutions (including libraries) to have discussions related to access and interoperability issues. Assuming that some commonality among/between the various standards can be reached, there will clearly be an impact on the search engines used to access these resources.

Interoperability: Many projects state that they have as an objective the interoperability of the systems; however, when queried, interoperability means adoption of a single set of standards and use of a single system or adoption of one vendor's software. At this time the predominant communication format for libraries, MARC, doesn't support the descriptive elements required by museums and archives. The same is true for museum based software, it doesn't meet the standards of libraries. With the development of XML and Dublin Core there is some hope that a system meeting the different needs may be accommodated.

Resource discovery services: With the development of the OCLC CORC service, we have the first opportunity to build a resource discovery service that supports standards (Dublin Core) that have possible use by different cultural heritage institutions. Unfortunately OCLC services are library-centric. Adoption of CORC by non-libraries will be not come easily as the system development did not include non-library representation and input.

Cataloging issues: Cataloging differences also pose some challenges. The cataloging of three-dimensional objects provides a good example. The museum community typically does not assign titles to such objects whereas libraries routinely supply titles to objects or items that lack them. The question arises: does it make a difference if there isn't a title supplied? How is retrieval affected? Another example occurs with the level of specificity applied in subject analysis. A very small historical society may not need the same level of specificity in the description of its materials, as does a large historical society, library or museum. What impact will different levels of subject analysis and specificity have on retrieval?

Authority control: Our discussions of authority control innovation must also include use of taxonomies as well as thesauri and subject heading lists. Barbara Tillett's suggestion of a single integrated authority record sounds appealing, however complicated [7]. The subject "field" as defined in Dublin Core with the appropriate scheme qualifiers almost presumes an ability for a system/search engine to perform cross-vocabulary searching. This certainly also poses a whole different set of challenges.

Will we succeed?

We expect to succeed. To do that, the best practices will have to become standards and the standards will have to continue to evolve, much as MARC has, and most important, the standards will have to be adopted. It is only when the standards are adopted that systems will be developed to support the widespread use. For us to achieve the vision of providing our citizens with the broadest possible access to the cultural resources of our peoples, we will need to develop standards and systems that have broad-based adoption across the different cultural heritage communities. To do that, we have to sit down at the table together. The people at today's conference have the opportunity to take a leadership role in calling together the cultural heritage institutions of the United States to begin working on the issue of how to increase access to our collective digital resources.


Notes

1-5 "Scientific, Industrial and Cultural heritage: a shared approach; a research framework for digital libraries, museums and archives," Ariadne, Issue 22.
6. Arms, Caroline, "Some Observations on Metadata and Digital Libraries," Bicentennial Conference on Bibliographic Control for the New Millennium, November 15-17, 2000.
7. Tillett, Barbara, "Authority Control on the Web," Bicentennial Conference on Bibliographic Control for the New Millennium, November 15-17, 2000. (http://www.loc.gov)

Library of Congress
November 7, 2000
Library of Congress Help Desk