Using XML and the Open Archive Initiative to Harvest and Reuse Content
November 6, 2001
Walter L. Warnick, Ph.D.
Director, Department of Energy Office of Scientific and Technical Information
http://www.osti.gov

Presentation at ASIST Annual Meeting

 

PowerPoint Presentation

[Slide 1 - Title, intro]

Knowledge and ignorance are always at war. The remarkable improvement in the human condition--the great achievement of our time--depends upon knowledge more than any physical resource. Information programs like ours at DOE have been advancing knowledge and beating back ignorance for decades. We must continue to push back the boundaries of ignorance for the sake of our future.

The current situation with Afghanistan is a perfect example. What underlies our military advantages in that conflict? Is it the superior ferocity of our soldiers? Is it the superior intellect of our generals? Perhaps, but we have no real evidence to support these notions. Rather, our military advantages stem, in major part, from superior technology. And what is technology? Isn't it simply the physical expression of scientific and technical knowledge?

Just as the military has specialists--the army to project force on land and the navy to project force on the seas--the information community has specialists, too. The role of my office is to facilitate information discovery. In the words of our enabling legislation, quote, "The dissemination of scientific and technical information relating to energy should be permitted and encouraged so as to provide free interchange of ideas and criticism which is essential to scientific and industrial progress and public understanding and to enlarge the fund of technical information."

To help us in the war between knowledge and ignorance, we are now blessed with a revolutionary new tool--the Internet. We at DOE are deploying it as fast as we can. We are taking every opportunity to break new ground with each new deployment. My hope is to rally lovers of science and technology to our cause by our efforts and by our successes.

Today in government, there is much talk about performance metrics. My office can offer up wonderful success stories about numbers of information transactions, numbers of patrons served, and quantities of information accessible and delivered. We invite bench marking against our sister government agencies. We invite bench marking against the private sector. The conventional wisdom that government IT deployment does not compare well with the private sector simply does not apply to my organization.

While the course of the war between knowledge and ignorance is unknown, its long term outcome is certain. In the long run, the promise of greater access and retrievability to more and more knowledge is too great to be denied.

The next steps already loom on the horizon. Right now, for example, the focus of on-line retrieval of journal literature is often limited to searching ABSTRACTS, titles, and authors. It is a trivial matter to build a system to allow FULL-TEXT searching of journal literature, and some progress in that direction has already been made. Particularly exciting now is the prospect of applying existing technology to create full-text indexes very inexpensively. Somewhere, somehow, someone will build such a system, and build it soon, because the promise is so great. This will be another major advance for knowledge over ignorance, as journal literature is such an important vehicle for scientific communication.

Another advance looming on the horizon are arrangements like the Open Archives Initiative which enlists simple metadata standards and XML to allow information to be repurposed easily.

[Slide 2 - Map, path]

[Slide 3 - The Challenge]

The Challenge

Today those of us who are in the federal scientific and technical information sector face a particular challenge: how can we harvest, share, and reuse our information content in a way that facilitates information discovery?

Historically, each federal agency has used different formats and technologies intended mainly for their own digital libraries. Typically, these agency-specific systems are not designed to facilitate inter-agency information discovery, nor information sharing.

This is a fundamental interoperability problem. It will not be resolved until we find a way where content owners--our agencies--can work together to harvest, share and archive our information resources. We need to adopt standards that will promote required interoperability.

As a first step toward interagency cooperation, we have formed the Science.gov Alliance. Eleanor Frierson has already discussed the Alliance. I will focus on what I propose to be the next phase for Science.gov. What I will offer are my views. They have been discussed only in very general terms with other Alliance members. This presentation is food for thought.

Those of us who have been around for awhile might have a sense of deja-vu. Schemes to facilitate information discovery and retrieval have come and gone. For example, some years ago the Government Information Locator System (GILS) was the center of attention as a standard that agencies were to adopt. But GILS has had its day, and now has fallen from favor. Someone should do a post mortem so that there might be a definitive analysis of what went wrong.

In my opinion, GILS failed because its benefits were too small in comparison with its burdens. If a GILS record were made at the document level, it placed a burden on the information creator, who realized little or no direct benefit from the record. On the other hand, if a GILS record were made at the system level, too little information was contained therein to facilitate discovery. Either way, there was a fatal flaw with GILS.

From this experience, I conclude that to succeed an information architecture must be simple and relatively easy to implement. There is hope on the horizon.

[Slide 4 - Three approaches]

Alternative Approaches

The National Science Foundation has advanced the National SMETE Digital Library community (NSDL). It has identified three approaches for establishing cooperation across digital libraries: Federation, Harvesting and Gathering. I believe this terminology was originated by Bill Arms of Cornell University. The three approaches differ in the extent of the burden accepted by participating libraries, and conversely, by the benefits provided for discovery.

At the high burden end of the spectrum is the Federation approach. Through this approach, participating organizations agree on interoperability standards and protocols - and then build the systems that form a federation. Through the Federation, extensive up-front efforts are made to reach agreement on organizational, content and technical issues. The library community is an excellent example of such a federation where they share on-line catalog records using Z39.50, MARC, and the Anglo-American Cataloging Rules (AACR2). The cost of participation is normally high for federated activities. It is resource intensive to implement, adhere to, and maintain standards. As a result, membership in these kinds of federations tends to be limited.

At the low burden end of the spectrum is the Gathering approach. It requires no cooperation and depends on web based search engines, such as Google and Alta Vista to provide information discovery and special applications like my office’s PrePRINT Network http://www.osti.gov/preprint. This approach mirrors the current situation on the Web. Its advantage is the minimal resource investment required.

In the middle of the spectrum, we find the Harvesting approach which mitigates some of the more onerous and difficult issues associated with the Federation and Gathering. In short, Harvesting lowers the barrier of participation while facilitating accessibility of digital library collections. We believe Harvesting, this middle ground, is the preferred approach to many practical areas of information discovery.

[Slide 5 - OAI background]

The Open Archive Initiative (OAI)

The Open Archive Initiative (OAI) is a Harvest approach. The OAI has its origins in the E-Prints community where the word "archive" is accepted to mean repository. The OAI has subsequently broadened its scope to include and provide a technical umbrella for interoperability - an umbrella used by many different publishing communities.

OAI is designed to provide a low-barrier approach to interoperability among publishers of information. Further OAI provides an easy way to implement and deploy mechanisms to harvest and disseminate information. Using a choice of common metadata standards, OAI facilitates searching across all participating repositories - in fact, users can query these repositories as if they were a single site. By agreeing to common standards, one repository can be part of a larger community of repositories, thus increasing the value of the individual repository and the entire community of repositories. In our mind, this is a "win-win" situation. Active supporters of OAI include the Digital Library Federation, CrossRef (a collaboration of 78 learned society and commercial publishers), and OCLC (Online Computer Library Center - a worldwide consortium of libraries and agencies)

[Slide 6 - OAI Division of labor]

Fundamental to OAI is a division of labor between participants:

* Data Providers. These participants make their metadata accessible through the OAI protocols.

* Service Providers. These participants harvest metadata from Data Providers using the OAI protocol and then use the metadata for value-added services. For example, DTIC could add OSTI's metadata to their existing repository by simply harvesting the metadata using OAI protocol.

* An organization can be a Data Provider and a Service Provider. In that scenario, the organization would expose their metadata for harvesting by others (data provider) and also harvest metadata from others to develop their own value-added services (service provider).

NSDL is an example of an organization that promises to be both a data provider and a service provider. OSTI will also perform both roles. In my view, OAI is particularly attractive to organizations that combine both roles, as they combine into a single entity those who invest the effort to be a data provider and reap the reward of being a service provider.

[Slide 7 - OAI Framework]

OAI provides a technical and organizational metadata-harvesting framework designed to facilitate information discovery in distributed repositories. This framework consists of two parts: a metadata definition, and a common protocol that together allow the harvesting of metadata from participating digital libraries.

[Slide 8 - Dublin Core]

Dublin Core

The OAI framework relies on interoperable and extensible metadata. To meet the requirements, the OAI selected a choice of metadata standards, one of which is an existing metadata standard called the Dublin Core Metadata element set. The Dublin Core gets its name from the first Dublin Core Series Workshop that took place in Dublin, Ohio in 1995. Developed with international participation, Dublin Core is an Internet Engineering Task Force (IETF) standard and is used extensively worldwide.

The Office of Scientific and Technical Information adopted Dublin Core as its metadata standard three years ago. We recognized that it provided more economical and efficient way describing our information. Dublin Core is also the metadata basis for our new product: the Energy Citations Database (ECD). Energy Citations database contains bibliographic records for energy and energy-related scientific and technical information from the Department of Energy (DOE) and its predecessor agencies, the Energy Research & Development Administration (ERDA) and the Atomic Energy Commission (AEC). As such ECD provides access to DOE publicly available citations from 1948 through the present, with continued growth through regular updates. The Dublin Core standard is described in detail at http://dublincore.org/ - we invite your perusal.

All the fields in the Dublin Core are optional. While this may seem unusual or counterproductive, it is consistent with the philosophy keeping the barriers low for participation. Organizations also have the option of maintaining richer metadata while using the Dublin Core as an exchange format.

[Slide 9 - Dublin Core Characteristics]

Characteristics of Dublin Core are:

* Simplicity: The Dublin has about the complexity of a library catalog card with 15 metadata descriptors, such as Creator, Date and Description.

* Semantic Interoperability: It provides a commonly understood set of descriptors that facilitates mapping to other data content standards increasing the possibility of interoperability.

* International Consensus The Dublin Core is accepted and in some 20 countries in North America, Europe, Australia, and Asia.

* Extensibility The Dublin Core provides an economical alternative to more complex metadata models. It also includes the flexibility and extensibility to accommodate more complex metadata standards.

[Slide 10 - XML]

Extensible Markup Language - XML

Recall that the OAI compliant repositories are Internet accessible servers of the data providers. These repositories disseminate Extensible Markup Language (XML)-encoded metadata records

XML is not a new standard, having roots in SGML. The development of XML began in 1996 and was adopted as a W3C standard in February 1998. At one time, OSTI strongly supported the SGML standard, but there was much resistance even with DOE, owing to the very high labor demands then required for converting a document to SGML. Recognizing this problem, subsequently OSTI has enthusiastically adopted the XML as a less demanding standard for information exchange and discovery.

[Slide 11 - XML Benefits]

We remain strong advocates for XML - for the following reasons. Like the Hypertext Markup language (HTML), XML is a nonproprietary standard. Further XML provides access to data across hardware platforms and operating systems. In addition, XML takes HTML to the next step by using tags (value, person, place, thing) only to delimit components of data. Once delimited the actual interpretation of the data is left to the application program that reads the data. This is indeed a far cry from the inflexible HTML situation with preset tags and attributes/definitions.

Unlike HTML, XML can describe information in very granular detail, providing the opportunity for increased precision in search results. In addition, XML is internationally recognized as the de facto B2B and B2C standard.

[Slide 12 - Selective Harvesting]

Selective Harvesting

A key feature of the OAI is selective harvesting. Selective harvesting means that a subset of data can be identified for harvesting at any given time. There are two simple criteria for selective harvesting: date-based and set-based. That is, data can be harvested based on date fields (e.g., entry date, date last updated) or based on some discrete identifier (e.g., subject category, keyword, organization code). In addition, selective harvesting in lieu of large-scale queries would tend to reduce the operational workload on the system server - thus minimizing user complaints about system response time.

[Slide 13 - Next Steps]

Next Steps

In the context of Science.gov, an alliance of ten federal agencies that includes:

* Department of Agriculture
* Department of Commerce
* Department of Defense
* Department of Education
* Department of Energy
* Department of Health and Human Services
* Department of Interior
* Environmental Protection Agency
* National Aeronautics and Space Administration
* National Science Foundation

Each agency possesses a unique digital library of information. Using the OAI protocol these unique collections can be shared and harvested. In that process customized compilations of information can be identified and reused to create exciting new synergies for scientific discovery. Scientific and technical reports, pre-prints, journal literature, and research summaries can be shared and repurposed while adding value to each existing digital library. Through this community of digital libraries, each repository of scientific and technical information that is added increases the value of the whole for those who participate.

OAI deserves a serious look as a standard protocol for government agencies to harvest, share, and reuse information to enable good science and promote our national interests.

[Slide 14 - Map - post mortem]

Sources

1. National SMETE Digital Library, Technical Infrastructure and Interoperability. http://www.smete.org/nsdl/workgroups/technical/technical_wg.html

2. The Open Archives Initiative: Building a low-barrier interoperability framework, Carl Lagoze, Digital Library Research Group, Cornell University, Ithaca, NY, Herbert Van de Sompel, Digital Library Research Group, Cornell University, Ithaca, NY

3. The Open Archives Initiative. http://www.openarchives.org

4. The Dublin Core Initiative. http://dublincore.org/about/overview/