Brief Overview of the DOE Scientific & Technical Information (STI) Harvesting Program 
Office of Scientific & Technical Information 

February 2003 - Revision 1

I. Background

As a major science and technology agency, the Department of Energy (DOE) conducts research and development (R&D) in a variety of fields. The knowledge gained during the R&D process is frequently imparted through scientific and technical information (STI), a key outcome of the Department's R&D and related endeavors. Within the DOE's Office of Science, the Office of Scientific and Technical Information (OSTI) has the responsibility for coordinating STI activities and for leading a collaboration to establish a distributed, electronic STI environment that meets Department-wide needs. To fulfill this mission, for over 50 years DOE organizations and contractor sites have been providing the results of their R&D activities in the form of scientific and technical documents to the DOE OSTI in Oak Ridge, Tennessee. Provision to OSTI of STI documents is in compliance with DOE Order 241.1A. Currently, document data is provided to OSTI electronically via input to OSTI 241-type forms with accompanying electronic full text. As a result, the Department maintains in one source an archival and current record of DOE scientific and technical information from 1948 to the present.

Recent new technology options have resulted in OSTI's emphasis that DOE and DOE contractor sites provide electronic full text. As of January 2001, OSTI announced it would no longer accept paper documents. In keeping with the organization's effort to take advantage of new technology, OSTI is looking at new ways to acquire bibliographic information about DOE scientific and technical documents and to access the documents electronically at the DOE/DOE contractor site of publication. Plans are to accomplish this goal for unclassified unlimited documents through "harvesting" of a site's existing bibliographic database and accessing electronic documents maintained at the sites.

II. What is Harvesting?

Today, via OSTI's E-Link 241 data entry format, DOE and DOE contractor sites manually input the bibliographic data and electronic full text to OSTI. A 2000 study of DOE/DOE contractor sites revealed that unclassified, unlimited bibliographic information and full-text documents are on a majority of DOE Web sites. Technically, it is now possible, via use of the OSTI Data Harvester System, to "harvest" data and documents already in existence at the sites and to incorporate the information into the OSTI Record System. In effect, OSTI can harvest a site's bibliographic information, put it into the OSTI System (repository), and point to the electronic full text located at the DOE/DOE contractor site. Through the DOE Harvesting Program, OSTI would like to capitalize on the work already done by sites and minimize any additional operational and manual requirements for the sites.

III. What are the benefits of harvesting?

It is anticipated that the current DOE and DOE contractor site effort for manually providing STI metadata to OSTI will be simplified by OSTI's accessing and "harvesting" existing DOE unclassified, unlimited bibliographic records and pointing to electronic full text located in the DOE/DOE contractor sites. Harvesting of metadata and electronic full text can decrease the processing and/or forms input required by sites to comply with DOE Order 241.1A, increase the timeliness of announcement of STI to both DOE and public users, and minimize storage requirements for the Department. Use of the originating research organization name in the record assures that each DOE/DOE contractor site is credited in the OSTI record for the information it provides. For this approach to be successful and to assure continuing availability of full text, a site needs to notify OSTI if it decides to take an electronic document off-line, at which time OSTI will incorporate the electronic full text into its archival collection. A site's participation in the DOE Harvesting Program will facilitate OSTI's continued centralized maintenance of citations to documents published by DOE from 1948 to the present as first provided in the Nuclear Science Abstracts (NSA), 1948-1976, and then the Energy Science and Technology Database (EDB), 1974 - September 30, 2001. As of September 2001, historical records were converted to a standardized Dublin Core format, available in the Energy Citations Database (ECD) at http://www.osti.gov/energycitations. ECD integrates STI from DOE/DOE contractors and provides search and retrieval capability for DOE STI from 1948 to the present. The DOE version of ECD (DOE ECD) is accessible to DOE and DOE contractor staff on a free domain-controlled Web site.

IV. What types of resources/documents will be involved in the Harvesting Program?

Initially only totally unclassified, unlimited documents will be harvested. Currently unclassified/limited and/or classified documents need to continue to be submitted to OSTI through the current channels (DOE F 241.1 or DOE F 241.3). Unclassified/limited documents may be harvested in the next phase.

V. What are the technical requirements for a DOE/DOE contractor site to participate in harvesting?

To fully participate in the DOE Harvesting Program, a site needs to have and/or to establish:
a. An up-to-date bibliographic database for unclassified unlimited documents.
b. A record update (add/change) date field that facilitates metadata selection.
c. A unique identifier for each record in the file.
- Note: This is key for OSTI duplicate record checks and for record updates.
- Note: Harvested data will be updated outside the E-Link 241 record correction facility. Changed harvested data will be "re-harvested" to replace an existing record.
d. An XML interface so that OSTI can access site metadata.
- Note: OSTI will access metadata via a URL with two parameters - (1) a start date and (2) an end date (based on the record update date field).
- Note: Fields may be in any order in the XML file. However, once OSTI and the site agree on an established order, the order must remain the same.
e. The site-developed XML interface must be available at the predefined "harvest" time.
f. Electronic full text corresponding to metadata harvested from the site (if available) must be referenced as a site URL in the harvested record. The site URL must be specific to the electronic full-text location.
- Note: This requirement is not applicable for bibliographic records without full text.
- Note: For electronic full text, PDF Normal format is preferred for harvested records; However, other formats will be considered.
g. A site technical contact is identified with whom OSTI technical staff can communicate and interact in the implementation process.
h. An E-mail address for OSTI to notify the site when and which site records have been harvested.

VI. What metadata is needed for the DOE Dublin Core Bibliographic Record?

A companion paper to this Overview is Metadata Useful for OSTI Harvesting from DOE Sites for unclassified unlimited resources. It includes field information from the acquisitions perspective, including type and data specifics. The amount of metadata a DOE/DOE contractor site maintains in its database may vary. While required metadata are noted by an M (Mandatory), all metadata maintained by a site that match an DOE/OSTI Dublin Core data element are of value in the search and retrieval process and, therefore, will be harvested if available.

If a DOE/DOE contractor site meets the technical requirements for harvesting (see V. above), to facilitate converting existing site bibliographic database records to the Dublin Core format, OSTI must be provided with 1) name/labels for metadata, 2) the definition for each metadata field, and 3) sample records by the site. Using these three sets of information, OSTI staff will collaborate with each site to define their specific harvesting program.

In addition to the above metadata, site-specific metadata necessary for each record to meet technical requirements for harvesting are:

VII. What initial steps are involved for a DOE/DOE contractor site to participle in the DOE STI Harvesting Program?

1. Initial contact, by OSTI staff, with a DOE/DOE contractor site to determine interest in harvesting and/or a site approaches OSTI with an expressed interest.
2. Identification of a technical contact at the DOE/DOE contractor site with whom OSTI technical staff can communicate.
3. Follow-up phone discussion (conference call) with OSTI and site personnel, including technical contacts, to discuss details, answer questions, etc.
4. Determination of whether DOE/DOE contractor site meets or can meet technical requirements (see V. above).
5. Provision, by the DOE/DOE contractor site, of 1) name/labels for site metadata, 2) the definition for each metadata field, and 3) sample database records to OSTI.
6. Mapping, by OSTI, of DOE/DOE contractor site records to DOE/OSTI Dublin Core.
7. Activation of the OSTI Data Harvester System by OSTI and the site technical contact.
8. Periodic evaluation of the Harvesting Program by OSTI and site personnel.