This list of steps in the harvesting process covers three main phases.  Preparing to harvest is when the site actually does the most work; this is when metadata mapping and the technical development takes place.  The testing phase is simple but takes up a significant amount of time in the process.  The production phase is an ongoing one based primarily on monitoring for quality control.  In the lists below, S stands for “Site” and notes a step that is performed by the participating site; O stands for OSTI, indicating OSTI’s responsibility for that task. 

 

 

 

Preparing:  Mapping and Technical Development

1

S

Site contacts OSTI’s harvesting manager (HM) to express interest.

2

S

Site sends data dictionary to HM via email.  Data dictionary may be as simple as a list of the fields (and their definitions) in the site’s review/approval or other database from which metadata could be retrieved for harvesting.  

3

O

HM prepares first draft of a mapping matrix and returns it to site for review.  Includes suggested names for XML tags, if needed.

4

S&O

Back and forth editing until matrix is relatively complete.

5

O

HM provides completed matrix to Dublin Core manager (DCM) and to harvesting programmer (HP) for review.

6

S&O

Conference call to go over matrix together and ask final questions, make final agreements.  Call includes site contacts, HM, HP, and may include DCM.  Questions that are resolved via the matrix are:

  • What metadata OSTI will retrieve from the site database and what metadata OSTI will insert into the metadata record on the site’s behalf;
  • What format the metadata must have for various fields;
  • The XML tag names and case
  • Business rules at both site and OSTI
  • Expected values for certain fields, such as document types.

 

The call is also used to discuss the technical requirements for the next step in the process.

 

7

O

HM includes all changes from conference call and outputs final version of the mapping matrix.  Distributes to all involved. 

8

O

HM prepares the DTD and provides to all involved.

9

O

HM prepares the detailed data and parsing  specifications and provides to the HP.

10

O

HP builds the customized directories, parser/mapping file, and default processes that will handle the site’s metadata.

11

S

Site programmer (SP) establishes a web service for harvesting and provides URL to OSTI.

12

S

SP prepares query script and sample XML output file.  Sends sample file to OSTI.

13

O

HP manually loads the sample file into the test E-Link database.  Suggests revisions as needed.

 

 

 

Practicing:  The Test Phase

14

O

HP runs specific tests on the connection with the web service newly established by the site.  Adjustments may be needed on either end.

15

O

OSTI manually conducts a harvesting run against the site’s database.  It is a real-time test of the new web service, the site’s query script and XML output file, OSTI’s customized mapping file for the site, etc.  It is the first chance to see how the real process will work.

16

O

HM individually checks all citations, comparing how they appear as raw data in the XML output file with how the citation is loaded and formatted into Test E-Link.  Problems are relayed back to the site for correction.

17

S&O

When the first run is deemed successful, a scheduled is agreed upon for the test runs.  These typically occur once a week on a Tuesday, Wednesday, or Thursday.  The start date for the test runs is set.

18

O

HP sets the test system to perform these test runs automatically.

19

O

Following each test run, HM reviews the records and informs site of errors or changes in mapping that need to be made.  DCM may also review for correct loading into Test Dublin Core.  Testing is done until all document types have been represented and all runs have been error free on a consistent basis.  This typically takes six weeks, but may take longer depending on down times, time lost while corrections are being made, etc.

20

O

When OSTI and the site agree that the testing is consistently stable and successful, the OSTI HM prepares the final test report and provides to the site.

 

 

 

 

Doing:  The Production Phase

21

S

Site notifies OSTI that it has resolved any issues that were still open when the test report was issued.

22

O

OSTI does some spot checking, tests changes, etc.

23

S&O

Schedule for production runs is set.  It is often the same time that was used for the test runs.  Final decisions are made on things such as:

  • Is an offset needed?
  • Does the site’s query script need to incorporate parameters that will block old, preharvesting records from coming into OSTI’s system again?
  • Who at the site will be on the distribution list for the harvesting email that is automatically generated after a run?
  • When will the catch-up run be scheduled?

24

O

HP sets the production system to perform these test runs automatically.

25

O

The first production run is performed according to the agreed-upon schedule. The automated email is sent to the site and to the appropriate OSTI distribution list.

26

O

HM reviews the first production run in the same detailed way that the test runs were checked.  Confers with site on any errors.

27

O

After the production run is pronounced successful, HP performs the catch-up run according to the agreed upon schedule.

28

S

Each week the site checks to ensure that their citations harvested in the previous week have moved into the Dublin Core database and appear with the correct OSTI ID.  Site notifies OSTI of any omissions.

29

S&O

Site notifies OSTI of any changes in URLs or metadata.  HM ensures that the changes are reflected in the mapping matrix, DTD, parser, etc.

30

S&O

OSTI notifies site of any system changes that require site to modify query script or XML output file.  Site ensures these changes are made in a timely manner.