Home Projects Publications Presentations Repositories Photo Gallery Career Staff Favorites
  • MyDelivery
  • Turning The Pages Online
  • MyMorph
  • Medical Article Records GROUNDTRUTH (MARG)
  • MD on Tap
  • AnatQuest
Links to Feeds:
PublicationsRSS  RSS
CEB NewsRSS  RSS

Last updated: June 18, 2008

CEB Projects

Print this Print this  E-mail this E-mail this


page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   



Automating the production of bibliographic records for MEDLINE


9.2 Edit workstation

While our goal is to automate data extraction from the scanned journals as much as possible, at any point in the life cycle of the MARS system there will be a need for some manual data entry for the fields not automatically extracted. One reason is that some data appears in pages other than the first page of the article, the only page that is usually scanned. (If the abstract continues on to the article's second page, this page is also scanned, but this occurs infrequently.) Examples are: NIH grant numbers and databank accession numbers. Another reason is that the OCR system does not reliably detect very small or italicized characters, such as in the affiliation field; in this situation, the edit operator might choose to simply type in the text. Furthermore, the MARS-2 algorithms handle only the "compliant" journals, while a significant, though decreasing, portion of the journal collection remains noncompliant. The edit workstation is designed for manual data entry, the name reflecting the combined actions of the editor and keyboard operator in the MARS-1 system.

The edit workstation takes advantage of upstream processes (OCR, autozoning and autolabeling), provides an interface for data entry, and allows the operator to correct errors in zoning and labeling before passing the data on to subsequent processes.

In order to minimize human errors at the edit stage, data entry is double keyed, i.e., two different operators produce two versions of the data for the same article: one version in Edit_One, another version in Edit_Two. The two versions of the data are differenced by a daemon process called Diff so that the reconcile operator may clearly detect any differences and select the correct version.

As shown in the workstation GUI in Figure 9.2.1, the edit operator is required to enter the pagination and language fields, the latter set to English as the default option. All other fields are optional, entered only if necessary. The left window of the workstation screen displays the TIFF page image including the (color coded) results of autozoning and autolabeling performed by upstream processes, so that the operator can identify and correct an incorrectly labeled zone, to ensure that errors do not propagate to downstream processes such as Confidence Edit, Reformat and Reconcile.

In addition to showing the zones and labels automatically done by daemons, the system also displays the percentage of high confidence OCR output characters (confidence level of 9) for each zone. These items of information displayed on the bitmapped image serve as a DoItAgain feature, enabling the edit operator to request the Admin operator to order redoing earlier processes such as Scan, OCR, zoning and labeling which may have produced errors resulting from poor scanning, unacceptably high misinterpretation by the OCR system, or incorrect zoning or labeling.

Screen shot of the edit workstation interface, the different article zones are highlighted.
Figure 9.2.1 Edit workstation GUI

The operator may also defer the process, or set an error flag at any time as needed, providing the flexibility for trouble-shooting and production scheduling.

The Edit software, running under Windows 2000, is developed using Visual C++ 6.0 and MFC 6.0. It uses Kodak Imaging Professional 2.5 to handle all image related functionality, and RogueWave Tools and DBTools 3.20 to implement client communications with the SQL database server.

The Edit software is equipped to assist in system performance evaluation, by automatically recording the time taken by the operator to key in data (entered in the PerformanceData table in the database), and the processing time (entered in the ProcessTime table). It also incorporates error handling components to automatically detect and handle database and developer defined errors. When an error condition is triggered, the Edit software warns the operator and allows the entry of more information. This information is automatically recorded and handled by the Admin module.

The design of this workstation also seeks to minimize delays caused by data and image transfer over the network. For example, after completing data entry for one of the articles in a journal issue, the operator clicks the page tab to go to the next TIFF image. At this point, the data for the current article must be written to the database, and the information related to the next article must be retrieved from the database and its page image from the file server for display on the screen. The database I/O and image retrieval from the file server and its transfer over the LAN are all time consuming. To reduce the effective delay perceived by the operator, the Edit software reads the database information related to all of the articles in a particular journal issue into the workstation memory at the beginning of the process, so that this data is available immediately as the operator moves from one article to the next.

9.3 Reconcile workstation

The purpose of the reconcile workstation is to enable an operator to check the accuracy of the bibliographic data extracted by the automated processes, as well as that entered manually by the Edit operator. Any errors are corrected at this stage before the citation is uploaded to the DCMS database.

The general view in the reconcile workstation's GUI gives an overall picture of the bibliographic data captured (Figure 9.3.1). Prior to the operator verifying the contents of the bibliographic fields, the field windows are highlighted by background color: green for fields created by the automated processes, cyan for those entered manually, yellow for those created by combining the outputs of both automated and manual processes, and red if no text appears in the window. In the example shown, the windows for NIH Grant Numbers and Databank Accession Numbers appear in red, since these data were not entered by the edit operator earlier in the workflow. (These fields are not captured automatically since they could appear anywhere in the article and not just on the scanned page.) Once the operator verifies the fields, the windows turn white, as shown for Pagination and Language in this example.

The reconcile workstation interface.
Figure 9.3.1 General view for all bibliographic fields in an article.

The GUI for this workstation also gives a split view (Figure 9.3.2) displaying both the image and the corresponding text to allow the operator to verify the text against the scanned image. Low confidence characters are highlighted in red on the text to draw the operator's attention to them.

Figure 9.3.2 Split view shows both the bitmapped image and corresponding text.
Low confidence characters are highlighted in the text window.

The reconcile workstation provides the operator several additional functions: the operator may redefine a field labeled incorrectly; activate a standalone OCR system to extract the field contents through the image or, alternatively, type in the text; if a page image is missing or duplicated, the operator may insert the missing page, or delete a duplicate; and if there are 'invalid' characters, the Reconcile software will convert them to the form required in MEDLINE.

Many of the functions in the reconcile workstation are provided by a program we developed called Character Verification, a module that allows the reconcile operator to view the bitmapped document images and to verify the text in all the fields, both entered by the edit operator, and that from the automated processes. It is an ActiveX control embedded in the Reconcile software. It is based on two ActiveX controls, the Eastman Kodak Image ActiveX Control and Microsoft's Rich Text Editor ActiveX control.

Character Verification allows the operator to view the image and perform manipulations such as rotation, and zooming in or out. Also a bounding box shows the area on the image that corresponds to the text that the operator is focusing on. The design of this functionality is based on Eastman Kodak Image Control.

Character Verification also allows the operator to edit the field contents. The design of this editor is based on the Microsoft Rich Text Editor and is derived from its heavily used functions such as copy, cut, paste, search and replace. The software can also relate the position of the text characters to the corresponding ones in the image, provide confidence levels, and allow the operator to enter diacritical marks, Greek letters, mathematical signs, change the first character of a selected word to upper case while leaving the rest in lower case, convert case, and complete words in the affiliation field from a partial output.

Character Verification and the Reconcile main program communicate via methods and events. Reconcile sets or gets the methods to instruct Character Verification. In turn, Character Verification fires events back to Reconcile to provide the information.

Figure 9.3.3 Reconcile main program communicating with Character Verification.

Character Verification retrieves the OCR output for every article (bibliographic) field, including text contents (character codes), confidence levels and character coordinates, from the MARS database, and the images from a file server. Keyed-in characters are assumed to be at the highest level of confidence and have no coordinates for their location in image.

Figure 9.3.4 Character Verification displays label text, confidence and character position.

Character Verification also provides the results of the PatternMatch program to the operator in a word list to choose from. An example is given below showing the operator optional words to select in an affiliation field.

Figure 9.3.5 Operator can select alternative words in an affiliation field.


page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   
 

National Institutes of Health (NIH)National Institutes of Health (NIH)
9000 Rockville Pike
Bethesda, Maryland 20892

U.S. Dept. of Health and Human ServicesU.S. Dept. of Health
and Human Services

USA.gov Website