Automating the production of bibliographic records for MEDLINE

Links to Feeds:
Publications	RSS
CEB News	RSS

CEB Projects

page

Automating the production of bibliographic records for MEDLINE

George Thoma
National Library of Medicine
Bethesda, MD USA

An R&D report of the
Communications Engineering Branch
Lister Hill National Center for Biomedical Communications
National Library of Medicine

September 2001

Contents

1	• Background
2	• Project objectives • Project significance • System description • Overview
3	• Database Considerations
4	• OCR system evaluation and selection
5	• Automated zoning • Methods and procedures
6	• Evaluation of automated zoning • Implementation • Performance in production • Automated Labeling • Definition of layout types • Structure of AL module
7	• Rule-based algorithms in AL module • Research tool for labeling • Performance in production • Ongoing research
8	• Automated reformatting • Reformatting the author field • Reformatting the data field
9	• Reformatting the Affiliation field • Ongoing work
10	• Lexical analysis to improve recognition • Lexical analysis to reduce highlighted words • Lexical analysis to improve recognition of affiliation
11	• Implementation • Operator workstation design • Scan workstation
12	• Edit workstation • Reconcile workstation
13	• Production supervision and control tools • CheckIn
14	• Admin workstation • Crash Patrol
15	• System performance evaluation • Process performance analysis • Comparison for three data entry systems
16	• Next tasks • Ground truth data: PathFinder
17	• Data extraction from online journals • Summary
18	•Alternative method for text verification
19	•References

Screen capture of MARS program with a split screen. On the left is a scanned journal article, the different zones are highlighted.

1. Background

The Communications Engineering Branch has had a longstanding involvement in document and biomedical imaging from many different standpoints: image capture, storage and retrieval, lossy and lossless compression, image enhancement and other types of image processing. Document imaging research, in particular, has been applied to the design and development of prototype systems to serve as testbeds for investigations into electronic archiving and preservation of library materials,^1-4 automated interlibrary loan systems,^5,6 on-demand document delivery,^7,8, and Internet-enabled document delivery and management for the end user.^9-12 In addition, we have engaged in several years of related and relevant R&D in document image analysis and understanding.^13-19

In 1996 following the unprecedented shutdown of the Federal Government, the National Library of Medicine unexpectedly lost its longstanding contractual arrangements for the keyboarding of bibliographic data into MEDLINE, with no immediate prospects for reinstating those contracts. A consequence of these events was that MEDLINE, our flagship database, was getting out of date at the rate of 30,000 to 40,000 citations every passing month. This clearly untenable situation proved to be an opportunity to improve upon the production performance of the labor intensive manual keyboarding of bibliographic data by combining key technologies already familiar to our R&D staff engaged in the projects listed above: in particular, document scanning, optical character recognition (OCR) and image analysis techniques.^13,14,19

To make an immediate difference we focused on the OCR conversion of the abstract, the largest chunk of a MEDLINE bibliographic record, amounting to a maximum of 400 words. In a few weeks we had created two workstations by integrating scanners and a low end OCR package already in our lab for NLM's Library Operations staff to use as a short term tool to try to tackle the growing backlog. It helped to some extent, but the error rate of the simple OCR package proved to be unacceptably high, requiring a great deal of manual correction.

Seeking a more satisfactory solution, we embarked on the design of a larger scale system centered on a multi-engine OCR system and operator workstations for scanning, keyboard entry and verification ("reconciling"). This first generation system20 code-named Medical Article Records System, MARS-1, allowed a scan operator to capture the first page of every article and manually zone the article title and abstract in the TIFF image for conversion by OCR. An editor marked up the journal article with instructional notes for the keyboarders to enter other fields. Concurrently or at any time, a keyboarder entered into a template all fields except for the abstract. Since double-keying was considered necessary for high accuracy, a second keyboarder repeated this process for the same articles. These two manual entries (one from each keyboarder) were then compared by a DIFF module in a matching server, producing a "citation difference" file highlighting inconsistencies. A MATCH module then matched the article title field in the citation difference file with the article title from the OCR output. The system now "knew" that the abstract and the rest of the keyed-in fields belonged to a particular article, and the reconcile operators were presented with a combined set of fields to be verified and corrected, and then uploaded to the NLM mainframe for use by the library's indexers. The indexers added the appropriate descriptive information such as MeSH terms, thereby completing the bibliographic record to be added to MEDLINE.

At installation, in its skeletal form, MARS-1 produced about 67 completed records a day. Over the next several months, the system was scaled up to eventually double the number of machines and operators. This scaling up and, what was equally important, incremental improvements, increased the production rate to over 600 completed records a day, a third of the total requirement at NLM, the goal that had been set from the beginning. The goal was to produce a third by MARS, a third by contract keyboarders and the remaining third from SGML (later, XML) coded records sent to NLM directly by journal publishers.

While pursuing image analysis research toward the design of a more comprehensively automated system, we introduced incremental improvements to increase the productivity of MARS-1 operators as outlined below:

Barcode scanners. Finding the manual entry of the nine digit "MRI number" uniquely identifying the journal issue burdensome and error-prone, all workstations were equipped with a barcode scanner. The MRI is a very important number because it identifies the journal issue and is the gateway to all the articles in that issue, the image files and the OCR output files.

Click select special symbols. In the case of the Greek letters and biomedical symbols which are not recognized by the OCR, a window with a list of these symbols in words (ALPHA for a, for instance) was provided to enable the operators to click on the words which are then automatically entered in the right place in the text. So even though the OCR system does not recognize these symbols, we made it easier for the operators to enter them.

Programmed keys. At first diacritics, NIH grant numbers and databank accession numbers (e.g., from GenBank) were entered separately after the MARS process, turning out to be laborious. It was much better for the keyboarders to enter these as they encountered them in the article. So the keyboarding template software was modified to accommodate programmed keys for diacritics, resulting in a 4% improvement in production rate.

Spellcheck using biomedical lexicons. While OCR errors were expected, a more serious problem was the fact that the OCR system incorrectly assigned low confidence values to perfectly correct characters resulting in the highlighting of an excessive number of correct words on the workstation screen, requiring the reconcile operator to tab unnecessarily through all these correct words. The solution was to develop a spellcheck module relying on a combination of biomedical lexicons, the NLM's UMLS Metathesaurus and the Specialist Lexicon, and heuristic rules related to word lengths.²¹ This feature cut down the highlighted correct words and the corresponding burden on the operators by about 50%. The net increase in production level due to this improvement was about 4%.

With MARS-1 in production, the design of the next generation system20a was begun to introduce more comprehensive automation and lower per-unit cost. MARS-2 is a database-centered and database-driven system that incorporates subsystems for automated page segmentation (zoning), automatic field identification (or labeling), automated syntax reformatting, and a classifier for the Greek letters and other symbols that the OCR system does not handle, all processes eliminating or reducing human labor. Three types of operators are still necessary: for scanning, editing and reconciling, but there are differences. The scan operator does less than he/she did in MARS-1 since manual zoning, a time consuming step, is eliminated. The "edit" operator combines the job of the editor and keyboarder in MARS-1, but much of the previous keyboarder's work is eliminated. The reconcile operator's task to verify and correct any errors remains the same. A description of MARS-2 appears in Section 4.

This report is organized as follows. Following the background statement, project objectives and project significance, Section 4 presents a system description including database organization and testing to select the OCR system; Sections 5-7 describe the image analysis and understanding work underlying the automated processes for zoning, labeling and reformatting; Section 8 describes lexical analysis techniques that improve the character and word recognition in the extracted fields; Section 9 describes the operator workstation design; Section 10 describes other systems designed to assist production control and supervision; Section 11 gives an account of performance evaluation; and Section 12 outlines next steps. References to the literature appear at the end.

page

Return to top