Home Projects Publications Presentations Repositories Photo Gallery Career Staff Favorites
  • MyDelivery
  • Turning The Pages Online
  • MyMorph
  • Medical Article Records GROUNDTRUTH (MARG)
  • MD on Tap
  • AnatQuest
Links to Feeds:
PublicationsRSS  RSS
CEB NewsRSS  RSS

Last updated: May 26, 2009

CEB Projects

Print this Print this  E-mail this E-mail this

Publisher Data Review System

Project Member(s): Daniel Le, Jong Woo Kim, Chan Moon, Joseph Chow, Loc Tran, In Cheol Kim, George Thoma

Publisher Data Review SystemThe PDR system provides operator data missing from the XML citations sent in directly by publishers (such as databank accession numbers, grant numbers, grant supports, investigator names, and PubMed IDs of commented articles) and as a result it speeds up the creation process and reduces manual data entry costs in completing citation records for the MEDLINE database.

The system consists of five automated subsystems and a client-based reconciliation subsystem as shown below. All subsystems are networked via a LAN and communicate through a PDR database server and an XML file server.

Briefly, the PDR system works as follows. The "Get Citations from DCMS Queue" subsystem periodically retrieves publisher-supplied XML citation files from DCMS. DCMS is another NLM system that communicates with publisher Web sites over the Internet to receive and store these publisher-supplied XML files. The "HTML/PDF Files Download" subsystem processes these XML files to obtain article links which are used to connect to the publisher Web sites and download full text article files. This subsystem also converts articles in PDF format to HTML, and validates them as full text articles, rather than abstracts or summaries. The "Zone Creation"" subsystem segments the articles to create zones based on geometric layout, the recursive X-Y cut algorithm, and HTML Document Object Model (DOM). The "Zones Labeling" subsystem labels and ranks these zones with appropriate field labels such as databank accession numbers, grant numbers, and grant supports, using Naiäve Bayesian and Support Vector Machine (SVM) algorithms. The "Bibliographic Data Extraction" subsystem discards irrelevant contents in the labeled zones, and extracts bibliographic items using a hybrid contextual and statistical method and the SVM algorithm. Finally, the "Client-based PDR Reconcile" subsystem, which is activated by operators through another NLM subsystem named "Client-based DCMS", presents the extracted bibliographic data to operators for verification before uploading to DCMS for indexing by indexers.

 

National Institutes of Health (NIH)National Institutes of Health (NIH)
9000 Rockville Pike
Bethesda, Maryland 20892

U.S. Dept. of Health and Human ServicesU.S. Dept. of Health
and Human Services

USA.gov Website