eRA Working Group Explores Technology-Assisted Disease Coding

eRA is collaborating with representatives from seven Institutes and Centers (ICs) to evaluate the use of advanced text-mining technology to improve NIH’s reporting on funding by disease. In his testimony to the House Appropriations Labor/Health and Human Services (HHS) Subcommittee on April 22, NIH Director Elias Zerhouni said that NIH would implement “intelligent data mining” to provide better accounting to Congress and the public on NIH’s investment by disease.

Currently, the NIH Office of the Budget must prepare agency-wide reports on more than 230 diseases and conditions. With the dual objectives of standardizing definitions and automation, the NIH Director’s Steering Committee requested that eRA form a working group to test the feasibility of using specific text-mining tools to accurately and consistently assist with disease coding. The Knowledge Management Disease Coding (KMDC) Working Group, which began meeting in April, presented its findings to the Steering Committee on October 21.

About Collexis® Technology

Collexis® refers to a family of intelligent, text-searching tools for examining vast quantities of data to identify patterns and establish relationships. As bio-medical data grows to petabytes (millions of gigabytes), managing this data becomes increasingly important. Intelligent text mining holds promise for promoting health research and accelerating discoveries by automating the integration of multiple data bases to find linkages and make hypotheses.

In January 2004, after evaluating available systems, NIH procured a site license for Collexis software. This software is based on the principle of “fingerprinting” each piece of text that contains relevant information, such as an article in a scientific journal. The fingerprinting process makes use of the professional terminology of a particular field. For example, the system can fingerprint an article based on the National Library of Medicine Medical Subject Headings (MeSH®) Thesaurus. Collexis then can condense the fingerprints of all of the researcher’s publications into a knowledge profile of that individual.

Once Collexis has completed the fingerprinting/profiling of all sources of input, the system can make associations based on criteria established by the user. Consider this application. A busy Helpdesk receives several hundred e-mails daily that require responses from an expert. KM helps the Helpdesk by building knowledge profiles of all its employees. From then on, routing an incoming email is a matter of matching its fingerprint with the catalog of employee knowledge profiles.

Progress of Disease Coding Working Group

The KMDC Working Group, led by Richard Morris, has made significant headway over the past four months. The group completed the following tasks:

  • Established contracts with Collexis and Mitretek to support the KMDC initiative. Patti Gaines is the eRA task order manager.   
  • Set up a policy sub-group to draft an NIH KM basic principles and implementation document.   
  • Created a test database, comprising the research plans for FY2003 R01 competing applications.   
  • Fingerprinted 12 disease categories using a variety of methods:
    1. Grants that IC-experts had previously assigned to the disease codes   
    2. Articles by experts in each disease category   
    3. Trans-NIH definitions   
    4. MeSH Thesaurus   
    5. Combination of method 1 and method 4   
    6. Method 1 modified by IC coding experts
  • Tested the accuracy of the fingerprints in several trials. Used lessons learned from successes and failures to fine tune the fingerprinting process.   
  • Established a Web portal where group members can test the KMDC system.   
  • Developed draft algorithms for converting the percent relevant of disease-code fingerprints to dollar amounts for budget reporting.

The KMDC policy group, comprised of representatives from the participating ICs and led by Izja Lederhendler, began meeting to consider basic principles of operation and options for governance. With technical assistance from Patti Gaines, Archna Bhandari, and Chanath Ratnanather who prepared the data, the policy group will present its findings to the NIH Steering Committee. Throughout the process, Norka Ruiz-Bravo (NIH deputy director for Extramural Research) and Richard Turman (director, NIH Office of the Budget) worked with Richard Morris, Izja Lederhendler, Della Hann, and Lee Pushkin to guide the effort. 

Previous NIH Pilots

Several earlier pilots at NIH demonstrated proof of concept of KM’s promise for optimizing eRA knowledge assets and shortening grant cycle times. According to a statement by CSR Division of Biologic Basis of Disease Director Elliot Postow on May 17, the introduction of electronic grant applications and referral technologies could reduce the review cycle by six to eight weeks.

  • KM-Assisted Reviewer Selection (1) –– In the spring of 2003, Dr. Arthur Petrosian, a scientific review administrator at the Center for Scientific Review (CSR), used his Computerized Reviewer Assignment and Search Program (CRASP) to assist scientific review administrators (SRAs) in locating reviewers for specific ad hoc diagnostic imaging study sections. CRASP matches fingerprints based on keywords in CRISP and PubMed with reviewer profiles. Although there was no systematic evaluation of this pilot, SRAs found the tool encouraging.   
  • KM-Assisted Reviewer Selection (2) –– Mitretek Systems developed a Grant Reviewer Selection (GRS) prototype using 60,000 candidate reviewers drawn from CRISP and MedLine and 30,000 FY2003 R01 research proposals. Collexis software generated fingerprints for each reviewer and for each proposal. GRS then served as a user interface to match: (1) reviewers to a given proposal, (2) other reviewers to a given reviewer; and (3) other proposed or funded research to a given proposal. Mitretek demonstrated GRS at the Third Annual eRA Symposium in April 2003. See presentation materials for more details.   
  • KM-Assisted Referral –– In the spring of 2003, CSR used KM to fingerprint 86 randomly selected R01 research plans for the October 2003 council round. KM technology then matched the fingerprints to CSR Integrated Review Group (IRG) descriptions to generate referral recommendations. In 35 percent of cases, the top-ranked KM referral matched the top-ranked human referral. In 64 percent of cases, one of the top three KM recommendations matched the top-ranked human referral. Testers believe that they will achieve better results using study section descriptions, which are more specific.   
  • KM-Assisted Self-Referral ––Last winter, Tom Tatham led a CSR effort to explore the possibility of using Collexis to profile study sections. The ultimate goal is to enable principal investigators (PIs) to input their abstract or research plan and have Collexis return a list of suggested study sections. Using KM to help PIs recommend a study section will save time and promote appropriate referrals.   
  • KM-Assisted Scientific Trends Detection –– In the spring of 2004, Mitretek Systems developed several KM prototypes to identify trends among 206 scientific poster proposals submitted to the Biomedical Information Science and Technology Initiative (BISTI) Symposium, “Digital Biology: the Emerging Paradigm” (November 2003). The prototypes include: (1) data visualization (graphic representation) of emerging concepts and their inter-relationships; (2) individual poster-level analysis, aimed at identifying concepts present in each poster for descriptive and comparative purposes; (3) author profiling to create a composite profile of an author's expertise by mining his/her poster abstracts; and (4) distribution of major concepts in a collection of documents, as well as the relative frequency of their occurrence. Mitretek’s presentation about these prototypes received favorable comments from the BISTI user group.

For more information about the eRA KM initiative, contact Richard Morris at RMorris@niaid.nih.gov.