Clinical Language Engineering Workbench (CLEW)

The Clinical Language Engineering Workbench (CLEW) provides free access to natural language processing (NLP) and machine-learning tools that any public health agency can use to develop NLP services. The pipelines, language models, and other algorithms that are part of this platform identify reportable cancer cases and code key data elements in electronic pathology reports and other medical reports.

Five steps in NLP Workbench project: environmental scan; stakeholder engagement, requirements gathering, and technical design; prototype development; pilot testing; and release.

The CLEW project included five steps: environmental scan; stakeholder engagement, requirements gathering, and technical design; prototype development; pilot testing; and release.

The Cancer Domain Pilot

Pathology reports are an integral part of cancer data. About 90% of cancer cases require pathological confirmation of the diagnosis. The College of American Pathologists (CAP) requires accredited laboratories to use the CAP Cancer Protocolsexternal icon and standard templates to capture key data in electronic checklists for pathology and biomarker outcomes.

However—

  • Pathology reports are still mostly text-based narratives, which are time-consuming to process.
  • CAP checklists are not required for biomarkers.
  • Laboratories are not required to store or transmit cancer data in discrete data elements.
  • Terminologies, test names, and data included in the biomarker reports are inconsistent among laboratories.
  • The organization of the reports and the reporting of information in HL7 messages is inconsistent.

CDC developed CLEW in a way that allows it to be expanded to improve extraction and automatic coding of key data elements for pathology cases. For example, this new platform can—

Project specifications included—

  • Collecting de-identified data from at least four national laboratories for breast, lung, prostate, and colorectal cancers.
  • Collecting histopathology cases from several states.
  • Collecting 125 cases per cancer site from each laboratory, for a total of at least 2,000 cases.
  • Completing double annotation by certified tumor registrars with a master reviewer.

How NLP Automates Cases Reporting Into Cancer Registries

Diagram of the NLP use cases for the cancer domain. Process described below.

The diagram above illustrates the use cases for the cancer domain.

  • Laboratory information systems transmitted unstructured text in the form of an HL7 version 2.5.1 Observation Result (ORU) message to CLEW.
  • CLEW returned a reportability determination. Reportability is determined by the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM).
  • Narrative pathology reports were sent to the cancer registry’s eMaRC Plus system in the form of an HL7 version 2.5.1 ORU message.
  • eMaRC Plus transmitted the unstructured text to CLEW.
  • CLEW returned structured data, including primary site, histology, laterality, behavior, and grade.

CLEW is a valuable platform that will allow the clinical NLP community to collaborate and share their work in a central repository. This new approach could enable development of NLP and machine-learning solutions to provide high-quality structured data for clinical, academic, government, and public health organizations. It could also minimize duplication in the development of solutions and expand the availability of tools and services to additional clinical domains.

Have questions? Contact us at cancerinformatics@cdc.gov.