| LHNCBC: Medical Informatics
Taking advantage of past efforts and experience with de-identified procedures (NCI Shared Pathology Informatics Network (SPIN) grant) and existing Lister Hill Center tools that can recognize sensitive content such as dates, person names, locations, and numeric identifiers, LHC researchers have initiated an effort to develop an open source text de-identification tool.
Our system uses more than 700,000 clinical records from the Clinical Center for testing and validating current work (under IRB exemptions). Developers are identifying and scrubbing sensitive information in the clinical text and labeling these items by type. To accomplish the task, the system utilizes several data sources. Statistical information about the usage of clinical words, common English words, and word co-occurrences have been extracted from multi-billion word corpora such as Wikipedia and Core clinical journal article abstracts.