Professional Profile Data: The Cost of Flexibility

The eRA project team recently released funds to implement a $2 million per year effort to cleanup personal data in the IMPAC II database. A contract will be awarded that will: 1) identify existing data conflicts; 2) manually correct any errors; and 3) develop and implement proposals to increase accuracy of the data. This cost will be ongoing until business rules and algorithms are created to reduce the possibility of redundant data entries, and the redesign of the NIH Commons enables grantees to verify their own data.

Data Integrity -- the accuracy or correctness of the data in a database -- is not a significant problem for most of the eRA data. But in terms of personal data, it is a major issue indeed. A recent analysis showed that less than 1 percent of overall data in IMPAC II was inaccurate, whereas the error rate in personal data was roughly 8 percent. These errors create problems for reporting accurate statistics (e.g., support for new investigators, physician scientists, etc.) and tracking grantee award history (mistaken identity of an investigator could pose serious problems).

The problem of conflicting data goes back many years. Typos in names, social security numbers and degrees entered into the legacy system were never systematically corrected. The user community chose not to impose rigor in the IMPAC II system that would slow down production work. Therefore, the system allows users to enter a duplicate social security number or duplicate record when entering personal data. This practice exacerbated the inadvertent creation of duplicate profiles and conflicting data on existing profiles. One alternative is to require the user to investigate and verify any conflicting information before proceeding. The adoption of an appropriate algorithm could assist in data verification.

Background information and further explanation of data integrity issues is available from eRA.