As noted in the papers on data quality and data cleaning discussed in the article below, issues related to the quality of data served through the GBIF (or any other) Data Portal are of critical importance to both data providers and data users.
Data quality and errors in data are often neglected because of the time and expense involved in checking data sets for typographical errors, empty or misused fields in a data record, etc.
Now, however, GBIF is providing open-source software to assist with some of the tasks involved in checking data sets for quality. This software package can be downloaded from Sourceforge, as can the documentation. A web page has been set up to demonstrate the function of the software with a number of sample tests.
This extensible Java package is intended to support a number of functions within the GBIF Data Portal. For instance, when new data sets are registered with GBIF and are indexed, this new software will be used to identify possible errors in them. These issues can then be reported to the custodian of the data for correction in a standard format generated by DataTester.
Tests that can be executed include the following:
- Reporting unrecognized values for data elements (e.g. country names or basis of record values)
- Checking that coordinates fall within the boundaries of named geographic areas
- Finding scientific names that are not known to external lists such as the Catalogue of Life or nomenclators
- Checking that scientific names have an appropriate format.
However, DataTester can be employed directly by data providers, other portals or persons preparing to perform analyses on data retrieved via GBIF. In fact, the software is not limited to biodiversity data types, but those in fields other than biodiversity informatics who wish to can add tests for the kinds of errors that might be found in their data sets.
The software is particularly suited to reporting on XML data sets, but can be applied to other data formats or relational databases. It allows programmers to develop new tests and to generalize tests so that they can work against multiple data standards (e.g. Darwin Core and ABCD schema). Each test may be associated with a severity (error, warning, info) to make it easier to focus on the most significant issues. Developers can also write extensions to support different reporting mechanisms.
Persons with questions or comments can post them on a mailing list that has been set up on SourceForge: (gbif-datatester), or
for more information can contact Donald Hobern, GBIF Programme Officer for Data Access and Database Interoperability (dhobern@gbif.org).
|