Home | Data | News | Events | Articles | Nodes | Preferences | Help | About | Press | Site map
SITE SEARCH: 
    
GBIF Data
Browse
Search
How to search
Providers
Data policy
About GBIF
Press
GBIF Q&A
GBIF Data Sharing
GBIF Symposia, etc.
Ebbe Nielsen Prize
GBIF Posters
GBIF Publications
GBIF Documents
GBIF Membership
GBIF Nodes
GBIF Directory
Tools and services
Newsletters
Mailing lists
Wiki
UDDI registry
Standards
CIRCA
GBIF tools download
Support
Become a data provider
GB documents [login]
GB15
Helpdesk
Training
Travel guidelines
FAQ
Programmes
DADI
DIGIT
ECAT
OCB
Home Stories centre

Story: DataTester: Software to Help Automate Data Cleansing Available from GBIF


Click on the image to enlarge

The task of checking a set of biodiversity data records for accuracy can be daunting, but now there is free, open-source software available to help. DataTester was developed by the Centro de Referência em Informação Ambiental (CRIA, Campinas, Brazil) with support from GBIF and the Gordon and Betty Moore Foundation.
Released on: 07 October 2005
Contributor: Meredith Lane
Language: English
Spatial coverage: Not applicable
Keywords:
Source of information: GBIF Secretariat
Concerned URL: http://www.gbif.net/datatester/index.jsp

As noted in the papers on data quality and data cleaning discussed in the article below, issues related to the quality of data served through the GBIF (or any other) Data Portal are of critical importance to both data providers and data users.

Data quality and errors in data are often neglected because of the time and expense involved in checking data sets for typographical errors, empty or misused fields in a data record, etc.

Now, however, GBIF is providing open-source software to assist with some of the tasks involved in checking data sets for quality. This software package can be downloaded from Sourceforge, as can the documentation. A web page has been set up to demonstrate the function of the software with a number of sample tests.

This extensible Java package is intended to support a number of functions within the GBIF Data Portal. For instance, when new data sets are registered with GBIF and are indexed, this new software will be used to identify possible errors in them. These issues can then be reported to the custodian of the data for correction in a standard format generated by DataTester.

Tests that can be executed include the following:

  • Reporting unrecognized values for data elements (e.g. country names or basis of record values)
  • Checking that coordinates fall within the boundaries of named geographic areas
  • Finding scientific names that are not known to external lists such as the Catalogue of Life or nomenclators
  • Checking that scientific names have an appropriate format.

However, DataTester can be employed directly by data providers, other portals or persons preparing to perform analyses on data retrieved via GBIF. In fact, the software is not limited to biodiversity data types, but those in fields other than biodiversity informatics who wish to can add tests for the kinds of errors that might be found in their data sets.

The software is particularly suited to reporting on XML data sets, but can be applied to other data formats or relational databases. It allows programmers to develop new tests and to generalize tests so that they can work against multiple data standards (e.g. Darwin Core and ABCD schema). Each test may be associated with a severity (error, warning, info) to make it easier to focus on the most significant issues. Developers can also write extensions to support different reporting mechanisms.

Persons with questions or comments can post them on a mailing list that has been set up on SourceForge: (gbif-datatester), or for more information can contact Donald Hobern, GBIF Programme Officer for Data Access and Database Interoperability (dhobern@gbif.org).

Please note that this story expired on 2006/01/15

Contact info | Webmaster | Webmaster login | Printable page