Develop a Method to Compare Facility Data across Disparate EPA Datasets

This application idea is approved by the EPA.

Source Code : Github Exit

Additional Source Code files:
https://github.com/cman81/hackathon-geofence-server Exit

Problem: Finding Matching Facilities across Datasets
A better way to identify matching facilities across datasets is something that the EPA struggles with. Currently, a variety of conditions are appended to queries in order to attempt to prevent false positive matches

Challenge: Improve facility matching

Develop a prototype of a tool that is able to identify similar facilities across 3 disparate datasets and a master list of facilities. Currently, there is a need for a tool that allows a user to locate similar facilities across information systems by leveraging an array of fuzzy string matching algorithms (such as Levenshtein distance, Soundex etc.) and being able to customize what algorithm is used and their associated parameters prior to initiating a comparison. The algorithm should be customized to be more or less strict in its interpretation of facilities that may be a match based on their associated parameters. Parameters include facility name, address, industry code, company name, contact information, and geographic proximity.

The prototype contains a mechanism for a user to easily adjust matching criteria

Overview of tasks that need to be completed for this challenge:

1. Review the FRS facility table sample and samples of 3 disparate datasets.

2. Develop an interface that yields matches and incorporates the following components:
1) Ability to specify various fuzzy string matching algorithms.
2) Ability to customize parameters for the selected fuzzy string matching algorithm.
3) Ability to add additional constrains such as the ones fund in no_merge_without_analysis.txt

Additional Information/Data Sources/Additional Information:

EPA’s Facility Registry System provides quality facility data to support EPA’s mission of protecting human health and the environment. Contains data about facilities, sites, or places of environmental interest that are subject to regulation.

https://www.epa.gov/enviro/epa-state-combined-csv-download-files

EPA’s Toxics Release Inventory (TRI) is a resource for learning about toxic chemical releases and pollution prevention activities reported by industrial and federal facilities. TRI data support informed decision-making by communities, government agencies, companies, and others.

https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-data-files-calendar-years-1987-2014

EPA’s Office of Enforcement and Compliance goes after pollution problems that impact American communities through vigorous civil and criminal enforcement that targets the most serious water, air and chemical hazards. As part of this mission, we work to advance environmental justice by protecting communities most vulnerable to pollution.

https://echo.epa.gov/files/echodownloads/echo_exporter.zip

OSHA’s Enforcement Data consists of inspection case detail for approximately 100,000 OSHA inspections conducted annually. The dataset includes information regarding the impetus for conducting the inspection, and details on citations and penalty assessments resulting from violations of OSHA standards. Additionally, accident investigation information is provided, including textual descriptions of the accident, and details regarding the injuries and fatalities which occurred.

http://ogesdw.dol.gov/views/data_catalogs.php

*Note: select OSHA Enforcement Data within the link. From there, you can download accident, inspection, and violation data.

Related Open Source Project on GitHub:

http://flori.github.io/amatch/

VN:F [1.9.22_1171]
Rating: 3.0/5 (1 vote cast)
Develop a Method to Compare Facility Data across Disparate EPA Datasets, 3.0 out of 5 based on 1 rating

Leave a comment

Your email address will not be published. Required fields are marked *