Guidelines for Data Providers


Principal Investigators are Encouraged to Arrange with their Publishers for Access to Journal Articles via CEDR Prior to Submission of Data File Sets to CEDR

CEDR requests that investigators obtain permission from their publishers for inclusion of their research papers on the public CEDR web site, CEDRView. While cooperation from researchers is, at present, voluntary, the public need for reasonable access to government-funded research on the health of the DOE workforce is compelling. Investigators are urged to consider publication venues that permit such access.

Discussion. The rationale for CEDR is grounded in the need for openness of government and public access to records of government-funded research. While researchers have historically made the results of their work available through peer-reviewed scientific literature and standards, recently, concern has arisen that public access to published papers is inadequate. The barriers to access, predominately cost, have been widely acknowledged (http://publicaccess.nih.gov/publicaccess_QandA.htm and http://osc.universityofcalifornia.edu/). Legislative and academic initiatives to confront the cost of access to scholarly literature are becoming more frequent. Indeed, recent legislative efforts to require NIH-funded research to be publicly-available one year after publication are indicative of the trend.

By making documentation and underlying data files accessible to the public, the Department of Energy's CEDR program has been at the forefront of this paradigm shift. Nevertheless, conventional practices still result in completion of research and publication phases prior to submission of data and documentation to CEDR. Because published articles are generally more intellectually accessible than analytic data files, CEDR Users need access to journal articles. Thus, the public is disadvantaged when access to research articles is obstructed by rights models that require fees for use of reports of research already paid for by public funds.

To improve public access to CEDR-related research funded by DOE, including research conducted by others for DOE, CEDR strongly encourages researchers to consider rights management concerns in advance of submitting research papers to publishers. While there are many variants of permissions agreements, researchers have been successful in obtaining permission to install some (e.g., pre-final publication) versions of their papers on public institutional web servers (including CEDR).

Confidentiality Statement for Data Providers

A Confidentiality Statement is required of Data Providers to comply both with Federal privacy requirements and those imposed by the States which provide death certificate data.

This Confidentiality Statement form is in PostScript® or Adobe® Acrobat PDF format. To print the Postscript version, you must have a Postscript printer. To print the PDF version, you must have Get Acrobat Adobe Acrobat Reader (free), and a printer.

General Considerations for Submission of Data to CEDR

The Comprehensive Epidemiologic Data Resource (CEDR) Program is an effort of the U.S. Department of Energy to integrate and broaden access to its epidemiologic information. The CEDR information system, located at Lawrence Berkeley National Laboratory, began as a repository of data to support the DOE Worker Health and Mortality Study program, and is expanding to provide access to dose reconstruction, area monitoring, community health studies, and epidemiologic surveillance data.

To facilitate scientific data sharing, complete documentation is requested from CEDR data providers. Three aspects of our approach to documentation are especially important: the provision of documentation for working as well as analytic data; support of structured documentation; and protection of individual identities.

PLEASE NOTE: According to Department of Energy Office of Epidemiologic Studies policy, all data provided to CEDR must be documented according to these Guidelines for Data Providers. Non-conforming data files and documentation must be returned to the provider for revision and resubmission to CEDR.

Clarifications and Corrections

CEDR will post clarifications and corrections conveyed to CEDR by data providers on the CEDR website. Data providers have a standing invitation to review and offer comments as to errors of style, fact, omission, and make other suggestions as believed will benefit CEDR users and the public.

Provision of working and analytic data file sets

In the process of conducting a study, researchers usually assemble files from a variety of data sources, such as plant payrolls, dosimetry systems, and death certificates. This initial set of files is termed the "working file set". These working files may include data not ultimately used in the preliminary cohort selection (including data on individuals who were excluded from the final analysis), variables not used in the study calculations, and data that prove to be inconsistent or unreliable which are discarded before the final analysis.

The "analytic file set" (which might consist of only a single file) contains the data upon which a researcher directly bases a study's reported findings. These data may be more carefully verified than the working file set, and the files are typically more concise. They frequently contain variables which are calculated from the variables in the working files. Of course, some researchers may use working files directly in an analysis, while others differentiate between working and analytic data differently than described here.

CEDR is structured to include both analytic and working sets of data files, with their supporting documentation. Provision of working as well as analytic data files enables CEDR users not only to verify the original analysis, but to choose an analysis path different from that of the original researcher.

What is a CEDR data file? A data file submitted to CEDR must be rectangular. That is, each record must have the same variables, in the same order, the same length. Where a variable includes decimals, the decimals must be aligned in a single column. Data files are required to be in ASCII format. Every data file must be accompanied by Columnar Specification Documentation.

CEDR data files are understood to be original work by the investigator(s) on whose behalf the data and documentation are submitted to CEDR. Occasionally, the line of demarcation between data and its descriptive documentation, such as a code set, can seem ambiquous. Generally, codes used in a variable are originated by the epidemiologic investigator. However, the investigator may also adopt a code set formulated by others, as in the International Classification of Disease. Any set of values used in a variable in a data file, and intended to serve as a translation table or to interpret a controlled set of symbols into a natural language such as English, is considered a "code set". Such sets of value pairs (code value, and interpretation of the code value) are installed in CEDR as Structured Documentation rather than as another data file.

CEDR ID Variables. As described in more detail below (Protection of Individual Identity) CEDR does not include "identifying" data such as personal name or Social Security Number. Many studies make use of a "de-identified" identifier, such that data about an individual in one CEDR data file can be matched with data about that same individual in another CEDR data file, and, at the same time, such that the personal identity of the individual is not revealed. This anonymized identifier is called a CEDR ID. If a CEDR ID variable is included in a data file submitted to CEDR, it should be the first variable in the file. The variable name should be specified as "id", and the variable description should be specified as "identification number".

Structured Documentation

Researchers maintain information, often as documents called "codebooks" to describe the variables and data values within the data file. Information such as the location of variables within records, the meaning of coded values, and a precise definition of a calculated variable.

A CEDR data file set consists of:

CEDR organizes and stores this documentation in a consistent structure across all data sets in the CEDR collection. The intent of this documentation is to aid the CEDR user in selecting among the available data and to provide the detailed information on files and variables necessary for use.

For these reasons we have developed documentation guidelines (attached, with examples of their use) which are sent to providers of CEDR data. These guidelines provide for data and documentation to be submitted to CEDR in specific formats and with descriptive information that is best originated by the investigator or data provider. These formats and descriptive information specifications might differ from the codebook maintained by originating investigators or data providers in that they specify summary paragraphs describing the data sources as a whole, and, in the case of analytic data, study methods and results.

How to Prepare Structured Documentation for CEDR

Protection of Individual Identity

1. Identifying data of human subjects. CEDR does not accept data files that are not de-identified and does not perform de-identifying. Epidemiologic data typically contain information that uniquely identifies individuals who were subjects in the study. Such data can only be used under rigorous legal and statutory controls, and a system of extensive protocol and other institutional reviews.

To provide for easier access to epidemiologic data overall, CEDR excludes identifying information that might incur regulatory burden. That is, CEDR does not include data that directly identifies individuals by name, social security number, specific birth or death dates, or specific job hire or termination dates.

This policy for CEDR collection development requires that that these types of data be redacted by the provider prior to submission to CEDR.
See: Data Truncation Guidelines

2. Identifying data of individuals other than human subjects. In the course of collecting data related to an epidemiologic study, an investigator or institution might collect information about individuals who are not human subjects per se. Such individuals include the investigators themselves, professional and technical staff associated with the project, including technicians who performed sampling or dose measurement recording tasks, and independent contractors. Identifying data describing such persons might not be subject to the same regulatory, professional, and institutional guidelines as are human subjects data. Nevertheless, unless CEDR has been notified that explicit permission has been granted for identifying non-human subject individuals, CEDR does not accept nor include such identifying information, except for the identification of principal investigators, co-authors and co-investigators on published papers associated with CEDR data file sets, and institutional contacts. References to individuals (other than principal investigators and study authors) such as names, initials, abbreviations, and personal names used in partnership or corporate format should be provided to CEDR in de-identified form.

Summary

These Guidelines minimize the effort required of CEDR data providers while obtaining the necessary information to make CEDR materials comprehensible to users. We encourage those who plan to submit data to CEDR to contact us early in their work in order that we can work with data providers in an efficient and supportive way. Typically, CEDR engages with data providers as needed to transfer data and the necessary documentation.

How to Prepare Structured Documentation for CEDR


Last modified
URL: http://cedr.lbl.gov