NIH Data Sharing Policy and Implementation Guidance

Grants Policy
Policy & Guidance
Compliance & Oversight
Research Involving Human Subjects
Office of Laboratory Animal Welfare (OLAW)
Animals in Research
Peer Review Policies & Practices
Intellectual Property Policy
Invention Reporting (iEdison)

Global OER Resources
Glossary & Acronyms
Frequently Used Links
Frequent Questions

(Updated: March 5, 2003)

This guidance provides the National Institutes of Health (NIH) policy statement on data sharing and additional information on the implementation of this policy.

Goals of Data Sharing
Applicability
Implementation

Timeliness of Data Sharing
Human Subjects and Privacy Issues
Proprietary Data
Methods for Data Sharing
Data Documentation
Funds for Data Sharing
Review Considerations

What to Include in an NIH Application
Examples of Data Sharing Plans
Definitions

Covered Entity
Data
Data Archive
Data Enclave
Final Research Data
Restricted Data
Timeliness
Unique Data

GOALS OF DATA SHARING

Data sharing promotes many goals of the NIH research endeavor. It is particularly important for unique data that cannot be readily replicated. Data sharing allows scientists to expedite the translation of research results into knowledge, products, and procedures to improve human health.

There are many reasons to share data from NIH-supported studies. Sharing data reinforces open scientific inquiry, encourages diversity of analysis and opinion, promotes new research, makes possible the testing of new or alternative hypotheses and methods of analysis, supports studies on data collection methods and measurement, facilitates the education of new researchers, enables the exploration of topics not envisioned by the initial investigators, and permits the creation of new datasets when data from multiple sources are combined.

In NIH's view, all data should be considered for data sharing. Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data. To facilitate data sharing, investigators submitting a research application requesting $500,000 or more of direct costs in any single year to NIH on or after October 1, 2003 are expected to include a plan for sharing final research data for research purposes, or state why data sharing is not possible.

APPLICABILITY

The NIH policy on data sharing applies:

To the sharing of final research data for research purposes.
To basic research, clinical studies, surveys, and other types of research supported by NIH. It applies to research that involves human subjects and laboratory research that does not involve human subjects. It is especially important to share unique data that cannot be readily replicated.
To applicants seeking $500,000 or more in direct costs in any year of the proposed project period through grants, cooperative agreements, or contracts.
To research applications submitted beginning October 1, 2003.

Policies with respect to data sharing vary across countries. Investigators from foreign institutions and U.S. investigators collecting data in other countries should familiarize themselves with the policies governing data sharing in the countries in which they plan to work and to address any specific limitations in the data-sharing plan in their application.

Even if NIH support is sought to transform or link datasets (as opposed to producing a new set of data), the investigator should still include a data-sharing plan in the application. If there are limitations associated with a data-sharing agreement for the original data that preclude subsequent sharing, then the applicant should explain this in the application.

IMPLEMENTATION

The NIH data-sharing policy applies to applicants seeking $500,000 or more in direct costs in any year of the proposed research. The $500,000 threshold corresponds to the threshold set in the October 16, 2001 NIH Guide, where applicants requesting $500,000 or more in direct costs for any year must seek agreement by NIH Institute or Center (IC) staff to accept assignment of their application at least 6 weeks prior to the anticipated submission date. (See http://grants2.nih.gov/grants/guide/notice-files/NOT-OD-02-004.html). That policy directs applicants to contact in writing or by telephone IC program staff during the development process of the application but no later than 6 weeks before the anticipated submission date. Applicants are encouraged to discuss their proposed data-sharing plan with IC program staff at that time.

Final research data are recorded factual material commonly accepted in the scientific community as necessary to document, support, and validate research findings. This does not mean summary statistics or tables; rather, it means the data on which summary statistics and tables are based. For most studies, final research data will be a computerized dataset. For example, the final research data for a clinical study would include the computerized dataset upon which the accepted publication was based, not the underlying pathology reports and other clinical source documents. For some but not all scientific areas, the final dataset might include both raw data and derived variables, which would be described in the documentation associated with the dataset.

Given the breadth and variety of science that NIH supports, neither the precise content for the data documentation, nor the formatting, presentation, or transport mode for data is stipulated. What is sensible in one field or one study may not work at all for others. It would be helpful for members of multiple disciplines and their professional societies to discuss data sharing, determine what standards and best practices should be proposed, and create a social environment that supports data sharing. NIH is planning to convene workshops where investigators with experience in data sharing will share their expertise with others. These workshops will address areas such as cleaning and formatting data, writing documentation, redacting data to protect subjects' identities and proprietary information, and estimating costs to prepare documentation and data for sharing.

When the Principal Investigator (PI) and the authorized institutional official sign the face page of an NIH application, they are assuring compliance with policies and regulations governing research awards. NIH expects grantees to follow these rules and to conduct the work described in the application. Thus, if an application describes a data-sharing plan, NIH expects that plan to be enacted. If progress has been made with the data-sharing plan, then the grantee should note this in the progress report. In the final progress report, if not sooner, the grantee should note what steps have been taken with respect to the data-sharing plan. In the case of noncompliance (depending on its severity and duration) NIH can take various actions to protect the Federal Government's interests. In some instances, for example, NIH may make data sharing an explicit term and condition of subsequent awards.

Grantees should note that, under the NIH Grants Policy Statement, they are required to keep the data for 3 years following closeout of a grant or contract agreement. (Contracts may specify different time periods.) For the most part, NIH makes awards to institutions and not individuals (with very few exceptions, such as F32 awards). Thus, the grantee institution may have additional policies and procedures regarding the custody, distribution, and required retention period for data produced under research awards.

Timeliness of Data Sharing

Recognizing that the value of data often depends on their timeliness, data sharing should occur in a timely fashion. NIH expects the timely release and sharing of data to be no later than the acceptance for publication of the main findings from the final dataset. The specific time will be influenced by the nature of the data collected. Data from small studies can be analyzed and submitted for publication relatively quickly. If data from large epidemiologic or longitudinal studies are collected over several discrete time periods or waves, it is reasonable to expect that the data would be released in waves as data become available or main findings from waves of the data are published. NIH recognizes that the investigators who collected the data have a legitimate interest in benefiting from their investment of time and effort. NIH continues to expect that the initial investigators may benefit from first and continuing use but not from prolonged exclusive use.

Human Subjects and Privacy Issues

The rights and privacy of human subjects who participate in NIH-sponsored research must be protected at all times. It is the responsibility of the investigators, their Institutional Review Board (IRB), and their institution to protect the rights of subjects and the confidentiality of the data. Prior to sharing, data should be redacted to strip all identifiers, and effective strategies should be adopted to minimize risks of unauthorized disclosure of personal identifiers. Stripping a dataset of items that could identify individual participants is referred to by several different terms, such as "data redaction," "de-identification of data," and anonymizing data. In addition to removing direct identifiers, e.g., name, address, telephone numbers, and Social Security Numbers, researchers should consider removing indirect identifiers and other information that could lead to "deductive disclosure" of participants' identities. Deductive disclosure of individual subjects becomes more likely when there are unusual characteristics of the joint occurrence of several unusual variables. Samples drawn from small geographic areas, rare populations, and linked datasets can present particular challenges to the protection of subjects' identities.

Investigators may use different methods to reduce the risk of subject identification. One possible approach is to withhold some part of the data. Another approach is to statistically alter the data in ways that will not compromise secondary analyses but will protect individual subjects' identities. Alternatively, an investigator may restrict access to the data at a controlled site, sometimes referred to as a data enclave. Some investigators may employ hybrid methods, such as releasing a highly redacted dataset for general use but providing access to more sensitive data with stricter controls through a data enclave.

Researchers who seek access to individual level data are typically required to enter into a data-sharing agreement. Data-sharing agreements, which come by many terms, including "license agreements," and "data distribution agreements," generally include requirements to protect participants' privacy and data confidentiality. They may prohibit the recipient from transferring the data to other users or require that the data be used for research purposes only, among other provisions, and they may stipulate penalties for violations. For further information on these alternative mechanisms to share data while protecting participant confidentiality, see also the section concerning "Methods for Data Sharing." In most instances, sharing and archiving of data is possible without compromising confidentiality and privacy rights. The procedures adopted to share data while protecting privacy should be individually tailored to the specific dataset.

Investigators seeking NIH support for clinical trials may wish to consider several factors as they develop their data-sharing plan. Researchers who are planning clinical trials and intend to share the resulting data should think carefully about the study design, the informed consent documents, and the structure of the resulting dataset prior to the initiation of the study. For example, many early phase clinical trials use small samples, which make it difficult to protect the privacy of the participants. Furthermore, some study designs afford greater privacy protection to subjects than others. For example, longitudinal research poses challenges because the need to retain identifiers in order to link individual-specific data collected at different time points.

NIH recognizes that the sharing of data from clinical trials and under other situations may require making the data anonymous or sharing under more controlled means, as through a restricted access data enclave. Sharing though data enclaves would grant access only to researchers who agree to preserve the privacy of subjects and provide means to protect the confidentiality of the data.

Investigators who are working for or who are themselves covered entities under the Health Insurance Portability and Accountability Act (HIPAA) must consider issues related to the Privacy Rule, a Federal regulation under HIPAA that governs the protection of individually identifiable health information. The Department of Health and Human Services (DHHS) provides guidance on research and the Privacy Rule elsewhere (http://www.hhs.gov/ocr/). It should be noted that the Privacy Rule is relatively new, and additional information and guidance will be shared on the DHHS website as soon as it is available.

If research participants are promised that their data will not be shared with other researchers, the application should explain the reasons for such promises. Such promises should not be made routinely and without adequate justification. For the most part, it is not appropriate for the initial investigator to place limits on the research questions or methods other investigators might pursue with the data. It is also not appropriate for the investigator who produced the data to require coauthorship as a condition for sharing the data.

Many research efforts supported by NIH do not include human subjects. Final research datasets from studies that do not include human subjects generally should not be constrained by the limitations deemed necessary and appropriate for human subjects.

Proprietary Data

Although Small Business Innovation Research (SBIR) applicants are also to address data sharing in their applications, under the Small Business Act, SBIR grantees may withhold their data for 4 years after the end of the award. The Small Business Act provides authority for NIH to protect from disclosure and nongovernmental use all SBIR data developed from work performed under an SBIR funding agreement for a period of 4 years after the closeout of either a phase I or phase II grant unless NIH obtains permission from the awardee to disclose these data. The data rights protection period lapses only upon expiration of the protection period applicable to the SBIR award, or by agreement between the small business concern and NIH.

Issues related to proprietary data also can arise when cofunding is provided by the private sector (e.g., the pharmaceutical or biotechnology industries) with corresponding constraints on public disclosure. NIH recognizes the need to protect patentable and other proprietary data. Any restrictions on data sharing due to cofunding arrangements should be discussed in the data-sharing plan section of an application and will be considered by program staff. While NIH understands that an institution's desire to exercise its intellectual property rights may justify a need to delay disclosure of research findings, a delay of 30 to 60 days is generally viewed as a reasonable period for such activity.

Methods for Data Sharing

There are many ways to share data.

Under the auspices of the PI
Data archive
Data enclave
Mixed mode sharing.

The method for sharing that an investigator selects is likely to depend on several factors, including the sensitivity of the data, the size and complexity of the dataset, and the volume of requests anticipated. Investigators sharing under their own auspices may simply mail a CD with the data to the requestor, or post the data on their institutional or personal Website. Although not a condition for data access, some investigators sharing under their own auspices may form collaborations with other investigators seeking their data in order to pursue research of mutual interest. Others may simply share the data by transferring them to a data archive facility to distribute more widely to interested users, to maintain associated documentation, and to meet reporting requirements. Data archives can be particularly attractive for investigators concerned about a large volume of requests, vetting frivolous or inappropriate requests, or providing technical assistance for users seeking help with analyses.

There are several mechanisms for data sharing that investigators can use. For example, investigators sharing under their own auspices should consider using a data-sharing agreement to impose appropriate limitations on users. Such an agreement usually indicates the criteria for data access, whether or not there are any conditions for research use, and can incorporate privacy and confidentiality standards to ensure data security at the recipient site and prohibit manipulation of data for the purposes of identifying subjects. Many examples of data sharing agreements for specific datasets are available on the Internet, including the following:

AHRQ National Inpatient Sample at http://www.ahcpr.gov/data/hcup/datause.htm

Russian Longitudinal Monitoring Survey at http://www.cpc.unc.edu/dataarch/iprimary/rlms.html

Center for Medicare and Medicaid Services Data at http://hrsonline.isr.umich.edu/rda/userdocs/cmsdua.pdf (PDF - 59 KB)

Alternatively, researchers may want to add their data to a data archive or a data enclave. Datasets that cannot be distributed to the general public, for example, because of participant confidentiality concerns, third-party licensing or use agreements that prohibit redistribution, or national security considerations, can be accessed through a data enclave. A data enclave provides a controlled, secure environment in which eligible researchers can perform analyses using restricted data resources.

Investigators may also wish to develop a "mixed mode" for data sharing that allows for more than one version of the dataset and provides different levels of access depending on the version. For example, a redacted dataset could be made available for general use, but stricter controls through a data enclave would be applied if access to more sensitive data were required.

Investigators will need to determine which method of data sharing is best for their particular dataset. The Data Sharing Workbook (PDF - 75 KB) or (MS Word - 74 KB) provides information and examples of how others have shared data.

Data Documentation

Regardless of the mechanism used to share data, each dataset will require documentation. (Some fields refer to data documentation by other terms, such as metadata or codebooks). Proper documentation is needed to ensure that others can use the dataset and to prevent misuse, misinterpretation, and confusion. Documentation provides information about the methodology and procedures used to collect the data, details about codes, definitions of variables, variable field locations, frequencies, and the like. The precise content of documentation will vary by scientific area, study design, the type of data collected, and characteristics of the dataset.

It is appropriate for scientific authors to acknowledge the source of data upon which their manuscript is based. Many investigators include this information in the methods and/or reference sections of their manuscripts. Journals generally include an acknowledgement section, in which the authors can recognize people who helped them gain access to the data. Authors using shared data should check the policies of the journal to which they plan to submit to determine the precise location in the manuscript for such acknowledgement. Most journals now expect that DNA and amino acid sequences that appear in articles will be submitted to a sequence database before publication.

Funds for Data Sharing

NIH recognizes that it takes time and money to prepare data for sharing. Thus, applicants can request funds for data sharing and archiving in their grant application. (See also the section on What to Include in an NIH Application.) Investigators who incorporate data sharing in the initial design of the study may more readily and economically establish adequate procedures for protecting the identities of participants and share a useful dataset with appropriate documentation.

Review Considerations

Reviewers will not factor the proposed data-sharing plan into the determination of scientific merit or priority score. Program staff will be responsible for overseeing the data sharing policy and for assessing the appropriateness and adequacy of the proposed data-sharing plan.

WHAT TO INCLUDE IN AN NIH APPLICATION

Investigators seeking $500,000 or more in direct costs in any year should include a description of how final research data will be shared, or explain why data sharing is not possible. It is expected that the data sharing discussion will be provided primarily in the form of a brief paragraph immediately following the Research Plan Section of the PHS 398 application form (i.e., immediately after I. Letters of Support), and would not count towards the application page limit.

Data Sharing Plan (to follow immediately after the Research Plan Section)

The precise content of the data-sharing plan will vary, depending on the data being collected and how the investigator is planning to share the data. Applicants who are planning to share data may wish to describe briefly the expected schedule for data sharing, the format of the final dataset, the documentation to be provided, whether or not any analytic tools also will be provided, whether or not a data-sharing agreement will be required and, if so, a brief description of such an agreement (including the criteria for deciding who can receive the data and whether or not any conditions will be placed on their use), and the mode of data sharing (e.g., under their own auspices by mailing a disk or posting data on their institutional or personal website, through a data archive or enclave). Investigators choosing to share under their own auspices may wish to enter into a data-sharing agreement.

References to data sharing may also be appropriate in other sections of the application, as discussed below.

Budget and Budget Justification Sections

Applicants may request funds in their application for data sharing. If funds are being sought, the applicant should address the financial issues in the budget and budget justification sections. Some investigators have more experience than others in estimating costs associated with preparing the dataset and associated documentation, and providing support to data users. As investigators gain experience with the process, their ability to estimate costs will improve. Investigators working with archives can get help with data preparation and cost estimation. Investigators who are concerned about paying for data-sharing costs at the end of their grant can make prior arrangements with archives. Investigators facing considerable delays in the preparation of the final dataset for sharing should consult with the NIH program about how to manage this situation, such as requesting a no-cost extension.

Background and Significance Section (PHS 398 Research Plan Section B)

If support is being sought to develop a large database that will serve as an important resource for the scientific community, the applicant may wish to make a statement about this in the significance section of the application.

Human Subjects Section (PHS 398 Research Plan Section E)

If the research involves human subjects and the data are intended to be shared, the application should discuss how the rights and confidentiality of participants would be protected. In the Human Subjects section of the application, the applicant should discuss the potential risks to research participants posed by data sharing and steps taken to address those risks.

EXAMPLES OF DATA-SHARING PLANS

The precise content and level of detail to be included in a data-sharing plan depends on several factors, such as whether or not the investigator is planning to share data, the size and complexity of the dataset, and the like. Below are several examples of data-sharing plans.

Example 1

The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects. Therefore, we are not planning to share the data.

Example 2

The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.

Example 3

This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years. Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/

User registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource. Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties.

DEFINITIONS

Covered Entity - A covered entity is defined as a health care clearinghouse, health plan, or health care provider that electronically transmits health information in connection with a transaction for which DHHS has adopted standards under the Health Insurance Portability and Accountability Act (HIPAA). An example of a researcher who may be a covered entity is a physician who electronically bills for health care services and conducts clinical trials. A set of decision tools on "Am I a covered entity?" are available from the DHHS Office for Civil Rights Website http://www.hhs.gov/ocr/

Data - see Final Research Data

Data Archive - A place where machine-readable data are acquired, manipulated, documented, and finally distributed to the scientific community for further analysis.

Data Enclave - A controlled, secure environment in which eligible researchers can perform analyses using restricted data resources.

Final Research Data - Recorded factual material commonly accepted in the scientific community as necessary to document and support research findings. This does not mean summary statistics or tables; rather, it means the data on which summary statistics and tables are based. For the purposes of this policy, final research data do not include laboratory notebooks, partial datasets, preliminary analyses, drafts of scientific papers, plans for future research, peer review reports, communications with colleagues, or physical objects, such as gels or laboratory specimens. NIH has separate guidance on the sharing of research resources, which can be found at http://grants.nih.gov/grants/policy/nihgps_2003/NIHGPS_Part7.htm#_Toc54600131

Restricted Data - datasets that cannot be distributed to the general public, because of, for example, participant confidentiality concerns, third-party licensing or use agreements, or national security considerations.

Timeliness - In general, NIH considers the timely release and sharing of data to be no later than the acceptance for publication of the main findings from the final dataset. However, the actual time will be influenced by the nature of the data collected.

Unique Data - Data that cannot be readily replicated. Examples of studies producing unique data include: large surveys that are too expensive to replicate; studies of unique populations, such as centenarians; studies conducted at unique times, such as a natural disaster; studies of rare phenomena, such as rare metabolic diseases.

Go To NIH Data Sharing Page

Note: For help accessing PDF, RTF, MS Word, Excel, PowerPoint, RealPlayer, Video or Flash files, see Help Downloading Files.