NDA Data Harmonization Approach

 

Overall Approach

The NIMH Data Archive (NDA) follows a researcher-driven approach to harmonizing individual-level data collected by hundreds of different laboratories primarily studying mental health and substance and alcohol use disorders.  The NDA data harmonization approach is extensible and currently accommodates phenotypic, clinical, behavioral, neuroimaging, neuro-signal recording, and omics data.  The approach is designed to accommodate most kinds of research data from any human disease research areas.

NDA’s data dictionary is a database of over 2,500 tables, each of which is called a data structure.  A data structure represents a single measure, data collection instrument, assessment, or metadata manifest, and contains data submitted by one or multiple researchers.  Each data structure was created by a researcher submitting data to NDA. 

NDA data structures are updated regularly as new projects add variables (data elements), create aliases, and update descriptions; all changes are recorded in the change history record. All data submitted to NDA is submitted to one of these structures and allows researchers to easily query across the entire NDA database.  Users can download data collection templates to simplify data collection and submission to NDA.

 

Required Data Elements

All NDA data structures include 5 required data elements that are used to merge data across structures and facilitate both quality assurance (QA) checks and querying across all structures. NDA does not allow missing or NA values for these 5 required data elements.

The first required data element is the Global Unique Identifier (GUID).  This element is called the NDAR GUID as it was developed by a research consortium when NDA was the National Database for Autism Research (NDAR).  The GUID allows NDA to link subjects across different studies, without handling any personally identifiable information.  Technical details on GUID generation are available at https://nda.nih.gov/guid.

NDA’s Required Data Elements:

  1. Subjectkey – the NDAR GUID for a subject; this must be in NDAR GUID format and can represent a real GUID or a pseudoGUID. Real GUIDs are preferable to pseudoGUID, and pseudoGUIDs can be retrospectively and permanently promoted to real GUIDs.
  2. Src_subject_id – a research study’s internal subject identifier; this ID should not contain any personally identifiable information (e.g. names, DOB, initials)
  3. Interview _age – the age in months (rounded) at the time when data were collected from the subject
  4. Interview_date – the date when data were collected from the subject, in MM/DD/YYYY format
  5. Sex – subject’s sex at birth; currently NDA supports only M/F values for this data element.

 

Data Structure Creation

Researchers submitting data to an NDA Collection (a virtual container for data from one research project) can create new data structures if there is no existing data structure that matches their data collection instrument.  Data submitters initiate the structure creation process in the Data Expected tab of their NDA Collection.  NDA’s Curation team reviews the request and either suggests an existing structure that could accommodate the data or creates a new data structure.  Data submitters make the final determination for data harmonization decisions.

If a new data structure is created, it will contain all item level questions (elements) and scores (scoring algorithm outputs or summary elements) defined by the original assessment/instrument creator or publisher.  The structure is then published to the NDA data dictionary, so other researchers can find it and use it to collect data from their own research studies and submit those data to the same table in the NDA database.

 

Data Structure Extension

If NDA has an existing structure that could accommodate newly collected data, the Curation team will provide documentation showing how most of the data elements in the structure described by the data submitter are identical to those in the existing structure. If the researcher agrees, NDA will add new data elements to the existing structure in order to accommodate all data to be submitted. 

NDA data structures often contain contextual data elements including version, visit number, respondent, or pipeline.  This facilitates normalization of data collected using the same instrument but in different contexts, while always presenting to end users contextual information for research decisions.

NDA does not create new versions of existing data structures unless a new edition of an existing assessment or instrument is published or re-modeled. 

 

Data Element Mapping, Aliases, Translations, and Constraints

Mapping

The NDA data dictionary contains over 1.5 million data elements.  Data elements are not defined by their names, as most NDA element names do not follow any sort of ontology and are just a combination of letters and numbers. 

NDA defines a single data element by the combination of three fields in the element definition: element description, value range and notes.  Element description can information about the respondent (e.g. child or parent) and value range and notes can contain information on how missing or unanswered data is coded.  NDA uses this approach to map data elements across multiple data structures.  When a new structure is created or an existing structure extended with new elements, if the data element already exists (as defined by these three fields), it would be re-used across the structures.  

Element description, value range, and notes are regularly updated as new researchers extend the NDA data dictionary.  These updates are propagated across all data structures for a given data element.  When a data submitter requests substantive changes to an existing element, NDA will create a new data element.

Aliases

Many data submitters do use ontologies or other naming conventions for their data management needs.  Data submitters provide those aliases as part of the Data Expected process.  NDA creates data element aliases in the NDA data dictionary and associates the aliases to the specific NDA Collection.  The Data Curator will provide an updated data submission template to the data submitter, who can then submit data in an NDA data structure without changing element names.  Data submitters should select “use custom scope” in the Validation Tool in order to submit with aliases.

Translations

Data submitters may use different coding for data element value ranges.  NDA can create a translation that maps the data submitter’s value range to the NDA value range.  Translations are specific to an NDA Collection.  Data submitters can use translations and do not have to recode data for submission.  Translations are not always possible, as they require a direct mapping between value ranges.

Database Constraints

NDA data are stored in an Oracle database which imposes some technical constraints:

  1. Size of element - 30 characters, no spaces, cannot start with number, no special characters except underscore (_)
  2. Alias size – 100 characters, no spaces
  3. Single structure can contain only 995 elements. NDA creates multipart structures to manage this limitation.
  4. 4000 characters – size limit for text element