Skip to main content

DAAC Home > Data Management > Manage
Data Management for Data Providers
Click an arrow to follow the data management path of a data set from planning to curation.

Data Management and Data Collection

A small amount of time invested in consistently defining, organizing, and documenting your data products during collection will save time and effort in the future when preparing data to archive.

Best Practices for Data Management

Keeping a few best practices in mind during the data collection phase will make the process of documenting your data set quick and easy when the time comes.

Best practices for data management

The ORNL DAAC has developed best practices for data management to help data providers more efficiently manage their data. These practices do not need to be completed sequentially.

View a webinar on

Assign descriptive file names

Names should contain only numbers, letters, underscores and ideally may contain the project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

Jump to Assign descriptive file names
or view
for more information.

Use consistent data organization

One organizational style is multiple rows, each with common-separated values. Another style is individual columns for each value. Be sure to provide a definition for all coded values or abbreviations.

Jump to Use consistent data organization
or view
for more information.

Define the contents of your data files

Provide names, units of measure, formats, and definitions of coded values. Be consistent.

Jump to Define the contents of your data files
or view
or
for more information.

Preserve information

Save raw data file with no transformations or analyses as "read-only". Use a scripted language to process data in a separate file.

Jump to Preserve information
or view
for more information.

Use stable file formats

Text-based comma separated values are ideal. Avoid proprietary formats that may not be readable in the future.

Jump to Use stable file formats
or view
or
for more information.

Protect your data

Ensure that file transfers are done without error by comparing checksums before and after transfers. Create and test back-up copies often to prevent the disaster of lost data.

Jump to Protect your data
or view
for more information.

Assign descriptive data set titles

A descriptive title should briefly describe your data set and will help workers search for and identify your data set as pertinent and useful for future research.

Jump to Assign descriptive data set titles for more information.

Perform basic quality assurance

Check that there are no missing values for key parameters. Scan and/or plot for impossible and anomalous values. Perform and review statistical summaries.

Jump to Perform basic quality assurance for more information.

Prepare documentation

Consider what a future investigator needs to know in order to obtain and use your data.

Jump to Prepare documentation for more information.


Define the contents of your data files

In order for others to use your data, they must fully understand the contents of the data set, including the parameter names, units of measure, formats, and definitions of coded values. Parameters, units, and other coded values may be required to follow certain naming standards as defined in experiment plans and the destination archive.

Common data field types

Recommendations for several common field types are listed.

Parameter Name

Describe the contents of a parameter, standardized across files, data sets, and the project. Documentation should contain a full description of the parameter. Use commonly accepted parameter names and abbreviations. Standards for parameters currently are in use, for example, GCMD and CDIAC AmeriFLUX , but are not consistently implemented across scientific communities. If a standard vocabulary is implemented, be sure to include the citation in the metadata and/or documentation.

Units

Units of reported parameters must be explicitly stated in the data file and in the documentation. SI units are preferable, but each discipline may have its own commonly used units of measure. Units standards are available, for example, the CF Conventions and Metadata and Meteorology and Micrometeorology Data Submission Guidelines. .

Parameter format

Choose a format for each parameter and use that format consistently throughout the data set. Explain formats in the documentation.

Dates

yyyy-mm-dd or yyyymmdd
January 2, 1997 is 1997-01-02.
The hyphens can be omitted if compactness is more important than human readability, for example 19970102.
If only the month or only the year is of interest use 1995-02 or 1995

Based on ISO 8601:2004 (Wikipedia).

Time

Use 24-hour notation (13:30 instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.)
Report both local time and Coordinated Universal Time (UTC).
Include local time zone in a separate field. Standard time is preferred.
Be sure to define the local time in the documentation.

Based on ISO 8601:2004 (Wikipedia).

View a webinar on

Spatial coordinates

Report in decimal degrees (≥4 decimal places). Provide south latitude and west longitude recorded as negative values, e.g., 80 30' 00" W longitude is -80.5000. All location information in a data set should use the same coordinate system, including coordinate type, datum, and spheroid. Document all of these characteristics (e.g., Lat/Long decimal degrees, NAD83 (North American Datum of 1983), WGRS84 (World Geographic Reference System of 1984)). The ORNL DAAC provides a tool to convert into decimal degrees.

Elevation

Provide elevation in meters. Include vertical datum used, e.g. North American Vertical Datum 1988 (NAVD 1988) or Australian Height Datum (AHD).

Coded fields

Standardized lists of predefined values from which the data provider may choose are useful. Two good examples are U.S. state abbreviations and postal zip codes. Data collectors may establish their own coded fields with defined values to be consistently used across several data files. Be aware of, and document, any changes in the coding scheme, particularly for externally defined coding schemes. Postal codes, as an example, can change somewhat over time.

Flags

A separate field may be used for quality considerations, reasons for missing values, or indicating replicated samples. Codes should be consistent across parameters and data files. Definitions of flag codes should be included in the data set documentation.

Missing values

Use consistent missing value notations throughout your data set.
For numeric fields, represent missing data with a specified extreme value (e.g., -9999), the IEEE floating point NaN value (Not a Number), or the database NULL. NULL and NaN can cause some problems, particularly with older programs.
For character fields, use NULL, "not applicable", "n/a" or "none". Explicit missing value representations are better than empty fields.
Document how missing and nodata values are represented.


View a webinar on or

Assign descriptive data set titles

Because the title is often the first thing people will see when looking at a data set, data set titles should be as descriptive as possible. These data sets may be accessed many years in the future by people who will be unaware of the details of the project. Data set titles should contain the type of data and other information such as the date range, the location, and the instrument used. If your data set is part of a larger project (e.g., SAFARI 2000 or LBA-ECO), you may want to include the name in your titles. The length of the title should be restricted to 85 character (spaces included) to be compatible with other clearinghouses for ecological and global change data collections. Names should contain only numbers, letters, dashes, underscores, periods, commas, colons, parentheses, and spaces – no special characters.

Examples of bad titles are:
  • The Aerostar 100 Data Set
  • Respiration Data
Some great titles are:
  • SAFARI 2000 Upper Air Meteorological Profiles, Skukuza, Dry Seasons 1999-2000
  • NACP Integrated Wildland and Cropland 30-m Fuel Characteristics Map, U.S.A., 2010
  • Global Fire Emissions Database, Version 2 (GFFDv2.2)

Assign descriptive file names

File names should reflect the contents of the file and uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type. Avoid using file names such as mydata.dat or 1998.dat. Data sets may be compressed into *.zip or *.tar.gz formats. File names should be constructed to contain only lower-case letters, numbers, and underscores – no spaces or special characters – for easy management by various data systems and to decrease software and platform dependency. File names should not be more than 64 characters in length and if well-constructed could be considerably less. Similar logic is useful when designing directory structures and names.

Examples of tabular data file name

Tabular data should be in *.txt or *.csv files with good file names.


A variety of file formats are used for image data. Descriptive file names should be used.

Examples of image data file names

See below for stable file formats used with image data.

LBA_leafarea_20091001_20100101.tif

Example GeoTIFF name from The Large Scale Biosphere-Atmosphere Experiment in Amazonia (LBA)

BOREAS_SSA_Young_Aspen_20100719.tif

From data set SAR Subsets for Selected Field Sites, 2007-2010

MODIS_landcover-IGBP_2001.tif

Example GeoTIFF name from MODIS land subsets


Include a description of the data file names in the data set documentation.

View a webinar on

Use consistent data organization

There are two common ways to organize tabular data. In either case, each separate line or row represents an observation. Each line is a complete record.

Station Date Temp Precip
Units YYYYMMDD C mm
HOGI 19961001 12 0
HOGI 19961002 14 3
HOGI 19961003 19 -9999

Most often, the columns represent all the parameters that make up the record. Similar to a spreadsheet, this is the potentially "short and fat" style of data organization. Note: -9999 is a missing value code

Station Date Parameter Value Unit
HOGI 19961001 Temp 12 C
HOGI 19961002 Temp 14 C
HOGI 19961001 Precip 0 mm
HOGI 19961002 Precip 3 mm

If most parameters in a record do not have measurements, you can define the parameter and value in two columns. Other columns may be used for data about the measurement like site, date, units of measure, etc. This is the "long and skinny" style of data organization.

Keep similar measurements together (same investigator, methods, time basis, and instruments) in one data file. Many small files are more difficult to process than one larger file. There are exceptions: observations of different types of measurements might be placed into separate data files. Data collected on different time scales or temporal resolution might be handled more efficiently in separate files.

Use similar data organization, parameter formats, and common site names across the data set. Include data set organization and provide definitions for all coded values or abbreviations, including spatial coordinates, in the documentation.

View a webinar on

Use stable file formats

Select a consistent format that can be read well into the future and is independent of changes in applications. If your data collection process used proprietary file formats, converting those files into a stable, well-documented, and non-proprietary format to maximize others' abilities to use and build upon your data.

Stable file formats

The data you are recording or generating will determine what file format you should use.

Tabular data

Delimited text file formats ensure data are readable in the future. Use a consistent structure throughout the data set. Report figures and analyses in companion documents, not the data file. Use ASCII text encoding if possible, with UTF-8 or UTF-16 as secondary options.

Headers and Delimiters

A header row should contain descriptors that link the data file to the data set: the data file name, data set title, author, today's date, date the data were last modified. Column headings describe the content of each column, including parameter names and units. Delimit fields using commas, tabs, semicolons, or vertical bars (|); in order of preference. Avoid delimiters that occur in the data fields. If the data fields use the comma as the decimal separator, semicolon is preferred column delimiter.

File extensions

Use the file extension that best indicates the type of file and field delimiters, *.csv for comma-delimited text file and *.txt for a tabs or semicolons delimited text file. Do not use *.dat

Raster Image data

Good raster file formats are open, non-proprietary, simple and commonly used. More importantly, they are self-descriptive, in other words, metadata are included inside the file. The ORNL DAAC recommends:
GeoTIFF
NetCDF v3/v4
HDF-EOS
Thoroughly document the format and structure.

Geospatial information for image data

Define projection/coordinate reference system, referenced Datum, EPSG code, spatial resolution, and bounding box. Provide a companion header file if geospatial information cannot be embedded inside the image file. Georeference images prior to archival. Documentations should include: rational for choosing a particular projection, issues with reprojecting the data, suggested resampling techniques (nearest neighbor/cubic convolution...etc), projection constraints.

View a webinar on providing geospatial informationYouTube .

Storage structure for image data

Store the image as byte, signed integer, unsigned integer, float, etc. depending on the data range and type. For example, if values of a parameter range from 1 to 100 and only have integer numbers, pick the BYTE data type to ensure that the least amount of disk space is used while maintaining data integrity. Embed the NODATA values in the image files if possible. Document NODATA values, FILL Values, Valid ranges, scale factor and offset of the data values.

Additional considerations for image data

Provide a color lookup table for visualization of the image file. Include pictures of binary image files so correct reading of the binary images can be checked. Avoid using generic file extensions (e.g., *.bin or *.img) which could cause issues. Document which software package and version were used to create the data file(s). If the files were created with custom code, provide a software program to enable the user to read the files.

Proprietary formats

Data that are provided in a proprietary software format must include documentation of the software specifications (i.e., software package, version, vendor, and native platform). The archive data center will use this information to convert to a non-proprietary format for the archive.

Vector Image data

Good vector file formats are open, non-proprietary, simple and commonly used. More importantly, they are self-descriptive, in other words, metadata are included inside the file. The ORNL DAAC recommends:
Shapefile
KML
Geo-reference vectors and specify geometry type (Point, Line, Polygon, Multipoint). For proprietary vector file formats, document the software package, version, vendor, and native platform.


Storing data in recommended formats with detailed documentation will allow your data to be easily read many years into the future. Easy access means improved usability of your data and more researchers using and citing your data. Users can spend more time analyzing the data and spend less time in data preparation. Data sets can be combined and compressed into standard *.zip or *.tar.gz formats.

View a webinar on or

Preserve information

To preserve your data and its integrity, save a "read-only" copy of your raw data files with no transformations, interpolation, or analyses. Use a programming language to process data in a separate file in a separate directory. The code you have written is an excellent record of data processing. Your code can easily and quickly be revised and rerun in the event of data loss or requests for edits. Programming has the added benefit of allowing a future worker to follow-up or reproduce your processing. GUI-based tools are easy on the front end, but they do not keep a record of changes to your data and make reproducing results difficult.

View a webinar on

Protect your data

Ensure that file transfers are done without error by comparing checksums before and after transfers. Create and test back-up copies often to prevent the disaster of lost data. Maintain at least three copies of your data: the original, an on-site but external backup, and an off-site backup in case of a disaster. The advent of cloud storage allows for remote file storage that can be accessed from virtually anywhere. Periodically, test your ability to recover your data.

View a webinar on

Prepare documentation and metadata

As with data, documentation should be saved using stable, non-proprietary formats. Images, figures, and pictures should be individual GIF or JPEG files. Documents should be in separate PDF or PS files identified in the data files. Names of documentation files should be similar to the name of the data set and the data file(s). The documentation is most useful when structured as a user's guide for the data product.

Documentation can never be too complete. Users who are not familiar with your data will need more detailed documentation to understand your data set. Long-term experimental activities require more documentation because personnel change over time.

Data set documentation

Data set documentation should provide a detailed description of your data.

Project description

  • The name of the data set, which will be the title of the documentation
  • What data were collected
  • The scientific reason why the data were collected
  • Who collected the data and whom to contact with questions (include email and website if appropriate)
  • Who funded the investigation (award/grand numbers)

Data characteristics and description

  • When and how frequently the data were collected
  • Where and with what spatial resolution the data were collected
  • The name(s) of the data file(s) in the data set
  • Example data file records for each data type file
  • Special codes used, including those for missing values
  • Date the data set was last modified
  • English language translation of any data values and descriptors in another language
  • How each parameter was measured or produced (methods), units of measure, format, precision and accuracy, and relationship to other data in the data set

Data acquisition

  • What instruments (including model and serial number) (e.g., rain gauge) and source (meteorological station) were used
  • Standards or calibrations that were used
  • What the environmental conditions were (e.g., cloud cover, atmospheric influences, etc.)
  • The data processing that was performed, including screening
  • The lineage (provenance) of the data
  • Software (including version number) used to prepare the data set
  • Software (including version number) needed to read the data set

Quality assessment

  • The quality assurance and quality control that have been applied
  • Describe the quality level of the data
  • Known problems that limit the data's use (e.g., uncertainty, sampling problems, blanks, QC samples)
  • Summary statistics generated directly from the final data file for use in verifying file transfer and transformations.

Supplemental information

  • Pertinent field notes or other companion files; the names of the files should be similar to the documentation and data file names
  • Related or ancillary data sets
  • References to published papers, including DOI where available, describing the collection and/or analysis of the data
  • How to cite the data set

Here are some example data set documentation records from the ORNL DAAC archive:

Metadata describe your data so that others can understand what your data set represents; they are thought of as "data about the data" or the "who, what, where, when, and why" of the data. Structured metadata, prepared by the ORNL DAAC staff, is used for data set search and discovery.

View a webinar on

Write documentation for a user who is unfamiliar with your project, methods, or observations. What does a user, 20 years into the future, need to know to use your data properly?

View a webinar on

Perform basic data quality assurance

In addition to scientific quality assurance (QA), we suggest you perform basic data QA on the data files prior to sharing. Tabular data and spatial data present different QA challenges.

Tabular data QA

Your data might be damaged during manipulation, storage, or transfer. QA ensures the data you are achiving is coherent and correct, just the way you created it. QA for tabular data includes reviewing and checking your data files, data values, and data set documentation.

Click a tabular QA step to display more information.

File structure

Check file structure by making sure the data are delimited properly or line up in the proper columns.

File organization

Check file organization and descriptors to ensure that there are no missing values for key parameters (such as sample identifier, station, time, date, geographic coordinates). Sort the records by key data fields to highlight discrepancies.

Documentation

Review the documentation to ensure that descriptions accurately reflect the data file names, format, and content. Check any included example data records to ensure that they are from the latest version of the data file.

Valid values

Check the validity of measured or derived values. Scan parameters for impossible values (e.g., pH of 74 or negative values where negative values are not possible). Review printed copies of the data file(s) and generate time series plots to detect anomalous values or data gaps.

Statistical summaries

Perform statistical summaries (frequency of parameter occurrence) and review results.

Location

If location is a parameter (latitude/longitude), then use scatter plots or GIS software to map each location to see if there are any errors in coordinates.

Data transfers

Verify data transfers (from field notebooks, data loggers, or instruments). For data transformations done by hand, consider double data entry (entering data twice, comparing the two data sets, and reconciling any differences). Where possible, compare summary statistics before and after data transformation.

Checksums

Calculate checksums for final data files and include verification files along with the data files when transferred to the data archive. Checksums are numerical values calculated from the number of bytes of data. Checksums are easily calculated and compared. If the current value matches a previously checksum, your data has likely not changed. Many command line, GUI, and scripting tools are available to calculate checksums from your data.

Version control

Track versions of your data files as changes are made using a number, date, or both to identify versions. Save and store earlier versions in a separate directory. Keep a history of changes to data files/versions and who made the changes. Make sure the data files submitted to the archive are the correct version. If you provide public access to your data files at your host institution after they are archived by ORNL DAAC, be sure the data at your site are the same as what was archived. Version control tools, like git and Subversion , are useful, especially if a group manipulates the data.

Spatial data like GIS images and vector files present unique challeges for data QA. You should perform QA specific to your data. The following list is instructive, but not exhaustive.

Spatial data QA

For GIS image and vector files, ensure the projection parameters have been accurately given. Additional information such as data type, scale, corner coordinates, missing data value, size of image, number of bands, and endian type should be checked for accuracy.

Click an image QA step to display more information.

File size

Provide checksum files to ensure data integrity during network transfer. Do not perform file compression unless file sizes are very large. Consult with the data archive on acceptable file sizes.

Data format

Check the data are in the format indicated by the file extension and documentation and readable in standard GIS/Image processing software (e.g., ENVI, ERDAS IMAGINE, or ArcGIS). The header information should be accurate if files are binary. ASCII file values are expressed as plain numbers (e.g., 0.000222) not scientific notation. The number of bands in multi-band image data must equal what is specified. Additional documentation may be required for a data format.

Projection

Ensure any provided projection information is correct, including parameters such as central meridian, datum, standard parallels, radius of earth. If the projection is a non-standard projection, provide a projection file in .prj or Well-Known Text (WKT) format. The projection should render within a GIS software package such as ENVI or ArcGIS.

Spatial extent

The spatial extent of the data should match the number of pixels and resolution of the data in the units (degrees/meters) and projection of the extents provided. Provide the extents of the actual data instead of the extent of the study area. In some cases the image files might include additional area.

Spatial resolution

Ensure the resolution specified and the resolution units (meters, feet, decimal degrees, etc.) are correct.

Specify how the data are to be read

Are the data to be read from upper left to lower right or lower left to upper right or any other way? Specify the nodata value(s), the scale of the data, any data offsets, the units of the data, and any color table that can be used with the data. Provide the color table in "Value, Red, Green, Blue" format.

Temporal resolution

Check the time frame specified and the temporal units against the data.

When QA is finished, describe the overall quality level and maturity of the data. Include quality levels in data files as coded values, and include a more complete description in metadata and documentation. Coded values for quality description should identify the level of maturity of your data products as they progress. A typical measurement of data maturity could be:

  • Level 0: raw data streams
  • Level 1: data have undergone automated quality control, data management procedures, and calibration
  • Level 2: data that have been integrated, analyzed, and gridded
  • Level 3: data are derived for other products and used in models and higher level analyses.

NASA Earth Systems Science has additional guidelines for defining data quality: Data Processing Levels. We have combined Level 1a and Level 1b.

View a webinar on


More details of our Data Management Best Practices can be found in Best Practices for Preparing Environmental Data Sets to Share and ArchivePDF, published by ORNL DAAC in 2010, and Environmental Data Management Best Practices Part 1 Tabular DataYouTube and Part 2 Geospatial DataYouTube, from the NASA Earthdata webinar series and presented by ORNL DAAC staff (hosted on YouTube).