NOAA   ERDDAP
Easier access to scientific data

Brought to you by NOAA NMFS SWFSC ERD    
 

Working with the datasets.xml File

[This web page will only be of interest to ERDDAP administrators.]

After you have followed the ERDDAP installation instructions, you must edit the datasets.xml file in [tomcat]/content/erddap/ to describe the datasets that your ERDDAP installation will serve.

Table of Contents


 

Introduction

Some Assembly Required - Setting up a dataset in ERDDAP isn't just a matter of pointing to the dataset's directory or URL. You have to write a chunk of XML for datasets.xml which describes the dataset.

  • For gridded datasets, in order to make the dataset conform to ERDDAP's data structure for gridded data, you have to identify a subset of the dataset's variables which share the same dimensions. (Why? How?)
  • The dataset's current metadata is imported automatically. But if you want to modify that metadata or add other metadata, you have to specify it in datasets.xml. And ERDDAP needs other metadata, including global attributes (such as infoUrl, institution, sourceUrl, summary, and title) and variable attributes (such as long_name and units). Just as the metadata that is currently in the dataset adds descriptive information to the dataset, the metadata requested by ERDDAP adds descriptive information to the dataset. The additional metadata is a good addition to your dataset and helps ERDDAP do a better job of presenting your data to users who aren't familiar with it.
  • ERDDAP needs you to do special things with the longitude, latitude, altitude, and time variables.
If you buy into these ideas and expend the effort to create the XML for datasets.xml, you get all the advantages of ERDDAP, including:
  • Full text search for datasets
  • Search for datasets by category
  • Data Access Forms so you can request subset of data in lots of different file formats
  • Forms to request graphs and maps
  • Web Map Service (WMS) for gridded datasets
  • RESTful access to your data
Making the datasets.xml takes considerable effort for the first few datasets, but it gets easier. After the first dataset, you can often re-use a lot of your work for the next dataset. Fortunately, there are two Tools to help you create the XML for each dataset in datasets.xml. And if you get stuck, please send an email with the details to bob dot simons at noaa dot gov.

Tools - There are two command line programs which are tools to help you create the XML for each dataset that you want your ERDDAP to serve. Once you have ERDDAP installed in Tomcat and Tomcat has unpacked the erddap.war file, you can find these programs in the [tomcat]/webapps/erddap/WEB-INF directory. There are Linux/Unix shell scripts (the program name, with no extension) and Windows .bat files for each program. When you run each program, it will ask you questions. For each question, type a response, then press Enter. Or press ^C to exit a program at any time. The tools print various diagnostic messages:

  • The word "error" is used when something went so wrong that the procedure failed to complete. Although it is annoying to get an error, the error forces you to deal with the problem.
  • The word "warning" is used when something went wrong, but the procedure was able to complete. These are pretty rare.
  • Anything else is just an informative message. You can add -verbose to the GenerateDatasetsXml or DasDds command line to get additional informative messages, which sometimes helps solve problems.

The two tools are a big help, but you still must read all of these instructions on this page carefully and make important decisions yourself.

  • GenerateDatasetsXml is a command line program that can generate a rough draft of the dataset XML for almost any type of datasets. When you use the GenerateDatasetsXml program:
    1. GenerateDatasetsXml asks you a series of questions so that it can access the dataset's source.
    2. If you answer the questions correctly, GenerateDatasetsXml will connect to the dataset's source and gather basic information (e.g., variable names).
    3. GenerateDatasetsXml will generate and print a rough draft of the dataset XML for that dataset and put the information on the system clipboard.
    4. You can then paste it into your datasets.xml file and start to edit it.
    5. You can then use DasDds (see below) to repeatedly test the XML for that dataset.
    Often, one of your answers won't be what GenerateDatasetsXml needs. You can then try again, with revised answers to the questions, until GenerateDatasetsXml can successfully connect to the dataset. If you use "GenerateDatasetsXml -verbose", it will print more diagnostic messages than usual.

    DISCLAIMER: The chunk of datasets.xml made by GenerateDatasetsXml isn't perfect. YOU MUST READ AND EDIT THE XML BEFORE USING IT IN A PUBLIC ERDDAP. GenerateDatasetsXml relies on a lot of rules-of-thumb which aren't always correct. YOU ARE RESPONSIBLE FOR ENSURING THE CORRECTNESS OF THE XML THAT YOU ADD TO ERDDAP'S datasets.xml FILE.

    EDDGridFromThreddsCatalog - In general, the options in GenerateDatasetsXml generate a datasets.xml chunk for one dataset from one specific data source. An exception to this is the EDDGridFromThreddsCatalog option. It generates all of the datasets.xml chunks needed for all of the EDDGridFromDap datasets that it can find by crawling recursively through a THREDDS (sub) catalog. There are many forms of THREDDS catalog URLs. This option REQUIRES a THREDDS .xml URL with /catalog/ in it, for example,
    http://oceanwatch.pfeg.noaa.gov/thredds/catalog/catalog.xml or
    http://oceanwatch.pfeg.noaa.gov/thredds/catalog/Satellite/aggregsatMH/chla/catalog.xml
    (note that the comparable .html catalog is at
    http://oceanwatch.pfeg.noaa.gov/thredds/Satellite/aggregsatMH/chla/catalog.html ).
    If you have problems with EDDGridFromThreddsCatalog:

    • Make sure the URL you are using is valid, includes /catalog/, and ends with /catalog.xml .
    • If possible, use a public URL (e.g., with oceanwatch.pfeg.noaa.gov) in the URL, not a private numeric IP address (e.g., with 12.34.56.78). If the THREDDS is only accessible privately, you can use <convertToPublicSourceUrl> so ERDDAP users see the public URL, even though ERDDAP gets data from the private URL.
    • Look in the log file, [bigParentDirectory]/logs/log.txt, for error messages.
    • Send an email to Bob with as much information as possible.
       
  • DasDds is a command line program that you can use after you have created a first attempt at the XML for a new dataset in datasets.xml. With DasDds, you can repeatedly test and refine the XML. When you use the DasDds program:
    1. DasDds asks you for the datasetID for the dataset you are working on.
    2. DasDds tries to create the dataset with that datasetID.
      • It always prints lots of diagnostic messages.
      • It always deletes all /dataset/ files for the dataset (for safety) before trying to create the dataset. So for aggregated datasets, you might want to adjust the fileNameRegex temporarily to limit the number of files the data constructor finds.
      • If it fails (for whatever reason), it will show you the error message. Read the diagnostic messages and the error message carefully. Then you can make a change to the XML and let DasDds try to create the dataset again.
    3. If DasDds can create the dataset, DasDds will then show you the .das and .dds for the dataset and put the information on the system clipboard. Often, you will want to make some small change to the dataset's XML to clean up the dataset's metadata.
    By going through this cycle repeatedly, you will eventually revise the dataset's XML so that the dataset can be created and so that the dataset's metadata is as you want it to be. If you use "DasDds -verbose", it will print more diagnostic messages than usual.

The basic structure of the datasets.xml file is:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<erddapDatasets>
  <convertToPublicSourceUrl /> <!-- 0 or more -->
  <requestBlacklist>...</requestBlacklist> <!-- 0 or 1 -->
  <subscriptionEmailBlacklist>...</subscriptionEmailBlacklist> <!-- 0 or 1 -->
  <user username="..." password="..." roles="..." /> <!-- 0 or more -->
  <dataset>...</dataset> <!-- 1 or more -->
</erddapDatasets>
It is possible that other encodings will be allowed in the future, but for now, only ISO-8859-1 is recommended.
 

Notes

Working with the datasets.xml file is a non-trivial project. Please read this entire web page carefully, especially these notes.
  • Hint - It is often easier to generate the XML for a dataset by making a copy of a working dataset description in dataset.xml and then modifying it.
     
  • Encoding Special Characters - Since datasets.xml is an XML file, you need to encode "&", "<", and ">" in any content as "&amp;", "&lt;", and "&gt;".
    Wrong: <title>Time & Tides</title>
    Right:   <title>Time &amp; Tides</title>
     
  • XML doesn't tolerate syntax errors. After you edit the dataset.xml file, it is a good idea to verify that the result is well-formed XML by pasting the XML text into an XML checker like RUWF.
     
  • Other Ways To Diagnose Problems With Datasets
    In addition to the two main Tools,
    • log.txt is a log file with all of ERDDAP's diagnostic messages.
    • The Daily Report has more information than the status page, including a list of datasets that didn't load and the exceptions (errors) they generated.
    • The Status Page is a quick way to check ERDDAP's status from any web browser. It includes a list of datasets that didn't load (although not the related exceptions) and taskThread statistics (showing the progress of EDDGridCopy and EDDTableCopy datasets).
    • If you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
       
  • The longitude, latitude, altitude, and time (LLAT) variables names are special.
    • LLAT variables are made known to ERDDAP if the axis variable's (for EDDGrid datasets) or data variable's (for EDDTable datasets) destinationName is "longitude", "latitude", "altitude", or "time".
    • We strongly encourage you to use these standard names for these variables whenever possible. If you don't use these special variable names, ERDDAP won't recognize their significance and, for example, will make a graph instead of a map if the x axis variable is lon and the y axis variable is lat.
    • Use the destinationNames "longitude" and "latitude" only if the units are degrees_east and degrees_north, respectively. If your data doesn't fit these requirements, use different variable names (e.g., lonRadians, latRadians).
    • Use the destinationName "altitude" only if the data is the distance above or below sea level. Use <altitudeMetersPerSourceUnit> to convert the data to meter above sea level (e.g., use -1 for data that was originally depth in meters). If you know the vertical datum, please specify it in the metadata. If your data doesn't fit these requirements, use a different destinationName (e.g., aboveGround, depth, distanceToBottom).
    • Use the destinationName "time" only for variables that include the entire date+time (or date, if that is all there is). If, for example, there are separate columns for date and timeOfDay, don't use the variable name "time". See units for a discussion of time units.
    • ERDDAP will automatically add lots of metadata to LLAT variables (e.g., "ioos_category", "units", and several standards-related attributes like "_CoordinateAxisType").
    • ERDDAP will automatically, on-the-fly, add lots of global metadata related to the LLAT values of the selected data subset (e.g., "geospatial_lon_min").
    • Clients that support these metadata standards will be able to take advantage of the added metadata to position the data in time and space.
    • Clients will find it easier to generate queries that include LLAT variables because the variable's names are the same in all relevant datasets.
    • LLAT variables are treated specially by Make A Graph. For example, if the X Axis variable is "longitude" and the Y Axis variable is "latitude", you will get a map (using a standard projection, and with a land mask, political boundaries, etc.) instead of a graph.
    • If you have longitude and latitude data expressed in different units and thus with different destinationNames, e.g., lonRadians and latRadians, Make A Graph will make graphs (e.g., time series) instead of maps.
    • The time variable and related timeStamp variables are unique in that they always convert data values from the source's time format (what ever it is) into a numeric value (seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format), depending on the situation.
    • Note that ERDDAP does NOT follow the CF standard when converting "years since" and "months since" time values to "seconds since". The CF standard defines a year as a fixed, single value: 3.15569259747e7 seconds. And CF defines a month as year/12. Unfortunately, most/all datasets that we have seen that use "years since" or "months since" clearly intend the values to be calendar years or calendar months. For example, "3 months since 1970-01-01" is clearly intended to mean 1970-04-01. So, ERDDAP interprets "years since" and "months since" as calendar years and months, and does not follow the strict CF standard.
    • When a user requests time data, they can request it by specifying the time as a numeric value (seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format).
    • See units for more information about time and timeStamp variables.
    • ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
    • See How ERDDAP Deals with Time.
       
  • Why just two basic data structures?
    • Since it is difficult for human clients and computer clients to deal with a complex set of possible dataset structures, ERDDAP uses just two basic data structures:
    • Certainly, not all data can be expressed in these structures, but much of it can. Tables, in particular, are very flexible data structures (look at the success of relational database programs).
    • This makes data queries easier to construct.
    • This makes data responses have a simple structure, which makes it easier to serve the data in a wider variety of standard file types (which often just support simple data structures). This is the main reason that we set up ERDDAP this way.
    • This, in turn, makes it very easy for us (or anyone) to write client software which works with all ERDDAP datasets.
    • This makes it easier to compare data from different sources.
    • We are very aware that if you are used to working with data in other data structures you may initially think that this approach is simplistic or insufficient. But all data structures have tradeoffs. None is perfect. Even the do-it-all structures have their downsides: working with them is complex and the files can only be written or read with special software libraries. If you accept ERDDAP's approach enough to try to work with it, you may find that it has its advantages (notably the support for multiple file types that can hold the data responses). The ERDDAP slide show (particularly the data structures slide) talk a lot about these issues.
    • And even if this approach sounds odd to you, most ERDDAP clients will never notice -- they will simply see that all of the datasets have a nice simple structure and they will be thankful that they can get data from a wide variety of sources returned in a wide variety of file formats.
       
  • What if the grid variables in the source dataset DON'T share the same axis variables? In EDDGrid datasets, all data variables MUST use (share) all of the axis variables. So if a source dataset has some variables with one set of dimensions, and other variables with a different set of dimensions, you will have to make two datasets in ERDDAP. For example, you might make one ERDDAP dataset entitled "Some Title (at surface)" to hold variables that just use [time][latitude][longitude] dimensions and make another ERDDAP dataset entitled "Some Title (at depths)" to hold the variables that use [time][altitude][latitude][longitude]. Or perhaps you can change the data source to add a dimension with a single value (for example, altitude=0) to make the variables consistent.
     
  • Projected Gridded Data - Modelers (and others) often work with gridded data on various non-cylindrical projections (e.g., conic, polar stereographic). Some end users want the projected data so there is no loss of information. For those clients, ERDDAP can serve the data, as is, if the ERDDAP administrator breaks the original dataset into a few datasets, with each part including variables which share the same axis variables. Yes, that seems odd to people involved and it is different from most OPeNDAP servers. But ERDDAP emphasizes making the data available in many formats. That is possible because ERDDAP uses/requires a more uniform data structure. Although it is a little awkward (i.e., different that expected), ERDDAP can distribute the projected data.

    [Yes, ERDDAP could have looser requirements for the data structure, but keep the requirements for the output formats. But that would lead to confusion among many users, particularly newbies, since many seemingly valid requests for data with different structures would be invalid because the data wouldn't fit into the file type. We keep coming back to the current system's design.]

    Some end users want lat lon geographic data (plate carree) for ease-of-use in different situations. For that, we encourage the ERDDAP administrator to re-project the data onto a geographic (plate carree) projection and serve that form of the data as a different dataset. Then both types of users are happy.


 

List of Types Datasets Datasets fall into two categories. (Why?)


 

Detailed Descriptions of Dataset Types

EDDGridFromDap handles grid variables from DAP servers.

EDDGridFromErddap handles gridded data from a remote ERDDAP server.
EDDTableFromErddap handles tabular data from a remote ERDDAP server.

  • EDDGridFromErddap and EDDTableFromErddap behave differently from all other types of datasets in ERDDAP.
    • Like other types of datasets, these datasets get information about the dataset from the source and keep it in memory.
    • Like other types of datasets, when ERDDAP searches for datasets, displays the Data Access Form, or displays the Make A Graph form, ERDDAP uses the information about the dataset which is in memory.
    • Unlike other types of datasets, when ERDDAP receives a request for data or images from these datasets, ERDDAP redirects the request to the remote ERDDAP server. The result is:
      • This is very efficient (CPU, memory, and bandwidth), because otherwise
        1. The composite ERDDAP has to send the request to the other ERDDAP (which takes time).
        2. The other ERDDAP has to get the data, reformat it, and transmit the data to the composite ERDDAP.
        3. The composite ERDDAP has to receive the data (using bandwidth), reformat it (using CPU and memory), and transmit the data to the user (using bandwidth).
        By redirecting the request and allowing the other ERDDAP to send the response directly to the user, the composite ERDDAP spends essentially no CPU time, memory, or bandwidth on the request.
      • The redirect is transparent to the user regardless of the client software (a browser or any other software or command line tool).
  • Normally, when an EDDGridFromErddap and EDDTableFromErddap are (re)loaded on your ERDDAP, they try to add a subscription to the remote dataset via the remote ERDDAP's email/URL subscription system. That way, whenever the remote dataset changes, the remote ERDDAP contacts the setDatasetFlag URL on your ERDDAP so that the local dataset is reloaded ASAP and so that the local dataset always mimics the remote dataset. So, the first time this happens, you should get an email requesting that you validate the subscription. However, if the local ERDDAP can't send an email or if the remote ERDDAP's email/URL subscription system isn't active, you should email the remote ERDDAP administrator and request that s/he manually add <onChange>...</onChange> tags to all of the relevant datasets to call your dataset's setDatasetFlag URLs. See your ERDDAP daily report for a list of setDatasetFlag URLs, but just send the ones for EDDGridFromErddap and EDDTableFromErddap datasets to the remote ERDDAP administrator.
  • EDDGridFromErddap and EDDTableFromErddap are the basis for clusters and federations of ERDDAPs, which efficiently distribute the CPU usage (mostly for making maps), memory usage, dataset storage, and bandwidth usage of a large data center.
  • EDDGridFromErddap and EDDTableFromErddap can't be used with remote datasets that require logging in (because they use <accessibleTo>).
  • For security reasons, EDDGridFromErddap and EDDTableFromErddap don't support the <accessibleTo> tag. See ERDDAP's security system for restricting access to some datasets to some users.
  • The skeleton XML for an EDDGridFromErddap dataset is very simple, because the intent is just to mimic the remote dataset which is already suitable for use in ERDDAP:
    <dataset type="EDDGridFromErddap" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    
  • The skeleton XML for an EDDTableFromErddap dataset is very simple, because the intent is just to mimic the remote dataset, which is already suitable for use in ERDDAP:
    <dataset type="EDDTableFromErddap" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    

EDDGridFromEtopo just serves the ETOPO1 Global 1-Minute Gridded Elevation Data Set (Ice Surface, grid registered, binary, 2byte int: etopo1_ice_g_i2.zip) which is distributed with ERDDAP.

  • Only two datasetID's are supported for EDDGridFromEtopo, so that you can access the data with longitude values -180 to 180, or longitude values 0 to 360.
  • There are never any sub tags, since the data is already described within ERDDAP.
  • So the two options for EDDGridFromEtopo datasets are (literally):
      <!-- etopo180 serves the data from longitude -180 to 180 -->
      <dataset type="EDDGridFromEtopo" datasetID="etopo180" /> 
      <!-- etopo360 serves the data from longitude 0 to 360 -->
      <dataset type="EDDGridFromEtopo" datasetID="etopo360" /> 
    

EDDGridFromFiles is the superclass of all EDDGridFrom...Files classes. You can't use EDDGridFromFiles directly. Instead, use a subclass of EDDGridFromFiles to handle the specific file type:

Currently, no other file types are supported. But it is usually relatively easy to support other file types. Contact us if you have requests. Or, if your data is in an old file format that you would like to move away from, we recommend converting the files to be NetCDF .nc files. NetCDF is a widely supported format, allows fast random access to the data, and is already supported by ERDDAP.

Details - The following information applies to all of the subclasses of EDDGridFromFiles.

  • Aggregation - This class aggregates data from local files. The resulting dataset appears as if all of the file's data had been combined. The local files all MUST have the same dataVariables (as defined in the dataset's datasets.xml). All of the dataVariables MUST use the same axisVariables/dimensions (as defined in the dataset's datasets.xml). The files will be aggregated based on the first (left-most) dimension, sorted in ascending order. Each file MAY have data for one or more values of the first dimension, but there can't be any overlap between files. If a file has more than one value for the first dimension, the values MUST be sorted in ascending order, with no ties. All files MUST have exactly the same values for all of the other dimensions. All files MUST have exactly the same units metadata for all axisVariables and dataVariables. For example, the dimensions might be [time][altitude][latitude][longitude], and the files might have the data for one time (or more) value(s) per file. The big advantages of aggregation are:
    • The size of the aggregated data set can be much larger than a single file can be conveniently (~2GB).
    • For near-real-time data, it is easy to add a new file with the latest chunk of data. You don't have to rewrite the entire dataset.
  • Directories - The files MAY be in one directory, or in a directory and its subdirectories (recursively). Note that if there are a large number of files (e.g., >1000), the operating system (and thus EDDGridFromFiles) will operate much more efficiently if you store the files in a series of subdirectories.
  • Cached File Information - When an EDDGridFromFiles dataset is first loaded, EDDGridFromFiles reads information from all of the relevant files and creates tables in memory with information about each valid file and each invalid file (one file per row).
    • The tables are also stored on disk, as .json files in [bigParentDirectory]/dataset in files named:
        [datasetID].dirs.json (which holds a list of unique directory names),
        [datasetID].files.json (which holds the table with each valid file's information),
        [datasetID].bad.json (which holds the table with each bad file's information).
    • The copy of the file information tables on disk is also useful when ERDDAP is shut down and restarted: it saves EDDGridFromFiles from having to re-read all of the data files.
    • You shouldn't ever need to delete or work with these files. You can delete these files (but why?). If you ever do need to delete these files (why?), you can do it when ERDDAP is running. (Then set a flag.)
    • If you want to encourage ERDDAP to update the stored dataset information (for example, if you just added, removed, or changed some files to the dataset's data directory), use the flag system to force ERDDAP to update the cached file information.
  • Handling Requests - When a client's request for data is processed, EDDGridFromFiles can quickly look in the table with the valid file information to see which files have the requested data.
  • Updating the Cached File Information - Whenever the dataset is reloaded, the cached file information is updated.
    • The dataset is reloaded periodically as determined by the <reloadEveryNMinutes> in the dataset's information in datasets.xml.
    • The dataset is reloaded as soon as possible whenever ERDDAP detects that you have added, removed, touch'd (to change the file's lastModified time), or changed a datafile.
    • The dataset is reloaded as soon as possible if you use the flag system.
    When the dataset is reloaded, ERDDAP compares the currently available files to the cached file information tables. New files are read and added to the valid files table. Files that no longer exist are dropped from the valid files table. Files where the file timestamp has changed are read and their information is updated. The new tables replace the old tables in memory and on disk.
  • Bad Files - The table of bad files and the reasons the files were declared bad (corrupted file, missing variables, etc.) is emailed to the emailEverythingTo email address (probably you) every time the dataset is reloaded. You should replace or repair these files as soon as possible.
  • FTP Trouble/Advice - If you FTP new data files to the ERDDAP server while ERDDAP is running, there is the chance that ERDDAP will be reloading the dataset during the FTP process. It happens more often than you might think! If it happens, the file will appear to be valid (it has a valid name), but the file isn't yet valid. If ERDDAP tries to read data from that invalid file, the resulting error will cause the file to be added to the table of invalid files. This is not good. To avoid this problem, use a temporary file name when FTP'ing the file, e.g., ABC2005.nc_TEMP . Then, the fileNameRegex test (see below) will indicate that this is not a relevant file. After the FTP process is complete, rename the file to the correct name. The renaming process will cause the file to become relevant in an instant.
  • The skeleton XML for all EDDGridFromFiles subclasses is:
    <dataset type="EDDGridFrom...Files" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit> 
      <fileDir>...</fileDir> <-- The directory (absolute) with the data files. -->
      <recursive>true|false</recursive> <-- Indicates if subdirectories
        of fileDir have data files, too. -->
      <fileNameRegex>...</fileNameRegex> <-- A regular expression 
        (tutorial) describing valid data files names, 
        e.g., ".*\.nc" for all .nc files. -->
      <metadataFrom>...</metadataFrom> <-- The file to get 
        metadata from ("first" or "last" (the default) based on file's 
        lastModifiedTime). -->
      <addAttributes>...</addAttributes>
      <axisVariable>...</axisVariable> <!-- 1 or more -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
    </dataset>
    

EDDGridFromNcFiles aggregates data from local, gridded, GRIB .grb and .grb2 files, HDF .hdf 4 (and 5?) files, NetCDF .nc files. This may work with other file types (e.g., BUFR), we just haven't tested it -- please send us some sample files.

Note that for GRIB files, ERDDAP will make a .gbx index file the first time it reads each GRIB file. So the GRIB files must be in a directory where the "user" that ran Tomcat has read+write permission.

See this class' superclass, EDDGridFromFiles, for information on how to use this class and how this class works.

EDDGridSideBySide aggregates two or more EDDGrid datasets (the children) side by side.

  • The resulting dataset has all of the variables from all of the child datasets.
  • The parent dataset and all of the child datasets MUST have different datasetIDs. If any names in a family are exactly the same, the dataset will fail to load (with the error message that the values of the aggregated axis are not in sorted order).
  • All children MUST have the same source values for axisVariables[1+] (e.g., latitude, longitude).
  • The children may have different source values for axisVariables[0] (e.g., time), but they are usually largely the same.
  • The parent dataset will appear to have all of the axisVariables[0] source values from all of the children.
  • For example, this lets you combine a source dataset with a vector's u-component and another source dataset with a vector's v-component, so the combined data can be served.
  • Children created by this method are held privately. They are not separately accessible datasets (e.g., by client data requests or by flag files).
  • The only allowed sub tags are <dataset> tags specifying the child datasets.
  • The global metadata and settings for the parent comes from the global metadata and settings for the first child.
  • If there is an exception while creating the first child, the parent will not be created.
  • If there is an exception while creating other children, this sends an email to emailEverythingTo (as specified in setup.xml) and continues with the other children.
  • The skeleton XML for an EDDGridSideBySide dataset is:
    <dataset type="EDDGridSideBySide" datasetID="..." active="..." >
      <dataset>...</dataset> <!-- 2 or more -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    

EDDGridAggregateExistingDimension aggregates two or more EDDGrid datasets based on different values of the first dimension.

  • For example, one child dataset might have 366 values (for 2004) for the time dimension and another child might have 365 values (for 2005) for the time dimension.
  • All the values for all of the other dimensions (e.g., latitude, longitude) MUST be identical for all of the children.
  • The parent dataset and the child dataset MUST have different datasetIDs. If any names in a family are exactly the same, the dataset will fail to load (with the error message that the values of the aggregated axis are not in sorted order).
  • Currently, the child dataset MUST be an EDDGridFromDap dataset and MUST have the lowest values of the aggregated dimension (usually the oldest time values). All of the other children MUST be almost identical datasets (differing just in the values for the first dimension) and are specified by just their sourceUrl.
  • The aggregate dataset gets its metadata from the first child.
  • ensureAxisValuesAreEqual - This tag is OPTIONAL. If true (the default), the non-first-axis values MUST be exactly equal in all children. If false, some minor variation is allowed(for example, it would allow 0.1 in one child and 0.1000000002 in another). Only use false if you need to and if you are certain that the variation that is present is acceptable to you.
  • The GenerateDatasetsXml program can make a rough draft of the datasets.xml for an EDDGridAggregateExistingDimension based on a set of files served by a Hyrax or THREDDS server. For example, use this input for the program (the "/1988" in the URL makes the example run faster):
      EDDType? EDDGridAggregateExistingDimension
      Server type (hyrax or thredds)? hyrax
      Parent URL (e.g., for hyrax, ending in "contents.html";
        for thredds, ending in "catalog.xml")
      ? http://dods.jpl.nasa.gov/opendap/ocean_wind/ccmp/L3.5a/data/
        flk/1988/contents.html
      File name regex (e.g., ".*\.nc")? month.*flk\.nc\.gz
      ReloadEveryNMinutes (e.g., 10080)? 10080

    You can use the resulting <sourceUrl> tags or delete them and uncomment the <sourceUrl> tag (so that new files are noticed each time the dataset is reloaded.
  • The skeleton XML for an EDDGridAggregateExistingDimension dataset is:
    <dataset type="EDDGridAggregateExistingDimension" datasetID="..." 
        active="..." >
      <dataset>...</dataset> <!-- This is a regular EDDGridFromDap 
        dataset description child with the lowest values for the aggregated dimensions. -->
      <sourceUrl>...</sourceUrl> <!-- 0 or many; the sourceUrls for 
        other children.  These children must be listed in order of ascending values 
        for the aggregated dimension. -->
      <sourceUrls serverType="..." regex="..." recursive="true" 
        >http://someServer/thredds/someSubdirectory/catalog.xml</sourceUrls> 
        <!-- 0 or 1. This specifies how to find the other children, instead 
        of using separate sourceUrl tags for each child.  The advantage of this
        is: new children will be detected each time the dataset is reloaded. 
        The serverType must be "thredds" or "hyrax".  
        An example of a regular expression (regex) (tutorial) is .*\.nc 
        recursive can be "true" or "false".  
        An example of a thredds catalogUrl is
        http://thredds1.pfeg.noaa.gov/thredds/catalog/Satellite/aggregsatMH/chla/catalog.xml
        An example of a hyrax catalogUrl is
        http://podaac-opendap.jpl.nasa.gov/opendap/allData/ccmp/L3.5a/monthly/flk/1988/contents.html
        When these children are sorted by file name, they must be in order of
        ascending values for the aggregated dimension. -->
      <ensureAxisValuesAreEqual>true(the default) or 
        false</ensureAxisValuesAreEqual> 
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    

EDDGridCopy makes and maintains a local copy of another EDDGrid's data and serves data from the local copy.

  • EDDGridCopy (and for tabular data, EDDTableCopy) is a very easy to use and a very effective solution to some of the biggest problems with serving data from remote data sources:
    • Accessing data from a remote data source can be slow.
      • They may be slow because they are inherently slow (e.g., an inefficient type of server),
      • because they are overwhelmed by too many requests,
      • or because your server or the remote server is bandwidth limited.
    • The remote dataset is sometimes unavailable (again, for a variety of reasons).
    • Relying on one source for the data doesn't scale well (e.g., when many users and many ERDDAPs utilize it).
       
  • How It Works - EDDGridCopy solves these problems by automatically making and maintaining a local copy of the data and serving data from the local copy. ERDDAP can serve data from the local copy very, very quickly. And making a local copy relieves the burden on the remote server. And the local copy is a backup of the original, which is useful in case something happens to the original.

    There is nothing new about making a local copy of a dataset. What is new here is that this class makes it *easy* to create and *maintain* a local copy of data from a *variety* of types of remote data sources and *add metadata* while copying the data.

  • Chunks of Data - EDDGridCopy makes the local copy of the data by requesting chunks of data from the remote <dataset> . There will be a chunk for each value of the leftmost axis variable. Note that EDDGridCopy doesn't rely on the remote dataset's index numbers for the axis -- those may change.

    WARNING: If the size of a chunk of data is so big that it causes problems (> 1GB?), EDDGridCopy can't be used. (Sorry, we hope to have a solution for this problem in the future.)

  • Local Files - Each chunk of data is stored in a separate netCDF file in a subdirectory of [bigParentDirectory]/copy/datasetID/ (as specified in setup.xml). File names created from axis values are modified to make them file-name-safe (e.g., hyphens are replaced by "x2D") -- this doesn't affect the actual data.
     
  • New Data - Each time EDDGridCopy is reloaded, it checks the remote <dataset> to see what chunks are available. If the file for a chunk of data doesn't already exist, a request to get the chunk is added to a queue. ERDDAP's taskThread processes all the queued requests for chunks of data, one-by-one. You can see statistics for the taskThread's activity on the Status Page and in the Daily Report. (Yes, ERDDAP could assign multiple tasks to this process, but that would use up lots of the remote data source's bandwidth, memory, and CPU time, and lots of the local ERDDAP's bandwidth, memory, and CPU time, neither of which is a good idea.)

    NOTE: The very first time an EDDGridCopy is loaded, (if all goes well) lots of requests for chunks of data will be added to the taskThread's queue, but no local data files will have been created. So the constructor will fail but taskThread will continue to work and create local files. If all goes well, the taskThread will make some local data files and the next attempt to reload the dataset (in ~15 minutes) will succeed, but initially with a very limited amount of data.

    WARNING: If the remote dataset is large and/or the remote server is slow (that's the problem, isn't it?!), it will take a long time to make a complete local copy. In some cases, the time needed will be unacceptable. For example, transmitting 1 TB of data over a T1 line (0.15 GB/s) takes at least 60 days, under optimal conditions. Plus, it uses lots of bandwidth, memory, and CPU time on the remote and local computers. The solution is to mail a hard drive to the administrator of the remote data set so that s/he can make a copy of the dataset and mail the hard drive back to you. Use that data as a starting point and EDDGridCopy will add data to it. (That is one way that Amazon's EC2 Cloud Service handles the problem, even though their system has lots of bandwidth.)

    WARNING: If a given value for the leftmost axis variable disappears from the remote dataset, EDDGridCopy does NOT delete the local copied file. If you want to, you can delete it yourself.

  • Recommended use -
    1. Create the <dataset> entry (the native type, not EDDGridCopy) for the remote data source.
      Get it working correctly, including all of the desired metadata.
    2. If it is too slow, add XML code to wrap it in an EDDGridCopy dataset.
      • Use a different datasetID (perhaps by changing the datasetID of the old datasetID slightly).
      • Copy the <accessibleTo>, <reloadEveryNMinutes> and <onChange> from the remote EDDGrid's XML to the EDDGridCopy's XML. (Their values for EDDGridCopy matter; their values for the inner dataset become irrelevant.)
    3. ERDDAP will make and maintain a local copy of the data.
       
  • WARNING: EDDGridCopy assumes that the data values for each chunk don't ever change. If/when they do, you need to manually delete the chunk files in [bigParentDirectory]/copy/datasetID/ which changed and flag the dataset to be reloaded so that the deleted chunks will be replaced. If you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
     
  • Change Metadata - If you need to change any addAttributes or change the order of the variables associated with the source dataset:
    1. Change the addAttributes for the source dataset in datasets.xml, as needed.
    2. Delete one of the copied files.
    3. Set a flag to reload the dataset immediately. If you do use a flag and you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
    4. The deleted file will be regenerated with the new metadata. If the source dataset is ever unavailable, the EDDGridCopy dataset will get metadata from the regenerated file, since it is the youngest file.
       
  • Skeleton XML - The skeleton XML for an EDDGridCopy dataset is:
    <dataset type="EDDGridCopy" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <dataset>...</dataset> <!-- 1 -->
    </dataset>
    

EDDTableFromDapSequence handles variables within 1- and 2-level sequences from DAP servers such as DAPPER.

  • You can gather the information you need to create the XML for an EDDTableFromDapSequence dataset by looking at the source dataset's DDS and DAS files in your browser (by adding .das and .dds to the sourceUrl, for example, http://dapper.pmel.noaa.gov/dapper/epic/tao_time_series.cdp.dds).
  • A variable is in a DAP sequence if the .dds response indicates that the data structure holding the variable is a "sequence" (case insensitive).
  • In some cases, you will see a sequence within a sequence, a 2-level sequence -- EDDTableFromDapSequence handles these, too.
  • The skeleton XML for an EDDTableFromDapSequence dataset is:
    <dataset type="EDDTableFromDapSequence" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
      <outerSequenceName>...</outerSequenceName>
        <!-- The name of the outer sequence for DAP sequence data. 
        This tag is REQUIRED. -->
      <innerSequenceName>...</innerSequenceName>
        <!-- The name of the inner sequence for DAP sequence data. 
        This tag is OPTIONAL; use it if the DAP data is a two level 
        sequence. -->
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <sourceCanConstrainStringEQNE>true|false</sourceCanConstrainStringEQNE>
      <sourceCanConstrainStringGTLT>true|false</sourceCanConstrainStringGTLT>
      <sourceCanConstrainStringRegex>...</sourceCanConstrainStringRegex>
      <skipDapperSpacerRows>...</skipDapperSpacerRows>
        <!-- skipDapperSpacerRows specifies whether the dataset 
        will skip the last row of each innerSequence other than the 
        last innerSequence (because Dapper servers put NaNs in the 
        row to act as a spacer).  This tag is OPTIONAL. The default 
        is false.  It is recommended that you set this to true for 
        all Dapper sources and false for all other data sources. -->
      <addAttributes>...</addAttributes>
      <dataVariable>...</dataVariable> <!-- 1 or more -->
    </dataset>
    

EDDTableFromDatabase handles data from one database table or view.

  • If the data you want to serve is in two or more tables (and needs a JOIN to extract data), you need to make a new table (or a view) with the JOINed/flattened information. Contact your database administrator.
  • You must get the appropriate JDBC 3 or JDBC 4 driver .jar file and put it in [tomcat]/webapps/erddap/WEB-INF/lib after you install ERDDAP. For Postgresql, we got the JDBC 4 driver from http://jdbc.postgresql.org and we use "org.postgresql.Driver" for the <driverName> in datasets.xml (see below). For SQL Server, you can get the JTDS JDBC driver from http://jtds.sourceforge.net and use "net.sourceforge.jtds.jdbc.Driver" for the <driverName>.
  • You can gather most of the information you need to create the XML for an EDDTableFromDatabase dataset by contacting the database administrator and by searching the web. The <driverName>, driver .jar file, <connectionProperty> names (e.g., "user", "password", and "ssl"), and some of the connectionProperty values can be found by searching the web for "JDBC connection properties databaseType" (e.g., Oracle, MySQL, PostgreSQL).
  • It is difficult to create the correct datasets.xml information needed for ERDDAP to establish a connection to the database. Be patient. Be methodical. Search the web for examples of using JDBC to connect to your type of database. Work closely with the database administrator, who may have relevant experience. If the dataset fails to load, read the error message carefully to find out why.
  • Database Date Time Data - Some database date time columns have no explicit time zone. Such columns are trouble for ERDDAP. Databases support the concept of a date (with or without a time) without a time zone, as an approximate range of time. But Java (and thus ERDDAP) only deal with instantaneous date+times with a timezone. So you may know that the date time data is based on a local time zone (with or without daylight savings) or the GMT/Zulu time zone, but Java (and ERDDAP) don't. We originally thought we could work around this problem (e.g, by specifying a time zone for the column), but the database+JDBC+Java interactions made this an unreliable solution.
    • So, ERDDAP requires that you store all date and date time data in the database table with a database data type that corresponds to the JDBC type "timestamp with time zone" (ideally, that uses the GMT/Zulu time zone).
    • In ERDDAP's datasets.xml, in the <dataVariable> tag for this variable, set
        <dataType>double</dataType>
      and in <addAttributes> set
        <att name="units">seconds since 1970-01-01T00:00:00Z</att> .
    • Suggestion: If the data is a time range, it is useful to have the timestamp values refer to the center of the implied time range (e.g., noon). For example, if a user has other data for 2010-03-26T13:00Z and they want the closest database data, then the data for 2010-03-26T12:00Z (representing data for that date) is obviously the best (as opposed to the midnight before or after, where it is less obvious which is best).
    • ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
    • See How ERDDAP Deals with Time.
  • Security - When working with databases, you need to do things as safely and securely as possible to avoid allowing a malicious user to damage your database or gain access to data they shouldn't have access to. ERDDAP tries to do things in a secure way, too.
    • Consider replicating, on a different computer, the database and database tables with the data that you want ERDDAP to serve. (Yes, for commercial databases like Oracle, this involves additional licensing fees. But for open source databases, like PostgreSQL and MySQL, this costs nothing.) This gives you a high level of security and also prevents ERDDAP requests from slowing down the original database.
    • We encourage you to set up ERDDAP to connect to the database as a database user that only has access to the relevant database(s) and only has READ privileges.
    • We encourage you to set up the connection from ERDDAP to the database so that it
      • always uses SSL,
      • only allows connections from one IP address (or one block of addresses) and from the one ERDDAP user, and
      • only transfers passwords in their MD5 hashed form.
    • [KNOWN PROBLEM]The connectionProperties (including the password!) are stored as plain text in datasets.xml. Only the administrator should have READ and WRITE access to this file! No other users of the computer should have READ or WRITE access to this file! We haven't found a way to allow the administrator to enter the database password during ERDDAP's startup in Tomcat (which occurs without user input), so the password must be accessible in a file.
    • When in ERDDAP, the password and other connection properties are stored in "private" Java variables.
    • Requests from clients are parsed and checked for validity before generating the SQL requests for the database.
    • Requests to the database are made with SQL PreparedStatements, to prevent SQL injection.
    • Requests to the database are submitted with executeQuery (not executeStatement) to limit requests to be read-only (so attempted SQL injection to alter the database will fail for this reason, too).
  • SQL - It is easy for ERDDAP to convert user requests into simple SQL PreparedStatements. For example, the ERDDAP request
      time,temperature&time>=2008-01-01T00:00:00&time<=2008-02-01T00:00:00

    will be converted into the SQL PreparedStatement
      SELECT time, temperature WHERE time >= 2008-01-01T00:00:00 AND
      time <= 2008-02-01T00:00:00

    ERDDAP requests with &distinct() and/or &orderBy(variables) will add DISTINCT and/or ORDER BY variables to the SQL prepared statement. In general, this will greatly slow down the response from the database.
    ERDDAP logs the PreparedStatement in log.txt as
      statement=thePreparedStatement
    .
  • Views - EDDTableFromDatabase is limited to getting data from one table, but that shouldn't be a problem. If a table of interest has foreign keys which link to other tables, we recommend that you ask the database administrator to create a VIEW. Views "can join and simplify multiple tables into a single virtual table" (Wikipedia). Views are great because:
    • They simplify queries (since the queries don't have to specify the JOINs, etc.).
    • They are efficient (since the database just has to set it up once).
    • They increase abstraction (since the database can be changed without having to change how the VIEW appears to the client).
  • Speed - If speed is a problem:
    • Set the Fetch Size - Databases return the data to ERDDAP in chunks. By default, different databases return a different number of rows in the chunks. Often this number is very small and so very inefficient. For example, the default for Oracle is 10! Read the JDBC documentation for your database's JDBC driver to find the connection property to set in order to increase this, and add this to the dataset's description in datasets.xml. For example,
      For MySQL, use
      <connectionProperty name="defaultFetchSize">4096</connectionProperty>
      For Oracle, use
      <connectionProperty name="defaultRowPrefetch">4096</connectionProperty>
      For PostgreSQL, use
      <connectionProperty name="defaultFetchSize">4096</connectionProperty>
      but feel free to change the number. Note that setting the number too big will
      cause ERDDAP to use lots of memory and be more likely to run out of memory.
    • ConnectionProperties - Each database has other connection properties which can be specified in datasets.xml. Many of these will affect the performance of the ERDDAP to database connection. Please read the documentation for your database's JDBC driver to see the options. If you find connection properties that are useful, please send an email with the details to bob dot simons at noaa dot gov.
    • Make a Table - You will probably get faster responses if you periodically (every day? whenever there is new data?) generate an actual table (similarly to how you generated the VIEW) and tell ERDDAP to get data from the table instead of the VIEW. Since any request to the table can then be fulfilled without JOINing another table, the response will be much faster.
    • Optimize/Vacuum the Table -
      MySQL will respond much faster if you use OPTIMIZE TABLE.
      PostgreSQL will respond much faster if you VACUUM the table.
      Oracle doesn't have or need an analogous command.
    • Connection Pooling - ERDDAP currently doesn't use connection pooling. ERDDAP makes a new connection to the database for each SQL query that it sends to the database. This adds about 0.1 seconds per request (sometimes longer, e.g., for remote databases), but is a more robust and safe approach. We may add optional connection pooling in the future.
    • If all else fails, consider storing the data in a collection of .nc files. If they are logically organized (each with data for a chunk of space and time), ERDDAP can extract data from them very quickly.
  • The skeleton XML for an EDDTableFromDatabase dataset is:
    <dataset type="EDDTableFromDatabase" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
        <!-- Put the database name at the end, for example, 
          "jdbc:postgresql://123.45.67.89:5432/databaseName". REQUIRED. -->
      <driverName>...</driverName>
        <!-- The high-level name of the database driver, e.g., 
          "org.postgresql.Driver".  You need to put the actual database 
          driver .jar file (for example, postgresql.jdbc.jar) in 
          [tomcat]/webapps/erddap/WEB-INF/lib.  REQUIRED. -->
      <connectionProperty name="name">value</connectionProperty>
        <!-- The names (e.g., "user", "password", and "ssl") and values 
          of the properties needed for ERDDAP to establish the connection
          to the database.  0 or more. -->
      <catalogName>...</catalogName>
        <!-- The name of the catalog which has the schema which has the 
          table, default = "".  OPTIONAL. -->
      <schemaName>...</schemaName> <!-- The name of the 
        schema which has the table, default = "".  OPTIONAL. -->
      <tableName>...</tableName>  <!-- The name of the 
        table, default = "".  REQUIRED. -->
      <orderBy>...</orderBy>  <!-- A comma-separated list of
        sourceNames to be used in an ORDER BY clause at the end of the 
        every query sent to the database (unless the user's request
        includes an &orderBy() filter, in which case the user's 
        orderBy is used).  The order of the sourceNames is important. 
        The leftmost sourceName is most important; subsequent 
        sourceNames are only used to break ties.  Only relevant 
        sourceNames are included in the ORDER BY clause for a given user 
        request.  If this is not specified, the order of the returned 
        values in not specified. Default = "".  OPTIONAL. -->
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
      <addAttributes>...</addAttributes>
      <dataVariable>...</dataVariable> <!-- 1 or more.
         For date and timestamp database columns, set dataType=double and 
         units=seconds since 1970-01-01T00:00:00Z -->
    </dataset>
    

EDDTableFromFiles is the superclass of all EDDTableFrom...Files classes. You can't use EDDTableFromFiles directly. Instead, use a subclass of EDDTableFromFiles to handle the specific file type:

Currently, no other file types are supported. But it is usually relatively easy to support other file types. Contact us if you have requests. Or, if your data is in an old file format that you would like to move away from, we recommend converting the files to be NetCDF .nc files. NetCDF is a widely supported format, allows fast random access to the data, and is already supported by ERDDAP.

Details - The following information applies to all of the subclasses of EDDTableFromFiles.

  • Aggregation - This class aggregates data from local files. Each file holds a (relatively) small table of data.
    • The resulting dataset appears as if all of the file's tables had been combined (all of the rows of data from file #1, plus all of the rows from file #2, ...).
    • The files don't all have to have all of the specified variables.
    • The variables in all of the files MUST have the same values for the add_offset, missing_value, _FillValue, scale_factor, and units attributes (if any). ERDDAP checks, but it is an imperfect test -- if there are different values, ERDDAP doesn't know which is correct and therefore which files are invalid.
  • Directories - The files can be in one directory, or in a directory and its subdirectories (recursively). Note that if there are a large number of files (e.g., >1000), the operating system (and thus EDDTableFromFiles) will operate much more efficiently if you store the files in a series of subdirectories.
  • Cached File Information - When an EDDTableFromFiles dataset is first loaded, EDDTableFromFiles reads all of the relevant files and creates tables in memory with information about each valid file (one file per row, including the minimum and maximum value of each variable, even String variables) and each invalid file.
    • The tables are also stored on disk, as .json files in [bigParentDirectory]/dataset in files named:
        [datasetID].dirs.json (which holds a list of unique directory names) and
        [datasetID].files.json (which holds the table with each valid file's information),
        [datasetID].bad.json (which holds the table with each bad file's information).
    • The copy of the file information tables on disk is also useful when ERDDAP is shut down and restarted: it saves EDDTableFromFiles from having to re-read all of the data files.
    • You shouldn't ever need to delete or work with these files. You can delete these files (but why?). You can use the flag system to force ERDDAP to update the cached file information.
  • Handling Requests - ERDDAP tabular data requests can put constraints on any variable.
    • When a client's request for data is processed, EDDTableFromFiles can quickly look in the table with the valid file information to see which files might have relevant data. For example, if each source file has the data for one fixed-location buoy, EDDTableFromFiles can very efficiently determine which files might have data within a given longitude range and latitude range.
    • Because the valid file information table includes the minimum and maximum value of every variable for every valid file, EDDTableFromFiles can often handle other queries quite efficiently. For example, if some of the buoys don't have an air pressure sensor, and a client requests data for airPressure!=NaN, EDDTableFromFiles can efficiently determine which buoys have air pressure data.
  • Updating the Cached File Information - Whenever the dataset is reloaded, the cached file information is updated.
    • The dataset is reloaded periodically as determined by the <reloadEveryNMinutes> in the dataset's information in datasets.xml.
    • The dataset is reloaded as soon as possible whenever ERDDAP detects that you have added, removed, touch'd (to change the file's lastModified time), or changed a datafile.
    • The dataset is reloaded as soon as possible if you use the flag system.
    When the dataset is reloaded, ERDDAP compares the currently available files to the cached file information table. New files are read and added to the valid files table. Files that no longer exist are dropped from the valid files table. Files where the file timestamp has changed are read and their information is updated. The new tables replaces the old tables in memory and on disk.
  • Bad Files - The table of bad files and the reasons the files were declared bad (corrupted file, missing variables, incorrect axis values, etc.) is emailed to the emailEverythingTo email address (probably you) every time the dataset is reloaded. You should replace or repair these files as soon as possible.
  • Near Real Time Data - EDDTableFromFiles treats requests for very recent data as a special case. The problem: If the files making up the dataset are updated frequently, it is likely that the dataset won't be updated every time a file is changed. So EDDTableFromFiles won't be aware of the changed files. (You could use the flag system, but this might lead to ERDDAP reloading the dataset almost continually. So in most cases, we don't recommend it.) Instead, EDDTableFromFiles does two things to deal with this situation:
    1. When the dataset is loaded, if the maximum value for the time variable is in the last 24 hours, ERDDAP sets the maximum time to be NaN (meaning Now).
    2. When ERDDAP gets a request for data within the last 20 hours (e.g., 8 hours ago until Now), ERDDAP will search all files which have any data in the last 20 hours.
    Thus, ERDDAP doesn't need to have perfectly up-to-date data for all of the files in order to find the latest data. You should still set <reloadEveryNMinutes> to a reasonably small value (e.g., 60), but it doesn't have to be tiny (e.g., 3).

    Not recommended organization of near-real-time data in the files: If, for example, you have a dataset that stores data for numerous stations (or buoy, or ...) for many years, you could arrange the files so that, for example, there is one file per station. But then, every time new data for a station arrives, you have to read a large old file and write a large new file. And when ERDDAP reloads the dataset, it notices that some files have been modified, so it reads those files completely. That is inefficient.

    Recommended organization of near-real-time data in the files: We recommend that you store the data in chunks, e.g., all data for one station for one year (or one month). Then, when a new datum arrives, you only have to read and rewrite the file with this year's (or month's) data. All of the files for previous years (or months) for that station remain unchanged. And when ERDDAP reloads the dataset, most files are unchanged; only a few, small files have changed and need to be read.

  • FTP Trouble/Advice - If you FTP new data files to the ERDDAP server while ERDDAP is running, there is the chance that ERDDAP will be reloading the dataset during the FTP process. It happens more often than you might think! If it happens, the file will appear to be valid (it has a valid name), but the file isn't valid. If ERDDAP tries to read data from that invalid file, the resulting error will cause the file to be added to the table of invalid files. This is not good. To avoid this problem, use a temporary file name when FTP'ing the file, e.g., ABC2005.nc_TEMP . Then, the fileNameRegex test (see below) will indicate that this is not a relevant file. After the FTP process is complete, rename the file to the correct name. The renaming process will cause the file to become relevant in an instant.
  • File Name Extracts - EDDTableFromFiles has a system for extracting a String from each file name and using that to make a psuedo data variable. Currently, there is no system to interpret these Strings as dates/times. There are several XML tags to set up this system. If you don't need part or all of this system, just don't specify these tags or use "" values.
    • preExtractRegex is a regular expression (tutorial) used to identify text to be removed from the start of the file name. The removal only occurs if the regex is matched. This usually begins with "^" to match the beginning of the file name.
    • postExtractRegex is a regular expression used to identify text to be removed from the end of the file name. The removal only occurs if the regex is matched. This usually ends with "$" to match the end of the file name.
    • extractRegex If present, this regular expression is used after preExtractRegex and postExtractRegex to identify a string to be extracted from the file name (e.g., the stationID). If the regex isn't matched, the entire file name is used (minus preExtract and postExtract). Use ".*" to match the entire file name that is left after preExtractRegex and postExtractRegex.
    • columnNameForExtract is the data column name for the extracted Strings. A dataVariable with this sourceName must be in the dataVariables list (with any data type, but usually String).
    For example, if a dataset has files with names like XYZAble.nc, XYZBaker.nc, XYZCharlie.nc, ..., and you want to create a new variable (stationID) when each file is read which will have station ID values (Able, Baker, Charlie, ...) extracted from the file names, you could use these tags:
    • <preExtractRegex>^XYZ</preExtractRegex>
      The initial ^ is a regular expression special character which forces ERDDAP to look for XYZ at the beginning of the file name. This causes XYZ, if found at the beginning of the file name, to be removed (e.g., the file name XYZAble.nc becomes Able.nc).
    • <postExtractRegex>\x2Enc$</postExtractRegex>
      The $ at the end is a regular expression special character which forces ERDDAP to look for .nc at the end of the file name. Since . is a regular expression special character (which matches any character), it is encoded as \x2E here (because 2E is the hexadecimal character number for a period). This causes .nc, if found at the end of the file name, to be removed (e.g., the partial file name Able.nc becomes Able).
    • <extractRegex>.*</extractRegex>
      The .* regular expression matches all remaining characters (e.g., the partial file name Able becomes the extract for the first file).
    • <columnNameForExtract>stationID</columnNameForExtract>
      This tells ERDDAP to create a new column called stationID when reading each file. Every row of data for a given file will have the text extracted from its file name (e.g., Able) as the value in the stationID column.
    In most cases, there are numerous values for these extract tags that will yield the same results -- regular expressions are very flexible. But in a few cases, there is just one way to get the desired results.
  • global: sourceNames - Global metadata in each file can be converted to be data. If the sourceName of a variable starts with global: (e.g., global:PI), when ERDDAP is reading the data from a file, ERDDAP will look for a global attribute of that name (e.g., PI) and create a column filled with the attribute's value.
  • The skeleton XML for all EDDTableFromFiles subclasses is:
    <dataset type="EDDTableFrom...Files" datasetID="..." active="..." >
      <nDimensions>...</nDimensions>  <!-- This was used prior to ERDDAP version 1.30, 
        but is now ignored. -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
      <specialMode>mode</specialMode>  <-- This rarely-used, optional tag can be used 
        with EDDTableFromThreddsFiles to specify that special, hard-coded rules 
        should be used to determine which files should be downloaded from the server.
        Currently, the only valid mode is SAMOS which is used with datasets from
        http://coaps.fsu.edu/thredds/catalog/samos to download only the files with 
        the last version number. -->
      <sourceUrl>...</sourceUrl>  <-- For subclasses like EDDTableFromHyraxFiles and 
        EDDTableFromThreddsFiles, this is where you specify the base URL for the files 
        on the remote server.  For subclasses that get data from local files, ERDDAP 
        doesn't use this information to get the data, but does display the information 
        to users. So I usually use "(local files)". -->
      <fileDir>...</fileDir> <-- The directory (absolute) with the data files. -->
      <recursive>true|false</recursive> <-- Indicates if subdirectories
        of fileDir have data files, too. -->
      <fileNameRegex>...</fileNameRegex> <-- A regular expression 
        (tutorial) describing valid data files names, e.g., ".*\.nc" for 
        all .nc files. -->
      <metadataFrom>...</metadataFrom> <-- The file to get metadata
        from ("first" or "last" (the default) based on file's 
        lastModifiedTime). -->
      <columnNamesRow>...</columnNamesRow> <-- (For 
        EDDTableFromAsciiFiles only) This specifies the number of the row
        with the column names in the files. (The first row is "1". 
        Default = 1.)  If you specify 0, ERDDAP will not look for column names
        and will assign names: Column#1, Column#2, ... -->
      <firstDataRow>...</firstDataRow> <-- (For 
        EDDTableFromAsciiFiles only) This specifies the number of the first
        row with data in the files. (The first row is "1". default = 2.) -->
      <-- For the next four tags, see File Name Extracts. -->
      <preExtractRegex>...</preExtractRegex>
      <postExtractRegex>...</postExtractRegex>
      <extractRegex>...</extractRegex>
      <columnNameForExtract>...</columnNameForExtract> 
      <sortedColumnSourceName>...</sortedColumnSourceName> 
        <-- The sourceName of the numeric column that the data files are 
        usually already sorted by within each file, e.g., "time".
        Use null or "" if no variable is suitable.
        It is ok if not all files are sorted by this column.
        If present, this can greatly speed up some data requests. 
        For EDDTableFromHyraxFiles, EDDTableFromNcFiles and 
        EDDTableFromThreddsFiles, this must be the leftmost axis 
        variable. -->
      <sortFilesBySourceNames>...</sortFilesBySourceNames>
        <-- This is a space-separated list of source variable names 
        which specifies how the internal list of files should be sorted
        (in ascending order), for example "id time". 
        It is the minimum value of the specified columns in each file
        that is used for sorting.
        When a data request is filled, data is obtained from the files
        in this order. Thus it determines the overall order of the data
        in the response.  If you specify more than one column name, the
        second name is used if there is a tie for the first column; the
        third is used if there is a tie for the first and second columns; ...
        This is OPTIONAL (the default is fileDir+fileName order). -->
      <isLocal>false<isLocal> <!-- (may be true or false, 
        the default). This is only used by EDDTableFromNcCFFiles. It 
        indicates if the files are local (actual files) or remote 
        (accessed via the web). The two types are treated slightly 
        differently.
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <addAttributes>...</addAttributes>
      <dataVariable>...</dataVariable> <!-- 1 or more -->
        <-- For EDDTableFromHyraxFiles, EDDTableFromNcFiles, and 
        EDDTableFromThreddsFiles, the axis variables (e.g., time) needn't
        be first or in any specific order. -->
    </dataset>
    

EDDTableFromAsciiFiles aggregates data from comma-, tab-, or space-separated tabular ASCII data files.

  • Normally, the files will have column names on the first row and data starting on the second row. But you can use <columnNamesRow> and <firstDataRow> in your datasets.xml file to a specify different row number.
  • Note that ASCII files are not a very efficient way to store/retreive data. For greater efficiency, save the files as .nc files (which one dimension, "row", shared by all variables) instead.
  • See this class' superclass, EDDTableFromFiles, for information on how to use this class and how this class works.

EDDTableFromAwsXmlFiles aggregates data from a set of Automatic Weather Station (AWS) XML data files. Some background information is at WeatherBug_Rest_XML_API.

  • This type of file is a simple but inefficient way to store the data, because each file usually seems to contain the observation from just one time point. So there may be a large number of files. If you want to improve performance, consider consolidating groups of observations (a week's worth?) in .nc files and using EDDTableFromNcFiles to serve the data.
  • See this class' superclass, EDDTableFromFiles, for information on how to use this class and how this class works.

EDDTableFromHyraxFiles aggregates data files with several variables, each with one or more shared dimensions (e.g., time, altitude, latitude, longitude), and served by a Hyrax OPeNDAP server.

  • In most cases, each file has multiple values for the leftmost dimension, e.g. time.
  • The files often (but don't have to) have a single value for the other dimensions (e.g., altitude, latitude, longitude).
  • The files may have character variables with an additional dimension (e.g., nCharacters).
  • Hyrax servers can be identified by the "/dods-bin/nph-dods/" or "/opendap/" in the URL.
  • This class screen-scrapes the Hyrax web pages with the lists of files in each directory. Because of this, it is very specific to the current format of Hyrax web pages. We will try to adjust ERDDAP quickly if/when future versions of Hyrax change how the files are listed.
  • The <fileDir> setting is ignored. Since this class downloads and makes a local copy of each remote data file, ERDDAP forces the fileDir to be [bigParentDirectory]/copy/datasetID/.
  • For <sourceUrl>, use the URL of the base directory of the dataset in the Hyrax server, for example,
    <sourceUrl>http://edac-dap.northerngulfinstitute.org/dods-bin/nph-dods/WCOS/nmsp/wcos/</sourceUrl>
    (although that server is no longer available).
    The sourceUrl web page usually has "OPeNDAP Server Index of [directoryName]" at the top.
  • Since this class always downloads and makes a local copy of each remote data file, you should never wrap this dataset in EDDTableCopy.
  • See this class' superclass, EDDTableFromFiles, for information on how to use this class and how this class works.
  • See the 1D, 2D, 3D, and 4D examples for EDDTableFromNcFiles.

EDDTableFromNcFiles aggregates data from .nc files with several variables, each with one shared dimension (e.g., time) or more than one shared dimensions (e.g., time, altitude, latitude, longitude). The files must have the same dimension names. A given file may have multiple values for each of the dimensions and the values may be different in different files. The files may have character variables with an additional dimension (e.g., nCharacters). See this class' superclass, EDDTableFromFiles, for information on how to use this class and how this class works.

  • 1D Example: 1D files are somewhat different from 2D, 3D, 4D, ... files.
    • You might have a set of .nc data files where each file has one month's worth of data from one drifting buoy.
    • Each file will have 1 dimension, e.g., time (size = [many]).
    • Each file will have one or more 1D variables which use that dimension, e.g., time, longitude, latitude, air temperature, ....
    • Each file may have 2D character variables, e.g., with dimensions (time,nCharacters).
  • 2D Example:
    • You might have a set of .nc data files where each file has one month's worth of data from one drifting buoy.
    • Each file will have 2 dimensions, e.g., time (size = [many]) and id (size = 1).
    • Each file will have 2 1D variables with the same names as the dimensions and using the same-name dimension, e.g., time(time), id(id). These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
    • Each file will have one or more 2D variables, e.g., longitude, latitude, air temperature, water temperature, ...
    • Each file may have 3D character variables, e.g., with dimensions (time,id,nCharacters).
  • 3D Example:
    • You might have a set of .nc data files where each file has one month's worth of data from one stationary buoy.
    • Each file will have 3 dimensions, e.g., time (size = [many]), lat (size = 1), and lon (size = 1).
    • Each file will have 3 1D variables with the same names as the dimensions and using the same-name dimension, e.g., time(time), lat(lat), lon(lon). These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
    • Each file will have one or more 3D variables, e.g., air temperature, water temperature, ...
    • Each file may have 4D character variables, e.g., with dimensions (time,lat,lon,nCharacters).
    • The file's name might have the buoy's name within the file's name.
  • 4D Example:
    • You might have a set of .nc data files where each file has one month's worth of data from one station. At each time point, the station takes readings at a series of depths.
    • Each file will have 4 dimensions, e.g., time (size = [many]), depth (size = [many]), lat (size = 1), and lon (size = 1).
    • Each file will have 4 1D variables with the same names as the dimensions and using the same-name dimension, e.g., time(time), depth(depth), lat(lat), lon(lon). These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
    • Each file will have one or more 4D variables, e.g., air temperature, water temperature, ...
    • Each file may have 5D character variables, e.g., with dimensions (time,depth,lat,lon,nCharacters).
    • The file's name might have the buoy's name within the file's name.

EDDTableFromNcCFFiles aggregates data aggregates data from .nc files which use one of the file formats specified by the CF Discrete Sampling Geometries conventions. See this class' superclass, EDDTableFromFiles, for information on how to use this class and how this class works.

The CF DSG conventions defines dozens of file formats and includes numerous minor variations. These class deals with all of the variations we are aware of, but we may have missed one (or more). So if this class can't read data from your CF DSG files, please email bob.simons at noaa.gov and include a sample file.

EDDTableFromThreddsFiles aggregates data files with several variables, each with one or more shared dimensions (e.g., time, altitude, latitude, longitude), and served by a THREDDS OPeNDAP server.

  • In most cases, each file has multiple values for the leftmost dimension, e.g. time.
  • The files often (but don't have to) have a single value for the other dimensions (e.g., altitude, latitude, longitude).
  • The files may have character variables with an additional dimension (e.g., nCharacters).
  • THREDDS servers can be identified by the "/thredds/" in the URLs. For example,
    http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.html
  • This class reads the catalog.xml files served by THREDDS with the lists of <catalogRefs> (references to additional catalog.xml sub-files) and <dataset>s (data files).
  • The <fileDir> setting is ignored. Since this class downloads and makes a local copy of each remote data file, ERDDAP forces the fileDir to be [bigParentDirectory]/copy/datasetID/.
  • For <sourceUrl>, use the URL of the catalog.xml file for the dataset in the THREDDS server, for example: for this URL which may be used in a web browser,
    http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.html ,
    use <sourceUrl>http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.xml</sourceUrl> .
  • Since this class always downloads and makes a local copy of each remote data file, you should never wrap this dataset in EDDTableCopy.
  • This dataset type supports an optional, rarely-used, special tag, <specialMode>mode</specialMode> which can be used to specify that special, hard-coded rules should be used to determine which files should be downloaded from the server. Currently, the only valid mode is SAMOS which is used with datasets from http://coaps.fsu.edu/thredds/catalog/samos to download only the files with the last version number.
  • See this class' superclass, EDDTableFromFiles, for information on how to use this class and how this class works.
  • See the 1D, 2D, 3D, and 4D examples for EDDTableFromNcFiles.

EDDTableFromNOS handles data from a NOAA NOS source, which uses SOAP+XML for requests and responses. It is very specific to NOAA NOS's XML. See the sample EDDTableFromNOS dataset in datasets2.xml.

EDDTableFromOBIS handles data from an Ocean Biogeographic Information System (OBIS) server.

  • OBIS servers expect an XML request and return an XML response.
  • Because all OBIS servers serve the same variables the same way (see the OBIS schema), you don't have to specify much to set up an OBIS dataset in ERDDAP.
  • You MUST include a "creator_email" attribute in the global addAttributes, since that information is used within the license. A suitable email address can be found by reading the XML response from the sourceURL.
  • You may or may not be able to get the global attribute <subsetVariables> to work with a given OBIS server. If you try, just try one variable (e.g., ScientificName or Genus).
  • The skeleton XML for an EDDTableFromOBIS dataset is:
    <dataset type="EDDTableFromOBIS" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <sourceCode>...</sourceCode>
        <!-- If you read the XML response from the sourceUrl, the 
        source code (e.g., GHMP) is the value from one of the 
        <resource><code> tags. -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <-- All ...SourceMinimum and Maximum tags are OPTIONAL -->
      <longitudeSourceMinimum>...</longitudeSourceMinimum> 
      <longitudeSourceMaximum>...</longitudeSourceMaximum> 
      <latitudeSourceMinimum>...</latitudeSourceMinimum> 
      <latitudeSourceMaximum>...</latitudeSourceMaximum> 
      <altitudeSourceMinimum>...</altitudeSourceMinimum> 
      <altitudeSourceMaximum>...</altitudeSourceMaximum> 
      <-- For timeSource... tags, use yyyy-MM-dd'T'HH:mm:ssZ format. -->
      <timeSourceMinimum>...</timeSourceMinimum> 
      <timeSourceMaximum>...</timeSourceMaximum> 
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <addAttributes>...</addAttributes> 
    </dataset>
    

EDDTableFromSOS handles data from a Sensor Observation Service (SWE/SOS) server.

  • This dataset type aggregates data from a group of stations which are all served by one SOS server.
  • The stations all serve the same set of variables (although the source for each station doesn't have to serve all variables).
  • SOS servers expect an XML request and return an XML response.
  • It is not easy to generate the dataset XML for SOS datasets. To find the needed information, you must visit sourceUrl+"?service=SOS&request=GetCapabilities" in a browser; look at the XML; make a GetObservation request by hand; and look at the XML response to the request.
  • SOS overview:
    • SWE (Sensor Web Enablement) and SOS (Sensor Observation Service) are OpenGIS® standards. That web site has the standards documents.
    • The OGC Web Services Common Specification ver 1.1.0 (OGC 06-121r3) covers construction of GET and POST queries (see section 7.2.3 and section 9).
    • If you send a getCapabilities xml request to a SOS server (sourceUrl + "?service=SOS&request=GetCapabilities"), you get an xml result with a list of stations and the observedProperties that they have data for.
    • An observedProperty is a formal URI reference to a property. For example, urn:ogc:phenomenon:longitude:wgs84 or http://marinemetadata.org/cf#sea_water_temperature
    • An observedProperty isn't a variable.
    • More than one variable may have the same observedProperty (for example, insideTemp and outsideTemp might both have observedProperty http://marinemetadata.org/cf#air_temperature).
    • If you send a getObservation xml request to a SOS server, you get an xml result with descriptions of field names in the response, field units, and the data. The field names will include longitude, latitude, depth(perhaps), and time.
    • Each dataVariable for an EDDTableFromSOS must include an "observedProperty" attribute, which identifies the observedProperty that must be requested from the server to get that variable. Often, several dataVariables will list the same composite observedProperty.
    • The dataType for each dataVariable may not be specified by the server. If so, you must look at the XML data responses from the server and assign appropriate <dataType>s in the ERDDAP dataset dataVariable definitions.
    • (At the time of writing this) some SOS servers respond to getObservation requests for more than one observedProperty by just returning results for the first of the observedProperties. (No error message!) See the constructor parameter requestObservedPropertiesSeparately.
  • EDDTableFromSOS automatically adds
    <att name="subsetVariables">station_id, longitude, latitude</att>
    to the dataset's global attributes when the dataset is created.
  • SOS servers usually express units with the UCUM system. Most ERDDAP servers express units with the UDUNITS system. If you need to convert between the two systems, you can use ERDDAP's web service to convert UCUM units to/from UDUNITS.
  • The skeleton XML for an EDDTableFromSOS dataset is:
    <dataset type="EDDTableFromSOS" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <stationIdSourceName>...</stationIdSourceName> <!-- 0 or 1. 
        Default="station_id". -->
      <longitudeSourceName>...</longitudeSourceName>
      <latitudeSourceName>...</latitudeSourceName>
      <altitudeSourceName>...</altitudeSourceName>
      <altitudeSourceMinimum>...</altitudeSourceMinimum> <!-- 0 or 1 -->
      <altitudeSourceMaximum>...</altitudeSourceMaximum> <!-- 0 or 1 -->
      <altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit> 
      <timeSourceName>...</timeSourceName>
      <timeSourceFormat>...</timeSourceFormat>
        <!-- timeSourceFormat MUST be either
        * For numeric data: a 
          UDUnits-compatible
          string (with the format 
          "units since baseTime") describing how to interpret
          source time values (e.g., "seconds since 1970-01-01T00:00:00Z"),
          where the base time is an ISO 8601:2004(E) formatted date time string 
          (yyyy-MM-dd'T'HH:mm:ssZ).
        * For String String data: an org.joda.time.format.DateTimeFormat 
          string (which is mostly compatible with java.text.SimpleDateFormat)
          describing how to interpret string times  (e.g., the 
          ISO8601TZ_FORMAT "yyyy-MM-dd'T'HH:mm:ssZ").  See Joda DateTimeFormat -->
      <observationOfferingIdRegex>...</observationOfferingIdRegex>
        <!-- Only observationOfferings with IDs (usually the station names) 
        which match this regular expression (tutorial) will be included 
        in the dataset (".+" will catch all station names). -->
      <requestObservedPropertiesSeparately>true|false(default)
        </requestObservedPropertiesSeparately>
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <addAttributes>...</addAttributes>
      <dataVariable>...</dataVariable> <!-- 1 or more. 
        * Each dataVariable MUST include the dataType tag.
        * Each dataVariable MUST include the observedProperty attribute. 
        * For IOOS SOS servers, *every* variable returned in the text/csv
          response MUST be included in this ERDDAP dataset definition. -->
    </dataset>
    

EDDTableCopy makes and maintains a local copy of another EDDTable's data and serves data from the local copy.

  • EDDTableCopy (and for grid data, EDDGridCopy) is a very easy to use and a very effective solution to some of the biggest problems with serving data from remote data sources:
    • Accessing data from a remote data source can be slow.
      • They may be slow because they are inherently slow (e.g., an inefficient type of server),
      • because they are overwhelmed by too many requests,
      • or because your server or the remote server is bandwidth limited.
    • The remote dataset is sometimes unavailable (again, for a variety of reasons).
    • Relying on one source for the data doesn't scale well (e.g., when many users and many ERDDAPs utilize it).
       
  • How It Works - EDDTableCopy solves these problems by automatically making and maintaining a local copy of the data and serving data from the local copy. ERDDAP can serve data from the local copy very, very quickly. And making and using a local copy relieves the burden on the remote server. And the local copy is a backup of the original, which is useful in case something happens to the original.

    There is nothing new about making a local copy of a dataset. What is new here is that this class makes it *easy* to create and *maintain* a local copy of data from a *variety* of types of remote data sources and *add metadata* while copying the data.

  • <extractDestinationNames> - EDDTableCopy makes the local copy of the data by requesting chunks of data from the remote dataset. EDDTableCopy determines which chunks to request by requesting the &distinct() values for the <extractDestinationNames> (specified in the datasets.xml, see below), which are the space-separated destination names of variables in the remote dataset. For example,
    <extractDestinationNames>drifter profile</extractDestinationNames>
    might yield distinct values combinations of drifter=tig17,profile=1017, drifter=tig17,profile=1095, ... drifter=une12,profile=1223, drifter=une12,profile=1251, ....

    In situations where one column (e.g., profile) may be all that is required to uniquely identify a group of rows of data, if there are a very large number of, e.g., profiles, it may be useful to also specify an additional extractDestinationName (e.g., drifter) which serves to subdivide the profiles. That leads to fewer data files in a given directory, which may lead to faster access.

  • Local Files - Each chunk of data is stored in a separate netCDF file in a subdirectory of [bigParentDirectory]/copy/datasetID/ (as specified in setup.xml). There is one subdirectory level for all but the last extractDestinationName. For example, data for tig17+1017, would be stored in
    [bigParentDirectory]/copy/sampleDataset/tig17/1017.nc .
    For example, data for une12+1251, would be stored in
    [bigParentDirectory]/copy/sampleDataset/une12/1251.nc .
    Directory and file names created from data values are modified to make them file-name-safe (e.g., spaces are replaced by "x20") -- this doesn't affect the actual data.
     
  • New Data - Each time EDDTableCopy is reloaded, it checks the remote dataset to see what distinct chunks are available. If the file for a chunk of data doesn't already exist, a request to get the chunk is added to a queue. ERDDAP's taskThread processes all the queued requests for chunks of data, one-by-one. You can see statistics for the taskThread's activity on the Status Page and in the Daily Report. (Yes, ERDDAP could assign multiple tasks to this process, but that would use up lots of the remote data source's bandwidth, memory, and CPU time, and lots of the local ERDDAP's bandwidth, memory, and CPU time, neither of which is a good idea.)

    NOTE: The very first time an EDDTableCopy is loaded, (if all goes well) lots of requests for chunks of data will be added to the taskThread's queue, but no local data files will have been created. So the constructor will fail but taskThread will continue to work and create local files. If all goes well, the taskThread will make some local data files and the next attempt to reload the dataset (in ~15 minutes) will succeed, but initially with a very limited amount of data.

    WARNING: If the remote dataset is large and/or the remote server is slow (that's the problem, isn't it?!), it will take a long time to make a complete local copy. In some cases, the time needed will be unacceptable. For example, transmitting 1 TB of data over a T1 line (0.15 GB/s) takes at least 60 days, under optimal conditions. Plus, it uses lots of bandwidth, memory, and CPU time on the remote and local computers. The solution is to mail a hard drive to the administrator of the remote data set so that s/he can make a copy of the dataset and mail the hard drive back to you. Use that data as a starting point and EDDTableCopy will add data to it. (That is how Amazon's EC2 Cloud Service handles the problem, even though their system has lots of bandwidth.)

    WARNING: If a given combination of values disappears from remote dataset, EDDTableCopy does NOT delete the local copied file. If you want to, you can delete it yourself.

  • Recommended Use -
    1. Create the <dataset> entry (the native type, not EDDTableCopy) for the remote data source. Get it working correctly, including all of the desired metadata.
    2. If it is too slow, add XML code to wrap it in an EDDTableCopy dataset.
      • Use a different datasetID (perhaps by changing the datasetID of the old datasetID slightly).
      • Copy the <accessibleTo>, <reloadEveryNMinutes> and <onChange> from the remote EDDTable's XML to the EDDTableCopy's XML. (Their values for EDDTableCopy matter; their values for the inner dataset become irrelevant.)
      • Create the <extractDestinationNames> tag (see above).
      • <orderExtractBy> is an OPTIONAL space separated list of destination variable names in the remote dataset. When each chunk of data is downloaded from the remote server, the chunk will be sorted by these variables (by the first variable, then by the second variable if the first variable is tied, ...). In some cases, ERDDAP will be able to extract data faster from the local data files if the first variable in the list is a numeric variable ("time" counts as a numeric variable). But choose the these variables in a way that is appropriate for the dataset.
    3. ERDDAP will make and maintain a local copy of the data.
       
  • WARNING: EDDTableCopy assumes that the data values for each chunk don't ever change. If/when they do, you need to manually delete the chunk files in [bigParentDirectory]/copy/datasetID/ which changed and flag the dataset to be reloaded so that the deleted chunks will be replaced. If you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
     
  • Change Metadata - If you need to change any addAttributes or change the order of the variables associated with the source dataset:
    1. Change the addAttributes for the source dataset in datasets.xml, as needed.
    2. Delete one of the copied files.
    3. Set a flag to reload the dataset immediately. If you do use a flag and you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
    4. The deleted file will be regenerated with the new metadata. If the source dataset is ever unavailable, the EDDTableCopy dataset will get metadata from the regenerated file, since it is the youngest file.
       
  • Note that EDDGridCopy is very similar to EDDTableCopy, but works with gridded datasets.
     
  • Skeleton XML - The skeleton XML for an EDDTableCopy dataset is:
    <dataset type="EDDTableCopy" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes>
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <extractDestinationNames>...</extractDestinationNames>  <!-- 1 -->
      <orderExtractBy>...</orderExtractBy> <!-- 0 or 1 -->
      <dataset>...</dataset> <!-- 1 -->
    </dataset>
    

 

Details

Here are detailed descriptions of common tags and attributes.
  • <convertToPublicSourceUrl /> is an optional tag within an <erddapDatasets> tag which contains a "from" and a "to" attribute which specify how to convert a matching local sourceUrl into a public sourceUrl. "from" must have the form "[something]//[something]/". There can be 0 or more of these tags. For more information see <sourceUrl>. For example,
    <convertToPublicSourceUrl from="http://192.168.31.18/" to="http://oceanwatch.pfeg.noaa.gov/" />
    will cause a matching local sourceUrl (such as http://192.168.31.18/thredds/dodsC/satellite/BA/ssta/5day)
    into a public sourceUrl (http://oceanwatch.pfeg.noaa.gov/thredds/dodsC/satellite/BA/ssta/5day).
     
  • <requestBlacklist> is an optional tag within an <erddapDatasets> tag which contains a comma-separated list of numeric IP addresses which will be immediately blacklisted.
    • You can also replace the last number in an IP address with * to block 0-255 (e.g., 123.45.67.*).
    • Any request from one of these addresses will receive an HTTP Error 403: Forbidden.
    • This can be used to fend off a Denial of Service attack or an overly zealous web robot.
    • For example,
      <requestBlacklist>98.76.54.321, 123.45.68.*</requestBlacklist>
    • See your ERDDAP daily report for a list/tally of the most active allowed and blocked requesters.
    • You can try to convert the IP numbers to domain names with free, reverse DNS, web services like http://www.hcidata.info/host2ip.htm.
       
  • <subscriptionEmailBlacklist> is an optional tag within an <erddapDatasets> tag which contains a comma-separated list of email address which are immediately blacklisted from the subscription system, for example
    <subscriptionEmailBlacklist>bob@badguy.com, john@badguy.com</subscriptionEmailBlacklist>
    If an email address on the list has subscriptions, the subscriptions will be cancelled. If an email address on the list tries to subscribe, the request will be refused.
     
  • <user> is an OPTIONAL tag within an <erddapDatasets> tag that identifies a user's username, password, and roles (a comma-separated list).
    • This is part of ERDDAP's security system for restricting access to some datasets to some users.
    • Make a separate <user> tag for each user.
    • If setup.xml's <authentication> is openid, use the user's OpenID URL as the username and don't specify a password. For example, if the user's OpenID URL is "http://jsmith.myopenid.com/", the <user> tag is
      <user username="http://jsmith.myopenid.com/" roles="role1, role2"/>
      OpenID is great because it frees you, the ERDDAP admin, from managing and dealing with passwords. A user's OpenID URL is public information, so there is no need to keep them secret.
    • The comma-separated list of roles specifies which roles you are assigning to the user. The "admin" role has a special meaning -- it identifies the ERDDAP administrator. Users with the admin role have special privileges. But admins don't automatically get access to any of the private datasets.
    • The user will then have access to datasets that list one of these roles in the dataset's <accessibleTo> tag.
    • Thus, this is role-based access control.
    • If there is no <user> tag for a client, s/he will only be able to access public datasets, i.e., datasets which don't have an <accessibleTo> tag.
    • If setup.xml's <authentication> is custom, you need to specify the user and the password (at least 7 characters long) attributes in the <user> tag.
      • The passwords that users enter are case sensitive.
      • setup.xml's <passwordEncoding> determines how passwords are stored in the <user> tags in datasets.xml. In order of increasing security, the options are:
        • plaintext (NOT RECOMMENDED!)
        • MD5 - the stored form of the password is made from MD5(password)
        • UEPMD5 - (the default) the stored form of the password is made from MD5(UserName:ERDDAP:password) The UserName and "ERDDAP" are used to salt the hash value, making it more difficult to decode.
        (See in setup.xml.)
      • On Windows, you can generate MD5 password digest values by downloading an MD5 program (such as MD5) and using (for example): md5 -djsmith:ERDDAP:actualPassword
      • On Linux/Unix, you can generate MD5 digest values by using the built-in md5sum program (for example):
        echo -n "jsmith:ERDDAP:actualPassword" | md5sum
      • Stored plaintext passwords are case sensitive. The stored forms of MD5 and UEPMD5 passwords are not case sensitive.
      • For example (using UEPMD5), if username="jsmith" and password="myPassword", the <user> tag is:
        <user username="jsmith"
        password="57AB7ACCEB545E0BEB46C4C75CEC3C30"
        roles="role1, role2" />

        where the stored password was generated with
        md5 -djsmith:ERDDAP:myPassword

       
  • <dataset>

    Two attributes can appear within a <dataset> tag:

    • datasetID="..." is a REQUIRED attribute within a <dataset> tag which assigns a short (usually <15 characters), unique, identifying name to a dataset.
      • Valid characters are A-Z, a-z, 0-9, _, and -, but we recommend starting with a letter and then just using A-Z, a-z, 0-9, and _.
      • DatasetID's are case sensitive, but DON'T create two datasetID's that only differ in upper/lower case letters. It will cause problems on Windows computers (yours and/or a user's computer).
      • Best practices: We recommend using camelCase.
      • Best practices: We recommend that the first part be an acronym or abbreviation of the source institution's name and the second part be an acronym or abbreviation of the dataset's name. When possible, we create a name which reflect's the source's name for the dataset. For example, we used datasetID="erdPHssta8day" for a dataset from the NOAA NMFS SWFSC Environmental Research Division which is designated by the source to be satellite/PH/ssta/8day.
         
    • active="..." is an OPTIONAL attribute within the <dataset> tag which indicates if a dataset is active (eligible for use in ERDDAP) or not.
      • Valid values are true (the default) and false.
      • Since the default is true, you don't need to use this attribute except to use active="false" to force a dataset's removal as soon as possible (if it is alive in ERDDAP) and to tell ERDDAP not to try to load it in the future.
         

    Several tags can appear between the <dataset> and </dataset> tags:

    • <accessibleTo> is an OPTIONAL tag within a <dataset> tag that specifies a space-separated list of roles which are allowed to have access to this dataset.
      • This is part of ERDDAP's security system for restricting access to some datasets to some users.
      • If this tag is not present, all users (even if they haven't logged in) will have access to this dataset.
      • If this tag is present, this dataset will only be visible and accessible to logged-in users who have one of the specified roles. This dataset won't be visible to users who aren't logged in.
         
    • <altitudeMetersPerSourceUnit> is an OPTIONAL tag within a <dataset> tag that specifies a number which is multiplied by the source altitude or depth values (after scale_factor and add_offset have been applied) to convert them into altitude values (in meters above sea level).
      • For example, if the source is already measured in meters above sea level, use 1.
      • For example, if the source is measured in meters below sea level, use -1.
      • For example, if the source is measured in km above sea level, use 0.001.
      • This tag is OPTIONAL, but recommended. The default value is 1.
      • An example is:
        <altitudeMetersPerSourceUnit>-1</altitudeMetersPerSourceUnit>
         
    • <fgdcFile> is an OPTIONAL tag within a <dataset> tag that tells ERDDAP to use a pre-made FGDC file that is somewhere on the server's file system instead of having ERDDAP try to generate the file. Usage:
      <fgdcFile>fullFileName</fgdcFile>

      If fullFileName="" or the file isn't found, the dataset will have no FGDC metadata. So this is also useful if you want to suppress the FGDC metadata for a specific dataset.
      Or, you can put <fgdcActive>false</fgdcActive> in setup.xml to tell ERDDAP not to offer FGDC metadata for any dataset.
       
    • <iso19115File> is an OPTIONAL tag within a <dataset> tag that tells ERDDAP to use a pre-made ISO 19115 file that is somewhere on the server's file system instead of having ERDDAP try to generate the file. Usage:
      <iso19115File>fullFileName</iso19115File>

      If fullFileName="" or the file isn't found, the dataset will have no ISO 19115 metadata. So this is also useful if you want to suppress the ISO 19115 metadata for a specific dataset.
      Or, you can put <iso19115Active>false</iso19115Active> in setup.xml to tell ERDDAP not to offer ISO 19115 metadata for any dataset.
       
    • <onChange> is an OPTIONAL tag within a <dataset> tag that specifies an action which will be done when this dataset is created (when ERDDAP is restarted) and whenever this dataset changes in any way.
      • Currently, for EDDGrid subclasses, any change to metadata or to an axis variable (e.g., a new time point for near-real-time data) is considered a change, but a reloading of the dataset is not considered a change (by itself).
      • Currently, for EDDTable subclasses, any reloading of the dataset is considered a change.
      • Currently, only two types of actions are allowed:
        • http:// - If the action starts with "http://", ERDDAP will send an HTTP GET request to the specified URL. The response will be ignored. For example, the URL might tell some other web service to do something.
          • If the URL has a query part (after the "?"), it MUST be already percent encoded. In practice, this can be very minimal percent encoding: all you have to do is convert special characters in the right-hand-side values of any constraints: % into %25, & into %26, ", into %22, = into %3D, + into %2B, and space into %20 (or +) and convert all characters above #126 to their %HH form (where HH is the 2-digit hex value). Unicode characters above #255 must be UTF-8 encoded and then each byte must be converted to %HH form (ask a programmer for help).
          • Since datasets.xml is an XML file, you also need to encode '&', '<', and '>' in the URL as '&amp;', '&lt;', and '&gt;'.
          • Example: For a URL that you might type into a browser as: http://www.company.com/webService?department=R%26D&param2=value2 You should specify an <onChange> tag via (on one line)
            <onChange>http://www.company.com/webService?department=R%26D&amp;param2=value2</onChange>
        • mailto: - If the action starts with "mailto:", ERDDAP will send an email to the subsequent email address indicating that the dataset has been updated/changed.
          Example: <onChange>mailto:john.smith@company.com</onChange>
        If you have a good reason for ERDDAP to support some other type of action, send us an email describing what you want.
      • This tag is OPTIONAL. There can be as many of these tags as you want.
      • This is analogous to ERDDAP's email/URL subscription system, but these actions aren't stored persistently (i.e., they are only stored in an EDD object).
      • To remove a subscription, just remove the <onChange> tag. The change will be noted the next time the dataset is reloaded.
         
    • <reloadEveryNMinutes> is an OPTIONAL tag within a <dataset> tag that specifies how often the dataset should be reloaded.
      • Generally, datasets that are updated frequently should be reloaded frequently, for example, every 60 minutes.
      • Datasets that are updated infrequently should be reloaded infrequently, for example, every 1440 minutes (daily) or 10080 minutes (weekly).
      • This tag is OPTIONAL, but recommended. The default is 10080.
      • An example is: <reloadEveryNMinutes>1440</reloadEveryNMinutes>
      • Note that when a dataset is reloaded, all files in the bigParentDirectory/cache/datasetID directory are deleted.
      • No matter what this is set to, a dataset won't be loaded more frequently than <loadDatasetsMinMinutes> (default = 15), as specified in setup.xml. So if you want datasets to be reloaded very frequently, you need to set both reloadEveryNMinutes and loadDatasetsMinMinutes to small values.
      • Don't set reloadEveryNMinutes to the same value as loadDatasetsMinMinutes, because the elapsed time is likely to be (for example) 14:58 or 15:02, so about half the time the dataset won't be reloaded. Instead, use a smaller reloadEveryNMinutes value (e.g., 10).
      • Regardless of reloadEveryNMinutes, you can manually tell ERDDAP to reload a specific dataset as soon as possible via a flag file.
         
    • <sourceCanConstrainStringEQNE> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source can constrain String variables with the = and != operators.
      • For EDDTableFromDapSequence, this applies to the outer sequence String variables only. It is assumed that the source can't handle any constraints on inner sequence variables.
      • This tag is OPTIONAL. Valid values are true (the default) and false.
      • For EDDTableFromDapSequence OPeNDAP DRDS servers, this should be set to true (the default).
      • For EDDTableFromDapSequence Dapper servers, this should be set to false.
      • An example is:
        <sourceCanConstrainStringEQNE>true</sourceCanConstrainStringEQNE>
         
    • <sourceCanConstrainStringGTLT> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source can constrain String variables with the <, <=, >, and >= operators.
      • For EDDTableFromDapSequence, this applies to the outer sequence String variables only. It is assumed that the source can't handle any constraints on inner sequence variables.
      • Valid values are true (the default) and false.
      • This tag is OPTIONAL. The default is true.
      • For EDDTableFromDapSequence OPeNDAP DRDS servers, this should be set to true (the default).
      • For EDDTableFromDapSequence Dapper servers, this should be set to false.
      • An example is:
        <sourceCanConstrainStringGTLT>true</sourceCanConstrainStringGTLT>
         
    • <sourceCanConstrainStringRegex> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source can constrain String variables by regular expressions, and if so, what the operator is.
      • Valid values are "=~" (the DAP standard), "~=" (mistakenly supported by many DAP servers), or "" (indicating that the source doesn't support regular expressions).
      • This tag is OPTIONAL. The default is "".
      • For EDDTableFromDapSequence OPeNDAP DRDS servers, this should be set to "" (the default).
      • For EDDTableFromDapSequence Dapper servers, this should be set to "" (the default).
      • An example is:
        <sourceCanConstrainStringRegex>=~</sourceCanConstrainStringRegex>
         
    • <sourceNeedsExpandedFP_EQ> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source needs help with queries with <numericVariable>=<floatingPointValue> (and !=, >=, <=).
      • For some data sources, numeric queries involving =, !=, <=, or >= may not work as desired with floating point numbers. For example, a search for longitude=220.2 may fail if the value is stored as 220.20000000000001.
      • This problem arises because floating point numbers are not represented exactly within computers.
      • If sourceNeedsExpandedFP_EQ is set to true (the default), ERDDAP modifies the queries sent to the data source to avoid this problem. It is always safe and fine to leave this set to true.
         
    • <sourceUrl> is a REQUIRED tag within a <dataset> tag that specifies the url source of the data.
      • For most dataset types, this is REQUIRED. For others it is not an option. See the dataset type's description for details.
      • For most datasets, this is the base of the url that is used to request data.
      • For example, for DAP servers, this is the url to which .dods, .das, .dds, or .html could be added.
      • If the URL has a query part (after the "?"), it MUST be already percent encoded. In practice, this can be very minimal percent encoding: all you have to do is convert special characters in the right-hand-side values of any constraints: % into %25, & into %26, ", into %22, = into %3D, + into %2B, and space into %20 (or +) and convert all characters above #126 to their %HH form (where HH is the 2-digit hex value). Unicode characters above #255 must be UTF-8 encoded and then each byte must be converted to %HH form (ask a programmer for help).
      • Since datasets.xml is an XML file, you MUST also encode '&', '<', and '>' in the URL as '&amp;', '&lt;', and '&gt;'.
      • An example is:
        <sourceUrl>http://dapper.pmel.noaa.gov/dapper/epic/tao_time_series.cdp</sourceUrl>
      • For most dataset types, ERDDAP adds the original sourceUrl (the "localSourceUrl" in the source code) to the global attributes (where it becomes the "publicSourceUrl" in the source code). When the data source is local files, ERDDAP adds sourceUrl="(local files)" to the global attributes as a security precaution. When the data source is a database, ERDDAP adds sourceUrl="(source database)" to the global attributes as a security precaution. If some of your datasets use non-public sourceUrl's (usually because their computer is in your DMZ or on a local LAN) you can use <convertToPublicSourceUrl> tags to specify how to convert the local sourceUrls to public sourceUrls.
         
    • <addAttributes> is an OPTIONAL tag for each dataset and for each variable which lets ERDDAP administrators control the metadata attributes associated with a dataset and its variables.
      • ERDDAP combines the attributes from the dataset's source ("sourceAttributes") and the "addAttributes" which you define in datasets.xml (which have priority) to make the "combinedAttributes", which are what ERDDAP users see. Thus, you can use addAttributes to redefine the values of sourceAttributes, add new attributes, or remove attributes.
      • The <addAttributes> tag encloses 0 or more <att> subtags, which are used to specify individual attributes.
      • Each attribute consists of a name and a value (which has a specific data type, e.g., double).
      • There can be only one attribute with a given name. If there are more, the last one has priority.
      • The value can be a single value or a space-separated list of values.
      • Syntax
        • The order of the <att> subtags within addAttributes is not important.
        • The <att> subtag format is
          <att name="name" [type="type"] >value</att>
        • If an <att> subtag has no value or a value of null, that attribute will be removed from the combined attributes.
          For example, <att name="rows" /> will remove rows from the combined attributes.
          For example, <att name="coordinates">null</att> will remove coordinates from the combined attributes.
        • The OPTIONAL type value for <att> subtags indicates the data type for the values. The default type is string.
          • Valid types for single values are byte, unsignedShort, short, int, long, float, double, and string.
          • Valid types for space-separated lists of values (or single values) are byteList, unsignedShortList, shortList, intList, longList, floatList, doubleList.
          • There is no stringList. Store the String values in a newline-separated String.
             
    • Global Attributes / Global <addAttributes> -
      <addAttributes> is an OPTIONAL tag within the <dataset> tag which is used to change attributes that apply to the entire dataset.
      • Use the global <addAttributes> to change the dataset's global attributes. ERDDAP combines the global attributes from the dataset's source (sourceAttributes) and the global addAttributes which you define in datasets.xml (which have priority) to make the global combinedAttributes, which are what ERDDAP users see. Thus, you can use addAttributes to redefine the values of sourceAttributes, add new attributes, or remove attributes.
      • See the <addAttributes> information which applies to global and variable <addAttributes>.
      • FGDC and ISO 19115-2/19139 Metadata - Normally, ERDDAP will automatically generate ISO 19115 and FGDC XML metadata files for each dataset using information from the dataset's metadata. So, good dataset metadata leads to good ERDDAP-generated ISO 19115 and FGDC metadata. Please consider putting lots of time and effort into improving your datasets' metadata (which is a good thing to do anyway). Most of the dataset metadata attributes which are used generate the ISO 19115 and FGDC metadata are from the ACDD metadata standard and are so noted below.
      • Many global attributes are special in that ERDDAP looks for them and uses them in various ways. For example, a link to the infoUrl is included on web pages with lists of datasets, and other places, so that users can find out more about the dataset.
      • When a user selects a subset of data, globalAttributes related to the variable's longitude, latitude, altitude, and time ranges (for example, Southernmost_Northing, Northernmost_Northing, time_coverage_start, time_coverage_end) are automatically generated or updated.
      • A simple sample global <addAttributes> is:
        <addAttributes> 
          <att name="Conventions">COARDS, CF-1.6, Unidata Dataset 
          Discovery v1.0</att>
          <att name="infoUrl">http://coastwatch.pfeg.noaa.gov/infog/
          PH_ssta_las.html</att>
          <att name="institution">NOAA CoastWatch, West Coast Node</att>
          <att name="title">SST, Pathfinder Ver 5.0, Day and Night, 
          Global, Science Quality (1 Day Composite)</att>
          <att name="cwhdf_version" />
        </addAttributes>  
      • Supplying this information helps ERDDAP do a better job and helps users understand the datasets. Please take the time to do a good job with metadata attributes. The users will thank you.

      Comments about global attributes that are special in ERDDAP:
       

      • acknowledgment (from the ACDD metadata standard) is a RECOMMENDED way to acknowledge the group or groups that provided support (notably, financial) for the project that created this data.
      • cdm_altitude_proxy is just for EDDTable datasets that don't have an altitude variable but do have a variable that is a proxy for altitude (e.g., depth, pressure, sigma, bottleNumber), you may use this attribute to identify that variable. For example,
        <att name="cdm_altitude_proxy">depth</att>
        If the cdm_data_type is Profile or TrajectoryProfile and there is no altitude variable, cdm_altitude_proxy MUST be defined. If cdm_altitude_proxy is defined, ERDDAP will add the following metadata to the variable: _CoordinateAxisType=Height and axis=Z.
      • cdm_data_type (from the ACDD metadata standard) is a global attribute that indicates the Unidata Common Data Model data type for the dataset. The CDM standard is still evolving and may change again. The best overview of the different feature types (but out-of-date!) is currently Unidata's CDMfeatures.doc.
        ERDDAP complies with the newly ratified Discrete Sampling Geometries chapter of the CF 1.6 metadata conventions (previously called the CF Point Observation Conventions).
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include the cdm_data_type attribute. A few dataset types (like EDDTableFromObis) will set this automatically.
        • For EDDGrid datasets, the cdm_data_type options are Grid (the default and by far the most common type for EDDGrid datasets), MovingGrid, Other, Point, Profile, RadialSweep, TimeSeries, TimeSeriesProfile, Swath, Trajectory, and TrajectoryProfile. Currently, EDDGrid does not require that any related metadata be specified, nor does it check that the data matches the cdm_data_type. That will probably change in the near future.
        • EDDTable uses cdm_data_type in a rigorous way. If a dataset doesn't comply with the cdm_data_type's requirements, the dataset will fail to load and will generate an error message. (That's a good thing, in the sense that the error message will tell you what is wrong so that you can fix it.)

          For all of these datasets, in the Conventions and Metadata_Conventions global attributes, please refer to CF-1.6 (not CF-1.0, 1.1, 1.2, 1.3, 1.4, or 1.5), since CF-1.6 is the first version to include the changes related to Discrete Sampling Geometry conventions.

          For EDDTable datasets, the cdm_data_type options (and related requirements) are

          • Point - for a dataset with unrelated points.
            • As with all cdm_data_types other than Other, Point datasets MUST have longitude, latitude, and time variables.
          • Profile - for data from multiple depths at one or more longitude,latitude locations.
            • The dataset MUST include the globalAttribute cdm_profile_variables, where the value is a comma-separated list of variables with profile information.
            • One of the variables MUST have the attribute cf_role=profile_id to identify the variable that uniquely identifies the profiles. If no other variable is suitable, consider using the time variable.
          • TimeSeries - for data from a set of stations with fixed longitude,latitude(,altitude).
            • The dataset MUST include the globalAttribute cdm_timeseries_variables, where the value is a comma-separated list of variables with station information.
            • One of the variables MUST have the attribute cf_role=timeseries_id to identify the variable that uniquely identifies the stations.
            • It is okay if the longitude and latitude vary slightly over time. If the longitude and latitude don't vary, include them in the cdm_timeseries_variables. If they do vary, don't include them in the cdm_timeseries_variables.
          • TimeSeriesProfile - for profiles from a set of stations.
            • The dataset MUST include the globalAttribute cdm_timeseries_variables, where the value is a comma-separated list of variables with station information.
            • The dataset MUST include the globalAttribute cdm_profile_variables, where the value is a comma-separated list of variables with profile information.
            • One of the variables MUST have the attribute cf_role=timeseries_id to identify the variable that uniquely identifies the stations.
            • One of the variables MUST have the attribute cf_role=profile_id to identify the variable that uniquely identifies the profiles. (A given profile_id only has to be unique for a given timeseries_id.) If no other variable is suitable, consider using the time variable.
          • Trajectory - for data from a set of longitude,latitude(,altitude) paths called trajectories.
            • The dataset MUST include the globalAttribute cdm_trajectory_variables, where the value is a comma-separated list of variables with trajectory information.
            • One of the variables MUST have the attribute cf_role=trajectory_id to identify the variable that uniquely identifies the trajectories.
          • TrajectoryProfile - for profiles taken along trajectories.
            • The dataset MUST include the globalAttribute cdm_trajectory_variables, where the value is a comma-separated list of variables with trajectory information.
            • The dataset MUST include the globalAttribute cdm_profile_variables, where the value is a comma-separated list of variables with profile information.
            • One of the variables MUST have the attribute cf_role=trajectory_id to identify the variable that uniquely identifies the trajectories.
            • One of the variables MUST have the attribute cf_role=profile_id to identify the variable that uniquely identifies the profiles. (A given profile_id only has to be unique for a given trajectory_id.) If no other variable is suitable, consider using the time variable.
          • Other - has no requirements. Use it if the dataset doesn't fit one of the other options.
          Related notes:
          • All EDDTable datasets with a cdm_data_type other than "Other" MUST have longitude, latitude, and time variables.
          • Datasets with profiles MUST have an altitude variable or a cdm_altitude_proxy variable.
          • If you can't make a dataset comply with all of the requirements for the ideal cdm_data_type, use "Point" (which has few requirements) or "Other" (which has no requirements) instead.
          • This information is used by ERDDAP in various ways, for example, when making .ncCF files (.nc files which comply with the Contiguous Ragged Array Representations associated with the dataset's cdm_data_type, as defined in the newly ratified Discrete Sampling Geometries chapter of the CF 1.6 metadata conventions, which were previously named "CF Point Observation Conventions").
          • Hint: Usually, a good starting point for subsetVariables is the combined values of the cdm_..._variables. For example, for TimeSeriesProfile, start with the cdm_timeseries_variables plus the cdm_profile_variables.
      • contributor_name (from the ACDD metadata standard) is the RECOMMENDED way to identify a person, organization, or project which contributed to this dataset (e.g., the original creator of the data, before it was reprocessd by the creator of this dataset). If "contributor" doesn't really apply to a dataset, omit this attribute. Compared to creator_name, this is more focused on the funding source.
      • contributor_role (from the ACDD metadata standard) is the RECOMMENDED way to identify the role of contributor_name, e.g., Source of Level 2b data. If "contributor" doesn't really apply to a dataset, omit this attribute.
      • coverage_content_type (from the ISO 19115 metadata standard) is the RECOMMENDED way to identify the type of gridded data (in EDDGrid datasets). The only allowed values are auxiliaryInformation, image, modelResult, physicalMeasurement (the default when ISO 19115 metadata is generated), qualityInformation, referenceInformation, and thematicClassification. (Don't use this tag for EDDTable datasets.)
      • creator_name (from the ACDD metadata standard) is the RECOMMENDED way to identify the person, organization, or project (if not a specific person or organization), most responsible for the creation (or most recent reprocessing) of this data. If the data was extensively reprocessed (e.g., satellite data from level 2 to level 3 or 4), then usually the reprocessor is listed as the creator and the original creator is listed via contributor_name. Compared to project, this is more flexible, since it may identify a person, an organization, or a project.
      • creator_email (from the ACDD metadata standard) is the RECOMMENDED way to identify an email address (correctly formatted, e.g., john_smith@great.org) that provides a way to contact the creator.
      • creator_url (from the ACDD metadata standard) is the RECOMMENDED way to identify a URL for organization that created the dataset, or a URL with the creator's information about this dataset (but that is more the purpose of infoUrl).
      • date_created (from the ACDD metadata standard) is the RECOMMENDED way to identify the date on which the data was created (e.g., processed into this form), in ISO 8601 format, e.g., 2010-01-30.
      • date_modified (from the ACDD metadata standard) is the RECOMMENDED way to identify the date on which the data was last modified (e.g., when an error was fixed or when the latest data was added), in ISO 8601 format, e.g., 2012-03-15.
      • date_issued (from the ACDD metadata standard) is the RECOMMENDED way to identify the date on which the data was made available to others, in ISO 8601 format, e.g., 2012-03-15. For example, the dataset may have a date_created of 2010-01-30, but was only made publicly available 2010-07-30.
      • featureType (from the CF metadata standard) is IGNORED and/or REPLACED. If the dataset's cdm_data_type is appropriate, ERDDAP will automatically use it to create a featureType attribute. So there is no need for you to add it.
      • drawLandMask - This is a RECOMMENDED global attribute used by ERDDAP (and no metadata standards) which specifies the default value for the "Draw Land Mask" option on the dataset's Make A Graph form and for the &.land parameter in a URL requesting a graph/map of the data. (However, if drawLandMask is specified in a variable's attributes, that value has precedence.)
        • For EDDGrid datasets, this specifies whether the land mask on a map is drawn over or under the grid data. over is recommended for oceanographic data (so that grid data over land is obscured by the landmask). under is recommended for all other data.
        • For EDDTable datasets: over makes the land mask on a map visible (land appears as a uniform gray area). over is commonly used for purely oceanographic datasets. under makes the land mask invisible (topography information is displayed for ocean and land areas). under is commonly used for all other data.
        • If any other value (or no value) is specified, the drawLandMask value from setup.xml is used. If none is specified there, over is the default.
      • history (from the CF and ACDD metadata standards) is a RECOMMENDED multi-line string global attribute with a line for every processing step that the data has undergone.
        • Ideally, each line has an ISO 8601:2004(E) formatted date+timeZ (for example, 1985-01-31T15:31:00Z) followed by a description of the processing step.
        • ERDDAP creates this if it doesn't already exist.
        • If it already exists, ERDDAP will append new information to the existing information.
        • history is important because it allows clients to backtrack to the original source of the data.
      • infoUrl is a REQUIRED global attribute with the URL of a web page with more information about this dataset (usually at the source institution's web site).
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • infoUrl is important because it allows clients to find out more about the data from the original source.
        • ERDDAP displays a link to the infoUrl on the dataset's Data Access Form, Make A Graph web page, and other web pages.
        • If the URL has a query part (after the "?"), it MUST be already percent encoded. In practice, this can be very minimal percent encoding: all you have to do is convert special characters in the right-hand-side values of any constraints: % into %25, & into %26, " into %22, = into %3D, + into %2B, and space into %20 (or +) and convert all characters above #126 to their %HH form (where HH is the 2-digit hex value). Unicode characters above #255 must be UTF-8 encoded and then each byte must be converted to %HH form (ask a programmer for help).
        • Since datasets.xml is an XML file, you MUST also encode '&', '<', and '>' in the URL as '&amp;', '&lt;', and '&gt;'.
        • infoUrl is unique to ERDDAP. It is not from any metadata standard.
      • institution (from the CF and ACDD metadata standards) is a REQUIRED global attribute with the short version of the name of the institution which is the source of this data (usually an acronym, usually <20 characters).
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • ERDDAP displays the institution whenever it displays a list of datasets. If an institution is longer than 20 characters, only the first 20 characters will be visible in the list of datasets (but the whole institution can be seen by putting the mouse cursor over the adjacent "?" icon).
        • If you add institution to the list of <categoryAttributes> in ERDDAP's setup.xml file, users can easily find datasets from the same institution via ERDDAP's "Search for Datasets by Category" on the home page.
      • keywords (from the ACDD metadata standard) is a RECOMMENDED comma-separated list of words and short phrases (e.g., GCMD Science Keywords) that describe the dataset in a general way, and not assuming any other knowledge of the dataset (e.g., for oceanographic data, include ocean).
      • keywords_vocabulary (from the ACDD metadata standard) is a RECOMMENDED attribute: if you are following a guideline for the words/phrases in your keywords attribute (e.g., GCMD Science Keywords), put the name of that guideline here.
      • license (from the ACDD metadata standard) is a STRONGLY RECOMMENDED global attribute with the license and/or usage restrictions.
        • If "[standard]" occurs in the attribute value, it will be replaced by the standard ERDDAP license from the <standardLicense> tag in messages.xml.
      • processing_level (from the ACDD metadata standard) is a RECOMMENDED textual description of the processing (e.g., NASA satellite data processing levels, e.g., Level 3) or quality control level (e.g., Science Quality) of the data.
      • project (from the ACDD metadata standard) is an OPTIONAL attribute to identify the project (e.g., GTSPP) that the dataset is part of. If the dataset isn't part of a project, don't use this attribute. Compared to creator_name, this is focused on the project (not a person or an organization, which may be involved in multiple projects).
      • publisher_name (from the ACDD metadata standard) is the RECOMMENDED way to identify the person, organization, or project which is publishing this dataset. For example, you are the publisher if another person or group created the dataset and you are just re-serving it via ERDDAP. If "publisher" doesn't really apply to a dataset, omit this attribute. Compared to creator_name, the publisher probably didn't significantly modify or reprocess the data; the publisher is just making the data available in a new venue.
      • publisher_email (from the ACDD metadata standard) is the RECOMMENDED way to identify an email address (correctly formatted, e.g., john_smith@great.org) that provides a way to contact the publisher. If "publisher" doesn't really apply to a dataset, omit this attribute.
      • publisher_url (from the ACDD metadata standard) is the RECOMMENDED way to identify a URL for the organization that created the dataset, or a URL with the publisher's information about this dataset (but that is more the purpose of infoUrl). If "publisher" doesn't really apply to a dataset, omit this attribute.
      • sourceUrl is a global attribute with the URL of the source of the data.
        • ERDDAP usually creates this global attribute automatically. Two exceptions are EDDTableFromHyraxFiles and EDDTableFromThreddsFiles.
        • sourceUrl is important because it allows clients to backtrack to the original source of the data.
        • sourceUrl is unique to ERDDAP. It is not from any metadata standard.
      • standard_name_vocabulary (from the ACDD metadata standard) is a RECOMMENDED attribute to identify the name of the controlled vocabulary from which variable standard_names are taken (e.g., CF-19 for version 19 of the CF standard name table).
      • subsetVariables (for EDDTable datasets only) is a RECOMMENDED global attribute that lets you specify a comma-separated list of destinationNames of variables which have a limited number of values (stated another way: variables for which each of the values has many duplicates). If this attribute is present, the dataset will have a .subset web page (and a link to it on every dataset list) which lets users quickly and easily select various subsets of the data.
        • Each time a dataset is loaded, ERDDAP loads and caches all of the distinct() subsetVariable data. Then, all user requests for distinct() subsetVariable data will be very fast.
        • The order of the destinationNames you specify determines the sort order on the .subset web page, so you will usually specify the most important variables first, then the least important. For example, for datasets with time series data for several stations, you might use, e.g.,
          <att name="subsetVariables">station_id, longitude, latitude</att>
          so that the values are sorted by station_id.
        • If the number of distinct combinations of these variables is greater than about 1,000,000, you should consider restricting the subsetVariables that you specify to reduce the number of distinct combinations to below 1,000,000; otherwise, the .subset web pages may be generated slowly.
        • If the number of distinct values of any one subset variable is greater than about 20,000, you should consider not including that variable in the list of subsetVariables; otherwise, it takes a long time to transmit the .subset, .graph, and .html web pages.
        • You should test each dataset to see if the subsetVariables setting is okay. If the source data server is slow and it takes too long (or fails) to download the data, either reduce the number of variables specified or remove the subsetVariables global attribute.
        • SubsetVariables is very useful. So if your dataset is suitable, please create a subsetVariables attribute.
        • EDDTableFromSOS automatically adds
          <att name="subsetVariables">station_id, longitude, latitude</att>
          when the dataset is created.
        • Possible warning: if a user using the .subset web page selects a value which has a carriageReturn or newline character, .subset will fail. ERDDAP can't work around this issue because of some HTML details. In any case, it is almost always a good idea to remove the carriageReturn and newline characters from the data. To help you fix the problem, if the EDDTable.subsetVariablesDataTable method in ERDDAP detects data values that will cause trouble, it will email a warning with a list of offending values to the emailEverythingTo email addresses specified in setup.xml. That way, you know what needs to be fixed.
        • Pre-generated subset tables. Normally, when ERDDAP loads a dataset, it requests the distinct() subset variables data table from the data source, just via a normal data request. In some cases, this data is not available from the data source or retrieving from the data source may be hard on the data source server. If so, you can supply a table with the information in a .json or .csv file with the name [tomcat]/content/erddap/subset/[datasetID].json (or .csv). If present, ERDDAP will read it once when the dataset is loaded and use it as the source of the subset data.
          • If there is an error while reading it, the dataset will fail to load.
          • It MUST have exact same column names (e.g., same case) as <subsetVariables>, but the columns MAY be in any order.
          • It MAY have extra columns (they'll be removed and newly redundant rows will be removed).
          • TimeStamp columns should have ISO 8601:2004(E) formatted date+timeZ strings (for example, 1985-01-31T15:31:00Z).
          • Missing values should be missing values (not fake numbers like -99).
          • .json files may be a little harder to create but deal with Unicode characters well. .json files are easy to create if you create them with ERDDAP.
          • .csv files are easy to work with, but suitable for ISO 8859-1 characters only. .csv files MUST have column names on the first row and data on subsequent rows.
      • summary (from the CF and ACDD metadata standards) is a REQUIRED global attribute with a long description of the dataset (usually <500 characters).
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • summary is important because it allows clients to read a description of the dataset that has more information than the title.
        • ERDDAP displays the summary on the dataset's Data Access Form, Make A Graph web page, and other web pages.
      • title (from the CF and ACDD metadata standards) is a REQUIRED global attribute with the short description of the dataset (usually <80 characters).
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • title is important because every list of datasets presented by ERDDAP (other than search results) lists the datasets in alphabetical order, by title. So if you want to specify the order of datasets, or have some datasets grouped together, you have to create titles with that in mind. Many lists of datasets (e.g., in response to a category search), show a subset of the full list and in a different order. So the title for each dataset should stand on its own.
        • If a title is longer than 80 characters, only the first and last 40 characters will be visible in the list of datasets (but the whole title can be seen by putting the mouse cursor over the adjacent "?" icon).
           
    • <axisVariable> is a REQUIRED tag within a EDDGrid <dataset> tag, which is used to describe a dimension (also called "axis") shared by the data variables in an EDDGrid dataset. It is not allowed for EDDTable datasets.
      • All data variables in an EDDGrid dataset MUST use (share) all of the axis variables. (Why? What if they don't?)
      • There MUST be an axis variable for each dimension. Axis variables MUST be specified in the order that the data variables use them.
      • There MUST be 1 or more instances of this tag.
      • <axisVariable> supports the following subtags:
        • <sourceName> - the data source's name for the variable. This is the name that ERDDAP will use when requesting data from the data source. This is the name that ERDDAP will look for when data is returned from the data source. This is case sensitive. This is REQUIRED.
        • <destinationName> is the name for the variable that will be shown to and used by ERDDAP users.
          • This is OPTIONAL. If absent, the sourceName is used.
          • This is useful because it allows you to change a cryptic or odd sourceName.
          • destinationName is case sensitive.
          • destinationNames MUST start with a letter (A-Z, a-z) and MUST be followed by 0 or more characters (A-Z, a-z, 0-9, and _). ('-' was allowed before ERDDAP version 1.10.) This restriction allows axis variable names to be used as variable names in a programming language (such as Matlab).
        • <addAttributes> defines an OPTIONAL set of attributes (name = value) which are added to the source's attributes for a variable, to make the combined attributes for a variable.
      • If the variable's sourceAttributes or <addAttributes> include scale_factor and/or add_offset attributes, their values will be used to unpack the data from the source before distribution to the client
        (resultValue = sourceValue * scale_factor + add_offset) . The unpacked variable will be of the same data type (e.g., float) as the scale_factor and add_offset values.
      • An example is:
        <axisVariable>
          <sourceName>MT</sourceName> 
          <destinationName>time</destinationName>
          <addAttributes>
            <att name="units">days since 1902-01-01 12:00:00Z</att>
          </addAttributes>
        </axisVariable> 
      • In EDDGrid datasets, the longitude, latitude, altitude, and time axis variables are special.
         
    • <dataVariable> is a REQUIRED (for almost all datasets) tag within the <dataset> tag which is used to describe a data variable.
      • There MUST be 1 or more instances of this tag.
      • <dataVariable> supports the following subtags:
        • <sourceName> - the data source's name for the variable. This is the name that ERDDAP will use when requesting data from the data source. This is the name that ERDDAP will look for when data is returned from the data source. This is case sensitive. This is REQUIRED.

          If you want to create a variable (with a fixed value) that isn't in the source dataset, use:
          <sourceName>=fixedValue</sourceName>
          The initial equals sign tells ERDDAP that a fixedValue will follow.
          The other tags for the <dataVariable> work as if this were a regular variable.
          For example, to create a variable called altitude with a fixed value of 0.0 (float), use:
          <sourceName>=0</sourceName>
          <destinationName>altitude</destinationName>
          <dataType>float</dataType>

        • <destinationName> - the name for the variable that will be shown to and used by ERDDAP users.
          • This is OPTIONAL. If absent, the sourceName is used.
          • This is useful because it allows you to change a cryptic or odd sourceName.
          • destinationName is case sensitive.
          • destinationNames MUST start with a letter (A-Z, a-z) and MUST be followed by 0 or more characters (A-Z, a-z, 0-9, and _). ('-' was allowed before ERDDAP version 1.10.) This restriction allows data variable names to be used as variable names in a programming language (like Matlab).
        • <dataType> - specifies the data type coming from the source.
          • This is REQUIRED by some dataset types and IGNORED by others. Dataset types that require this for their dataVariables are: EDDGridFromXxxFiles, EDDTableFromXxxFiles, EDDTableFromMWFS, EDDTableFromNOS, EDDTableFromSOS. Other dataset types ignore this tag because they get the information from the source.
          • Valid values are: "double" (64-bit), "float" (32-bit), "long" (64-bit signed), "int" (32-bit signed), "short" (16-bit signed), "byte" (8-bit signed), "char" (essentially: 16-bit unsigned), "boolean", and "String" (any length).
          • "boolean" is a special case.
            • Internally, ERDDAP doesn't support a boolean type because booleans can't store missing values.
            • Also, DAP doesn't support booleans, so there is no standard way to query boolean variables.
            • Specifying "boolean" for the dataType in datasets.xml will cause boolean values to be stored and represented as bytes: 0=false, 1=true.
            • Clients can specify constraints by using the numeric values (for example, "isAlive=1"). But ERDDAP administrators need to use the "boolean" dataType in datasets.xml to tell ERDDAP how to interact with the data source.
        • <addAttributes> - defines a set of attributes (name = value) which are added to the source's attributes for a variable, to make the combined attributes for a variable. This is OPTIONAL.
      • If the variable's sourceAttributes or <addAttributes> include scale_factor and/or add_offset attributes, their values will be used to unpack the data from the source before distribution to the client. The unpacked variable will be of the same data type (e.g., float) as the scale_factor and add_offset values.
      • An example is:
        <dataVariable>
          <sourceName>waterTemperature</sourceName>
          <destinationName>sea_water_temperature</destinationName>
          <dataType>float</dataType>
          <addAttributes>
            <att name="ioos_category">Temperature</att>
            <att name="long_name">Sea Water Temperature</att>
            <att name="standard_name">sea_water_temperature</att>
            <att name="units">degree_C</att>
          </addAttributes>
        </dataVariable>  
      • In EDDTable datasets, longitude, latitude, altitude, and time data variables are special.
         
    • Variable Attributes / Variable <addAttributes> - <addAttributes> is an OPTIONAL tag within an <axisVariable> or <dataVariable> tag which is used to change the variable's attributes.
      • Use a variable's <addAttributes> to change the variable's attributes. ERDDAP combines a variable's attributes from the dataset's source (sourceAttributes) and the variable's addAttributes which you define in datasets.xml (which have priority) to make the variable's "combinedAttributes", which are what ERDDAP users see. Thus, you can use addAttributes to redefine the values of sourceAttributes, add new attributes, or remove attributes.
      • See the <addAttributes> information which applies to global and variable <addAttributes>.
      • ERDDAP looks for and uses many of these attributes in various ways. For example, the colorBar values are required to make a variable available via WMS, so that maps can be made with consistent colorBars.
      • The longitude, latitude, altitude, and time variables get lots of appropriate metadata automatically (e.g., units).
      • A sample <addAttributes> for a data variable is:
        <addAttributes> 
          <att name="colorBarMinimum" type="double">0</att>
          <att name="colorBarMaximum" type="double">32</att>
          <att name="data_min" type="double">10.34</att>
          <att name="data_max" type="double">17.91</att>
          <att name="ioos_category">Temperature</att>
          <att name="long_name">Sea Surface Temperature</att>
          <att name="numberOfObservations" /> 
          <att name="units">degree_C</att>
        </addAttributes>
      • Supplying this information helps ERDDAP do a better job and helps users understand the datasets. Please take the time to do a good job with metadata attributes. The users will thank you.

      Comments about variable attributes that are special in ERDDAP:
       

      • actual_range (CDC COARDS) is a RECOMMENDED variable attribute.
        • If present, it is an array of two values of the same data type as the variable, specifying the actual (not the theoretical or the allowed) minimum and maximum values of the data for that variable.
        • If the data is packed with scale_factor and/or add_offset, actual_range should have packed values.
        • If present, ERDDAP will extract the information and display it to the user on the Data Access Form and Make A Graph web pages for that dataset.
        • When a user selects a subset of data, ERDDAP modifies actual_range automatically, to reflect the subset's range.
        • See also data_min and data_max.
      • Color Bar Attributes - There are several OPTIONAL variable attributes which specify the suggested default attributes for a color bar (used to convert data values into colors on images) for this variable.
        • If present, this information is used as default information by griddap and tabledap whenever you request an image that uses a color bar.
        • For example, when latitude-longitude gridded data is plotted as a coverage on a map, the color bar specifies how the data values are converted to colors.
        • Having these values allows ERDDAP to create images which use a consistent color bar across different requests, even when the time or other dimension values vary.
        • These attribute names were created for use in ERDDAP. They are not from a metadata standard.
        • WMS - The main requirements for a variable to be accessible via ERDDAP's WMS server are:
          • The dataset must be an EDDGrid... dataset.
          • The data variable MUST be a gridded variable.
          • The data variable MUST have longitude and latitude axis variables. (Other axis variables are OPTIONAL.)
          • There MUST be some longitude values between -180 and 180.
          • The colorBarMinimum and colorBarMaximum attributes MUST be specified. (Other color bar attributes are OPTIONAL.)
        • The attributes related to the color bar are:
          • colorBarMinimum specifies the minimum value on the colorBar.
            • If the data is packed with scale_factor and/or add_offset, specify the colorBarMinimum as an unpacked value.
            • Data values lower than colorBarMinimum are represented by the same color as colorBarMinimum values.
            • The attribute should be of type="double", regardless of the data variable's type.
            • The value is usually a nice round number.
            • Best practices: We recommend a value slightly higher than the minimum data value.
            • There is no default value.
          • colorBarMaximum specifies the maximum value on the colorBar.
            • If the data is packed with scale_factor and/or add_offset, specify the colorBarMinimum as an unpacked value.
            • Data values higher than colorBarMaximum are represented by the same color as colorBarMaximum values.
            • The attribute should be of type="double", regardless of the data variable's type.
            • The value is usually a nice round number.
            • Best practices: We recommend a value slightly lower than the maximum data value.
            • There is no default value.
          • colorBarPalette specifies the palette for the colorBar.
            • All ERDDAP installations support these standard palettes: BlackBlueWhite, BlackRedWhite, BlackWhite, BlueWhiteRed, LightRainbow, Ocean, Rainbow, RedWhiteBlue, ReverseRainbow, Topography, WhiteBlack, WhiteBlueBlack, and WhiteRedBlack.
            • If you have installed additional palettes, you can refer to one of them.
            • If this attribute isn't present, the default is BlueWhiteRed if -1*colorBarMinimum = colorBarMaximum; otherwise the default is Rainbow.
          • colorBarScale specifies the scale for the colorBar.
            • Valid values are Linear and Log.
            • If the value is Log, colorBarMinimum must be greater than 0.
            • If this attribute isn't present, the default is Linear.
          • colorBarContinuous specifies whether the colorBar has a continuous palette of colors, or whether the colorBar has a few discrete colors.
            • Valid values are the strings true and false.
            • If this attribute isn't present, the default is true.
        • Example:
            <att name="colorBarMinimum" type="double">0</att>
            <att name="colorBarMaximum" type="double">32</att>
            <att name="colorBarPalette">Rainbow</att>
            <att name="colorBarContinuous">true</att>
            <att name="colorBarScale">Linear</att>
      • data_min and data_max - These are RECOMMENDED variable attributes defined in the World Ocean Circulation metadata description.
        • If present, they are of the same data type as the variable, and specify the actual (not the theoretical or the allowed) minimum and maximum values of the data for that variable.
        • If the data is packed with scale_factor and/or add_offset, data_min and data_max should be packed values.
        • If present, ERDDAP will extract the information and display it to the user on the Data Access Form and Make A Graph web pages for that dataset.
        • See also actual_range.
      • drawLandMask - This is an OPTIONAL variable attribute used by ERDDAP (and no metadata standards) which specifies the default value for the "Draw Land Mask" option on the dataset's Make A Graph form and for the &.land parameter in a URL requesting a graph/map of the data.
        • For variables in EDDGrid datasets, this specifies whether the land mask on a map is drawn over or under the grid data. over is recommended for oceanographic data (so that grid data over land is obscured by the landmask). under is recommended for all other data.
        • For variables in EDDTable datasets: over makes the land mask on a map visible (land appears as a uniform gray area). over is commonly used for purely oceanographic datasets. under makes the land mask invisible (topography information is displayed for ocean and land areas). under is commonly used for all other data.
        • If any other value (or no value) is specified, the drawLandMask value from the dataset's global attributes is used.
      • ioos_category - This is a REQUIRED variable attribute if <variablesMustHaveIoosCategory> is set to true in setup.xml; otherwise, it is OPTIONAL.
        • (As of writing this) we aren't aware of formal definitions of these names.
        • The core names are from Zdenka Willis' .ppt "Integrated Ocean Observing System (IOOS) NOAA's Approach to Building an Initial Operating Capability" and from the US IOOS Blueprint.
        • Bob Simons added additional names (mostly based on the the names of scientific fields, e.g., Biology, Ecology, Meteorology, Statistics, Taxonomy) for other types of data.
        • There is some overlap and ambiguity -- do your best.
        • It is likely that this list will be revised in the future. If you have requests, please email bob.simons at noaa.gov.
        • The current valid values are Bathymetry, Biology, Bottom Character, Colored Dissolved Organic Matter, Contaminants, Currents, Dissolved Nutrients, Dissolved O2, Ecology, Fish Abundance, Fish Species, Heat Flux, Hydrology, Ice Distribution, Identifier, Location, Meteorology, Ocean Color, Optical Properties, Other, Pathogens, pCO2, Phytoplankton Species, Pressure, Productivity, Quality, Salinity, Sea Level, Statistics, Stream Flow, Surface Waves, Taxonomy, Temperature, Time, Total Suspended Matter, Unknown, Wind, Zooplankton Species, and Zooplankton Abundance.
        • If you add ioos_category to the list of <categoryAttributes> in ERDDAP's setup.xml file, users can easily find datasets with similar data via ERDDAP's "Search for Datasets by Category" on the home page.
      • long_name (COARDS, CF, netCDF and ACDD metadata standards) is a RECOMMENDED variable attribute in ERDDAP.
        • ERDDAP uses the long_name for labeling axes on graphs.
        • Best practices: Capitalize the words in the long_name as if it were a title (capitalize the first word and all non-article words). Don't include the units in the long_name. The long name shouldn't be very long (usually <20 characters), but should be more descriptive than the destinationName, which is often very concise.
        • If "long_name" isn't defined in the variable's sourceAttributes or <addAttributes>, ERDDAP will generate it by cleaning up the standard_name (if present) or the destinationName.
      • missing_value (default = NaN) and _FillValue (default = NaN) (COARDS, CF, and netCDF) are variable attributes which describe a number (for example, -9999) which is used to represent a missing value.
        • ERDDAP supports missing_value and _FillValue, since some data sources assign slightly different meanings to them.
        • If present, they should be of the same data type as the variable.
        • If the data is packed with scale_factor and/or add_offset, the missing_value and _FillValue values should be likewise packed.
        • If a variable uses these special numbers, the missing_value and/or _FillValue attributes are REQUIRED.
        • For some output data formats, ERDDAP will leave this special numbers intact.
        • For other output data formats, ERDDAP will replace these special numbers with NaN or "".
      • scale_factor (default = 1) and add_offset (default = 0) (COARDS, CF, and netCDF) are OPTIONAL variable attributes which describe data which is packed in a simpler data type via a simple transformation.
        • If present, their data type is different from the source data type and describes the data type of the destination values.
          For example, a data source might have stored float data values with one decimal digit packed as short ints (int16), using scale_factor = 0.1 and add_offset = 0. For example,
          <att name="scale_factor" type="float">0.1</att>
          <att name="add_offset" type="float">0</att>
          In this example, ERDDAP would unpack the data and present it to the user as float data values.
        • If present, ERDDAP will extract the values from these attributes, remove the attributes, and automatically unpack the data for the user:
            destinationValue = sourceValue * scale_factor + add_offset
          Or, stated another way:
            unpackedValue = packedValue * scale_factor + add_offset
      • standard_name (from the ACDD metadata standard) is a RECOMMENDED variable attribute in ERDDAP. (CF maintains a list of CF standard names)
        • If you add standard_name to variables' attributes and add standard_name to the list of <categoryAttributes> in ERDDAP's setup.xml file, users can easily find datasets with similar data via ERDDAP's "Search for Datasets by Category" on the home page.
        • Best practices: Part of the power of controlled vocabularies comes from using only the terms in the list. So we recommend sticking to the terms defined in the controlled vocabulary, and we recommend against making up a term if there isn't an appropriate one in the list. If you need additional terms, see if the standards committee will add them to the controlled vocabulary.
      • time_precision
        • time_precision is an OPTIONAL attribute used by ERDDAP (and no metadata standards) for time and timestamp variables. It specifies the precision to be used when displaying time values from the variable on web pages in ERDDAP. The only data file output format that uses this is .htmlTable.
        • Valid values are 1970-01, 1970-01-01, 1970-01-01T00Z, 1970-01-01T00:00Z, 1970-01-01T00:00:00Z (the default), 1970-01-01T00:00:00.0Z, 1970-01-01T00:00:00.00Z, 1970-01-01T00:00:00.000Z. [1970 is not an option because it is a single number, so ERDDAP can't know if it is a formatted time string (a year) or if it is some number of seconds since 1970-01-01T00:00:00T.]
        • If time_precision isn't specified or the value isn't matched, the default value will be used.
        • Here, as in other parts of ERDDAP, any fields of the formatted time that are not displayed are assumed to have the minimum value. For example, 1985-07, 1985-07-01, 1985-07-01T00Z, 1985-07-01T00:00Z, and 1985-07-01T00:00:00Z are all considered equivalent, although with different levels of precision implied. This matches the ISO 8601:2004 "extended" Time Format Specification.
        • WARNING: You should only use a limited time_precision to the extent that all of the data values for the variable have only the minimum value for all of the fields that are hidden.
          • For example, you can use a time_precision of 1970-01-01 if all of the data vaules have hour=0, minute=0, and second=0 (for example 2005-03-04T00:00:00Z and 2005-03-05T00:00:00Z).
          • For example, don't use a time_precision of 1970-01-01 if there are non-0 hour, minute, or seconds values, (for example 2005-03-05T12:00:00Z) because the non-default hour value wouldn't be displayed.
            Otherwise, if a user asks for all data with time=2005-03-05, the request will fail unexpectedly.
      • units (COARDS, CF, netCDF and ACDD metadata standard) defines the units of the data values.
        • "units" is REQUIRED as either a sourceAttribute or an addAttribute for "time" variables and is STRONGLY RECOMMENDED for other variables whenever appropriate (which is almost always).
        • In general, we recommend UDUnits-compatible units which is required by the COARDS and CF standards.
        • Another common standard is UCUM - the Unified Code for Units of Measure. OGC services such as SOS, WCS, and WMS require UCUM and often refer to UCUM as UOM (Units Of Measure).
        • We recommend that you use one units standard for all datasets in your ERDDAP. You should tell ERDDAP which standard you are using with <units_standard>, in your setup.xml file.
        • For time variables, either the variable's sourceAttributes or <addAttributes> (which is read first) MUST have units which is either
          • For time axis variables or time data variables with numeric data: UDUnits-compatible string (with the format units since baseTime) describing how to interpret source time values (e.g., seconds since 1970-01-01T00:00:00Z), where the base time is an ISO 8601:2004(E) formatted date time string (yyyy-MM-dd'T'HH:mm:ssZ).
          • For time data variables with String data: an org.joda.time.format.DateTimeFormat string (which is mostly compatible with java.text.SimpleDateFormat) describing how to interpret string times (e.g., the ISO8601TZ_FORMAT yyyy-MM-dd'T'HH:mm:ssZ).
            A Z (not the literal 'Z') at the end of the format string tells Java/Joda/ERDDAP to look for the character 'Z' (indicating the Zulu time zone with offset=0) or look for a time zone offset in the form +hh:mm, +hh, -hh:mm, or -hh. Examples of String dates in this format are
            2012-11-20T10:12:59-07:00
            2012-11-20T17:12:59Z
            2012-11-20T17:12:59
            all of which are equivalent times in ERDDAP because ERDDAP's default time zone (relevant for the last example) is Zulu.
            Other examples are
            2012-11-20T10:12 (missing seconds are assumed to be 0)
            2012-11-20T17 (missing minutes are assumed to be 0)
            2012-11-20 (missing hours are assumed to be 0)
            2012-11 (missing date is assumed to be 1)
            See Joda DateTimeFormat .
          The main time data variable (for tabular datasets) and the main time axis variable (for gridded datasets) are recognized by the destinationName time and their units metadata (which must be suitable).

          For tabular datasets, other variables can be timeStamp variables. They behave like the main time variable (converting the source's time format into "seconds since 1970-01-01T00:00:00Z" and/or ISO 8601:2004(E) format), but have a different destinationName. TimeStamp variables are recognized by their "units" metadata, which must contain " since " (for numeric dateTimes) or "yy" or "YY" for formatted String dateTimes. But please still use the destinationName "time" for the main dateTime variable.

          Always check your work to be sure that the time data that shows up in ERDDAP is the correct time data. Working with time data is always tricky and error prone.

          See more information about time variables.
          ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
          See How ERDDAP Deals with Time.


 

Contact

Questions, comments, suggestions? Please send an email to bob dot simons at noaa dot gov and include the ERDDAP URL directly related to your question or comment.
 

ERDDAP, Version 1.42
Disclaimers | Privacy Policy