Working with the datasets.xml File
[This web page will only be of interest to ERDDAP administrators.]
After you have followed the ERDDAP
installation instructions,
you must edit the datasets.xml file
in [tomcat]/content/erddap/ to describe the datasets that your ERDDAP installation will serve.
Table of Contents
Some Assembly Required -
Setting up a dataset in ERDDAP isn't just a matter of pointing to the dataset's
directory or URL. You have to write a chunk of XML for datasets.xml which describes the dataset.
- For gridded datasets, in order to make the dataset conform to ERDDAP's data structure for gridded data,
you have to identify a subset of the dataset's variables which share the same dimensions.
(Why? How?)
- The dataset's current metadata is imported automatically.
But if you want to modify that metadata or add other metadata, you have to specify it in datasets.xml.
And ERDDAP needs other metadata, including global attributes
(such as infoUrl, institution,
sourceUrl, summary, and title) and variable attributes
(such as long_name and units).
Just as the metadata that is currently in the dataset adds descriptive information to the dataset,
the metadata requested by ERDDAP adds descriptive information to the dataset.
The additional metadata is a good addition to your dataset and helps ERDDAP do a better job of
presenting your data to users who aren't familiar with it.
- ERDDAP needs you to do special things with the
longitude, latitude, altitude, and time variables.
If you buy into these ideas and expend the effort to create the XML for datasets.xml,
you get all the advantages of ERDDAP, including:
- Full text search for datasets
- Search for datasets by category
- Data Access Forms so you can request subset of data in lots of different file formats
- Forms to request graphs and maps
- Web Map Service (WMS) for gridded datasets
- RESTful access to your data
Making the datasets.xml takes considerable effort for the first few datasets, but it gets easier.
After the first dataset, you can often re-use a lot of your work for the next dataset.
Fortunately, there are two Tools to help you create the XML for each
dataset in datasets.xml.
And if you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
Tools -
There are two command line programs which are tools to help you create the XML
for each dataset that you want your ERDDAP to serve.
Once you have ERDDAP installed in Tomcat and Tomcat has unpacked the erddap.war file,
you can find these programs in the [tomcat]/webapps/erddap/WEB-INF directory.
There are Linux/Unix shell scripts (the program name, with no extension) and
Windows .bat files for each program.
When you run each program, it will ask you questions.
For each question, type a response, then press Enter.
Or press ^C to exit a program at any time.
The tools print various diagnostic messages:
- The word "error" is used when something went so wrong that the procedure failed to complete.
Although it is annoying to get an error, the error forces you to deal with the problem.
- The word "warning" is used when something went wrong, but the procedure was able to complete.
These are pretty rare.
- Anything else is just an informative message.
You can add -verbose to the GenerateDatasetsXml or DasDds command line to get
additional informative messages, which sometimes helps solve problems.
The two tools are a big help, but you still must read all of these instructions on this page carefully
and make important decisions yourself.
- GenerateDatasetsXml
is a command line program that can generate a rough draft
of the dataset XML for almost any type of datasets.
When you use the GenerateDatasetsXml program:
- GenerateDatasetsXml asks you a series of questions so that it can access the dataset's source.
- If you answer the questions correctly, GenerateDatasetsXml will connect to the dataset's source
and gather basic information (e.g., variable names).
- GenerateDatasetsXml will generate and print a rough draft of the dataset XML for that dataset
and put the information on the system clipboard.
- You can then paste it into your datasets.xml file and start to edit it.
- You can then use DasDds (see below) to repeatedly test the XML for that dataset.
Often, one of your answers won't be what GenerateDatasetsXml needs.
You can then try again, with revised answers to the questions,
until GenerateDatasetsXml can successfully connect to the dataset.
If you use "GenerateDatasetsXml -verbose", it will print more diagnostic messages than usual.
DISCLAIMER:
The chunk of datasets.xml made by GenerateDatasetsXml isn't perfect.
YOU MUST READ AND EDIT THE XML BEFORE USING IT IN A PUBLIC ERDDAP.
GenerateDatasetsXml relies on a lot of rules-of-thumb which aren't always correct.
YOU ARE RESPONSIBLE FOR ENSURING THE CORRECTNESS OF THE XML THAT YOU
ADD TO ERDDAP'S datasets.xml FILE.
EDDGridFromThreddsCatalog -
In general, the options in GenerateDatasetsXml generate a datasets.xml chunk
for one dataset from one specific data source. An exception to this is the
EDDGridFromThreddsCatalog option.
It generates all of the datasets.xml chunks needed for all of the EDDGridFromDap datasets
that it can find by crawling recursively through a THREDDS (sub) catalog.
There are many forms of THREDDS catalog URLs.
This option REQUIRES a THREDDS .xml URL with /catalog/ in it, for example,
http://oceanwatch.pfeg.noaa.gov/thredds/catalog/catalog.xml or
http://oceanwatch.pfeg.noaa.gov/thredds/catalog/Satellite/aggregsatMH/chla/catalog.xml
(note that the comparable .html catalog is at
http://oceanwatch.pfeg.noaa.gov/thredds/Satellite/aggregsatMH/chla/catalog.html ).
If you have problems with EDDGridFromThreddsCatalog:
- Make sure the URL you are using is valid, includes /catalog/,
and ends with /catalog.xml .
- If possible, use a public URL (e.g., with oceanwatch.pfeg.noaa.gov) in the
URL, not a private numeric IP address (e.g., with 12.34.56.78).
If the THREDDS is only accessible privately, you can use
<convertToPublicSourceUrl>
so ERDDAP users see the public URL, even though ERDDAP gets data from the
private URL.
- Look in the log file, [bigParentDirectory]/logs/log.txt, for error messages.
- Send an email to Bob with as much information as possible.
- DasDds is a command line program that you can use
after you have created a first attempt at the XML for a new dataset in datasets.xml.
With DasDds, you can repeatedly test and refine the XML.
When you use the DasDds program:
- DasDds asks you for the datasetID for the dataset you are working on.
- DasDds tries to create the dataset with that datasetID.
- It always prints lots of diagnostic messages.
- It always deletes all /dataset/ files for the dataset (for safety) before trying
to create the dataset. So for aggregated datasets, you might want to adjust the
fileNameRegex temporarily to limit the number of files the data constructor finds.
- If it fails (for whatever reason), it will show you the error message.
Read the diagnostic messages and the error message carefully.
Then you can make a change to the XML and let DasDds try to create the dataset again.
- If DasDds can create the dataset, DasDds will then show you the .das and .dds for the dataset
and put the information on the system clipboard. Often, you will want to make some small
change to the dataset's XML to clean up the dataset's metadata.
By going through this cycle repeatedly, you will eventually revise the dataset's XML
so that the dataset can be created and so that the dataset's metadata is as you want it to be.
If you use "DasDds -verbose", it will print more diagnostic messages than usual.
The basic structure of the datasets.xml file is:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<erddapDatasets>
<convertToPublicSourceUrl /> <!-- 0 or more -->
<requestBlacklist>...</requestBlacklist> <!-- 0 or 1 -->
<subscriptionEmailBlacklist>...</subscriptionEmailBlacklist> <!-- 0 or 1 -->
<user username="..." password="..." roles="..." /> <!-- 0 or more -->
<dataset>...</dataset> <!-- 1 or more -->
</erddapDatasets>
It is possible that other encodings will be allowed in the future, but for now, only ISO-8859-1 is recommended.
Working with the datasets.xml file is a non-trivial project.
Please read this entire web page carefully, especially these notes.
- Hint - It is often easier to generate the XML for a dataset by making a copy of a working
dataset description in dataset.xml and then modifying it.
- Encoding Special Characters - Since datasets.xml is an XML file, you need to encode "&", "<", and ">"
in any content as "&", "<", and ">".
Wrong: <title>Time & Tides</title>
Right: <title>Time & Tides</title>
- XML doesn't tolerate syntax errors. After you edit the dataset.xml file, it is a good idea
to verify that the result is well-formed XML by pasting the XML text into an XML checker like
RUWF.
- Other Ways
To Diagnose Problems With Datasets
In addition to the two main Tools,
- log.txt
is a log file with all of ERDDAP's diagnostic messages.
- The Daily Report
has more information than the status page, including a list of datasets that
didn't load and the exceptions (errors) they generated.
- The Status Page
is a quick way to check ERDDAP's status from any web browser.
It includes a list of datasets that didn't load (although not the related exceptions) and
taskThread statistics (showing the progress of
EDDGridCopy and
EDDTableCopy datasets).
- If you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
- The longitude, latitude, altitude, and time (LLAT) variables names are special.
- LLAT variables are made known to ERDDAP if the axis variable's (for EDDGrid datasets) or data
variable's (for EDDTable datasets) destinationName is "longitude", "latitude", "altitude", or "time".
- We strongly encourage you to use these standard names for these variables whenever possible.
If you don't use these special variable names, ERDDAP won't recognize their significance and, for example,
will make a graph instead of a map if the x axis variable is lon and the y axis variable is lat.
- Use the destinationNames "longitude" and "latitude" only if the units are degrees_east
and degrees_north, respectively.
If your data doesn't fit these requirements, use different variable names
(e.g., lonRadians, latRadians).
- Use the destinationName "altitude" only if the data is the distance above or below sea level.
Use
<altitudeMetersPerSourceUnit>
to convert the data to meter above sea level (e.g., use -1 for data that was
originally depth in meters).
If you know the
vertical datum,
please specify it in the metadata.
If your data doesn't fit these requirements,
use a different destinationName (e.g., aboveGround, depth,
distanceToBottom).
- Use the destinationName "time" only for variables that include the entire date+time
(or date, if that is all there is).
If, for example, there are separate columns for date and timeOfDay, don't use the variable name "time".
See units for a discussion of time units.
- ERDDAP will automatically add lots of metadata to LLAT variables (e.g., "ioos_category", "units",
and several standards-related attributes like "_CoordinateAxisType").
- ERDDAP will automatically, on-the-fly, add lots of global metadata related to the LLAT values
of the selected data subset (e.g., "geospatial_lon_min").
- Clients that support these metadata standards will be able to take advantage of the added metadata
to position the data in time and space.
- Clients will find it easier to generate queries that include LLAT variables because the variable's
names are the same in all relevant datasets.
- LLAT variables are treated specially by Make A Graph. For example, if the X Axis variable is
"longitude" and the Y Axis variable is "latitude", you will get a map (using a standard
projection, and with a land mask, political boundaries, etc.) instead of a graph.
- If you have longitude and latitude data expressed in different units and thus
with different destinationNames, e.g.,
lonRadians and latRadians, Make A Graph will make graphs (e.g., time series) instead of maps.
- The time variable and related timeStamp variables are unique in that they
always convert data values from the source's time format (what ever it is) into a numeric value
(seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format), depending on the
situation.
- Note that ERDDAP does NOT follow the CF standard when converting "years since" and
"months since" time values to "seconds since". The CF standard defines a year as a
fixed, single value: 3.15569259747e7 seconds. And CF defines a month as year/12.
Unfortunately, most/all datasets that we have seen that use "years since" or
"months since" clearly intend the values to be calendar years or calendar months.
For example, "3 months since 1970-01-01" is clearly intended to mean 1970-04-01.
So, ERDDAP interprets "years since" and "months since" as calendar years and months,
and does not follow the strict CF standard.
- When a user requests time data, they can request it by specifying the time as a numeric value
(seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format).
- See units for more information about time and timeStamp variables.
- ERDDAP has a utility to
Convert
a Numeric Time to/from a String Time.
- See How ERDDAP
Deals with Time.
- Why just two basic data structures?
- Since it is difficult for human clients and computer clients to deal with a complex set of
possible dataset structures, ERDDAP uses just two basic data structures:
- Certainly, not all data can be expressed in these structures, but much of it can.
Tables, in particular, are very flexible data structures
(look at the success of relational database programs).
- This makes data queries easier to construct.
- This makes data responses have a simple structure, which makes it easier to serve the data
in a wider variety of standard file types (which often just support simple data structures).
This is the main reason that we set up ERDDAP this way.
- This, in turn, makes it very easy for us (or anyone) to write client software which works with all
ERDDAP datasets.
- This makes it easier to compare data from different sources.
- We are very aware that if you are used to working with data in other data structures
you may initially think that this approach is simplistic or insufficient.
But all data structures have tradeoffs. None is perfect.
Even the do-it-all structures have their downsides: working with them is complex and
the files can only be written or read with special software libraries.
If you accept ERDDAP's approach enough to try to work with it, you may find that it has its
advantages (notably the support for multiple file types that can hold the data responses).
The
ERDDAP slide show
(particularly the
data
structures slide)
talk a lot about these issues.
- And even if this approach sounds odd to you, most ERDDAP clients will never notice --
they will simply see that all of the datasets have a nice simple structure
and they will be thankful that they can get data from a wide variety of sources returned in a
wide variety of file formats.
- What if the grid variables in the source dataset DON'T share the same axis variables?
In EDDGrid datasets, all data variables MUST use (share) all of the axis variables.
So if a source dataset has some variables with one set of dimensions, and other variables with
a different set of dimensions, you will have to make two datasets in ERDDAP.
For example, you might make one ERDDAP dataset entitled "Some Title (at surface)" to hold variables
that just use [time][latitude][longitude] dimensions and make another ERDDAP dataset entitled
"Some Title (at depths)" to hold the variables that use [time][altitude][latitude][longitude].
Or perhaps you can change the data source to add a dimension with a single value (for example,
altitude=0) to make the variables consistent.
- Projected Gridded Data
- Modelers (and others) often work with gridded data on various
non-cylindrical projections (e.g., conic, polar stereographic).
Some end users want the projected data so there is no loss of information.
For those clients, ERDDAP can serve the data, as is, if the ERDDAP administrator breaks the
original dataset into a few datasets, with each part including variables which share the same
axis variables.
Yes, that seems odd to people involved and it is different from most OPeNDAP servers.
But ERDDAP emphasizes making the data available in many formats.
That is possible because ERDDAP uses/requires a more uniform data structure.
Although it is a little awkward (i.e., different that expected), ERDDAP can distribute the projected data.
[Yes, ERDDAP could have looser requirements for the data structure, but keep the requirements for
the output formats. But that would lead to confusion among many users, particularly newbies,
since many seemingly valid requests for data with different structures would be invalid
because the data wouldn't fit into the file type.
We keep coming back to the current system's design.]
Some end users want lat lon geographic data (plate carree) for ease-of-use in different situations.
For that, we encourage the ERDDAP administrator to re-project the data onto a geographic
(plate carree) projection and serve that form of the data as a different dataset.
Then both types of users are happy.
List of Types Datasets
Datasets fall into two categories. (Why?)
- EDDGrid datasets handle gridded data.
- In EDDGrid datasets, data variables are multi-dimensional arrays of data.
- There MUST be an axis variable for each dimension.
Axis variables MUST be specified in the order that the data variables use them.
- In EDDGrid datasets, all data variables MUST use (share) all of the axis variables.
(Why? What if they don't?)
- See the more complete description of the
EDDGrid data model.
- The EDDGrid dataset types are:
- EDDTable datasets handle tabular data.
- Tabular data can be represented as a table with rows and columns.
Each column (a data variable) has a name, a set of attributes, and stores just one type of data.
- See the more complete description of the
EDDTable data model.
- The EDDTable dataset types are:
Detailed Descriptions of Dataset Types
EDDGridFromDap handles grid variables from
DAP servers.
EDDGridFromErddap handles gridded data from
a remote ERDDAP server.
EDDTableFromErddap handles tabular data from
a remote ERDDAP server.
- EDDGridFromErddap and EDDTableFromErddap behave differently from all other types of datasets in ERDDAP.
- Like other types of datasets, these datasets get information about the dataset from the source
and keep it in memory.
- Like other types of datasets, when ERDDAP searches for datasets, displays the Data Access Form,
or displays the Make A Graph form, ERDDAP uses the information about the dataset which is in memory.
- Unlike other types of datasets, when ERDDAP receives a request for data or images from these datasets,
ERDDAP
redirects
the request to the remote ERDDAP server. The result is:
- This is very efficient (CPU, memory, and bandwidth), because otherwise
- The composite ERDDAP has to send the request to the other ERDDAP (which takes time).
- The other ERDDAP has to get the data, reformat it, and transmit the data to the composite
ERDDAP.
- The composite ERDDAP has to receive the data (using bandwidth), reformat it (using CPU and
memory), and transmit the data to the user (using bandwidth).
By redirecting the request and allowing the other ERDDAP to send the response directly to the user,
the composite ERDDAP spends essentially no CPU time, memory, or bandwidth on the request.
- The redirect is transparent to the user regardless of the client software (a browser or any
other software or command line tool).
- Normally, when an EDDGridFromErddap and EDDTableFromErddap are (re)loaded on your ERDDAP, they try
to add a subscription to the remote dataset via the remote ERDDAP's email/URL subscription system.
That way, whenever the remote dataset changes, the remote ERDDAP contacts the
setDatasetFlag URL
on your ERDDAP so that the local dataset is reloaded ASAP and so that the local dataset always
mimics the remote dataset.
So, the first time this happens, you should get an email requesting that you validate the subscription.
However, if the local ERDDAP can't send an email or if the remote ERDDAP's email/URL subscription
system isn't active, you should email the remote ERDDAP administrator and request that s/he manually
add
<onChange>...</onChange>
tags to all of the relevant datasets to call your dataset's
setDatasetFlag URLs.
See your ERDDAP daily report for a list of setDatasetFlag URLs, but just send the ones for
EDDGridFromErddap and EDDTableFromErddap datasets to the remote ERDDAP administrator.
- EDDGridFromErddap and EDDTableFromErddap are the basis for
clusters and federations
of ERDDAPs, which efficiently distribute the CPU usage (mostly for making maps), memory usage,
dataset storage, and bandwidth usage of a large data center.
- EDDGridFromErddap and EDDTableFromErddap can't be used with remote datasets that require logging in
(because they use <accessibleTo>).
- For security reasons, EDDGridFromErddap and EDDTableFromErddap don't support the
<accessibleTo> tag.
See ERDDAP's
security system
for restricting access to some datasets to some users.
- The skeleton XML for an EDDGridFromErddap dataset
is very simple, because the intent is just to mimic
the remote dataset which is already suitable for use in ERDDAP:
<dataset type="EDDGridFromErddap" datasetID="..." active="..." >
<sourceUrl>...</sourceUrl>
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
</dataset>
- The skeleton XML for an EDDTableFromErddap dataset
is very simple, because the intent is just to mimic
the remote dataset, which is already suitable for use in ERDDAP:
<dataset type="EDDTableFromErddap" datasetID="..." active="..." >
<sourceUrl>...</sourceUrl>
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
</dataset>
EDDGridFromEtopo just serves the
ETOPO1 Global 1-Minute Gridded Elevation Data Set
(Ice Surface, grid registered, binary, 2byte int: etopo1_ice_g_i2.zip) which is
distributed with ERDDAP.
EDDGridFromFiles is the superclass of all
EDDGridFrom...Files classes.
You can't use EDDGridFromFiles directly.
Instead, use a subclass of EDDGridFromFiles to handle the specific file type:
Currently, no other file types are supported.
But it is usually relatively easy to support other file types. Contact us if you have requests.
Or, if your data is in an old file format that you would like to move away from,
we recommend converting the files to be NetCDF .nc files. NetCDF is a widely supported format,
allows fast random access to the data, and is already supported by ERDDAP.
Details - The following information applies to all of the subclasses of EDDGridFromFiles.
- Aggregation - This class aggregates data from local files.
The resulting dataset appears as if all of the file's data had been combined.
The local files all MUST have the same dataVariables (as defined in the dataset's datasets.xml).
All of the dataVariables MUST use the same axisVariables/dimensions (as defined in the dataset's
datasets.xml).
The files will be aggregated based on the first (left-most) dimension, sorted in ascending order.
Each file MAY have data for one or more values of the first dimension, but there can't be any
overlap between files.
If a file has more than one value for the first dimension, the values MUST be sorted in ascending
order, with no ties.
All files MUST have exactly the same values for all of the other dimensions.
All files MUST have exactly the same units metadata for all axisVariables
and dataVariables.
For example, the dimensions might be [time][altitude][latitude][longitude], and the files might
have the data for one time (or more) value(s) per file.
The big advantages of aggregation are:
- The size of the aggregated data set can be much larger than a single file can be conveniently (~2GB).
- For near-real-time data, it is easy to add a new file with the latest chunk of data.
You don't have to rewrite the entire dataset.
- Directories - The files MAY be in one directory, or in a directory and its subdirectories (recursively).
Note that if there are a large number of files (e.g., >1000), the operating system (and thus
EDDGridFromFiles) will operate much more efficiently if you store the files in a series of
subdirectories.
- Cached File Information - When an EDDGridFromFiles dataset is first loaded,
EDDGridFromFiles reads information from all of the relevant files
and creates tables in memory with information about each valid file and each invalid file
(one file per row).
- The tables are also stored on disk, as .json files in [bigParentDirectory]/dataset in
files named:
[datasetID].dirs.json (which holds a list of unique directory names),
[datasetID].files.json (which holds the table with each valid file's information),
[datasetID].bad.json (which holds the table with each bad file's information).
- The copy of the file information tables on disk is also useful when ERDDAP is shut down and restarted:
it saves EDDGridFromFiles from having to re-read all of the data files.
- You shouldn't ever need to delete or work with these files.
You can delete these files (but why?).
If you ever do need to delete these files (why?), you can do it when ERDDAP is running.
(Then set a flag.)
- If you want to encourage ERDDAP to update the stored dataset information
(for example, if you just added, removed, or changed some files to the dataset's data directory),
use the
flag system
to force ERDDAP to update the cached file information.
- Handling Requests -
When a client's request for data is processed, EDDGridFromFiles can quickly look
in the table with the valid file information to see which files have the requested data.
- Updating the Cached File Information - Whenever the dataset is reloaded, the cached file
information is updated.
- The dataset is reloaded periodically as determined by the
<reloadEveryNMinutes>
in the dataset's information in datasets.xml.
- The dataset is reloaded as soon as possible whenever ERDDAP detects that you have added,
removed, touch'd
(to change the file's lastModified time), or changed a datafile.
- The dataset is reloaded as soon as possible if you use the
flag system.
When the dataset is reloaded, ERDDAP compares the currently available files to the cached file
information tables.
New files are read and added to the valid files table.
Files that no longer exist are dropped from the valid files table.
Files where the file timestamp has changed are read and their information is updated.
The new tables replace the old tables in memory and on disk.
- Bad Files - The table of bad files and the reasons the files were declared bad (corrupted file,
missing variables, etc.) is emailed to the emailEverythingTo email address (probably you)
every time the dataset is reloaded. You should replace or repair these files as soon as possible.
- FTP Trouble/Advice - If you FTP new data files to the ERDDAP server while ERDDAP is running,
there is the chance that ERDDAP will be reloading the dataset during the FTP process.
It happens more often than you might think!
If it happens, the file will appear to be valid (it has a valid name), but the file isn't yet valid.
If ERDDAP tries to read data from that invalid file, the resulting error will cause the file
to be added to the table of invalid files.
This is not good.
To avoid this problem, use a temporary file name when FTP'ing the file, e.g., ABC2005.nc_TEMP .
Then, the fileNameRegex test (see below) will indicate that this is not a relevant file.
After the FTP process is complete, rename the file to the correct name.
The renaming process will cause the file to become relevant in an instant.
- The skeleton XML for all EDDGridFromFiles subclasses is:
<dataset type="EDDGridFrom...Files" datasetID="..." active="..." >
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
<fileDir>...</fileDir> <-- The directory (absolute) with the data files. -->
<recursive>true|false</recursive> <-- Indicates if subdirectories
of fileDir have data files, too. -->
<fileNameRegex>...</fileNameRegex> <-- A regular expression
(tutorial) describing valid data files names,
e.g., ".*\.nc" for all .nc files. -->
<metadataFrom>...</metadataFrom> <-- The file to get
metadata from ("first" or "last" (the default) based on file's
lastModifiedTime). -->
<addAttributes>...</addAttributes>
<axisVariable>...</axisVariable> <!-- 1 or more -->
<dataVariable>...</dataVariable> <!-- 1 or more -->
</dataset>
EDDGridFromNcFiles aggregates data from local, gridded,
GRIB .grb and .grb2 files,
HDF .hdf 4 (and 5?)
files,
NetCDF .nc files.
This may work with other file types (e.g., BUFR), we just haven't tested it -- please send us
some sample files.
Note that for GRIB files, ERDDAP will make a .gbx index file the first time it reads each GRIB file.
So the GRIB files must be in a directory where the "user" that ran Tomcat has read+write permission.
See this class' superclass, EDDGridFromFiles, for information
on how to use this class and how this class works.
EDDGridSideBySide
aggregates two or more EDDGrid datasets (the children) side by side.
EDDGridAggregateExistingDimension
aggregates two or more EDDGrid datasets based on different
values of the first dimension.
- For example, one child dataset might have 366 values (for 2004) for the time dimension and another
child might have 365 values (for 2005) for the time dimension.
- All the values for all of the other dimensions (e.g., latitude, longitude) MUST be identical
for all of the children.
- The parent dataset and the child dataset MUST have different datasetIDs.
If any names in a family are exactly the same, the dataset will fail to load
(with the error message that the values of the aggregated axis are not in sorted order).
- Currently, the child dataset MUST be an EDDGridFromDap dataset and MUST have
the lowest values of the aggregated dimension (usually the oldest time values).
All of the other children MUST be almost identical datasets (differing just in the values
for the first dimension) and are specified by just their sourceUrl.
- The aggregate dataset gets its metadata from the first child.
- ensureAxisValuesAreEqual - This tag is OPTIONAL.
If true (the default), the non-first-axis values MUST be exactly equal in all children.
If false, some minor variation is allowed(for example, it would allow 0.1 in one child and
0.1000000002 in another).
Only use false if you need to and if you are certain that the variation that is present
is acceptable to you.
- The GenerateDatasetsXml program can make a rough draft of the
datasets.xml for an
EDDGridAggregateExistingDimension based on a set of files served by a Hyrax or THREDDS server.
For example, use this input for the program (the "/1988" in the URL makes the example run faster):
EDDType? EDDGridAggregateExistingDimension
Server type (hyrax or thredds)? hyrax
Parent URL (e.g., for hyrax, ending in "contents.html";
for thredds, ending in "catalog.xml")
? http://dods.jpl.nasa.gov/opendap/ocean_wind/ccmp/L3.5a/data/
flk/1988/contents.html
File name regex (e.g., ".*\.nc")? month.*flk\.nc\.gz
ReloadEveryNMinutes (e.g., 10080)? 10080
You can use the resulting <sourceUrl> tags or
delete them and uncomment the <sourceUrl> tag
(so that new files are noticed each time
the dataset is reloaded.
- The skeleton XML for an
EDDGridAggregateExistingDimension dataset is:
<dataset type="EDDGridAggregateExistingDimension" datasetID="..."
active="..." >
<dataset>...</dataset> <!-- This is a regular EDDGridFromDap
dataset description child with the lowest values for the aggregated dimensions. -->
<sourceUrl>...</sourceUrl> <!-- 0 or many; the sourceUrls for
other children. These children must be listed in order of ascending values
for the aggregated dimension. -->
<sourceUrls serverType="..." regex="..." recursive="true"
>http://someServer/thredds/someSubdirectory/catalog.xml</sourceUrls>
<!-- 0 or 1. This specifies how to find the other children, instead
of using separate sourceUrl tags for each child. The advantage of this
is: new children will be detected each time the dataset is reloaded.
The serverType must be "thredds" or "hyrax".
An example of a regular expression (regex) (tutorial) is .*\.nc
recursive can be "true" or "false".
An example of a thredds catalogUrl is
http://thredds1.pfeg.noaa.gov/thredds/catalog/Satellite/aggregsatMH/chla/catalog.xml
An example of a hyrax catalogUrl is
http://podaac-opendap.jpl.nasa.gov/opendap/allData/ccmp/L3.5a/monthly/flk/1988/contents.html
When these children are sorted by file name, they must be in order of
ascending values for the aggregated dimension. -->
<ensureAxisValuesAreEqual>true(the default) or
false</ensureAxisValuesAreEqual>
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
</dataset>
EDDGridCopy makes and maintains a local copy of another EDDGrid's data and
serves data from the local copy.
- EDDGridCopy (and for tabular data, EDDTableCopy)
is a very easy to use and a very effective
solution to some of the biggest problems with serving data from remote data sources:
- Accessing data from a remote data source can be slow.
- They may be slow because they are inherently slow (e.g., an inefficient type of server),
- because they are overwhelmed by too many requests,
- or because your server or the remote server is bandwidth limited.
- The remote dataset is sometimes unavailable (again, for a variety of reasons).
- Relying on one source for the data doesn't scale well (e.g., when many users and many ERDDAPs
utilize it).
- How It Works - EDDGridCopy solves these problems by automatically making and maintaining
a local copy of the data and serving data from the local copy.
ERDDAP can serve data from the local copy very, very quickly.
And making a local copy relieves the burden on the remote server.
And the local copy is a backup of the original, which is useful in case something happens to the
original.
There is nothing new about making a local copy of a dataset. What is new here is that this class
makes it *easy* to create and *maintain* a local copy of data from a *variety* of types
of remote data sources and *add metadata* while copying the data.
- Chunks of Data - EDDGridCopy makes the local copy of the data by requesting chunks of data from the
remote <dataset> .
There will be a chunk for each value of the leftmost axis variable.
Note that EDDGridCopy doesn't rely on the remote dataset's index numbers for the axis -- those
may change.
WARNING: If the size of a chunk of data is so big that it causes problems (> 1GB?), EDDGridCopy
can't be used. (Sorry, we hope to have a solution for this problem in the future.)
- Local Files - Each chunk of data is stored in a separate netCDF file in a subdirectory of
[bigParentDirectory]/copy/datasetID/ (as specified in
setup.xml).
File names created from axis values are modified to make them file-name-safe
(e.g., hyphens are replaced by "x2D") -- this doesn't affect the actual data.
- New Data - Each time EDDGridCopy is reloaded, it checks the remote <dataset> to see what
chunks are available.
If the file for a chunk of data doesn't already exist, a request to get the chunk is added to a queue.
ERDDAP's taskThread processes all the queued requests for chunks of data,
one-by-one.
You can see statistics for the taskThread's activity on the
Status Page and in the
Daily Report.
(Yes, ERDDAP could assign multiple tasks to this process, but that would use up lots of the
remote data source's bandwidth, memory, and CPU time, and lots of the local ERDDAP's bandwidth,
memory, and CPU time, neither of which is a good idea.)
NOTE: The very first time an EDDGridCopy is loaded, (if all goes well) lots of requests for chunks
of data will be added to the taskThread's queue, but no local data files will have been created.
So the constructor will fail but taskThread will continue to work and create local files.
If all goes well, the taskThread will make some local data files and the next attempt to
reload the dataset (in ~15 minutes) will succeed, but initially with a very limited amount of data.
WARNING: If the remote dataset is large and/or the remote server is slow
(that's the problem, isn't it?!), it will take a long time to make a complete local copy.
In some cases, the time needed will be unacceptable.
For example, transmitting 1 TB of data over a T1 line (0.15 GB/s) takes at least 60 days,
under optimal conditions.
Plus, it uses lots of bandwidth, memory, and CPU time on the remote and local computers.
The solution is to mail a hard drive to the administrator of the remote data set so that
s/he can make a copy of the dataset and mail the hard drive back to you.
Use that data as a starting point and EDDGridCopy will add data to it.
(That is one way that Amazon's EC2 Cloud Service
handles the problem, even though their system
has lots of bandwidth.)
WARNING: If a given value for the leftmost axis variable disappears from the remote dataset,
EDDGridCopy does NOT delete the local copied file. If you want to, you can delete it yourself.
- Recommended use -
- Create the <dataset> entry (the native type, not EDDGridCopy)
for the remote data source.
Get it working correctly, including all of the desired metadata.
- If it is too slow, add XML code to wrap it in an EDDGridCopy dataset.
- Use a different datasetID (perhaps by changing the datasetID of the old datasetID slightly).
- Copy the <accessibleTo>, <reloadEveryNMinutes> and
<onChange> from the
remote EDDGrid's XML to the EDDGridCopy's XML.
(Their values for EDDGridCopy matter; their values for the inner dataset become irrelevant.)
- ERDDAP will make and maintain a local copy of the data.
- WARNING: EDDGridCopy assumes that the data values for each chunk don't ever change.
If/when they do, you need to manually delete the chunk files in
[bigParentDirectory]/copy/datasetID/
which changed and flag
the dataset to be reloaded so that the deleted chunks will be replaced.
If you have an email subscription to the dataset, you will get two emails:
one when the dataset first reloads and starts to copy the data,
and another when the dataset loads again (automatically) and detects the new local data files.
- Change Metadata - If you need to change any addAttributes or change the order of the variables
associated with the source dataset:
- Change the addAttributes for the source dataset in datasets.xml, as needed.
- Delete one of the copied files.
- Set a flag
to reload the dataset immediately.
If you do use a flag and you have an email subscription to the dataset, you will get two emails:
one when the dataset first reloads and starts to copy the data,
and another when the dataset loads again (automatically) and detects the new local data files.
- The deleted file will be regenerated with the new metadata.
If the source dataset is ever unavailable, the EDDGridCopy dataset will get metadata
from the regenerated file, since it is the youngest file.
- Skeleton XML - The skeleton XML for an EDDGridCopy dataset is:
<dataset type="EDDGridCopy" datasetID="..." active="..." >
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
<dataset>...</dataset> <!-- 1 -->
</dataset>
EDDTableFromDapSequence handles variables within
1- and 2-level sequences from
DAP servers such as
DAPPER.
EDDTableFromDatabase handles data from one
database table or view.
- If the data you want to serve is in two or more tables (and needs a JOIN to extract data),
you need to make a new table (or a view) with the JOINed/flattened information.
Contact your database administrator.
- You must get the appropriate JDBC 3 or JDBC 4 driver .jar file and put it in
[tomcat]/webapps/erddap/WEB-INF/lib after you install ERDDAP.
For Postgresql, we got the JDBC 4 driver from
http://jdbc.postgresql.org
and we use "org.postgresql.Driver" for the <driverName>
in datasets.xml (see below).
For SQL Server, you can get the JTDS JDBC driver from
http://jtds.sourceforge.net
and use "net.sourceforge.jtds.jdbc.Driver" for the <driverName>.
- You can gather most of the information you need to create the XML for an EDDTableFromDatabase
dataset by contacting the database administrator and by searching the web.
The <driverName>, driver .jar file, <connectionProperty>
names (e.g., "user",
"password", and "ssl"), and some of the connectionProperty values can be found by searching
the web for "JDBC connection properties databaseType" (e.g., Oracle, MySQL, PostgreSQL).
- It is difficult to create the correct datasets.xml information needed for ERDDAP to establish
a connection to the database. Be patient. Be methodical.
Search the web for examples of using JDBC to connect to your type of database.
Work closely with the database administrator, who may have relevant experience.
If the dataset fails to load, read the
error message carefully to find out why.
- Database Date Time Data -
Some database date time columns have no explicit time zone.
Such columns are trouble for ERDDAP.
Databases support the concept of a date (with or without a time) without a time zone, as an
approximate range of time.
But Java (and thus ERDDAP) only deal with instantaneous date+times with a timezone.
So you may know that the date time data is based on a local time zone (with or without daylight savings)
or the GMT/Zulu time zone, but Java (and ERDDAP) don't.
We originally thought we could work around this problem (e.g, by specifying a time zone for the
column), but the database+JDBC+Java interactions made this an unreliable solution.
- So, ERDDAP requires that you store all date and date time data in the database table
with a database data type that corresponds to the JDBC type "timestamp with time zone"
(ideally, that uses the GMT/Zulu time zone).
- In ERDDAP's datasets.xml, in the <dataVariable> tag for this variable, set
<dataType>double</dataType>
and in <addAttributes> set
<att name="units">seconds since 1970-01-01T00:00:00Z</att> .
- Suggestion: If the data is a time range, it is useful to have the timestamp values refer to
the center of the implied time range (e.g., noon).
For example, if a user has other data for 2010-03-26T13:00Z and they want the closest database
data, then the data for 2010-03-26T12:00Z (representing data for that date) is obviously the best
(as opposed to the midnight before or after, where it is less obvious which is best).
- ERDDAP has a utility to
Convert
a Numeric Time to/from a String Time.
- See How ERDDAP
Deals with Time.
- Security - When working with databases, you need to do things as safely and securely as possible
to avoid allowing a malicious user to damage your database or gain access to data they shouldn't
have access to.
ERDDAP tries to do things in a secure way, too.
- Consider replicating, on a different computer, the database and database tables
with the data that you want ERDDAP to serve.
(Yes, for commercial databases like Oracle, this involves
additional licensing fees. But for open source databases, like PostgreSQL and MySQL,
this costs nothing.) This gives you a high level of security and also
prevents ERDDAP requests from slowing down the original database.
- We encourage you to set up ERDDAP to connect to the database as a database user that only has
access to the relevant database(s) and only has READ privileges.
- We encourage you to set up the connection from ERDDAP to the database so that it
- always uses SSL,
- only allows connections from one IP address (or one block of addresses) and from the one
ERDDAP user, and
- only transfers passwords in their MD5 hashed form.
- [KNOWN PROBLEM]The connectionProperties (including the password!) are stored as plain text
in datasets.xml.
Only the administrator should have READ and WRITE access to this file!
No other users of the computer should have READ or WRITE access to this file!
We haven't found a way to allow the administrator to enter the database password during
ERDDAP's startup in Tomcat (which occurs without user input), so the password must be
accessible in a file.
- When in ERDDAP, the password and other connection properties are stored in "private"
Java variables.
- Requests from clients are parsed and checked for validity before generating the SQL requests
for the database.
- Requests to the database are made with SQL PreparedStatements, to prevent SQL injection.
- Requests to the database are submitted with executeQuery (not executeStatement) to limit
requests to be read-only (so attempted SQL injection to alter the database will fail for this
reason, too).
- SQL - It is easy for ERDDAP to convert user requests into simple SQL PreparedStatements.
For example, the ERDDAP request
time,temperature&time>=2008-01-01T00:00:00&time<=2008-02-01T00:00:00
will be converted into the SQL PreparedStatement
SELECT time, temperature WHERE time >= 2008-01-01T00:00:00 AND
time <= 2008-02-01T00:00:00
ERDDAP requests with &distinct() and/or &orderBy(variables) will add
DISTINCT and/or ORDER BY variables to the SQL prepared statement. In general, this will greatly slow down the response from the database.
ERDDAP logs the PreparedStatement in
log.txt
as
statement=thePreparedStatement.
- Views - EDDTableFromDatabase is limited to getting data from one table, but that shouldn't be a problem.
If a table of interest has foreign keys which link to other tables,
we recommend that you ask the database administrator to create a
VIEW.
Views "can join and simplify multiple tables into a single virtual table" (Wikipedia).
Views are great because:
- They simplify queries (since the queries don't have to specify the JOINs, etc.).
- They are efficient (since the database just has to set it up once).
- They increase abstraction (since the database can be changed without having to change how the
VIEW appears to the client).
- Speed - If speed is a problem:
- Set the Fetch Size - Databases return the data to ERDDAP in chunks.
By default, different databases return a different number of rows in the chunks.
Often this number is very small and so very inefficient.
For example, the default for Oracle is 10!
Read the JDBC documentation for your database's JDBC driver
to find the connection property to set
in order to increase this, and add this to the dataset's description
in datasets.xml. For example,
For MySQL, use
<connectionProperty name="defaultFetchSize">4096</connectionProperty>
For Oracle, use
<connectionProperty name="defaultRowPrefetch">4096</connectionProperty>
For PostgreSQL, use
<connectionProperty name="defaultFetchSize">4096</connectionProperty>
but feel free to change the number. Note that setting the number too big will
cause ERDDAP to use lots of memory and be more likely to run out of memory.
- ConnectionProperties - Each database has other connection properties which
can be specified in datasets.xml. Many of these will affect the performance
of the ERDDAP to database connection. Please read the documentation for
your database's JDBC driver to see the options.
If you find connection properties that are useful, please
send an email with the details to bob dot simons at noaa dot gov.
- Make a Table - You will probably get faster responses if you periodically
(every day? whenever there is new data?) generate an actual table (similarly to how you
generated the VIEW) and tell ERDDAP to get data from the table instead of the VIEW.
Since any request to the table can then be
fulfilled without JOINing another table, the response will be much faster.
- Optimize/Vacuum the Table -
MySQL will respond much faster if you use
OPTIMIZE TABLE.
PostgreSQL will respond much faster if you
VACUUM the table.
Oracle doesn't have or need an analogous command.
- Connection Pooling -
ERDDAP currently doesn't use connection pooling. ERDDAP makes a new connection
to the database for each SQL query that it sends to the database.
This adds about 0.1 seconds per request (sometimes longer, e.g., for remote databases),
but is a more robust and safe approach.
We may add optional connection pooling in the future.
- If all else fails, consider storing the data in a collection of .nc files.
If they are logically organized (each with data for a chunk of space and time),
ERDDAP can extract data from them very quickly.
- The skeleton XML for an EDDTableFromDatabase dataset is:
<dataset type="EDDTableFromDatabase" datasetID="..." active="..." >
<sourceUrl>...</sourceUrl>
<!-- Put the database name at the end, for example,
"jdbc:postgresql://123.45.67.89:5432/databaseName". REQUIRED. -->
<driverName>...</driverName>
<!-- The high-level name of the database driver, e.g.,
"org.postgresql.Driver". You need to put the actual database
driver .jar file (for example, postgresql.jdbc.jar) in
[tomcat]/webapps/erddap/WEB-INF/lib. REQUIRED. -->
<connectionProperty name="name">value</connectionProperty>
<!-- The names (e.g., "user", "password", and "ssl") and values
of the properties needed for ERDDAP to establish the connection
to the database. 0 or more. -->
<catalogName>...</catalogName>
<!-- The name of the catalog which has the schema which has the
table, default = "". OPTIONAL. -->
<schemaName>...</schemaName> <!-- The name of the
schema which has the table, default = "". OPTIONAL. -->
<tableName>...</tableName> <!-- The name of the
table, default = "". REQUIRED. -->
<orderBy>...</orderBy> <!-- A comma-separated list of
sourceNames to be used in an ORDER BY clause at the end of the
every query sent to the database (unless the user's request
includes an &orderBy() filter, in which case the user's
orderBy is used). The order of the sourceNames is important.
The leftmost sourceName is most important; subsequent
sourceNames are only used to break ties. Only relevant
sourceNames are included in the ORDER BY clause for a given user
request. If this is not specified, the order of the returned
values in not specified. Default = "". OPTIONAL. -->
<sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
<addAttributes>...</addAttributes>
<dataVariable>...</dataVariable> <!-- 1 or more.
For date and timestamp database columns, set dataType=double and
units=seconds since 1970-01-01T00:00:00Z -->
</dataset>
EDDTableFromFiles is the superclass of all
EDDTableFrom...Files classes.
You can't use EDDTableFromFiles directly.
Instead, use a subclass of EDDTableFromFiles to handle the specific file type:
- EDDTableFromAsciiFiles aggregates data from
comma-, tab-, or space-separated tabular ASCII data files.
- EDDTableFromAwsXmlFiles aggregates data from
a set of Automatic Weather Station (AWS) XML files.
- EDDTableFromHyraxFiles aggregates data with several variables,
each with shared dimensions
(e.g., time, altitude, latitude, longitude), and served by a
Hyrax OPeNDAP server.
- EDDTableFromNcFiles aggregates data from .nc files
with several variables,
each with shared dimensions (e.g., time, altitude, latitude, longitude).
- EDDTableFromNcCFFiles aggregates data from
.nc files which use one of the file formats specified by the
CF Discrete Sampling Geometries
conventions.
- EDDTableFromThreddsFiles aggregates data from
files with several variables with
shared dimensions served by a
THREDDS OPeNDAP server.
Currently, no other file types are supported.
But it is usually relatively easy to support other file types. Contact us if you have requests.
Or, if your data is in an old file format that you would like to move away from,
we recommend converting the files to be NetCDF .nc files. NetCDF is a widely supported format,
allows fast random access to the data, and is already supported by ERDDAP.
Details - The following information applies to all of the subclasses of EDDTableFromFiles.
- Aggregation - This class aggregates data from local files. Each file holds a (relatively)
small table of data.
- The resulting dataset appears as if all of the file's tables had been combined
(all of the rows of data from file #1, plus all of the rows from file #2, ...).
- The files don't all have to have all of the specified variables.
- The variables in all of the files MUST have the same values for the
add_offset,
missing_value,
_FillValue,
scale_factor, and
units attributes (if any).
ERDDAP checks, but it is an imperfect test -- if there are different values, ERDDAP
doesn't know which is correct and therefore which files are invalid.
- Directories - The files can be in one directory, or in a directory and its subdirectories
(recursively).
Note that if there are a large number of files (e.g., >1000), the operating system
(and thus EDDTableFromFiles) will operate much more efficiently if you store the files in a
series of subdirectories.
- Cached File Information - When an EDDTableFromFiles dataset is first loaded,
EDDTableFromFiles reads all of the relevant files and creates tables in memory with information
about each valid file (one file per row, including the minimum and maximum value of each variable,
even String variables) and each invalid file.
- The tables are also stored on disk, as .json files in [bigParentDirectory]/dataset
in files named:
[datasetID].dirs.json (which holds a list of unique directory names) and
[datasetID].files.json (which holds the table with each valid file's information),
[datasetID].bad.json (which holds the table with each bad file's information).
- The copy of the file information tables on disk is also useful when ERDDAP is shut down and
restarted: it saves EDDTableFromFiles from having to re-read all of the data files.
- You shouldn't ever need to delete or work with these files.
You can delete these files (but why?). You can use the
flag system
to force ERDDAP to update
the cached file information.
- Handling Requests - ERDDAP tabular data requests can put constraints on any variable.
- When a client's request for data is processed, EDDTableFromFiles can quickly look
in the table with the valid file information to see which files might have relevant data.
For example, if each source file has the data for one fixed-location buoy, EDDTableFromFiles
can very efficiently determine which files might have data within a given longitude range and
latitude range.
- Because the valid file information table includes the minimum and maximum value of every
variable for every valid file, EDDTableFromFiles can often handle other queries quite efficiently.
For example, if some of the buoys don't have an air pressure sensor, and a client requests
data for airPressure!=NaN, EDDTableFromFiles can efficiently determine which buoys
have air pressure data.
- Updating the Cached File Information -
Whenever the dataset is reloaded, the cached file
information is updated.
- The dataset is reloaded periodically as determined by the
<reloadEveryNMinutes> in the
dataset's information in datasets.xml.
- The dataset is reloaded as soon as possible whenever ERDDAP detects that you have added,
removed,
touch'd
(to change the file's lastModified time), or changed a datafile.
- The dataset is reloaded as soon as possible if you use the
flag system.
When the dataset is reloaded, ERDDAP compares the currently available files to the cached file
information table.
New files are read and added to the valid files table.
Files that no longer exist are dropped from the valid files table.
Files where the file timestamp has changed are read and their information is updated.
The new tables replaces the old tables in memory and on disk.
- Bad Files - The table of bad files and the reasons the files were declared bad (corrupted file,
missing variables, incorrect axis values, etc.) is emailed to the emailEverythingTo email address
(probably you) every time the dataset is reloaded.
You should replace or repair these files as soon as possible.
- Near Real Time Data -
EDDTableFromFiles treats requests for very recent data as a special case.
The problem: If the files making up the dataset are updated frequently, it is likely that the
dataset won't be updated every time a file is changed. So EDDTableFromFiles won't be aware of
the changed files.
(You could use the
flag system,
but this might lead to ERDDAP reloading the dataset almost continually.
So in most cases, we don't recommend it.)
Instead, EDDTableFromFiles does two things to deal with this situation:
- When the dataset is loaded, if the maximum value for the time variable is in the last 24 hours,
ERDDAP sets the maximum time to be NaN (meaning Now).
- When ERDDAP gets a request for data within the last 20 hours (e.g., 8 hours ago until Now),
ERDDAP will search all files which have any data in the last 20 hours.
Thus, ERDDAP doesn't need to have perfectly up-to-date data for all of the files in order to
find the latest data.
You should still set <reloadEveryNMinutes>
to a reasonably
small value (e.g., 60),
but it doesn't have to be tiny (e.g., 3).
Not recommended organization of near-real-time data in the files:
If, for example, you have a dataset that stores data for numerous stations (or buoy, or ...)
for many years, you could arrange the files so that, for example, there is one file per station.
But then, every time new data for a station arrives, you have to read a large old file and
write a large new file.
And when ERDDAP reloads the dataset, it notices that some files have been modified, so it reads
those files completely.
That is inefficient.
Recommended organization of near-real-time data in the files:
We recommend that you store the data in chunks, e.g., all data for one station for
one year (or one month).
Then, when a new datum arrives, you only have to read and rewrite the file with this year's
(or month's) data.
All of the files for previous years (or months) for that station remain unchanged.
And when ERDDAP reloads the dataset, most files are unchanged; only a few, small files have
changed and need to be read.
- FTP Trouble/Advice - If you FTP new data files to the ERDDAP server while ERDDAP is running,
there is the chance that ERDDAP will be reloading the dataset during the FTP process.
It happens more often than you might think!
If it happens, the file will appear to be valid (it has a valid name), but the file isn't valid.
If ERDDAP tries to read data from that invalid file, the resulting error will cause the file to
be added to the table of invalid files.
This is not good.
To avoid this problem, use a temporary file name when FTP'ing the file, e.g., ABC2005.nc_TEMP .
Then, the fileNameRegex test (see below) will indicate that this is not a relevant file.
After the FTP process is complete, rename the file to the correct name.
The renaming process will cause the file to become relevant in an instant.
- File Name Extracts -
EDDTableFromFiles has a system for extracting a String from each file name
and using that to make a psuedo data variable.
Currently, there is no system to interpret these Strings as dates/times.
There are several XML tags to set up this system.
If you don't need part or all of this system, just don't specify these tags or use "" values.
- preExtractRegex is a
regular expression
(tutorial)
used to identify text to be removed
from the start of the file name.
The removal only occurs if the regex is matched.
This usually begins with "^" to match the beginning of the file name.
- postExtractRegex is a regular expression used to identify text to be removed from
the end of the file name.
The removal only occurs if the regex is matched.
This usually ends with "$" to match the end of the file name.
- extractRegex If present, this regular expression is used after preExtractRegex and
postExtractRegex to identify a string to be extracted from the file name (e.g., the stationID).
If the regex isn't matched, the entire file name is used (minus preExtract and postExtract).
Use ".*" to match the entire file name that is left after preExtractRegex and postExtractRegex.
- columnNameForExtract is the data column name for the extracted Strings.
A dataVariable with this sourceName must be in the dataVariables list (with any data type,
but usually String).
For example, if a dataset has files with names like
XYZAble.nc, XYZBaker.nc, XYZCharlie.nc, ...,
and you want to create a new variable (stationID) when each file is read
which will have station ID values
(Able, Baker, Charlie, ...) extracted from the file names, you could use these tags:
- <preExtractRegex>^XYZ</preExtractRegex>
The initial ^ is a regular expression special character which
forces ERDDAP to look for XYZ at the beginning of the file name.
This causes XYZ, if found at the beginning of the file name, to be removed
(e.g., the file name XYZAble.nc becomes Able.nc).
- <postExtractRegex>\x2Enc$</postExtractRegex>
The $ at the end is a regular expression special character which
forces ERDDAP to look for .nc at the end of the file name.
Since . is a regular expression special character (which matches any character),
it is encoded as \x2E here
(because 2E is the hexadecimal character number for a period).
This causes .nc, if found at the end of the file name, to be removed
(e.g., the partial file name Able.nc becomes Able).
- <extractRegex>.*</extractRegex>
The .* regular expression matches all remaining characters
(e.g., the partial file name Able becomes the extract for the first file).
- <columnNameForExtract>stationID</columnNameForExtract>
This tells ERDDAP to create a new column called stationID when reading each file.
Every row of data for a given file will have the text extracted
from its file name (e.g., Able) as the value in the stationID column.
In most cases, there are numerous values for these extract tags that will yield the same results --
regular expressions are very flexible. But in a few cases, there is just one
way to get the desired results.
- global: sourceNames -
Global metadata in each file can be converted to be data.
If the sourceName of a variable starts with
global: (e.g., global:PI), when ERDDAP is reading the data from a file,
ERDDAP will look for a global attribute of that name (e.g., PI) and
create a column filled with the attribute's value.
- The skeleton XML
for all EDDTableFromFiles subclasses is:
<dataset type="EDDTableFrom...Files" datasetID="..." active="..." >
<nDimensions>...</nDimensions> <!-- This was used prior to ERDDAP version 1.30,
but is now ignored. -->
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
<specialMode>mode</specialMode> <-- This rarely-used, optional tag can be used
with EDDTableFromThreddsFiles to specify that special, hard-coded rules
should be used to determine which files should be downloaded from the server.
Currently, the only valid mode is SAMOS which is used with datasets from
http://coaps.fsu.edu/thredds/catalog/samos to download only the files with
the last version number. -->
<sourceUrl>...</sourceUrl> <-- For subclasses like EDDTableFromHyraxFiles and
EDDTableFromThreddsFiles, this is where you specify the base URL for the files
on the remote server. For subclasses that get data from local files, ERDDAP
doesn't use this information to get the data, but does display the information
to users. So I usually use "(local files)". -->
<fileDir>...</fileDir> <-- The directory (absolute) with the data files. -->
<recursive>true|false</recursive> <-- Indicates if subdirectories
of fileDir have data files, too. -->
<fileNameRegex>...</fileNameRegex> <-- A regular expression
(tutorial) describing valid data files names, e.g., ".*\.nc" for
all .nc files. -->
<metadataFrom>...</metadataFrom> <-- The file to get metadata
from ("first" or "last" (the default) based on file's
lastModifiedTime). -->
<columnNamesRow>...</columnNamesRow> <-- (For
EDDTableFromAsciiFiles only) This specifies the number of the row
with the column names in the files. (The first row is "1".
Default = 1.) If you specify 0, ERDDAP will not look for column names
and will assign names: Column#1, Column#2, ... -->
<firstDataRow>...</firstDataRow> <-- (For
EDDTableFromAsciiFiles only) This specifies the number of the first
row with data in the files. (The first row is "1". default = 2.) -->
<-- For the next four tags, see File Name Extracts. -->
<preExtractRegex>...</preExtractRegex>
<postExtractRegex>...</postExtractRegex>
<extractRegex>...</extractRegex>
<columnNameForExtract>...</columnNameForExtract>
<sortedColumnSourceName>...</sortedColumnSourceName>
<-- The sourceName of the numeric column that the data files are
usually already sorted by within each file, e.g., "time".
Use null or "" if no variable is suitable.
It is ok if not all files are sorted by this column.
If present, this can greatly speed up some data requests.
For EDDTableFromHyraxFiles, EDDTableFromNcFiles and
EDDTableFromThreddsFiles, this must be the leftmost axis
variable. -->
<sortFilesBySourceNames>...</sortFilesBySourceNames>
<-- This is a space-separated list of source variable names
which specifies how the internal list of files should be sorted
(in ascending order), for example "id time".
It is the minimum value of the specified columns in each file
that is used for sorting.
When a data request is filled, data is obtained from the files
in this order. Thus it determines the overall order of the data
in the response. If you specify more than one column name, the
second name is used if there is a tie for the first column; the
third is used if there is a tie for the first and second columns; ...
This is OPTIONAL (the default is fileDir+fileName order). -->
<isLocal>false<isLocal> <!-- (may be true or false,
the default). This is only used by EDDTableFromNcCFFiles. It
indicates if the files are local (actual files) or remote
(accessed via the web). The two types are treated slightly
differently.
<sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
<addAttributes>...</addAttributes>
<dataVariable>...</dataVariable> <!-- 1 or more -->
<-- For EDDTableFromHyraxFiles, EDDTableFromNcFiles, and
EDDTableFromThreddsFiles, the axis variables (e.g., time) needn't
be first or in any specific order. -->
</dataset>
EDDTableFromAsciiFiles aggregates data from
comma-, tab-, or space-separated tabular ASCII data files.
- Normally, the files will have column names on the first row and data starting on the second row.
But you can use <columnNamesRow> and <firstDataRow>
in your datasets.xml file to
a specify different row number.
- Note that ASCII files are not a very efficient way to store/retreive data.
For greater efficiency, save the files as .nc files (which one dimension, "row", shared by
all variables) instead.
- See this class' superclass, EDDTableFromFiles, for information
on how to use this class and how this class works.
EDDTableFromAwsXmlFiles aggregates data from
a set of Automatic Weather Station (AWS) XML data files. Some background information is at
WeatherBug_Rest_XML_API.
- This type of file is a simple but inefficient way to store the data,
because each file usually seems to contain the observation from just one time point.
So there may be a large number of files. If you want to improve performance,
consider consolidating groups of observations (a week's worth?) in .nc files
and using EDDTableFromNcFiles to serve the data.
- See this class' superclass, EDDTableFromFiles, for information
on how to use this class and how this class works.
EDDTableFromHyraxFiles aggregates data files with several variables, each with
one or more shared dimensions (e.g., time, altitude, latitude, longitude), and served by a
Hyrax OPeNDAP server.
- In most cases, each file has multiple values for the leftmost dimension, e.g. time.
- The files often (but don't have to) have a single value for the other dimensions (e.g., altitude,
latitude, longitude).
- The files may have character variables with an additional dimension (e.g., nCharacters).
- Hyrax servers can be identified by the "/dods-bin/nph-dods/" or "/opendap/" in the URL.
- This class screen-scrapes the Hyrax web pages with the lists of files in each directory.
Because of this, it is very specific to the current format of Hyrax web pages.
We will try to adjust ERDDAP quickly if/when future versions of Hyrax change how the files are listed.
- The <fileDir> setting is ignored. Since this class downloads
and makes a local copy of each remote data file, ERDDAP forces the fileDir
to be [bigParentDirectory]/copy/datasetID/.
- For <sourceUrl>, use the URL of the base directory of the dataset in the Hyrax server, for example,
<sourceUrl>http://edac-dap.northerngulfinstitute.org/dods-bin/nph-dods/WCOS/nmsp/wcos/</sourceUrl>
(although that server is no longer available).
The sourceUrl web page usually has "OPeNDAP Server Index of [directoryName]" at the top.
- Since this class always downloads and makes a local copy of each remote
data file, you should never wrap this dataset in EDDTableCopy.
- See this class' superclass, EDDTableFromFiles, for information
on how to use this class and how this class works.
- See the 1D, 2D, 3D, and 4D examples for EDDTableFromNcFiles.
EDDTableFromNcFiles
aggregates data from .nc files with several variables, each with one shared
dimension (e.g., time) or more than one shared dimensions (e.g., time, altitude, latitude,
longitude).
The files must have the same dimension names.
A given file may have multiple values for each of the dimensions and the values may be
different in different files.
The files may have character variables with an additional dimension (e.g., nCharacters).
See this class' superclass, EDDTableFromFiles,
for information on how to use this class and how this class works.
- 1D Example:
1D files are somewhat different from 2D, 3D, 4D, ... files.
- You might have a set of .nc data files where each file has one month's worth of data from one
drifting buoy.
- Each file will have 1 dimension, e.g., time (size = [many]).
- Each file will have one or more 1D variables which use that dimension, e.g., time, longitude,
latitude, air temperature, ....
- Each file may have 2D character variables, e.g., with dimensions (time,nCharacters).
- 2D Example:
- You might have a set of .nc data files where each file has one month's worth of data from one
drifting buoy.
- Each file will have 2 dimensions, e.g., time (size = [many]) and id (size = 1).
- Each file will have 2 1D variables with the same names as the dimensions and using the
same-name dimension, e.g., time(time), id(id).
These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
- Each file will have one or more 2D variables, e.g., longitude, latitude, air temperature, water
temperature, ...
- Each file may have 3D character variables, e.g., with dimensions (time,id,nCharacters).
- 3D Example:
- You might have a set of .nc data files where each file has one month's worth of data from one
stationary buoy.
- Each file will have 3 dimensions, e.g., time (size = [many]), lat (size = 1), and lon (size = 1).
- Each file will have 3 1D variables with the same names as the dimensions and using the
same-name dimension, e.g., time(time), lat(lat), lon(lon).
These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
- Each file will have one or more 3D variables, e.g., air temperature, water temperature, ...
- Each file may have 4D character variables, e.g., with dimensions (time,lat,lon,nCharacters).
- The file's name might have the buoy's name within the file's name.
- 4D Example:
- You might have a set of .nc data files where each file has one month's worth of data from one
station. At each time point, the station takes readings at a series of depths.
- Each file will have 4 dimensions, e.g., time (size = [many]), depth (size = [many]), lat (size = 1),
and lon (size = 1).
- Each file will have 4 1D variables with the same names as the dimensions and using the
same-name dimension, e.g., time(time), depth(depth), lat(lat), lon(lon).
These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
- Each file will have one or more 4D variables, e.g., air temperature, water temperature, ...
- Each file may have 5D character variables, e.g., with dimensions (time,depth,lat,lon,nCharacters).
- The file's name might have the buoy's name within the file's name.
EDDTableFromNcCFFiles
aggregates data aggregates data from
.nc files which use one of the file formats specified by the
CF Discrete Sampling Geometries
conventions.
See this class' superclass, EDDTableFromFiles,
for information on how to use this class and how this class works.
The CF DSG conventions defines dozens of file formats and includes numerous
minor variations. These class deals with all of the variations we are aware of, but
we may have missed one (or more). So if this class can't read data from your CF DSG files,
please email bob.simons at noaa.gov and include a sample file.
EDDTableFromThreddsFiles aggregates data files
with several variables,
each with one or more shared dimensions (e.g., time, altitude, latitude, longitude), and
served by a
THREDDS OPeNDAP server.
- In most cases, each file has multiple values for the leftmost dimension, e.g. time.
- The files often (but don't have to) have a single value for the other dimensions (e.g., altitude,
latitude, longitude).
- The files may have character variables with an additional dimension (e.g., nCharacters).
- THREDDS servers can be identified by the "/thredds/" in the URLs.
For example,
http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.html
- This class reads the catalog.xml files served by THREDDS with the lists of
<catalogRefs>
(references to additional catalog.xml sub-files) and <dataset>s (data files).
- The <fileDir> setting is ignored. Since this class downloads
and makes a local copy of each remote data file, ERDDAP forces the fileDir
to be [bigParentDirectory]/copy/datasetID/.
- For <sourceUrl>, use the URL of the catalog.xml file for the dataset in the
THREDDS server,
for example:
for this URL which may be used in a web browser,
http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.html ,
use <sourceUrl>http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.xml</sourceUrl> .
- Since this class always downloads and makes a local copy of each remote
data file, you should never wrap this dataset in EDDTableCopy.
- This dataset type supports an optional, rarely-used, special tag,
<specialMode>mode</specialMode>
which can be used to specify that special, hard-coded rules should be
used to determine which files should be downloaded from the server.
Currently, the only valid mode is SAMOS which is used with datasets
from http://coaps.fsu.edu/thredds/catalog/samos to download only the files with
the last version number.
- See this class' superclass, EDDTableFromFiles,
for information on how to use this class and how this class works.
- See the 1D, 2D, 3D, and 4D examples for EDDTableFromNcFiles.
EDDTableFromNOS handles data from a NOAA
NOS source,
which uses
SOAP+XML for requests and responses. It is very specific to NOAA NOS's XML.
See the sample EDDTableFromNOS dataset in datasets2.xml.
EDDTableFromOBIS handles data from an
Ocean Biogeographic Information System (OBIS) server.
- OBIS servers expect an XML request and return an XML response.
- Because all OBIS servers serve the same variables the same way
(see the OBIS schema),
you don't have to specify much to set up an OBIS dataset in ERDDAP.
- You MUST include a "creator_email" attribute in the global addAttributes,
since that information is used within the license.
A suitable email address can be found by reading the XML response from the sourceURL.
- You may or may not be able to get the global attribute
<subsetVariables> to work with
a given OBIS server. If you try, just try one variable (e.g., ScientificName or Genus).
- The skeleton XML for an EDDTableFromOBIS dataset is:
<dataset type="EDDTableFromOBIS" datasetID="..." active="..." >
<sourceUrl>...</sourceUrl>
<sourceCode>...</sourceCode>
<!-- If you read the XML response from the sourceUrl, the
source code (e.g., GHMP) is the value from one of the
<resource><code> tags. -->
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<-- All ...SourceMinimum and Maximum tags are OPTIONAL -->
<longitudeSourceMinimum>...</longitudeSourceMinimum>
<longitudeSourceMaximum>...</longitudeSourceMaximum>
<latitudeSourceMinimum>...</latitudeSourceMinimum>
<latitudeSourceMaximum>...</latitudeSourceMaximum>
<altitudeSourceMinimum>...</altitudeSourceMinimum>
<altitudeSourceMaximum>...</altitudeSourceMaximum>
<-- For timeSource... tags, use yyyy-MM-dd'T'HH:mm:ssZ format. -->
<timeSourceMinimum>...</timeSourceMinimum>
<timeSourceMaximum>...</timeSourceMaximum>
<sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
<addAttributes>...</addAttributes>
</dataset>
EDDTableFromSOS handles data from a
Sensor Observation Service
(SWE/SOS) server.
- This dataset type aggregates data from a group of stations which are all served by one SOS server.
- The stations all serve the same set of variables (although the source for each station doesn't
have to serve all variables).
- SOS servers expect an XML request and return an XML response.
- It is not easy to generate the dataset XML for SOS datasets.
To find the needed information, you must visit
sourceUrl+"?service=SOS&request=GetCapabilities" in a browser;
look at the XML; make a GetObservation request by hand;
and look at the XML response to the request.
- SOS overview:
- SWE (Sensor Web Enablement) and SOS (Sensor Observation Service) are
OpenGIS® standards.
That web site has the standards documents.
- The OGC Web Services Common Specification ver 1.1.0 (OGC 06-121r3) covers construction of
GET and POST queries (see section 7.2.3 and section 9).
- If you send a getCapabilities xml request to a SOS server
(sourceUrl + "?service=SOS&request=GetCapabilities"), you get an xml result
with a list of stations and the observedProperties that they have data for.
- An observedProperty is a formal URI reference to a property. For example,
urn:ogc:phenomenon:longitude:wgs84 or http://marinemetadata.org/cf#sea_water_temperature
- An observedProperty isn't a variable.
- More than one variable may have the same observedProperty (for example, insideTemp
and outsideTemp might both have observedProperty
http://marinemetadata.org/cf#air_temperature).
- If you send a getObservation xml request to a SOS server, you get an xml result with
descriptions of field names in the response, field units, and the data.
The field names will include longitude, latitude, depth(perhaps), and time.
- Each dataVariable for an EDDTableFromSOS must include an "observedProperty" attribute,
which identifies the observedProperty that must be requested from the server to
get that variable. Often, several dataVariables will list the same composite
observedProperty.
- The dataType for each dataVariable may not be specified by the server.
If so, you must look at the XML data responses from the server and assign appropriate
<dataType>s in the ERDDAP dataset dataVariable
definitions.
- (At the time of writing this) some SOS servers respond to getObservation requests for
more than one observedProperty by just returning results for the first of the
observedProperties. (No error message!)
See the constructor parameter requestObservedPropertiesSeparately.
- EDDTableFromSOS automatically adds
<att name="subsetVariables">station_id,
longitude, latitude</att>
to the dataset's global attributes when the dataset is created.
- SOS servers usually express units with the
UCUM system.
Most ERDDAP servers express units with the
UDUNITS system.
If you need to convert between the two systems, you can use
ERDDAP's web service to convert UCUM units to/from UDUNITS.
- The skeleton XML for an EDDTableFromSOS dataset is:
<dataset type="EDDTableFromSOS" datasetID="..." active="..." >
<sourceUrl>...</sourceUrl>
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<stationIdSourceName>...</stationIdSourceName> <!-- 0 or 1.
Default="station_id". -->
<longitudeSourceName>...</longitudeSourceName>
<latitudeSourceName>...</latitudeSourceName>
<altitudeSourceName>...</altitudeSourceName>
<altitudeSourceMinimum>...</altitudeSourceMinimum> <!-- 0 or 1 -->
<altitudeSourceMaximum>...</altitudeSourceMaximum> <!-- 0 or 1 -->
<altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit>
<timeSourceName>...</timeSourceName>
<timeSourceFormat>...</timeSourceFormat>
<!-- timeSourceFormat MUST be either
* For numeric data: a
UDUnits-compatible
string (with the format
"units since baseTime") describing how to interpret
source time values (e.g., "seconds since 1970-01-01T00:00:00Z"),
where the base time is an ISO 8601:2004(E) formatted date time string
(yyyy-MM-dd'T'HH:mm:ssZ).
* For String String data: an org.joda.time.format.DateTimeFormat
string (which is mostly compatible with java.text.SimpleDateFormat)
describing how to interpret string times (e.g., the
ISO8601TZ_FORMAT "yyyy-MM-dd'T'HH:mm:ssZ"). See Joda DateTimeFormat -->
<observationOfferingIdRegex>...</observationOfferingIdRegex>
<!-- Only observationOfferings with IDs (usually the station names)
which match this regular expression (tutorial) will be included
in the dataset (".+" will catch all station names). -->
<requestObservedPropertiesSeparately>true|false(default)
</requestObservedPropertiesSeparately>
<sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
<addAttributes>...</addAttributes>
<dataVariable>...</dataVariable> <!-- 1 or more.
* Each dataVariable MUST include the dataType tag.
* Each dataVariable MUST include the observedProperty attribute.
* For IOOS SOS servers, *every* variable returned in the text/csv
response MUST be included in this ERDDAP dataset definition. -->
</dataset>
EDDTableCopy makes and maintains a local copy of another EDDTable's
data and serves
data from the local copy.
- EDDTableCopy (and for grid data, EDDGridCopy)
is a very easy to use and a very effective
solution to some of the biggest problems with serving data from remote data sources:
- Accessing data from a remote data source can be slow.
- They may be slow because they are inherently slow (e.g., an inefficient type of server),
- because they are overwhelmed by too many requests,
- or because your server or the remote server is bandwidth limited.
- The remote dataset is sometimes unavailable (again, for a variety of reasons).
- Relying on one source for the data doesn't scale well (e.g., when many users and many
ERDDAPs utilize it).
- How It Works - EDDTableCopy solves these problems by automatically making and maintaining
a local copy of the data and serving data from the local copy.
ERDDAP can serve data from the local copy very, very quickly.
And making and using a local copy relieves the burden on the remote server.
And the local copy is a backup of the original, which is useful in case something happens
to the original.
There is nothing new about making a local copy of a dataset. What is new here is that this
class makes it *easy* to create and *maintain* a local copy of data from a *variety* of types
of remote data sources and *add metadata* while copying the data.
- <extractDestinationNames> - EDDTableCopy makes the local copy of the
data by requesting
chunks of data from the remote dataset.
EDDTableCopy determines which chunks to request by requesting the &distinct() values
for the <extractDestinationNames> (specified in the datasets.xml, see below),
which are the space-separated destination names of variables in the remote dataset.
For example,
<extractDestinationNames>drifter profile</extractDestinationNames>
might
yield distinct values combinations of drifter=tig17,profile=1017, drifter=tig17,profile=1095,
... drifter=une12,profile=1223, drifter=une12,profile=1251, ....
In situations where one column (e.g., profile) may be all that is required to uniquely
identify a group of rows of data, if there are a very large number of, e.g., profiles,
it may be useful to also specify an additional extractDestinationName (e.g., drifter)
which serves to subdivide the profiles.
That leads to fewer data files in a given directory, which may lead to faster access.
- Local Files - Each chunk of data is stored in a separate netCDF file in a subdirectory of
[bigParentDirectory]/copy/datasetID/ (as specified in
setup.xml).
There is one subdirectory level for all but the last extractDestinationName.
For example, data for tig17+1017, would be stored in
[bigParentDirectory]/copy/sampleDataset/tig17/1017.nc .
For example, data for une12+1251, would be stored in
[bigParentDirectory]/copy/sampleDataset/une12/1251.nc .
Directory and file names created from data values are modified to make them file-name-safe
(e.g., spaces are replaced by "x20") -- this doesn't affect the actual data.
- New Data - Each time EDDTableCopy is reloaded, it checks the remote dataset
to see what distinct chunks are available.
If the file for a chunk of data doesn't already exist, a request to get the chunk is
added to a queue.
ERDDAP's taskThread processes all the queued requests for chunks of data, one-by-one.
You can see statistics for the taskThread's activity on the
Status Page and in the
Daily Report.
(Yes, ERDDAP could assign multiple tasks to this process, but that would use up
lots of the remote data source's bandwidth, memory, and CPU time, and
lots of the local ERDDAP's bandwidth, memory, and CPU time, neither of which is a good idea.)
NOTE: The very first time an EDDTableCopy is loaded, (if all goes well)
lots of requests for chunks of data will be added to the taskThread's queue,
but no local data files will have been created.
So the constructor will fail but taskThread will continue to work and create local files.
If all goes well, the taskThread will make some local data files and the next attempt to reload
the dataset (in ~15 minutes) will succeed, but initially with a very limited amount of data.
WARNING: If the remote dataset is large and/or the remote server is slow (that's the problem,
isn't it?!), it will take a long time to make a complete local copy.
In some cases, the time needed will be unacceptable.
For example, transmitting 1 TB of data over a T1 line (0.15 GB/s) takes at least 60 days,
under optimal conditions.
Plus, it uses lots of bandwidth, memory, and CPU time on the remote and local computers.
The solution is to mail a hard drive to the administrator of the remote data set so that
s/he can make a copy of the dataset and mail the hard drive back to you.
Use that data as a starting point and EDDTableCopy will add data to it.
(That is how Amazon's EC2 Cloud Service handles the problem, even
though their system has lots of bandwidth.)
WARNING: If a given combination of values disappears from remote dataset,
EDDTableCopy does NOT delete the local copied file. If you want to, you can delete it yourself.
- Recommended Use -
- Create the <dataset> entry (the native type, not EDDTableCopy)
for the remote data source.
Get it working correctly, including all of the desired metadata.
- If it is too slow, add XML code to wrap it in an EDDTableCopy dataset.
- Use a different datasetID (perhaps by changing the datasetID of the old datasetID slightly).
- Copy the <accessibleTo>, <reloadEveryNMinutes> and
<onChange> from the
remote EDDTable's XML to the EDDTableCopy's XML.
(Their values for EDDTableCopy matter; their values for the inner dataset become irrelevant.)
- Create the <extractDestinationNames> tag (see above).
- <orderExtractBy> is an OPTIONAL space separated list of destination
variable names
in the remote dataset.
When each chunk of data is downloaded from the remote server, the chunk will be sorted by
these variables (by the first variable, then by the second variable if the first variable
is tied, ...).
In some cases, ERDDAP will be able to extract data faster from the local data files
if the first variable in the list is a numeric variable ("time" counts as a numeric variable).
But choose the these variables in a way that is appropriate for the dataset.
- ERDDAP will make and maintain a local copy of the data.
- WARNING: EDDTableCopy assumes that the data values for each chunk don't ever change.
If/when they do, you need to manually delete the chunk files in
[bigParentDirectory]/copy/datasetID/
which changed and flag
the dataset to be reloaded so that the deleted
chunks will be replaced.
If you have an email subscription to the dataset, you will get two emails:
one when the dataset first reloads and starts to copy the data,
and another when the dataset loads again (automatically) and detects the new local data files.
- Change Metadata - If you need to change any addAttributes or change the order of the
variables associated with the source dataset:
- Change the addAttributes for the source dataset in datasets.xml, as needed.
- Delete one of the copied files.
- Set a flag
to reload the dataset immediately.
If you do use a flag and you have an email subscription to the dataset, you will get two emails:
one when the dataset first reloads and starts to copy the data,
and another when the dataset loads again (automatically) and detects the new local data files.
- The deleted file will be regenerated with the new metadata.
If the source dataset is ever unavailable, the EDDTableCopy dataset will get metadata
from the regenerated file, since it is the youngest file.
- Note that EDDGridCopy is very similar to EDDTableCopy,
but works with gridded datasets.
- Skeleton XML - The skeleton XML for an EDDTableCopy dataset is:
<dataset type="EDDTableCopy" datasetID="..." active="..." >
<accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
<reloadEveryNMinutes>...</reloadEveryNMinutes>
<fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
<iso19115File>...</iso19115File> <!-- 0 or 1 -->
<onChange>...</onChange> <!-- 0 or more -->
<extractDestinationNames>...</extractDestinationNames> <!-- 1 -->
<orderExtractBy>...</orderExtractBy> <!-- 0 or 1 -->
<dataset>...</dataset> <!-- 1 -->
</dataset>
Here are detailed descriptions of common tags and attributes.
Questions, comments, suggestions? Please send an email to
bob dot simons at noaa dot gov
and include the ERDDAP URL directly related to your question or comment.
ERDDAP, Version 1.42
Disclaimers |
Privacy Policy
|