Calendar

Ocean_use_case_searching

Searching for Data Sets Within the Ocean Use Case

Update - 06/2010

The work so far has focused on two parallel goals:

Build, distribute and support the NcML handler in Hyrax so that we can establish more metadata uniformity across servers without modifying datasets; and
Build Java/XSLT software to transform the DAP 3.x DDX response into EML

Hyrax 1.5.x contains a the NcML handler that supports adding attributes and Hyrax 1.6.x contains a version of the NcML handler that supports aggregations, too. Other work will continue to add new capabilities to the handler

EML generation

In general, we cannot build valid EML using only a 'raw' DDX and XSLT. In most cases we will need some Java code to read values from the dataset and/or additional metadata inserted into the DDX using NcML. However, for dataset that conform to the Climate Forecast 1.0 metadata convention (CF-1.0), we can build EML directly from the 'raw' DDX response (without using Java to read data values or NcML to insert additional metadata).

Here is an EML document built using XSLT from a DDX from a dataset with CF-1.0 metadata.

About the EML: The EML generated assumes:

Only Grids are interesting variables in a given dataset/granule
The dataset complies with CF-1.0
Provides dataset-scope geo-temporal metadata
Every Grid shares that dataset-scoped geo-temporal metadata

About

This is a summary of a design to provide EML documents for the data sets that are part of the REAP Ocean Use Case. The complete design can be found at REAP Cataloging and Searching . Note that the software described there makes use of two other components proposed to be developed for not only this effort but also for other projects. Those are the NcML AIS handler and the NcML Aggregation handler, both modules that will run in Hyrax.

Note that this does not provide and hints on how to build a user interface that would actually use the information, only how to get that information into the data base.

The Problem

Finding data scattered among distributed servers that are run independently is a long-standing problem. Various solutions like crawling the servers' contents, requiring standardized information about the contents so the can be indexed, et c., have been tried with varying success. The design here addresses the twin problems of motivating providers to write extra information that will make the data locatable and then increasing the time those documents remain valid. It is not a complete solution in the sense that not all providers will write the needed documents and, over time, some documents will go stale. However, it should have fewer problems than designs that fail to consider these realities.

DAP servers, which provide all of the data used by the Ocean Use Case are completely autonomous entities with their own network interfaces. Each supports a uniform base set of operations, but there is no requirement to organize or catalog those servers in a coordinated, unified or centralized way. For the purposes of this use case, we assume that implementing a distributed query is outside the scope of solutions we can to consider. Thus, we must create a centralized store of information about the collection of known data sets that can be searched in response to various types of queries.

Solution

The Kepler client supports searching the Metacat data base system to locate data sets and it seems that populating an instance of this data base with information about the data sets in the use case will provide the needed information on which to build a search user interface. However, the Metacat/Kepler combination is not completely general; while Metacat is capable of indexing arbitrary XML documents (so it could index the DDX returned by a DAP server for a given data set), Kepler expects the records it searches to be (a subset of) EML documents.

Since the servers that provide the use case's data are distributed, it seems like a good plan to have each server provide information about its (relevant) holding using EML. Each server could return these XML documents as a response. In fact, since all DAP servers are bundled with some form of a HTTP server, the EML documents could be stored in the HTTP server's 'DocumentRoot'. However, our experience, and the experience of others, is that a solution based on such a collection of documents does not work in practice. The servers move, the URLs to data sets with a server move and the records about the data sets often contains errors. furthermore, as new data sets are added, data providers don't supply the needed inforamtion about them, so while the collection of records starts out complete (not considering errors in the existing information) it rapidly ages and grows more and more stale with time.

To solve the problem of maintaining a accurate static collection of information about servers, this design will generate the EML records from the automatically-built DDX response which is supported by the newer DAP servers. Because this response does not necessarily contain all of the raw information needed to make a valid EML record, we will use the AIS handler, designed to be used by other projects as well, to supplement the information in the DDX with 'micro documents' that will be collected in files separate from, but closely bound to the original data. The AIS will combine the information in these files with the information in the data sets themselves and return the resulting DDX (think of it as a DDX' - DDX prime).

However, if we were to require data providers to write 'AIS files' with EML that was then to be merged with the data set's information to build the DDX, we really would not have solved much, if anything. The same basic problem would remain - the collection of information would quickly grow stale. Instead we will isolate specific parts of the EML document that are needed for the metacat/kepler combination and generalize those. The generalized 'micro-documents' will be based on ISO19115 to the extent possible and the result merged into the DDX. A DDX-->EML transform (implemented in XSLT which is supported by metacat) will actually generate the EML documents. This provides two important sources of leverage in this design. First, since the source information is more general than EML itself, it should be easier to motivate data providers to write it. Second, the same source information can be used to build other kinds of records used by other search systems.

While not in the scope of this design, this AIS --> DDX --> EML data flow provides the basis for more sophisticated processing than XSLT can provide.

Risks

There's no discussion of how to modify Kepler so users can search for the data sets. This really isn't a risk per se, it just indicates that there's really a whole other issue to be addressed.

Search Parameters

This search sub-system is driven by the Ocean Use case, in it's second form, and so the parameters are derived from that. However, these are general search parameters applicable to just about any search involving satellite data. The parameters are:

Space (Latitude and Longitude)
Time (Really date & time)

Question: But we should include things like 'day' or night' to accommodate finding only images from the day or night?

Resolution (i.e., what is the area in km^2 covered by a pixel)
Parameter (Sea Surface Temperature, Wind vectors, etc.)

Attachments:

20040103-NCDC-L4LRblend-GLOB-v01-fv01_0-AVHRR_AMSR_OI.nc.bz2.eml

6840 bytes

Go to top Edit this page More info... Attach file...

This page last changed on 28-Jun-2010 14:29:06 PDT by uid=gallagher,o=unaffiliated.