Calendar

Ocean_use_case_searching

This is version 4. It is not the current version, and thus it cannot be edited.
[Back to current version] [Restore this version]

Searching for Data Sets Within the Ocean Use Case

This is a summary of a design to provide EML documents for the data sets that are part of the REAP Ocean Use Case. The complete design can be found at REAP Cataloging and Searching . Note that the software described there makes use of two other components proposed to be developed for not only this effort but also for other projects. Those are the NcML AIS handler and the NcML Aggregation handler, both modules that will run in Hyrax.

Summary

The Problem

DAP servers, which provide all of the data used by the Ocean Use Case are completely autonomous entities with their own network interfaces. Each supports a uniform base set of operations, but there is no requirement to organize or catalog those servers in a coordinated and centralized way. For the purposes of this use case, we assume that implementing a distributed query is outside the scope of solutions we wan to consider. Thus, we must create a centralized store of information about the collection of known data sets that can be searched in response to various types of queries.

Solution

The Kepler client supports searching the Metacat data base system to locate data sets and it seems that populating an instance of this data base with information about the data sets in the use case will provide the needed information on which to build a search user interface. However, the metacat/kepler combination is not completely general; while metacat is capable of indexing arbitrary XML documents (so it could index the DDX returned by a DAP server for a given data set), kepler expects the records it searches to be (a subset of) EML documents.

Since the servers that provide the use case's data are distributed, it seems like a good plan to have each server provide information about its (relevant) holding using EML. Each server could return these XML documents as a response. In fact, since all DAP servers are bundled with some form of a HTTP server, the EML documents could be stored in the HTTP server's 'DocumentRoot'. However, our experience, and the experience of others, is that a solution based on such a collection of documents does not work in practice. The servers move, the URLs to data sets with a server move and the records about the data sets often contains errors. furthermore, as new data sets are added, data providers don't supply the needed inforamtion about them, so while the collection of records starts out complete (not considering errors in the existing information) it rapidly ages and grows more and more stale with time.

To solve the problem of maintaining a accurate static collection of information about servers, this design will generate the EML records from the automatically-built DDX response which is supported by the newer DAP servers. Because this response does not necessarily contain all of the raw information needed to make a valid EML record, we will use the AIS handler, designed to be used by other projects as well, to supplement the information in the DDX with 'micro documents' that will be collected in files separate from, but closely bound to the original data. The AIS will combine the information in these files with the information in the data sets themselves and return the resulting DDX (think of it as a DDX' - DDX prime).

However, if we were to require data providers to write 'AIS files' with EML that was then to be merged with the data set's information to build the DDX, we really would not have solved much, if anything. The same basic problem(s) would remain - the collection of information would grow stale.

Why will storing information in the 'AIS files' work when writing full EML records doesn't? First, the AIS files will be closely bound to the data sets they describe, so they providers will likely move those AIS files when they move the data files, either when the data move within a host or between hosts. Second, the AIS files will use

We have experimented with searching solutions that depend on server's

Risks

Attachments:

20040103-NCDC-L4LRblend-GLOB-v01-fv01_0-AVHRR_AMSR_OI.nc.bz2.eml

6840 bytes

Go to top More info... Attach file...

This particular version was published on 01-May-2009 14:34:13 PDT by uid=gallagher,o=unaffiliated.