Calendar

Ocean_use_case_searching

This is version 9. It is not the current version, and thus it cannot be edited.
[Back to current version] [Restore this version]

Searching for Data Sets Within the Ocean Use Case

About

This is a summary of a design to provide EML documents for the data sets that are part of the REAP Ocean Use Case. The complete design can be found at REAP Cataloging and Searching . Note that the software described there makes use of two other components proposed to be developed for not only this effort but also for other projects. Those are the NcML AIS handler and the NcML Aggregation handler, both modules that will run in Hyrax.

Note that this does not provide and hints on how to build a user interface that would actually use the information, only how to get that information into the data base.

The Problem

Finding data scattered among distributed servers that are run independently is a long-standing problem. Various solutions like crawling the servers' contents, requiring standardized information about the contents so the can be indexed, et c., have been tried with varying success. The design here addresses the twin problems of motivating providers to write extra information that will make the data locatable and then increasing the time those documents remain valid. It is not a complete solution in the sense that not all providers will write the needed documents and, over time, some documents will go stale. However, it should have fewer problems than designs that fail to consider these realities.

DAP servers, which provide all of the data used by the Ocean Use Case are completely autonomous entities with their own network interfaces. Each supports a uniform base set of operations, but there is no requirement to organize or catalog those servers in a coordinated and centralized way. For the purposes of this use case, we assume that implementing a distributed query is outside the scope of solutions we wan to consider. Thus, we must create a centralized store of information about the collection of known data sets that can be searched in response to various types of queries.

Solution

The Kepler client supports searching the Metacat data base system to locate data sets and it seems that populating an instance of this data base with information about the data sets in the use case will provide the needed information on which to build a search user interface. However, the metacat/kepler combination is not completely general; while metacat is capable of indexing arbitrary XML documents (so it could index the DDX returned by a DAP server for a given data set), kepler expects the records it searches to be (a subset of) EML documents.

Since the servers that provide the use case's data are distributed, it seems like a good plan to have each server provide information about its (relevant) holding using EML. Each server could return these XML documents as a response. In fact, since all DAP servers are bundled with some form of a HTTP server, the EML documents could be stored in the HTTP server's 'DocumentRoot'. However, our experience, and the experience of others, is that a solution based on such a collection of documents does not work in practice. The servers move, the URLs to data sets with a server move and the records about the data sets often contains errors. furthermore, as new data sets are added, data providers don't supply the needed inforamtion about them, so while the collection of records starts out complete (not considering errors in the existing information) it rapidly ages and grows more and more stale with time.

To solve the problem of maintaining a accurate static collection of information about servers, this design will generate the EML records from the automatically-built DDX response which is supported by the newer DAP servers. Because this response does not necessarily contain all of the raw information needed to make a valid EML record, we will use the AIS handler, designed to be used by other projects as well, to supplement the information in the DDX with 'micro documents' that will be collected in files separate from, but closely bound to the original data. The AIS will combine the information in these files with the information in the data sets themselves and return the resulting DDX (think of it as a DDX' - DDX prime).

However, if we were to require data providers to write 'AIS files' with EML that was then to be merged with the data set's information to build the DDX, we really would not have solved much, if anything. The same basic problem would remain - the collection of information would quickly grow stale. Instead we will isolate specific parts of the EML document that are needed for the metacat/kepler combination and generalize those. The generalized 'micro-documents' will be based on ISO19115 to the extent possible and the result merged into the DDX. A DDX-->EML transform (implemented in XSLT which is supported by metacat) will actually generate the EML documents. This provides two important sources of leverage in this design. First, since the source information is more general than EML itself, it should be easier to motivate data providers to write it. Second, the same source information can be used to build other kinds of records used by other search systems.

While not in the scope of this design, this AIS --> DDX --> EML data flow provides the basis for more sophisticated processing than XSLT can provide.

Risks

There's no discussion of how to modify Kepler so users can search for the data sets. This really isn't a risk per se, it just indicates that there's really a whole other issue to be addressed.

Attachments:

20040103-NCDC-L4LRblend-GLOB-v01-fv01_0-AVHRR_AMSR_OI.nc.bz2.eml

6840 bytes

Go to top More info... Attach file...

This particular version was published on 01-May-2009 16:41:54 PDT by uid=gallagher,o=unaffiliated.