Calendar

O Pe NDAP Kepler Data Model Resolution

The OPeNDAP data model (aka the DAP2 data model) supports more complex data objects than the Kepler/Ptolmey data model. In particular DAP2 supports deeper hierarchies and N-dimensional arrays. Although these data can find logical representations in Kepler/Ptolmey the end result is:

N-Dimensional Array Issues

Poorly optimized in memory (unless the data is represented as a Ptolmey matrix type)

Comments from Dan Higgins - 9/28/2007

The Ptolemy array of tokens is inefficient while the Ptolemy Matrix is designed for 2-D info (images). I suggest that we add a new multidimensional array type to Kepler. Data would be stored linearly in memory (1-D Java array) with a second array indicating dimensionality (approach used in R). Thus a 1-D array with 12 elements could be dimensioned as a 1x12, a 2x6, 3x4, 2x3x2 etc arrays and the dimensionality can changed without moving any data.

(This is in effect the way that OPeNDAP stores it's array data in memory. If care was taken designing this Token type we could allow the internal storage to be passed in - thus allowing us to read OPeNDAP data and seamlessly wrap it in a Token without being forced to do an element by element copy. ndp 11/14/07)

Although a new multidimensional array type would be more space efficient than arrays of array tokens, any purely RAM based implementation will rapidly run into memory limitations as we try to handle bigger data sets (e.g. multiple datasets over time with time the 3rd dimension). So why not consider a disk-based option now?

Why not consider a new Kepler datatype that is file-based? i.e. store the opendap data in local files (perhaps CDF or HDF files? I think there are Java tools for reading such file(?)) and use 'file reference tokens' (these don't currently exist). (Currently we do use simple strings as file references in Kepler. We should add a 'ReferenceToken' that is immutable - i.e. for files this would involve file locking, etc.)

Java NIO routines offer some methods for optimizing speed of random access of large disk-based files using OS disk caches and other methods. We might want to investigate these to optimize performance of disk-based data storage.

Comments from Nathan Potter - 11/15/2007: I have added an optimization step to the OPeNDAP actor in which it "squeezes" incoming arrays to remove dimensions whose size is equal to one. The result is that if the user subsets an N-Dimensional array in such a way that the result is effectively a 1 or 2 dimensional array, then the Actor will map it to a matrix (1xN and MxN respectively). While this doesn't address all of the memory usage concerns it will enable us to move forward with workflow development for the SST use case. That work should bring more focus on the other memory limitations that we may encounter.

Data Structure Complexity

DAP data is not really available to the existing Kepler/Ptolmey actorsuite due to it's inherent complexity.

Because the DAP is rich in data stored in variations of the Structure data type this means that much of what is produced from DAP data sources will naturally map to a RecordToken. Currently there is a RecordDisassembler Actor that can be used to break apart these RecordTokens. Unfortunately it is very cumbersome for the user to configure. Beacuse of this Ilkay attempted to write an AutomatedRecordDisassembler Actor and immediatley ran into a wall because it was not possible for the Actor to determine the structure of the incoming RecordToken at design time.

Comments From Dan Higgins (11/15/2007)

Go to top Edit this page More info... Attach file...

This page last changed on 14-Jan-2008 11:13:24 PST by uid=barseghian,o=NCEAS.