Ecoinformatics site parent site of Partnership for Biodiversity Informatics site parent site of REAP - Home




O Pe NDAP Kepler Data Model Resolution

The OPeNDAP data model (aka the DAP2 data model) supports more complex data objects than the Kepler/Ptolmey data model. In particular DAP2 supports deeper hierarchies and N-dimensional arrays. Although these data can find logical representations in Kepler/Ptolmey the end result is:

N-Dimensional Array Issues

Poorly optimized in memory (unless the data is represented as a Ptolmey matrix type)

Comments from Dan Higgins - 9/28/2007
  1. The Ptolemy array of tokens is inefficient while the Ptolemy Matrix is designed for 2-D info (images). I suggest that we add a new multidimensional array type to Kepler. Data would be stored linearly in memory (1-D Java array) with a second array indicating dimensionality (approach used in R). Thus a 1-D array with 12 elements could be dimensioned as a 1x12, a 2x6, 3x4, 2x3x2 etc arrays and the dimensionality can changed without moving any data.

    (This is in effect the way that OPeNDAP stores it's array data in memory. If care was taken designing this Token type we could allow the internal storage to be passed in - thus allowing us to read OPeNDAP data and seamlessly wrap it in a Token without being forced to do an element by element copy. ndp 11/14/07)

  2. Although a new multidimensional array type would be more space efficient than arrays of array tokens, any purely RAM based implementation will rapidly run into memory limitations as we try to handle bigger data sets (e.g. multiple datasets over time with time the 3rd dimension). So why not consider a disk-based option now?

  3. Why not consider a new Kepler datatype that is file-based? i.e. store the opendap data in local files (perhaps CDF or HDF files? I think there are Java tools for reading such file(?)) and use 'file reference tokens' (these don't currently exist). (Currently we do use simple strings as file references in Kepler. We should add a 'ReferenceToken' that is immutable - i.e. for files this would involve file locking, etc.)

  4. Java NIO routines offer some methods for optimizing speed of random access of large disk-based files using OS disk caches and other methods. We might want to investigate these to optimize performance of disk-based data storage.
Comments from Nathan Potter - 11/15/2007
I have added an optimization step to the OPeNDAP actor in which it "squeezes" incoming arrays to remove dimensions whose size is equal to one. The result is that if the user subsets an N-Dimensional array in such a way that the result is effectively a 1 or 2 dimensional array, then the Actor will map it to a matrix (1xN and MxN respectively). While this doesn't address all of the memory usage concerns it will enable us to move forward with workflow development for the SST use case. That work should bring more focus on the other memory limitations that we may encounter.

Data Structure Complexity

DAP data is not really available to the existing Kepler/Ptolmey actorsuite due to it's inherent complexity.

Because the DAP is rich in data stored in variations of the Structure data type this means that much of what is produced from DAP data sources will naturally map to a RecordToken. Currently there is a RecordDisassembler Actor that can be used to break apart these RecordTokens. Unfortunately it is very cumbersome for the user to configure. Beacuse of this Ilkay attempted to write an AutomatedRecordDisassembler Actor and immediatley ran into a wall because it was not possible for the Actor to determine the structure of the incoming RecordToken at design time.

Comments From Dan Higgins (11/15/2007)
I got thinking about this issue after our last REAP call and thought I would try to summarize some thoughts here.

At design time, actors are really independent. Most of the actors that dynamically generate output ports are data source actors. Except for trigger inputs, there are no inputs and the output ports can be generated because the actor uses some parameter to get the data needed to generate ports when it is instantiated (i.e. dropped on the canvas) or input parameters are changed. (Example: EML2 Datasource uses the eml metadata, the OpenDAP actor uses a URL; Changing an actor parameter during design will trigger changes in the output ports due to AttributeChanged events.)

But the actor connected to the output of a Data Source only gets port data from the preceding actor during the fire cycle which doesn't occur during design. Say some data source puts its output in a complex form (e.g. a Kepler Record or XML or an R dataframe). If that output is connected to the input of another actor, without additional information, the following actor cannot know any details of its input until it receives the datatoken! The existing RecordDisassembler works by requiring the user to know some 'names' of items in the Record token and creating output ports with that name.

Now, a RecordDisassembler actor could be given a Parameter that is an array of strings that are names of Record elements and then could automatically generate output ports based on that array (or it could be given a 'template' RecordToken and figure out the name array from the Token). A change in this parameter at design time would trigger changes in the outputs. The parameter could even be placed on the canvas and shared between multiple actors. (i.e. Parameters are a way for actors to share information at design time).

Note that the problem is related to complex data types like a Record or XML file. The data type of ports can be set at design time and checked. But with complex types, the type itself is incomplete (i.e.more information is needed). For a Record, one would really need a complete description of the Record with element names and element types (perhaps recursive) and with XML, one would need a complete schema. Requiring such complete type descriptions would make actors so specific that their usefulness would be limited to only a few cases.

Theoretically, one could have an actor query everyone of its predecessors when its input ports were connected to see if there is information about the details of the data that will be sent to it (e.g. a RecordDisassembler could ask its predecessor(s) for a prototype record) and use that data (if available) to configure outputs. But that greatly increases the complexity of a workflow because every actor is then (possibly) strongly linked to all of its predecessors!

So all this brings me to one possible solution. Assume that any data source that actually gets information at design time (like the EML of the EML actor or the OpenDAP actor) ALSO created a Kepler parameter when it was dropped on the canvas (or its parameters were changed) and this few parameter were automatically made visible on the workspace canvas. The parameter would basically be the schema of the Record or EML that the actor might output. Any other actor added to the model could then use that parameter. e.g. if the parameter were a Record, a RecordDisassembler could use it as a template for creating outputs.

Go to top   Edit this page   More info...   Attach file...
This page last changed on 14-Jan-2008 11:13:24 PST by uid=barseghian,o=NCEAS.