The DAP data model#
Warning
The information regarding this page may be outdated. If you notice the following examples are not running correctly, consider reporting in the Github issue tracker pydap/pydap#issues.
The DAP is a protocol designed for the efficient transmission of scientific data over the internet. In order to transmit data from the server to a client, both must agree on a way to represent data: is it an array of integers?, a multi-dimensional grid? In order to do this, the specification defines a data model that, in theory, should be able to represent any existing dataset.
Metadata#
pydap has a series of classes in the pydap.model
module, representing the DAP data model.
The most fundamental data type is called BaseType
, and it represents a value or an array of values.
Here an example of creating one of these objects:
Note
Prior to pydap 3.2, the name argument was optional for all date types. Since pydap 3.2, it is mandatory.
All pydap types have five attributes in common. The first one is the name
of the variable; in this case, our variable is called “a”:
Note that there’s a difference between the variable name (the local name a
) and its attribute name
; in this example they are equal, but we could reference our object using any other name:
We can use special characters for the variable names; they will be quoted accordingly:
The second attribute is called id
. In the examples we’ve seen so far, id
and name
are equal:
This is because the id
is used to show the position of the variable in a given dataset, and in these
examples the variables do not belong to any datasets. First let’s store our variables in a container
object called StructureType
. A StructureType
is a special type of ordered dictionary that holds other pydap types:
Note that the variable name has to be used as its key on the StructureType
. This can be easily remedied:
There is a special derivative of the StructureType
called DatasetType
, which represent the dataset.
The difference between the two is that there should be only one DatasetType
, but
it may contain any number of StructureType
objects, which can be deeply nested. Let’s create our dataset object:
Note that for objects on the first level of the dataset, like s
, the id is identical to the name.
Deeper objects, like a
which is stored in s
, have their id calculated by joining the names of the
variables with a period. One detail is that we can access variables stored in a structure using a “lazy” syntax like this:
The third common attribute that variables share is called attributes
, which hold most of its metadata.
This attribute is a dictionary of keys and values, and the values themselves can also be dictionaries.
For our variable a
we have:
These attributes can be accessed lazily directly from the variable:
But if you want to create a new attribute you’ll have to insert it directly into attributes
:
It’s always better to use the correct syntax instead of the lazy one when writing code. Use the lazy syntax only when introspecting a dataset on the Python interpreter, to save a few keystrokes.
The fourth attribute is called data
, and it holds a representation of the actual data.
We’ll take a detailed look of this attribute in the next subsection.
Note
Prior to pydap 3.2, all variables had also an attribute called _nesting_level
.
This attribute had value 1 if the variable was inside a SequenceType
object,
0 if it’s outside, and >1 if it’s inside a nested sequence.
Since pydap 3.2, the _nesting_level
has been deprecated and there is no
intrinsic way of finding the where in a deep object a variable is located.
Data#
As we saw on the last subsection, all pydap objects have a data
attribute that holds a representation of the variable data.
This representation will vary depending on the variable type.
BaseType
#
For the simple BaseType
objects the data
attributes is usually a Numpy array,
though we can also use a Numpy scalar or Python number:
Note that starting from pydap 3.2 the datatype is inferred from the input data:
When you slice a BaseType
array, the slice is simply passed onto the data attribute. So we may have:
You can think of a BaseType
object as a thin layer around Numpy arrays,
until you realize that the data
attribute can be any object implementing the array interface!
This is how the DAP client works – instead of assigning an array with data directly to the attribute,
we assign a special object which behaves like an array and acts as a proxy to a remote dataset.
Here’s an example:
In the example above, the data is only downloaded in the last line, when the pseudo array is sliced. The object will construct the appropriate DAP URL, request the data, unpack it and return a Numpy array.
StructureType
#
A StructureType
holds no data; instead, its data
attribute is a property that collects data from the children variables:
The opposite is also true; it’s possible to specify the structure data and have it propagated to the children:
The same is true for objects of DatasetType
, since the dataset is simply the root structure.
SequenceType
#
A SequenceType
object is a special kind of StructureType
holding sequential data.
Here’s an example of a sequence holding the variables a
and c
that we created before:
Let’s add some data to our sequence. This can be done by setting a structured numpy array to the data attribute:
Note that the data for the sequence is an aggregation of the children data, similar to Python’s zip()
builtin.
This will be more complicated when encountering nested sequences, but for flat sequences they behave the same.
We can also iterate over the SequenceType
. In this case, it will return a series of tuples with the data:
Prior to pydap 3.2.2, this approach was not possible and one had to iterate directly over SequenceType
:
This approach will be deprecated in pydap 3.4.
The SequenceType
behaves pretty much like structured arrays from
Numpy, since we can reference them by column (s['a']
) or by index:
Note that these objects are also SequenceType
themselves. The basic rules when working with sequence data are:
When a
SequenceType
is sliced with a string the corresponding children is returned. For example:s['a']
will return childa
;When a
SequenceType
is iterated over (using.iterdata()
after pydap 3.2.2) it will return a series of tuples, each one containing the data for a record;When a
SequenceType
is sliced with an integer, a comparison or aslice()
a newSequenceType
will be returned;When a
SequenceType
is sliced with a tuple of strings a newSequenceType
will be returned, containing only the children defined in the tuple in the new order. For example,s[('c', 'a')]
will return a sequences
with the childrenc
anda
, in that order.
Note that except for rule 4 SequenceType
mimics the behavior of Numpy structure arrays.
Now imagine that we want to add to a SequenceType
data pulled from a relational database.
The easy way would be to fetch the data in the correct column order, and insert it into the sequence.
But what if we don’t want to store the data in memory, and instead we would like to stream it directly from the database?
In this case we can create an object that behaves like a structure array, similar to the proxy object that implements the array interface.
pydap defines a “protocol” called IterData
, which is simply any object that:
Returns data when iterated over.
Returns a new
IterData
when sliced such that:if the slice is a string the new
IterData
contains data only for that children;if the slice is a tuple of strings the object contains only those children, in that order;
if the slice is an integer, a
slice()
or a comparison, the data is filter accordingly.
The base implementation works by wrapping data from a basic Numpy array. And here is an example of how we would use it:
One can also iterate directly over the IterData
object to obtain the data:
This approach will not be deprecated in pydap 3.4. NOTE: For numpy > 2.0, iterating over the
IterData
object returns a record specifying the individual types of the elements of the
sequence
There are many implementations of classes derived from IterData
: pydap.handlers.dap.SequenceProxy
is a proxy to
sequential data on Opendap servers, pydap.handlers.csv.CSVProxy
wraps a CSV file,
and pydap.handlers.sql.SQLProxy
works as a stream to a relational database.
GridType
#
A GridType
is a special kind of object that behaves like an array and a StructureType
.
The class is derived from StructureType
; the major difference is that the first defined variable is a multidimensional array,
while subsequent children are vector maps that define the axes of the array. This way, the data
attribute on a GridType
returns the data of all its children: the n-dimensional array followed by n maps.
Here is a simple example:
Grid behave like arrays in that they can be sliced. When this happens, a new GridType
is returned with the proper data and axes:
It is possible to disable this feature (some older servers might not handle it nicely):