The DAP data model#

Warning

The information regarding this page may be outdated. If you notice the following examples are not running correctly, consider reporting in the Github issue tracker pydap/pydap#issues.

The DAP is a protocol designed for the efficient transmission of scientific data over the internet. In order to transmit data from the server to a client, both must agree on a way to represent data: is it an array of integers?, a multi-dimensional grid? In order to do this, the specification defines a data model that, in theory, should be able to represent any existing dataset.

Metadata#

pydap has a series of classes in the pydap.model module, representing the DAP data model. The most fundamental data type is called BaseType, and it represents a value or an array of values. Here an example of creating one of these objects:

Note

Prior to pydap 3.2, the name argument was optional for all date types. Since pydap 3.2, it is mandatory.

All pydap types have five attributes in common. The first one is the name of the variable; in this case, our variable is called “a”:

Note that there’s a difference between the variable name (the local name a) and its attribute name; in this example they are equal, but we could reference our object using any other name:

We can use special characters for the variable names; they will be quoted accordingly:

The second attribute is called id. In the examples we’ve seen so far, id and name are equal:

This is because the id is used to show the position of the variable in a given dataset, and in these examples the variables do not belong to any datasets. First let’s store our variables in a container object called StructureType. A StructureType is a special type of ordered dictionary that holds other pydap types:

Note that the variable name has to be used as its key on the StructureType. This can be easily remedied:

There is a special derivative of the StructureType called DatasetType, which represent the dataset. The difference between the two is that there should be only one DatasetType, but it may contain any number of StructureType objects, which can be deeply nested. Let’s create our dataset object:

Note that for objects on the first level of the dataset, like s, the id is identical to the name. Deeper objects, like a which is stored in s, have their id calculated by joining the names of the variables with a period. One detail is that we can access variables stored in a structure using a “lazy” syntax like this:

The third common attribute that variables share is called attributes, which hold most of its metadata. This attribute is a dictionary of keys and values, and the values themselves can also be dictionaries. For our variable a we have:

These attributes can be accessed lazily directly from the variable:

But if you want to create a new attribute you’ll have to insert it directly into attributes:

It’s always better to use the correct syntax instead of the lazy one when writing code. Use the lazy syntax only when introspecting a dataset on the Python interpreter, to save a few keystrokes.

The fourth attribute is called data, and it holds a representation of the actual data. We’ll take a detailed look of this attribute in the next subsection.

Note

Prior to pydap 3.2, all variables had also an attribute called _nesting_level. This attribute had value 1 if the variable was inside a SequenceType object, 0 if it’s outside, and >1 if it’s inside a nested sequence. Since pydap 3.2, the _nesting_level has been deprecated and there is no intrinsic way of finding the where in a deep object a variable is located.

Data#

As we saw on the last subsection, all pydap objects have a data attribute that holds a representation of the variable data. This representation will vary depending on the variable type.

BaseType#

For the simple BaseType objects the data attributes is usually a Numpy array, though we can also use a Numpy scalar or Python number:

Note that starting from pydap 3.2 the datatype is inferred from the input data:

When you slice a BaseType array, the slice is simply passed onto the data attribute. So we may have:

You can think of a BaseType object as a thin layer around Numpy arrays, until you realize that the data attribute can be any object implementing the array interface! This is how the DAP client works – instead of assigning an array with data directly to the attribute, we assign a special object which behaves like an array and acts as a proxy to a remote dataset.

Here’s an example:

In the example above, the data is only downloaded in the last line, when the pseudo array is sliced. The object will construct the appropriate DAP URL, request the data, unpack it and return a Numpy array.

StructureType#

A StructureType holds no data; instead, its data attribute is a property that collects data from the children variables:

The opposite is also true; it’s possible to specify the structure data and have it propagated to the children:

The same is true for objects of DatasetType, since the dataset is simply the root structure.

SequenceType#

A SequenceType object is a special kind of StructureType holding sequential data. Here’s an example of a sequence holding the variables a and c that we created before:

Let’s add some data to our sequence. This can be done by setting a structured numpy array to the data attribute:

Note that the data for the sequence is an aggregation of the children data, similar to Python’s zip() builtin. This will be more complicated when encountering nested sequences, but for flat sequences they behave the same.

We can also iterate over the SequenceType. In this case, it will return a series of tuples with the data:

Prior to pydap 3.2.2, this approach was not possible and one had to iterate directly over SequenceType:

This approach will be deprecated in pydap 3.4.

The SequenceType behaves pretty much like structured arrays from Numpy, since we can reference them by column (s['a']) or by index:

Note that these objects are also SequenceType themselves. The basic rules when working with sequence data are:

  1. When a SequenceType is sliced with a string the corresponding children is returned. For example: s['a'] will return child a;

  2. When a SequenceType is iterated over (using .iterdata() after pydap 3.2.2) it will return a series of tuples, each one containing the data for a record;

  3. When a SequenceType is sliced with an integer, a comparison or a slice() a new SequenceType will be returned;

  4. When a SequenceType is sliced with a tuple of strings a new SequenceType will be returned, containing only the children defined in the tuple in the new order. For example, s[('c', 'a')] will return a sequence s with the children c and a, in that order.

Note that except for rule 4 SequenceType mimics the behavior of Numpy structure arrays.

Now imagine that we want to add to a SequenceType data pulled from a relational database. The easy way would be to fetch the data in the correct column order, and insert it into the sequence. But what if we don’t want to store the data in memory, and instead we would like to stream it directly from the database? In this case we can create an object that behaves like a structure array, similar to the proxy object that implements the array interface. pydap defines a “protocol” called IterData, which is simply any object that:

  1. Returns data when iterated over.

  2. Returns a new IterData when sliced such that:

    1. if the slice is a string the new IterData contains data only for that children;

    2. if the slice is a tuple of strings the object contains only those children, in that order;

    3. if the slice is an integer, a slice() or a comparison, the data is filter accordingly.

The base implementation works by wrapping data from a basic Numpy array. And here is an example of how we would use it:

One can also iterate directly over the IterData object to obtain the data:

This approach will not be deprecated in pydap 3.4. NOTE: For numpy > 2.0, iterating over the IterData object returns a record specifying the individual types of the elements of the sequence

There are many implementations of classes derived from IterData: pydap.handlers.dap.SequenceProxy is a proxy to sequential data on Opendap servers, pydap.handlers.csv.CSVProxy wraps a CSV file, and pydap.handlers.sql.SQLProxy works as a stream to a relational database.

GridType#

A GridType is a special kind of object that behaves like an array and a StructureType. The class is derived from StructureType; the major difference is that the first defined variable is a multidimensional array, while subsequent children are vector maps that define the axes of the array. This way, the data attribute on a GridType returns the data of all its children: the n-dimensional array followed by n maps.

Here is a simple example:

Grid behave like arrays in that they can be sliced. When this happens, a new GridType is returned with the proper data and axes:

It is possible to disable this feature (some older servers might not handle it nicely):