5 minute tutorial#

The easiest way to use pydap is to use it as a client to access remote data hosted on OPeNDAP servers. You can use pydap’s open_url directly, or better use pydap as an engine for xarray. xarray allows for OPeNDAP users to exploit many of Pangeo’s modern capabilities for scalable computing.

OPeNDAP - the vision#

The original vision of OPeNDAP (Cornillion, et al 1993) was to make the equivalency:

\( \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL} \approx \text{Remote Dataset} }\)

Furthermore,

\( \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL + Constraints} \approx \text{Subset of Remote Dataset}} \)

Here, we demonstrate this. For this short tutorial we will access a remote dataset hosted on OPeNDAP’s Hyrax server. For more information about OPeNDAP and Hyrax you can go to the official OPeNDAP documentation.

The remote dataset that will be used in this tutorial can be inspected via the browser HERE

from pydap.client import open_url
import xarray as xr
import numpy as np

We define a URL pointing to a remote dataset.

url = "http://test.opendap.org:8080/opendap/tutorials/20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"

pure PyDAP#

First, we demonstrate access to the remote dataset via PyDAP

pydap_ds = open_url(url, protocol='dap4')

Note the extra argument protocol='dap4'. One could also pass protocol='dap2'.

We can inspect the contents of the dataset as follows:

pydap_ds.tree()
.20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
├──time
├──lat
├──lon
├──analysed_sst
├──analysis_error
├──mask
├──sea_ice_fraction
├──dt_1km_data
└──sst_anomaly
pydap_ds['sst_anomaly'].shape
(1, 17999, 36000)
print('This array occupies: ', pydap_ds['sst_anomaly'].nbytes/1e9, '[GBs] in memory')
This array occupies:  1.295928 [GBs] in memory

Note

No data has been downloaded yet. PyDAP reads the metadata of the remote URL to create the Dataset.

Each variable contains CF-compliant metadata that can be recognized by various APIs, such as scale_factor, offsets and _FillValue. These parameters are necessary to mask over land areas, and scale values. Some APIs like xarray can recognize these, while for others a user must manually transform the data.

pydap_ds['sst_anomaly'].attributes
{'long_name': 'SST anomaly from a seasonal SST climatology based on the MUR data over 2003-2014 period',
 'units': 'kelvin',
 '_FillValue': -32768,
 'add_offset': 0.0,
 'scale_factor': 0.001,
 'valid_min': -32767,
 'valid_max': 32767,
 'comment': 'anomaly reference to the day-of-year average between 2003 and 2014',
 'coordinates': 'lon lat',
 'Maps': ('/time', '/lat', '/lon')}

You can read more about NetCDF Climate and Forcasts (CF) Metadata Conventions HERE.

Downloading the Array into memory#

You can trigger a download on-the-fly as needed. However in almost all cases only a subset of an entire dataset is needed. You can download only the piece you want, by slicing the array as follows:

%%time
array = pydap_ds['sst_anomaly'][0, 0:10, 0:10]
CPU times: user 7.62 ms, sys: 1.35 ms, total: 8.97 ms
Wall time: 348 ms
np.shape(array)
(1, 10, 10)

With the above command, all the data-array has been downloaded into memory and assigned to the variable array. However, the variable array is not a numpy array, but rather a BaseType of pydap’s model:

type(array)
pydap.model.BaseType

To extract the numpy array from pydap’s BaseType do:

data = array.data
type(data)
numpy.ndarray

Using server-side subsetting#

Because data is hosted on Hyrax, you can exploit server-side subsetting local to the data. OPeNDAP servers support subsetting by adding Constraint Expressions to the URL.

In this scenario were we want to subset the variable sst_anomaly, we can request it directly to OPeNDAP’s Hyrax server using the following syntax:

<OPeNDAP_URL> + "?dap4.ce=\sst_anomaly[0][0:1:9][0:1:9]"

This means that from the <OPeNDAP_URL> associated with the complete dataset, we request to only select the variable sst_anomaly, and subset it as follows: [0][0:1:9][0:1:9]. This index-based subsetting implies:

  • A single (first) element of the first (time) dimension,

  • [0:1:9] indicates the first 10 elements of the second (lat) dimension

  • [0:1:9] indicates the first 10 elements of the third (lon) dimension.

CE = "?dap4.ce=/lat;/sst_anomaly[0][0:1:9][0:1:9]"
pydap_ds = open_url(url+CE, protocol='dap4')
pydap_ds.tree()
.20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
├──lat
└──sst_anomaly
pydap_ds['sst_anomaly'].shape
(1, 10, 10)
pydap_ds['lat'].shape
(17999,)

Note

OPeNDAP only applied the subset to the variable sst_anomaly, while lat (and any other) would retain the original size. One can use a different syntax of the Constraint Expression so that subsets along shared dimensions are applied across variables that share the dimension. For more, See below.

xarray approach#

PyDAP’s open_url can be used internally within xarray, by defining an extra parameter when creating an xarray Dataset. This extra parameter is:

engine='pydap'

Moreoever, we can combine the server-side subsetting that occurs local to the data on the OPeNDAP server.

Note

There exists many many many servers, but only two DAP implementation: DAP2 and DAP4. The differences between the two go beyond this 5 minute intro. We will simply restrict to say that DAP4 is newer, that any server implementing DAP4 can implement DAP2, and so will only focus on DAP4 in this short tutorial. PyDAP accepts a protocol argument which specifies "dap2" vs "dap4". xarray does not.

Tip

Within xarray and when setting pydap as the engine, we can specify DAP4 as the protocol by passing a URL with dap4 replacing the https.

'dap4'+url[4:]
'dap4://test.opendap.org:8080/opendap/tutorials/20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
dataset = xr.open_dataset('dap4'+url[4:], engine='pydap')
dataset
<xarray.Dataset> Size: 29GB
Dimensions:           (/time: 1, /lat: 17999, /lon: 36000)
Coordinates:
    lat               (/lat) float32 72kB ...
    lon               (/lon) float32 144kB ...
Dimensions without coordinates: /time, /lat, /lon
Data variables:
    time              (/time) datetime64[ns] 8B ...
    analysed_sst      (/time, /lat, /lon) float64 5GB ...
    analysis_error    (/time, /lat, /lon) float64 5GB ...
    mask              (/time, /lat, /lon) float32 3GB ...
    sea_ice_fraction  (/time, /lat, /lon) float64 5GB ...
    dt_1km_data       (/time, /lat, /lon) timedelta64[ns] 5GB ...
    sst_anomaly       (/time, /lat, /lon) float64 5GB ...
Attributes: (12/47)
    Conventions:                CF-1.7
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid

Constraint Expression and shared dimensions within xarray#

Similarly as with pure PyDAP, can define a URL with a Constraint Expression (CE). Here we follow a more dynamic syntax using the concept of shared dimensions. We define the subset at the dimension level, and include in the URL all the variables (including the dimensions) that we want to have included in the resulting dataset (subset).

For example, considering the shared dimensions /dimension1 and /dimension2, we follow the syntax:

<OPeNDAP_URL>?dap4.ce=/dimension1=[start:step:end];/dimension2=[start:step:end];/dimension1;/dimension2;/variable1;/variable2

Note

ICE above, if step is ommited then step=1 by default.

where variable1 and variable2 are 2D variables so that variable1[dimension1][dimension2] and variable2[dimension1][dimension2].

CE = "?dap4.ce=/time=[0];/lat=[0:9];/lon=[0:9];/time;/lat;/lon;/sst_anomaly;/analysed_sst"
print(url+CE)
http://test.opendap.org:8080/opendap/tutorials/20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?dap4.ce=/time=[0];/lat=[0:9];/lon=[0:9];/time;/lat;/lon;/sst_anomaly;/analysed_sst
dataset_ce = xr.open_dataset(url+CE, engine='pydap')
dataset_ce
<xarray.Dataset> Size: 2kB
Dimensions:       (/time: 1, /lat: 10, /lon: 10)
Coordinates:
    lat           (/lat) float32 40B ...
    lon           (/lon) float32 40B ...
Dimensions without coordinates: /time, /lat, /lon
Data variables:
    time          (/time) datetime64[ns] 8B ...
    analysed_sst  (/time, /lat, /lon) float64 800B ...
    sst_anomaly   (/time, /lat, /lon) float64 800B ...
Attributes: (12/47)
    Conventions:                CF-1.7
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid