5 minute tutorial#
The easiest way to use pydap
is to use it as a client
to access remote data hosted on OPeNDAP servers. You can use pydap
’s open_url
directly, or better use pydap
as an engine for xarray
. xarray
allows for OPeNDAP users to exploit many of Pangeo’s modern capabilities for scalable computing.
OPeNDAP - the vision#
The original vision of OPeNDAP (Cornillion, et al 1993) was to make the equivalency:
\( \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL} \approx \text{Remote Dataset} }\)
Furthermore,
\( \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL + Constraints} \approx \text{Subset of Remote Dataset}} \)
Here, we demonstrate this. For this short tutorial we will access a remote dataset hosted on OPeNDAP’s Hyrax server. For more information about OPeNDAP and Hyrax you can go to the official OPeNDAP documentation.
The remote dataset that will be used in this tutorial can be inspected via the browser HERE
from pydap.client import open_url
import xarray as xr
import numpy as np
We define a URL pointing to a remote dataset.
url = "http://test.opendap.org:8080/opendap/tutorials/20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
pure PyDAP
#
First, we demonstrate access to the remote dataset via PyDAP
pydap_ds = open_url(url, protocol='dap4')
Note the extra argument protocol='dap4'
. One could also pass protocol='dap2'
.
We can inspect the contents of the dataset as follows:
pydap_ds.tree()
.20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
├──time
├──lat
├──lon
├──analysed_sst
├──analysis_error
├──mask
├──sea_ice_fraction
├──dt_1km_data
└──sst_anomaly
pydap_ds['sst_anomaly'].shape
(1, 17999, 36000)
print('This array occupies: ', pydap_ds['sst_anomaly'].nbytes/1e9, '[GBs] in memory')
This array occupies: 1.295928 [GBs] in memory
Note
No data has been downloaded yet. PyDAP
reads the metadata of the remote URL to create the Dataset
.
Each variable contains CF-compliant metadata
that can be recognized by various APIs, such as scale_factor
, offsets
and _FillValue
. These parameters are necessary to mask over land areas, and scale values. Some APIs like xarray
can recognize these, while for others a user must manually transform the data.
pydap_ds['sst_anomaly'].attributes
{'long_name': 'SST anomaly from a seasonal SST climatology based on the MUR data over 2003-2014 period',
'units': 'kelvin',
'_FillValue': -32768,
'add_offset': 0.0,
'scale_factor': 0.001,
'valid_min': -32767,
'valid_max': 32767,
'comment': 'anomaly reference to the day-of-year average between 2003 and 2014',
'coordinates': 'lon lat',
'Maps': ('/time', '/lat', '/lon')}
You can read more about NetCDF Climate and Forcasts (CF) Metadata Conventions
HERE.
Downloading the Array into memory#
You can trigger a download on-the-fly as needed. However in almost all cases only a subset of an entire dataset is needed
. You can download only the piece you want, by slicing the array as follows:
%%time
array = pydap_ds['sst_anomaly'][0, 0:10, 0:10]
CPU times: user 6.73 ms, sys: 1.76 ms, total: 8.49 ms
Wall time: 274 ms
np.shape(array)
(1, 10, 10)
With the above command, all the data-array has been downloaded into memory and assigned to the variable array
. However, the variable array
is not a numpy array, but rather a BaseType
of pydap
’s model:
type(array)
pydap.model.BaseType
To extract the numpy array from pydap
’s BaseType
do:
data = array.data
type(data)
numpy.ndarray
Using server-side subsetting#
Because data is hosted on Hyrax
, you can exploit server-side subsetting local to the data
. OPeNDAP servers support subsetting by adding Constraint Expressions
to the URL
.
In this scenario were we want to subset the variable sst_anomaly
, we can request it directly to OPeNDAP’s Hyrax server using the following syntax:
<OPeNDAP_URL> + "?dap4.ce=\sst_anomaly[0][0:1:9][0:1:9]"
This means that from the <OPeNDAP_URL>
associated with the complete dataset, we request to only select the variable sst_anomaly
, and subset it as follows: [0][0:1:9][0:1:9]
. This index-based subsetting implies:
A single (first) element of the first (
time
) dimension,[0:1:9]
indicates the first10
elements of the second (lat
) dimension[0:1:9]
indicates the first10
elements of the third (lon
) dimension.
CE = "?dap4.ce=/lat;/sst_anomaly[0][0:1:9][0:1:9]"
pydap_ds = open_url(url+CE, protocol='dap4')
pydap_ds.tree()
.20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
├──lat
└──sst_anomaly
pydap_ds['sst_anomaly'].shape
(1, 10, 10)
pydap_ds['lat'].shape
(17999,)
Note
OPeNDAP
only applied the subset to the variable sst_anomaly
, while lat
(and any other) would retain the original size. One can use a different syntax of the Constraint Expression so that subsets along shared dimensions
are applied across variables that share the dimension. For more, See below.
xarray approach#
PyDAP
’s open_url
can be used internally within xarray
, by defining an extra parameter when creating an xarray
Dataset. This extra parameter is:
engine='pydap'
Moreoever, we can combine the server-side
subsetting that occurs local to the data
on the OPeNDAP
server.
Note
There exists many many many servers, but only two DAP implementation: DAP2
and DAP4
. The differences between the two go beyond this 5 minute intro. We will simply restrict to say that DAP4
is newer, that any server implementing DAP4
can implement DAP2
, and so will only focus on DAP4
in this short tutorial. PyDAP
accepts a protocol
argument which specifies "dap2"
vs "dap4"
. xarray
does not.
Tip
Within xarray
and when setting pydap
as the engine, we can specify DAP4
as the protocol by passing a URL with dap4
replacing the https
.
'dap4'+url[4:]
'dap4://test.opendap.org:8080/opendap/tutorials/20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
dataset = xr.open_dataset('dap4'+url[4:], engine='pydap')
dataset
<xarray.Dataset> Size: 29GB Dimensions: (/time: 1, /lat: 17999, /lon: 36000) Coordinates: lat (/lat) float32 72kB ... lon (/lon) float32 144kB ... Dimensions without coordinates: /time, /lat, /lon Data variables: time (/time) datetime64[ns] 8B ... analysed_sst (/time, /lat, /lon) float64 5GB ... analysis_error (/time, /lat, /lon) float64 5GB ... mask (/time, /lat, /lon) float32 3GB ... sea_ice_fraction (/time, /lat, /lon) float64 5GB ... dt_1km_data (/time, /lat, /lon) timedelta64[ns] 5GB ... sst_anomaly (/time, /lat, /lon) float64 5GB ... Attributes: (12/47) Conventions: CF-1.7 title: Daily MUR SST, Final product summary: A merged, multi-sensor L4 Foundation SST anal... references: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-... institution: Jet Propulsion Laboratory history: created at nominal 4-day latency; replaced nr... ... ... project: NASA Making Earth Science Data Records for Us... publisher_name: GHRSST Project Office publisher_url: http://www.ghrsst.org publisher_email: ghrsst-po@nceo.ac.uk processing_level: L4 cdm_data_type: grid