PyDAP as a client#

PyDAP can be used to “lazily” inspect and retrieve data from any of the thousands of scientific datasets available on the internet on OPeNDAP servers. This way, it’s possible to instrospect and manipulate a Dataset as if it were stored locally, with data being downloaded on-the-fly as necessary.

PyDAP can be used to open data structures that may fall within one of the following data types (not to be confused with file format):

  • Gridded Arrays.

  • Sequences (Tabular data)

  • Structures.

When opening a remote URL containing one or more data types, PyDAP will create a Dataset object which acts like the root directly. PyDAP’s Dataset can then contain multiple data types, which may be nested. Moreover, since PyDAP now supports the DAP4 model, it supports Groups, nested Groups which may themselves hold other data types, and other nested PyDAP objects named above.

Gridded Arrays#

Let’s start by accessing Gridded Array, i.e., data that is stored as a regular multidimensional array. Here’s a simple example where we access the COADS climatology from the official OPeNDAP test server:

from pydap.client import open_url
import numpy as np
import matplotlib.pyplot as plt
dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', protocol='dap2') # dap2 is the default
dataset
<DatasetType with children 'COADSX', 'COADSY', 'TIME', 'SST', 'AIRT', 'UWND', 'VWND'>

Here we used the pydap.client.open_url function to open an OPeNDAP URL specifying the location of the dataset. When we access the remote dataset the function returns a DatasetType object, which is a Structure – a fancy dictionary that stores other variables. We can check the names of the store variables like we would do with a Python dictionary:

list(dataset.keys())
['COADSX', 'COADSY', 'TIME', 'SST', 'AIRT', 'UWND', 'VWND']

Another useful way to inspect the dataset is the .tree() method, which provides an in-depth inspection of your dataset

dataset.tree()
.coads_climatology.nc
├──COADSX
├──COADSY
├──TIME
├──SST
│  ├──SST
│  ├──TIME
│  ├──COADSY
│  └──COADSX
├──AIRT
│  ├──AIRT
│  ├──TIME
│  ├──COADSY
│  └──COADSX
├──UWND
│  ├──UWND
│  ├──TIME
│  ├──COADSY
│  └──COADSX
└──VWND
   ├──VWND
   ├──TIME
   ├──COADSY
   └──COADSX

Note

No data has been downloaded yet.

Let’s work with the SST variable; we can “lazly” reference it using the usual dictionary syntax of dataset['SST'], or dataset.SST:

sst = dataset['SST'] # or dataset.SST
sst
<GridType with array 'SST' and maps 'TIME', 'COADSY', 'COADSX'>

Note that the variable is of type GridType, a multidimensional array with specific axes defining each of its dimensions:

sst.dimensions
('TIME', 'COADSY', 'COADSX')
sst.shape
(12, 90, 180)
len(sst)
4

Note

len and shape of sst differ! The difference between them is a property of the GridType: in this case sst packs with itself the coordinate axes necessary to be fully “self-described”. This behavior was done to mimick the netCDF model, that each file (in this case each gridded array) is “self-describing” and therefore contains all the relevant information to process it. The tree view of the dataset above further illustrates the point, with the multiple copies of coordinate data in dataset. Everytime you download a GridType you also download the dimension arrays used fully describe them.

It is possible to specify NOT to download the coordinate axes of a GridType variable when opening a URL as follows:

dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc", output_grid=False)
dataset
<DatasetType with children 'COADSX', 'COADSY', 'TIME', 'SST', 'AIRT', 'UWND', 'VWND'>

NO data has been downloaded into memory yet, but when you download a GridType array, the coordinate axes will not be downloaded. This important workaround is particularly usefull for speeding up workflows, and to not overwhelm older OPeNDAP servers that might run out of memory when attempting to retrieve both the data and the coordinate axes of a variable.

Note

The GridType is a DAP2 model specification and was dropped in the much newer DAP4. Nonetheless, OPeNDAP DAP4 servers support DAP2. Currently PyDAP when opening a remote URL, if protocol is not specified, PyDAP assumes by default the DAP2 model specification. This may change in the future.

To access the URL according to the DAP4 model specification, run:

dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc", protocol='dap4')
dataset.tree()
.coads_climatology.nc
├──COADSX
├──COADSY
├──TIME
├──SST
├──AIRT
├──UWND
└──VWND
dataset['SST']
<BaseType with data Dap4BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST', dtype('>f4'), (12, 90, 180), (slice(None, None, None), slice(None, None, None), slice(None, None, None)))>

In PyDAP, the BaseType is a thin wrapper to a numpy.ndarray with some of the same basic attributes like shape, nbytes. The Dap4BaseProxy is an empty data container with all the attributes of the remote array, including:

  • name

  • dtype

  • shape

  • slice

To download data into memory you must slice the array. This is:

%%time
sst = dataset['SST'][:]
print('Type of object: ', type(sst))
Type of object:  <class 'pydap.model.BaseType'>
CPU times: user 25.9 ms, sys: 7.53 ms, total: 33.5 ms
Wall time: 342 ms

Note that upon download, the variable sst is a BaseType, rather than a numpy array. The BaseType has CF-compliant metadata as attributes, which are defined by the data provider and contain useful information for decoding the variable.

sst.attributes
{'missing_value': -9.99999979e+33,
 '_FillValue': -9.99999979e+33,
 'long_name': 'SEA SURFACE TEMPERATURE',
 'history': 'From coads_climatology',
 'units': 'Deg C',
 'checksum': array([1974861177], dtype=uint32),
 'Maps': ('/TIME', '/COADSY', '/COADSX')}

To access the array for further (numerical) manipulation you must add a .data to the BaseType. This is:

type(sst.data)
numpy.ndarray

Note

One of the features of OPeNDAP is the server-side subsetting, which can be exploited by PyDAP via slicing of the array when downloading data. For example, if you are only interested in a localized subregion of the entire domain (e.g. finite range of latitude and longitudes of the global coverage), and you know the index that span your desired area of interest.

%%time
sst = dataset['SST'][0, 45:80, 45:125]
print('Type of object: ', type(sst))
Type of object:  <class 'pydap.model.BaseType'>
CPU times: user 5.91 ms, sys: 1.98 ms, total: 7.89 ms
Wall time: 369 ms
sst.shape
(1, 35, 80)

Note

The subsetting took place close to the data thanks for OPenDAP’s server functionality, and only the data that we request in the slice, is downloaded.

Note

Slicing the array before downloading will almost always result in better performance and speed ups. This can be significant for very large datasets, and particularly useful when automatizing workflows that require downloading data into memory for further processing.

Tabular data#

Let’s look now at tabular data, which PyDAP calls them Sequences (typically asociated with Sequential in situ data). Here we consider another file from within the test OPeNDAP server for simplicity. You can access the catalog HERE.

Note

Sequential data consists of one of more records of related variables, such as simultaneous measurements of temperature and wind velocity for example.

ds = open_url("http://test.opendap.org/opendap/hyrax/data/ff/gsodock1.dat")
ds.tree()
.gsodock1.dat
└──URI_GSO-Dock
   ├──Time
   ├──Depth
   ├──Sea_Temp
   ├──Salinity
   ├──DO%20
   ├──pH
   ├──Turbidity
   ├──Air_Temp
   ├──Wind_Speed
   ├──Wind_Direction
   └──Barometric_Pres
ds['URI_GSO-Dock']
<SequenceType with children 'Time', 'Depth', 'Sea_Temp', 'Salinity', 'DO%20', 'pH', 'Turbidity', 'Air_Temp', 'Wind_Speed', 'Wind_Direction', 'Barometric_Pres'>

ERDDAP in situ data#

Consider a more complex example. Here we look at data from the glider DAC found at the Integrated Ocean Observing System . The data can be accessed through an OPeNDAP server, as well as the ERRDAP server. In the example below we demostrate how to access glider data from a Deep-Pelagic Nekton study off the Gulf of Mexico, with pydap through ERRDAP.

We look at the Dataset ID: Murphy-20150809T1355. It’s ERDDAP access form can be found HERE

ds = open_url("https://gliders.ioos.us/erddap/tabledap/Murphy-20150809T1355")['s']
ds.tree()
.s
├──trajectory
├──wmo_id
├──profile_id
├──time
├──latitude
├──longitude
├──depth
├──conductivity
├──conductivity_qc
├──density
├──density_qc
├──depth_qc
├──instrument_ctd
├──lat_qc
├──lat_uv
├──lat_uv_qc
├──lon_qc
├──lon_uv
├──lon_uv_qc
├──platform_meta
├──precise_lat
├──precise_lon
├──precise_time
├──pressure
├──pressure_qc
├──profile_lat_qc
├──profile_lon_qc
├──profile_time_qc
├──qartod_conductivity_flat_line_flag
├──qartod_conductivity_gross_range_flag
├──qartod_conductivity_primary_flag
├──qartod_conductivity_rate_of_change_flag
├──qartod_conductivity_spike_flag
├──qartod_density_flat_line_flag
├──qartod_density_gross_range_flag
├──qartod_density_primary_flag
├──qartod_density_rate_of_change_flag
├──qartod_density_spike_flag
├──qartod_monotonic_pressure_flag
├──qartod_pressure_flat_line_flag
├──qartod_pressure_gross_range_flag
├──qartod_pressure_primary_flag
├──qartod_pressure_rate_of_change_flag
├──qartod_pressure_spike_flag
├──qartod_salinity_flat_line_flag
├──qartod_salinity_gross_range_flag
├──qartod_salinity_primary_flag
├──qartod_salinity_rate_of_change_flag
├──qartod_salinity_spike_flag
├──qartod_temperature_flat_line_flag
├──qartod_temperature_gross_range_flag
├──qartod_temperature_primary_flag
├──qartod_temperature_rate_of_change_flag
├──qartod_temperature_spike_flag
├──salinity
├──salinity_qc
├──temperature
├──temperature_qc
├──time_qc
├──time_uv
├──time_uv_qc
├──u
├──u_qc
├──v
└──v_qc

Where the variables s identifies the sequential data.

type(ds)
pydap.model.SequenceType

We can further identify each individual glider data by looking at profile_id, a value that is unique for each of them. You can inspect the raw values are follows

[id for id in ds['profile_id'].iterdata()]
[np.int32(1),
 np.int32(2),
 np.int32(3),
 np.int32(4),
 np.int32(5),
 np.int32(6),
 np.int32(7),
 np.int32(8),
 np.int32(9),
 np.int32(10),
 np.int32(11),
 np.int32(12),
 np.int32(13),
 np.int32(14),
 np.int32(15),
 np.int32(16),
 np.int32(17),
 np.int32(18),
 np.int32(19),
 np.int32(20),
 np.int32(21),
 np.int32(22),
 np.int32(23),
 np.int32(24),
 np.int32(25),
 np.int32(26),
 np.int32(27),
 np.int32(28),
 np.int32(29),
 np.int32(30),
 np.int32(31),
 np.int32(32),
 np.int32(33),
 np.int32(34),
 np.int32(35),
 np.int32(36),
 np.int32(37),
 np.int32(38),
 np.int32(39),
 np.int32(40),
 np.int32(41),
 np.int32(42),
 np.int32(43),
 np.int32(44),
 np.int32(45),
 np.int32(46),
 np.int32(47),
 np.int32(48),
 np.int32(49),
 np.int32(50),
 np.int32(51),
 np.int32(52),
 np.int32(53),
 np.int32(54),
 np.int32(55),
 np.int32(56),
 np.int32(57),
 np.int32(58),
 np.int32(59),
 np.int32(60),
 np.int32(61),
 np.int32(62),
 np.int32(63),
 np.int32(64),
 np.int32(65),
 np.int32(66),
 np.int32(67),
 np.int32(68),
 np.int32(69),
 np.int32(70),
 np.int32(71),
 np.int32(72),
 np.int32(73),
 np.int32(74),
 np.int32(75),
 np.int32(76),
 np.int32(77),
 np.int32(78),
 np.int32(79),
 np.int32(80),
 np.int32(81),
 np.int32(82),
 np.int32(83),
 np.int32(84),
 np.int32(85),
 np.int32(86),
 np.int32(87),
 np.int32(88),
 np.int32(89),
 np.int32(90),
 np.int32(91),
 np.int32(92),
 np.int32(93),
 np.int32(94),
 np.int32(95),
 np.int32(96),
 np.int32(97),
 np.int32(98),
 np.int32(99),
 np.int32(100),
 np.int32(101),
 np.int32(102),
 np.int32(103),
 np.int32(104),
 np.int32(105),
 np.int32(106),
 np.int32(107),
 np.int32(108),
 np.int32(109),
 np.int32(110),
 np.int32(111),
 np.int32(112),
 np.int32(113),
 np.int32(114),
 np.int32(115),
 np.int32(116),
 np.int32(117),
 np.int32(118),
 np.int32(119),
 np.int32(120),
 np.int32(121),
 np.int32(122),
 np.int32(123),
 np.int32(124),
 np.int32(125),
 np.int32(126),
 np.int32(127),
 np.int32(128),
 np.int32(129),
 np.int32(130),
 np.int32(131),
 np.int32(132),
 np.int32(133),
 np.int32(134),
 np.int32(135),
 np.int32(136),
 np.int32(137),
 np.int32(138),
 np.int32(139),
 np.int32(140),
 np.int32(141),
 np.int32(142),
 np.int32(143),
 np.int32(144),
 np.int32(145),
 np.int32(146),
 np.int32(147),
 np.int32(148),
 np.int32(149),
 np.int32(150),
 np.int32(151),
 np.int32(152),
 np.int32(153),
 np.int32(154),
 np.int32(155),
 np.int32(156),
 np.int32(157),
 np.int32(158),
 np.int32(159),
 np.int32(160),
 np.int32(161),
 np.int32(162),
 np.int32(163),
 np.int32(164),
 np.int32(165),
 np.int32(166),
 np.int32(167),
 np.int32(168),
 np.int32(169),
 np.int32(170),
 np.int32(171),
 np.int32(172),
 np.int32(173),
 np.int32(174),
 np.int32(175),
 np.int32(176),
 np.int32(177),
 np.int32(178),
 np.int32(179),
 np.int32(180),
 np.int32(181),
 np.int32(182),
 np.int32(183),
 np.int32(184),
 np.int32(185),
 np.int32(186),
 np.int32(187),
 np.int32(188),
 np.int32(189)]

Note

OPeNDAP and therefore PyDAP support fully-qualifying names, which come in handly for complex (nested) data structures. In the case of Sequential data, you can access a variable <VarName> within a sequence named <Seq> as <Seq.VarName>

These datasets are rich in metadata, which can be accessed through the attributes property as follows

ds['profile_id'].attributes
{'_FillValue': -1,
 'actual_range': [1, 189],
 'cf_role': 'profile_id',
 'comment': 'Sequential profile number within the trajectory.  This value is unique in each file that is part of a single trajectory/deployment.',
 'ioos_category': 'Identifier',
 'long_name': 'Profile ID',
 'valid_max': 2147483647,
 'valid_min': 1}

The first thing we’d like to do is limit our very simple analysis. We consider only a single glider and only inspect the variables depth and temperature. This is simlar to column-wise selection.

To accomplish that we use pydap’s simple logic as follows

seq = ds[('profile_id', 'depth', 'temperature')]
print(type(seq))
seq.tree()
<class 'pydap.model.SequenceType'>
.s
├──profile_id
├──depth
└──temperature

Filtering data#

With Sequential data we can make use of filters to extract only those values associated with one or more rows. This is, identify all the values of the sequence that are less than, equal to, or greater than.

For example, lets consider depth and temperature values associated with the value profile_id=5.

glid5 = seq[('profile_id', 'depth', 'temperature')][seq['profile_id'].data==5]
glid5
<SequenceType with children 'profile_id', 'depth', 'temperature'>
Depths5 = np.array([depth for depth in glid5['depth']])
ids5 = np.array([_id for _id in glid5['profile_id']])
Temps5 = np.array([temp for temp in glid5['temperature']])

Lets try another profile_id.

glid6 = seq[('profile_id', 'depth', 'temperature')][seq['profile_id'].data==6]
Depths6 = np.array([depth for depth in glid6['depth']])
ids6 = np.array([_id for _id in glid6['profile_id']])
Temps6 = np.array([temp for temp in glid6['temperature']])
plt.plot(Temps5, -Depths5, ls='', marker='o', markersize=10, alpha=0.5, label='profile id: ' +str(ids5[0]))
plt.plot(Temps6, -Depths6, 'k', ls='', marker='d', markersize=10, alpha=0.25, label='profile id: ' +str(ids6[0]))
plt.xlabel(r'$\text{Temp} \;[^\circ C]$', fontsize=12.5)
plt.ylabel(r'$Z\;[m]$', fontsize=12.5, rotation=0, labelpad=10)
plt.legend(fontsize=12.5, frameon=False);
_images/42b2d9a78cd0dfb3940cfade0cd07a426159dd4dff5e6947d271cdbcb23d8190.png

Authentication#

Basic and Digest

To use Basic and Digest authentication, simply add your username and password to the dataset URL. Keep in mind that if the server only supports Basic authentication your credentials will be sent as plaintext, and could be sniffed on the network.

dataset = open_url('http://username:password@server.example.com/path/to/dataset')

CAS#

The Central Authentication Service (CAS) is a single sign-on protocol for the web, usually involving a web browser and cookies. Nevertheless it’s possible to use pydap with an OPeNDAP server behind a CAS. The function install_cas_client below replaces pydap’s default HTTP function with a new version able to submit authentication data to an HTML form and store credentials in cookies. (In this particular case, the server uses Javascript to redirect the browser to a new location, so the client has to parse the location from the Javascript code; other CAS would require a tweaked function.)

To use it, just attach a web browsing session with authentication cookies:

from pydap.client import open_url
from pydap.cas.get_cookies import setup_session
session = setup_session(authentication_url, username, password)
dataset = open_url('http://server.example.com/path/to/dataset', session=session)

This method could work but each CAS is slightly different and might require a specifically designed setup_session instance. Two CAS are however explicitly supported by PyDAP:

URS NASA EARTHDATA#

Note

This method requires an EDL account. If you do not have one, go HERE to create one

Authentication can done through a username and a password. However, it is highly recommended to use instead tokens. This ECCO provides an example of how to use tokens to authenticate.

edl_token = "YourTokenHere"
auth_hdr="Bearer " + edl_token
my_session = requests.Session()
my_session.headers={"Authorization": auth_hdr}

With that you can now pass the my_session as an argument to PyDAP’s open_url:

dataset = open_url('http://server.example.com/path/to/dataset', session=my_session)

Earth System Grid Federation (ESGF)#

Authentication is done through an openid and a password:

from pydap.client import open_url
from pydap.cas.esgf import setup_session
dataset_url = 'http://server.example.com/path/to/dataset'
session = setup_session(openid, password, check_url=dataset_url)
dataset = open_url(dataset_url, session=session)

If your openid contains contains the string ceda.ac.uk authentication requires an additional username argument:

from pydap.client import open_url
from pydap.cas.esgf import setup_session
session = setup_session(openid, password, check_url=dataset_url, username=username)
dataset = open_url(dataset_url, session=session)

Warning

All Advanced features described below may be outdated.

Advanced features#

Calling server-side functions

When you open a remote dataset, the DatasetType object has a special attribute named functions that can be used to invoke any server-side functions. Here’s an example of using the geogrid function from Hyrax:

>>> dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc")
>>> new_dataset = dataset.functions.geogrid(dataset.SST, 10, 20, -10,60)
>>> new_dataset.SST.COADSY[:] 
[-11. -9. -7. -5. -3. -1. 1. 3. 5. 7. 9. 11.]
>>> new_dataset.SST.COADSX[:]
[ 21. 23. 25. 27. 29. 31. 33. 35. 37. 39. 41. 43. 45. 47. 49. 51. 53. 55. 57. 59. 61.]

Unfortunately, there’s currently no standard mechanism to discover which functions the server support. The function attribute will accept any function name the user specifies, and will try to pass the call to the remote server.

Using a cache#

Warning

This may be outdated.

You can specify a cache directory in the pydap.lib.CACHE global variable. If this value is different than None, the client will try (if the server headers don’t prohibit) to cache the result, so repeated requests will be read from disk instead of the network:

import pydap.lib
pydap.lib.CACHE = "/tmp/pydap-cache/"

Timeout#

To specify a timeout for the client, just set the desired number of seconds using the timeout option to open_url(...) or open_dods(...). For example, the following commands would timeout after 30 seconds without receiving a response from the server:

dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', timeout=30)
dataset = open_dods('http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods', timeout=30)

Configuring a proxy#

Warning

This may be outdated.

It’s possible to configure pydap to access the network through a proxy server. Here’s an example for an HTTP proxy running on localhost listening on port 8000:

import httplib2
from pydap.util import socks
import pydap.lib
pydap.lib.PROXY = httplib2.ProxyInfo(
         socks.PROXY_TYPE_HTTP, 'localhost', 8000)

This way, all further calls to pydap.client.open_url will be routed through the proxy server. You can also authenticate to the proxy:

pydap.lib.PROXY = httplib2.ProxyInfo(
         socks.PROXY_TYPE_HTTP, 'localhost', 8000,
         proxy_user=USERNAME, proxy_pass=PASSWORD)

A user has reported that httplib2 has problems authenticating against a NTLM proxy server. In this case, a simple solution is to change the pydap.http.request function to use urllib2 instead of httplib2, monkeypatching the code like in the CAS authentication example above:

import urllib2
import logging

def install_urllib2_client():
    def new_request(url):
        log = logging.getLogger('pydap')
        log.INFO('Opening %s' % url)

        f = urllib2.urlopen(url.rstrip('?&'))
        headers = dict(f.info().items())
        body = f.read()
        return headers, body

    from pydap.util import http
    http.request = new_request

The function install_urllib2_client should then be called before doing any requests.