PyDAP as a client#
PyDAP
can be used to “lazily” inspect and retrieve remote data from any of the thousands of scientific datasets available on the internet on OPeNDAP data servers, allowing the user to manipulate a Dataset
as if it were stored locally, only downloading on-the-fly when necessary. In order to transmit data from the Server to the Client, both server and client must agree on a way to represent data
: is it an array of integers?, a multi-dimensional grid? In order to do this, a DAP protocol defines a data model that, in theory, should be able to represent any existing (scientific) dataset.
Pydap uses the requests
library to fetch remote data from an OPeNDAP data server. Data from such a server is one of the following types:
File Extension |
File Type |
Protocol |
Example URL |
---|---|---|---|
DMR |
Metadata |
DAP4 |
http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dmr |
DAP |
Metadata and binary |
DAP4 |
http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dap |
DDS |
Metadata |
DAP2 |
http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dds |
DAS |
Metadata |
DAP2 |
http://test.opendap.org/opendap/data/nc/coads_climatology.nc.das |
DODS |
Metadata and Binary |
DAP2 |
http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dods |
Note
Clickling on any of the dap
or dods
example URLs will trigger a download of a OPeNDAP binary data. Pydap parses this binary data and turns is into a pydap Dataset.
Requests library#
As of version 3.5.4
, pydap
now uses Python’s requests library to get/fetch the remote datasets described on the Table above and can also use Python’s requests_cache
library to cache responses. For the user, pydap has a special function to initialize any such session:
Session with No Cache |
Cached Session |
---|---|
use_cache=False (default) |
use_cache=True |
from pydap.client import open_url
from pydap.net import create_session
data_url = "http://test.opendap.org/opendap/data/nc/coads_climatology.nc"
Use default non-cached Session#
# default
my_session = create_session()
%%time
pyds = open_url(data_url, protocol="dap4", session=my_session)
CPU times: user 17.2 ms, sys: 1.82 ms, total: 19.1 ms
Wall time: 269 ms
Lets try again#
%%time
pyds = open_url(data_url, protocol="dap4", session=my_session)
CPU times: user 4.24 ms, sys: 1.09 ms, total: 5.32 ms
Wall time: 155 ms
What is hapenning?#
In both cases, only the dmr
associated with the remote dataset was fetched, and used to create the pydap dataset.
The apparent difference in timing can sometimes be attributes to what is called “cold reading” vs “warm reading”. But in both scenarios,
each time the pyds
is created, the remote dmr
dataset is fetched and processed by pydap to create the lazy
dataset that point
to the original opendap source.
To avoid repeatedly downloading the same resource over and over, potentially overwhelming remote data servers, pydap can now cache responses.
Use Cached-Session#
# Non-default
cached_session = create_session(use_cache=True)
clear any prevous cached session#
cached_session.cache.clear()
%%time
new_pyds = open_url(data_url, protocol="dap4", session=cached_session)
CPU times: user 6.62 ms, sys: 3.62 ms, total: 10.2 ms
Wall time: 191 ms
The timing required to download a remote dmr
from the same server remains close to that of the warm
case.
Now let’s try again!#
%%time
new_pyds = open_url(data_url, protocol="dap4", session=cached_session)
CPU times: user 2.29 ms, sys: 432 μs, total: 2.72 ms
Wall time: 2.58 ms
The resulting timing has dropped significantly. This is because the dmr
was never downloaded from the remote source. Insted it was fetched form the cache.
print("Default location of cached response: ", cached_session.cache.db_path)
Default location of cached response: /var/folders/hc/tkfpclz952n091r0k5b2t9jr0000gn/T/http_cache.sqlite
print("URLs of cached responses: ", cached_session.cache.urls())
URLs of cached responses: ['http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dmr']
Finally, let’s clear the cache#
cached_session.cache.clear()
print("URLs of cached responses: ", cached_session.cache.urls())
URLs of cached responses: []
Timeout#
To specify a timeout for the client, just set the desired number of seconds using the timeout
option to open_url(...)
. For example, the following would timeout after 30 seconds without receiving a response from the server:
dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', timeout=30)
Note
The default timeout is 120 seconds, or 2 minutes.