Pydap is slow. How can I improve the download time?#
There are two stages at which pydap
downloads content: a)
during the dataset creation, and b)
fetching numerical / array data.
a) Metadata / dataset creation#
pyds = open_url(URL, session=my_session, protocol='dap2') # protocol="dap4"
The performace depends on various factors:
Authentication. There may be many redirects. If possible, use
token
authentication, as it reduces the amount of redirects.Hierarchical metadata. Some datasets can contain O(100) of variables, and complex nested
Groups
(Groups
are part of theDAP4
protocol), and parsing the metadata to create the dataset can be time-consuming. To reduce the timing, you can use the Data Request Form to construct aConstraint Expression
that reduces the amount ofGroups
and variables you wish to include in your dataset. This will allow you to discard variables before creating the dataset. The documentation on Constraint expressions has an example demonstrating the use ofCE
s to reduce the size of the dataset before the dataset creation for the Atlas03 experiment.Cache the Session. Starting with
pydap
version3.5.4
,pydap
can cache sessions, storing thedmr
(i.e. the metadata) after the first download, for later use.
Note
requests-cache
can also recover credentials from the ~/.netrc
file, and handle token authentication.
To cache the session, you can initialize it as follows
from pydap.net import create_session
my_session = create_session(use_cache=True) # False is the default
pyds = open_url(URL, session=my_session, protocol='dap4') # protocol="dap2" works too
The documentation section on Pydap as a Client has a short example showing how to cache the dmr
during the dataset creation.
b) Fetching numerical data #
pydap
downloads array data in the form of .dap
(DAP4) or .dods
(DAP2) when slicing the array. This is, when:
pyds["VarName"][:] # this will download all the array, a different indexing will only download the subset
or when accessing via xarray
(with engine="pydap"
)
ds['varName'].isel(dim1=slice_dim1, dim2=slice_dim2).data # e.g. ds['Theta'].isel(X=slice(1,10), Y=slice(10, 20)).data
The speed of download can depend on many factors: chunking of the remote dataset, size of download, internet speed, the remote server, etc. We recommend:
Subset the Variable. This limits the size of download (specially when remote datasets are a virtual aggregated of many many remote files). Some organizations impose a 2Gb limit on the download. The PACE Example illustrates this point. In it, the coords arrays (
lat
andlon
) are to identify the subset of 2D array of interest.Cache the Session . Same as with the dataset creation, a cached session can also store
.dap
/.dods
responses. This will also limit the times a (repeated) download is requested to the server.Diagnosing. It is possible that the remote dataset has many small chunks, resulting in very slow performance. This, along with internet connection, are performance problems outside of the scope of
pydap
. A useful diagnose if the issue is withgpydap
or with the remote server, is to use curl to download the response.
curl -L -n "<opendap_url_with_constraint_expression>"
where -L
implies following redirects, and -n
instructs curl
to recover authentication from the ~/.netrc
file. This last one is only necessary when authentication is required. For example, to download a .dap
(DAP4) response from a dap4 server (with no authentication required):
curl -L -o output.dap "http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dap?dap4.ce=/TIME"
The following command downloads only the variable TIME
from this test dataset. The download should be very fast. When slicing an array pydap
does something very similar: downloads a .dap
response for a single variable, in this case TIME
. Pydap should not take too much longer that curl
to download the .dap
response.
Check variable sizes and avoid downloading entire arrays of ncml datasets.
ncml
datasets are a virtual aggregation of a collection of NetCDF files. Thencml
is great because it provides a single URL endpoint for a single collection, but many users experience long times and downlod errors when requesting to download even a single variable.