Pydap is slow. How can I improve the download time?

Pydap is slow. How can I improve the download time?#

There are two stages at which pydap downloads content: a) during the dataset creation, and b) fetching numerical / array data.

a) Metadata / dataset creation#

pyds = open_url(URL, session=my_session, protocol='dap2') # protocol="dap4" 

The performace depends on various factors:

  • Authentication. There may be many redirects. If possible, use token authentication, as it reduces the amount of redirects.

  • Hierarchical metadata. Some datasets can contain O(100) of variables, and complex nested Groups (Groups are part of the DAP4 protocol), and parsing the metadata to create the dataset can be time-consuming. To reduce the timing, you can use the Data Request Form to construct a Constraint Expression that reduces the amount of Groups and variables you wish to include in your dataset. This will allow you to discard variables before creating the dataset. The documentation on Constraint expressions has an example demonstrating the use of CEs to reduce the size of the dataset before the dataset creation for the Atlas03 experiment.

  • Cache the Session. Starting with pydap version 3.5.4, pydap can cache sessions, storing the dmr (i.e. the metadata) after the first download, for later use.

Note

requests-cache can also recover credentials from the ~/.netrc file, and handle token authentication.

To cache the session, you can initialize it as follows

from pydap.net import create_session

my_session = create_session(use_cache=True) # False is the default
pyds = open_url(URL, session=my_session, protocol='dap4') # protocol="dap2" works too

The documentation section on Pydap as a Client has a short example showing how to cache the dmr during the dataset creation.

b) Fetching numerical data #

pydap downloads array data in the form of .dap (DAP4) or .dods (DAP2) when slicing the array. This is, when:

pyds["VarName"][:] # this will download all the array, a different indexing will only download the subset

or when accessing via xarray (with engine="pydap")

ds['varName'].isel(dim1=slice_dim1, dim2=slice_dim2).data # e.g. ds['Theta'].isel(X=slice(1,10), Y=slice(10, 20)).data

The speed of download can depend on many factors: chunking of the remote dataset, size of download, internet speed, the remote server, etc. We recommend:

  • Subset the Variable. This limits the size of download (specially when remote datasets are a virtual aggregated of many many remote files). Some organizations impose a 2Gb limit on the download. The PACE Example illustrates this point. In it, the coords arrays (lat and lon) are to identify the subset of 2D array of interest.

  • Cache the Session . Same as with the dataset creation, a cached session can also store .dap/.dods responses. This will also limit the times a (repeated) download is requested to the server.

  • Diagnosing. It is possible that the remote dataset has many small chunks, resulting in very slow performance. This, along with internet connection, are performance problems outside of the scope of pydap. A useful diagnose if the issue is withg pydap or with the remote server, is to use curl to download the response.

curl -L -n "<opendap_url_with_constraint_expression>" 

where -L implies following redirects, and -n instructs curl to recover authentication from the ~/.netrc file. This last one is only necessary when authentication is required. For example, to download a .dap (DAP4) response from a dap4 server (with no authentication required):

curl -L -o output.dap "http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dap?dap4.ce=/TIME"

The following command downloads only the variable TIME from this test dataset. The download should be very fast. When slicing an array pydap does something very similar: downloads a .dap response for a single variable, in this case TIME. Pydap should not take too much longer that curl to download the .dap response.

  • Check variable sizes and avoid downloading entire arrays of ncml datasets. ncml datasets are a virtual aggregation of a collection of NetCDF files. The ncml is great because it provides a single URL endpoint for a single collection, but many users experience long times and downlod errors when requesting to download even a single variable.