Pydap is slow. How can I improve the download time?

Pydap is slow. How can I improve the download time?#

There are two stages at which pydap downloads data: a) during the dataset creation, and b) when downloading array data. These can be slow on their own, due to a couple of scenarios:

  1. Data behind Authentication.

  2. Using DAP2 protocol instead of DAP4 protocol.

  3. Remote dataset is chunked with small chunks.

  4. The client API is sending too many unnecessary requests to the remote server.

  5. Server is not configured correctly.

  6. Internet connection is intermittent or subpar.

Among other scenarios. In the vast majority of cases, it is considered a good practive to keep the amount of requests to the server at a minimum, i.e. only request the necessary. This is even more important when data is behind authentication, as there can be many redirects per request from the client API.

Below are some guidences to improve data access for users. If you continue to experience performance issues, please consider writing an issue in Pydap’s Github IssueTracker. Below we assume users are using pydap as a backend engine for Xarray. This is:

ds = xr.open_dataset(url, engine='pydap', ...) # or xr.open_mfdataset()

a)Creating the Dataset#

When creating a Dataset, Xarray’s internal logic sends requests to the server to:

  • Download Metadata. For DAP2 protocol these are the .dds and .das, one of each per remote file. In the DAP4 only a single metadata request is sent per remote file, the .dmr. These metadata files describe the internal structure and information within each remote file (all variable names, types, shapes, attributes), which Xarray uses to create the Dataset.

  • Download all dimension array data. Per file, Xarray downloads by default all dimensions and loads them into memory. When opening multiple files, this behavior can leads to sub-performant behavior, as Xarray will request all dimension data from every file in order to perform a safety check, even when the user know a-priori that this data is identical across all remote files. While these safety checks are very important (and should be in place), it can lead to huge performance losses when aggregating 100 of remote files, each with 2 or more dimensions, in particular when restarting the Kernel (and be force to run the workflow from scratch).

Note

As of xarray=v2025.10, Xarray sends an individual request per variable within a remote file, instead of fetching all variable’s array data within a single request. This behavior is not optimal, and will be improved in the future.

To improve the performance, and avoid re-downloading the data required to open a dataset we recommend:

  1. Use DAP4. This is a more modern DAP protocol, and better supported by OPeNDAP services and the overall geospatial community. Moreover, by choosing DAP4 you automatically cut by half the amount of responses associated with metadata, from the same file. To specify the DAP protocol with PyDAP you can change the scheme of the url as follows:

dap4_url = "dap4://<www.opendap.org/remote_file>" # <------- DAP4
dap2_url = "dap2://<www.opendap.org/remote_file>" # <------- DAP2
url = "https://<www.opendap.org/remote_file>"     # <------- Pydap assumes DAP2.
  1. Use Constraint Expressions to produce a Constrained DMR. A Constraint Expression can be added to a DAP4 url following the syntax:

url = <https://...opendap_url..>
CE = "dap4.ce=/VarName1;/VarName2;/...;/VarNameN"
url_ce = url + "?" + CE

where VarName1, VarName2, and VarNameN are all variables present a remote file with M>=N variables. Passing this URL to Xarray and using pydap as the engine, will enable to remote OPeNDAP Server to produce a Constraint DMR, which will only have information about those variables. This can lead to significant performance gains when N<<M (For example N=4, and M=1000).

Note

Xarray has internal logic to drop variables. But Xarray will parse the metadata from ALL the variables, to then subsequently drop the variables specified by the user in .drop_vars(). With a Constrained DMR via the Constraint Expression in the example above, for example, Xarray would only process the N variables.

Warning

Xarray requires the presence of Dimension data to match the shape of any data variable. When constructing Constraint Expression as above, include all the dimensions associated with the variables of interest, in the CE.

  1. Consolidate Metadata. PyDAP has a method to persist metadata for later reuse. PyDAP can use a request_cache.CachedSession to download and persist the metadata required to initiate an Xarray Dataset. A CachedSession makes use of a SQLite backend and can act as a database manager, and since a CachedSession can also be used to authenticate, it is a stand-in replacement for a requests.Session object typically used by PyDAP. Consider the example below:

from pydap.net import create_session
from pydap.client import consolidate_metadata

URLS = [url1, url2, url3, url4, url5, ...., urlN]
database_name = '</path_to_persistent_directory_of_metadata_for_the_files/NAME_OF_DATABASE>'

my_session = create_session(use_cache=True, cache_kwargs={'cache_name': database_name})

consolidate_metadata(URLS, concat_dim="time", session=my_session)

ds = xr.open_mfdataset(URLS, engine='pydap', parallel=True, concat_dim='time', ...)

The resulting my_session point to a SQLite database that can persist, and be version controlled if the /path_to_persistent_directory_of_metadata_for_this_files/, where the NAME_OF_DATABASE.sqlite file exists, is a version controlled directory (for example using github).

consolidate_metadata pre-downloads all the DMRs from the remote server, along with any necessary dimension array data and stores them in the SQLite database for later reuse. It is up to the user to know which dimension the data should be concatenated, in the example above, it is time. So after restarting the kernel (or deleting the ds reference), as long as the my_session points to the database, all the metadata persists. Creating/opening the Xarray dataset should take no more that 2-5 seconds.

In addition, consider the case where the data provider added more remote files to the same collection where the initial URLs belong to. Running the following should update the dataset with the new data. Since the original files are already cached, not data from the original urls should be downloaded.

# new urls
new_URLs = [urlN1, urlN2, urlN3, urlN4, urlN5, ...., urlNN]

# add new url to previous ones
updated_URLs = URLs + new_URLs

# load the session that points to the SQLite database
database_name = '</path_to_persistent_directory_of_metadata_for_this_files/NAME_OF_DATABASE>'
my_session = create_session(use_cache=True, cache_kwargs={'cache_name': database_name})

consolidate_metadata(updated_URLs, concat_dim="time", session=my_session)

Now, the SQLite database contains updated dimension data, and metadata.

Note

To clear the metadata, one simply should do: my_session.cache.clear(). When clearing the metadata, specially after restarting the Kernel, we suggest starting from create_session all over again.

b) Fetching numerical data #

  • It is strongly advised to configure the OPeNDAP server to implement the DAP4 protocol, since its fully supports all data types supported by DAP2, and it can send chunk Responses over the web.

Note

In the DAP2 protocol, the entire response is sent on a single Chunk. This is, the DAP2 protocol does not support sending chunk response. This can lead to Timeout errors on the server side.

  • Web streaming data over the network (as opposed to in-Cloud region), It is strongly adviced to exploit the OPeNDAP server’s specialized infrastructure to subset in a data-proximate way. When working with Xarray, this implies making sure the slice is passed down to the server. This is achieved in the following ways:

ds = xr.open_dataset(url, engine='pydap', session=my_session)
ds['varName'].isel(dim1=slice_dim1, dim2=slice_dim2).load()

where slice_dim1 and slice_dim2 are the slices that have been predetermined by the user.

When working with multiple files, as of Xarray <=v2025.10, the slice is not passed to the server unless the dataset is chunked when creating it. This is


expected_sizes = {'dim1':expected_size_dim1, 'dim2':expected_size_dim2, 'dim3': expected_size_dim3}

ds = xr.open_mfdataset(urls, engine='pydap', session=my_session, concat_dim='dim1', combine='nested', parallel=True)
ds['varName'].isel(dim1=slice_dim1, dim2=slice_dim2, dim3=slice_dim3).load()

where dim1 is the concatenating dimension, dim2 and dim3 are other dimensions in the aggregated dataset, and expected_size_dim1, expected_size_dim2, and expected_size_dim3 together define the expected size of the subset within each granule. This size cannot exceed the original size of the dimension. For some examples, see the 5 minute tutorial

  • Diagnosing. It is possible that the remote dataset has many small chunks, resulting in very slow performance. This, along with internet connection, are performance problems outside of the scope of pydap. A useful diagnose if the issue is withg pydap or with the remote server, is to use curl to download the response.

curl -L -n "<opendap_url_with_constraint_expression>" 

where -L implies following redirects, and -n instructs curl to recover authentication from the ~/.netrc file. This last one is only necessary when authentication is required. For example, to download a .dap (DAP4) response from a dap4 server (with no authentication required):

curl -L -o output.dap "http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dap?dap4.ce=/TIME"

The following command downloads only the variable TIME from this test dataset. The download should be very fast. When slicing an array pydap does something very similar: downloads a .dap response for a single variable, in this case TIME. Pydap should not take too much longer that curl to download the .dap response.

  • Check variable sizes and avoid downloading entire arrays of ncml datasets. ncml datasets are a virtual aggregation of a collection of NetCDF files. The ncml is great because it provides a single URL endpoint for a single collection, but many users experience long times and downlod errors when requesting to download even a single variable.