Pydap is slow. How can I improve the download time?#
There are two stages at which pydap
downloads data: a)
during the dataset creation, and b)
when downloading array data. These can be slow on their own, due to a couple of scenarios:
Data behind Authentication.
Using DAP2 protocol instead of DAP4 protocol.
Remote dataset is chunked with small chunks.
The client API is sending too many unnecessary requests to the remote server.
Server is not configured correctly.
Internet connection is intermittent or subpar.
Among other scenarios. In the vast majority of cases, it is considered a good practive to keep the amount of requests to the server at a minimum, i.e. only request the necessary. This is even more important when data is behind authentication, as there can be many redirects per request from the client API.
Below are some guidences to improve data access for users. If you continue to experience performance issues, please consider writing an issue in Pydap’s Github IssueTracker. Below we assume users are using pydap
as a backend engine for Xarray
. This is:
ds = xr.open_dataset(url, engine='pydap', ...) # or xr.open_mfdataset()
a)Creating the Dataset#
When creating a Dataset, Xarray
’s internal logic sends requests to the server to:
Download Metadata. For
DAP2
protocol these are the.dds
and.das
, one of each per remote file. In theDAP4
only a single metadata request is sent per remote file, the.dmr
. These metadata files describe the internal structure and information within each remote file (all variable names, types, shapes, attributes), whichXarray
uses to create the Dataset.Download all dimension array data. Per file,
Xarray
downloads by default all dimensions and loads them into memory. When opening multiple files, this behavior can leads to sub-performant behavior, asXarray
will request all dimension data from every file in order to perform a safety check, even when the user know a-priori that this data is identical across all remote files. While these safety checks are very important (and should be in place), it can lead to huge performance losses when aggregating 100 of remote files, each with 2 or more dimensions, in particular when restarting the Kernel (and be force to run the workflow from scratch).
Note
As of xarray=v2025.10
, Xarray
sends an individual request per variable within a remote file, instead of fetching all variable’s array data within a single request. This behavior is not optimal, and will be improved in the future.
To improve the performance, and avoid re-downloading the data required to open a dataset we recommend:
Use DAP4. This is a more modern DAP protocol, and better supported by OPeNDAP services and the overall geospatial community. Moreover, by choosing DAP4 you automatically cut by half the amount of responses associated with metadata, from the same file. To specify the DAP protocol with
PyDAP
you can change the scheme of the url as follows:
dap4_url = "dap4://<www.opendap.org/remote_file>" # <------- DAP4
dap2_url = "dap2://<www.opendap.org/remote_file>" # <------- DAP2
url = "https://<www.opendap.org/remote_file>" # <------- Pydap assumes DAP2.
Use Constraint Expressions to produce a Constrained DMR. A
Constraint Expression
can be added to aDAP4
url following the syntax:
url = <https://...opendap_url..>
CE = "dap4.ce=/VarName1;/VarName2;/...;/VarNameN"
url_ce = url + "?" + CE
where VarName1
, VarName2
, and VarNameN
are all variables present a remote file with M>=N
variables. Passing this URL to Xarray
and using pydap
as the engine, will enable to remote OPeNDAP Server to produce a Constraint DMR
, which will only have information about those variables. This can lead to significant performance gains when N<<M
(For example N=4
, and M=1000
).
Note
Xarray
has internal logic to drop variables. But Xarray
will parse the metadata from ALL the variables, to then subsequently drop the variables specified by the user in .drop_vars()
. With a Constrained DMR via the Constraint Expression in the example above, for example, Xarray
would only process the N
variables.
Warning
Xarray
requires the presence of Dimension data to match the shape of any data variable
. When constructing Constraint Expression as above, include all the dimensions associated with the variables of interest, in the CE
.
Consolidate Metadata.
PyDAP
has a method to persist metadata for later reuse.PyDAP
can use a request_cache.CachedSession to download and persist the metadata required to initiate anXarray
Dataset. ACachedSession
makes use of aSQLite
backend and can act as a database manager, and since aCachedSession
can also be used to authenticate, it is a stand-in replacement for a requests.Session object typically used by PyDAP. Consider the example below:
from pydap.net import create_session
from pydap.client import consolidate_metadata
URLS = [url1, url2, url3, url4, url5, ...., urlN]
database_name = '</path_to_persistent_directory_of_metadata_for_the_files/NAME_OF_DATABASE>'
my_session = create_session(use_cache=True, cache_kwargs={'cache_name': database_name})
consolidate_metadata(URLS, concat_dim="time", session=my_session)
ds = xr.open_mfdataset(URLS, engine='pydap', parallel=True, concat_dim='time', ...)
The resulting my_session
point to a SQLite
database that can persist, and be version controlled if the /path_to_persistent_directory_of_metadata_for_this_files/
, where the NAME_OF_DATABASE.sqlite
file exists, is a version controlled directory (for example using github).
consolidate_metadata
pre-downloads all the DMRs from the remote server, along with any necessary dimension array data and stores them in the SQLite
database for later reuse. It is up to the user to know which dimension the data should be concatenated, in the example above, it is time
. So after restarting the kernel (or deleting the ds
reference), as long as the my_session
points to the database, all the metadata persists. Creating/opening the Xarray dataset should take no more that 2-5 seconds.
In addition, consider the case where the data provider added more remote files to the same collection where the initial URLs belong to. Running the following should update the dataset with the new data. Since the original files are already cached, not data from the original urls should be downloaded.
# new urls
new_URLs = [urlN1, urlN2, urlN3, urlN4, urlN5, ...., urlNN]
# add new url to previous ones
updated_URLs = URLs + new_URLs
# load the session that points to the SQLite database
database_name = '</path_to_persistent_directory_of_metadata_for_this_files/NAME_OF_DATABASE>'
my_session = create_session(use_cache=True, cache_kwargs={'cache_name': database_name})
consolidate_metadata(updated_URLs, concat_dim="time", session=my_session)
Now, the SQLite database contains updated dimension data, and metadata.
Note
To clear the metadata, one simply should do: my_session.cache.clear()
. When clearing the metadata, specially after restarting the Kernel, we suggest starting from create_session
all over again.
b) Fetching numerical data #
It is strongly advised to configure the OPeNDAP server to implement the DAP4 protocol, since its fully supports all data types supported by DAP2, and it can send chunk Responses over the web.
Note
In the DAP2
protocol, the entire response is sent on a single Chunk. This is, the DAP2 protocol does not support sending chunk response. This can lead to Timeout errors on the server side.
Web streaming data over the network (as opposed to in-Cloud region), It is strongly adviced to exploit the OPeNDAP server’s specialized infrastructure to subset in a data-proximate way. When working with Xarray, this implies making sure the slice is passed down to the server. This is achieved in the following ways:
ds = xr.open_dataset(url, engine='pydap', session=my_session)
ds['varName'].isel(dim1=slice_dim1, dim2=slice_dim2).load()
where slice_dim1
and slice_dim2
are the slices that have been predetermined by the user.
When working with multiple files, as of Xarray <=v2025.10
, the slice is not passed to the server unless the dataset is chunked when creating it. This is
expected_sizes = {'dim1':expected_size_dim1, 'dim2':expected_size_dim2, 'dim3': expected_size_dim3}
ds = xr.open_mfdataset(urls, engine='pydap', session=my_session, concat_dim='dim1', combine='nested', parallel=True)
ds['varName'].isel(dim1=slice_dim1, dim2=slice_dim2, dim3=slice_dim3).load()
where dim1
is the concatenating dimension, dim2
and dim3
are other dimensions in the aggregated dataset, and expected_size_dim1
, expected_size_dim2
, and expected_size_dim3
together define the expected size of the subset within each granule. This size cannot exceed the original size of the dimension. For some examples, see the 5 minute tutorial
Diagnosing. It is possible that the remote dataset has many small chunks, resulting in very slow performance. This, along with internet connection, are performance problems outside of the scope of
pydap
. A useful diagnose if the issue is withgpydap
or with the remote server, is to use curl to download the response.
curl -L -n "<opendap_url_with_constraint_expression>"
where -L
implies following redirects, and -n
instructs curl
to recover authentication from the ~/.netrc
file. This last one is only necessary when authentication is required. For example, to download a .dap
(DAP4) response from a dap4 server (with no authentication required):
curl -L -o output.dap "http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dap?dap4.ce=/TIME"
The following command downloads only the variable TIME
from this test dataset. The download should be very fast. When slicing an array pydap
does something very similar: downloads a .dap
response for a single variable, in this case TIME
. Pydap should not take too much longer that curl
to download the .dap
response.
Check variable sizes and avoid downloading entire arrays of ncml datasets.
ncml
datasets are a virtual aggregation of a collection of NetCDF files. Thencml
is great because it provides a single URL endpoint for a single collection, but many users experience long times and downlod errors when requesting to download even a single variable.