PACE Chlorophyll A data over a region in the NorthAtlantic Ocean#
This notebook demonstrates access and subsetting to PACE Ocean Color Data. Broad information about the dataset can be found on the PACE website (see here)
Requirements to run this notebook
Have an Earth Data Login account
Have a Bearer Token.
Knowledge of a collection concept ID (or DOI) of a dataset.
Internet connection.
Objectives
Use pydap to demonstrate
Discovery of OPeNDAP granules from a collection of interest, further filtering with a time range.
Use of
tokens
to authenticate (from earthaccess).Accessing Level 3
PACE
, and identify an area of interest.Subset and enable the
Constraint Expression
to make its way to the OPeNDAP server.Store the subset locally.
Author
: Miguel Jimenez-Urias, ‘25
from pydap.client import get_cmr_urls, consolidate_metadata, open_url
from pydap.net import create_session
import xarray as xr
import earthaccess
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
Use a Session as a Metadata Database and Authentication#
Pydap can make use of a requests_cache.CachedSession
object, which can be used as
Authentication (via netrc or token).
Database Management (sqlite3 as backend). Only caches necessary data.
Stream data.
Below we initialize the session, inheriting auth credentials from earthaccess, and point to local database PACE.sqlite
store in the /data
folder
auth = earthaccess.login(strategy="interactive", persist=True)
session = create_session(use_cache=True, session=earthaccess.get_requests_https_session(), cache_kwargs={'cache_name': 'data/PACE'})
session.cache.clear()
Query CMR for OPeNDAP URLs#
We query NASA’s CMR for all opendap urls associated with 4km, daily, Level 3 Gridded Chlorophyll A, version 3.1. For this, you can use Earthdata Search to identify the Concept Collection ID. We focus for 2 months of data this year.
PACE_ccid = "C3620140255-OB_CLOUD" # version 3.1
time_range=[dt.datetime(2025, 6, 1), dt.datetime(2025, 8, 1)]
urls = get_cmr_urls(ccid=PACE_ccid, limit=1000, time_range=time_range) # limit by default = 50
Note
This collection had two version of the same data. We are interested in the 4km
resolution. One way to be certain that any number of opendap urls belong to the same collection, is by inspecting the url strings (unnecessary in most cases)
Further, we subset by variable name, retaining only lat
, lon
, and chlor_a
variables. These are the variables of interest, and discarding variables reduces the amount of time it takes for Xarray
to process the metadata
Note
Xarray has a .drop_vars method, but these variables are dropped AFTER creating the dataset. If a dataset has O(1000) variables, Xarray
would process the variables first, and then drop them. With OPeNDAP, you can instruct the server a priori which variables Xarray needs to process.
CEs = "?dap4.ce=/lat;/lon;/chlor_a"
_urls = [url for url in urls if '4km' in url and "DAY" in url]
pace_urls = [url.replace("https", "dap4") + CEs for url in _urls] # a dap4 schema implies a DAP4 protocol in pydap.
len(pace_urls)
32
pace_urls[:4]
['dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0701/PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a',
'dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0702/PACE_OCI.20250702.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a',
'dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0703/PACE_OCI.20250703.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a',
'dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0704/PACE_OCI.20250704.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a']
Consolidate metadata aggregation#
In here, the dimensions and some metadata is eagerly download. This step is not necessary but provides a major boost in the performance when using pydap as a backend of xarray.
The metadata is stored as sqlite, and can persist through sessions.
Note
This collection only defines the time dimension (or coordinate) in the metadata. Because there is no variable in the remote file associated with the time coordinate, there is no need to define a concatenating dimension (concat_dim
) when running the code below. Almost always there is a concat_dim
, but not in this special case.
%%time
consolidate_metadata(pace_urls, session=session)
datacube has dimensions ['lat[0:1:4319]', 'lon[0:1:8639]'] , and concat dim: `None`
CPU times: user 766 ms, sys: 142 ms, total: 908 ms
Wall time: 2.7 s
Create a single Dataset object#
We use Xarray
with pydap
as the backend. Pydap
is a backend that is always installed when installing Xarray
, but not yet the default when working with opendap urls. As a result, we must define it.
%%time
ds = xr.open_mfdataset(
pace_urls,
engine='pydap',
session=session,
parallel=True,
concat_dim='time', # <------ a time dimension will be created
combine='nested',
)
CPU times: user 973 ms, sys: 384 ms, total: 1.36 s
Wall time: 1.79 s
ds
<xarray.Dataset> Size: 5GB Dimensions: (time: 32, lat: 4320, lon: 8640) Coordinates: * lat (lat) float32 17kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 Dimensions without coordinates: time Data variables: chlor_a (time, lat, lon) float32 5GB dask.array<chunksize=(1, 4320, 8640), meta=np.ndarray> Attributes: (12/64) product_name: PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chl... instrument: OCI title: OCI Level-3 Standard Mapped Image project: Ocean Biology Processing Group (NASA/G... platform: PACE source: satellite observations from OCI-PACE ... ... identifier_product_doi: 10.5067/PACE/OCI/L3M/CHL/3.1 keywords: Earth Science > Oceans > Ocean Chemist... keywords_vocabulary: NASA Global Change Master Directory (G... data_bins: Attribute elided: Unsupported attribut... data_minimum: 0.0134539828 data_maximum: 15.9237003
The dataset above represent a metadata representation of the entire data of interest. It is chunked, and only the dimensions lat and lon have been downloaded.
Note
A named dimension time
was created, along with all remote granules were concatenated.
print('Uncompressed dataset size [GBs]: ', ds.nbytes / 1e9)
Uncompressed dataset size [GBs]: 4.77762624
Data-proximate subsetting: How to with OPeNDAP, Pydap, and Xarray?#
While it is possible to download the entire dataset, it is better practice to subset it first! There are many tools to subset data in a data-proximate way, and OPeNDAP is one of them. We need to identify the subset first and pass it to the server.
Identifying the subset enables users to potentially construct Constraint Expressions, which are used to instruct the OPeNDAP server the subset of interest. But with Pydap and Xarray, these are tools that interactively construct slices and can request subsets, hiding the abstraction. Below we demonstrate how to pass the subset to the OPeNDAP server so that Xarray does less work.
Identify spatial subset#
In this case, we are interested in a spatial subset. The data is Level 3 data (gridded) so latitude and longitude are uniform. Moreover, these are 1D, and have already been downloaded into memory!
lat, lon = ds['lat'].values, ds['lon'].values
Say we define the area of interest#
# Min/max of lon values
minLon, maxLon = -96, 10
# Min/Max of lat values
minLat, maxLat = 6, 70
iLon = np.where((lon>minLon)&(lon < maxLon))[0]
iLat= np.where((lat>minLat)&(lat < maxLat))[0]
So the slices are simply the first and last elements of iLon and iLat!!:
lon_slice = slice(iLon[0], iLon[-1])
lat_slice = slice(iLat[0], iLat[-1])
Warning
Not all array values of lat/lon
coordinates are monotonic. Always make sure that is the case, even when data is Level 3 or Level 4
Download only the subset#
With Xarray
, subsetting syntax is similar to pandas. For example
ds.isel(lon=lon_slice, lat_slice)
Produces a client-side subset. However
Warning
When using opendap, slice as above and then downloading data will only subset by variable, downloading all of the data in most cases, only to then be further subset by xarray.
To ensure the slices are sent to the remote server, we must chunk the dataset
when creating it. This will guarantee that the subset will be mostly done on the server side. You MUST
choose the chunk size to match the size of the slice, as shown below.
%%time
ds = xr.open_mfdataset(
pace_urls,
engine='pydap',
session=session,
parallel=True,
concat_dim='time', # <------ a time dimension will be created
combine='nested',
chunks = {'lon': len(iLon), 'lat':len(iLat)} # <----------- This instructs the OPeNDAP server to subset in space
)
ds
CPU times: user 271 ms, sys: 99.9 ms, total: 371 ms
Wall time: 278 ms
<xarray.Dataset> Size: 5GB Dimensions: (time: 32, lat: 4320, lon: 8640) Coordinates: * lat (lat) float32 17kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 Dimensions without coordinates: time Data variables: chlor_a (time, lat, lon) float32 5GB dask.array<chunksize=(1, 1536, 2544), meta=np.ndarray> Attributes: (12/64) product_name: PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chl... instrument: OCI title: OCI Level-3 Standard Mapped Image project: Ocean Biology Processing Group (NASA/G... platform: PACE source: satellite observations from OCI-PACE ... ... identifier_product_doi: 10.5067/PACE/OCI/L3M/CHL/3.1 keywords: Earth Science > Oceans > Ocean Chemist... keywords_vocabulary: NASA Global Change Master Directory (G... data_bins: Attribute elided: Unsupported attribut... data_minimum: 0.0134539828 data_maximum: 15.9237003
ds["chlor_a"] ## inspect the chunk of the data
<xarray.DataArray 'chlor_a' (time: 32, lat: 4320, lon: 8640)> Size: 5GB dask.array<concatenate, shape=(32, 4320, 8640), dtype=float32, chunksize=(1, 1536, 2544), chunktype=numpy.ndarray> Coordinates: * lat (lat) float32 17kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98 * lon (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 Dimensions without coordinates: time Attributes: long_name: Chlorophyll Concentration, OCI Algorithm units: mg m^-3 standard_name: mass_concentration_of_chlorophyll_in_sea_water valid_min: 0.00100000005 valid_max: 100.0 reference: Hu, C., Lee Z., and Franz, B.A. (2012). Chlorophyll-a alg... display_scale: log display_min: 0.00999999978 display_max: 20.0 Maps: ('/lat', '/lon')
We now ensure Xarray
will treat this subset as an individual chunk (per variable)
ds = ds.isel(lon=slice(iLon[0], iLon[-1]+1), lat=slice(iLat[0], iLat[-1]+1)).chunk({'lon': len(iLon), 'lat':len(iLat)})
Finally#
Store data locally, as NetCDF4
. At this stage:
Data is downloaded (spatially subsetted via
OPeNDAP
)Data is resampled (via
Xarray
)
%%time
ds.to_netcdf("data/pace_subset.nc4", mode='w')
CPU times: user 8.01 s, sys: 6.25 s, total: 14.3 s
Wall time: 51.2 s
We inspect our data#
mds = xr.open_dataset("data/pace_subset.nc4")
mds
<xarray.Dataset> Size: 500MB Dimensions: (time: 32, lat: 1536, lon: 2544) Coordinates: * lat (lat) float32 6kB 69.98 69.94 69.9 69.85 ... 6.104 6.062 6.021 * lon (lon) float32 10kB -95.98 -95.94 -95.9 -95.85 ... 9.896 9.938 9.979 Dimensions without coordinates: time Data variables: chlor_a (time, lat, lon) float32 500MB ... Attributes: (12/64) product_name: PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chl... instrument: OCI title: OCI Level-3 Standard Mapped Image project: Ocean Biology Processing Group (NASA/G... platform: PACE source: satellite observations from OCI-PACE ... ... identifier_product_doi: 10.5067/PACE/OCI/L3M/CHL/3.1 keywords: Earth Science > Oceans > Ocean Chemist... keywords_vocabulary: NASA Global Change Master Directory (G... data_bins: Attribute elided: Unsupported attribut... data_minimum: 0.0134539828 data_maximum: 15.9237003
print("Size of downloaded subset: ", mds.nbytes/1e9)
Size of downloaded subset: 0.500187072