PACE Chlorophyll A data over a region in the NorthAtlantic Ocean#

This notebook demonstrates access and subsetting to PACE Ocean Color Data. Broad information about the dataset can be found on the PACE website (see here)

Requirements to run this notebook

  1. Have an Earth Data Login account

  2. Have a Bearer Token.

  3. Knowledge of a collection concept ID (or DOI) of a dataset.

  4. Internet connection.

Objectives

Use pydap to demonstrate

  • Discovery of OPeNDAP granules from a collection of interest, further filtering with a time range.

  • Use of tokens to authenticate (from earthaccess).

  • Accessing Level 3 PACE, and identify an area of interest.

  • Subset and enable the Constraint Expression to make its way to the OPeNDAP server.

  • Store the subset locally.

Author: Miguel Jimenez-Urias, ‘25

from pydap.client import get_cmr_urls, consolidate_metadata, open_url
from pydap.net import create_session
import xarray as xr
import earthaccess
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors

Use a Session as a Metadata Database and Authentication#

Pydap can make use of a requests_cache.CachedSession object, which can be used as

  • Authentication (via netrc or token).

  • Database Management (sqlite3 as backend). Only caches necessary data.

  • Stream data.

Below we initialize the session, inheriting auth credentials from earthaccess, and point to local database PACE.sqlite store in the /data folder

auth = earthaccess.login(strategy="interactive", persist=True)
session = create_session(use_cache=True, session=earthaccess.get_requests_https_session(), cache_kwargs={'cache_name': 'data/PACE'})
session.cache.clear()

Query CMR for OPeNDAP URLs#

We query NASA’s CMR for all opendap urls associated with 4km, daily, Level 3 Gridded Chlorophyll A, version 3.1. For this, you can use Earthdata Search to identify the Concept Collection ID. We focus for 2 months of data this year.

PACE_ccid = "C3620140255-OB_CLOUD" # version 3.1
time_range=[dt.datetime(2025, 6, 1), dt.datetime(2025, 8, 1)]
urls = get_cmr_urls(ccid=PACE_ccid, limit=1000, time_range=time_range) # limit by default = 50

Note

This collection had two version of the same data. We are interested in the 4km resolution. One way to be certain that any number of opendap urls belong to the same collection, is by inspecting the url strings (unnecessary in most cases)

Further, we subset by variable name, retaining only lat, lon, and chlor_a variables. These are the variables of interest, and discarding variables reduces the amount of time it takes for Xarray to process the metadata

Note

Xarray has a .drop_vars method, but these variables are dropped AFTER creating the dataset. If a dataset has O(1000) variables, Xarray would process the variables first, and then drop them. With OPeNDAP, you can instruct the server a priori which variables Xarray needs to process.

CEs = "?dap4.ce=/lat;/lon;/chlor_a"
_urls = [url for url in urls if '4km' in url and "DAY" in url]
pace_urls = [url.replace("https", "dap4") + CEs for url in _urls] # a dap4 schema implies a DAP4 protocol in pydap.
len(pace_urls)
32
pace_urls[:4]
['dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0701/PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a',
 'dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0702/PACE_OCI.20250702.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a',
 'dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0703/PACE_OCI.20250703.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a',
 'dap4://oceandata.sci.gsfc.nasa.gov/opendap/PACE_OCI/L3SMI/2025/0704/PACE_OCI.20250704.L3m.DAY.CHL.V3_1.chlor_a.4km.NRT.nc?dap4.ce=/lat;/lon;/chlor_a']

Consolidate metadata aggregation#

In here, the dimensions and some metadata is eagerly download. This step is not necessary but provides a major boost in the performance when using pydap as a backend of xarray.

The metadata is stored as sqlite, and can persist through sessions.

Note

This collection only defines the time dimension (or coordinate) in the metadata. Because there is no variable in the remote file associated with the time coordinate, there is no need to define a concatenating dimension (concat_dim) when running the code below. Almost always there is a concat_dim, but not in this special case.

%%time
consolidate_metadata(pace_urls, session=session)
datacube has dimensions ['lat[0:1:4319]', 'lon[0:1:8639]'] , and concat dim: `None`
CPU times: user 766 ms, sys: 142 ms, total: 908 ms
Wall time: 2.7 s

Create a single Dataset object#

We use Xarray with pydap as the backend. Pydap is a backend that is always installed when installing Xarray, but not yet the default when working with opendap urls. As a result, we must define it.

%%time
ds = xr.open_mfdataset(
    pace_urls, 
    engine='pydap', 
    session=session, 
    parallel=True, 
    concat_dim='time',  # <------ a time dimension will be created 
    combine='nested',
)
CPU times: user 973 ms, sys: 384 ms, total: 1.36 s
Wall time: 1.79 s
ds
<xarray.Dataset> Size: 5GB
Dimensions:  (time: 32, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float32 17kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
Dimensions without coordinates: time
Data variables:
    chlor_a  (time, lat, lon) float32 5GB dask.array<chunksize=(1, 4320, 8640), meta=np.ndarray>
Attributes: (12/64)
    product_name:                      PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chl...
    instrument:                        OCI
    title:                             OCI Level-3 Standard Mapped Image
    project:                           Ocean Biology Processing Group (NASA/G...
    platform:                          PACE
    source:                            satellite observations from OCI-PACE
    ...                                ...
    identifier_product_doi:            10.5067/PACE/OCI/L3M/CHL/3.1
    keywords:                          Earth Science > Oceans > Ocean Chemist...
    keywords_vocabulary:               NASA Global Change Master Directory (G...
    data_bins:                         Attribute elided: Unsupported attribut...
    data_minimum:                      0.0134539828
    data_maximum:                      15.9237003

The dataset above represent a metadata representation of the entire data of interest. It is chunked, and only the dimensions lat and lon have been downloaded.

Note

A named dimension time was created, along with all remote granules were concatenated.

print('Uncompressed dataset size [GBs]: ', ds.nbytes / 1e9)
Uncompressed dataset size [GBs]:  4.77762624

Data-proximate subsetting: How to with OPeNDAP, Pydap, and Xarray?#

While it is possible to download the entire dataset, it is better practice to subset it first! There are many tools to subset data in a data-proximate way, and OPeNDAP is one of them. We need to identify the subset first and pass it to the server.

Identifying the subset enables users to potentially construct Constraint Expressions, which are used to instruct the OPeNDAP server the subset of interest. But with Pydap and Xarray, these are tools that interactively construct slices and can request subsets, hiding the abstraction. Below we demonstrate how to pass the subset to the OPeNDAP server so that Xarray does less work.

Identify spatial subset#

In this case, we are interested in a spatial subset. The data is Level 3 data (gridded) so latitude and longitude are uniform. Moreover, these are 1D, and have already been downloaded into memory!

lat, lon = ds['lat'].values, ds['lon'].values

Say we define the area of interest#

# Min/max of lon values
minLon, maxLon = -96, 10

# Min/Max of lat values
minLat, maxLat = 6, 70
iLon = np.where((lon>minLon)&(lon < maxLon))[0]
iLat= np.where((lat>minLat)&(lat < maxLat))[0]

So the slices are simply the first and last elements of iLon and iLat!!:

lon_slice = slice(iLon[0], iLon[-1])
lat_slice = slice(iLat[0], iLat[-1])

Warning

Not all array values of lat/lon coordinates are monotonic. Always make sure that is the case, even when data is Level 3 or Level 4

Download only the subset#

With Xarray, subsetting syntax is similar to pandas. For example

ds.isel(lon=lon_slice, lat_slice)

Produces a client-side subset. However

Warning

When using opendap, slice as above and then downloading data will only subset by variable, downloading all of the data in most cases, only to then be further subset by xarray.

To ensure the slices are sent to the remote server, we must chunk the dataset when creating it. This will guarantee that the subset will be mostly done on the server side. You MUST choose the chunk size to match the size of the slice, as shown below.

%%time
ds = xr.open_mfdataset(
    pace_urls, 
    engine='pydap', 
    session=session, 
    parallel=True, 
    concat_dim='time',  # <------ a time dimension will be created 
    combine='nested',
    chunks = {'lon': len(iLon), 'lat':len(iLat)} #  <----------- This instructs the OPeNDAP server to subset in space
)
ds
CPU times: user 271 ms, sys: 99.9 ms, total: 371 ms
Wall time: 278 ms
<xarray.Dataset> Size: 5GB
Dimensions:  (time: 32, lat: 4320, lon: 8640)
Coordinates:
  * lat      (lat) float32 17kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
Dimensions without coordinates: time
Data variables:
    chlor_a  (time, lat, lon) float32 5GB dask.array<chunksize=(1, 1536, 2544), meta=np.ndarray>
Attributes: (12/64)
    product_name:                      PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chl...
    instrument:                        OCI
    title:                             OCI Level-3 Standard Mapped Image
    project:                           Ocean Biology Processing Group (NASA/G...
    platform:                          PACE
    source:                            satellite observations from OCI-PACE
    ...                                ...
    identifier_product_doi:            10.5067/PACE/OCI/L3M/CHL/3.1
    keywords:                          Earth Science > Oceans > Ocean Chemist...
    keywords_vocabulary:               NASA Global Change Master Directory (G...
    data_bins:                         Attribute elided: Unsupported attribut...
    data_minimum:                      0.0134539828
    data_maximum:                      15.9237003
ds["chlor_a"] ## inspect the chunk of the data
<xarray.DataArray 'chlor_a' (time: 32, lat: 4320, lon: 8640)> Size: 5GB
dask.array<concatenate, shape=(32, 4320, 8640), dtype=float32, chunksize=(1, 1536, 2544), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 17kB 89.98 89.94 89.9 89.85 ... -89.9 -89.94 -89.98
  * lon      (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
Dimensions without coordinates: time
Attributes:
    long_name:      Chlorophyll Concentration, OCI Algorithm
    units:          mg m^-3
    standard_name:  mass_concentration_of_chlorophyll_in_sea_water
    valid_min:      0.00100000005
    valid_max:      100.0
    reference:      Hu, C., Lee Z., and Franz, B.A. (2012). Chlorophyll-a alg...
    display_scale:  log
    display_min:    0.00999999978
    display_max:    20.0
    Maps:           ('/lat', '/lon')

We now ensure Xarray will treat this subset as an individual chunk (per variable)

ds = ds.isel(lon=slice(iLon[0], iLon[-1]+1), lat=slice(iLat[0], iLat[-1]+1)).chunk({'lon': len(iLon), 'lat':len(iLat)})

Finally#

Store data locally, as NetCDF4. At this stage:

  • Data is downloaded (spatially subsetted via OPeNDAP)

  • Data is resampled (via Xarray)

%%time
ds.to_netcdf("data/pace_subset.nc4", mode='w')
CPU times: user 8.01 s, sys: 6.25 s, total: 14.3 s
Wall time: 51.2 s

We inspect our data#

mds = xr.open_dataset("data/pace_subset.nc4")
mds
<xarray.Dataset> Size: 500MB
Dimensions:  (time: 32, lat: 1536, lon: 2544)
Coordinates:
  * lat      (lat) float32 6kB 69.98 69.94 69.9 69.85 ... 6.104 6.062 6.021
  * lon      (lon) float32 10kB -95.98 -95.94 -95.9 -95.85 ... 9.896 9.938 9.979
Dimensions without coordinates: time
Data variables:
    chlor_a  (time, lat, lon) float32 500MB ...
Attributes: (12/64)
    product_name:                      PACE_OCI.20250701.L3m.DAY.CHL.V3_1.chl...
    instrument:                        OCI
    title:                             OCI Level-3 Standard Mapped Image
    project:                           Ocean Biology Processing Group (NASA/G...
    platform:                          PACE
    source:                            satellite observations from OCI-PACE
    ...                                ...
    identifier_product_doi:            10.5067/PACE/OCI/L3M/CHL/3.1
    keywords:                          Earth Science > Oceans > Ocean Chemist...
    keywords_vocabulary:               NASA Global Change Master Directory (G...
    data_bins:                         Attribute elided: Unsupported attribut...
    data_minimum:                      0.0134539828
    data_maximum:                      15.9237003
print("Size of downloaded subset: ", mds.nbytes/1e9)
Size of downloaded subset:  0.500187072