Using the client ================ pydap can be used as a client to inspect and retrieve data from any of the `hundreds of scientific datasets `_ available on the internet on `OPeNDAP `_ servers. This way, it's possible to instrospect and manipulate a dataset as if it were stored locally, with data being downloaded on-the-fly as necessary. Accessing gridded data ---------------------- Let's start accessing gridded data, i.e., data that is stored as a regular multidimensional array. Here's a simple example where we access the `COADS `_ climatology from the official OPeNDAP server: .. doctest:: >>> from pydap.client import open_url >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc') >>> type(dataset) Here we use the ``pydap.client.open_url`` function to open an URL specifying the location of the dataset; this URL should be stripped of the extensions commonly used for OPeNDAP datasets, like `.dds` or `.das`. When we access the remote dataset the function returns a ``DatasetType`` object, which is a *Structure* -- a fancy dictionary that stores other variables. We can check the names of the store variables like we would do with a Python dictionary: .. doctest:: >>> list(dataset.keys()) ['COADSX', 'COADSY', 'TIME', 'SST', 'AIRT', 'UWND', 'VWND'] Let's work with the ``SST`` variable; we can reference it using the usual dictionary syntax of ``dataset['SST']``, or using the "lazy" syntax ``dataset.SST``: .. doctest:: >>> sst = dataset['SST'] # or dataset.SST >>> type(sst) Note that the variable is of type ``GridType``, a multidimensional array with specific axes defining each of its dimensions: .. doctest:: >>> sst.dimensions ('TIME', 'COADSY', 'COADSX') >>> sst.maps OrderedDict([('TIME', f8'), (12,), (slice(None, None, None),))>), ('COADSY', f8'), (90,), (slice(None, None, None),))>), ('COADSX', f8'), (180,), (slice(None, None, None),))>)]) Each map is also, in turn, a variable that can be accessed using the same syntax used for Structures: .. doctest:: >>> sst.TIME f8'), (12,), (slice(None, None, None),))> The axes are all of type ``BaseType``. This is the OPeNDAP equivalent of a multidimensional array, with a specific shape and type. Even though no data have been downloaded up to this point, we can introspect these attributes from the axes or from the Grid itself: .. doctest:: >>> sst.shape (12, 90, 180) >>> sst.dtype dtype('>f4') >>> sst.TIME.shape (12,) >>> sst.TIME.dtype dtype('>f8') We can also introspect the variable attributes; they are stored in an attribute appropriately called ``attributes``, and they can also be accessed with a "lazy" syntax: .. doctest:: >>> import pprint >>> pprint.pprint(sst.attributes) {'_FillValue': -9.99999979e+33, 'history': 'From coads_climatology', 'long_name': 'SEA SURFACE TEMPERATURE', 'missing_value': -9.99999979e+33, 'units': 'Deg C'} >>> sst.units 'Deg C' Finally, we can also download some data. To download data we simply access it like we would access a `Numpy `_ array, and the data for the corresponding subset will be dowloaded on the fly from the server: .. doctest:: >>> sst.shape (12, 90, 180) >>> grid = sst[0,10:14,10:14] # this will download data from the server >>> grid The data itself can be accessed in the ``array`` attribute of the Grid, and also on the individual axes: .. doctest:: >>> grid.array[:] >>> print(grid.array[:].data) [[[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33] [ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01] [ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01] [ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]] >>> grid.COADSX[:] >>> print(grid.COADSX[:].data) [ 41. 43. 45. 47.] Alternatively, we could have dowloaded the data directly, skipping the axes: .. doctest:: >>> print(sst.array[0,10:14,10:14].data) [[[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33] [ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01] [ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01] [ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]] Older Servers ~~~~~~~~~~~~~ Some servers using a very old OPeNDAP application might run of of memory when attempting to retrieve both the data and the coordinate axes of a variable. The work around is to simply disable the retrieval of coordinate axes by using the ``output_grid`` option to open url: .. doctest:: >>> from pydap.client import open_url >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', output_grid=False) >>> grid = sst[0,10:14,10:14] # this will download data from the server >>> grid Accessing sequential data ------------------------- Now let's see an example of accessing sequential data. Sequential data consists of one or more records of related variables, such as a simultaneous measurements of temperature and wind velocity, for example. In this example we're going to access data from the `Argo project `_, consisting of profiles made by autonomous buoys drifting on the ocean: .. doctest:: python >>> from pydap.client import open_url >>> dataset = open_url('http://dapper.pmel.noaa.gov/dapper/argo/argo_all.cdp') This dataset is fairly complex, with several variables representing heterogeneous 4D data. The layout of the dataset follows the `Dapper in-situ conventions `_, consisting of two nested sequences: the outer sequence contains, in this case, a latitude, longitude and time variable, while the inner sequence contains measurements along a z axis. The first thing we'd like to do is limit our region; let's work with a small region in the Tropical Atlantic: .. doctest:: python >>> type(dataset.location) >>> dataset.location.keys() ['LATITUDE', 'JULD', 'LONGITUDE', '_id', 'profile', 'attributes', 'variable_attributes'] >>> my_location = dataset.location[ ... (dataset.location.LATITUDE > -2) & ... (dataset.location.LATITUDE < 2) & ... (dataset.location.LONGITUDE > 320) & ... (dataset.location.LONGITUDE < 330)] Note that the variable ``dataset.location`` is of type ``SequenceType`` -- also a Structure that holds other variables. Here we're limiting the sequence ``dataset.location`` to measurements between given latitude and longitude boundaries. Let's access the identification number of the first 10-or-so profiles: .. code-block:: python >>> for i, id_ in enumerate(my_location['_id'].iterdata()): ... print(id_) ... if i == 10: ... print('...') ... break 1125393 835304 839894 875344 110975 864748 832685 887712 962673 881368 1127922 ... >>> len(my_location['_id'].iterdata()) 623 Note that calculating the length of a sequence takes some time, since the client has to download all the data and do the calculation locally. This is why you should use ``len(my_location['_id'])`` instead of ``len(my_location)``. Both should give the same result (unless the dataset changes between requests), but the former retrieves only data for the ``_id`` variable, while the later retrives data for all variables. We can explicitly select just the first 5 profiles from our sequence: .. doctest:: python >>> my_location = my_location[:5] >>> len(my_location['_id'].iterdata()) 5 And we can print the temperature profiles at each location. We're going to use the `coards `_ module to convert the time to a Python ``datetime`` object: .. code-block:: python >>> from coards import from_udunits >>> for position in my_location.iterdata(): ... date = from_udunits(position.JULD.data, position.JULD.units.replace('GMT', '+0:00')) ... print(position.LATITUDE.data, position.LONGITUDE.data, date) ... print('=' * 40) ... i = 0 ... for pressure, temperature in zip(position.profile.PRES, position.profile.TEMP): ... print(pressure, temperature) ... if i == 10: ... print('...') ... break ... i += 1 -1.01 320.019 2009-05-03 11:42:34+00:00 ======================================== 5.0 28.59 10.0 28.788 15.0 28.867 20.0 28.916 25.0 28.94 30.0 28.846 35.0 28.566 40.0 28.345 45.0 28.05 50.0 27.595 55.0 27.061 ... -0.675 320.027 2006-12-25 13:24:11+00:00 ======================================== 5.0 27.675 10.0 27.638 15.0 27.63 20.0 27.616 25.0 27.617 30.0 27.615 35.0 27.612 40.0 27.612 45.0 27.605 50.0 27.577 55.0 27.536 ... -0.303 320.078 2007-01-12 11:30:31.001000+00:00 ======================================== 5.0 27.727 10.0 27.722 15.0 27.734 20.0 27.739 25.0 27.736 30.0 27.718 35.0 27.694 40.0 27.697 45.0 27.698 50.0 27.699 55.0 27.703 ... -1.229 320.095 2007-04-22 13:03:35.002000+00:00 ======================================== 5.0 28.634 10.0 28.71 15.0 28.746 20.0 28.758 25.0 28.755 30.0 28.747 35.0 28.741 40.0 28.737 45.0 28.739 50.0 28.748 55.0 28.806 ... -1.82 320.131 2003-04-09 13:20:03+00:00 ======================================== 5.1 28.618 9.1 28.621 19.4 28.637 29.7 28.662 39.6 28.641 49.6 28.615 59.7 27.6 69.5 26.956 79.5 26.133 89.7 23.937 99.2 22.029 ... These profiles could be easily plotted using `matplotlib `_: .. code-block:: python >>> for position in my_location.iterdata(): ... plot(position.profile.TEMP, position.profile.PRES) >>> show() You can also access the deep variables directly. When you iterate over these variables the client will download the data as nested lists: .. code-block:: python >>> for value in my_location.profile.PRES.iterdata(): ... print(value[:10]) [5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0] [5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0] [5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0] [5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0] [5.0999999, 9.1000004, 19.4, 29.700001, 39.599998, 49.599998, 59.700001, 69.5, 79.5, 89.699997] Authentication -------------- Basic & Digest ~~~~~~~~~~~~~~ To use Basic and Digest authentication, simply add your username and password to the dataset URL. Keep in mind that if the server only supports Basic authentication your credentials will be sent as plaintext, and could be sniffed on the network. .. code-block:: python >>> from pydap.client import open_url >>> dataset = open_url('http://username:password@server.example.com/path/to/dataset') CAS ~~~ The `Central Authentication Service `_ (CAS) is a single sign-on protocol for the web, usually involving a web browser and cookies. Nevertheless it's possible to use pydap with an OPeNDAP server behind a CAS. The function ``install_cas_client`` below replaces pydap's default HTTP function with a new version able to submit authentication data to an HTML form and store credentials in cookies. (In this particular case, the server uses Javascript to redirect the browser to a new location, so the client has to parse the location from the Javascript code; other CAS would require a tweaked function.) To use it, just attach a web browsing ``session`` with authentication cookies: .. code-block:: python >>> from pydap.client import open_url >>> from pydap.cas.get_cookies import setup_session >>> session = setup_session(authentication_url, username, password) >>> dataset = open_url('http://server.example.com/path/to/dataset', session=session) This method could work but each CAS is slightly different and might require a specifically designed ``setup_session`` instance. Two CAS are however explicitly supported by ``pydap``: URS NASA EARTHDATA ^^^^^^^^^^^^^^^^^^ Authentication is done through a ``username`` and a ``password``: .. code-block:: python >>> from pydap.client import open_url >>> from pydap.cas.urs import setup_session >>> dataset_url = 'http://server.example.com/path/to/dataset' >>> session = setup_session(username, password, check_url=dataset_url) >>> dataset = open_url(dataset_url, session=session) Earth System Grid Federation (ESGF) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Authentication is done through an ``openid`` and a ``password``: .. code-block:: python >>> from pydap.client import open_url >>> from pydap.cas.esgf import setup_session >>> dataset_url = 'http://server.example.com/path/to/dataset' >>> session = setup_session(openid, password, check_url=dataset_url) >>> dataset = open_url(dataset_url, session=session) If your ``openid`` contains contains the string ``ceda.ac.uk`` authentication requires an additional ``username`` argument: .. code-block:: python >>> from pydap.client import open_url >>> from pydap.cas.esgf import setup_session >>> session = setup_session(openid, password, check_url=dataset_url, username=username) >>> dataset = open_url(dataset_url, session=session) Advanced features ----------------- Calling server-side functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you open a remote dataset, the ``DatasetType`` object has a special attribute named ``functions`` that can be used to invoke any server-side functions. Here's an example of using the ``geogrid`` function from Hyrax: .. doctest:: >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc') >>> new_dataset = dataset.functions.geogrid(dataset.SST, 10, 20, -10, 60) >>> new_dataset.SST.shape (12, 12, 21) >>> new_dataset.SST.COADSY[:] [-11. -9. -7. -5. -3. -1. 1. 3. 5. 7. 9. 11.] >>> new_dataset.SST.COADSX[:] [ 21. 23. 25. 27. 29. 31. 33. 35. 37. 39. 41. 43. 45. 47. 49. 51. 53. 55. 57. 59. 61.] Unfortunately, there's currently no standard mechanism to discover which functions the server support. The ``function`` attribute will accept any function name the user specifies, and will try to pass the call to the remote server. Opening a specific URL ~~~~~~~~~~~~~~~~~~~~~~ You can pass any URL to the ``open_url`` function, together with any valid constraint expression. Here's an example of restricting values for the months of January, April, July and October: .. doctest:: >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc?SST[0:3:11][0:1:89][0:1:179]') >>> dataset.SST.shape (4, 90, 180) This can be extremely useful for server side-processing; for example, we can create and access a new variable ``A`` in this dataset, equal to twice ``SSH``: .. doctest:: >>> dataset = open_url('http://hycom.coaps.fsu.edu:8080/thredds/dodsC/las/dynamic/data_A5CDC5CAF9D810618C39646350F727FF.jnl_expr_%7B%7D%7Blet%20A=SSH*2%7D?A') >>> dataset.keys() ['A'] In this case, we're using the Ferret syntax ``let A=SSH*2`` to define the new variable, since the data is stored in an `F-TDS server `_. Server-side processing is useful when you want to reduce the data before downloading it, to calculate a global average, for example. Accessing raw data ~~~~~~~~~~~~~~~~~~ The client module has a special function called ``open_dods``, used to access raw data from a DODS response: .. doctest:: >>> from pydap.client import open_dods >>> dataset = open_dods( ... 'http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]') This function allows you to access raw data from any URL, including appending expressions to `F-TDS `_ and `GDS `_ servers or calling server-side functions directly. By default this method downloads the data directly, and skips metadata from the DAS response; if you want to investigate and introspect datasets you should set the ``get_metadata`` parameter to true: .. doctest:: >>> dataset = open_dods( ... 'http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]', ... get_metadata=True) >>> dataset.attributes['NC_GLOBAL']['history'] FERRET V4.30 (debug/no GUI) 15-Aug-96 Using a cache ~~~~~~~~~~~~~ You can specify a cache directory in the ``pydap.lib.CACHE`` global variable. If this value is different than ``None``, the client will try (if the server headers don't prohibit) to cache the result, so repeated requests will be read from disk instead of the network: .. code-block:: python >>> import pydap.lib >>> pydap.lib.CACHE = "/tmp/pydap-cache/" Timeout ~~~~~~~ To specify a timeout for the client, just set the desired number of seconds using the ``timeout`` option to ``open_url(...)`` or ``open_dods(...)``. For example, the following commands would timeout after 30 seconds without receiving a response from the server: .. code-block:: python >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', timeout=30) >>> dataset = open_dods('http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods', timeout=30) Configuring a proxy ~~~~~~~~~~~~~~~~~~~ It's possible to configure pydap to access the network through a proxy server. Here's an example for an HTTP proxy running on ``localhost`` listening on port 8000: .. code-block:: python >>> import httplib2 >>> from pydap.util import socks >>> import pydap.lib >>> pydap.lib.PROXY = httplib2.ProxyInfo( ... socks.PROXY_TYPE_HTTP, 'localhost', 8000) This way, all further calls to ``pydap.client.open_url`` will be routed through the proxy server. You can also authenticate to the proxy: .. code-block:: python >>> pydap.lib.PROXY = httplib2.ProxyInfo( ... socks.PROXY_TYPE_HTTP, 'localhost', 8000, ... proxy_user=USERNAME, proxy_pass=PASSWORD) A user `has reported `_ that ``httplib2`` has problems authenticating against a NTLM proxy server. In this case, a simple solution is to change the ``pydap.http.request`` function to use ``urllib2`` instead of ``httplib2``, monkeypatching the code like in the `CAS authentication example above <#cas>`_: .. code-block:: python import urllib2 import logging def install_urllib2_client(): def new_request(url): log = logging.getLogger('pydap') log.INFO('Opening %s' % url) f = urllib2.urlopen(url.rstrip('?&')) headers = dict(f.info().items()) body = f.read() return headers, body from pydap.util import http http.request = new_request The function ``install_urllib2_client`` should then be called before doing any requests.