Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of dask support in pyresample? #206

Open
rabernat opened this issue Aug 13, 2019 · 7 comments
Open

Status of dask support in pyresample? #206

rabernat opened this issue Aug 13, 2019 · 7 comments

Comments

@rabernat
Copy link

The latest pyresample docs state:

Interfaces to XArray objects (including dask array support) are provided in separate Resampler class interfaces and are in active development.

However, there is no further mention of xarray / dask support in the rest of the docs. There seem to be a few dask issues (e.g. #148), but I could not ascertain the status of xarray / dask support based on browsing the docs and the repo.

Could you clarify where things stand? My use case is that I would like to use pyresample lazily on dask arrays, where the data is chunked contiguously in space but has many samples in time. Here's what I have tried, representing each timestep as a different channel:

import numpy as np
import dask.array as da
from pyresample import image, geometry

area_def = geometry.AreaDefinition('areaD', 'Europe (3km, HRV, VTC)', 'areaD',
                               {'a': '6378144.0', 'b': '6356759.0',
                                'lat_0': '50.00', 'lat_ts': '50.00',
                                'lon_0': '8.00', 'proj': 'stere'},
                               800, 800,
                               [-1370912.72, -909968.64,
                                1029087.28, 1490031.36])
msg_area = geometry.AreaDefinition('msg_full', 'Full globe MSG image 0 degrees',
                               'msg_full',
                               {'a': '6378169.0', 'b': '6356584.0',
                                'h': '35785831.0', 'lon_0': '0',
                                'proj': 'geos'},
                               3712, 3712,
                               [-5568742.4, -5568742.4,
                                5568742.4, 5568742.4])

# Here I have 10 "timesteps" (pyresample calls these "channels")
# The channels are the final axis (axis=2) of the array
# works if I use numpy
# data = np.random.rand(3712, 3712, 10)
data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))

msg_con_nn = image.ImageContainerNearest(data, msg_area, radius_of_influence=50000)
msg_con_nn.resample(area_def)

I would expect this to lazily return a dask array with the same chunk structure as the input array, with resampling performed on demand as the chunks are loaded. Instead I hit an error when creating the ImageContainerNearest object:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-d3a77de18e5c> in <module>
      1 #data = np.random.rand(3712, 3712, 10)
      2 data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
----> 3 msg_con_nn = image.ImageContainerNearest(data, msg_area, radius_of_influence=50000)

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyresample/image.py in __init__(self, image_data, geo_def, radius_of_influence, epsilon, fill_value, reduce_data, nprocs, segments)
    255         super(ImageContainerNearest, self).__init__(image_data, geo_def,
    256                                                     fill_value=fill_value,
--> 257                                                     nprocs=nprocs)
    258         self.radius_of_influence = radius_of_influence
    259         self.epsilon = epsilon

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyresample/image.py in __init__(self, image_data, geo_def, fill_value, nprocs)
     59             geo_def = geo_def.freeze()
     60         if not isinstance(image_data, (np.ndarray, np.ma.core.MaskedArray)):
---> 61             raise TypeError('image_data must be either an ndarray'
     62                             ' or a masked array')
     63         elif ((image_data.ndim > geo_def.ndim + 1) or

TypeError: image_data must be either an ndarray or a masked array
@djhoese
Copy link
Member

djhoese commented Aug 13, 2019

Sadly, the old interfaces do not support dask. We've been playing around with support for Xarray DataArray objects backed by dask when using Satpy and have had a lot of success, but not everything we want lives in pyresample at the moment. That said I think you could do what you want if you are willing to wrap things in a DataArray first and are willing to deal with some less than perfect interfaces.

import numpy as np
import dask.array as da
import xarray as xr
from pyresample import geometry
from pyresample.kd_tree import XArrayResamplerNN

area_def = geometry.AreaDefinition('areaD', 'Europe (3km, HRV, VTC)', 'areaD',
                               {'a': '6378144.0', 'b': '6356759.0',
                                'lat_0': '50.00', 'lat_ts': '50.00',
                                'lon_0': '8.00', 'proj': 'stere'},
                               800, 800,
                               [-1370912.72, -909968.64,
                                1029087.28, 1490031.36])
msg_area = geometry.AreaDefinition('msg_full', 'Full globe MSG image 0 degrees',
                               'msg_full',
                               {'a': '6378169.0', 'b': '6356584.0',
                                'h': '35785831.0', 'lon_0': '0',
                                'proj': 'geos'},
                               3712, 3712,
                               [-5568742.4, -5568742.4,
                                5568742.4, 5568742.4])

# Here I have 10 "timesteps" (pyresample calls these "channels")
# The channels are the final axis (axis=2) of the array
# works if I use numpy
# data = np.random.rand(3712, 3712, 10)
data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
data = xr.DataArray(data, dims=('y', 'x', 'time'))

resampler = XArrayResamplerNN(msg_area, area_def, 50000)
resampler.get_neighbour_info()
result = resampler.get_sample_from_neighbour_info(data)

Note that nearest neighbor in pyresample uses the KDTree from pykdtree which is faster than scipy and uses OpenMP, but is not dask or multi-process friendly otherwise. This means that we have to create a single KDTree from the lon/lats of the source area definition (3712, 3712). We then query it in parallel. Also, since it is using OpenMP you may want to limit the number of OpenMP threads being used export OMP_NUM_THREADS=1 (or 2). See https://satpy.readthedocs.io/en/latest/faq.html for other tips that we normally recommend when using dask and resampling.

Edit: The point of the last paragraph was to say: I've never tested these dask interfaces on a cluster or multiprocess scheduler. Threaded scheduler will probably perform best.

@djhoese
Copy link
Member

djhoese commented Aug 13, 2019

And...because I have you here, keep an eye out for https://github.com/geoxarray/geoxarray which should simplify some of this in the future (taking a Dataset from anywhere and remap it with pyresample).

@mraspaud
Copy link
Member

mraspaud commented Oct 3, 2022

@rabernat I don't know if you question is still relevant, but I just want to name the recent work on the resample_blocks function: https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.resampler.resample_blocks
It aims at being a counterpart to dask's map_blocks for transforming arrays between different projections.

@maxrjones
Copy link

@djhoese @mraspaud are there any examples of using the resample_blocks function with Xarray objects? I'm interested in trying it out as an alternative to using map_blocks with rasterio's reprojection (xref carbonplan/ndpyramid#10).

@djhoese
Copy link
Member

djhoese commented Mar 20, 2024

@maxrjones I'm just coming back from paternity leave and also don't have much experience with customizing calls to resample_blocks, but if I remember correctly @mraspaud designed it only for dask array inputs and not DataArray/Dataset inputs. @mraspaud will have to provide any more details if he has them.

@mraspaud
Copy link
Member

resample_blocks was written with dask arrays in mind, but it should be quite straightforward to add a small wrapper to support xarray also. The docstring shows how it was thought to be used, but feels free to ask more specific questions if you need help.

@maxrjones
Copy link

Sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants