Skip to content

Commit c6c7335

Browse files
shoyerXarray-Beam authors
authored andcommitted
Add a "missing features" documentation section on Dataset
PiperOrigin-RevId: 817803789
1 parent 1d9beec commit c6c7335

1 file changed

Lines changed: 45 additions & 0 deletions

File tree

docs/high-level.ipynb

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,51 @@
346346
],
347347
"outputs": [],
348348
"execution_count": 13
349+
},
350+
{
351+
"metadata": {
352+
"id": "EQMXRq18_G-Y"
353+
},
354+
"cell_type": "markdown",
355+
"source": [
356+
"## Missing features\n",
357+
"\n",
358+
"`xbeam.Dataset` is not yet complete, and we would welcome contributions! Here are a few features that would be particularly welcome. If you're interested in any of these, the easiest way to get in touch is to [raise an issue](https://github.com/google/xarray-beam/issues) on GitHub.\n",
359+
"\n",
360+
"### Operations that combine multiple datasets\n",
361+
"\n",
362+
"Support for operations that merge together different datasets would be quite welcome, e.g., to evaluate model outputs against ground truth. Currently, tools like [WeatherBenchX](https://github.com/google-research/weatherbenchX/) acheive this by writing custom Beam pipelines.\n",
363+
"\n",
364+
"There are two ways these might be implemented for `Dataset`:\n",
365+
"\n",
366+
"1. By supporting multiple Dataset arguments in a `xbeam.map_blocks()` function ([tracking issue](https://github.com/google/xarray-beam/issues/173)).\n",
367+
"2. By supporting xarray.DataTree objects ([tracking issue](https://github.com/google/xarray-beam/issues/124))\n",
368+
"\n",
369+
"In the long term, `DataTree` support for simultaneously loading data is a better option, because merging together separate Beam ptransforms (as would be required for `map_blocks`) requires an expensive shuffle step via `beam.CoGroupByKey`. This will require the upstream Xarray project supporting a bit more functionality with DataTree, most notably `concat` and `combine_nested`.\n",
370+
"\n",
371+
"### Aggregations other than `mean`\n",
372+
"\n",
373+
"Currently, Xarray-Beam only supports an efficient aggregation implementation for {py:meth}`~xarray_beam.Dataset.mean`, but it should be relatively straightforward to extend this for many other common Xarray aggregation, e.g., `sum`, `min`, `max`, `all`, `any`, `std`, `var`, etc.\n",
374+
"\n",
375+
"### IO connectors\n",
376+
"\n",
377+
"Tools for reading/writing Xarray-Beam into other distributed storage systems, such as Google Earth Engine (see [XEE](https://github.com/google/Xee)) and [Icechunk](https://icechunk.io/), would be very welcome.\n",
378+
"\n",
379+
"### Other `Dataset` operations\n",
380+
"\n",
381+
"`xbeam.Dataset` has an intentionally small API surface, so features that can implemented via a trivial call to `map_blocks()` are probably not a good fit for Xarray-Beam itself.\n",
382+
"\n",
383+
"That said, there are plenty of other Xarray methods that _do_ require updates to underlying `chunks` and `xbeam.Key` objects beyond what `map_blocks` can handle (e.g., `rename`, `thin`, `assign_coords`), or for which more efficient distributed algorithms exist (e.g., for [groupby and resampling](https://github.com/xarray-contrib/flox)).\n",
384+
"\n",
385+
"We are also contemplating starting another open source project for collecting generally useful utilities that a little too weather/climate domain-specific to make sense in Xarray-Beam, e.g., for regridding."
386+
]
387+
},
388+
{
389+
"metadata": {
390+
"id": "8pdnX_kJ_gYV"
391+
},
392+
"cell_type": "markdown",
393+
"source": []
349394
}
350395
],
351396
"metadata": {

0 commit comments

Comments
 (0)