Add a "missing features" documentation section on Dataset

shoyer · Xarray-Beam authors · commit c6c73355b81c · 2025-10-10T15:10:31.000-07:00
PiperOrigin-RevId: 817803789
diff --git a/docs/high-level.ipynb b/docs/high-level.ipynb
@@ -346,6 +346,51 @@
       ],
       "outputs": [],
       "execution_count": 13
+    },
+    {
+      "metadata": {
+        "id": "EQMXRq18_G-Y"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Missing features\n",
+        "\n",
+        "`xbeam.Dataset` is not yet complete, and we would welcome contributions! Here are a few features that would be particularly welcome. If you're interested in any of these, the easiest way to get in touch is to [raise an issue](https://github.com/google/xarray-beam/issues) on GitHub.\n",
+        "\n",
+        "### Operations that combine multiple datasets\n",
+        "\n",
+        "Support for operations that merge together different datasets would be quite welcome, e.g., to evaluate model outputs against ground truth. Currently, tools like [WeatherBenchX](https://github.com/google-research/weatherbenchX/) acheive this by writing custom Beam pipelines.\n",
+        "\n",
+        "There are two ways these might be implemented for `Dataset`:\n",
+        "\n",
+        "1. By supporting multiple Dataset arguments in a `xbeam.map_blocks()` function ([tracking issue](https://github.com/google/xarray-beam/issues/173)).\n",
+        "2. By supporting xarray.DataTree objects ([tracking issue](https://github.com/google/xarray-beam/issues/124))\n",
+        "\n",
+        "In the long term, `DataTree` support for simultaneously loading data is a better option, because merging together separate Beam ptransforms (as would be required for `map_blocks`) requires an expensive shuffle step via `beam.CoGroupByKey`. This will require the upstream Xarray project supporting a bit more functionality with DataTree, most notably `concat` and `combine_nested`.\n",
+        "\n",
+        "### Aggregations other than `mean`\n",
+        "\n",
+        "Currently, Xarray-Beam only supports an efficient aggregation implementation for {py:meth}`~xarray_beam.Dataset.mean`, but it should be relatively straightforward to extend this for many other common Xarray aggregation, e.g., `sum`, `min`, `max`, `all`, `any`, `std`, `var`, etc.\n",
+        "\n",
+        "### IO connectors\n",
+        "\n",
+        "Tools for reading/writing Xarray-Beam into other distributed storage systems, such as Google Earth Engine (see [XEE](https://github.com/google/Xee)) and [Icechunk](https://icechunk.io/), would be very welcome.\n",
+        "\n",
+        "### Other `Dataset` operations\n",
+        "\n",
+        "`xbeam.Dataset` has an intentionally small API surface, so features that can implemented via a trivial call to `map_blocks()` are probably not a good fit for Xarray-Beam itself.\n",
+        "\n",
+        "That said, there are plenty of other Xarray methods that _do_ require updates to underlying `chunks` and `xbeam.Key` objects beyond what `map_blocks` can handle (e.g., `rename`, `thin`, `assign_coords`), or for which more efficient distributed algorithms exist (e.g., for [groupby and resampling](https://github.com/xarray-contrib/flox)).\n",
+        "\n",
+        "We are also contemplating starting another open source project for collecting generally useful utilities that a little too weather/climate domain-specific to make sense in Xarray-Beam, e.g., for regridding."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "8pdnX_kJ_gYV"
+      },
+      "cell_type": "markdown",
+      "source": []
     }
   ],
   "metadata": {