Skip to content

Fix two pandas 3.0 incompatibilities (StringDtype groupby, Series positional indexing)#321

Open
benmsanderson wants to merge 2 commits into
openscm:mainfrom
benmsanderson:fix/pandas3-compat
Open

Fix two pandas 3.0 incompatibilities (StringDtype groupby, Series positional indexing)#321
benmsanderson wants to merge 2 commits into
openscm:mainfrom
benmsanderson:fix/pandas3-compat

Conversation

@benmsanderson
Copy link
Copy Markdown

@benmsanderson benmsanderson commented May 23, 2026

Summary

pandas 3.0 introduced two changes that scmdata 0.18 trips on for any
multi-scenario ScmRun:

  1. Default StringDtype inference. String columns now come back as
    pd.StringDtype rather than object. RunGroupBy.__init__ calls
    numpy.issubdtype(col.dtype, numpy.number) to detect numeric meta
    columns; on StringDtype this raises:

    TypeError: Cannot interpret '<StringDtype(storage='python', na_value=nan)>' as a data type
    

    Route the check through pd.api.types.is_numeric_dtype instead,
    which returns False for StringDtype and True for numeric
    dtypes.

  2. Removal of Series positional integer indexing.
    _xarray._many_to_one ended with
    checker.groupby(col2).count().max()[0]. .max() on a DataFrame
    returns a label-indexed Series, and pandas 3.0 has removed
    positional integer indexing on those — [0] now raises
    KeyError: 0. Use .iloc[0]: same semantics, explicit positional.

Both calls are exercised by every multi-scenario ScmRun. The second
in particular blocks ScmRun.to_nc entirely on pandas 3.0, so any
downstream that streams scenarios to disk currently cannot run.

Backwards compatibility

Both replacements have been pandas's canonical APIs since well before
pandas 2.0:

  • pandas.api.types.is_numeric_dtype — present since pandas 0.18
  • Series.iloc[0] — long-standing positional accessor

So the change is safe on pandas 2.x as well; no version pin needed.

Test plan

Existing tests/unit/test_groupby.py and tests/unit/test_netcdf.py
both exercise the affected code paths and were failing on pandas 3.0
before this change. No new tests added — the existing suite is the
regression coverage.

Context

Found while deploying
openscm/openscm-runner for
AR7-cycle work .

…itional indexing)

pandas 3.0 introduced two changes that scmdata 0.18 trips on for any
multi-scenario ScmRun:

1. Default StringDtype inference. String columns now come back as
   pd.StringDtype rather than object. RunGroupBy.__init__ called
   numpy.issubdtype(col.dtype, numpy.number) to detect numeric meta
   columns; on StringDtype this raises
   'TypeError: Cannot interpret <StringDtype(...)> as a data type'.
   Route the check through pd.api.types.is_numeric_dtype instead,
   which returns False for StringDtype and True for numeric dtypes.

2. Removal of Series positional integer indexing.
   _xarray._many_to_one ended with checker.groupby(col2).count().max()[0].
   max() on a DataFrame returns a label-indexed Series and pandas 3.0
   removed positional integer indexing on those, so [0] raises
   'KeyError: 0'. Use .iloc[0]: same semantics, explicit positional.

Both calls are exercised by every multi-scenario ScmRun. The second
in particular blocks ScmRun.to_nc entirely on pandas 3.0, so any
downstream that streams scenarios to disk (e.g. openscm-runner's
NetCDFChunkWriter) currently cannot run.

The fixes are backward-compatible: pd.api.types.is_numeric_dtype and
Series.iloc[0] have been pandas's canonical APIs since well before
pandas 2.0.
benmsanderson added a commit to benmsanderson/openscm-runner that referenced this pull request May 23, 2026
Mirror of scripts/run_rcmip_fair2.py for the CICEROSCMPY2 adapter:
runs every SSP in the RCMIP fixture (ssp119, ssp126, ssp245, ssp370
and the two lowNTCF variants, ssp434, ssp460, ssp534-over, ssp585)
against N posterior members of a CICERO-SCM v2.x distribution and
prints a per-scenario 2100 GSAT / CO2 / ERF summary.

Defaults to splice mode (user emissions + bundled ssp245 historical),
which is the path the demo uses. Pass --cicero-bundle-dir to switch
to bundle mode (Marit RCMIP-aligned setup) where gaspam and conc
files are resolved per-scenario from inside the bundle directory.

Smoke-tested end-to-end against draw_samples_500.json with 20
members: ~44 s for 10 scenarios x 20 members on a single thread.
2100 GSAT medians are systematically warmer than the FaIRv2 numbers
on the same protocol (e.g. ssp245 3.77 K vs FaIR 2.63 K, ssp585
6.72 K vs FaIR 4.82 K). The CICEROSCM bundle's ECS distribution is
wider than FaIR's, and the 20-member subset is small relative to the
full 500-member posterior, so the offset is consistent with the
expected inter-model spread.

Results are kept in memory: the NetCDFChunkWriter path currently
trips a scmdata-pandas-3 incompatibility (fixed in PR #11 / upstream
openscm/scmdata#321), so writer support stays out of this script
until those land in main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant