Skip to content

fix: Handle array of strings columns in Athena materialization#6324

Open
alan-gauthier-jt wants to merge 2 commits intofeast-dev:masterfrom
alan-gauthier-jt:fix-empty-string-array
Open

fix: Handle array of strings columns in Athena materialization#6324
alan-gauthier-jt wants to merge 2 commits intofeast-dev:masterfrom
alan-gauthier-jt:fix-empty-string-array

Conversation

@alan-gauthier-jt
Copy link
Copy Markdown
Contributor

@alan-gauthier-jt alan-gauthier-jt commented Apr 24, 2026

What this PR does / why we need it

Fixes two related bugs that cause TypeError and ValueError when materializing
feature views with array-typed columns (e.g. Array(String), Array(Int64)) using
the Athena offline store.

Arrow/Athena deserializes array columns as numpy.ndarray (object dtype) instead of
plain Python lists. This breaks two code paths in type_map.py:

  1. _convert_scalar_values_to_proto: pd.isnull(ndarray) returns an array of bools,
    and not <array> raises ValueError: The truth value of an empty array is ambiguous.
    → Already guarded by _is_array_like in newer Feast versions; no change needed here.

  2. _convert_list_values_to_proto (generic list path): proto_type(val=ndarray) passes
    the raw numpy array to the protobuf constructor, which only accepts Python lists →
    TypeError: bad argument type for built-in operation. Additionally, Arrow nullable
    columns can yield None elements inside the ndarray, which protobuf repeated fields
    also reject.

  3. _validate_collection_item_types: None elements inside an ndarray failed the
    type(item) in valid_types check before reaching the sanitization step.

Changes

feast/type_map.py

  • Add module-level _LIST_NONE_DEFAULTS dict mapping each list ValueType to a
    type-appropriate zero/empty default value used to replace None elements:

    • STRING_LIST, UUID_LIST, TIME_UUID_LIST, DECIMAL_LIST""
    • BYTES_LISTb""
    • INT32_LIST, INT64_LIST0
    • FLOAT_LIST, DOUBLE_LIST0.0
    • BOOL_LISTFalse
    • UNIX_TIMESTAMP_LISTNULL_TIMESTAMP_INT_VALUE
  • Add module-level _sanitize_list_value(value, feast_value_type) helper that:

    • Calls .tolist() on any numpy.ndarray to produce a plain Python list
      (empty ndarray → None, treated as a missing row)
    • Replaces None elements with the type-appropriate default from _LIST_NONE_DEFAULTS
    • Is a no-op for plain Python lists without None and for scalar values
  • Apply sanitization upfront in _convert_list_values_to_proto: both values and
    sample are normalised via _sanitize_list_value before any type-checking or proto
    conversion, removing the need for per-path ndarray handling.

  • Remove the old _to_proto_safe_list / _DROP_NONE / _LIST_TYPE_NONE_REPLACEMENT
    module-level helpers, which have been superseded by the above.

  • Skip None elements in _validate_collection_item_typesNone entries are
    valid in nullable Arrow columns and are sanitized upstream; raising a TypeError on
    them before that point was incorrect.

Testing

Added TestArrowArrayStringListMaterialization in
sdk/python/tests/unit/test_type_map.py covering:

Test Scenario
test_sanitize_list_value_ndarray ndarray → plain list
test_sanitize_list_value_empty_ndarray empty ndarray → None (missing row)
test_sanitize_list_value_ndarray_with_none None elements in STRING_LIST replaced with ""
test_sanitize_list_value_plain_list plain list passthrough
test_sanitize_list_value_plain_list_with_none None in plain STRING_LIST list replaced with ""
test_sanitize_list_value_numeric_none_replaced None in numeric/bool lists replaced with zero default
test_sanitize_list_value_bytes_none_replaced None in BYTES_LIST replaced with b""
test_sanitize_list_value_scalar_passthrough non-list, non-ndarray values unchanged
test_string_list_from_ndarray full round-trip via python_values_to_proto_values
test_string_list_from_empty_ndarray empty ndarray no longer raises ValueError
test_string_list_from_ndarray_with_none_elements None in ndarray no longer raises TypeError
test_string_list_null_row_produces_empty_proto None rows produce empty ProtoValue
test_mixed_batch_simulating_athena_chunk full simulation of a failing Athena materialization batch
pytest sdk/python/tests/unit/test_type_map.py::TestArrowArrayStringListMaterialization -v

Which issues this PR fixes

Fixes #6325

Does this PR introduce a user-facing change?

Yes — materialization of array-typed feature columns from Athena no longer fails with
TypeError or ValueError when a batch contains empty arrays, None rows, or None
elements inside arrays. None elements inside an array are now stored as the
type-appropriate zero/empty value (e.g. "" for strings, 0 for integers).

Previously:
  TypeError: bad argument type for built-in operation
  ValueError: The truth value of an empty array is ambiguous

After this fix:
  Materialization completes successfully.
  None elements inside arrays are replaced with type-appropriate defaults.

devin-ai-integration[bot]

This comment was marked as resolved.

@alan-gauthier-jt alan-gauthier-jt changed the title fix: handle numpy.ndarray Array(String) columns in Athena materialization fix: handle numpyndarray Array(String) columns in Athena materialization Apr 24, 2026
Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
@alan-gauthier-jt alan-gauthier-jt changed the title fix: handle numpyndarray Array(String) columns in Athena materialization fix: Handle array of strings columns in Athena materialization Apr 24, 2026
Comment thread sdk/python/feast/type_map.py Outdated
# Per-type default values substituted for None elements inside list columns.
# Only STRING_LIST uses ""; numeric/bytes types drop None entirely because
# there is no meaningful in-band sentinel (protobuf rejects wrong scalar types).
_LIST_TYPE_NONE_REPLACEMENT: Dict[ValueType, Any] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alan-gauthier-jt The approach used in https://github.com/feast-dev/feast/pull/6327/changes seems safer and preserve list length

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation to a similar solution than #6327

Comment thread sdk/python/feast/type_map.py Outdated
none_replacement = _LIST_TYPE_NONE_REPLACEMENT.get(feast_value_type, _DROP_NONE)
if none_replacement is _DROP_NONE:
return [x for x in value if x is not None]
return [x if x is not None else none_replacement for x in value]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if feast_value_type in _LIST_TYPE_NONE_REPLACEMENT instead?

Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
@alan-gauthier-jt alan-gauthier-jt requested a review from ntkathole May 7, 2026 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: TypeError / ValueError when materializing Array(String) feature views with Athena offline store

2 participants