Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1826
Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1826il9ue wants to merge 1 commit into
Conversation
When querying an Iceberg table through the `iceberg(...)` table function or a DataLakeCatalog, a column whose name contains a `.` and whose type is `Array(T)` (e.g. `` `a.b` ARRAY<STRING> ``) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values. Fixes ClickHouse#90731. The Parquet V3 reader path (`SchemaConverter` + `ColumnMapper` + `FormatFilterInfo`) is already correct after the dotted-name field-id work in 0a218cd, 4b733ba and f24c1a4. This change addresses two residual upstream defects that affect dotted-name `Array(T)` columns regardless of source: * `ColumnsDescription::getAllRegisteredNames` explicitly filtered out any column whose name contained `.`, under the assumption such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, and produced by Iceberg / Spark) is a first-class registered name and must appear in `IHints` misspelling suggestions. The function is only consumed by `IHints`-style suggestion paths (and by `StorageSystemZooKeeper` for column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol. * `NestedUtils::getSubcolumnsOfNested` treated every `Array(T)` column whose name contained `.` as a flattened element of a synthetic `Nested` structure named after the prefix. This caused the Arrow, ORC and pre-V3 Parquet readers to look for a struct field with the prefix name in the data file rather than the literal dotted column, returning an empty array. The fix uses a two-pass scan: a synthetic `Nested` entry is only emitted when at least two `Array(T)` columns share the same dotted prefix. A lone column such as `a.b: Array(T)` no longer appears in the synthetic-Nested map. Genuine flattened `Nested` with multiple fields is unaffected; the existing early-continue on `isNested()` also covers the one-field-Nested edge case. Tests: * `tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column` — end-to-end repro of ClickHouse#90731 against s3, azure and local storage. * `test_dotted_array_alongside_real_nested` in the same file — mixed- schema regression guard verifying a lone dotted `Array` column coexists with genuine flattened-Nested siblings. * `tests/queries/0_stateless/04259_dotted_array_not_nested.sql` — isolates Bug B without Iceberg. * `tests/queries/0_stateless/04260_dotted_column_in_hints.sh` — verifies Bug A by checking the misspelling hint output. Changelog category (leave one): - Bug Fix (user-visible misbehavior in an official stable release) Changelog entry: Fix reading Iceberg tables whose `ARRAY` column names contain a dot (e.g. `` `a.b` ARRAY<STRING> ``), which previously returned empty arrays. Two upstream defects were responsible: `ColumnsDescription::getAllRegisteredNames` filtered out dotted names, and `NestedUtils::getSubcolumnsOfNested` misclassified lone dotted `Array(T)` columns as flattened `Nested` children. (cherry picked from commit f8467af)
Backport of upstream fix for ClickHouse#90731Backport of Fixes the customer-reported symptom from ClickHouse/ClickHouse#90731 against the 26.3 release line. SymptomWhen querying an Iceberg table through the -- Spark
CREATE TABLE table7 (`a.b` ARRAY<STRING>);
INSERT INTO table7 VALUES (ARRAY('a','b','c'));
-- ClickHouse (before fix)
SELECT `a.b` FROM iceberg('...');
-- got: [ ]
-- expected: ['a','b','c']Root causeThe Parquet V3 reader path (
Fix
Tests
RiskLow. Five-line removal in Scope
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix reading Iceberg tables whose
ARRAYcolumn names contain a dot (e.g.`a.b` ARRAY<STRING>), which previously returned empty arrays. Two upstream defects were responsible:ColumnsDescription::getAllRegisteredNamesfiltered out dotted names, andNestedUtils::getSubcolumnsOfNestedmisclassified lone dottedArray(T)columns as flattenedNestedchildren.Documentation entry for user-facing changes
CI/CD Options
Exclude tests:
Regression jobs to run: