Skip to content

Cursor.description collapses TIMESTAMP_NTZ to 'timestamp' on the SELECT path #786

@rshura

Description

@rshura

Cursor.description reports TIMESTAMP_NTZ columns as 'timestamp' — the _NTZ distinction is lost on the SELECT path

Summary

For any SELECT-style query, cursor.description returns the same type_code ('timestamp') for columns/expressions of Spark type TIMESTAMP and Spark type TIMESTAMP_NTZ. The _NTZ distinction is erased before it reaches the DB-API caller, so downstream code that relies on cursor.description to decide whether a column is timezone-aware cannot tell the two apart.

This is the cursor-description analog of the SQLAlchemy reflection bug fixed in #295 / #296. That fix landed only on the DESCRIBE-driven reflection path; the SELECT path was not touched.

It is also structurally identical to the still-open #336 ("Cursor().description reports NULL (VOID) and INTERVAL fields as 'string'") — same root cause, different victim type.

Reproduction

from databricks import sql

with sql.connect(
    server_hostname=...,
    http_path=...,
    access_token=...,
) as conn:
    with conn.cursor() as cur:
        cur.execute(
            "SELECT "
            "  CAST('2024-10-07 12:00:00' AS TIMESTAMP)     AS tz_aware, "
            "  CAST('2024-10-07 12:00:00' AS TIMESTAMP_NTZ) AS tz_naive"
        )
        for col in cur.description:
            print(col[0], "->", col[1])

Observed output (against a Databricks SQL Warehouse, connector 4.2.6):

tz_aware -> timestamp
tz_naive -> timestamp

The same collapse happens for SELECT <ntz_column> FROM <table> where the underlying column is genuinely TIMESTAMP_NTZ (verified via DESCRIBE), and for explicit CAST(... AS TIMESTAMP_NTZ) over any expression.

The repo's own unit-test fixture pins this behavior:
tests/unit/test_util.py

("timestamp_column",     "timestamp", None, None, None, None, None),
("timestamp_ntz_column", "timestamp", None, None, None, None, None),

Expected

cursor.description[i][1] should be 'timestamp_ntz' for genuine TIMESTAMP_NTZ columns/expressions, and 'timestamp' for TIMESTAMP. This matches the string returned by DESCRIBE on the same column and the type names already used by the SQLAlchemy dialect's reflection map (where "timestamp_ntz" is a valid key — see databricks-sqlalchemy/src/databricks/sqlalchemy/_parse.py).

Root cause

In src/databricks/sql/backend/thrift_backend.py, _col_to_description derives the type_code from the Thrift TTypeId enum name:

# thrift_backend.py, _col_to_description
name = ttypes.TTypeId._VALUES_TO_NAMES[type_entry.primitiveEntry.type]
# Drop _TYPE suffix
cleaned_type = (name[:-5] if name.endswith("_TYPE") else name).lower()

The Thrift TTypeId enum has a single TIMESTAMP_TYPE; there is no TIMESTAMP_NTZ_TYPE. Both Spark TIMESTAMP and Spark TIMESTAMP_NTZ columns arrive over the wire with primitiveEntry.type = TIMESTAMP_TYPE, so after the suffix-strip + lowercase both end up as 'timestamp'.

The same function already contains the mechanism needed to recover the lost distinction. Further down in _col_to_description:

# Extract variant type from field if available
if field is not None:
    try:
        if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
            sql_type = field.metadata.get(b"Spark:DataType:SqlName")
            if sql_type == b"VARIANT":
                cleaned_type = "variant"
    except Exception as e:
        logger.debug(f"Could not extract variant type from field: {e}")

The Arrow field metadata key Spark:DataType:SqlName carries the true Spark SQL type name, including the TIMESTAMP_NTZ distinction. PR #560 used this hook to recover the VARIANT type that was being collapsed to 'string'. The same hook can recover TIMESTAMP_NTZ collapsed to 'timestamp'.

Proposed fix

Extend the existing Spark:DataType:SqlName override in _col_to_description to also handle TIMESTAMP_NTZ:

if field is not None:
    try:
        if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
            sql_type = field.metadata.get(b"Spark:DataType:SqlName")
            if sql_type == b"VARIANT":
                cleaned_type = "variant"
            elif sql_type == b"TIMESTAMP_NTZ":
                cleaned_type = "timestamp_ntz"
    except Exception as e:
        logger.debug(f"Could not extract type from field metadata: {e}")

The same shape would cleanly extend to the cases in #336 (VOID, INTERVAL) if their Spark:DataType:SqlName metadata is populated by DBR — worth verifying as part of this fix.

The corresponding unit-test fixture in tests/unit/test_util.py would need to be updated to assert "timestamp_ntz" (rather than "timestamp") for the timestamp_ntz_column row.

Impact

Any consumer that uses cursor.description to classify columns as timezone-aware vs timezone-naive on Databricks must currently treat every timestamp column as tz-aware, or fall back to a separate DESCRIBE round trip to recover the _NTZ suffix. This is a real-world blocker for tools building on the connector — for example, any validation that requires "this time axis is tz-naive" cannot be expressed against a Databricks TIMESTAMP_NTZ column based on connector output alone.

Bumping the connector pin does not help: verified that 4.2.6 (latest at time of writing) still collapses the type. The TIMESTAMP_NTZ-related CHANGELOG entries since the initial fix concern parameter binding (TimestampNTZParameter) and SQLAlchemy reflection — neither touches the cursor.description path.

Related

Environment

  • databricks-sql-connector 4.2.6
  • Backend: Databricks SQL Warehouse
  • Behavior is independent of the Python or OS version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions