Skip to content

Defer heavy Google SDK imports to runtime in Dataproc and BigQuery operators#64938

Open
alamashir wants to merge 4 commits intoapache:mainfrom
alamashir:fix/lazy-imports-google-operators
Open

Defer heavy Google SDK imports to runtime in Dataproc and BigQuery operators#64938
alamashir wants to merge 4 commits intoapache:mainfrom
alamashir:fix/lazy-imports-google-operators

Conversation

@alamashir
Copy link
Copy Markdown
Contributor

Summary

Fixes #62373

  • Moves heavy imports (google.cloud.dataproc_v1, google.cloud.bigquery, hooks, triggers) from module-level to method bodies in dataproc.py and bigquery.py operators
  • Uses _UNSET sentinel pattern for DEFAULT/DEFAULT_RETRY default parameter values to avoid importing google.api_core at class definition time
  • Moves _BigQueryHookWithFlexibleProjectId class definition inside get_db_hook() method to avoid importing BigQueryHook at module level
  • Updates test mock paths to point to actual source modules since symbols are no longer re-exported at the operator module level

This prevents 15-26s import delays on small workers (1 vCPU, 2 GiB) that cause DagBag timeout error (30s limit) during DAG parsing. Heavy imports now only happen at task execution time.

Test plan

  • pytest providers/google/tests/unit/google/cloud/operators/test_dataproc.py — 108 passed
  • pytest providers/google/tests/unit/google/cloud/operators/test_bigquery.py — 128 passed
  • ruff check passes on all changed files
  • Import smoke test: python -c "from airflow.providers.google.cloud.operators.dataproc import DataprocCreateBatchOperator" completes without loading google.cloud.dataproc_v1

…erators (apache#62373)

Move heavy imports (google.cloud.dataproc_v1, google.cloud.bigquery,
hooks, triggers) from module-level to method bodies in dataproc.py and
bigquery.py operators. This prevents 15-26s import delays on small
workers that cause DagBag timeout errors during DAG parsing.

Key changes:
- Use _UNSET sentinel pattern for DEFAULT/DEFAULT_RETRY parameters
- Move _BigQueryHookWithFlexibleProjectId into get_db_hook() method
- Add lazy imports in all execute/helper methods
- Update test mock paths to match new import locations
@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Apr 9, 2026
@alamashir alamashir marked this pull request as ready for review April 9, 2026 03:03
@alamashir alamashir requested a review from shahar1 as a code owner April 9, 2026 03:03
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented Apr 9, 2026

cc @VladaZakharova can you review?

@VladaZakharova
Copy link
Copy Markdown
Contributor

Can you please share results of all the system tests for dataproc and bigquery? you changed a lot of templated fields, so i want to verify that all the tests are still green

Copy link
Copy Markdown
Contributor

@VladaZakharova VladaZakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

Copy link
Copy Markdown
Contributor

@VladaZakharova VladaZakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please provide the screenshots

@alamashir
Copy link
Copy Markdown
Contributor Author

@VladaZakharova do you just mean running test suite and providing screenshot of that?

@alamashir
Copy link
Copy Markdown
Contributor Author

@VladaZakharova
DataProc tests
image
Bigquery tests
image

@alamashir alamashir requested a review from VladaZakharova April 9, 2026 14:48
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces DAG-parse import latency for the Google Dataproc and BigQuery operators by deferring heavy Google SDK and provider imports until execution-time code paths, addressing DagBag import timeouts on small workers (issue #62373).

Changes:

  • Moved google.* SDK imports and Google provider hook/trigger imports out of module scope in dataproc.py and bigquery.py, and introduced an _UNSET sentinel to avoid importing google.api_core defaults at definition time.
  • Refactored BigQuery’s flexible-project hook wrapper to be defined inside get_db_hook() so BigQueryHook isn’t imported at operator module import time.
  • Updated unit tests’ mock.patch() targets to patch the hook/SDK symbols at their new import locations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
providers/google/src/airflow/providers/google/cloud/operators/dataproc.py Defers Dataproc SDK/hook/trigger imports and replaces default retry args with _UNSET + runtime DEFAULT resolution.
providers/google/src/airflow/providers/google/cloud/operators/bigquery.py Defers BigQuery SDK/hook/trigger imports and nests flexible-project hook wrapper inside get_db_hook().
providers/google/tests/unit/google/cloud/operators/test_dataproc.py Updates patch paths to hook/SDK modules now that operators no longer re-export symbols at module scope.
providers/google/tests/unit/google/cloud/operators/test_bigquery.py Updates patch paths (hooks/mixin method) to match the new import structure.
Comments suppressed due to low confidence (1)

providers/google/src/airflow/providers/google/cloud/operators/dataproc.py:1390

  • DataprocJobBaseOperator.__init__ imports and instantiates DataprocHook, which imports google.cloud.dataproc_v1 at module import time (providers/google/cloud/hooks/dataproc.py:32). Operator instances are created during DAG parsing, so this can still trigger heavy Google SDK imports before task execution and undermine the goal of avoiding DagBag import timeouts. Consider deferring hook import/creation until execute() (or via a lazy cached_property) and postponing the project_id fallback resolution (project_id or hook.project_id) to runtime.
        from airflow.providers.google.cloud.hooks.dataproc import DataprocHook

        self.job_error_states = job_error_states or {"ERROR"}
        self.impersonation_chain = impersonation_chain
        self.hook = DataprocHook(gcp_conn_id=gcp_conn_id, impersonation_chain=impersonation_chain)
        self.project_id = project_id or self.hook.project_id
        self.job_template: DataProcJobBuilder | None = None

def project_id(self, value: str) -> None:
cached_creds, _ = self.get_credentials_and_project_id()
self._cached_project_id = value or PROVIDE_PROJECT_ID
self._cached_credntials = cached_creds
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _BigQueryHookWithFlexibleProjectId.project_id setter, this assigns to self._cached_credntials, but the base hook caching attribute is _cached_credentials (see airflow/providers/google/common/hooks/base_google.py). As written, the credentials cache is not updated/used and a new misspelled attribute is created instead. Update this to write to the correct _cached_credentials attribute (and consider keeping backwards-compat only if needed).

Suggested change
self._cached_credntials = cached_creds
self._cached_credentials = cached_creds

Copilot uses AI. Check for mistakes.
@VladaZakharova
Copy link
Copy Markdown
Contributor

@VladaZakharova do you just mean running test suite and providing screenshot of that?

No, i mean running in breeze system tests that are located in folder tests/system/dataproc and tests/system/bigquery and attach the screenshots of those tests green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataprocCreateBatchOperator and BigQueryInsertJobOperator imports result in DagBag timeout error

4 participants