workflow update for including embeddings as a parallel processing job with aggregation#599
Conversation
… with aggregation Also update the config to read the template config from a Yaml file for embeddings
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| CodeStyle | 1 minor |
🟢 Metrics 12 complexity
Metric Results Complexity 12
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request introduces support for embedding generation within the Spanner ingestion workflow, running embedding jobs in parallel with aggregation jobs. It adds configuration loading from a YAML specification file, along with corresponding unit tests and dependencies. The review feedback highlights three critical issues: the HTTP timeout in the Cloud Workflow exceeds the 1800-second limit, the gcsfs dependency is missing for reading GCS paths with pandas, and resolving the embedding spec path relative to the current working directory may fail in Cloud Run.
Also update the config to read the template config from a Yaml file for embeddings
The workflow is e2e tested in https://pantheon.corp.google.com/workflows/workflow/us-central1/spanner-ingestion-workflow/execution/bbbac16f-c86f-4f20-9e4f-9ea8e0bd1997/summary?e=13803378&mods=-monitoring_api_staging&project=datcom-ci
It could properly filter the stats vars and add embeddings for test DB: https://pantheon.corp.google.com/spanner/instances/datcom-spanner-test/databases/dc-test-db