Skip to content

UN-3479 [FEAT] SDK subpackage for org-to-org data migration#15

Merged
chandrasekharan-zipstack merged 33 commits into
mainfrom
feat/org-migration
May 27, 2026
Merged

UN-3479 [FEAT] SDK subpackage for org-to-org data migration#15
chandrasekharan-zipstack merged 33 commits into
mainfrom
feat/org-migration

Conversation

@chandrasekharan-zipstack
Copy link
Copy Markdown
Contributor

@chandrasekharan-zipstack chandrasekharan-zipstack commented May 23, 2026

Summary

New unstract.migration sub-package — moves an org's configured resources (adapters, connectors, tags, prompt-studio projects, workflows, tool instances, workflow endpoints, ETL/TASK pipelines, API deployments) into another org via two admin-issued Platform API keys. Works cloud↔cloud, on-prem↔on-prem, or within the same deployment.

Tracked in UN-3479. Backend companion: Zipstack/unstract#1987.

Architecture

  • OrgEndpoint — base URL + org slug + Platform API key + configurable api_path_prefix (api/v1 default).
  • PlatformClient — entity-scoped HTTP wrapper; Bearer auth; DRF-OPTIONS-driven writable-field set via get_post_schema(entity_path).
  • MigrationContext — bundles source/target clients, options, per-run RemapTable (source UUID → target UUID).
  • Phase base + per-entity subclasses; each phase: list → per-id GET → POST (or adopt-by-name).
  • walker.remap_uuids — JSON walker that rewrites embedded UUIDs using RemapTable.resolve_any before POST.
  • MigrationReport — rich table summary + Source → Target UUID Map for audit.
  • click CLI: unstract-migrate migrate --source-url ... --source-org ... --target-url ... --target-org ... with env-var fallback for keys, plus --include / --exclude for phase selection and --dry-run.

Phase order is owned by orchestrator.PHASES — phases never call each other:

adapter → connector → tag → custom_tool → workflow → tool_instance
       → workflow_endpoint → pipeline → api_deployment

Phases

Phase Notes
Adapter Skips is_friction_less=True. Encrypted adapter_metadata_b survives via per-id GET.
Connector Same list→GET→POST pattern. Adopt-by-name idempotent.
Tag Backend now exposes POST/PATCH/DELETE; SDK creates tags on target.
CustomTool Composite — tool + profile_managers + prompts + republish to registry. Resolves profile adapters by name (UUIDs differ across orgs).
Workflow tool_metadata UUID-walk during remap.
ToolInstance ≤1 per workflow; FK-rewrites via remap.
WorkflowEndpoint Rewrites connector + workflow refs.
Pipeline ETL + TASK only. DEFAULT (legacy v1) and APP (Streamlit) skipped. Warns when source has >1 active key.
APIDeployment Adopts by api_name. Same key-warning behaviour. Sequenced after workflow_endpoint (serializer requires configured source+destination endpoints).

Constraints (from design ADRs)

  • Secrets carried verbatim through the SDK (same posture as the FE; ADR 0006).
  • No local state file (ADR 0007); idempotency = name-based GET against target.
  • No UUID preservation; embedded refs rewritten via the walker.
  • No DB / schema migrations.

Carry-forward gotchas

  1. List endpoints often serve stripped payloads (e.g. AdapterListSerializer omits adapter_metadata_b) — every phase does list → per-id GET before POST.
  2. PATH_PREFIX is env-controlled on the backend; --api-prefix flag exposed.
  3. AuditSerializer auto-adds creator to shared_users M2M — relevant for any cleanup scripting.
  4. Service-account is a real User row (is_service_account=True) + OrganizationMember.
  5. Backend auto-provisions one active API key per Pipeline/APIDeployment on POST; key UUIDs are editable=False. SDK does not preserve source key values — operator must rotate. Warning is emitted when source has >1 active key.

Test plan

  • 56 unit tests across all phases (happy / dry-run / adopt / abort / extra-keys-warning paths) + remap table + walker + HTTP client + CLI.
  • Local end-to-end smoke against docker stack (org_Q4qgjLWIbaJlfStsorg_migration_target) — full pipeline creates on first run, adopts idempotently on re-run.
  • Cloud↔cloud smoke against staging before un-drafting.
  • On-prem↔on-prem smoke before un-drafting.

Out of scope (separate work)

  • shared_to_org / shared_users propagation (resource-ownership semantics).
  • Prompt-studio uploaded document blobs (file migration).
  • API-key value preservation (would need backend design change).
  • Pipeline types DEFAULT (legacy) and APP (Streamlit).

Related

  • JIRA: UN-3479
  • Backend PR: Zipstack/unstract#1987 (feat/org-migration-platform-api-gaps)
  • KB (internal): ~/Documents/Obsidian Vault/zipstuff/org-data-migration/

chandrasekharan-zipstack and others added 11 commits May 24, 2026 01:15
…apter)

Adds `unstract.migration` — a CLI + library that drives org-to-org data
migration between Unstract deployments via the existing Platform API
surface. v1 ships the adapter phase end-to-end as the reference impl;
remaining phases (connector, tag, custom_tool, workflow, tool_instance,
workflow_endpoint) follow in subsequent commits.

Design highlights:
- Two admin Platform API keys in, populated target out. No bundle, no
  Django mgmt command, no out-of-band secret step.
- Idempotency via in-memory remap (per run) + name-based GET against
  target (across runs). No state file — target IS the state.
- Fresh target UUIDs on every create; embedded UUID references remapped
  in-memory via `walker.remap_uuids` pre-POST.
- Secrets carried verbatim from source GET to target POST (same surface
  the FE already consumes when an admin opens an adapter card).

Package layout:
- `client.py`      thin Platform API wrapper (one per OrgEndpoint)
- `context.py`     OrgEndpoint, MigrationOptions, MigrationContext, RemapTable
- `walker.py`     `remap_uuids` JSON walker
- `report.py`      rich-rendered MigrationReport + plain-text fallback
- `phases/`        base.Phase ABC + adapter.AdapterPhase (reference impl)
- `orchestrator.py` top-level `migrate()` + phase order
- `cli.py`         click CLI: `unstract-migrate migrate ...`

CLI:
- Entry point `unstract-migrate` (also `python -m unstract.migration`)
- Platform keys via flags or env (UNSTRACT_SRC_PLATFORM_KEY /
  UNSTRACT_TGT_PLATFORM_KEY) to keep them out of shell history
- `--api-prefix` overrides PATH_PREFIX (default api/v1 to match OSS
  docker compose; cloud/on-prem set as needed)
- `--dry-run`, `--include`, `--exclude`, `--on-name-conflict adopt|abort`

Audit:
- Per-line logs include source UUID + target UUID for every adopted/created
  entity (`src=... -> tgt=...`)
- Final report renders a source -> target UUID map table for traceability

Deps gated behind optional `[migration]` extra — core SDK consumers
unaffected. Tests: 13 unit tests (RemapTable, remap_uuids, AdapterPhase
happy path / idempotency / dry-run / abort). Integration smoke verified
locally against docker compose: 9 adapters migrated source -> target,
re-run reports 0 created / 9 adopted.
Per the project's code-comments guidance: comments explain WHY in
generic terms, not 'see file X in repo Y'. Path/class references rot
when files move and don't help a future reader of this package who
may not have the backend repo open.

Behavior unchanged. 13 unit tests + integration smoke still green.
Two independent leaf phases following the AdapterPhase template:
list -> per-id GET -> POST/adopt by name, record source->target remap.

- ConnectorPhase: skips Unstract Cloud Storage rows (catalog id is
  redacted on the wire — target re-provisions per-org). OAuth-backed
  connectors land without refresh tokens; operator re-authorises on
  target.
- TagPhase: simplest entity — name + description, no encryption, no
  list-vs-detail divergence.

Orchestrator PHASES order extended; 22 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the hardcoded *_POST_FIELDS tuples and the UCS catalog-ID constant.
Backend serializer is now the single source of truth for which fields the
SDK posts:

- PlatformClient.get_post_schema(entity_path) issues OPTIONS once per
  path, caches the writable-field set (DRF SimpleMetadata already strips
  read_only fields from actions.POST).
- Each phase fetches its schema in run(); builds the POST body by
  intersecting the source GET payload with the schema.
- ConnectorPhase: replace UCS_CONNECTOR_ID hardcode with an empty-metadata
  signal — backend redacts metadata to {} for auto-provisioned rows, so a
  falsy connector_metadata on the wire is unmigratable. Future redactions
  (any catalog) are covered automatically.

Test FakeClients gain a get_post_schema() that mirrors the writable
subset; UCS test renamed to test_redacted_metadata_connector_skipped.
22/22 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add phases.base.build_post_payload(src, writable) used by every phase:

- Subtracts SERVER_MANAGED from the OPTIONS-derived writable set. DRF
  exposes id/organization/created_by/modified_by/shared_users/timestamps
  as writable on ModelSerializer, but the view's perform_create overrides
  them server-side — posting them is noise (and a source-org value for
  organization/created_by would mismatch the target).
- Skips empty strings as well as None. DRF treats '' on a required field
  as blank and 400s (hit on connector_version='').

Local smoke: 8/8 connectors + 2/2 tags migrated, idempotent re-run
adopts all 10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates prompt-studio projects with their ProfileManager + ToolStudioPrompt
children, then republishes PromptStudioRegistry via the backend's export-tool
action (avoids carrying tool_metadata across orgs). Walker-remaps adapter
UUIDs into profile FKs and across embedded JSON fields. Fresh-tool path
deletes the backend's auto-default profile before recreating from source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates Workflow rows with walker-remapped connector UUIDs in
source_settings + destination_settings JSON blobs. WorkflowEndpoints
are auto-created by the backend on workflow POST; reconciled later by
the dedicated WorkflowEndpoint phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the workflow execution loop on target. CustomToolPhase now records
a prompt_studio_registry remap (src_registry_id -> tgt_registry_id) by
looking up both sides via the newly-filterable registry list endpoint.

ToolInstancePhase walks the workflow remap, creates one bare instance per
target workflow (POST overrides metadata server-side), then PATCHes
metadata so source's adapter selections survive. Source metadata stores
adapters as names, which match across orgs since AdapterPhase preserves
them; the backend resolves names to local UUIDs on PATCH.

WorkflowEndpointPhase PATCHes the SOURCE/DESTINATION endpoints that the
backend auto-created on workflow POST, pairing by endpoint_type and
remapping connector FK + walker-rewriting embedded UUIDs in configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the field-by-field reconcile loop (create_custom_tool +
delete_profile + create_profile + set_default_profile + create_prompt)
in favour of the backend's purpose-built endpoints:

- GET prompt-studio/project-transfer/{id} bundles tool_metadata,
  tool_settings, default_profile_settings, prompts in one shot.
- POST prompt-studio/project-transfer/ creates the tool, default
  profile (wired with target-org adapter ids the SDK supplies), and
  prompts server-side in one call.
- POST prompt-studio/{id}/sync-prompts/ rip-and-replaces prompts on
  an existing target tool for the adopt path.

Adapter ids for the import are resolved from the source's default
ProfileManager via the adapter remap table; missing remap fails the
tool cleanly instead of landing a half-wired profile.

Removes hardcoded PROFILE_WRITABLE / PROMPT_WRITABLE frozensets and
the OPTIONS schema fetch for prompt-studio — the project-transfer
endpoint owns the field shape server-side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live ProfileManagerSerializer renders adapter FKs as flat NAME strings
(e.g. "gpt4-o"), not UUIDs. The first project-transfer smoke flagged
"no adapter remap for gpt4-o" for every tool — the phase was looking
up a name in the UUID-keyed adapter remap.

Switch resolution to target-side lookup: read the adapter NAME from
the source default profile and ask target's list_adapters(name=...)
for the target UUID. AdapterPhase preserves names across orgs, so the
lookup hits whenever adapters migrated cleanly.

Smoke-test result against local stack: 5/6 source tools created on
fresh run, all 5 adopted on re-run (sync_prompts), 1 source tool
failed cleanly because it has no profile (test data, not a code bug).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the last gap in the org-to-org data migration: ETL/TASK
pipelines and API deployments.

Both phases mirror the WorkflowPhase shape: get-or-create by name on
target, FK rewrites via the workflow remap table, dry-run + abort
support, idempotent re-run. Phase order extended to:

  ... workflow_endpoint -> pipeline -> api_deployment

`api_deployment` requires endpoints to be configured before the
serializer accepts it, so it must run after WorkflowEndpointPhase.

PipelinePhase scope: ETL + TASK only. DEFAULT is dead v1 code; APP is
a Streamlit-style deployment that doesn't fit the pipeline model.

API key handling: backend auto-provisions one active key per
pipeline/deployment on POST. Extra rotated source keys are NOT
mirrored — UUIDs are server-generated (not settable) and operators
should rotate post-migration anyway. Both phases log a WARNING when
the source had more than one active key so the operator notices.

Client surface added:
- list_pipelines, get_pipeline, create_pipeline, update_pipeline
- list_api_deployments, get_api_deployment, create_api_deployment,
  update_api_deployment
- list_pipeline_keys, list_api_deployment_keys, create_api_key

13 new unit tests; full suite green (43 -> 56).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@chandrasekharan-zipstack chandrasekharan-zipstack changed the title feat(migration): SDK subpackage for org-to-org data migration feat(migration): SDK subpackage for org-to-org data migration [UN-3479] May 24, 2026
@chandrasekharan-zipstack chandrasekharan-zipstack changed the title feat(migration): SDK subpackage for org-to-org data migration [UN-3479] UN-3479 [FEAT] SDK subpackage for org-to-org data migration May 24, 2026
@chandrasekharan-zipstack chandrasekharan-zipstack marked this pull request as ready for review May 24, 2026 07:00
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 24, 2026

Greptile Summary

This PR introduces unstract.clone, a new SDK sub-package for migrating all configured resources (adapters, connectors, tags, prompt-studio projects, workflows, tool instances, workflow endpoints, ETL/TASK pipelines, API deployments) from one Unstract org to another via two Platform API keys. It ships a click CLI, a thread-pooled phase orchestrator, a JSON UUID walker, and 56 unit tests.

  • All major bugs flagged in previous review rounds have been resolved: per-id GET is in place for every phase, the build_post_payload false/zero drop is fixed, the dry-run adopt path no longer fires export_custom_tool, resolve_any is now safely snapshotted, and unresolvable connector references skip rather than null-patch.
  • Two minor quality issues remain: _lookup_tool_name in FilesPhase calls list_custom_tools() once per tool in a sequential loop (N+1, same pattern already fixed in CustomToolPhase), and the --api-prefix CLI flag is applied to both source and target endpoints with no way to specify different prefixes for cross-version deployments.

Confidence Score: 5/5

Safe to merge; all blocking correctness bugs from prior rounds have been addressed and only minor quality improvements remain.

Every phase now follows the list → per-id GET → POST pattern, resolve_any is safely snapshotted, the dry-run adopt path no longer fires real writes, build_post_payload correctly preserves False and 0, and the unresolvable-connector PATCH-null was eliminated. The two remaining findings — an N+1 list_custom_tools() call in FilesPhase and a shared --api-prefix CLI flag — are quality/UX concerns with no correctness impact on the migration data.

src/unstract/clone/phases/files.py — _lookup_tool_name N+1 call; src/unstract/clone/cli.py — shared --api-prefix for source and target.

Important Files Changed

Filename Overview
src/unstract/clone/phases/base.py Core phase helpers: build_post_payload fix is correct (explicit is not None and != ""); misleading comment about not in (None, "") semantics.
src/unstract/clone/phases/custom_tool.py All previously flagged bugs fixed: list_custom_tools() hoisted, dry-run adopt path short-circuits before export_custom_tool, bare return replaced with return result.
src/unstract/clone/phases/workflow_endpoint.py Unresolvable-connector skip and null connection_type coercion both fixed; patch logic is correct.
src/unstract/clone/phases/pipeline.py Per-id GET before POST is in place; adopt/create/dry-run paths are all correct; extra-key warning logic is sound.
src/unstract/clone/phases/api_deployment.py Per-id GET before POST added; adopt-by-name idempotency correct; mirrors pipeline pattern faithfully.
src/unstract/clone/phases/workflow.py Per-id GET (get_workflow) now called before build/remap on the create path; cascade-skip logic via skipped_custom_tool_registry_ids is correct.
src/unstract/clone/context.py resolve_any snapshotting fix (list(self._table.values())) correctly prevents RuntimeError: dict changed size under concurrent record calls.
src/unstract/clone/client.py Clean entity-scoped HTTP wrapper; get_post_schema caching is correct (only called from sequential run() methods, not from parallel workers).
src/unstract/clone/phases/files.py N+1 list_custom_tools() calls in _lookup_tool_name (called once per tool in the Pass 1 loop); all lock usage in parallel pass is correct; retry/backoff logic is sound.
src/unstract/clone/cli.py Single --api-prefix applied to both source and target prevents cross-version migrations with different path prefixes; all other CLI wiring is correct.
src/unstract/clone/orchestrator.py Phase ordering is topologically correct; CloneError abort and duration stamping work correctly; clients closed in finally block.
src/unstract/clone/phases/tool_instance.py Broken-adapter sentinel detection, metadata PATCH, and remap recording are all correct; dry-run paths are consistent.
src/unstract/clone/walker.py UUID regex walker is correct and safe; unknown UUIDs pass through untouched.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    CLI["CLI / clone()"] --> A["AdapterPhase\nlist → GET → POST/adopt"]
    CLI --> B["ConnectorPhase\nlist → GET → POST/adopt"]
    CLI --> C["TagPhase\nlist → POST/adopt"]
    A --> D["CustomToolPhase\nexport → import/sync-prompts → registry"]
    B --> D
    D --> E["FilesPhase\nlist DM → download → upload"]
    D --> F["WorkflowPhase\nlist → GET → POST/adopt"]
    A --> F
    B --> F
    F --> G["ToolInstancePhase\ncreate → PATCH metadata"]
    D --> G
    G --> H["WorkflowEndpointPhase\nPATCH connector + config"]
    B --> H
    H --> I["PipelinePhase\nlist → GET → POST/adopt"]
    F --> I
    H --> J["APIDeploymentPhase\nlist → GET → POST/adopt"]
    F --> J
    subgraph remap["RemapTable (shared, thread-safe snapshot)"]
        R1["adapter uuid → uuid"]
        R2["connector uuid → uuid"]
        R3["custom_tool uuid → uuid"]
        R4["prompt_studio_registry uuid → uuid"]
        R5["workflow uuid → uuid"]
        R6["tool_instance uuid → uuid"]
    end
    A -.-> R1
    B -.-> R2
    D -.-> R3
    D -.-> R4
    F -.-> R5
    G -.-> R6
Loading

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
src/unstract/clone/phases/files.py:451-471
**N+1 `list_custom_tools()` calls in file-name lookup**

`_lookup_tool_name` is called once per entry in `tool_remap` (line 81), and every call issues a fresh `GET prompt-studio/` request to list all target tools. For N migrated tools this produces N round-trips that return the same payload. The exact same pattern was hoisted out of the loop in `CustomToolPhase` in a previous review round; the same fix applies here: fetch the list once before the loop and build a `{tool_id: tool_name}` dict that subsequent iterations query locally.

### Issue 2 of 3
src/unstract/clone/cli.py:183-194
**`--api-prefix` is shared between source and target**

Both `OrgEndpoint` objects receive the same `api_prefix` value. A migration between a deployment with `PATH_PREFIX=api/v1` and one with `PATH_PREFIX=api/v2` (or any cross-version setup) has no way to supply different prefixes per side. Consider adding `--source-api-prefix` / `--target-api-prefix` flags (with `--api-prefix` kept as a shorthand that sets both when they would be equal), or at minimum document that both deployments must share the same prefix.

### Issue 3 of 3
src/unstract/clone/phases/base.py:47-49
**Misleading comment about `not in (None, "")` semantics**

The comment states that `0 not in (None, "")` "falsely returns True," but in standard Python that expression returns `True` because `0` is neither equal to `None` nor to `""` — the result is correct, not false. The actual footgun in the old code was a truthiness check (`if v:`) that would silently drop `False` and `0`. The current explicit `is not None and != ""` guards are correct; consider rewriting the comment to say "replaced a truthiness guard that dropped `False`/`0`" so the rationale is unambiguous.

Reviews (19): Last reviewed commit: "Merge pull request #16 from Zipstack/ci/..." | Re-trigger Greptile

Comment thread src/unstract/migration/phases/base.py Outdated
Comment thread src/unstract/clone/phases/workflow_endpoint.py
Comment thread src/unstract/clone/phases/custom_tool.py Outdated
Comment thread src/unstract/migration/phases/custom_tool.py Outdated
chandrasekharan-zipstack and others added 3 commits May 24, 2026 13:18
Adds a `files` phase that moves Prompt Studio document files between orgs
using the existing Platform API endpoints — no new BE surface.

- Default mode (`--file-strategy=platform_api`): lists target DM rows
  once per tool for idempotency, downloads each missing source file via
  `fetch_contents_ide`, decodes per mime, and POSTs as multipart through
  `upload_for_ide`.
- Skip mode (`--skip-files`): metadata only; source filenames go into
  `MigrationReport.skipped_files` for operator-driven UI re-upload.
- Mixed-mode reporting: files above `--max-file-size` (default 25MB)
  land in `oversize_files`; mime types the BE endpoint can't round-trip
  losslessly (Excel placeholder, etc.) land in `unsupported_files`.
  Sibling files always continue — phase never aborts on file-level
  issues. Transport-level failures land in `failed_files`.
- Idempotency-only retries on 5xx + transient connection errors.
- Wired after `custom_tool` in the orchestrator (consumes its remap).

Report grows four new typed lists (`uploaded_files`, `skipped_files`,
`oversize_files`, `unsupported_files`, `failed_files`) plus
end-of-phase rendering in both rich and plain modes.

12 new unit tests cover happy path, idempotency skip, oversize,
unsupported mime, skip strategy, dry-run, retry on 5xx, missing
custom_tool remap, per-tool source-list failure isolation, upload
failure capture, and parametrised text/csv + text/plain round-trips.
…nce NOT FOUND

Multiple fixes uncovered by the first local-stack run:

client.py
- list_prompt_documents: mount under prompt-studio/ prefix
  (BE include in urls_v2.py).
- download_prompt_file: use ?document_id=, matching
  fetch_contents_ide serializer (was ?file_name=, BE ignored it
  and returned 400 ValidationError).
- upload_prompt_file: drop trailing slash, BE pattern is
  prompt-studio/file/<uuid:pk> with no slash so POST 404'd.
- add get_custom_tool / update_custom_tool for default-doc PATCH.

phases/files.py
- After upload loop per tool, mirror source's CustomTool.output
  by filename so FE auto-selects on load. Fall back to first
  target doc. Preserve any existing target output (operator
  may have already picked manually on a re-run).

phases/tool_instance.py
- Detect source serializer sentinels ([X NOT FOUND],
  [DELETED ADAPTER ...], [NEEDS UPDATE]) in stored metadata and
  skip the PATCH instead of round-tripping a broken adapter
  reference. ToolInstance row exists with backend defaults;
  operator re-binds in UI.

report.py
- Drop the full source->target UUID map from rendered output
  (noisy on large migrations). Print per-entity counts only;
  full map still in as_dict() and at DEBUG log level.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
base.build_post_payload:
  Previous `value not in (None, "")` dropped booleans False and numeric
  0 along with None/"" — DRF BooleanField False and numeric defaults
  were silently stripped from POST payloads. Switch to explicit
  identity + equality checks. New test_base_helpers.py guards this.

workflow_endpoint._patch_endpoint:
  When source endpoint had a connector but its remap entry is missing
  (e.g. connector phase skipped a row), we previously PATCHed the
  target endpoint with connector_instance_id=None — silently detaching
  it. Now skip the PATCH, increment result.skipped, and append an
  error entry so the operator sees the broken link in the report.
  Existing test rewritten to assert the new skip-and-flag behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread src/unstract/clone/phases/custom_tool.py
list_custom_tools() was called once per source tool — N target-API
round-trips for the same invariant data. Fetch once before the loop
and append locally on create so adoption lookups stay correct on
re-runs.

Also drop the value.get("id") fallback in _extract_adapter_name —
returning a UUID where the caller expects a name made list_adapters
silently miss with a confusing warning.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- src/unstract/migration -> src/unstract/clone, tests/migration -> tests/clone
- MigrationReport/Context/Options/Error -> Clone*
- CLI subcommand 'migrate' -> 'clone', script 'unstract-migrate' -> 'unstract-clone'
- pyproject extra 'migration' -> 'clone'
- README + docstrings + log lines updated; report title 'Migration Report' -> 'Clone Report'

All 75 unit tests pass.
Copy link
Copy Markdown
Contributor

@jaseemjaskp jaseemjaskp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated multi-agent review of the clone subpackage (Code Reviewer, Silent-Failure Hunter, Type-Design, Test Analyzer, Comment Analyzer, Code Simplifier). Findings below are grouped by file; severity is tagged inline (P1 highest). Findings already covered by existing review comments (e.g. the greptile dry-run write on custom_tool.py:180, the _extract_adapter_name / list_custom_tools / base.build_post_payload / workflow_endpoint items) are intentionally omitted as fixed/duplicate. Tests pass (75 green).

Comment thread src/unstract/clone/phases/tool_instance.py Outdated
Comment thread src/unstract/clone/phases/files.py
Comment thread src/unstract/clone/phases/files.py Outdated
Comment thread src/unstract/clone/phases/custom_tool.py
Comment thread src/unstract/clone/walker.py
Comment thread src/unstract/clone/orchestrator.py
Comment thread src/unstract/clone/report.py
Comment thread src/unstract/clone/context.py
Comment thread src/unstract/clone/phases/base.py
Comment thread src/unstract/clone/phases/base.py
Comment thread src/unstract/clone/phases/workflow_endpoint.py
Comment thread src/unstract/clone/phases/custom_tool.py Outdated
Comment thread src/unstract/clone/phases/pipeline.py Outdated
P1 fixes:
- tool_instance: dry-run no longer PATCHes adopted targets.
- custom_tool: dry-run no longer republishes the registry on adopt.
- custom_tool: registry remap-lookup failure now counts as failed.
- custom_tool: phase-init failure returns `result`, not `None`.
- files: malformed DM rows + unsupported-mime + oversize bump skipped
  with an error entry so the run no longer reports green when items
  needed manual attention.
- files: tighten `_lookup_tool_name` exceptions to PlatformAPIError
  and transport errors.
- workflow_endpoint: drop `connection_type or ""` coercion that
  could turn None into a blank DRF would reject; omit the key.
- pipeline/api_deployment: do per-id GET before POST so list-only
  serializer fields don't get stripped from the create payload.

P2/P3 quick wins:
- PlatformClient: add `close()` + context manager; orchestrator
  closes both clients in a finally block.
- pipeline/api_deployment: promote DEBUG to WARNING when the
  source key-list call fails (operator-facing).
- cli: distinguish `--max-file-size 0` from unparseable; preserve 0.
- files: drop the dead `docs/internal/...` reference.

Tests:
- New test_client.py + test_orchestrator.py + test_cli.py cover
  HTTP-layer + orchestrator paths that previously had no coverage.
- Regression tests added to existing phase tests for each P1 fix.
- 105 tests pass (was 75).
@jaseemjaskp jaseemjaskp self-requested a review May 27, 2026 03:24
Copy link
Copy Markdown
Contributor

@jaseemjaskp jaseemjaskp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second-pass review (post-9391ce3). The prior round's P1/P2/P3s are addressed — verified the dry-run adopt short-circuits, file-row counters, narrowed excepts, session close, and connection_type omission. One new correctness-of-reporting issue remains; everything else I'd add is either already deferred-with-rationale (StrEnum option sets, report dict buckets, six-phase run() duplication) or below the bar.

Comment thread src/unstract/clone/phases/tool_instance.py Outdated
Adds --concurrency N (1-32, default 4) to fan out per-phase work across
threads. Phases mutate shared state (counters, RemapTable, file lists)
under a lock owned by the parallel_map helper. concurrency=1 short-
circuits the executor for byte-for-byte identical sequential behaviour.

Also:
- PlatformAPIError now surfaces response body in str(e) so logger.exception
  emits the backend error text.
- CloneReport tracks per-phase + total wall-clock; rendered as a Time
  column in the run report.
- Files phase restructured into 3 passes (prep sequentially, per-file
  download/upload in parallel, set default doc sequentially).
…skipped

The broken-adapter-refs branch logged a warning and appended an error
entry but didn't bump any counter, so a degraded clone (tool_instance
row landed with backend defaults, adapters silently unbound) reported
as 'Completed successfully' with exit 0.
@Zipstack Zipstack deleted a comment from sergiojrdotnet May 27, 2026
@jaseemjaskp jaseemjaskp self-requested a review May 27, 2026 04:07
Comment thread src/unstract/clone/phases/workflow.py
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread src/unstract/clone/context.py
- Detect frictionless adapter dependence in custom_tool phase (source
  service-account view hides frictionless adapters; profile names not
  in that view can't be migrated). Skip the tool and cascade-skip its
  workflow via a registry-id set on CloneContext.
- Detect OAuth-backed connectors (metadata carries access_token /
  refresh_token) and skip ahead of POST. Avoids the backend's
  OAuthTimeOut 408 and surfaces a re-auth instruction for the operator.
- CloneReport now prints a "Failures" section before the table with
  one truncated line per recorded error (capped at 30 rows). Entity
  labels in the section are bold-cyan to match the phase column.
- Drop "workflow_owner" from POST payloads (server-managed).
Comment thread src/unstract/clone/context.py
chandrasekharan-zipstack and others added 2 commits May 27, 2026 14:37
- WorkflowPhase now does list → per-id GET → POST like other phases,
  so source_settings/destination_settings JSON blobs (carrying
  connector UUIDs) aren't dropped by stripped list serializers.
- RemapTable.resolve_any iterates over a snapshot to avoid
  `dictionary changed size during iteration` against a concurrent
  record() from sibling workers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Existing main.yml only fires on workflow_dispatch (release path), so
unit tests never ran on PRs. Add a workflow that runs pytest on every
PR (and pushes to main) across Python 3.11 and 3.12. Lint is left as
a follow-up — src/ currently has 10 pre-existing E501 violations that
would block any gate; cleaning those up is a separate change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
chandrasekharan-zipstack and others added 2 commits May 27, 2026 14:56
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ci: add PR-gate workflow running pytest on every PR
@chandrasekharan-zipstack chandrasekharan-zipstack merged commit c3c6d37 into main May 27, 2026
3 checks passed
@chandrasekharan-zipstack chandrasekharan-zipstack deleted the feat/org-migration branch May 27, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants