fix: filter out datasets with inconsistent database and LakeFS records by xuang7 · Pull Request #5171 · apache/texera

xuang7 · 2026-05-24T00:41:19Z

What changes were proposed in this PR?

This PR fixes an issue where dataset listings fail when dataset records in the database and LakeFS repositories are inconsistent. This breaks the workflow dataset picker and can also affect Hub dataset listings. The fix updates the dataset listing endpoints to first fetch existing LakeFS repository names and filter out dataset records whose repositories are missing, so valid datasets can still be returned normally.

Demo:

Before	After

Any related issues, documentation, discussions?

Closes #5106

How was this PR tested?

Added two tests.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

codecov-commenter · 2026-05-24T00:43:23Z

Codecov Report

❌ Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 45.81%. Comparing base (c435aa7) to head (c4a945d).

Files with missing lines	Patch %	Lines
...exera/web/resource/dashboard/hub/HubResource.scala	0.00%	2 Missing ⚠️
.../amber/core/storage/util/LakeFSStorageClient.scala	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5171      +/-   ##
============================================
- Coverage     47.13%   45.81%   -1.33%     
- Complexity     2344     2345       +1     
============================================
  Files          1042     1046       +4     
  Lines         39989    40033      +44     
  Branches       4260     4258       -2     
============================================
- Hits          18849    18341     -508     
- Misses        20015    20582     +567     
+ Partials       1125     1110      -15

Flag	Coverage Δ		*Carryforward flag
access-control-service	`39.53% <ø> (ø)`
agent-service	`33.74% <ø> (-0.03%)`	⬇️	Carriedforward from 662c8cb
amber	`50.31% <0.00%> (-0.02%)`	⬇️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`0.00% <ø> (ø)`
file-service	`32.89% <100.00%> (+0.70%)`	⬆️
frontend	`34.62% <ø> (-3.20%)`	⬇️	Carriedforward from 662c8cb
python	`90.50% <ø> (ø)`		Carriedforward from 662c8cb
workflow-compiling-service	`56.81% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mengw15

Left one comment

mengw15 · 2026-05-25T08:02:44Z

  ): List[DashboardDataset] = {
    val uid = user.getUid
+    // Drop DB rows whose LakeFS repo is missing.
+    val existingRepos = LakeFSStorageClient.listAllRepoNames()


TOCTOU note: listAllRepoNames() is a snapshot taken before the .map that calls retrieveRepositorySize per row. If a concurrent admin / orchestrator deletes a LakeFS repo between the snapshot and the per-row size lookup, the request will still 500 on the now-stale "exists" check. The window is small but non-zero.

Worth knowing the existing dataset-search path (DatasetSearchQueryBuilder.toEntryImpl at lines 127-137 on main) already handles this with a try { retrieveRepositorySize(...) } catch (ApiException) { return null } pattern, logging and silently dropping the orphan. After this PR the two read paths have two different defenses for the same underlying inconsistency.

Could we use try-catch here too? That would close the race window, drop the need for the new listAllRepoNames() helper entirely, and unify the orphan defense with the existing search path. What do you think?

xuang7 added 2 commits May 23, 2026 17:22

update.

d352629

Merge branch 'main' into fix/filter-mismatched-datasets

a98106b

github-actions Bot assigned xuang7 May 24, 2026

github-actions Bot added engine fix common platform Non-amber Scala service paths labels May 24, 2026

xuang7 requested a review from aicam May 24, 2026 00:44

chenlica requested a review from mengw15 May 25, 2026 07:08

mengw15 reviewed May 25, 2026

View reviewed changes

Merge branch 'main' into fix/filter-mismatched-datasets

c4a945d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: filter out datasets with inconsistent database and LakeFS records#5171

fix: filter out datasets with inconsistent database and LakeFS records#5171
xuang7 wants to merge 3 commits into
apache:mainfrom
xuang7:fix/filter-mismatched-datasets

xuang7 commented May 24, 2026

Uh oh!

codecov-commenter commented May 24, 2026 •

edited

Loading

Uh oh!

mengw15 left a comment

Uh oh!

mengw15 May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xuang7 commented May 24, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

codecov-commenter commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mengw15 left a comment

Choose a reason for hiding this comment

Uh oh!

mengw15 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented May 24, 2026 •

edited

Loading