Skip to content

fix: commit request queue dedup cache only after batch_add_requests succeeds#975

Draft
vdusek wants to merge 5 commits into
masterfrom
fix/rq-cache-commit-on-success
Draft

fix: commit request queue dedup cache only after batch_add_requests succeeds#975
vdusek wants to merge 5 commits into
masterfrom
fix/rq-cache-commit-on-success

Conversation

@vdusek

@vdusek vdusek commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Both Apify request queue clients (ApifyRequestQueueSingleClient, ApifyRequestQueueSharedClient) cached new requests locally (the single client also updates _head_requests) before calling batch_add_requests. That caused two bugs:

  • A failed call left the cache entry behind, so a later retry of the same request was deduplicated locally and never re-sent. The request was silently lost.
  • A concurrent producer could match the in-flight cache entry and return was_already_present before the first call finished. If that call then failed, the producer had already reported success for a request that never reached the platform.

Now the cache and head are committed only after batch_add_requests succeeds, and only for requests the platform accepted (unprocessed_requests are skipped). A failed call commits nothing.

To keep deduplicating concurrent producers (so overlapping batches do not multiply platform writes), each in-flight add is tracked by a per-request future. A concurrent producer of the same request awaits it instead of re-sending. It is told the request is present only if the original add committed it. If the original add failed, the producer reports the request unprocessed so Crawlee retries it, rather than receiving false success.

Within-batch deduplication is unaffected, since Crawlee already collapses a batch by unique_key (via _transform_requests) before it reaches these clients.

Tests (parametrized over both clients, plus integration):

  • A failed batch_add_requests leaves no cached entry, and the retry reaches the platform.
  • A concurrent producer of an in-flight request deduplicates against it (a single API call) and is told the request is present only once it is accepted.
  • When the in-flight add fails, the concurrent producer reports the request unprocessed instead of false success.
  • test_request_queue_parallel_deduplication confirms overlapping concurrent producers write each request to the platform exactly once.

@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 12, 2026
@vdusek vdusek self-assigned this Jun 12, 2026
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 12, 2026
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.07%. Comparing base (0daca28) to head (56d7c55).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #975      +/-   ##
==========================================
+ Coverage   89.90%   90.07%   +0.16%     
==========================================
  Files          49       49              
  Lines        3091     3143      +52     
==========================================
+ Hits         2779     2831      +52     
  Misses        312      312              
Flag Coverage Δ
e2e 35.53% <12.16%> (-0.38%) ⬇️
integration 57.52% <97.29%> (+0.64%) ⬆️
unit 81.45% <95.94%> (+2.70%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vdusek added 4 commits June 12, 2026 18:27
…-on-success

# Conflicts:
#	tests/unit/storage_clients/test_apify_request_queue_client.py
…form writes

Commit-after-success regressed concurrent deduplication. Overlapping producers
re-sent the same request and multiplied platform writes, which the parallel-dedup
integration test caught. In-flight adds are now tracked as per-request futures
that concurrent producers await instead of re-sending. A producer learns the
request is present only once the platform accepts it. If the original add fails,
awaiters are told it was not committed and report it unprocessed so it gets
retried, never falsely succeeded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants