Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush) by bmehta001 · Pull Request #1491 · microsoft/cpp_client_telemetry

bmehta001 · 2026-06-22T15:56:27Z

Summary

Offline-storage data-safety + flush performance (combines the data-safety fixes with the SQLite batch-flush optimization, formerly PR #1496 and #1497).

Area	Fix
`MemoryStorage::DeleteRecords(filter)`	An empty filter matched every record and silently wiped the RAM queue. Now ignored (use `DeleteAllRecords()`), mirroring the SQLite backend.
`OfflineStorage_SQLite::StoreRecords`	Batches the whole flush into one `BEGIN EXCLUSIVE`/`COMMIT` (one fsync) instead of one transaction per record. ~11× (200 recs) / ~40× (1000 recs) faster (measured on the SDK's vendored sqlite). All-or-nothing: any insert failure rolls the batch back (new `SqliteDB::rollback`), so it can be re-queued without duplicate rows (no unique `record_id`).
`OfflineStorage_SQLite::StoreRecord`	Returns `false` and reports `OnStorageFailed` on a real write failure (was ignoring the `execute()` result). Validation runs before the transaction; write failures are reported after it closes.
`OfflineStorageHandler::StoreRecord`	Propagates the disk store result (was always `true`).
`OfflineStorageHandler::Flush`	Drained the RAM queue and persisted before confirming, losing events on a disk-write failure. Now drains and persists via the batched, all-or-nothing `StoreRecords()`, and re-queues the whole batch if it doesn't fully persist — realizing the batch speedup and keeping the no-loss / no-duplicate guarantee. Also null-guards the size read (disk-only storage).

Tests

MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll
OfflineStorageHandlerFlushTests.FailedDiskStoreDuringFlushReturnsRecordsToMemory — records the disk store rejects remain in memory after Flush(). Verified it fails against the previous GetRecords()-based Flush.
OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords
Full WSL UnitTests: 527/527 pass.

History

The batching half (PR #1497) was driven to a clean Copilot review over ~9 rounds, then folded here so the batched StoreRecords() and the Flush data-loss fix cooperate (all-or-nothing batch + re-queue) rather than conflict.

…ailure Two latent data-loss bugs found during a repo-wide review: 1) MemoryStorage::DeleteRecords(whereFilter) matched EVERY record when whereFilter was empty (the matcher starts `matched = true` and the per-key loop never runs), silently wiping the entire in-memory queue. This contradicts the fail-closed OfflineStorage_SQLite::DeleteRecords and the Room backend. Guard an empty filter and return without deleting; intentional full clears use DeleteAllRecords(). 2) OfflineStorage_SQLite::StoreRecord ignored the bool returned by SqliteStatement::execute(), returning true and bumping m_DbSizeEstimate even on a real write failure (SQLITE_FULL/IOERR/etc). The event is silently lost with no OnStorageFailed notification and the size estimate drifts. Capture the result; on failure log, notify the observer, and return false (skipping the size bump). Tests: added MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll (fails without the guard -- the queue is wiped to 0; passes with it). The StoreRecord write-failure path isn't unit-testable here (the insert is REPLACE INTO with no constraint to violate), so it's covered by build + review. Verified locally on Linux: all 9 MemoryStorageTests and 32 OfflineStorageTests_SQLite pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Verified TDD: this test fails without the empty-filter guard (the queue is wiped, GetSize()/GetRecordCount() drop to 0) and passes with it. Run on Linux host. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

lib/offline/OfflineStorage_SQLite.cpp::StoreRecord now returns false on a write failure (this PR), but OfflineStorageHandler::StoreRecord ignored the disk result and always returned true, so a failed synchronous store (RAM queue disabled or during shutdown) was counted as successfully persisted by StoreRecords()/StorageObserver. Return the disk StoreRecord() result in the direct-to-disk path. The memory path is unchanged: MemoryStorage::StoreRecord returning false means an intentional latency-Off skip, not a failure, so it must not surface as an error. Verified at lib/offline/OfflineStorageHandler.cpp:266-275 and lib/offline/OfflineStorage_SQLite.cpp:180-186. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR hardens the offline-storage layer against two data-loss scenarios by (1) preventing accidental full-queue deletes when a delete predicate is empty and (2) ensuring SQLite write failures are surfaced to callers/observers instead of being treated as successful stores.

Changes:

Add an empty-filter guard to MemoryStorage::DeleteRecords(whereFilter) and a unit test to prevent unintended full deletes.
Propagate synchronous disk StoreRecord() failures through OfflineStorageHandler::StoreRecord so callers don’t count failed persistence as success.
Treat SqliteStatement::execute() failures in OfflineStorage_SQLite::StoreRecord as hard failures: log, notify OnStorageFailed, and return false without updating size estimates.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/unittests/MemoryStorageTests.cpp	Adds coverage to ensure `DeleteRecords({})` does not wipe the in-memory queue.
lib/offline/MemoryStorage.cpp	Guards empty `whereFilter` to avoid “match-all” deletion behavior.
lib/offline/OfflineStorageHandler.cpp	Returns disk `StoreRecord()` result to propagate synchronous persistence failures to callers.
lib/offline/OfflineStorage_SQLite.cpp	Checks `execute()` result and reports SQLite insert failures via logging + `OnStorageFailed`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                    // Propagate a synchronous disk write failure to the caller so a
+                    // failed store is not counted as successfully persisted.
+                    return m_offlineStorageDisk->StoreRecord(record);


+                    // Propagate a synchronous disk write failure to the caller so a
+                    // failed store is not counted as successfully persisted.
+                    return m_offlineStorageDisk->StoreRecord(record);


OfflineStorage_SQLite::StoreRecord: return false when the INSERT execute() fails (was ignoring the result and returning true even on a real write failure). The Flush() reserve/confirm-delete logic in this PR relies on the disk backend reporting per-record failure; without this, a failed sqlite3_step would still be treated as persisted and dropped from memory. (This is the same one-line fix as PR microsoft#1491, included here so this PR is correct on its own; trivial overlap that resolves cleanly at merge.) OfflineStorageTests.cpp NoopTaskDispatcher: own queued tasks and delete them on Cancel()/Join()/destruction instead of dropping the Task* (PAL::scheduleTask allocates with new and expects the dispatcher to take ownership), so the helper cannot leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Combine the Flush() data-loss fix into this storage-data-safety PR (the two are halves of the same fix: this PR already makes OfflineStorage_SQLite::StoreRecord report write failures; Flush() must act on that). OfflineStorageHandler::Flush() previously drained the in-memory queue with GetRecords() (which removes records) and handed them to StoreRecords() before confirming persistence. On a partial/total disk write failure the un-persisted records were already gone from memory and never re-queued -> events lost. Flush() now drains into a local batch, persists one record at a time, and re-inserts only the records that fail to persist (so failures are retried, not lost). Per-record StoreRecord() is used deliberately: a batched StoreRecords() only returns a count, so on a partial failure we could not tell which records to re-queue, and re-storing already-saved records would duplicate them (no unique record_id constraint). Also null-guards the dbSizeBeforeFlush read so Flush() is safe with disk-only storage (CFG_INT_RAM_QUEUE_SIZE == 0). Adds OfflineStorageHandlerFlushTests.FailedDiskWriteDuringFlushReturnsRecordsToMemory (records the SQLite store rejects stay in memory after Flush; verified it fails against the previous GetRecords()-based Flush). Closes the separate PR microsoft#1496. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

+            if (it != m_tasks.end())
+            {
+                delete *it;
+                m_tasks.erase(it);
+            }
+            return true;
+        }


The test helper's Cancel() returned true unconditionally, violating the ITaskDispatcher::Cancel contract (return whether the task was found/cancelled). Return true only when the task was present in the queue, false otherwise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

+            size_t totalSaved = 0;
+            size_t totalFailed = 0;
+            for (auto& record : records)
+            {
+                if (m_offlineStorageDisk->StoreRecord(record))


+// Regression test: when records pulled from the in-memory queue fail to persist
+// to disk during Flush(), they must be returned to the queue rather than lost.
+TEST(OfflineStorageHandlerFlushTests, FailedDiskWriteDuringFlushReturnsRecordsToMemory)


+    // A timestamp <= 0 is accepted by the in-memory queue but rejected by the
+    // SQLite disk store's input validation, so its StoreRecord() returns false.
+    // This drives the same Flush() failure-handling path as a disk write failure
+    // (a failed record must be returned to memory, not dropped).


Rename FailedDiskWriteDuringFlush... -> FailedDiskStoreDuringFlush... and reword its comments: the test exercises a disk StoreRecord() rejection (SQLite input validation), which drives the same Flush() re-queue path as any disk store failure, not a literal disk write/IO error. (The reviewer's separate note that Flush() ignores EventPersistence_DoNotStoreOnDisk is a pre-existing behavior, out of scope for this data-safety change and not cleanly unit-testable via the public API; tracked as a follow-up.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

…ta-safety

…(was PR microsoft#1497) Combine the batched-flush perf work into this PR and make it cooperate with the Flush() data-loss fix, so both land together. OfflineStorage_SQLite: StoreRecords() now inserts the whole batch in a single BEGIN EXCLUSIVE / COMMIT (one fsync) instead of one transaction per record (~11x at 200 records, ~40x at 1000 vs the SDK's vendored sqlite). Shared per-record logic is factored into isValidRecord / insertRecordUnsafe / checkStorageSizeLimits. The batch is all-or-nothing: if any insert fails, the transaction is rolled back (new SqliteDB::rollback / DbTransaction::markForRollback) and the size estimate is undone, so callers can re-queue the whole batch without risking duplicate rows (the events table has no unique record_id constraint). OfflineStorageHandler::Flush() now uses the batched StoreRecords() to persist a drained batch in one transaction. Because StoreRecords() is all-or-nothing, on failure nothing is committed and Flush returns every record to the in-memory queue for retry -- realizing the batching speedup while keeping the no-event-loss / no-duplicate guarantee. StoreRecords/StoreRecord report write failures via OnStorageFailed after the transaction closes; validation runs before the transaction. Adds OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords. Full UnitTests (527) pass. Closes PR microsoft#1497. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

+        std::vector<const StorageRecord*> valid;
+        valid.reserve(records.size());
+        for (auto const& i : records) {
+            if (isValidRecord(i)) {
+                valid.push_back(&i);
+            }
+        }
+
+        if (valid.empty()) {
+            // Every record was invalid (already reported above). Match the single
+            // StoreRecord(), which returns after validation without checking
+            // DB-open.
+            return 0;
+        }


…cords StoreRecords() previously filtered out invalid records and committed the valid ones, so it could return a count < records.size() even though some records were persisted. OfflineStorageHandler::Flush() treats totalSaved < records.size() as a batch failure and re-queues ALL drained records, which would duplicate the valid records that were actually stored. Make StoreRecords() truly all-or-nothing: if ANY input record is invalid, store nothing and return 0 (invalids are still reported via isValidRecord()). Combined with the existing rollback-on-write-failure, StoreRecords() now returns either records.size() (whole batch committed) or 0 (nothing committed), so Flush's re-queue-all-on-short-return can never duplicate records. Adds OfflineStorageTests_SQLite.StoreRecordsBatchWithAnyInvalidStoresNothing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

+            // than reserving) keeps only a single copy of each record in flight and
+            // avoids stamping a reservation lease that the Room backend would
+            // persist to disk.
            auto records = m_offlineStorageMemory->GetRecords(false, EventLatency_Unspecified);


Flush() re-queued the whole drained batch whenever StoreRecords() returned a count < records.size(). Both disk backends are all-or-nothing (SQLite rolls back; Room returns 0 on a failed JNI batch), so the only meaningful "failure" value is 0. Room also caps its returned count at min(size, INT32_MAX); keying off < records.size() would treat that capped count as a failure and re-queue already-persisted records (duplicates). Key the re-queue off totalSaved == 0 instead, which is the true "nothing committed" signal. (The cap only matters for a batch larger than the RAM queue could ever hold.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

bmehta001 and others added 2 commits June 22, 2026 10:48

Add MemoryStorage empty-filter delete regression test

325c55b

Verified TDD: this test fails without the empty-filter guard (the queue is wiped, GetSize()/GetRecordCount() drop to 0) and passes with it. Run on Linux host. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bmehta001 requested a review from a team as a code owner June 22, 2026 15:56

lalitb reviewed Jun 22, 2026

View reviewed changes

Comment thread lib/offline/OfflineStorage_SQLite.cpp Outdated

bmehta001 requested a review from Copilot June 22, 2026 21:28

Copilot started reviewing on behalf of bmehta001 June 22, 2026 21:28 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

bmehta001 mentioned this pull request Jun 22, 2026

Prevent event loss when a disk write fails during Flush() #1496

Closed

bmehta001 changed the title ~~Offline storage: guard empty-filter delete and propagate SQLite store failures~~ Offline storage data-safety: empty-filter guard, store-failure propagation, and flush event-loss fix Jun 23, 2026

bmehta001 requested a review from Copilot June 23, 2026 03:59

Copilot started reviewing on behalf of bmehta001 June 23, 2026 04:00 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread tests/unittests/OfflineStorageTests.cpp

Comment on lines +196 to +202

if (it != m_tasks.end())

{

delete *it;

m_tasks.erase(it);

}

return true;

}

bmehta001 requested a review from Copilot June 23, 2026 04:09

Copilot started reviewing on behalf of bmehta001 June 23, 2026 04:09 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

bmehta001 requested a review from Copilot June 23, 2026 04:25

Copilot started reviewing on behalf of bmehta001 June 23, 2026 04:26 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

bmehta001 mentioned this pull request Jun 23, 2026

Offline storage: batch flush inserts in one transaction #1497

Closed

bmehta001 and others added 2 commits June 23, 2026 10:02

Merge remote-tracking branch 'msft/main' into bhamehta/fix-storage-da…

e9c7ee3

…ta-safety

bmehta001 changed the title ~~Offline storage data-safety: empty-filter guard, store-failure propagation, and flush event-loss fix~~ Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush) Jun 23, 2026

bmehta001 requested a review from Copilot June 23, 2026 16:34

Copilot started reviewing on behalf of bmehta001 June 23, 2026 16:35 View session

Copilot AI reviewed Jun 23, 2026

bmehta001 requested a review from Copilot June 23, 2026 16:41

Copilot started reviewing on behalf of bmehta001 June 23, 2026 16:42 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

bmehta001 requested a review from Copilot June 23, 2026 17:06

Copilot started reviewing on behalf of bmehta001 June 23, 2026 17:07 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread lib/offline/OfflineStorageHandler.cpp

// than reserving) keeps only a single copy of each record in flight and

// avoids stamping a reservation lease that the Room backend would

// persist to disk.

auto records = m_offlineStorageMemory->GetRecords(false, EventLatency_Unspecified);

bmehta001 requested a review from Copilot June 23, 2026 17:18

Copilot started reviewing on behalf of bmehta001 June 23, 2026 17:19 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Merge branch 'main' into bhamehta/fix-storage-data-safety

97fee8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491

Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491
bmehta001 wants to merge 11 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-storage-data-safety

bmehta001 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

bmehta001 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

History

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bmehta001 commented Jun 22, 2026 •

edited

Loading