Skip to content

Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491

Open
bmehta001 wants to merge 11 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-storage-data-safety
Open

Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491
bmehta001 wants to merge 11 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-storage-data-safety

Conversation

@bmehta001

@bmehta001 bmehta001 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Offline-storage data-safety + flush performance (combines the data-safety fixes with the SQLite batch-flush optimization, formerly PR #1496 and #1497).

Area Fix
MemoryStorage::DeleteRecords(filter) An empty filter matched every record and silently wiped the RAM queue. Now ignored (use DeleteAllRecords()), mirroring the SQLite backend.
OfflineStorage_SQLite::StoreRecords Batches the whole flush into one BEGIN EXCLUSIVE/COMMIT (one fsync) instead of one transaction per record. ~11× (200 recs) / ~40× (1000 recs) faster (measured on the SDK's vendored sqlite). All-or-nothing: any insert failure rolls the batch back (new SqliteDB::rollback), so it can be re-queued without duplicate rows (no unique record_id).
OfflineStorage_SQLite::StoreRecord Returns false and reports OnStorageFailed on a real write failure (was ignoring the execute() result). Validation runs before the transaction; write failures are reported after it closes.
OfflineStorageHandler::StoreRecord Propagates the disk store result (was always true).
OfflineStorageHandler::Flush Drained the RAM queue and persisted before confirming, losing events on a disk-write failure. Now drains and persists via the batched, all-or-nothing StoreRecords(), and re-queues the whole batch if it doesn't fully persist — realizing the batch speedup and keeping the no-loss / no-duplicate guarantee. Also null-guards the size read (disk-only storage).

Tests

  • MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll
  • OfflineStorageHandlerFlushTests.FailedDiskStoreDuringFlushReturnsRecordsToMemory — records the disk store rejects remain in memory after Flush(). Verified it fails against the previous GetRecords()-based Flush.
  • OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords
  • Full WSL UnitTests: 527/527 pass.

History

The batching half (PR #1497) was driven to a clean Copilot review over ~9 rounds, then folded here so the batched StoreRecords() and the Flush data-loss fix cooperate (all-or-nothing batch + re-queue) rather than conflict.

bmehta001 and others added 2 commits June 22, 2026 10:48
…ailure

Two latent data-loss bugs found during a repo-wide review:

1) MemoryStorage::DeleteRecords(whereFilter) matched EVERY record when
   whereFilter was empty (the matcher starts `matched = true` and the
   per-key loop never runs), silently wiping the entire in-memory queue.
   This contradicts the fail-closed OfflineStorage_SQLite::DeleteRecords
   and the Room backend. Guard an empty filter and return without
   deleting; intentional full clears use DeleteAllRecords().

2) OfflineStorage_SQLite::StoreRecord ignored the bool returned by
   SqliteStatement::execute(), returning true and bumping
   m_DbSizeEstimate even on a real write failure (SQLITE_FULL/IOERR/etc).
   The event is silently lost with no OnStorageFailed notification and
   the size estimate drifts. Capture the result; on failure log, notify
   the observer, and return false (skipping the size bump).

Tests: added MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll
(fails without the guard -- the queue is wiped to 0; passes with it).
The StoreRecord write-failure path isn't unit-testable here (the insert
is REPLACE INTO with no constraint to violate), so it's covered by build
+ review. Verified locally on Linux: all 9 MemoryStorageTests and 32
OfflineStorageTests_SQLite pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verified TDD: this test fails without the empty-filter guard (the queue
is wiped, GetSize()/GetRecordCount() drop to 0) and passes with it.
Run on Linux host.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 requested a review from a team as a code owner June 22, 2026 15:56
Comment thread lib/offline/OfflineStorage_SQLite.cpp Outdated
lib/offline/OfflineStorage_SQLite.cpp::StoreRecord now returns false on a write
failure (this PR), but OfflineStorageHandler::StoreRecord ignored the disk
result and always returned true, so a failed synchronous store (RAM queue
disabled or during shutdown) was counted as successfully persisted by
StoreRecords()/StorageObserver.

Return the disk StoreRecord() result in the direct-to-disk path. The memory
path is unchanged: MemoryStorage::StoreRecord returning false means an
intentional latency-Off skip, not a failure, so it must not surface as an error.

Verified at lib/offline/OfflineStorageHandler.cpp:266-275 and
lib/offline/OfflineStorage_SQLite.cpp:180-186.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the offline-storage layer against two data-loss scenarios by (1) preventing accidental full-queue deletes when a delete predicate is empty and (2) ensuring SQLite write failures are surfaced to callers/observers instead of being treated as successful stores.

Changes:

  • Add an empty-filter guard to MemoryStorage::DeleteRecords(whereFilter) and a unit test to prevent unintended full deletes.
  • Propagate synchronous disk StoreRecord() failures through OfflineStorageHandler::StoreRecord so callers don’t count failed persistence as success.
  • Treat SqliteStatement::execute() failures in OfflineStorage_SQLite::StoreRecord as hard failures: log, notify OnStorageFailed, and return false without updating size estimates.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/unittests/MemoryStorageTests.cpp Adds coverage to ensure DeleteRecords({}) does not wipe the in-memory queue.
lib/offline/MemoryStorage.cpp Guards empty whereFilter to avoid “match-all” deletion behavior.
lib/offline/OfflineStorageHandler.cpp Returns disk StoreRecord() result to propagate synchronous persistence failures to callers.
lib/offline/OfflineStorage_SQLite.cpp Checks execute() result and reports SQLite insert failures via logging + OnStorageFailed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +272 to +274
// Propagate a synchronous disk write failure to the caller so a
// failed store is not counted as successfully persisted.
return m_offlineStorageDisk->StoreRecord(record);
Comment on lines +272 to +274
// Propagate a synchronous disk write failure to the caller so a
// failed store is not counted as successfully persisted.
return m_offlineStorageDisk->StoreRecord(record);
bmehta001 added a commit to bmehta001/cpp_client_telemetry that referenced this pull request Jun 23, 2026
OfflineStorage_SQLite::StoreRecord: return false when the INSERT execute() fails
(was ignoring the result and returning true even on a real write failure). The
Flush() reserve/confirm-delete logic in this PR relies on the disk backend
reporting per-record failure; without this, a failed sqlite3_step would still be
treated as persisted and dropped from memory. (This is the same one-line fix as
PR microsoft#1491, included here so this PR is correct on its own; trivial overlap that
resolves cleanly at merge.)

OfflineStorageTests.cpp NoopTaskDispatcher: own queued tasks and delete them on
Cancel()/Join()/destruction instead of dropping the Task* (PAL::scheduleTask
allocates with new and expects the dispatcher to take ownership), so the helper
cannot leak.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combine the Flush() data-loss fix into this storage-data-safety PR (the two are
halves of the same fix: this PR already makes OfflineStorage_SQLite::StoreRecord
report write failures; Flush() must act on that).

OfflineStorageHandler::Flush() previously drained the in-memory queue with
GetRecords() (which removes records) and handed them to StoreRecords() before
confirming persistence. On a partial/total disk write failure the un-persisted
records were already gone from memory and never re-queued -> events lost.

Flush() now drains into a local batch, persists one record at a time, and
re-inserts only the records that fail to persist (so failures are retried, not
lost). Per-record StoreRecord() is used deliberately: a batched StoreRecords()
only returns a count, so on a partial failure we could not tell which records to
re-queue, and re-storing already-saved records would duplicate them (no unique
record_id constraint). Also null-guards the dbSizeBeforeFlush read so Flush() is
safe with disk-only storage (CFG_INT_RAM_QUEUE_SIZE == 0).

Adds OfflineStorageHandlerFlushTests.FailedDiskWriteDuringFlushReturnsRecordsToMemory
(records the SQLite store rejects stay in memory after Flush; verified it fails
against the previous GetRecords()-based Flush). Closes the separate PR microsoft#1496.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 changed the title Offline storage: guard empty-filter delete and propagate SQLite store failures Offline storage data-safety: empty-filter guard, store-failure propagation, and flush event-loss fix Jun 23, 2026
@bmehta001 bmehta001 requested a review from Copilot June 23, 2026 03:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +196 to +202
if (it != m_tasks.end())
{
delete *it;
m_tasks.erase(it);
}
return true;
}
The test helper's Cancel() returned true unconditionally, violating the
ITaskDispatcher::Cancel contract (return whether the task was found/cancelled).
Return true only when the task was present in the queue, false otherwise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread lib/offline/OfflineStorageHandler.cpp Outdated
Comment on lines +193 to +197
size_t totalSaved = 0;
size_t totalFailed = 0;
for (auto& record : records)
{
if (m_offlineStorageDisk->StoreRecord(record))
Comment thread tests/unittests/OfflineStorageTests.cpp Outdated
Comment on lines +217 to +219
// Regression test: when records pulled from the in-memory queue fail to persist
// to disk during Flush(), they must be returned to the queue rather than lost.
TEST(OfflineStorageHandlerFlushTests, FailedDiskWriteDuringFlushReturnsRecordsToMemory)
Comment thread tests/unittests/OfflineStorageTests.cpp Outdated
Comment on lines +238 to +241
// A timestamp <= 0 is accepted by the in-memory queue but rejected by the
// SQLite disk store's input validation, so its StoreRecord() returns false.
// This drives the same Flush() failure-handling path as a disk write failure
// (a failed record must be returned to memory, not dropped).
Rename FailedDiskWriteDuringFlush... -> FailedDiskStoreDuringFlush... and reword
its comments: the test exercises a disk StoreRecord() rejection (SQLite input
validation), which drives the same Flush() re-queue path as any disk store
failure, not a literal disk write/IO error.

(The reviewer's separate note that Flush() ignores EventPersistence_DoNotStoreOnDisk
is a pre-existing behavior, out of scope for this data-safety change and not
cleanly unit-testable via the public API; tracked as a follow-up.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

bmehta001 and others added 2 commits June 23, 2026 10:02
…(was PR microsoft#1497)

Combine the batched-flush perf work into this PR and make it cooperate with the
Flush() data-loss fix, so both land together.

OfflineStorage_SQLite: StoreRecords() now inserts the whole batch in a single
BEGIN EXCLUSIVE / COMMIT (one fsync) instead of one transaction per record
(~11x at 200 records, ~40x at 1000 vs the SDK's vendored sqlite). Shared per-record
logic is factored into isValidRecord / insertRecordUnsafe / checkStorageSizeLimits.
The batch is all-or-nothing: if any insert fails, the transaction is rolled back
(new SqliteDB::rollback / DbTransaction::markForRollback) and the size estimate is
undone, so callers can re-queue the whole batch without risking duplicate rows
(the events table has no unique record_id constraint).

OfflineStorageHandler::Flush() now uses the batched StoreRecords() to persist a
drained batch in one transaction. Because StoreRecords() is all-or-nothing, on
failure nothing is committed and Flush returns every record to the in-memory queue
for retry -- realizing the batching speedup while keeping the no-event-loss /
no-duplicate guarantee.

StoreRecords/StoreRecord report write failures via OnStorageFailed after the
transaction closes; validation runs before the transaction. Adds
OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords. Full UnitTests (527)
pass. Closes PR microsoft#1497.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 changed the title Offline storage data-safety: empty-filter guard, store-failure propagation, and flush event-loss fix Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush) Jun 23, 2026
@bmehta001 bmehta001 requested a review from Copilot June 23, 2026 16:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread lib/offline/OfflineStorage_SQLite.cpp Outdated
Comment on lines +276 to +289
std::vector<const StorageRecord*> valid;
valid.reserve(records.size());
for (auto const& i : records) {
if (isValidRecord(i)) {
valid.push_back(&i);
}
}

if (valid.empty()) {
// Every record was invalid (already reported above). Match the single
// StoreRecord(), which returns after validation without checking
// DB-open.
return 0;
}
…cords

StoreRecords() previously filtered out invalid records and committed the valid
ones, so it could return a count < records.size() even though some records were
persisted. OfflineStorageHandler::Flush() treats totalSaved < records.size() as a
batch failure and re-queues ALL drained records, which would duplicate the valid
records that were actually stored.

Make StoreRecords() truly all-or-nothing: if ANY input record is invalid, store
nothing and return 0 (invalids are still reported via isValidRecord()). Combined
with the existing rollback-on-write-failure, StoreRecords() now returns either
records.size() (whole batch committed) or 0 (nothing committed), so Flush's
re-queue-all-on-short-return can never duplicate records.

Adds OfflineStorageTests_SQLite.StoreRecordsBatchWithAnyInvalidStoresNothing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

// than reserving) keeps only a single copy of each record in flight and
// avoids stamping a reservation lease that the Room backend would
// persist to disk.
auto records = m_offlineStorageMemory->GetRecords(false, EventLatency_Unspecified);
Flush() re-queued the whole drained batch whenever StoreRecords() returned a
count < records.size(). Both disk backends are all-or-nothing (SQLite rolls back;
Room returns 0 on a failed JNI batch), so the only meaningful "failure" value is
0. Room also caps its returned count at min(size, INT32_MAX); keying off
< records.size() would treat that capped count as a failure and re-queue
already-persisted records (duplicates). Key the re-queue off totalSaved == 0
instead, which is the true "nothing committed" signal. (The cap only matters for
a batch larger than the RAM queue could ever hold.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants