Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491
Open
bmehta001 wants to merge 11 commits into
Conversation
…ailure Two latent data-loss bugs found during a repo-wide review: 1) MemoryStorage::DeleteRecords(whereFilter) matched EVERY record when whereFilter was empty (the matcher starts `matched = true` and the per-key loop never runs), silently wiping the entire in-memory queue. This contradicts the fail-closed OfflineStorage_SQLite::DeleteRecords and the Room backend. Guard an empty filter and return without deleting; intentional full clears use DeleteAllRecords(). 2) OfflineStorage_SQLite::StoreRecord ignored the bool returned by SqliteStatement::execute(), returning true and bumping m_DbSizeEstimate even on a real write failure (SQLITE_FULL/IOERR/etc). The event is silently lost with no OnStorageFailed notification and the size estimate drifts. Capture the result; on failure log, notify the observer, and return false (skipping the size bump). Tests: added MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll (fails without the guard -- the queue is wiped to 0; passes with it). The StoreRecord write-failure path isn't unit-testable here (the insert is REPLACE INTO with no constraint to violate), so it's covered by build + review. Verified locally on Linux: all 9 MemoryStorageTests and 32 OfflineStorageTests_SQLite pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verified TDD: this test fails without the empty-filter guard (the queue is wiped, GetSize()/GetRecordCount() drop to 0) and passes with it. Run on Linux host. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lalitb
reviewed
Jun 22, 2026
lib/offline/OfflineStorage_SQLite.cpp::StoreRecord now returns false on a write failure (this PR), but OfflineStorageHandler::StoreRecord ignored the disk result and always returned true, so a failed synchronous store (RAM queue disabled or during shutdown) was counted as successfully persisted by StoreRecords()/StorageObserver. Return the disk StoreRecord() result in the direct-to-disk path. The memory path is unchanged: MemoryStorage::StoreRecord returning false means an intentional latency-Off skip, not a failure, so it must not surface as an error. Verified at lib/offline/OfflineStorageHandler.cpp:266-275 and lib/offline/OfflineStorage_SQLite.cpp:180-186. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the offline-storage layer against two data-loss scenarios by (1) preventing accidental full-queue deletes when a delete predicate is empty and (2) ensuring SQLite write failures are surfaced to callers/observers instead of being treated as successful stores.
Changes:
- Add an empty-filter guard to
MemoryStorage::DeleteRecords(whereFilter)and a unit test to prevent unintended full deletes. - Propagate synchronous disk
StoreRecord()failures throughOfflineStorageHandler::StoreRecordso callers don’t count failed persistence as success. - Treat
SqliteStatement::execute()failures inOfflineStorage_SQLite::StoreRecordas hard failures: log, notifyOnStorageFailed, and returnfalsewithout updating size estimates.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/unittests/MemoryStorageTests.cpp | Adds coverage to ensure DeleteRecords({}) does not wipe the in-memory queue. |
| lib/offline/MemoryStorage.cpp | Guards empty whereFilter to avoid “match-all” deletion behavior. |
| lib/offline/OfflineStorageHandler.cpp | Returns disk StoreRecord() result to propagate synchronous persistence failures to callers. |
| lib/offline/OfflineStorage_SQLite.cpp | Checks execute() result and reports SQLite insert failures via logging + OnStorageFailed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+272
to
+274
| // Propagate a synchronous disk write failure to the caller so a | ||
| // failed store is not counted as successfully persisted. | ||
| return m_offlineStorageDisk->StoreRecord(record); |
Comment on lines
+272
to
+274
| // Propagate a synchronous disk write failure to the caller so a | ||
| // failed store is not counted as successfully persisted. | ||
| return m_offlineStorageDisk->StoreRecord(record); |
bmehta001
added a commit
to bmehta001/cpp_client_telemetry
that referenced
this pull request
Jun 23, 2026
OfflineStorage_SQLite::StoreRecord: return false when the INSERT execute() fails (was ignoring the result and returning true even on a real write failure). The Flush() reserve/confirm-delete logic in this PR relies on the disk backend reporting per-record failure; without this, a failed sqlite3_step would still be treated as persisted and dropped from memory. (This is the same one-line fix as PR microsoft#1491, included here so this PR is correct on its own; trivial overlap that resolves cleanly at merge.) OfflineStorageTests.cpp NoopTaskDispatcher: own queued tasks and delete them on Cancel()/Join()/destruction instead of dropping the Task* (PAL::scheduleTask allocates with new and expects the dispatcher to take ownership), so the helper cannot leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combine the Flush() data-loss fix into this storage-data-safety PR (the two are halves of the same fix: this PR already makes OfflineStorage_SQLite::StoreRecord report write failures; Flush() must act on that). OfflineStorageHandler::Flush() previously drained the in-memory queue with GetRecords() (which removes records) and handed them to StoreRecords() before confirming persistence. On a partial/total disk write failure the un-persisted records were already gone from memory and never re-queued -> events lost. Flush() now drains into a local batch, persists one record at a time, and re-inserts only the records that fail to persist (so failures are retried, not lost). Per-record StoreRecord() is used deliberately: a batched StoreRecords() only returns a count, so on a partial failure we could not tell which records to re-queue, and re-storing already-saved records would duplicate them (no unique record_id constraint). Also null-guards the dbSizeBeforeFlush read so Flush() is safe with disk-only storage (CFG_INT_RAM_QUEUE_SIZE == 0). Adds OfflineStorageHandlerFlushTests.FailedDiskWriteDuringFlushReturnsRecordsToMemory (records the SQLite store rejects stay in memory after Flush; verified it fails against the previous GetRecords()-based Flush). Closes the separate PR microsoft#1496. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+196
to
+202
| if (it != m_tasks.end()) | ||
| { | ||
| delete *it; | ||
| m_tasks.erase(it); | ||
| } | ||
| return true; | ||
| } |
The test helper's Cancel() returned true unconditionally, violating the ITaskDispatcher::Cancel contract (return whether the task was found/cancelled). Return true only when the task was present in the queue, false otherwise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+193
to
+197
| size_t totalSaved = 0; | ||
| size_t totalFailed = 0; | ||
| for (auto& record : records) | ||
| { | ||
| if (m_offlineStorageDisk->StoreRecord(record)) |
Comment on lines
+217
to
+219
| // Regression test: when records pulled from the in-memory queue fail to persist | ||
| // to disk during Flush(), they must be returned to the queue rather than lost. | ||
| TEST(OfflineStorageHandlerFlushTests, FailedDiskWriteDuringFlushReturnsRecordsToMemory) |
Comment on lines
+238
to
+241
| // A timestamp <= 0 is accepted by the in-memory queue but rejected by the | ||
| // SQLite disk store's input validation, so its StoreRecord() returns false. | ||
| // This drives the same Flush() failure-handling path as a disk write failure | ||
| // (a failed record must be returned to memory, not dropped). |
Rename FailedDiskWriteDuringFlush... -> FailedDiskStoreDuringFlush... and reword its comments: the test exercises a disk StoreRecord() rejection (SQLite input validation), which drives the same Flush() re-queue path as any disk store failure, not a literal disk write/IO error. (The reviewer's separate note that Flush() ignores EventPersistence_DoNotStoreOnDisk is a pre-existing behavior, out of scope for this data-safety change and not cleanly unit-testable via the public API; tracked as a follow-up.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…(was PR microsoft#1497) Combine the batched-flush perf work into this PR and make it cooperate with the Flush() data-loss fix, so both land together. OfflineStorage_SQLite: StoreRecords() now inserts the whole batch in a single BEGIN EXCLUSIVE / COMMIT (one fsync) instead of one transaction per record (~11x at 200 records, ~40x at 1000 vs the SDK's vendored sqlite). Shared per-record logic is factored into isValidRecord / insertRecordUnsafe / checkStorageSizeLimits. The batch is all-or-nothing: if any insert fails, the transaction is rolled back (new SqliteDB::rollback / DbTransaction::markForRollback) and the size estimate is undone, so callers can re-queue the whole batch without risking duplicate rows (the events table has no unique record_id constraint). OfflineStorageHandler::Flush() now uses the batched StoreRecords() to persist a drained batch in one transaction. Because StoreRecords() is all-or-nothing, on failure nothing is committed and Flush returns every record to the in-memory queue for retry -- realizing the batching speedup while keeping the no-event-loss / no-duplicate guarantee. StoreRecords/StoreRecord report write failures via OnStorageFailed after the transaction closes; validation runs before the transaction. Adds OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords. Full UnitTests (527) pass. Closes PR microsoft#1497. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+276
to
+289
| std::vector<const StorageRecord*> valid; | ||
| valid.reserve(records.size()); | ||
| for (auto const& i : records) { | ||
| if (isValidRecord(i)) { | ||
| valid.push_back(&i); | ||
| } | ||
| } | ||
|
|
||
| if (valid.empty()) { | ||
| // Every record was invalid (already reported above). Match the single | ||
| // StoreRecord(), which returns after validation without checking | ||
| // DB-open. | ||
| return 0; | ||
| } |
…cords StoreRecords() previously filtered out invalid records and committed the valid ones, so it could return a count < records.size() even though some records were persisted. OfflineStorageHandler::Flush() treats totalSaved < records.size() as a batch failure and re-queues ALL drained records, which would duplicate the valid records that were actually stored. Make StoreRecords() truly all-or-nothing: if ANY input record is invalid, store nothing and return 0 (invalids are still reported via isValidRecord()). Combined with the existing rollback-on-write-failure, StoreRecords() now returns either records.size() (whole batch committed) or 0 (nothing committed), so Flush's re-queue-all-on-short-return can never duplicate records. Adds OfflineStorageTests_SQLite.StoreRecordsBatchWithAnyInvalidStoresNothing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| // than reserving) keeps only a single copy of each record in flight and | ||
| // avoids stamping a reservation lease that the Room backend would | ||
| // persist to disk. | ||
| auto records = m_offlineStorageMemory->GetRecords(false, EventLatency_Unspecified); |
Flush() re-queued the whole drained batch whenever StoreRecords() returned a count < records.size(). Both disk backends are all-or-nothing (SQLite rolls back; Room returns 0 on a failed JNI batch), so the only meaningful "failure" value is 0. Room also caps its returned count at min(size, INT32_MAX); keying off < records.size() would treat that capped count as a failure and re-queue already-persisted records (duplicates). Key the re-queue off totalSaved == 0 instead, which is the true "nothing committed" signal. (The cap only matters for a batch larger than the RAM queue could ever hold.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Offline-storage data-safety + flush performance (combines the data-safety fixes with the SQLite batch-flush optimization, formerly PR #1496 and #1497).
MemoryStorage::DeleteRecords(filter)DeleteAllRecords()), mirroring the SQLite backend.OfflineStorage_SQLite::StoreRecordsBEGIN EXCLUSIVE/COMMIT(one fsync) instead of one transaction per record. ~11× (200 recs) / ~40× (1000 recs) faster (measured on the SDK's vendored sqlite). All-or-nothing: any insert failure rolls the batch back (newSqliteDB::rollback), so it can be re-queued without duplicate rows (no uniquerecord_id).OfflineStorage_SQLite::StoreRecordfalseand reportsOnStorageFailedon a real write failure (was ignoring theexecute()result). Validation runs before the transaction; write failures are reported after it closes.OfflineStorageHandler::StoreRecordtrue).OfflineStorageHandler::FlushStoreRecords(), and re-queues the whole batch if it doesn't fully persist — realizing the batch speedup and keeping the no-loss / no-duplicate guarantee. Also null-guards the size read (disk-only storage).Tests
MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAllOfflineStorageHandlerFlushTests.FailedDiskStoreDuringFlushReturnsRecordsToMemory— records the disk store rejects remain in memory afterFlush(). Verified it fails against the previousGetRecords()-based Flush.OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecordsUnitTests: 527/527 pass.History
The batching half (PR #1497) was driven to a clean Copilot review over ~9 rounds, then folded here so the batched
StoreRecords()and the Flush data-loss fix cooperate (all-or-nothing batch + re-queue) rather than conflict.