simplify usage of genomic_range_index_levels() (#16)

mlin · web-flow · commit 800a996a4439 · 2021-05-25T13:26:29.000-10:00
by caching results when used repeatedly on the same connection, so long as the database has not been changed
diff --git a/docs/guide_gri.md b/docs/guide_gri.md
@@ -164,26 +164,20 @@ queryChrom = featureChrom AND
 
 The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by bounding their search *levels*, skipping steps that'd be useless in view of the overall length distribution of the indexed features. (See [Internals](internals.md) for full explanation.)
 
-The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. This procedure has to analyze the GRI, and the logarithmic cost of doing so will be worthwhile if used to optimize many subsequent GRI queries (but not for just one or a few). Therefore, a typical program should query `genomic_range_index_levels()` once upfront, then pass the detected bounds in to subsequent prepared queries, e.g. in Python:
+The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. Example usage:
 
-```python3
-(gri_ceiling, gri_floor) = next(
-    con.execute("SELECT * FROM genomic_range_index_levels('exons')")
-  )
-for (queryChrom, queryBegin, queryEnd) in queryRanges:
-  exons = list(
-    con.execute(
-      "SELECT * from exons WHERE exons._rowid_ IN \
-        genomic_range_rowids('exons',?,?,?,?,?)",
-      (queryChrom, queryBegin, queryEnd, gri_ceiling, gri_floor)
-    )
-  )
-  ...
+```sql
+SELECT col1, col2, ... FROM exons, genomic_range_index_levels('exons')
+  WHERE exons._rowid_ IN
+    genomic_range_rowids('exons', 'chr12', 111803912, 111804012,
+                         _gri_ceiling, _gri_floor)
 ```
 
-**❗ Don't use the detected level bounds if the table can be modified in the meantime. GRI queries with inappropriate bounds are liable to produce incomplete results.**
+Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
+
+`genomic_range_index_levels()` performs some upfront analysis of table's GRI upon its first use on any database connection. The cost of this analysis should be worthwhile if it's used to optimize many `genomic_range_rowids()` operations (but not just one or a few). Subsequent uses of `genomic_range_index_levels()` on the same connection & table reuse the first analysis, unless the database changes in the meantime, in which case the analysis must be redone. This suggests using `genomic_range_index_levels()` only once the database is read-only.
 
-Omitting the bounds is always safe, albeit slightly slower. <small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 &lt; *C* &lt; 16, such that all (present & future) indexed features are guaranteed to have lengths &le;16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is &lt;16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F* &gt; 0 such that the minimum possible feature length &gt;16<sup>*F*-1</sup>, if any; or (iii) zero. The default, safe, albeit slower bounds are C=15, F=0.</small>
+<small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 &lt; *C* &lt; 16, such that all (present & future) indexed features are guaranteed to have lengths &le;16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is &lt;16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F* &gt; 0 such that the minimum possible feature length &gt;16<sup>*F*-1</sup>, if any; or (iii) zero. The safe, default bounds are C=15, F=0. GRI queries with inappropriate bounds are liable to produce incomplete results.</small>
 
 #### Joining tables on range overlap
 
@@ -202,7 +196,7 @@ FROM variants LEFT JOIN exons ON exons._rowid_ IN
 
 We fill out the GRI query range using the three coordinate columns of the variants table.
 
-We may be able to speed this up by supplying level bounds, as shown above. Optionally, in this case where we expect a "tight loop" of many GRI queries, we can even inline the bounds detection:
+We may be able to speed this up by supplying level bounds, as discussed above:
 
 ``` sql
 SELECT variants.*, exons._rowid_
@@ -218,8 +212,6 @@ FROM genomic_range_index_levels('exons'),
   )
 ```
 
-Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
-
 See also "Advice for big data" below on optimizing storage layout for GRI queries.
 
 ### Reference genome metadata
diff --git a/src/genomicsqlite.cc b/src/genomicsqlite.cc
@@ -855,11 +855,13 @@ class GenomicRangeRowidsTVF : public SQLiteVirtualTable {
 // genomic_range_index_levels(tableName): inspect the GRI to detect the gri_ceiling and gri_floor
 // of the (current snapshot of) the given table. (returns just one row)
 class GenomicRangeIndexLevelsCursor : public SQLiteVirtualTableCursor {
-    sqlite3 *db_;
-    sqlite_int64 ceiling_ = -1, floor_ = -1;
-
   public:
-    GenomicRangeIndexLevelsCursor(sqlite3 *db) : db_(db) {}
+    struct cached_levels {
+        uint32_t data_version = UINT32_MAX;
+        int db_total_changes = INT_MAX, ceiling = 15, floor = 0;
+    };
+    using levels_cache = map<string, cached_levels>;
+    GenomicRangeIndexLevelsCursor(sqlite3 *db, levels_cache &cache) : db_(db), cache_(cache) {}
 
     int Filter(int idxNum, const char *idxStr, int argc, sqlite3_value **argv) override {
         ceiling_ = floor_ = -1;
@@ -870,15 +872,60 @@ class GenomicRangeIndexLevelsCursor : public SQLiteVirtualTableCursor {
         } else {
             string table_name = (const char *)sqlite3_value_text(argv[0]);
             // TODO: sanitize table_name
+            auto schema_table = split_schema_table(table_name);
+            string schema = schema_table.first;
+            transform(schema.begin(), schema.end(), schema.begin(), ::tolower);
+
+            uint32_t data_version = UINT32_MAX;
+            int db_total_changes = INT_MAX;
+            bool main = schema.empty() || schema == "main.";
+            if (main) {
+                // cache levels for tables of the main database, invalidated when database changes
+                // are indicated by SQLITE_FCNTL_DATA_VERSION and/or sqlite3_total_changes().
+                // Exclude attached databases because we can't know if a schema name could have
+                // been reattached to a different file between invocations.
+                int rc =
+                    sqlite3_file_control(db_, nullptr, SQLITE_FCNTL_DATA_VERSION, &data_version);
+                if (rc != SQLITE_OK) {
+                    Error("genomic_range_index_levels(): error in SQLITE_FCNTL_DATA_VERSION");
+                    return rc;
+                }
+                db_total_changes = sqlite3_total_changes(db_);
+                auto cached = cache_.find(schema_table.second);
+                if (cached != cache_.end() && data_version == cached->second.data_version &&
+                    db_total_changes == cached->second.db_total_changes) {
+                    floor_ = cached->second.floor;
+                    ceiling_ = cached->second.ceiling;
+                    _DBG << "genomic_range_index_levels() cache hit on " << table_name
+                         << " ceiling = " << ceiling_ << " floor = " << floor_ << endl;
+                    return SQLITE_OK;
+                }
+            }
+
             try {
                 auto p = DetectLevelRange(db_, table_name);
                 floor_ = p.first;
                 ceiling_ = p.second;
-                assert(floor_ >= 0 && ceiling_ >= floor_ && ceiling_ <= 15);
-                return SQLITE_OK;
             } catch (std::exception &exn) {
                 Error(exn.what());
+                return SQLITE_ERROR;
             }
+            assert(floor_ >= 0 && ceiling_ >= floor_ && ceiling_ <= 15);
+
+            if (main) {
+                auto cached = cache_.find(schema_table.second);
+                if (cached == cache_.end()) {
+                    cache_[schema_table.second] = cached_levels();
+                    cached = cache_.find(schema_table.second);
+                    assert(cached != cache_.end());
+                }
+                cached->second.data_version = data_version;
+                cached->second.db_total_changes = db_total_changes;
+                cached->second.ceiling = ceiling_;
+                cached->second.floor = floor_;
+            }
+
+            return SQLITE_OK;
         }
         return SQLITE_ERROR;
     }
@@ -910,11 +957,18 @@ class GenomicRangeIndexLevelsCursor : public SQLiteVirtualTableCursor {
         *pRowid = 1;
         return SQLITE_OK;
     }
+
+  private:
+    sqlite3 *db_;
+    levels_cache &cache_;
+    sqlite_int64 ceiling_ = -1, floor_ = -1;
 };
 
 class GenomicRangeIndexLevelsTVF : public SQLiteVirtualTable {
+    GenomicRangeIndexLevelsCursor::levels_cache cache_;
+
     unique_ptr<SQLiteVirtualTableCursor> NewCursor() override {
-        return unique_ptr<SQLiteVirtualTableCursor>(new GenomicRangeIndexLevelsCursor(db_));
+        return unique_ptr<SQLiteVirtualTableCursor>(new GenomicRangeIndexLevelsCursor(db_, cache_));
     }
 
   public:
diff --git a/test/genomicsqlite_big_tests.wdl b/test/genomicsqlite_big_tests.wdl
@@ -235,6 +235,7 @@ task test_sam_web {
             'SELECT gri_refseq_name, count(1) AS read_count
                 FROM reads LEFT JOIN _gri_refseq USING(_gri_rid)
                 GROUP BY gri_refseq_name
+                HAVING read_count >= 1000000
                 ORDER BY read_count DESC'
 
         # stop nginx
diff --git a/test/test_gri.py b/test/test_gri.py
@@ -324,6 +324,26 @@ def test_gri_levels_in_sql(tmp_path):
     _fill_exons(con)
     con.commit()
 
+    # test caching & invalidation:
+    results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
+    assert results == [(3, 1)]
+    results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
+    assert results == [(3, 1)]
+    results = list(con.execute("SELECT * FROM genomic_range_index_levels('main.exons')"))
+    assert results == [(3, 1)]
+    tch1 = con.total_changes
+    con.execute("INSERT INTO exons VALUES('ether',0,4097,4097,'ether')")
+    tch2 = con.total_changes
+    assert tch2 > tch1
+    results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
+    assert results == [(4, 1)]
+    con.commit()
+    results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
+    assert results == [(4, 1)]
+    con.execute("DELETE FROM exons WHERE rid = 'ether'")
+    con.commit()
+    results = list(con.execute("SELECT * FROM genomic_range_index_levels('main.exons')"))
+    assert results == [(3, 1)]
     results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
     assert results == [(3, 1)]