Skip to content

Commit 800a996

Browse files
authored
simplify usage of genomic_range_index_levels() (#16)
by caching results when used repeatedly on the same connection, so long as the database has not been changed
1 parent fc705b1 commit 800a996

4 files changed

Lines changed: 93 additions & 26 deletions

File tree

docs/guide_gri.md

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -164,26 +164,20 @@ queryChrom = featureChrom AND
164164

165165
The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by bounding their search *levels*, skipping steps that'd be useless in view of the overall length distribution of the indexed features. (See [Internals](internals.md) for full explanation.)
166166

167-
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. This procedure has to analyze the GRI, and the logarithmic cost of doing so will be worthwhile if used to optimize many subsequent GRI queries (but not for just one or a few). Therefore, a typical program should query `genomic_range_index_levels()` once upfront, then pass the detected bounds in to subsequent prepared queries, e.g. in Python:
167+
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. Example usage:
168168

169-
```python3
170-
(gri_ceiling, gri_floor) = next(
171-
con.execute("SELECT * FROM genomic_range_index_levels('exons')")
172-
)
173-
for (queryChrom, queryBegin, queryEnd) in queryRanges:
174-
exons = list(
175-
con.execute(
176-
"SELECT * from exons WHERE exons._rowid_ IN \
177-
genomic_range_rowids('exons',?,?,?,?,?)",
178-
(queryChrom, queryBegin, queryEnd, gri_ceiling, gri_floor)
179-
)
180-
)
181-
...
169+
```sql
170+
SELECT col1, col2, ... FROM exons, genomic_range_index_levels('exons')
171+
WHERE exons._rowid_ IN
172+
genomic_range_rowids('exons', 'chr12', 111803912, 111804012,
173+
_gri_ceiling, _gri_floor)
182174
```
183175

184-
**❗ Don't use the detected level bounds if the table can be modified in the meantime. GRI queries with inappropriate bounds are liable to produce incomplete results.**
176+
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
177+
178+
`genomic_range_index_levels()` performs some upfront analysis of table's GRI upon its first use on any database connection. The cost of this analysis should be worthwhile if it's used to optimize many `genomic_range_rowids()` operations (but not just one or a few). Subsequent uses of `genomic_range_index_levels()` on the same connection & table reuse the first analysis, unless the database changes in the meantime, in which case the analysis must be redone. This suggests using `genomic_range_index_levels()` only once the database is read-only.
185179

186-
Omitting the bounds is always safe, albeit slightly slower. <small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 &lt; *C* &lt; 16, such that all (present & future) indexed features are guaranteed to have lengths &le;16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is &lt;16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F* &gt; 0 such that the minimum possible feature length &gt;16<sup>*F*-1</sup>, if any; or (iii) zero. The default, safe, albeit slower bounds are C=15, F=0.</small>
180+
<small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 &lt; *C* &lt; 16, such that all (present & future) indexed features are guaranteed to have lengths &le;16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is &lt;16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F* &gt; 0 such that the minimum possible feature length &gt;16<sup>*F*-1</sup>, if any; or (iii) zero. The safe, default bounds are C=15, F=0. GRI queries with inappropriate bounds are liable to produce incomplete results.</small>
187181

188182
#### Joining tables on range overlap
189183

@@ -202,7 +196,7 @@ FROM variants LEFT JOIN exons ON exons._rowid_ IN
202196

203197
We fill out the GRI query range using the three coordinate columns of the variants table.
204198

205-
We may be able to speed this up by supplying level bounds, as shown above. Optionally, in this case where we expect a "tight loop" of many GRI queries, we can even inline the bounds detection:
199+
We may be able to speed this up by supplying level bounds, as discussed above:
206200

207201
``` sql
208202
SELECT variants.*, exons._rowid_
@@ -218,8 +212,6 @@ FROM genomic_range_index_levels('exons'),
218212
)
219213
```
220214

221-
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
222-
223215
See also "Advice for big data" below on optimizing storage layout for GRI queries.
224216

225217
### Reference genome metadata

src/genomicsqlite.cc

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -855,11 +855,13 @@ class GenomicRangeRowidsTVF : public SQLiteVirtualTable {
855855
// genomic_range_index_levels(tableName): inspect the GRI to detect the gri_ceiling and gri_floor
856856
// of the (current snapshot of) the given table. (returns just one row)
857857
class GenomicRangeIndexLevelsCursor : public SQLiteVirtualTableCursor {
858-
sqlite3 *db_;
859-
sqlite_int64 ceiling_ = -1, floor_ = -1;
860-
861858
public:
862-
GenomicRangeIndexLevelsCursor(sqlite3 *db) : db_(db) {}
859+
struct cached_levels {
860+
uint32_t data_version = UINT32_MAX;
861+
int db_total_changes = INT_MAX, ceiling = 15, floor = 0;
862+
};
863+
using levels_cache = map<string, cached_levels>;
864+
GenomicRangeIndexLevelsCursor(sqlite3 *db, levels_cache &cache) : db_(db), cache_(cache) {}
863865

864866
int Filter(int idxNum, const char *idxStr, int argc, sqlite3_value **argv) override {
865867
ceiling_ = floor_ = -1;
@@ -870,15 +872,60 @@ class GenomicRangeIndexLevelsCursor : public SQLiteVirtualTableCursor {
870872
} else {
871873
string table_name = (const char *)sqlite3_value_text(argv[0]);
872874
// TODO: sanitize table_name
875+
auto schema_table = split_schema_table(table_name);
876+
string schema = schema_table.first;
877+
transform(schema.begin(), schema.end(), schema.begin(), ::tolower);
878+
879+
uint32_t data_version = UINT32_MAX;
880+
int db_total_changes = INT_MAX;
881+
bool main = schema.empty() || schema == "main.";
882+
if (main) {
883+
// cache levels for tables of the main database, invalidated when database changes
884+
// are indicated by SQLITE_FCNTL_DATA_VERSION and/or sqlite3_total_changes().
885+
// Exclude attached databases because we can't know if a schema name could have
886+
// been reattached to a different file between invocations.
887+
int rc =
888+
sqlite3_file_control(db_, nullptr, SQLITE_FCNTL_DATA_VERSION, &data_version);
889+
if (rc != SQLITE_OK) {
890+
Error("genomic_range_index_levels(): error in SQLITE_FCNTL_DATA_VERSION");
891+
return rc;
892+
}
893+
db_total_changes = sqlite3_total_changes(db_);
894+
auto cached = cache_.find(schema_table.second);
895+
if (cached != cache_.end() && data_version == cached->second.data_version &&
896+
db_total_changes == cached->second.db_total_changes) {
897+
floor_ = cached->second.floor;
898+
ceiling_ = cached->second.ceiling;
899+
_DBG << "genomic_range_index_levels() cache hit on " << table_name
900+
<< " ceiling = " << ceiling_ << " floor = " << floor_ << endl;
901+
return SQLITE_OK;
902+
}
903+
}
904+
873905
try {
874906
auto p = DetectLevelRange(db_, table_name);
875907
floor_ = p.first;
876908
ceiling_ = p.second;
877-
assert(floor_ >= 0 && ceiling_ >= floor_ && ceiling_ <= 15);
878-
return SQLITE_OK;
879909
} catch (std::exception &exn) {
880910
Error(exn.what());
911+
return SQLITE_ERROR;
881912
}
913+
assert(floor_ >= 0 && ceiling_ >= floor_ && ceiling_ <= 15);
914+
915+
if (main) {
916+
auto cached = cache_.find(schema_table.second);
917+
if (cached == cache_.end()) {
918+
cache_[schema_table.second] = cached_levels();
919+
cached = cache_.find(schema_table.second);
920+
assert(cached != cache_.end());
921+
}
922+
cached->second.data_version = data_version;
923+
cached->second.db_total_changes = db_total_changes;
924+
cached->second.ceiling = ceiling_;
925+
cached->second.floor = floor_;
926+
}
927+
928+
return SQLITE_OK;
882929
}
883930
return SQLITE_ERROR;
884931
}
@@ -910,11 +957,18 @@ class GenomicRangeIndexLevelsCursor : public SQLiteVirtualTableCursor {
910957
*pRowid = 1;
911958
return SQLITE_OK;
912959
}
960+
961+
private:
962+
sqlite3 *db_;
963+
levels_cache &cache_;
964+
sqlite_int64 ceiling_ = -1, floor_ = -1;
913965
};
914966

915967
class GenomicRangeIndexLevelsTVF : public SQLiteVirtualTable {
968+
GenomicRangeIndexLevelsCursor::levels_cache cache_;
969+
916970
unique_ptr<SQLiteVirtualTableCursor> NewCursor() override {
917-
return unique_ptr<SQLiteVirtualTableCursor>(new GenomicRangeIndexLevelsCursor(db_));
971+
return unique_ptr<SQLiteVirtualTableCursor>(new GenomicRangeIndexLevelsCursor(db_, cache_));
918972
}
919973

920974
public:

test/genomicsqlite_big_tests.wdl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,7 @@ task test_sam_web {
235235
'SELECT gri_refseq_name, count(1) AS read_count
236236
FROM reads LEFT JOIN _gri_refseq USING(_gri_rid)
237237
GROUP BY gri_refseq_name
238+
HAVING read_count >= 1000000
238239
ORDER BY read_count DESC'
239240
240241
# stop nginx

test/test_gri.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,26 @@ def test_gri_levels_in_sql(tmp_path):
324324
_fill_exons(con)
325325
con.commit()
326326

327+
# test caching & invalidation:
328+
results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
329+
assert results == [(3, 1)]
330+
results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
331+
assert results == [(3, 1)]
332+
results = list(con.execute("SELECT * FROM genomic_range_index_levels('main.exons')"))
333+
assert results == [(3, 1)]
334+
tch1 = con.total_changes
335+
con.execute("INSERT INTO exons VALUES('ether',0,4097,4097,'ether')")
336+
tch2 = con.total_changes
337+
assert tch2 > tch1
338+
results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
339+
assert results == [(4, 1)]
340+
con.commit()
341+
results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
342+
assert results == [(4, 1)]
343+
con.execute("DELETE FROM exons WHERE rid = 'ether'")
344+
con.commit()
345+
results = list(con.execute("SELECT * FROM genomic_range_index_levels('main.exons')"))
346+
assert results == [(3, 1)]
327347
results = list(con.execute("SELECT * FROM genomic_range_index_levels('exons')"))
328348
assert results == [(3, 1)]
329349

0 commit comments

Comments
 (0)