You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guide_gri.md
+21-17Lines changed: 21 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -160,20 +160,9 @@ By the half-open position convention, this includes features that *abut* as well
160
160
161
161
#### Level bounds optimization
162
162
163
-
The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by skipping steps that'd be useless in view of the length distribution of the indexed features. (See Internals for full explanation.)
163
+
The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by bounding their search *levels*, skipping steps that'd be useless in view of the overall length distribution of the indexed features. (See [Internals](internals.md) for full explanation.)
164
164
165
-
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect the appropriate bounds for (the current snapshot of) the table. Example usage:
166
-
167
-
```sql
168
-
SELECT col1, col2, ... FROM exons, genomic_range_index_levels('exons')
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
175
-
176
-
Alternatively, your program might first query `genomic_range_index_levels()` alone, then pass the bounds in to subsequent prepared queries, e.g. in Python:
165
+
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. This procedure has to analyze the GRI, and the cost of doing so will be worthwhile if used to optimize many subsequent GRI queries (but not for just one or a few). Therefore, a typical program should query `genomic_range_index_levels()` once upfront, then pass the detected bounds in to subsequent prepared queries, e.g. in Python:
177
166
178
167
```python3
179
168
(gri_ceiling, gri_floor) =next(
@@ -190,16 +179,29 @@ for (queryChrom, queryBegin, queryEnd) in queryRanges:
190
179
...
191
180
```
192
181
193
-
This bounds detection procedure has a small cost, which will be worthwhile if used to optimize many subsequent GRI queries (but possibly not if just for a few).
194
-
195
-
**❗ The bounds should be redetected if the min/max feature length may have been changed by inserts or updates to the table. GRI queries with incorrect bounds are liable to produce incomplete results.**
182
+
**❗ Don't use the detected level bounds if the table can be modified in the meantime. GRI queries with inappropriate bounds are liable to produce incomplete results.**
196
183
197
184
Omitting the bounds is always safe, albeit slower. <small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 <*C*< 16, such that all (present & future) indexed features are guaranteed to have lengths ≤16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is <16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F*> 0 such that the minimum possible feature length >16<sup>*F*-1</sup>, if any; or (iii) zero. The default, safe, albeit slower bounds are C=15, F=0.</small>
198
185
199
186
#### Joining tables on range overlap
200
187
201
188
Suppose we have two tables with genomic features to join on range overlap. Only the "right-hand" table must have a GRI; preferably the smaller of the two. For example, annotating a table of variants with the surrounding exon(s), if any:
202
189
190
+
```sql
191
+
SELECT variants.*, exons._rowid_
192
+
FROM variants LEFT JOIN exons ONexons._rowid_IN
193
+
genomic_range_rowids(
194
+
'exons',
195
+
variants.chrom,
196
+
variants.beginPos,
197
+
variants.endPos
198
+
)
199
+
```
200
+
201
+
We fill out the GRI query range using the three coordinate columns of the variants table.
202
+
203
+
We may be able to speed this up by supplying level bounds, as shown above. Optionally, in this case where we expect a "tight loop" of many GRI queries, we can even inline the bounds detection:
204
+
203
205
```sql
204
206
SELECT variants.*, exons._rowid_
205
207
FROM genomic_range_index_levels('exons'),
@@ -214,7 +216,9 @@ FROM genomic_range_index_levels('exons'),
214
216
)
215
217
```
216
218
217
-
We fill out the GRI query range using the three coordinate columns of the variants table. The level bounds optimization is highly desirable for the "tight loop" of GRI queries during a join. See also "Advice for big data" below.
219
+
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
220
+
221
+
See also "Advice for big data" below on optimizing storage layout for GRI queries.
0 commit comments